Hadoop The Definitive Guide 3rd Edition Orielly May 2012

User Manual:

Open the PDF directly: View PDF .
Page Count: 686 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Table of Contents
Foreword
Preface
- Administrative Notes
- What’s in This Book?
- What’s New in the Second Edition?
- What’s New in the Third Edition?
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
Chapter 1. Meet Hadoop
- Data!
- Data Storage and Analysis
- Comparison with Other Systems
- A Brief History of Hadoop
- Apache Hadoop and the Hadoop Ecosystem
- Hadoop Releases
  - What’s Covered in This Book
    - Configuration names
    - MapReduce APIs
  - Compatibility
Chapter 2. MapReduce
- A Weather Dataset
  - Data Format
- Analyzing the Data with Unix Tools
- Analyzing the Data with Hadoop
  - Map and Reduce
  - Java MapReduce
    - A test run
    - The old and the new Java MapReduce APIs
- Scaling Out
- Hadoop Streaming
  - Ruby
  - Python
- Hadoop Pipes
  - Compiling and Running
Chapter 3. The Hadoop Distributed Filesystem
- The Design of HDFS
- HDFS Concepts
- The Command-Line Interface
  - Basic Filesystem Operations
- Hadoop Filesystems
  - Interfaces
    - HTTP
    - C
    - FUSE
- The Java Interface
- Data Flow
- Data Ingest with Flume and Sqoop
- Parallel Copying with distcp
  - Keeping an HDFS Cluster Balanced
- Hadoop Archives
  - Using Hadoop Archives
  - Limitations
Chapter 4. Hadoop I/O
- Data Integrity
- Compression
- Serialization
- Avro
- File-Based Data Structures
  - SequenceFile
  - MapFile
Chapter 5. Developing a MapReduce Application
- The Configuration API
  - Combining Resources
  - Variable Expansion
- Setting Up the Development Environment
  - Managing Configuration
  - GenericOptionsParser, Tool, and ToolRunner
- Writing a Unit Test with MRUnit
  - Mapper
  - Reducer
- Running Locally on Test Data
  - Running a Job in a Local Job Runner
    - Fixing the mapper
  - Testing the Driver
- Running on a Cluster
- Tuning a Job
  - Profiling Tasks
    - The HPROF profiler
    - Other profilers
- MapReduce Workflows
Chapter 6. How MapReduce Works
- Anatomy of a MapReduce Job Run
  - Classic MapReduce (MapReduce 1)
  - YARN (MapReduce 2)
- Failures
  - Failures in Classic MapReduce
  - Failures in YARN
- Job Scheduling
  - The Fair Scheduler
  - The Capacity Scheduler
- Shuffle and Sort
- Task Execution
Chapter 7. MapReduce Types and Formats
- MapReduce Types
  - The Default MapReduce Job
    - The default Streaming job
    - Keys and values in Streaming
- Input Formats
- Output Formats
Chapter 8. MapReduce Features
- Counters
- Sorting
- Joins
  - Map-Side Joins
  - Reduce-Side Joins
- Side Data Distribution
  - Using the Job Configuration
  - Distributed Cache
- MapReduce Library Classes
Chapter 9. Setting Up a Hadoop Cluster
- Cluster Specification
  - Network Topology
    - Rack awareness
- Cluster Setup and Installation
- SSH Configuration
- Hadoop Configuration
- YARN Configuration
  - Important YARN Daemon Properties
    - Memory
  - YARN Daemon Addresses and Ports
- Security
- Benchmarking a Hadoop Cluster
  - Hadoop Benchmarks
  - User Jobs
- Hadoop in the Cloud
  - Apache Whirr
Chapter 10. Administering Hadoop
- HDFS
- Monitoring
- Maintenance
Chapter 11. Pig
- Installing and Running Pig
- An Example
  - Generating Examples
- Comparison with Databases
- Pig Latin
- User-Defined Functions
- Data Processing Operators
- Pig in Practice
  - Parallelism
  - Parameter Substitution
    - Dynamic parameters
    - Parameter substitution processing
Chapter 12. Hive
- Installing Hive
  - The Hive Shell
- An Example
- Running Hive
- Comparison with Traditional Databases
  - Schema on Read Versus Schema on Write
  - Updates, Transactions, and Indexes
- HiveQL
  - Data Types
    - Primitive types
    - Complex types
  - Operators and Functions
    - Conversions
- Tables
- Querying Data
- User-Defined Functions
  - Writing a UDF
  - Writing a UDAF
    - A more complex UDAF
Chapter 13. HBase
- HBasics
  - Backdrop
- Concepts
  - Whirlwind Tour of the Data Model
    - Regions
    - Locking
  - Implementation
    - HBase in operation
- Installation
  - Test Drive
- Clients
  - Java
    - MapReduce
  - Avro, REST, and Thrift
- Example
- HBase Versus RDBMS
- Praxis
Chapter 14. ZooKeeper
- Installing and Running ZooKeeper
- An Example
- The ZooKeeper Service
- Building Applications with ZooKeeper
- ZooKeeper in Production
  - Resilience and Performance
  - Configuration
Chapter 15. Sqoop
- Getting Sqoop
- Sqoop Connectors
- A Sample Import
  - Text and Binary File Formats
- Generated Code
  - Additional Serialization Systems
- Imports: A Deeper Look
- Working with Imported Data
  - Imported Data and Hive
- Importing Large Objects
- Performing an Export
- Exports: A Deeper Look
  - Exports and Transactionality
  - Exports and SequenceFiles
Chapter 16. Case Studies
- Hadoop Usage at Last.fm
- Hadoop and Hive at Facebook
- Nutch Search Engine
- Log Processing at Rackspace
- Cascading
- TeraByte Sort on Apache Hadoop
- Using Pig and Wukong to Explore Billion-edge Network Graphs
Appendix A. Installing Apache Hadoop
- Prerequisites
- Installation
- Configuration
Appendix B. Cloudera’s Distribution Including Apache Hadoop
Appendix C. Preparing the NCDC Weather Data
Index

THIRD EDITION

Hadoop: The Definitive Guide



Beijing

•

Cambridge

•

Farnham

•

Köln

•

Sebastopol

•

Tokyo

Hadoop: The Definitive Guide, Third Edition















Editors: 

Production Editor: 

Copyeditor: 

Proofreader: 

Indexer: 

Cover Designer: 

Interior Designer: 

Illustrator: 

 

Revision History for the Third Edition:

 

 





























Table of Contents
Foreword  . .................................................................. xv
Preface .................................................................... xvii
1. Meet Hadoop ........................................................... 1
 
 
 
 
 
 
 
 
 
 
 
2. MapReduce ........................................................... 17
 
 
 
 
 
 
 
 
 
 
 
 
 
v

 
 
3. The Hadoop Distributed Filesystem  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4. Hadoop I/O ...........................................................  81
 
 
 
 
 
 
 
 
 
 
vi | Table of Contents

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5. Developing a MapReduce Application  . ................................... 143
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table of Contents | vii

 
6. How MapReduce Works ................................................ 189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7. MapReduce Types and Formats .......................................... 223
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8. MapReduce Features .................................................. 259
 
 
 
viii | Table of Contents

 
 
 
 
 
 
 
 
 
 
 
 
 
9. Setting Up a Hadoop Cluster ............................................ 297
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table of Contents | ix

10. Administering Hadoop ................................................. 339
 
 
 
 
 
 
 
 
 
 
 
 
 
11. Pig  . ................................................................  367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
x | Table of Contents

 
 
12. Hive ................................................................ 413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13. HBase ............................................................... 459
 
 
 
 
 
 
 
 
Table of Contents | xi

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14. ZooKeeper  . .......................................................... 489
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
xii | Table of Contents

15. Sqoop ............................................................... 527
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16. Case Studies  . ........................................................  547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Table of Contents | xiii

 
 
 
 
 
 
 
 
 
 
 
 
A. Installing Apache Hadoop .............................................. 617
B. Cloudera’s Distribution Including Apache Hadoop  . . . . . . . . . . . . . . . . . . . . . . . . . .  623
C. Preparing the NCDC Weather Data  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  625
Index ..................................................................... 629
xiv | Table of Contents

Foreword



           



































































xvi | Foreword

Preface





















































 



xvii



Administrative Notes







import org.apache.hadoop.io.*





What’s in This Book?

            

    







        

              

           



xviii | Preface

What’s New in the Second Edition?








What’s New in the Third Edition?



            








          











Preface | xix

Conventions Used in This Book







Constant width







Constant width bold



Constant width italic









Using Code Examples



























xx | Preface

Safari® Books Online



        
    










How to Contact Us

















Preface | xxi

Acknowledgments











           









   









































              



             





xxii | Preface

















Preface | xxiii

CHAPTER 1
Meet Hadoop




Data!







 

 
 
 

           

 

  
 
   

1









          

 















       















          









     



     

         



















2 | Chapter 1:Meet Hadoop









Data Storage and Analysis



























































            





 

Data Storage and Analysis | 3

















Comparison with Other Systems







































Rational Database Management System





4 | Chapter 1:Meet Hadoop







































Traditional RDBMS MapReduce

Data size Gigabytes Petabytes

Access Interactive and batch Batch

Updates Read and write many times Write once, read many times

Structure Static schema Dynamic schema

Integrity High Low

Scaling Nonlinear Linear

























Comparison with Other Systems | 5

           





































         



Grid Computing

         





       









 









6 | Chapter 1:Meet Hadoop













































































 



Comparison with Other Systems | 7

        







Volunteer Computing

  













           









        



















         



            



 





8 | Chapter 1:Meet Hadoop

A Brief History of Hadoop







The Origin of the Name “Hadoop”













             















           

























 

JobTracker

 



 



A Brief History of Hadoop | 9






 








 

















             

 

           

 

10 | Chapter 1:Meet Hadoop









Hadoop at Yahoo!

































           



           































 



A Brief History of Hadoop | 11

 



 

 



 

 

 

 



 

 



 

 

 

 

 

 





Apache Hadoop and the Hadoop Ecosystem











    













       



12 | Chapter 1:Meet Hadoop



         





























         























Hadoop Releases



             













Hadoop Releases | 13


      


 




        
           

 
        


Feature 1.x 0.22 2.x
Secure authentication Yes No Yes
Old configuration names Yes Deprecated Deprecated
New configuration names No Yes Yes
Old MapReduce API Yes Yes Yes
New MapReduce API Yes (with some
missing libraries)
Yes Yes
MapReduce 1 runtime (Classic) Yes Yes No
MapReduce 2 runtime (YARN) No No Yes
HDFS federation No No Yes
HDFS high-availability No No Yes







 


14 | Chapter 1:Meet Hadoop

What’s Covered in This Book











Configuration names





          dfs.namenode  

dfs.name.dirdfs.namenode.name.dir

mapreducemapredmapred.job.name

mapreduce.job.name















MapReduce APIs









oldapi







Compatibility













Hadoop Releases | 15









InterfaceStability.Stable

        InterfaceStabil

ity.EvolvingInterfaceStability.Unstable

org.apache.hadoop.classification

















            





























                  



16 | Chapter 1:Meet Hadoop

CHAPTER 2

MapReduce









            





A Weather Dataset









Data Format



















0057

332130 # USAF weather station identifier

99999 # WBAN weather station identifier

19500101 # observation date

0300 # observation time

+51317 # latitude (degrees x 1000)

+028783 # longitude (degrees x 1000)

FM-12

+0171 # elevation (meters)

99999

V020

320 # wind direction (degrees)

1 # quality code

0072

00450 # sky ceiling height (meters)

1 # quality code

010000 # visibility distance (meters)

1 # quality code

-0128 # air temperature (degrees Celsius x 10)

1 # quality code

-0139 # dew point temperature (degrees Celsius x 10)

1 # quality code

10268 # atmospheric pressure (hectopascals x 10)

1 # quality code







% ls raw/1990 | head

010010-99999-1990.gz

010014-99999-1990.gz

010015-99999-1990.gz

010016-99999-1990.gz

010017-99999-1990.gz

010030-99999-1990.gz

010040-99999-1990.gz

010080-99999-1990.gz

010100-99999-1990.gz

010150-99999-1990.gz







18 | Chapter 2:MapReduce

   



Analyzing the Data with Unix Tools















#!/usr/bin/env bash

for year in all/*

echo -ne `basename $year .gz`"\t"

gunzip -c $year | \

awk '{ temp = substr($0, 88, 5) + 0;

q = substr($0, 93, 1);

if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }

END { print max }'

done









     





END





% ./max_temperature.sh

1901 317

1902 244

1903 289

1904 256

1905 283

...









Analyzing the Data with Unix Tools | 19





             











































Analyzing the Data with Hadoop







Map and Reduce















20 | Chapter 2:MapReduce

















0067011990999991950051507004...9999999N9+00001+99999999999...

0043011990999991950051512004...9999999N9+00221+99999999999...

0043011990999991950051518004...9999999N9-00111+99999999999...

0043012650999991949032412004...0500001N9+01111+99999999999...

0043012650999991949032418004...0500001N9+00781+99999999999...



(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)

(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)

(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)

(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)

(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)





            



(1950, 0)

(1950, 22)

(1950, −11)

(1949, 111)

(1949, 78)







(1949, [111, 78])

(1950, [0, 22, −11])





(1949, 111)

(1950, 22)









Analyzing the Data with Hadoop | 21

Java MapReduce





Mapper

map()



import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs

airTemperature = Integer.parseInt(line.substring(88, 92));

} else {

airTemperature = Integer.parseInt(line.substring(87, 92));

}

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]")) {

context.write(new Text(year), new IntWritable(airTemperature));

}

Mapper







22 | Chapter 2:MapReduce





org.apache.hadoop.io

LongWritableLongTextString

IntWritableInteger

map()Text

Stringsubstring()



map()Context

Text

IntWritable



Reducer



import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer

extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values,

Context context)

throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

}

context.write(key, new IntWritable(maxValue));

}





TextIntWritable

TextIntWritable







Analyzing the Data with Hadoop | 23



import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature {

public static void main(String[] args) throws Exception {

if (args.length != 2) {

System.err.println("Usage: MaxTemperature <input path> <output path>");

System.exit(-1);

}

Job job = new Job();

job.setJarByClass(MaxTemperature.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Job





JobsetJarByClass()





Job

addInputPath()FileInputFormat



addInputPath()



              setOutput

Path()FileOutputFormat





24 | Chapter 2:MapReduce





            setMapperClass() 

setReducerClass()

setOutputKeyClass()setOutputValueClass()



             

setMapOutputKeyClass()setMapOutputValueClass()



TextInputFormat



waitForCompletion()Job





waitForCompletion()

truefalse01

A test run















% export HADOOP_CLASSPATH=hadoop-examples.jar

% hadoop MaxTemperature input/ncdc/sample.txt output

12/02/04 11:50:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library

for your platform... using builtin-java classes where applicable

12/02/04 11:50:41 WARN mapred.JobClient: Use GenericOptionsParser for parsing the

arguments. Applications should implement Tool for the same.

12/02/04 11:50:41 INFO input.FileInputFormat: Total input paths to process : 1

12/02/04 11:50:41 INFO mapred.JobClient: Running job: job_local_0001

12/02/04 11:50:41 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/02/04 11:50:41 INFO mapred.MapTask: io.sort.mb = 100

12/02/04 11:50:42 INFO mapred.MapTask: data buffer = 79691776/99614720

12/02/04 11:50:42 INFO mapred.MapTask: record buffer = 262144/327680

12/02/04 11:50:42 INFO mapred.MapTask: Starting flush of map output

12/02/04 11:50:42 INFO mapred.MapTask: Finished spill 0

12/02/04 11:50:42 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And i

s in the process of commiting

12/02/04 11:50:42 INFO mapred.JobClient: map 0% reduce 0%

12/02/04 11:50:44 INFO mapred.LocalJobRunner:

12/02/04 11:50:44 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.

Analyzing the Data with Hadoop | 25

12/02/04 11:50:44 INFO mapred.Task: Using ResourceCalculatorPlugin : null

12/02/04 11:50:44 INFO mapred.LocalJobRunner:

12/02/04 11:50:44 INFO mapred.Merger: Merging 1 sorted segments

12/02/04 11:50:44 INFO mapred.Merger: Down to the last merge-pass, with 1 segments

left of total size: 57 bytes

12/02/04 11:50:44 INFO mapred.LocalJobRunner:

12/02/04 11:50:45 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And

is in the process of commiting

12/02/04 11:50:45 INFO mapred.LocalJobRunner:

12/02/04 11:50:45 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to

commit now

12/02/04 11:50:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local

_0001_r_000000_0' to output

12/02/04 11:50:45 INFO mapred.JobClient: map 100% reduce 0%

12/02/04 11:50:47 INFO mapred.LocalJobRunner: reduce > reduce

12/02/04 11:50:47 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.

12/02/04 11:50:48 INFO mapred.JobClient: map 100% reduce 100%

12/02/04 11:50:48 INFO mapred.JobClient: Job complete: job_local_0001

12/02/04 11:50:48 INFO mapred.JobClient: Counters: 17

12/02/04 11:50:48 INFO mapred.JobClient: File Output Format Counters

12/02/04 11:50:48 INFO mapred.JobClient: Bytes Written=29

12/02/04 11:50:48 INFO mapred.JobClient: FileSystemCounters

12/02/04 11:50:48 INFO mapred.JobClient: FILE_BYTES_READ=357503

12/02/04 11:50:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=425817

12/02/04 11:50:48 INFO mapred.JobClient: File Input Format Counters

12/02/04 11:50:48 INFO mapred.JobClient: Bytes Read=529

12/02/04 11:50:48 INFO mapred.JobClient: Map-Reduce Framework

12/02/04 11:50:48 INFO mapred.JobClient: Map output materialized bytes=61

12/02/04 11:50:48 INFO mapred.JobClient: Map input records=5

12/02/04 11:50:48 INFO mapred.JobClient: Reduce shuffle bytes=0

12/02/04 11:50:48 INFO mapred.JobClient: Spilled Records=10

12/02/04 11:50:48 INFO mapred.JobClient: Map output bytes=45

12/02/04 11:50:48 INFO mapred.JobClient: Total committed heap usage (bytes)=36923

8016

12/02/04 11:50:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=129

12/02/04 11:50:48 INFO mapred.JobClient: Combine input records=0

12/02/04 11:50:48 INFO mapred.JobClient: Reduce input records=5

12/02/04 11:50:48 INFO mapred.JobClient: Reduce input groups=2

12/02/04 11:50:48 INFO mapred.JobClient: Combine output records=0

12/02/04 11:50:48 INFO mapred.JobClient: Reduce output records=2

12/02/04 11:50:48 INFO mapred.JobClient: Map output records=5

  hadoop           



hadoopjava



          

HADOOP_CLASSPATHhadoop



HADOOP_CLASSPATH

           



26 | Chapter 2:MapReduce

           

job_local_0001

       attempt_local_0001_m_000000_0 

attempt_local_0001_r_000000_0



      













% cat output/part-r-00000

1949 111

1950 22







The old and the new Java MapReduce APIs















org.apache.hadoop.mapreduce.lib

              















 

         



MapperReducer

Analyzing the Data with Hadoop | 27

 org.apache.hadoop.mapreduce

org.apache.hadoop.mapred

 

Context

JobConfOutputCollectorReporter



 



run()



MapRunnable



 Job

JobClient

 JobConf

        Configuration 



Configuration

Job

 

nnnnn

nnnnnnnnnnnnnnn



 java.lang.Inter

ruptedException    





 reduce()java.lang.Iterable

java.lang.Iterator



for (VALUEIN value : values) { ... }

MaxTemperature



 



          



 

28 | Chapter 2:MapReduce

MapperReducer

map()reduce()

          Mapper 

Reducer

map()reduce()





map()reduce()@Override





public class OldMaxTemperature {

static class OldMaxTemperatureMapper extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

@Override

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs

airTemperature = Integer.parseInt(line.substring(88, 92));

} else {

airTemperature = Integer.parseInt(line.substring(87, 92));

}

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]")) {

output.collect(new Text(year), new IntWritable(airTemperature));

}

static class OldMaxTemperatureReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output, Reporter reporter)

throws IOException {

int maxValue = Integer.MIN_VALUE;

while (values.hasNext()) {

maxValue = Math.max(maxValue, values.next().get());

}

output.collect(key, new IntWritable(maxValue));

}

Analyzing the Data with Hadoop | 29

}

public static void main(String[] args) throws IOException {

if (args.length != 2) {

System.err.println("Usage: OldMaxTemperature <input path> <output path>");

System.exit(-1);

}

JobConf conf = new JobConf(OldMaxTemperature.class);

conf.setJobName("Max temperature");

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

conf.setMapperClass(OldMaxTemperatureMapper.class);

conf.setReducerClass(OldMaxTemperatureReducer.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

JobClient.runJob(conf);

}

Scaling Out



               









Data Flow













           





            





30 | Chapter 2:MapReduce







      

     







































Scaling Out | 31

































            











32 | Chapter 2:MapReduce





































Combiner Functions







Scaling Out | 33





                















(1950, 0)

(1950, 20)

(1950, 10)



(1950, 25)

(1950, 15)



(1950, [0, 20, 10, 25, 15])



(1950, 25)







34 | Chapter 2:MapReduce

(1950, [20, 25])





max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

     



mean(0, 20, 10, 25, 15) = 14



mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15











Specifying a combiner function



Reducer

 MaxTemperatureReducer      

Job



public class MaxTemperatureWithCombiner {

public static void main(String[] args) throws Exception {

if (args.length != 2) {

System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +

"<output path>");

System.exit(-1);

}

Job job = new Job();

job.setJarByClass(MaxTemperatureWithCombiner.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducer.class);

 





Scaling Out | 35

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

Running a Distributed MapReduce Job











Hadoop Streaming







            



















Ruby





#!/usr/bin/env ruby

 







36 | Chapter 2:MapReduce

STDIN.each_line do |line|

val = line

year, temp, q = val[15,4], val[87,5], val[92,1]

puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)

end



STDINIO



\tputs





           map()

   Mapper        









          

Mapperclose()









% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb

1950 +0000

1950 +0022

1950 -0011

1949 +0111

1949 +0078





#!/usr/bin/env ruby

last_key, max_val = nil, -1000000

STDIN.each_line do |line|

key, val = line.split("\t")

if last_key && last_key != key

puts "#{last_key}\t#{max_val}"

last_key, max_val = key, val.to_i

else

last_key, max_val = key, [max_val, val.to_i].max

 



Hadoop Streaming | 37

end

puts "#{last_key}\t#{max_val}" if last_key















                

last_key && last_key != key



      











% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb | \

sort | ch02/src/main/ruby/max_temperature_reduce.rb

1949 111

1950 22





hadoop  

jar



% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

-input input/ncdc/sample.txt \

-output output \

-mapper ch02/src/main/ruby/max_temperature_map.rb \

-reducer ch02/src/main/ruby/max_temperature_reduce.rb

-combiner











% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

-input input/ncdc/all \

-output output \

38 | Chapter 2:MapReduce

-mapper "ch02/src/main/ruby/max_temperature_map.rb | sort |

ch02/src/main/ruby/max_temperature_reduce.rb" \

-reducer ch02/src/main/ruby/max_temperature_reduce.rb \

-file ch02/src/main/ruby/max_temperature_map.rb \

-file ch02/src/main/ruby/max_temperature_reduce.rb

-file



Python











#!/usr/bin/env python

import re

import sys

for line in sys.stdin:

val = line.strip()

(year, temp, q) = (val[15:19], val[87:92], val[92:93])

if (temp != "+9999" and re.match("[01459]", q)):

print "%s\t%s" % (year, temp)



#!/usr/bin/env python

import sys

(last_key, max_val) = (None, -sys.maxint)

for line in sys.stdin:

(key, val) = line.strip().split("\t")

if last_key and last_key != key:

print "%s\t%s" % (last_key, max_val)

(last_key, max_val) = (key, int(val))

else:

(last_key, max_val) = (key, max(max_val, int(val)))

if last_key:

print "%s\t%s" % (last_key, max_val)

           



Hadoop Streaming | 39





% cat input/ncdc/sample.txt | ch02/src/main/python/max_temperature_map.py | \

sort | ch02/src/main/python/max_temperature_reduce.py

1949 111

1950 22

Hadoop Pipes

















#include <algorithm>

#include <limits>

#include <stdint.h>

#include <string>

#include "hadoop/Pipes.hh"

#include "hadoop/TemplateFactory.hh"

#include "hadoop/StringUtils.hh"

class MaxTemperatureMapper : public HadoopPipes::Mapper {

public:

MaxTemperatureMapper(HadoopPipes::TaskContext& context) {

}

void map(HadoopPipes::MapContext& context) {

std::string line = context.getInputValue();

std::string year = line.substr(15, 4);

std::string airTemperature = line.substr(87, 5);

std::string q = line.substr(92, 1);

if (airTemperature != "+9999" &&

(q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {

context.emit(year, airTemperature);

}

};

class MapTemperatureReducer : public HadoopPipes::Reducer {

public:

MapTemperatureReducer(HadoopPipes::TaskContext& context) {

}

void reduce(HadoopPipes::ReduceContext& context) {

int maxValue = INT_MIN;

while (context.nextValue()) {

40 | Chapter 2:MapReduce

maxValue = std::max(maxValue, HadoopUtils::toInt(context.getInputValue()));

}

context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));

}

};

int main(int argc, char *argv[]) {

return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,

MapTemperatureReducer>());

}





MapperReducerHadoopPipes

map()reduce()

       MapContext ReduceContext 



JobConf









           MapTempera

tureReducer

HadoopUtils

MaxTemperature

MapperairTemperature

map()

main()HadoopPipes::runTask

             Mapper 

ReducerrunTask()Factory

MapperReducer





Compiling and Running





CC = g++

CPPFLAGS = -m32 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include

max_temperature: max_temperature.cpp

$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \

-lhadooputils -lpthread -g -O2 -o $@

Hadoop Pipes | 41

            

HADOOP_INSTALL       

PLATFORM





% export PLATFORM=Linux-i386-32

% make

max_temperature







     









% hadoop fs -put max_temperature bin/max_temperature



% hadoop fs -put input/ncdc/sample.txt sample.txt

pipes

-program



% hadoop pipes \

-D hadoop.pipes.java.recordreader=true \

-D hadoop.pipes.java.recordwriter=true \

-input sample.txt \

-output output \

-program bin/max_temperature

-Dhadoop.pipes.java.recordreader

hadoop.pipes.java.recordwritertrue



              







42 | Chapter 2:MapReduce

CHAPTER 3

The Hadoop Distributed Filesystem



























The Design of HDFS















 

            



        









































              





           





















 

 







44 | Chapter 3:The Hadoop Distributed Filesystem

HDFS Concepts

Blocks



























Why Is a Block in HDFS So Large?





































HDFS Concepts | 45



            

































fsck



% hadoop fsck / -files -blocks





Namenodes and Datanodes































46 | Chapter 3:The Hadoop Distributed Filesystem





               



































HDFS Federation































ViewFileSystem



HDFS Concepts | 47

HDFS High-Availability















     

          











         















 











 



 













48 | Chapter 3:The Hadoop Distributed Filesystem









Failover and fencing





           













































The Command-Line Interface













The Command-Line Interface | 49



fs.default.name



hdfs



localhost





     dfs.replication       







Basic Filesystem Operations





hadoop fs -help



% hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/

quangle.txt

       fs   

-copyFromLocal





hdfs://localhost

% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt





% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt



% hadoop fs -copyToLocal quangle.txt quangle.copy.txt

% md5 input/docs/quangle.txt quangle.copy.txt

MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9

MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9









50 | Chapter 3:The Hadoop Distributed Filesystem

% hadoop fs -mkdir books

% hadoop fs -ls .

Found 2 items

drwxr-xr-x - tom supergroup 0 2009-04-02 22:41 /user/tom/books

-rw-r--r-- 1 tom supergroup 118 2009-04-02 22:29 /user/tom/quangle.txt

ls -l



















File Permissions in HDFS



rw

x





           





              

















        



dfs.permissions





The Command-Line Interface | 51




      

Hadoop Filesystems

org.apache.hadoop.fs.FileSystem



Filesystem URI scheme Java implementation
(all under org.apache.hadoop)
Description
Local file fs.LocalFileSystem A filesystem for a locally connected disk with client-
side checksums. Use RawLocalFileSystem for a
local filesystem with no checksums. See “LocalFileSys-
tem” on page 82.
HDFS hdfs hdfs.
DistributedFileSystem
Hadoop’s distributed filesystem. HDFS is designed to work
efficiently in conjunction with MapReduce.
HFTP hftp hdfs.HftpFileSystem A filesystem providing read-only access to HDFS over
HTTP. (Despite its name, HFTP has no connection with
FTP.) Often used with distcp (see “Parallel Copying with
distcp” on page 75) to copy data between HDFS
clusters running different versions.
HSFTP hsftp hdfs.HsftpFileSystem A filesystem providing read-only access to HDFS over
HTTPS. (Again, this has no connection with FTP.)
WebHDFS webhdfs hdfs.web.WebHdfsFile
System
A filesystem providing secure read-write access to HDFS
over HTTP. WebHDFS is intended as a replacement for
HFTP and HSFTP.
HAR har fs.HarFileSystem A filesystem layered on another filesystem for archiving
files. Hadoop Archives are typically used for archiving files
in HDFS to reduce the namenode’s memory usage. See
“Hadoop Archives” on page 77.
KFS (Cloud-
Store)
kfs fs.kfs.
KosmosFileSystem
CloudStore (formerly Kosmos filesystem) is a dis-
tributed filesystem like HDFS or Google’s GFS, written in
C++. Find more information about it at
http://code.google.com/p/kosmosfs/.
FTP ftp fs.ftp.FTPFileSystem A filesystem backed by an FTP server.
S3 (native) s3n fs.s3native.
NativeS3FileSystem
A filesystem backed by Amazon S3. See http://wiki
.apache.org/hadoop/AmazonS3.
52 | Chapter 3:The Hadoop Distributed Filesystem

Filesystem URI scheme Java implementation

(all under org.apache.hadoop)

Description

S3 (block-

based)

s3 fs.s3.S3FileSystem A filesystem backed by Amazon S3, which stores files in

blocks (much like HDFS) to overcome S3’s 5 GB file size

limit.

Distributed

RAID

hdfs hdfs.DistributedRaidFi

leSystem

A “RAID” version of HDFS designed for archival storage.

For each file in HDFS, a (smaller) parity file is created,

which allows the HDFS replication to be reduced from

three to two, which reduces disk usage by 25% to 30%

while keeping the probability of data loss the same. Dis-

tributed RAID requires that you run a RaidNode daemon

on the cluster.

View viewfs viewfs.ViewFileSystem A client-side mount table for other Hadoop filesystems.

Commonly used to create mount points for federated

namenodes (see “HDFS Federation” on page 47).

            







% hadoop fs -ls file:///









Interfaces





FileSystem









HTTP





DistributedFileSystem



Hadoop Filesystems | 53











dfs.webhdfs.enabled







          







FileSystem



            











54 | Chapter 3:The Hadoop Distributed Filesystem

FileSystem



















FUSE







lscat









The Java Interface

FileSystem



DistributedFileSystem           

FileSystem





Reading Data from a Hadoop URL

                

java.net.URL

InputStream in = null;

try {

 FileContext

FileContext



The Java Interface | 55

in = new URL("hdfs://host/path").openStream();

// process in

} finally {

IOUtils.closeStream(in);

}

hdfs

setURLStreamHandlerFactoryURL

FsUrlStreamHandlerFactory





URLStreamHandlerFactory





cat

            



public class URLCat {

static {

URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

}

public static void main(String[] args) throws Exception {

InputStream in = null;

try {

in = new URL(args[0]).openStream();

IOUtils.copyBytes(in, System.out, 4096, false);

} finally {

IOUtils.closeStream(in);

}

IOUtils

finally

  System.out          copyBytes



System.out



56 | Chapter 3:The Hadoop Distributed Filesystem



% hadoop URLCat hdfs://localhost/user/tom/quangle.txt

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

Reading Data Using the FileSystem API

URLStreamHand

lerFactoryFileSystem



           Path   

java.io.File

Path



FileSystem



FileSystem

public static FileSystem get(Configuration conf) throws IOException

public static FileSystem get(URI uri, Configuration conf) throws IOException

public static FileSystem get(URI uri, Configuration conf, String user)

throws IOException

Configuration





URI



URI





getLocal()

public static LocalFileSystem getLocal(Configuration conf) throws IOException

FileSystemopen()



public FSDataInputStream open(Path f) throws IOException

public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException





 

The Java Interface | 57





public class FileSystemCat {

public static void main(String[] args) throws Exception {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

InputStream in = null;

try {

in = fs.open(new Path(uri));

IOUtils.copyBytes(in, System.out, 4096, false);

} finally {

IOUtils.closeStream(in);

}



% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

FSDataInputStream

open()FileSystemFSDataInputStream

java.iojava.io.DataInputStream



package org.apache.hadoop.fs;

public class FSDataInputStream extends DataInputStream

implements Seekable, PositionedReadable {

// implementation elided

}

Seekable

getPos()

public interface Seekable {

void seek(long pos) throws IOException;

long getPos() throws IOException;

}

seek()

IOExceptionskip()java.io.InputStream

seek()



58 | Chapter 3:The Hadoop Distributed Filesystem









public class FileSystemDoubleCat {

public static void main(String[] args) throws Exception {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

FSDataInputStream in = null;

try {

in = fs.open(new Path(uri));

IOUtils.copyBytes(in, System.out, 4096, false);

in.seek(0); // go back to the start of the file

IOUtils.copyBytes(in, System.out, 4096, false);

} finally {

IOUtils.closeStream(in);

}



% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

FSDataInputStreamPositionedReadable



public interface PositionedReadable {

public int read(long position, byte[] buffer, int offset, int length)

throws IOException;

public void readFully(long position, byte[] buffer, int offset, int length)

throws IOException;

public void readFully(long position, byte[] buffer) throws IOException;

}

read()lengthposition

bufferoffset

lengthreadFully()

lengthbuffer.length

The Java Interface | 59

buffer

EOFException



FSDataInputStream





seek()



           



Writing Data

FileSystem

Path



public FSDataOutputStream create(Path f) throws IOException

     





create()        







exists()

Progressable

              



package org.apache.hadoop.util;

public interface Progressable {

public void progress();

}



append()

public FSDataOutputStream append(Path f) throws IOException





   

60 | Chapter 3:The Hadoop Distributed Filesystem







progress()









public class FileCopyWithProgress {

public static void main(String[] args) throws Exception {

String localSrc = args[0];

String dst = args[1];

InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(dst), conf);

OutputStream out = fs.create(new Path(dst), new Progressable() {

public void progress() {

System.out.print(".");

}

});

IOUtils.copyBytes(in, out, 4096, true);

}



% hadoop FileCopyWithProgress input/docs/1400-8.txt hdfs://localhost/user/tom/

1400-8.txt

...............

progress()



FSDataOutputStream

 create()   FileSystem   FSDataOutputStream  

FSDataInputStream

package org.apache.hadoop.fs;

public class FSDataOutputStream extends DataOutputStream implements Syncable {

public long getPos() throws IOException {

 

              



The Java Interface | 61

// implementation elided

}

// implementation elided

}

FSDataInputStreamFSDataOutputStream







Directories

FileSystem

public boolean mkdirs(Path f) throws IOException



java.io.Filemkdirs()true





create()

Querying the Filesystem

File metadata: FileStatus



FileStatus





getFileStatus()FileSystem FileStatus





public class ShowFileStatusTest {

private MiniDFSCluster cluster; // use an in-process HDFS cluster for testing

private FileSystem fs;

@Before

public void setUp() throws IOException {

Configuration conf = new Configuration();

if (System.getProperty("test.build.data") == null) {

System.setProperty("test.build.data", "/tmp");

}

cluster = new MiniDFSCluster(conf, 1, true, null);

fs = cluster.getFileSystem();

62 | Chapter 3:The Hadoop Distributed Filesystem

OutputStream out = fs.create(new Path("/dir/file"));

out.write("content".getBytes("UTF-8"));

out.close();

}

@After

public void tearDown() throws IOException {

if (fs != null) { fs.close(); }

if (cluster != null) { cluster.shutdown(); }

}

@Test(expected = FileNotFoundException.class)

public void throwsFileNotFoundForNonExistentFile() throws IOException {

fs.getFileStatus(new Path("no-such-file"));

}

@Test

public void fileStatusForFile() throws IOException {

Path file = new Path("/dir/file");

FileStatus stat = fs.getFileStatus(file);

assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));

assertThat(stat.isDir(), is(false));

assertThat(stat.getLen(), is(7L));

assertThat(stat.getModificationTime(),

is(lessThanOrEqualTo(System.currentTimeMillis())));

assertThat(stat.getReplication(), is((short) 1));

assertThat(stat.getBlockSize(), is(64 * 1024 * 1024L));

assertThat(stat.getOwner(), is("tom"));

assertThat(stat.getGroup(), is("supergroup"));

assertThat(stat.getPermission().toString(), is("rw-r--r--"));

}

@Test

public void fileStatusForDirectory() throws IOException {

Path dir = new Path("/dir");

FileStatus stat = fs.getFileStatus(dir);

assertThat(stat.getPath().toUri().getPath(), is("/dir"));

assertThat(stat.isDir(), is(true));

assertThat(stat.getLen(), is(0L));

assertThat(stat.getModificationTime(),

is(lessThanOrEqualTo(System.currentTimeMillis())));

assertThat(stat.getReplication(), is((short) 0));

assertThat(stat.getBlockSize(), is(0L));

assertThat(stat.getOwner(), is("tom"));

assertThat(stat.getGroup(), is("supergroup"));

assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));

}

FileNotFoundException

exists()FileSys

tem

public boolean exists(Path f) throws IOException

The Java Interface | 63

Listing files



FileSystemlistStatus()



public FileStatus[] listStatus(Path f) throws IOException

public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException

public FileStatus[] listStatus(Path[] files) throws IOException

public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

FileStatus

FileStatus



PathFilter

               



listStatusFileSta

tus



stat2Paths()FileUtil

FileStatusPath



public class ListStatus {

public static void main(String[] args) throws Exception {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path[] paths = new Path[args.length];

for (int i = 0; i < paths.length; i++) {

paths[i] = new Path(args[i]);

}

FileStatus[] status = fs.listStatus(paths);

Path[] listedPaths = FileUtil.stat2Paths(status);

for (Path p : listedPaths) {

System.out.println(p);

}



% hadoop ListStatus hdfs://localhost/ hdfs://localhost/user/tom

hdfs://localhost/user

hdfs://localhost/user/tom/books

hdfs://localhost/user/tom/quangle.txt

64 | Chapter 3:The Hadoop Distributed Filesystem

File patterns











FileSystem

public FileStatus[] globStatus(Path pathPattern) throws IOException

public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws

IOException

globStatus()FileStatus

PathFilter







Glob Name Matches

*asterisk Matches zero or more characters

?question mark Matches a single character

[ab] character class Matches a single character in the set {a, b}

[^ab] negated character class Matches a single character that is not in the set {a, b}

[a-b] character range Matches a single character in the (closed) range [a, b], where a is lexicographically

less than or equal to b

[^a-b] negated character range Matches a single character that is not in the (closed) range [a, b], where a is

lexicographically less than or equal to b

{a,b} alternation Matches either expression a or b

\c escaped character Matches character c when it is a metacharacter

           

               



/

2007/

 12/

 30/

 31/

2008/

01/

01/

02/



The Java Interface | 65



Glob Expansion

/* /2007 /2008

/*/* /2007/12 /2008/01

/*/12/* /2007/12/30 /2007/12/31

/200? /2007 /2008

/200[78] /2007 /2008

/200[7-8] /2007 /2008

/200[^01234569] /2007 /2008

/*/*/{31,01} /2007/12/31 /2008/01/01

/*/*/3{0,1} /2007/12/30 /2007/12/31

/*/{12/31,01/01} /2007/12/31 /2008/01/01

PathFilter





listStatus()globStatus()FileSystem

PathFilter

package org.apache.hadoop.fs;

public interface PathFilter {

boolean accept(Path path);

}

PathFilterjava.io.FileFilterPathFile



PathFilter



public class RegexExcludePathFilter implements PathFilter {

private final String regex;

public RegexExcludePathFilter(String regex) {

this.regex = regex;

}

public boolean accept(Path path) {

return !path.toString().matches(regex);

}

66 | Chapter 3:The Hadoop Distributed Filesystem







fs.globStatus(new Path("/2007/*/*"), new RegexExcludeFilter("^.*/2007/12/31$"))



Path





  

PathFilter



Deleting Data

delete()FileSystem

public boolean delete(Path f, boolean recursive) throws IOException

frecursive

   recursive true 

IOException

Data Flow

Anatomy of a File Read







open()FileSystem

       DistributedFileSystem     

DistributedFileSystem















Data Flow | 67



DistributedFileSystemFSDataInputStream

FSDataInputStream

DFSInputStream

read()DFSInputStream





read()

DFSInputStream





DFSInputStream





close()FSDataInputStream

DFSInputStream



       

DFSInputStream

              DFSInput

Stream







68 | Chapter 3:The Hadoop Distributed Filesystem

            







Network Topology and Hadoop

























 



 

 





























 

Data Flow | 69



Anatomy of a File Write









       create()  DistributedFileSystem   

DistributedFileSystem









IOExceptionDistributedFileSystemFSDataOutputStream

FSDataOutputStreamDFSOutput

Stream

DFSOutputStream

Data

Streamer            





DataStreamer



70 | Chapter 3:The Hadoop Distributed Filesystem







DFSOutputStream     















               











dfs.replication.min



dfs.replication

close()

        

Data Flow | 71



              Data

Streamer



Replica Placement



















































Coherency Model









Path p = new Path("p");

fs.create(p);

assertThat(fs.exists(p), is(true));

72 | Chapter 3:The Hadoop Distributed Filesystem





Path p = new Path("p");

OutputStream out = fs.create(p);

out.write("content".getBytes("UTF-8"));

out.flush();

assertThat(fs.getFileStatus(p).getLen(), is(0L));









sync()FSDataOutputStreamsync()





Path p = new Path("p");

FSDataOutputStream out = fs.create(p);

out.write("content".getBytes("UTF-8"));

out.flush();

out.sync();

assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));



 sync()hflush()

hsync()

fsync

hflush()

Data Flow | 73

fsync





FileOutputStream out = new FileOutputStream(localFile);

out.write("content".getBytes("UTF-8"));

out.flush(); // flush to operating system

out.getFD().sync(); // sync to disk

assertThat(localFile.length(), is(((long) "content".length())));

sync()

Path p = new Path("p");

OutputStream out = fs.create(p);

out.write("content".getBytes("UTF-8"));

out.close();

assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));

Consequences for application design



sync()



sync()

sync()





sync()



Data Ingest with Flume and Sqoop















tail

tail









74 | Chapter 3:The Hadoop Distributed Filesystem

          















Parallel Copying with distcp













            



% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar











-overwrite

-update

-overwrite-update









% hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo







-overwrite-update







Parallel Copying with distcp | 75































-m-m 1000













% hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar



dfs.http.address







% hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar



             



Keeping an HDFS Cluster Balanced





-m

1



76 | Chapter 3:The Hadoop Distributed Filesystem

















Hadoop Archives

          





                









        



Using Hadoop Archives









% hadoop fs -lsr /my/files

-rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/a

drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files/dir

-rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/dir/b

archive

% hadoop archive -archiveName files.har /my/files /my

              









% hadoop fs -ls /my

Found 2 items

drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files

Hadoop Archives | 77

drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files.har

% hadoop fs -ls /my/files.har

Found 3 items

-rw-r--r-- 10 tom supergroup 165 2009-04-09 19:13 /my/files.har/_index

-rw-r--r-- 10 tom supergroup 23 2009-04-09 19:13 /my/files.har/_masterindex

-rw-r--r-- 1 tom supergroup 2 2009-04-09 19:13 /my/files.har/part-0

















% hadoop fs -lsr har:///my/files.har

drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my

drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files

-rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/a

drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files/dir

-rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/dir/b









% hadoop fs -lsr har:///my/files.har/my/files/dir

% hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/dir











          







% hadoop fs -rmr /my/files.har

78 | Chapter 3:The Hadoop Distributed Filesystem

Limitations





















InputFormat



         









Hadoop Archives | 79

CHAPTER 4

Hadoop I/O











Data Integrity































Data Integrity in HDFS



io.bytes.per.checksum













       ChecksumException    IOExcep

tion













DataBlockScanner





             









ChecksumException









falsesetVerify

Checksum()FileSystemopen()

-ignoreCrc-get

-copyToLocal





LocalFileSystem

LocalFileSystem  





io.bytes.per.checksum



82 | Chapter 4:Hadoop I/O

               

             

LocalFileSystemChecksumException









      RawLocalFileSystem    Local

FileSystem             

fs.file.implorg.apache.

hadoop.fs.RawLocalFileSystemRawLocalFile

System



Configuration conf = ...

FileSystem fs = new RawLocalFileSystem();

fs.initialize(null, conf);

ChecksumFileSystem

LocalFileSystemChecksumFileSystem

        Checksum

FileSystemFileSystem

FileSystem rawFs = ...

FileSystem checksummedFs = new ChecksumFileSystem(rawFs);



getRawFileSystem()   ChecksumFileSystem ChecksumFileSystem   

getChecksumFile()



      ChecksumFileSystem        

reportChecksumFailure()       

LocalFileSystem





Compression









Compression | 83





Compression format Tool Algorithm Filename extension Splittable?
DEFLATEaN/A DEFLATE .deflate No
gzip gzip DEFLATE .gz No
bzip2 bzip2 bzip2 .bz2 Yes
LZO lzop LZO .lzo Nob
LZ4 N/A LZ4 .lz4 No
Snappy N/A Snappy .snappy No
aDEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for
producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.)
The .deflate filename extension is a Hadoop convention.
bHowever, LZO files are splittable if they have been indexed in a preprocessing step. See page 89.



–1-9


gzip -1 file

      









        
            
         

84 | Chapter 4:Hadoop I/O

Codecs



CompressionCodec

GzipCodec





Compression format Hadoop CompressionCodec

DEFLATE org.apache.hadoop.io.compress.DefaultCodec

gzip org.apache.hadoop.io.compress.GzipCodec

bzip2 org.apache.hadoop.io.compress.BZip2Codec

LZO com.hadoop.compression.lzo.LzopCodec

LZ4 org.apache.hadoop.io.compress.Lz4Codec

Snappy org.apache.hadoop.io.compress.SnappyCodec





  

LzopCodeclzop



LzoCodec

           



Compressing and decompressing streams with CompressionCodec

CompressionCodec

            createOutput

Stream(OutputStream out)CompressionOutputStream



           

createInputStream(InputStream in)CompressionInputStream



CompressionOutputStream  CompressionInputStream    java.util.

zip.DeflaterOutputStreamjava.util.zip.DeflaterInputStream





SequenceFile





Compression | 85



public class StreamCompressor {

public static void main(String[] args) throws Exception {

String codecClassname = args[0];

Class<?> codecClass = Class.forName(codecClassname);

Configuration conf = new Configuration();

CompressionCodec codec = (CompressionCodec)

ReflectionUtils.newInstance(codecClass, conf);

CompressionOutputStream out = codec.createOutputStream(System.out);

IOUtils.copyBytes(System.in, out, 4096, false);

out.finish();

}

CompressionCodec

ReflectionUtils

System.out

copyBytes()IOUtils

    CompressionOutputStream    finish() 

CompressionOutputStream          

        

StreamCompressor

GzipCodec

% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \

| gunzip -

Text

Inferring CompressionCodecs using CompressionCodecFactory



GzipCodec



CompressionCodecFactory          

CompressionCodecgetCodec()Path



               



public class FileDecompressor {

public static void main(String[] args) throws Exception {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path inputPath = new Path(uri);

CompressionCodecFactory factory = new CompressionCodecFactory(conf);

86 | Chapter 4:Hadoop I/O

CompressionCodec codec = factory.getCodec(inputPath);

if (codec == null) {

System.err.println("No codec found for " + uri);

System.exit(1);

}

String outputUri =

CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());

InputStream in = null;

OutputStream out = null;

try {

in = codec.createInputStream(fs.open(inputPath));

out = fs.create(new Path(outputUri));

IOUtils.copyBytes(in, out, conf);

} finally {

IOUtils.closeStream(in);

IOUtils.closeStream(out);

}



removeSuffix()CompressionCodecFactory



% hadoop FileDecompressor file.gz

CompressionCodecFactory         io.compression.

codecs



     

CompressionCodecFactory





Property name Type Default value Description

io.compression.codecs Comma-separated

Class names org.apache.hadoop.io.

compress.DefaultCodec,

org.apache.hadoop.io.

compress.GzipCodec,

org.apache.hadoop.io.

compress.BZip2Codec

A list of the

CompressionCodec classes

for compression/

decompression

Native libraries

            







Compression | 87









Compression format Java implementation? Native implementation?

DEFLATE Yes Yes

gzip Yes Yes

bzip2 Yes No

LZO No Yes

LZ4 No Yes

Snappy No Yes









java.library.path













hadoop.native.lib

false



CodecPool

          





Compressor





public class PooledStreamCompressor {

public static void main(String[] args) throws Exception {

String codecClassname = args[0];

Class<?> codecClass = Class.forName(codecClassname);

Configuration conf = new Configuration();

CompressionCodec codec = (CompressionCodec)

CodecPool.

88 | Chapter 4:Hadoop I/O

ReflectionUtils.newInstance(codecClass, conf);

Compressor compressor = null;

try {

compressor = CodecPool.getCompressor(codec);

CompressionOutputStream out =

codec.createOutputStream(System.out, compressor);

IOUtils.copyBytes(System.in, out, 4096, false);

out.finish();

} finally {

CodecPool.returnCompressor(compressor);

}

CompressorCompressionCodec

createOutputStream()finally

               

IOException

Compression and Input Splits







































           

            







Compression | 89











Which Compression Format Should I Use?









 





 



 







 







Using Compression in MapReduce

      







             

mapred.output.compress   true   mapred.output.compression.codec



FileOutputFormat





public class MaxTemperatureWithCompression {

public static void main(String[] args) throws Exception {

90 | Chapter 4:Hadoop I/O

if (args.length != 2) {

System.err.println("Usage: MaxTemperatureWithCompression <input path> " +

"<output path>");

System.exit(-1);

}

Job job = new Job();

job.setJarByClass(MaxTemperature.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileOutputFormat.setCompressOutput(job, true);

FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

job.setMapperClass(MaxTemperatureMapper.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducer.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}





% hadoop MaxTemperatureWithCompression input/ncdc/sample.txt.gz output



% gunzip -c output/part-r-00000.gz

1949 111

1950 22

mapred.output.com

pression.type            

RECORD        BLOCK 





SequenceFileOutputFormatsetOut

putCompressionType()



Tool







Compression | 91



Property name Type Default value Description

mapred.output.com

press

boolean false Compress outputs

mapred.output.com

pression.

codec

Class name org.apache.hadoop.io.

compress.DefaultCodec

The compression codec to use for out-

puts

mapred.output.com

pression.

type

String RECORD The type of compression to use for Se-

quenceFile outputs: NONE, RECORD, or

BLOCK

Compressing map output

















Property name Type Default value Description

mapred.compress.map. output boolean false Compress map outputs

mapred.map.output.

compression.codec

Class org.apache.hadoop.io.

compress.DefaultCodec

The compression codec to use for

map outputs





Configuration conf = new Configuration();

conf.setBoolean("mapred.compress.map.output", true);

conf.setClass("mapred.map.output.compression.codec", GzipCodec.class,

CompressionCodec.class);

Job job = new Job(conf);

JobConf



conf.setCompressMapOutput(true);

conf.setMapOutputCompressorClass(GzipCodec.class);

92 | Chapter 4:Hadoop I/O

Serialization







           





























           

           





































Serialization | 93







The Writable Interface

DataOutput

DataInput

package org.apache.hadoop.io;

import java.io.DataOutput;

import java.io.DataInput;

import java.io.IOException;

public interface Writable {

void write(DataOutput out) throws IOException;

void readFields(DataInput in) throws IOException;

}

     Writable           

IntWritableint

set()

IntWritable writable = new IntWritable();

writable.set(163);



IntWritable writable = new IntWritable(163);

IntWritable

java.io.ByteArrayOutputStreamjava.io.DataOutputStream

java.io.DataOutput

public static byte[] serialize(Writable writable) throws IOException {

ByteArrayOutputStream out = new ByteArrayOutputStream();

DataOutputStream dataOut = new DataOutputStream(out);

writable.write(dataOut);

dataOut.close();

return out.toByteArray();

}



byte[] bytes = serialize(writable);

assertThat(bytes.length, is(4));



java.io.DataOutput

StringUtils

assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));

94 | Chapter 4:Hadoop I/O

Writable



public static byte[] deserialize(Writable writable, byte[] bytes)

throws IOException {

ByteArrayInputStream in = new ByteArrayInputStream(bytes);

DataInputStream dataIn = new DataInputStream(in);

writable.readFields(dataIn);

dataIn.close();

return bytes;

}

IntWritabledeserialize()



get()

IntWritable newWritable = new IntWritable();

deserialize(newWritable, bytes);

assertThat(newWritable.get(), is(163));

WritableComparable and comparators

IntWritableWritableComparable

Writablejava.lang.Comparable

package org.apache.hadoop.io;

public interface WritableComparable<T> extends Writable, Comparable<T> {

}





RawComparatorComparator

package org.apache.hadoop.io;

import java.util.Comparator;

public interface RawComparator<T> extends Comparator<T> {

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}





IntWritablecompare()

b1b2

s1s2l1l2

WritableComparator      RawComparator 

WritableComparable

compare()

compare()

Serialization | 95

RawComparatorWritable

IntWritable

RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);

IntWritable

IntWritable w1 = new IntWritable(163);

IntWritable w2 = new IntWritable(67);

assertThat(comparator.compare(w1, w2), greaterThan(0));



byte[] b1 = serialize(w1);

byte[] b2 = serialize(w2);

assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length),

greaterThan(0));

Writable Classes

Writableorg.apache.hadoop.io



Writable wrappers for Java primitives

Writable

charIntWritableget()set()





Java primitive Writable implementation Serialized size (bytes)

boolean BooleanWritable 1

byte ByteWritable 1

short ShortWritable 2

int IntWritable 4

VIntWritable 1–5

float FloatWritable 4

long LongWritable 8

VLongWritable 1–9

double DoubleWritable 8

96 | Chapter 4:Hadoop I/O





IntWritable  LongWritable     VIntWritable 

VLongWritable



Serialization | 97





byte[] data = serialize(new VIntWritable(163));

assertThat(StringUtils.byteToHexString(data), is("8fa3"));











VIntWritableVLongWritable



long

Text

TextWritableWritable

java.lang.StringTextUTF8





Textint

Text







TextStringText

      

charString

charAt()

Text t = new Text("hadoop");

assertThat(t.getLength(), is(6));

assertThat(t.getBytes().length, is(6));

assertThat(t.charAt(2), is((int) 'd'));

assertThat("Out of bounds", t.charAt(100), is(-1));

  charAt()   int       

StringcharTextfind()

StringindexOf()

Text t = new Text("hadoop");

assertThat("Find a substring", t.find("do"), is(2));

assertThat("Finds first 'o'", t.find("o"), is(3));

assertThat("Finds 'o' from position 4 or later", t.find("o", 4), is(4));

assertThat("No match", t.find("pig"), is(-1));

Indexing.

98 | Chapter 4:Hadoop I/O



TextString





Unicode code point U+0041 U+00DF U+6771 U+10400

Name LATIN CAPITAL

LETTER A

LATIN SMALL LETTER

SHARP S

N/A (a unified

Han ideograph)

DESERET CAPITAL LETTER

LONG I

UTF-8 code units 41 c3 9f e6 9d b1 f0 90 90 80

Java representation \u0041 \u00DF \u6771 \uuD801\uDC00



char     char

  

StringText



public class StringTextComparisonTest {

@Test

public void string() throws UnsupportedEncodingException {

String s = "\u0041\u00DF\u6771\uD801\uDC00";

assertThat(s.length(), is(5));

assertThat(s.getBytes("UTF-8").length, is(10));

assertThat(s.indexOf("\u0041"), is(0));

assertThat(s.indexOf("\u00DF"), is(1));

assertThat(s.indexOf("\u6771"), is(2));

assertThat(s.indexOf("\uD801\uDC00"), is(3));

assertThat(s.charAt(0), is('\u0041'));

assertThat(s.charAt(1), is('\u00DF'));

assertThat(s.charAt(2), is('\u6771'));

assertThat(s.charAt(3), is('\uD801'));

assertThat(s.charAt(4), is('\uDC00'));

assertThat(s.codePointAt(0), is(0x0041));

assertThat(s.codePointAt(1), is(0x00DF));

assertThat(s.codePointAt(2), is(0x6771));

assertThat(s.codePointAt(3), is(0x10400));

}

@Test

public void text() {

Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");

Unicode.

 

Serialization | 99

assertThat(t.getLength(), is(10));

assertThat(t.find("\u0041"), is(0));

assertThat(t.find("\u00DF"), is(1));

assertThat(t.find("\u6771"), is(3));

assertThat(t.find("\uD801\uDC00"), is(6));

assertThat(t.charAt(0), is(0x0041));

assertThat(t.charAt(1), is(0x00DF));

assertThat(t.charAt(3), is(0x6771));

assertThat(t.charAt(6), is(0x10400));

}

Stringchar



Text

indexOf()String

charfind()Text

charAt()Stringchar

code

PointAt()char

intcharAt()Text

codePointAt()String



Text



Textjava.nio.ByteBuffer

bytesToCodePoint()Text

int

bytesToCodePoint()



public class TextIterator {

public static void main(String[] args) {

Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");

ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());

int cp;

while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1) {

System.out.println(Integer.toHexString(cp));

}



Iteration.

100 | Chapter 4:Hadoop I/O

% hadoop TextIterator

6771

10400

StringTextWritable

NullWritable

Textset()

Text t = new Text("hadoop");

t.set("pig");

assertThat(t.getLength(), is(3));

assertThat(t.getBytes().length, is(3));

getBytes()

getLength()

Text t = new Text("hadoop");

t.set(new Text("pig"));

assertThat(t.getLength(), is(3));

assertThat("Byte length not shortened", t.getBytes().length,

is(6));

getLength()

getBytes()

Text          

java.lang.StringTextString

toString()

assertThat(new Text("hadoop").toString(), is("hadoop"));

BytesWritable

BytesWritable





000000020305

BytesWritable b = new BytesWritable(new byte[] { 3, 5 });

byte[] bytes = serialize(b);

assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));

BytesWritableset()

TextgetBytes()Byte

sWritable        

BytesWritable        BytesWritable   get

Length()

b.setCapacity(11);

assertThat(b.getLength(), is(2));

assertThat(b.getBytes().length, is(11));

Mutability.

Resorting to String.

Serialization | 101

NullWritable

NullWritableWritable

      

NullWritable

NullWritable

SequenceFile



NullWritable.get()

ObjectWritable and GenericWritable

ObjectWritableString

enumWritablenull



ObjectWritable

SequenceFile

ObjectWritable       ObjectWritable   









GenericWritable



Writable collections

 Writable   org.apache.hadoop.io Array

Writable ArrayPrimitiveWritable TwoDArrayWritable MapWritable

SortedMapWritableEnumSetWritable

ArrayWritable  TwoDArrayWritable  Writable    

Writable

ArrayWritableTwoDArrayWritable



ArrayWritable writable = new ArrayWritable(Text.class);

WritableSequenceFile

ArrayWritableTwoDAr

rayWritable

public class TextArrayWritable extends ArrayWritable {

public TextArrayWritable() {

super(Text.class);

}

102 | Chapter 4:Hadoop I/O

ArrayWritableTwoDArrayWritableget()set()

toArray()

ArrayPrimitiveWritable

set()

MapWritableSortedMapWritablejava.util.Map<Writable,

Writable>java.util.SortedMap<WritableComparable, Writable>





org.apache.hadoop.io

Writable

MapWritableSortedMapWritable

byte

WritableMapWritableSortedMapWritable

MapWritable



MapWritable src = new MapWritable();

src.put(new IntWritable(1), new Text("cat"));

src.put(new VIntWritable(2), new LongWritable(163));

MapWritable dest = new MapWritable();

WritableUtils.cloneInto(dest, src);

assertThat((Text) dest.get(new IntWritable(1)), is(new Text("cat")));

assertThat((LongWritable) dest.get(new VIntWritable(2)), is(new

LongWritable(163)));

Writable

MapWritableSortedMapWritable

NullWritableEnumSetWritable

WritableArrayWritable

WritableGenericWritable

ArrayWritableListWritable

MapWritable

Implementing a Custom Writable

Writable



Writable

Writable

          Writable



Writable



Serialization | 103

Writable

TextPair





import java.io.*;

import org.apache.hadoop.io.*;

public class TextPair implements WritableComparable<TextPair> {

private Text first;

private Text second;

public TextPair() {

set(new Text(), new Text());

}

public TextPair(String first, String second) {

set(new Text(first), new Text(second));

}

public TextPair(Text first, Text second) {

set(first, second);

}

public void set(Text first, Text second) {

this.first = first;

this.second = second;

}

public Text getFirst() {

return first;

}

public Text getSecond() {

return second;

}

@Override

public void write(DataOutput out) throws IOException {

first.write(out);

second.write(out);

}

@Override

public void readFields(DataInput in) throws IOException {

first.readFields(in);

second.readFields(in);

}

@Override

public int hashCode() {

return first.hashCode() * 163 + second.hashCode();

104 | Chapter 4:Hadoop I/O

}

@Override

public boolean equals(Object o) {

if (o instanceof TextPair) {

TextPair tp = (TextPair) o;

return first.equals(tp.first) && second.equals(tp.second);

}

return false;

}

@Override

public String toString() {

return first + "\t" + second;

}

@Override

public int compareTo(TextPair tp) {

int cmp = first.compareTo(tp.first);

if (cmp != 0) {

return cmp;

}

return second.compareTo(tp.second);

}

Text

 first  second       

Writable          

readFields()

   

write()readFields()

TextPairwrite()Text

TextreadFields()

        Text   DataOutput 

DataInput

  

Writable

   

hashCode() equals()  toString()   java.lang.Object  hash

Code()HashPartitioner

      



WritableTextOutputFormat

toString()TextOutputFormat

toString()Text

PairText



Serialization | 105

TextPairWritableComparable

compareTo()

TextPairTextArrayWrita

bleText

TextArrayWritableWritableWritableComparable

Implementing a RawComparator for speed

TextPair

         

TextPair

compareTo()

    TextPair       



  TextPair Text

Text





TextTextRawCompara

tor         

TextPair



public static class Comparator extends WritableComparator {

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public Comparator() {

super(TextPair.class);

}

@Override

public int compare(byte[] b1, int s1, int l1,

byte[] b2, int s2, int l2) {

try {

int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);

int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

if (cmp != 0) {

return cmp;

}

return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,

b2, s2 + firstL2, l2 - firstL2);

} catch (IOException e) {

throw new IllegalArgumentException(e);

}

106 | Chapter 4:Hadoop I/O

static {

WritableComparator.define(TextPair.class, new Comparator());

}

   WritableComparator    RawComparator 



       firstL1  firstL2     

Text

decodeVIntSize()WritableUtils

readVInt()



TextPair

Custom comparators

TextPair



Writableorg.apache.hadoop.io

WritableUtils

RawComparator



TextPair

FirstComparator

compare()  compare() 









public static class FirstComparator extends WritableComparator {

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public FirstComparator() {

super(TextPair.class);

}

@Override

public int compare(byte[] b1, int s1, int l1,

byte[] b2, int s2, int l2) {

try {

int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);

int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);

return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

} catch (IOException e) {

throw new IllegalArgumentException(e);

Serialization | 107

}

@Override

public int compare(WritableComparable a, WritableComparable b) {

if (a instanceof TextPair && b instanceof TextPair) {

return ((TextPair) a).first.compareTo(((TextPair) b).first);

}

return super.compare(a, b);

}

Serialization Frameworks

Writable







        Serialization  

org.apache.hadoop.io.serializer  WritableSerialization   

SerializationWritable

SerializationSerializer

Deserializer



io.serializations

Serializationorg.apache.hadoop.io.seri

alizer.WritableSerialization

Writable

JavaSerialization

Integer

String





Why Not Use Java Object Serialization?



















108 | Chapter 4:Hadoop I/O













  java.io.Serializable

java.io.Externalizable         

  

















































Serialization IDL











org.apache.hadoop.record



Serialization | 109

 















Avro



















      

























            



         



 

 



110 | Chapter 4:Hadoop I/O











          





 





       









       











Avro Data Types and Schemas

              

  





type

{ "type": "null" }



Type Description Schema

null The absence of a value "null"

boolean A binary value "boolean"

int 32-bit signed integer "int"

long 64-bit signed integer "long"

float Single-precision (32-bit) IEEE 754 floating-point number "float"

double Double-precision (64-bit) IEEE 754 floating-point number "double"

Avro | 111

Type Description Schema

bytes Sequence of 8-bit unsigned bytes "bytes"

string Sequence of Unicode characters "string"







Type Description Schema example

array An ordered collection of objects. All objects in a partic-

ular array must have the same schema.

{

"type": "array",

"items": "long"

}

map An unordered collection of key-value pairs. Keys must

be strings and values may be any type, although within

a particular map, all values must have the same schema.

{

"type": "map",

"values": "string"

}

record A collection of named fields of any type. {

"type": "record",

"name": "WeatherRecord",

"doc": "A weather reading.",

"fields": [

{"name": "year", "type": "int"},

{"name": "temperature", "type": "int"},

{"name": "stationId", "type": "string"}

]

}

enum A set of named values. {

"type": "enum",

"name": "Cutlery",

"doc": "An eating utensil.",

"symbols": ["KNIFE", "FORK", "SPOON"]

}

fixed A fixed number of 8-bit unsigned bytes. {

"type": "fixed",

"name": "Md5Hash",

"size": 16

}

union A union of schemas. A union is represented by a JSON

array, where each element in the array is a schema.

Data represented by a union must match one of the

schemas in the union.

[

"null",

"string",

{"type": "map", "values": "string"}

]



double

doublefloatFloat







112 | Chapter 4:Hadoop I/O























recordenumfixed

namenamespace

stringStringUtf8

Utf8

Utf8

String

Utf8   



Utf8java.lang.CharSequence



   Utf8   String    

toString()



String

avro.java.stringString

{ "type": "string", "avro.java.string": "String" }



String

         stringType 

String



String





Avro type Generic Java mapping Specific Java mapping Reflect Java mapping

null null type

Avro | 113

Avro type Generic Java mapping Specific Java mapping Reflect Java mapping

boolean boolean

int int short or int

long long

float float

double double

bytes java.nio.ByteBuffer Array of byte

string org.apache.avro.

util.Utf8

or java.lang.String

java.lang.String

array org.apache.avro.

generic.GenericArray

Array or java.util.Collection

map java.util.Map

record org.apache.avro.

generic.Generic

Record

Generated class implementing

org.apache.avro.

specific.Specific

Record.

Arbitrary user class with a zero-

argument constructor. All inherited

nontransient instance fields are used.

enum java.lang.String Generated Java enum. Arbitrary Java enum.

fixed org.apache.avro.

generic.GenericFixed

Generated class implementing

org.apache.avro.

specific.SpecificFixed.

org.apache.avro.

generic.GenericFixed

union java.lang.Object

In-Memory Serialization and Deserialization











{

"type": "record",

"name": "StringPair",

"doc": "A pair of strings.",

"fields": [

{"name": "left", "type": "string"},

{"name": "right", "type": "string"}

]

}

114 | Chapter 4:Hadoop I/O

  .avsc





Schema.Parser parser = new Schema.Parser();

Schema schema = parser.parse(getClass().getResourceAsStream("StringPair.avsc"));



GenericRecord datum = new GenericData.Record(schema);

datum.put("left", "L");

datum.put("right", "R");



ByteArrayOutputStream out = new ByteArrayOutputStream();

DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);

Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);

writer.write(datum, encoder);

encoder.flush();

out.close();

       DatumWriter   Encoder 

DatumWriterEncoder

GenericDatumWriter

GenericRecordEncodernull



write()



GenericDatumWriter



write()



DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);

Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);

GenericRecord result = reader.read(null, decoder);

assertThat(result.get("left").toString(), is("L"));

assertThat(result.get("right").toString(), is("R"));

nullbinaryDecoder()read()



result.get("left")result.get("left")Utf8

StringtoString()

The specific API



StringPair



Avro | 115

...

<build>

<groupId>org.apache.avro</groupId>

<artifactId>avro-maven-plugin</artifactId>

<version>${avro.version}</version>

<id>schemas</id>

<phase>generate-sources</phase>

<goals>

<goal>schema</goal>

</goals>

<include>StringPair.avsc</include>

</includes>

<sourceDirectory>src/main/resources</sourceDirectory>

<outputDirectory>${project.build.directory}/generated-sources/java

</outputDirectory>

</configuration>

</execution>

</executions>

</plugin>

</plugins>

</build>

...

</project>

           org.apache.avro.spe

cific.SchemaTask

GenericRecord

StringPairSpecificDatumWriter

SpecificDatumReader

StringPair datum = new StringPair();

datum.left = "L";

datum.right = "R";

ByteArrayOutputStream out = new ByteArrayOutputStream();

DatumWriter<StringPair> writer =

new SpecificDatumWriter<StringPair>(StringPair.class);

Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);

writer.write(datum, encoder);

encoder.flush();

out.close();

DatumReader<StringPair> reader =

new SpecificDatumReader<StringPair>(StringPair.class);

 

java -jar avro-tools-*.jar

116 | Chapter 4:Hadoop I/O

Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);

StringPair result = reader.read(null, decoder);

assertThat(result.left.toString(), is("L"));

assertThat(result.right.toString(), is("R"));



datum.setLeft("L")result.getLeft()

Avro Datafiles



           





















               

DatumWriter    Encoder  DataFileWriter

DatumWriter

.avro

File file = new File("data.avro");

DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);

DataFileWriter<GenericRecord> dataFileWriter =

new DataFileWriter<GenericRecord>(writer);

dataFileWriter.create(schema, file);

dataFileWriter.append(datum);

dataFileWriter.close();



append()

java.io.File

java.io.OutputStreamcreate()

DataFileWriterOutputStream

create()FileSystem



            



DataFileReadergetSchema()



Avro | 117

DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();

DataFileReader<GenericRecord> dataFileReader =

new DataFileReader<GenericRecord>(file, reader);

assertThat("Schema is the same", schema, is(dataFileReader.getSchema()));

DataFileReader

hasNext()next()



assertThat(dataFileReader.hasNext(), is(true));

GenericRecord result = dataFileReader.next();

assertThat(result.get("left").toString(), is("L"));

assertThat(result.get("right").toString(), is("R"));

assertThat(dataFileReader.hasNext(), is(false));

next()

              

GenericRecord



GenericRecord record = null;

while (dataFileReader.hasNext()) {

record = dataFileReader.next(record);

// process record

}



for (GenericRecord record : dataFileReader) {

// process record

}

FsInput

PathDataFileReader

seek()sync()

DataFileStream

DataFileStreamInputStream

Interoperability

          



Python API



StringPair

DatumWriter  DataFileWriter  





118 | Chapter 4:Hadoop I/O



dictDataFileWriter



import os

import string

import sys

from avro import schema

from avro import io

from avro import datafile

if __name__ == '__main__':

if len(sys.argv) != 2:

sys.exit('Usage: %s <data_file>' % sys.argv[0])

avro_file = sys.argv[1]

writer = open(avro_file, 'wb')

datum_writer = io.DatumWriter()

schema_object = schema.parse("\

{ "type": "record",

"name": "StringPair",

"doc": "A pair of strings.",

"fields": [

{"name": "left", "type": "string"},

{"name": "right", "type": "string"}

]

}")

dfw = datafile.DataFileWriter(writer, datum_writer, schema_object)

for line in sys.stdin.readlines():

(left, right) = string.split(line.strip(), ',')

dfw.append({'left':left, 'right':right});

dfw.close()



% easy_install avro





% python avro/src/main/py/write_pairs.py pairs.avro

a,1

c,2

b,3

b,2

Avro | 119

C API

               





#include <avro.h>

#include <stdio.h>

#include <stdlib.h>

int main(int argc, char *argv[]) {

if (argc != 2) {

fprintf(stderr, "Usage: dump_pairs <data_file>\n");

exit(EXIT_FAILURE);

}

const char *avrofile = argv[1];

avro_schema_error_t error;

avro_file_reader_t filereader;

avro_datum_t pair;

avro_datum_t left;

avro_datum_t right;

int rval;

char *p;

avro_file_reader(avrofile, &filereader);

while (1) {

rval = avro_file_reader_read(filereader, NULL, &pair);

if (rval) break;

if (avro_record_get(pair, "left", &left) == 0) {

avro_string_get(left, &p);

fprintf(stdout, "%s,", p);

}

if (avro_record_get(pair, "right", &right) == 0) {

avro_string_get(right, &p);

fprintf(stdout, "%s\n", p);

}

avro_file_reader_close(filereader);

return 0;

}



       avro_file_reader_t    avro_

file_reader

 tojson



 avro_

120 | Chapter 4:Hadoop I/O

 avro_file_reader_read

rval

 avro_file_reader_close

avro_file_reader_read



NULL

              

avro_datum_t

avro_record_get

avro_string_get





% ./dump_pairs pairs.avro

a,1

c,2

b,3

b,2



Schema Resolution

              





description

{

"type": "record",

"name": "StringPair",

"doc": "A pair of strings with an added field.",

"fields": [

{"name": "left", "type": "string"},

{"name": "right", "type": "string"},

{"name": "description", "type": "string", "default": "}

]

}



description

default



 



Avro | 121

null

null



{"name": "description", "type": ["null", "string"], "default": null}



GenericDatumReader



DatumReader<GenericRecord> reader =

new GenericDatumReader<GenericRecord>(schema, newSchema);

Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);

GenericRecord result = reader.read(null, decoder);

assertThat(result.get("left").toString(), is("L"));

assertThat(result.get("right").toString(), is("R"));

assertThat(result.get("description").toString(), is("));



null



DatumReader<GenericRecord> reader =

new GenericDatumReader<GenericRecord>(null, newSchema);







rightStringPair

{

"type": "record",

"name": "StringPair",

"doc": "The right field of a pair of strings.",

"fields": [

{"name": "right", "type": "string"}

]

}











New schema Writer Reader Action

Added field Old New The reader uses the default value of the new field, since it is not written by the writer.

New Old The reader does not know about the new field written by the writer, so it is ignored

(projection).

Removed field Old New The reader ignores the removed field (projection).

122 | Chapter 4:Hadoop I/O

New schema Writer Reader Action

New Old The removed field is not written by the writer. If the old schema had a default defined

for the field, the reader uses this; otherwise, it gets an error. In this case, it is best to

update the reader’s schema, either at the same time as or before the writer’s.







StringPairfirstsecond

leftright

{

"type": "record",

"name": "StringPair",

"doc": "A pair of strings with aliased field names.",

"fields": [

{"name": "first", "type": "string", "aliases": ["left"]},

{"name": "second", "type": "string", "aliases": ["right"]}

]

}





leftright

firstsecond

Sort Order



          





record



order

ascendingdescendingignore



         

StringPairrightleft



{

"type": "record",

"name": "StringPair",

"doc": "A pair of strings, sorted by right field descending.",

"fields": [

{"name": "left", "type": "string", "order": "ignore"},

Avro | 123

{"name": "right", "type": "string", "order": "descending"}

]

}



           



rightleft

{

"type": "record",

"name": "StringPair",

"doc": "A pair of strings, sorted by right then left.",

"fields": [

{"name": "right", "type": "string"},

{"name": "left", "type": "string"}

]

}





StringPair

order

left





right       











         











Avro MapReduce



AvroMapperAvroReducerorg.apache.avro.mapred

 

hashCode()BinaryData



124 | Chapter 4:Hadoop I/O

MapperReducer













{

"type": "record",

"name": "WeatherRecord",

"doc": "A weather reading.",

"fields": [

{"name": "year", "type": "int"},

{"name": "temperature", "type": "int"},

{"name": "stationId", "type": "string"}

]

}







public class AvroGenericMaxTemperature extends Configured implements Tool {

private static final Schema SCHEMA = new Schema.Parser().parse(

"{" +

" \"type\": \"record\"," +

" \"name\": \"WeatherRecord\"," +

" \"doc\": \"A weather reading.\"," +

" \"fields\": [" +

" {\"name\": \"year\", \"type\": \"int\"}," +

" {\"name\": \"temperature\", \"type\": \"int\"}," +

" {\"name\": \"stationId\", \"type\": \"string\"}" +

" ]" +

"}"

);

public static class MaxTemperatureMapper

extends AvroMapper<Utf8, Pair<Integer, GenericRecord>> {

private NcdcRecordParser parser = new NcdcRecordParser();

private GenericRecord record = new GenericData.Record(SCHEMA);

@Override

public void map(Utf8 line,

AvroCollector<Pair<Integer, GenericRecord>> collector,

Reporter reporter) throws IOException {

parser.parse(line.toString());

if (parser.isValidTemperature()) {

record.put("year", parser.getYearInt());

record.put("temperature", parser.getAirTemperature());

record.put("stationId", parser.getStationId());

collector.collect(

new Pair<Integer, GenericRecord>(parser.getYearInt(), record));

Avro | 125

}

public static class MaxTemperatureReducer

extends AvroReducer<Integer, GenericRecord, GenericRecord> {

@Override

public void reduce(Integer key, Iterable<GenericRecord> values,

AvroCollector<GenericRecord> collector, Reporter reporter)

throws IOException {

GenericRecord max = null;

for (GenericRecord value : values) {

if (max == null ||

(Integer) value.get("temperature") > (Integer) max.get("temperature")) {

max = newWeatherRecord(value);

}

collector.collect(max);

}

private GenericRecord newWeatherRecord(GenericRecord value) {

GenericRecord record = new GenericData.Record(SCHEMA);

record.put("year", value.get("year"));

record.put("temperature", value.get("temperature"));

record.put("stationId", value.get("stationId"));

return record;

}

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

System.err.printf("Usage: %s [generic options] <input> <output>\n",

getClass().getSimpleName());

ToolRunner.printGenericCommandUsage(System.err);

return -1;

}

JobConf conf = new JobConf(getConf(), getClass());

conf.setJobName("Max temperature");

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

AvroJob.setInputSchema(conf, Schema.create(Schema.Type.STRING));

AvroJob.setMapOutputSchema(conf,

Pair.getPairSchema(Schema.create(Schema.Type.INT), SCHEMA));

AvroJob.setOutputSchema(conf, SCHEMA);

conf.setInputFormat(AvroUtf8InputFormat.class);

AvroJob.setMapperClass(conf, MaxTemperatureMapper.class);

AvroJob.setReducerClass(conf, MaxTemperatureReducer.class);

JobClient.runJob(conf);

126 | Chapter 4:Hadoop I/O

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new AvroGenericMaxTemperature(), args);

System.exit(exitCode);

}





"temperature"

SCHEMA









org.apache.avro.mapred.Pair

MaxTemperatureMapper     org.apache.avro.mapred.AvroMapper





GenericRecord



         

Pairorg.apache.avro.mapred.AvroReducerMaxTemperatureRe

ducer







AvroJob

AvroJob



string

AvroUtf8InputFormat

int



AvroOutputFormat



% hadoop jar avro-examples.jar AvroGenericMaxTemperature \

input/ncdc/sample.txt output

             

AvroSpecificMaxTemperature

Avro | 127





% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/part-00000.avro

{"year":1949,"temperature":111,"stationId":"012650-99999"}

{"year":1950,"temperature":22,"stationId":"011990-99999"}

AvroMapperAvroReducer







Sorting Using Avro MapReduce







public class AvroSort extends Configured implements Tool {

static class SortMapper<K> extends AvroMapper<K, Pair<K, K>> {

public void map(K datum, AvroCollector<Pair<K, K>> collector,

Reporter reporter) throws IOException {

collector.collect(new Pair<K, K>(datum, null, datum, null));

}

static class SortReducer<K> extends AvroReducer<K, K, K> {

public void reduce(K key, Iterable<K> values,

AvroCollector<K> collector,

Reporter reporter) throws IOException {

for (K value : values) {

collector.collect(value);

}

@Override

public int run(String[] args) throws Exception {

if (args.length != 3) {

System.err.printf(

"Usage: %s [generic options] <input> <output> <schema-file>\n",

getClass().getSimpleName());

ToolRunner.printGenericCommandUsage(System.err);

return -1;

}

String input = args[0];

String output = args[1];

String schemaFile = args[2];

JobConf conf = new JobConf(getConf(), getClass());

128 | Chapter 4:Hadoop I/O

conf.setJobName("Avro sort");

FileInputFormat.addInputPath(conf, new Path(input));

FileOutputFormat.setOutputPath(conf, new Path(output));

Schema schema = new Schema.Parser().parse(new File(schemaFile));

AvroJob.setInputSchema(conf, schema);

Schema intermediateSchema = Pair.getPairSchema(schema, schema);

AvroJob.setMapOutputSchema(conf, intermediateSchema);

AvroJob.setOutputSchema(conf, schema);

AvroJob.setMapperClass(conf, SortMapper.class);

AvroJob.setReducerClass(conf, SortReducer.class);

JobClient.runJob(conf);

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new AvroSort(), args);

System.exit(exitCode);

}





K



             

org.apache.avro.mapred.Pair







               





% java -jar $AVRO_HOME/avro-tools-*.jar tojson input/avro/pairs.avro

{"left":"a","right":"1"}

{"left":"c","right":"2"}

{"left":"b","right":"3"}

{"left":"b","right":"2"}



% hadoop jar avro-examples.jar AvroSort input/avro/pairs.avro output \

ch04-avro/src/main/resources/SortedStringPair.avsc



 



Avro | 129

% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/part-00000.avro

{"left":"b","right":"3"}

{"left":"c","right":"2"}

{"left":"b","right":"2"}

{"left":"a","right":"1"}

Avro MapReduce in Other Languages



AvroAsTextInputFormat



bytes

AvroTextOutputFormat

bytes



org.apache.avro.mapred



org.apache.avro.mapred.tether









File-Based Data Structures





           



SequenceFile



SequenceFile





LongWritable      Writable     



130 | Chapter 4:Hadoop I/O

SequenceFile

          SequenceFile  

             

SequenceFile

Writing a SequenceFile

SequenceFilecreateWriter()

SequenceFile.Writer         

           FSDataOutputStream  

FileSystemPath  Configuration   

Progressable

MetadataSequence

File

SequenceFileWritable

Serialization

    SequenceFile.Writer       

append()        close()  Sequence

File.Writerjava.io.Closeable

SequenceFile





public class SequenceFileWriteDemo {

private static final String[] DATA = {

"One, two, buckle my shoe",

"Three, four, shut the door",

"Five, six, pick up sticks",

"Seven, eight, lay them straight",

"Nine, ten, a big fat hen"

};

public static void main(String[] args) throws IOException {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path path = new Path(uri);

IntWritable key = new IntWritable();

Text value = new Text();

SequenceFile.Writer writer = null;

try {

writer = SequenceFile.createWriter(fs, conf, path,

 

SequenceFile

File-Based Data Structures | 131

key.getClass(), value.getClass());

for (int i = 0; i < 100; i++) {

key.set(100 - i);

value.set(DATA[i % DATA.length]);

System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);

writer.append(key, value);

}

} finally {

IOUtils.closeStream(writer);

}



IntWritableText

SequenceFile.WritergetLength()







% hadoop SequenceFileWriteDemo numbers.seq

[128] 100 One, two, buckle my shoe

[173] 99 Three, four, shut the door

[220] 98 Five, six, pick up sticks

[264] 97 Seven, eight, lay them straight

[314] 96 Nine, ten, a big fat hen

[359] 95 One, two, buckle my shoe

[404] 94 Three, four, shut the door

[451] 93 Five, six, pick up sticks

[495] 92 Seven, eight, lay them straight

[545] 91 Nine, ten, a big fat hen

...

[1976] 60 One, two, buckle my shoe

[2021] 59 Three, four, shut the door

[2088] 58 Five, six, pick up sticks

[2132] 57 Seven, eight, lay them straight

[2182] 56 Nine, ten, a big fat hen

...

[4557] 5 One, two, buckle my shoe

[4602] 4 Three, four, shut the door

[4649] 3 Five, six, pick up sticks

[4693] 2 Seven, eight, lay them straight

[4743] 1 Nine, ten, a big fat hen

Reading a SequenceFile



SequenceFile.Reader          

next()

Writablenext()

132 | Chapter 4:Hadoop I/O

               



public boolean next(Writable key, Writable val)

truefalse



Writable



public Object next(Object key) throws IOException

public Object getCurrentValue(Object val) throws IOException



io.serializations

next()null

getCurrentValue()

next()null

             

Writable            Sequence

File.Reader    getKeyClass()  getValueClass()   ReflectionU

tils

Writable





public class SequenceFileReadDemo {

public static void main(String[] args) throws IOException {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path path = new Path(uri);

SequenceFile.Reader reader = null;

try {

reader = new SequenceFile.Reader(fs, path, conf);

Writable key = (Writable)

ReflectionUtils.newInstance(reader.getKeyClass(), conf);

Writable value = (Writable)

ReflectionUtils.newInstance(reader.getValueClass(), conf);

long position = reader.getPosition();

while (reader.next(key, value)) {

String syncSeen = reader.syncSeen() ? "*" : ";

System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);

position = reader.getPosition(); // beginning of next record

}

} finally {

IOUtils.closeStream(reader);

}

File-Based Data Structures | 133

}







SequenceFile.Writer













% hadoop SequenceFileReadDemo numbers.seq

[128] 100 One, two, buckle my shoe

[173] 99 Three, four, shut the door

[220] 98 Five, six, pick up sticks

[264] 97 Seven, eight, lay them straight

[314] 96 Nine, ten, a big fat hen

[359] 95 One, two, buckle my shoe

[404] 94 Three, four, shut the door

[451] 93 Five, six, pick up sticks

[495] 92 Seven, eight, lay them straight

[545] 91 Nine, ten, a big fat hen

[590] 90 One, two, buckle my shoe

...

[1976] 60 One, two, buckle my shoe

[2021*] 59 Three, four, shut the door

[2088] 58 Five, six, pick up sticks

[2132] 57 Seven, eight, lay them straight

[2182] 56 Nine, ten, a big fat hen

...

[4557] 5 One, two, buckle my shoe

[4602] 4 Three, four, shut the door

[4649] 3 Five, six, pick up sticks

[4693] 2 Seven, eight, lay them straight

[4743] 1 Nine, ten, a big fat hen

           

seek()



reader.seek(359);

assertThat(reader.next(key, value), is(true));

assertThat(((IntWritable) key).get(), is(95));



next()

reader.seek(360);

reader.next(key, value); // fails with IOException

134 | Chapter 4:Hadoop I/O

sync(long

position)SequenceFile.Reader

position

sync()





reader.sync(360);

assertThat(reader.getPosition(), is(2021L));

assertThat(reader.next(key, value), is(true));

assertThat(((IntWritable) key).get(), is(59));

SequenceFile.Writersync()



       sync() 

   Syncable      









Displaying a SequenceFile with the command-line interface

hadoop fs-text









toString()









% hadoop fs -text numbers.seq | head

100 One, two, buckle my shoe

99 Three, four, shut the door

98 Five, six, pick up sticks

97 Seven, eight, lay them straight

96 Nine, ten, a big fat hen

95 One, two, buckle my shoe

94 Three, four, shut the door

93 Five, six, pick up sticks

92 Seven, eight, lay them straight

91 Nine, ten, a big fat hen

File-Based Data Structures | 135

Sorting and merging SequenceFiles













% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 \

-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \

-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \

-outKey org.apache.hadoop.io.IntWritable \

-outValue org.apache.hadoop.io.Text \

numbers.seq sorted

% hadoop fs -text sorted/part-00000 | head

1 Nine, ten, a big fat hen

2 Seven, eight, lay them straight

3 Five, six, pick up sticks

4 Three, four, shut the door

5 One, two, buckle my shoe

6 Nine, ten, a big fat hen

7 Seven, eight, lay them straight

8 Five, six, pick up sticks

9 Three, four, shut the door

10 One, two, buckle my shoe



SequenceFile.Sorter

sort()merge()







The SequenceFile format



SEQ



    













 SequenceFile

136 | Chapter 4:Hadoop I/O









writeInt()java.io.Data

OutputSerialization











   



               

io.seqfile.compress.blocksize







MapFile

MapFileSequenceFileMapFile

java.util.Map

Map



File-Based Data Structures | 137

Writing a MapFile

  MapFile      SequenceFile     

MapFile.Writerappend()

         IOException     

WritableComparableWritableSequenceFile



MapFile

SequenceFile



public class MapFileWriteDemo {

private static final String[] DATA = {

"One, two, buckle my shoe",

"Three, four, shut the door",

"Five, six, pick up sticks",

"Seven, eight, lay them straight",

"Nine, ten, a big fat hen"

};

public static void main(String[] args) throws IOException {

String uri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(uri), conf);

IntWritable key = new IntWritable();

Text value = new Text();

MapFile.Writer writer = null;

try {

writer = new MapFile.Writer(conf, fs, uri,

key.getClass(), value.getClass());

for (int i = 0; i < 1024; i++) {

key.set(i + 1);

value.set(DATA[i % DATA.length]);



138 | Chapter 4:Hadoop I/O

writer.append(key, value);

}

} finally {

IOUtils.closeStream(writer);

}

MapFile

% hadoop MapFileWriteDemo numbers.map

MapFile



% ls -l numbers.map

total 104

-rw-r--r-- 1 tom tom 47898 Jul 29 22:06 data

-rw-r--r-- 1 tom tom 251 Jul 29 22:06 index

SequenceFile

% hadoop fs -text numbers.map/data | head

1 One, two, buckle my shoe

2 Three, four, shut the door

3 Five, six, pick up sticks

4 Seven, eight, lay them straight

5 Nine, ten, a big fat hen

6 One, two, buckle my shoe

7 Three, four, shut the door

8 Five, six, pick up sticks

9 Seven, eight, lay them straight

10 Nine, ten, a big fat hen





% hadoop fs -text numbers.map/index

1 128

129 6079

257 12054

385 18030

513 24002

641 29976

769 35947

897 41922



          io.map.index.interval

setIndexInterval()MapFile.Writer



MapFile





File-Based Data Structures | 139

MapFile





Reading a MapFile

      MapFile     

SequenceFileMapFile.Readernext()

false

public boolean next(WritableComparable key, Writable val) throws IOException

get()

public Writable get(WritableComparable key, Writable val) throws IOException

MapFile

nullkeykey

val



MapFile

Text value = new Text();

reader.get(new IntWritable(496), value);

assertThat(value.toString(), is("One, two, buckle my shoe"));

MapFile.Reader

















getClosest()get()

nullMapFile



MapFile          boolean  



MapFile



MapFileio.map.index.skip

01

2



140 | Chapter 4:Hadoop I/O





MapFile variants

MapFile

SetFileMapFileWritable



ArrayFileMapFile

Writable

BloomMapFileMapFileget()







get()

io.mapfile.bloom.size 

         io.map

file.bloom.error.rate



Converting a SequenceFile to a MapFile

MapFileSequenceFile

SequenceFileMapFile

SequenceFile

SequenceFile

 fix()MapFile  

MapFile



public class MapFileFixer {

public static void main(String[] args) throws Exception {

String mapUri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(mapUri), conf);

Path map = new Path(mapUri);

Path mapData = new Path(map, MapFile.DATA_FILE_NAME);

// Get key and value types from data sequence file

SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);

Class keyClass = reader.getKeyClass();

Class valueClass = reader.getValueClass();

reader.close();

File-Based Data Structures | 141

// Create the map file index file

long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);

System.out.printf("Created MapFile %s with %d entries\n", map, entries);

}

fix()



 

MapFile



% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 \

-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \

-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \

-outKey org.apache.hadoop.io.IntWritable \

-outValue org.apache.hadoop.io.Text \

numbers.seq numbers.map

 

% hadoop fs -mv numbers.map/part-00000 numbers.map/data

 

% hadoop MapFileFixer numbers.map

Created MapFile numbers.map with 100 entries

MapFile

142 | Chapter 4:Hadoop I/O

CHAPTER 5

Developing a MapReduce Application











































143

The Configuration API



   Configuration     org.apache.hadoop.conf 



String

booleanintlong  float String

Classjava.io.FileString

Configuration





<?xml version="1.0"?>

<name>color</name>

<value>yellow</value>

<description>Color</description>

</property>

</property>

<name>weight</name>

<value>heavy</value>

<description>Weight</description>

</property>

<name>size-weight</name>

<value>${size},${weight}</value>

<description>Size and weight</description>

</property>

</configuration>





Configuration conf = new Configuration();

conf.addResource("configuration-1.xml");

assertThat(conf.get("color"), is("yellow"));

assertThat(conf.getInt("size", 0), is(10));

assertThat(conf.get("breadth", "wide"), is("wide"));

144 | Chapter 5:Developing a MapReduce Application





get()

breadth

Combining Resources





           

sizeweight



<?xml version="1.0"?>

</property>

<name>weight</name>

<value>light</value>

</property>

</configuration>

Configuration

Configuration conf = new Configuration();

conf.addResource("configuration-1.xml");

conf.addResource("configuration-2.xml");



size

assertThat(conf.getInt("size", 0), is(12));

final

weightfinal



assertThat(conf.get("weight"), is("heavy"));









The Configuration API | 145

Variable Expansion



size-weight

${size},${weight}



assertThat(conf.get("size-weight"), is("12,heavy"));



System.setProperty("size", "14");

assertThat(conf.get("size-weight"), is("14,heavy"));

            

-Dproperty=value







System.setProperty("length", "2");

assertThat(conf.get("length"), is((String) null));

Setting Up the Development Environment











<groupId>com.hadoopbook</groupId>

<artifactId>hadoop-book-mr-dev</artifactId>

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

</properties>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

</dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

146 | Chapter 5:Developing a MapReduce Application

</dependency>

<groupId>org.hamcrest</groupId>

<artifactId>hamcrest-all</artifactId>

</dependency>

<groupId>org.apache.mrunit</groupId>

<artifactId>mrunit</artifactId>

<version>0.8.0-incubating</version>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-test</artifactId>

</dependency>

<groupId>com.sun.jersey</groupId>

<artifactId>jersey-core</artifactId>

</dependency>

</dependencies>

<build>

<finalName>hadoop-examples</finalName>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

</configuration>

</plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-jar-plugin</artifactId>

<outputDirectory>${basedir}</outputDirectory>

</configuration>

</plugin>

</plugins>

</build>

</project>

Setting Up the Development Environment | 147







hadoop-core

junithamcrest-all

mrunit

hadoop-test

jersey-core





hadoop-core













% mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true

Managing Configuration

























      









 



148 | Chapter 5:Developing a MapReduce Application





<?xml version="1.0"?>

<name>fs.default.name</name>

</property>

<name>mapred.job.tracker</name>

<value>local</value>

</property>

</configuration>





<?xml version="1.0"?>

<name>fs.default.name</name>

<value>hdfs://localhost/</value>

</property>

<name>mapred.job.tracker</name>

<value>localhost:8021</value>

</property>

</configuration>







<?xml version="1.0"?>

<name>fs.default.name</name>

<value>hdfs://namenode/</value>

</property>

<name>mapred.job.tracker</name>

<value>jobtracker:8021</value>

</property>

</configuration>

Setting Up the Development Environment | 149







Setting User Identity



whoami

groups





hadoop.job.ugi

preston,directors,inventors

prestondirectorsinventors

              

dfs.web.ugiwebuser,webgroup



            



-conf





% hadoop fs -conf conf/hadoop-localhost.xml -ls .

Found 2 items

drwxr-xr-x - tom supergroup 0 2009-04-08 10:32 /user/tom/input

drwxr-xr-x - tom supergroup 0 2009-04-08 13:09 /user/tom/output

-conf

$HADOOP_INSTALL



-conf



Tool

GenericOptionsParser, Tool, and ToolRunner

             

  GenericOptionsParser       

Configuration

       GenericOptionsParser    

    Tool       

ToolRunnerGenericOptionsParser

150 | Chapter 5:Developing a MapReduce Application

public interface Tool extends Configurable {

int run(String [] args) throws Exception;

}

Tool

ToolConfiguration



public class ConfigurationPrinter extends Configured implements Tool {

static {

Configuration.addDefaultResource("hdfs-default.xml");

Configuration.addDefaultResource("hdfs-site.xml");

Configuration.addDefaultResource("mapred-default.xml");

Configuration.addDefaultResource("mapred-site.xml");

}

@Override

public int run(String[] args) throws Exception {

Configuration conf = getConf();

for (Entry<String, String> entry: conf) {

System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());

}

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);

System.exit(exitCode);

}

ConfigurationPrinterConfigured

  Configurable     Tool   

ConfigurableToolConfigured

  run()  Configuration  Configurable

getConf()



Configuration

ConfigurationPrintermain()run()

   ToolRunner  run()       

ConfigurationToolrun()ToolRunner

GenericOptionsParser

Configuration



% mvn compile

% export HADOOP_CLASSPATH=target/classes/

% hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml \

Setting Up the Development Environment | 151

| grep mapred.job.tracker=

mapred.job.tracker=localhost:8021

Which Properties Can I Set?

ConfigurationPrinter





            





             









mapred.tasktracker.map.tasks.maximum







mapred.tasktracker.map.tasks.max

imum mapred.tasktracker          









GenericOptionsParser

% hadoop ConfigurationPrinter -D color=yellow | grep color

color=yellow

-Dcolor

yellow-D



-D

        -D mapred.reduce.tasks=n  





GenericOptionsParserToolRunner

           



152 | Chapter 5:Developing a MapReduce Application

       

-D property=value   GenericOptionsParser  ToolRunner

-Dproperty=value

java

D

GenericOptionsParser

java.lang.System

Configuration



color       HADOOP_OPTS  

SystemConfigurationPrinter

% HADOOP_OPTS='-Dcolor=yellow' \

hadoop ConfigurationPrinter | grep color



          

   





Option Description

-D property=value Sets the given Hadoop configuration property to the given value. Overrides any default

or site properties in the configuration and any properties set via the -conf option.

-conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient way

to set site properties or to set a number of properties at once.

-fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.default.name=uri

-jt host:port Sets the jobtracker to the given host and port. Shortcut for -D

mapred.job.tracker=host:port

-files file1,file2,... Copies the specified files from the local filesystem (or any filesystem if a scheme is

specified) to the shared filesystem used by the jobtracker (usually HDFS) and makes

them available to MapReduce programs in the task’s working directory. (See “Distributed

Cache” on page 289 for more on the distributed cache mechanism for copying files to

tasktracker machines.)

-archives

archive1,archive2,...

Copies the specified archives from the local filesystem (or any filesystem if a scheme is

specified) to the shared filesystem used by the jobtracker (usually HDFS), unarchives

them, and makes them available to MapReduce programs in the task’s working

directory.

-libjars jar1,jar2,... Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is

specified) to the shared filesystem used by the jobtracker (usually HDFS), and adds them

to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that

a job is dependent on.

Setting Up the Development Environment | 153

Writing a Unit Test with MRUnit

















Mapper





import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mrunit.mapreduce.MapDriver;

import org.junit.*;

public class MaxTemperatureMapperTest {

@Test

public void processesValidRecord() throws IOException, InterruptedException {

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +

// Year ^^^^

"99999V0203201N00261220001CN9999999N9-00111+99999999999");

// Temperature ^^^^^

new MapDriver<LongWritable, Text, Text, IntWritable>()

.withMapper(new MaxTemperatureMapper())

.withInputValue(value)

.withOutput(new Text("1950"), new IntWritable(-11))

.runTest();

}





MapDriver

MaxTemperatureMapper

Text

IntWritable         runT

est()





154 | Chapter 5:Developing a MapReduce Application

Mapper





v1.MaxTemperatureMapperMaxTemperatureMapper





public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature = Integer.parseInt(line.substring(87, 92));

context.write(new Text(year), new IntWritable(airTemperature));

}



Context

+9999

@Test

public void ignoresMissingTemperatureRecord() throws IOException,

InterruptedException {

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +

// Year ^^^^

"99999V0203201N00261220001CN9999999N9+99991+99999999999");

// Temperature ^^^^^

new MapDriver<LongWritable, Text, Text, IntWritable>()

.withMapper(new MaxTemperatureMapper())

.withInputValue(value)

.runTest();

}

MapDriver

withOutput()





NumberFormatExceptionparseInt()





@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

Writing a Unit Test with MRUnit | 155

String year = line.substring(15, 19);

String temp = line.substring(87, 92);

if (!missing(temp)) {

int airTemperature = Integer.parseInt(temp);

context.write(new Text(year), new IntWritable(airTemperature));

}

private boolean missing(String temp) {

return temp.equals("+9999");

}



Reducer



ReduceDriver

@Test

public void returnsMaximumIntegerInValues() throws IOException,

InterruptedException {

new ReduceDriver<Text, IntWritable, Text, IntWritable>()

.withReducer(new MaxTemperatureReducer())

.withInputKey(new Text("1950"))

.withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))

.withOutput(new Text("1950"), new IntWritable(10))

.runTest();

}

      IntWritable     

MaxTemperatureReducer

MaxTemperatureReducer



public class MaxTemperatureReducer

extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values,

Context context)

throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

}

context.write(key, new IntWritable(maxValue));

}

156 | Chapter 5:Developing a MapReduce Application

Running Locally on Test Data





Running a Job in a Local Job Runner

Tool

           

MaxTemperatureDriver



public class MaxTemperatureDriver extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

System.err.printf("Usage: %s [generic options] <input> <output>\n",

getClass().getSimpleName());

ToolRunner.printGenericCommandUsage(System.err);

return -1;

}

Job job = new Job(getConf(), "Max temperature");

job.setJarByClass(getClass());

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);

System.exit(exitCode);

}

MaxTemperatureDriverTool

GenericOptionsParserrun()

Job





Running Locally on Test Data | 157

TextInputFormatLongWritable

TextMax temperature





































          

mapred.job.trackerhost:port

local





mapreduce.framework.namelocal



% mvn compile

% export HADOOP_CLASSPATH=target/classes/

% hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml \

input/ncdc/micro output

-fs-jtGenericOptionsParser

% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output

MaxTemperatureDriver



-fsfile:///







158 | Chapter 5:Developing a MapReduce Application

java.lang.NumberFormatException: For input string: "+0000"

Fixing the mapper







+9999







public class NcdcRecordParser {

private static final int MISSING_TEMPERATURE = 9999;

private String year;

private int airTemperature;

private String quality;

public void parse(String record) {

year = record.substring(15, 19);

String airTemperatureString;

// Remove leading plus sign as parseInt doesn't like them

if (record.charAt(87) == '+') {

airTemperatureString = record.substring(88, 92);

} else {

airTemperatureString = record.substring(87, 92);

}

airTemperature = Integer.parseInt(airTemperatureString);

quality = record.substring(92, 93);

}

public void parse(Text record) {

parse(record.toString());

}

public boolean isValidTemperature() {

return airTemperature != MISSING_TEMPERATURE && quality.matches("[01459]");

}

public String getYear() {

return year;

}

public int getAirTemperature() {

return airTemperature;

}



parse()

Running Locally on Test Data | 159

isValidTemperature()





isValidTemperature()









public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

parser.parse(value);

if (parser.isValidTemperature()) {

context.write(new Text(parser.getYear()),

new IntWritable(parser.getAirTemperature()));

}



Testing the Driver



Tool

Configuration











@Test

public void test() throws Exception {

Configuration conf = new Configuration();

conf.set("fs.default.name", "file:///");

conf.set("mapred.job.tracker", "local");

Path input = new Path("input/ncdc/micro");

Path output = new Path("output");

160 | Chapter 5:Developing a MapReduce Application

FileSystem fs = FileSystem.getLocal(conf);

fs.delete(output, true); // delete old output

MaxTemperatureDriver driver = new MaxTemperatureDriver();

driver.setConf(conf);

int exitCode = driver.run(new String[] {

input.toString(), output.toString() });

assertThat(exitCode, is(0));

checkOutput(conf, output);

}

  fs.default.name mapred.job.tracker  

MaxTemperatureDriverTool

checkOut

put()





MiniDFSClusterMiniMRClusterMiniYARNCluster











ClusterMapReduceTestCase



setUp()tearDown()





MaxTemperature

DriverMiniTest







Running on a Cluster





            



Running on a Cluster | 161

Packaging a Job











setJarByClass()JobConfJob

setJar()



      



% mvn package -DskipTests







       









The client classpath

hadoop jar <jar>

 

 



 HADOOP_CLASSPATH

HADOOP_CLASSPATH



hadoop CLASSNAME

The task classpath



         HADOOP_CLASSPATH

HADOOP_CLASSPATH



162 | Chapter 5:Developing a MapReduce Application



 

     



          -libjars   

addFileToClassPath()DistributedCache

Job

Packaging dependencies





 

 

 

HADOOP_CLASSPATH-libjars









Task classpath precedence











HADOOP_USER_CLASSPATH_FIRST

truemapreduce.task.classpath.firsttrue







Launching a Job



-conf-fs-jt

% unset HADOOP_CLASSPATH

% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver \

-conf conf/hadoop-cluster.xml input/ncdc/all max-temp

Running on a Cluster | 163

HADOOP_CLASSPATH

            

target/classes/

         MaxTemperatureDriver  



waitForCompletion()Job





09/04/11 08:15:52 INFO mapred.FileInputFormat: Total input paths to process : 101

09/04/11 08:15:53 INFO mapred.JobClient: Running job: job_200904110811_0002

09/04/11 08:15:54 INFO mapred.JobClient: map 0% reduce 0%

09/04/11 08:16:06 INFO mapred.JobClient: map 28% reduce 0%

09/04/11 08:16:07 INFO mapred.JobClient: map 30% reduce 0%

...

09/04/11 08:21:36 INFO mapred.JobClient: map 100% reduce 100%

09/04/11 08:21:38 INFO mapred.JobClient: Job complete: job_200904110811_0002

09/04/11 08:21:38 INFO mapred.JobClient: Counters: 19

09/04/11 08:21:38 INFO mapred.JobClient: Job Counters

09/04/11 08:21:38 INFO mapred.JobClient: Launched reduce tasks=32

09/04/11 08:21:38 INFO mapred.JobClient: Rack-local map tasks=82

09/04/11 08:21:38 INFO mapred.JobClient: Launched map tasks=127

09/04/11 08:21:38 INFO mapred.JobClient: Data-local map tasks=45

09/04/11 08:21:38 INFO mapred.JobClient: FileSystemCounters

09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_READ=12667214

09/04/11 08:21:38 INFO mapred.JobClient: HDFS_BYTES_READ=33485841275

09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=989397

09/04/11 08:21:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=904

09/04/11 08:21:38 INFO mapred.JobClient: Map-Reduce Framework

09/04/11 08:21:38 INFO mapred.JobClient: Reduce input groups=100

09/04/11 08:21:38 INFO mapred.JobClient: Combine output records=4489

09/04/11 08:21:38 INFO mapred.JobClient: Map input records=1209901509

09/04/11 08:21:38 INFO mapred.JobClient: Reduce shuffle bytes=19140

09/04/11 08:21:38 INFO mapred.JobClient: Reduce output records=100

09/04/11 08:21:38 INFO mapred.JobClient: Spilled Records=9481

09/04/11 08:21:38 INFO mapred.JobClient: Map output bytes=10282306995

09/04/11 08:21:38 INFO mapred.JobClient: Map input bytes=274600205558

09/04/11 08:21:38 INFO mapred.JobClient: Combine input records=1142482941

09/04/11 08:21:38 INFO mapred.JobClient: Map output records=1142478555

09/04/11 08:21:38 INFO mapred.JobClient: Reduce input records=103





hadoop job





Map input bytes

HDFS_BYTES_READ



164 | Chapter 5:Developing a MapReduce Application

Job, Task, and Task Attempt IDs



job_200904110811_0002
0002

         

job
task
task_200904110811_0002_m_000003
   000003      m      
job_200904110811_0002

            
         


attempt_200904110811_0002_m_000003_0
   0        
task_200904110811_0002_m_000003
     




The MapReduce Web UI



The jobtracker page




Running on a Cluster | 165

























JobsetJobName()

mapred.job.name





mapred.jobtracker.com

pleteuserjobs.maximum





166 | Chapter 5:Developing a MapReduce Application



Job History







              



hadoop.job.history.location 



              

         

hadoop.job.history.user.locationnone

Running on a Cluster | 167









   hadoop job -history       



The job page



















 

















     









     





Retrieving the Results



             



168 | Chapter 5:Developing a MapReduce Application



Running on a Cluster | 169

















      





-getmergehadoop fs





% hadoop fs -getmerge max-temp max-temp-local

% sort max-temp-local | tail

1991 607

1992 605

1993 567

1994 568

1995 567

1996 561

1997 565

1998 568

1999 568

2000 558



        





-cat



% hadoop fs -cat max-temp/*







Debugging a Job









170 | Chapter 5:Developing a MapReduce Application











































public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {

enum Temperature {

OVER_100

}

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

parser.parse(value);

if (parser.isValidTemperature()) {

int airTemperature = parser.getAirTemperature();

if (airTemperature > 1000) {

System.err.println("Temperature over 100 degrees for input: " + value);

context.setStatus("Detected possibly corrupt record: see logs.");

context.getCounter(Temperature.OVER_100).increment(1);

}

context.write(new Text(parser.getYear()), new IntWritable(airTemperature));

}

Running on a Cluster | 171





setStatus()Context



OVER_100







The tasks page























task_200904110811_0003_m_000044

       



The task details page











webinterface.private.actions

true

172 | Chapter 5:Developing a MapReduce Application





webinterface.private.actionstrue

dfs.web.ugi















Running on a Cluster | 173

Temperature over 100 degrees for input:

0335999999433181957042302005+37950+139117SAO +0004RJSN V020113590031500703569999994

33201957010100005+35317+139650SAO +000899999V02002359002650076249N004000599+0067...











% hadoop job -counter job_200904110811_0003 'v4.MaxTemperatureMapper$Temperature' \

OVER_100

-counter









            



Handling malformed data







@Test

public void parsesMalformedTemperature() throws IOException,

InterruptedException {

Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +

// Year ^^^^

"RJSN V02011359003150070356999999433201957010100005+353");

// Temperature ^^^^^

Counters counters = new Counters();

new MapDriver<LongWritable, Text, Text, IntWritable>()

.withMapper(new MaxTemperatureMapper())

.withInputValue(value)

.withCounters(counters)

.runTest();

Counter c = counters.findCounter(MaxTemperatureMapper.Temperature.MALFORMED);

assertThat(c.getValue(), is(1L));

}











174 | Chapter 5:Developing a MapReduce Application


public class MaxTemperatureMapper
  extends Mapper<LongWritable, Text, Text, IntWritable> {
  
  enum Temperature {
    MALFORMED
  }
  private NcdcRecordParser parser = new NcdcRecordParser();
  
  @Override
  public void map(LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {
    
    parser.parse(value);
    if (parser.isValidTemperature()) {
      int airTemperature = parser.getAirTemperature();
      context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
    } else if (parser.isMalformedTemperature()) {
      System.err.println("Ignoring possibly corrupt input: " + value);
      context.getCounter(Temperature.MALFORMED).increment(1);
    }
  }
}
Hadoop Logs



Logs Primary audience Description Further information
System daemon logs Administrators Each Hadoop daemon produces a logfile (us-
ing log4j) and another file that combines
standard out and error. Written in the direc-
tory defined by the HADOOP_LOG_DIR en-
vironment variable.
“System log-
files” on page 309 and
“Logging” on page 352
HDFS audit logs Administrators A log of all HDFS requests, turned off by de-
fault. Written to the namenode’s log, al-
though this is configurable.
“Audit Log-
ging” on page 346
MapReduce job history logs Users A log of the events (such as task completion)
that occur in the course of running a job.
Saved centrally on the jobtracker and in the
job’s output directory in a _logs/history sub-
directory.
“Job His-
tory” on page 167
MapReduce task logs Users Each tasktracker child process produces a
logfile using log4j (called syslog), a file for
data sent to standard out (stdout), and a file
for standard error (stderr). Written in the
This section
Running on a Cluster | 175

Logs Primary audience Description Further information

userlogs subdirectory of the directory defined

by the HADOOP_LOG_DIR environment

variable.







 

              





















import org.apache.commons.logging.Log;

import org.apache.commons.logging.LogFactory;

import org.apache.hadoop.mapreduce.Mapper;

public class LoggingIdentityMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

extends Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

private static final Log LOG = LogFactory.getLog(LoggingIdentityMapper.class);

@Override

public void map(KEYIN key, VALUEIN value, Context context)

throws IOException, InterruptedException {

// Log to stdout file

System.out.println("Map key: " + key);

// Log to syslog file

LOG.info("Map key: " + key);

if (LOG.isDebugEnabled()) {

LOG.debug("Map value: " + value);

}

context.write((KEYOUT) key, (VALUEOUT) value);

}

INFODEBUG

            

mapred.map.child.log.levelmapred.reduce.child.log.level

176 | Chapter 5:Developing a MapReduce Application





% hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoop-cluster.xml \

-D mapred.map.child.log.level=DEBUG input/ncdc/sample.txt logging-out



             mapred.user

log.retain.hours

mapred.userlog.limit.kb







DEBUG



% HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -text /foo/bar

Remote Debugging























mapred.child.java.opts-XX:-HeapDumpOnOutOfMemoryError -XX:Heap

DumpPath=/path/to/dumps



           

mapred.child.java.opts        









IsolationRunner

IsolationRunner



Running on a Cluster | 177

           



keep.failed.task.filestrue


              
keep.task.files.pattern



mapred.local.dir




      mapred.local.dir/taskTracker/jobcache/job-ID/task-attempt-ID
        
Tuning a Job

            



Area Best practice Further information
Number of
mappers
How long are your mappers running for? If they are only running for a few seconds
on average, you should see whether there’s a way to have fewer mappers and
make them all run longer, a minute or so, as a rule of thumb. The extent to
which this is possible depends on the input format you are using.
“Small files and Com-
bineFileInputFor-
mat” on page 239
Number of reducers For maximum performance, the number of reducers should be slightly less than
the number of reduce slots in the cluster. This allows the reducers to finish in
one wave and fully utilizes the cluster during the reduce phase.
“Choosing the Num-
ber of Reduc-
ers” on page 231
Combiners Check whether your job can take advantage of a combiner to reduce the amount
of data passing through the shuffle.
“Combiner Func-
tions” on page 33
Intermediate
compression
Job execution time can almost always benefit from enabling map output
compression.
“Compressing map
output” on page 92
178 | Chapter 5:Developing a MapReduce Application

Area Best practice Further information
Custom
serialization
If you are using your own custom Writable objects or custom comparators,
make sure you have implemented RawComparator.
“Implementing a
RawComparator for
speed” on page 106
Shuffle tweaks The MapReduce shuffle exposes around a dozen tuning parameters for memory
management, which may help you wring out the last bit of performance.
“Configuration Tun-
ing” on page 211
Profiling Tasks








    









               

The HPROF profiler

     JobConf    
MaxTemperatureDriver


 

        

Tuning a Job | 179

Configuration conf = getConf();

conf.setBoolean("mapred.task.profile", true);

conf.set("mapred.task.profile.params", "-agentlib:hprof=cpu=samples," +

"heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s");

conf.set("mapred.task.profile.maps", "0-2");

conf.set("mapred.task.profile.reduces", "); // no reduces

Job job = new Job(conf, "Max temperature");



JobContext.TASK_PROFILEmapred.task.profile







depth=6

JobContext.TASK_PRO

FILE_PARAMSmapred.task.profile.params



mapred.task.profile.maps

mapred.task.profile.reduces

0-2



       0-1,4,6-



JobContext.NUM_MAP_PROFILESJobCon

text.NUM_REDUCE_PROFILES











CPU SAMPLES BEGIN (total = 1002) Sat Apr 11 11:17:52 2009

rank self accum count trace method

1 3.49% 3.49% 35 307969 java.lang.Object.<init>

2 3.39% 6.89% 34 307954 java.lang.Object.<init>

3 3.19% 10.08% 32 307945 java.util.regex.Matcher.<init>

4 3.19% 13.27% 32 307963 java.lang.Object.<init>

5 3.19% 16.47% 32 307973 java.lang.Object.<init>



TRACE 307973: (thread=200001)

java.lang.Object.<init>(Object.java:20)

org.apache.hadoop.io.IntWritable.<init>(IntWritable.java:29)

v5.MaxTemperatureMapper.map(MaxTemperatureMapper.java:30)

v5.MaxTemperatureMapper.map(MaxTemperatureMapper.java:14)

org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:356)

180 | Chapter 5:Developing a MapReduce Application

IntWritable

Writable



Writable





Other profilers









            



MapReduce Workflows





























          



Decomposing a Problem into MapReduce Jobs











MapReduce Workflows | 181

        



          



 







           













        



029070-99999 19010101 0

029070-99999 19020101 -94

...







029070-99999 0101 -68



























 

182 | Chapter 5:Developing a MapReduce Application





      ChainMapper      

ChainReducer



JobControl













JobClient.runJob(conf1);

JobClient.runJob(conf2);

runJob()IOException







waitForCompletion()Jobtrue

false





org.apache.hadoop.mapreduce.jobcontrol

JobControlorg.apache.hadoop.mapred.jobcon

trolJobControl

JobControl

JobControl



JobControl



Apache Oozie













MapReduce Workflows | 183











JobControl



















        



Defining an Oozie workflow



           

       





<workflow-app xmlns="uri:oozie:workflow:0.1" name="max-temp-workflow">

<map-reduce>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

</prepare>

<name>mapred.mapper.class</name>

<value>OldMaxTemperature$OldMaxTemperatureMapper</value>

</property>

<name>mapred.combiner.class</name>

<value>OldMaxTemperature$OldMaxTemperatureReducer</value>

</property>

<name>mapred.reducer.class</name>

<value>OldMaxTemperature$OldMaxTemperatureReducer</value>

</property>

184 | Chapter 5:Developing a MapReduce Application

<name>mapred.output.key.class</name>

<value>org.apache.hadoop.io.Text</value>

</property>

<name>mapred.output.value.class</name>

<value>org.apache.hadoop.io.IntWritable</value>

</property>

<name>mapred.input.dir</name>

<value>/user/${wf:user()}/input/ncdc/micro</value>

</property>

<name>mapred.output.dir</name>

<value>/user/${wf:user()}/output</value>

</property>

</configuration>

</map-reduce>

</action>

<message>MapReduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]

</message>

</kill>

</workflow-app>

start

map-reducekillend



startend

startmax-temp-mr

end

kill

message





MapReduce Workflows | 185

map-reduce

job-trackername-node









prepare







configuration









      mapred.input.dir  mapred.out

put.dirFileInputFormatFileOutputFormat





            

${wf:user()}





Packaging and deploying an Oozie workflow application











max-temp-workflow/

lib/

 hadoop-examples.jar

workflow.xml











186 | Chapter 5:Developing a MapReduce Application





% hadoop fs -put hadoop-examples/target/max-temp-workflow max-temp-workflow

Running an Oozie workflow job



oozie

      OOZIE_URL     

oozie

% export OOZIE_URL="http://localhost:11000/oozie"

oozieoozie help

job-run

% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run

job: 0000009-120119174508294-oozie-tom-W

-config

nameNodejobTracker

oozie.wf.application.path



nameNode=hdfs://localhost:8020

jobTracker=localhost:8021

oozie.wf.application.path=${nameNode}/user/${user.name}/max-temp-workflow

   -info

oozie job



% oozie job -info 0000009-120119174508294-oozie-tom-W

RUNNINGKILLEDSUCCEEDED





% hadoop fs -cat output/part-*

1949 111

1950 22







MapReduce Workflows | 187

CHAPTER 6

How MapReduce Works







Anatomy of a MapReduce Job Run

submit()Job

  waitForCompletion()   











            mapred.

job.tracker

local





mapred.job.tracker



      









     mapreduce.framework.name     

 JobClient.submitJob(conf)JobClient.runJob(conf)

189

localclassic

yarn







      











Classic MapReduce (MapReduce 1)





 

 

JobTracker

 

TaskTracker

 



Job submission

submit()JobJobSummittersub

mitJobInternal()waitForCom

pletion()







JobSummitter

 getNewJobId()JobTracker



 

     



 





190 | Chapter 6:How MapReduce Works



 

            



mapred.submit.replication





 submitJob()

JobTracker

Job initialization

JobTrackersubmitJob()











Anatomy of a MapReduce Job Run | 191

            
mapred.reduce.tasks    Job      setNumReduceTasks()




          
OutputCommitter
FileOutputCommitter




Task assignment












            












192 | Chapter 6:How MapReduce Works

Task execution





TaskRunner
TaskRunner






    OutputCommitter   















      


Progress and status updates





Streaming and pipes.
Anatomy of a MapReduce Job Run | 193













            









194 | Chapter 6:How MapReduce Works

What Constitutes Progress in MapReduce?













 

 

 ReportersetStatus()

 ReporterincrCounter()

 Reporterprogress()













             











Job

            Job

getStatus()JobStatus





Job completion





Job

waitForCompletion()



Anatomy of a MapReduce Job Run | 195



        job.end.notifica

tion.url





YARN (MapReduce 2)













 



196 | Chapter 6:How MapReduce Works















         



























       











           







 

 



 



 

                



Anatomy of a MapReduce Job Run | 197

 



          



            











198 | Chapter 6:How MapReduce Works

Job submission



ClientProtocolmapre

duce.framework.nameyarn





   

yarn.app.mapreduce.am.com

pute-splits-in-cluster



submitApplication()



Job initialization

submitApplication()









MRAppMaster







mapreduce.job.reduces



















mapreduce.job.ubertask.maxmaps

mapreduce.job.ubertask.maxreducesmapreduce.job.ubertask.maxbytes

mapreduce.job.ubertask.enable

false

OutputCommit

ter





Anatomy of a MapReduce Job Run | 199

Task assignment









             









             mapre

duce.map.memory.mbmapreduce.reduce.memory.mb







  



         











yarn.scheduler.capacity

.minimum-allocation-mbyarn.schedu

ler.capacity.maximum-allocation-mb



       mapreduce.map.memory.mb  mapre

duce.reduce.memory.mb

Task execution





YarnChild







YarnChild







200 | Chapter 6:How MapReduce Works

             Yarn

Child





Progress and status updates











          mapreduce.client.pro

gressmonitor.pollinterval





Anatomy of a MapReduce Job Run | 201









Job completion

  

waitForCompletion()

Job         mapreduce.client.comple

tion.pollinterval





            

OutputCommitter



Failures







Failures in Classic MapReduce







Task failure













stream.non.zero.exit.is.failure

true



              

202 | Chapter 6:How MapReduce Works













        mapred.task.timeout     





















            

mapred.map.max.attemptsmapred.reduce.max.attempts











           

 mapred.max.map.failures.percent  mapred.max.reduce.failures.percent











             

mapred.map.max.attempts  mapred.reduce.max.attempts    



 

      mapred.task.tracker.task-controller   

org.apache.hadoop.mapred.LinuxTaskController        

org.apache.hadoop.mapred.DefaultTaskControllersetsid







Failures | 203



hadoop job

Tasktracker failure







            mapred.task

tracker.expiry.interval          













mapred.max.tracker.failures



  mapred.max.tracker.blacklists       













Jobtracker failure















  mapred.jobtracker.restart.recover       



Failures in YARN



        



204 | Chapter 6:How MapReduce Works

Task failure





            

mapre

duce.task.timeout



mapreduce.map.maxat

temptsmapreduce.reduce.maxattempts

mapreduce.map.failures.maxpercent

mapreduce.reduce.failures.maxpercent



Application master failure



     



yarn.resourcemanager.am.max-retries













         yarn.app.mapreduce.am.job.recov

ery.enabletrue















Node manager failure





yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms





Failures | 205







        



          mapreduce.job.maxtaskfai

lures.per.tracker









Resource manager failure





















yarn.resourceman

ager.store.class     org.apache.hadoop.yarn.server.resource

manager.recovery.MemStore          

         



Job Scheduling











  



206 | Chapter 6:How MapReduce Works

             mapred.job.priority

setJobPriority()JobClient

VERY_HIGHHIGHNORMALLOWVERY_LOW



            











The Fair Scheduler











   



















mapred.jobtracker.taskScheduler

org.apache.hadoop.mapred.FairScheduler







The Capacity Scheduler











Job Scheduling | 207





            





Shuffle and Sort









            

         





The Map Side









 





208 | Chapter 6:How MapReduce Works



io.sort.mb

          io.sort.spill.percent

0.80





           

mapred.local.dir



















io.sort.factor



min.num.spills.for.combine















mapred.compress.map.outputtrue

mapred.map.output.compression.codec







tasktracker.http.threads



      







Shuffle and Sort | 209

The Reduce Side





         











 

mapred.reduce.parallel.copies





























      mapred.job.shuffle.input.buffer.percent 



           

mapred.job.shuffle.merge.percent        

mapred.inmem.merge.threshold





 





              







io.sort.factor

210 | Chapter 6:How MapReduce Works











































Configuration Tuning

 













            





mapred.child.java.opts

             







io.sort.*

Shuffle and Sort | 211

 io.sort.mb      













    mapred.inmem.merge.threshold  0  

mapred.job.reduce.input.buffer.percent1.0





io.file.buffer.size





212 | Chapter 6:How MapReduce Works









Property name Type Default value Description

io.sort.mb int 100 The size, in megabytes, of the

memory buffer to use while sorting

map output.

io.sort.record.percent float 0.05 The proportion of io.sort.mb

reserved for storing record bound-

aries of the map outputs. The re-

maining space is used for the map

output records themselves. This

property was removed in releases

after 1.x, as the shuffle code was

improved to do a better job of using

all the available memory for map

output and accounting informa-

tion.

io.sort.spill.percent float 0.80 The threshold usage proportion for

both the map output memory

buffer and the record boundaries

index to start the process of spilling

to disk.

io.sort.factor int 10 The maximum number of streams

to merge at once when sorting files.

This property is also used in the re-

duce. It’s fairly common to increase

this to 100.

min.num.spills.for.

combine

int 3 The minimum number of spill files

needed for the combiner to run (if

a combiner is specified).

mapred.compress.map.

output

boolean false Compress map outputs.

mapred.map.output.

compression.codec

Class name org.apache.hadoop.io.

compress.DefaultCodec

The compression codec to use for

map outputs.

task

tracker.http.threads

int 40 The number of worker threads per

tasktracker for serving the map

outputs to reducers. This is a clus-

ter-wide setting and cannot be set

by individual jobs. Not applicable

in MapReduce 2.

Shuffle and Sort | 213



Property name Type Default value Description

mapred.reduce.parallel.

copies

int 5 The number of threads used to copy map outputs

to the reducer.

mapred.reduce.copy.backoff int 300 The maximum amount of time, in seconds, to spend

retrieving one map output for a reducer before de-

claring it as failed. The reducer may repeatedly re-

attempt a transfer within this time if it fails (using

exponential backoff).

io.sort.factor int 10 The maximum number of streams to merge at once

when sorting files. This property is also used in the

map.

mapred.job.shuffle.input.

buffer.percent

float 0.70 The proportion of total heap size to be allocated to

the map outputs buffer during the copy phase of the

shuffle.

mapred.job.shuffle.merge.

percent

float 0.66 The threshold usage proportion for the map outputs

buffer (defined by mapred.job.shuf

fle.input.buffer.percent) for starting

the process of merging the outputs and spilling to

disk.

mapred.inmem.merge.threshold int 1000 The threshold number of map outputs for starting

the process of merging the outputs and spilling to

disk. A value of 0 or less means there is no threshold,

and the spill behavior is governed solely by

mapred.job.shuffle.merge.percent.

mapred.job.reduce.input.

buffer.percent

float 0.0 The proportion of total heap size to be used for re-

taining map outputs in memory during the reduce.

For the reduce phase to begin, the size of map out-

puts in memory must be no more than this size. By

default, all map outputs are merged to disk before

the reduce begins, to give the reducers as much

memory as possible. However, if your reducers re-

quire less memory, this value may be increased to

minimize the number of trips to disk.

Task Execution









214 | Chapter 6:How MapReduce Works

The Task Execution Environment











configure()MapperReducer



MapperReducer



Property name Type Description Example

mapred.job.id String The job ID (see “Job,

Task, and Task Attempt

IDs” on page 165 for a

description of the

format)

job_200811201130_0004

mapred.tip.id String The task ID task_200811201130_0004_m_000003

mapred.task.id String The task attempt ID

(not the task ID) attempt_200811201130_0004_m_000003_0

mapred.task.

partition

int The index of the task

within the job 3

mapred.task.is.map boolean Whether this task is a

map task true

Streaming environment variables







mapred.job.id

os.environ["mapred_job_id"]



-cmdenv

MAGIC_PARAMETER



-cmdenv MAGIC_PARAMETER=abracadabra

Speculative Execution





Task Execution | 215



    



































   

















Property name Type Default value Description

mapred.map.tasks.specula

tive.execution

boolean true Whether extra instances

of map tasks may be

launched if a task is mak-

ing slow progress

mapred.reduce.tasks.specula

tive.

execution

boolean true Whether extra instances

of reduce tasks may be

launched if a task is mak-

ing slow progress

yarn.app.mapreduce.am.job.

speculator.class

Class org.apache.hadoop.mapre

duce.v2.

app.speculate.DefaultSpe

culator

The Speculator class

implementing the specu-

lative execution policy

(MapReduce 2 only)

216 | Chapter 6:How MapReduce Works

Property name Type Default value Description

yarn.app.mapreduce.am.job.

task.estimator.class

Class org.apache.hadoop.mapre

duce.v2.

app.speculate.LegacyTa

skRuntimeEstimator

An implementation of

TaskRuntimeEstima

tor used by Specula

tor instances that pro-

vides estimates for task

runtimes (MapReduce 2

only)



























OutputCommitter



Output Committers



OutputCommitter

setOutputCommitter()

JobConfmapred.output.committer.class

OutputCommitterOutputFormatgetOut

putCommitter()FileOutputCommitter

OutputCommitter



OutputCommitter

public abstract class OutputCommitter {

public abstract void setupJob(JobContext jobContext) throws IOException;

public void commitJob(JobContext jobContext) throws IOException { }

public void abortJob(JobContext jobContext, JobStatus.State state)

throws IOException { }

public abstract void setupTask(TaskAttemptContext taskContext)

Task Execution | 217

throws IOException;

public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)

throws IOException;

public abstract void commitTask(TaskAttemptContext taskContext)

throws IOException;

public abstract void abortTask(TaskAttemptContext taskContext)

throws IOException;

}

setupJob()

FileOutputCommitter

${mapred.output.dir}${mapred.out

put.dir}/_temporary

commitJob()

          



abortJob()





setupTask()





false

needsTaskCommit()

commitTask()abortTask()FileOut

putCommitter

commitTask()



${mapred.output.dir}

abortTask()









     





Task side-effect files

OutputCol

lector

218 | Chapter 6:How MapReduce Works









OutputCommitter







mapred.work.out

put.dir

getWorkOutputPath()FileOutputFormat

  Path         







       NLineInputFormat  





Task JVM Reuse





















mapred.job.reuse.jvm.num.tasks







setNumTasksToExecutePerJvm()JobConf

 

Task Execution | 219



Property name Type Default value Description

mapred.job.reuse.jvm.num.tasks int 1 The maximum number of tasks to run for a given

job for each JVM on a tasktracker. A value of –1

indicates no limit, which means the same JVM may

be used for all tasks for a job.















Skipping Bad Records

















             

              



    TextInputFormat    







mapred.linerecordreader.maxlength







220 | Chapter 6:How MapReduce Works



















            









 

 

 



 





SkipBadRecords



        

      mapred.map.max.attempts 

mapred.reduce.max.attempts







hadoop fs -text

 

Task Execution | 221

CHAPTER 7

MapReduce Types and Formats









MapReduce Types



map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

K1V1

K2V2

K3V3



public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

// ...

}

protected void map(KEYIN key, VALUEIN value,

Context context) throws IOException, InterruptedException {

// ...

}

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

public class Context extends ReducerContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

// ...

}

223

protected void reduce(KEYIN key, Iterable<VALUEIN> values,

Context context) throws IOException,

InterruptedException {

// ...

}



write()

public void write(KEYOUT key, VALUEOUT value)

throws IOException, InterruptedException

MapperReducer

KEYINMapper

KEYINReducer

KEYINLongWrita

bleMapperTextReducer





KEYINK1





Reducer

K2V2

map: (K1, V1) → list(K2, V2)

combine: (K2, list(V2)) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

K3

K2V3V2

K2V2





partition: (K2, V2) → integer



public abstract class Partitioner<KEY, VALUE> {

public abstract int getPartition(KEY key, VALUE value, int numPartitions);

}

224 | Chapter 7:MapReduce Types and Formats

MapReduce Signatures in the Old API



K1V1



public interface Mapper<K1, V1, K2, V2> extends JobConfigurable, Closeable {

void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)

throws IOException;

}

public interface Reducer<K2, V2, K3, V3> extends JobConfigurable, Closeable {

void reduce(K2 key, Iterator<V2> values,

OutputCollector<K3, V3> output, Reporter reporter) throws IOException;

}

public interface Partitioner<K2, V2> extends JobConfigurable {

int getPartition(K2 key, V2 value, int numPartitions);

}

 







TextInputFormat

LongWritableText

JobJobConf

LongWritable

TextK2K3setMapOutputKeyClass()

setOutputKeyClass()V2

V3setOutputValueClass()









         











MapReduce Types | 225



Property Job setter method Input types Intermediate types Output types

K1 V1 K2 V2 K3 V3

Properties for configuring types:

mapreduce.job.inputformat.class setInputFormatClass() ••

mapreduce.map.output.key.class setMapOutputKeyClass() •

mapreduce.map.output.value.class setMapOutputValueClass() •

mapreduce.job.output.key.class setOutputKeyClass() •

mapreduce.job.output.value.class setOutputValueClass() •

Properties that must be consistent with the types:

mapreduce.job.map.class setMapperClass() ••• •

mapreduce.job.combine.class setCombinerClass() • •

mapreduce.job.partitioner.class setPartitionerClass() • •

mapreduce.job.output.key.comparator.class setSortComparatorClass() •

mapreduce.job.output.group.comparator.class setGroupingComparatorClass() •

mapreduce.job.reduce.class setReducerClass() • • • •

mapreduce.job.outputformat.class setOutputFormatClass() • •

226 | Chapter 7:MapReduce Types and Formats



Property JobConf setter method Input types Intermediate types Output types

K1 V1 K2 V2 K3 V3

Properties for configuring types:

mapred.input.format.class setInputFormat() ••

mapred.mapoutput.key.class setMapOutputKeyClass() •

mapred.mapoutput.value.class setMapOutputValueClass() •

mapred.output.key.class setOutputKeyClass() •

mapred.output.value.class setOutputValueClass() •

Properties that must be consistent with the types:

mapred.mapper.class setMapperClass() ••• •

mapred.map.runner.class setMapRunnerClass() ••• •

mapred.combiner.class setCombinerClass() • •

mapred.partitioner.class setPartitionerClass() • •

mapred.output.key.comparator.class setOutputKeyComparatorClass() •

mapred.output.value.groupfn.class setOutputValueGroupingComparator() •

mapred.reducer.class setReducerClass() • • • •

mapred.output.format.class setOutputFormat() • •

MapReduce Types | 227

The Default MapReduce Job





public class MinimalMapReduce extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

System.err.printf("Usage: %s [generic options] <input> <output>\n",

getClass().getSimpleName());

ToolRunner.printGenericCommandUsage(System.err);

return -1;

}

Job job = new Job(getConf());

job.setJarByClass(getClass());

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MinimalMapReduce(), args);

System.exit(exitCode);

}





% hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output





0→0029029070999991901010106004+64333+023450FM-12+000599999V0202701N01591...

0→0035029070999991902010106004+64333+023450FM-12+000599999V0201401N01181...

135→0029029070999991901010113004+64333+023450FM-12+000599999V0202901N00821...

141→0035029070999991902010113004+64333+023450FM-12+000599999V0201401N01181...

270→0029029070999991901010120004+64333+023450FM-12+000599999V0209991C00001...

282→0035029070999991902010120004+64333+023450FM-12+000599999V0201401N01391...









MinimalMapReduce

228 | Chapter 7:MapReduce Types and Formats



public class MinimalMapReduceWithDefaults extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setInputFormatClass(TextInputFormat.class);

job.setMapperClass(Mapper.class);

job.setMapOutputKeyClass(LongWritable.class);

job.setMapOutputValueClass(Text.class);

job.setPartitionerClass(HashPartitioner.class);

job.setNumReduceTasks(1);

job.setReducerClass(Reducer.class);

job.setOutputKeyClass(LongWritable.class);

job.setOutputValueClass(Text.class);

job.setOutputFormatClass(TextOutputFormat.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);

System.exit(exitCode);

}

   run()  



          

              

JobBuilder

public static Job parseInputAndOutput(Tool tool, Configuration conf,

String[] args) throws IOException {

if (args.length != 2) {

printUsage(tool, "<input> <output>");

return null;

}

Job job = new Job(conf);

job.setJarByClass(tool.getClass());

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

return job;

MapReduce Types | 229

}

public static void printUsage(Tool tool, String extraArgsUsage) {

System.err.printf("Usage: %s [genericOptions] %s\n\n",

tool.getClass().getSimpleName(), extraArgsUsage);

GenericOptionsParser.printGenericCommandUsage(System.err);

}

MinimalMapReduceWithDefaults





TextInputFormatLongWrita

bleText





Mapper



public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

protected void map(KEYIN key, VALUEIN value,

Context context) throws IOException, InterruptedException {

context.write((KEYOUT) key, (VALUEOUT) value);

}

Mapper

LongWritable

Text

HashPartitioner





public class HashPartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value,

int numReduceTasks) {

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

}











HashPartitioner



230 | Chapter 7:MapReduce Types and Formats











         



Choosing the Number of Reducers















 mapred.tasktracker.reduce.tasks.maximum    







           





Reducer



public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context

Context context) throws IOException, InterruptedException {

for (VALUEIN value: values) {

context.write((KEYOUT) key, (VALUEOUT) value);

}

LongWritableText

LongWritableText









MapReduce Types | 231







TextOutputFormat



TextOutputFormat

The default Streaming job





% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

-input input/ncdc/sample.txt \

-output output \

-mapper /bin/cat



TextInputFormat

LongWritableText

Text

LongWritableText

TextInputFormat



             

stream.map.input.ignoreKeytrue









% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

-input input/ncdc/sample.txt \

-output output \

-inputformat org.apache.hadoop.mapred.TextInputFormat \

-mapper /bin/cat \

-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \

-numReduceTasks 1 \

-reducer org.apache.hadoop.mapred.lib.IdentityReducer \

-outputformat org.apache.hadoop.mapred.TextOutputFormat



-combiner

 -io rawbytes-io typedbytes

-io text

232 | Chapter 7:MapReduce Types and Formats

Keys and values in Streaming













                 

stream.num.map.output.key.fields  stream.num.reduce.output.key.fields 



a,b,c

a,bc







stream.reduce.output.field.separator

a:b

abTextOutputFormat

ab

   TextOutputFormat    mapred.textoutputfor

mat.separator









Property name Type Default value Description

stream.map.input.field.

separator

String \t The separator to use when passing the input key and

value strings to the stream map process as a stream of

bytes

stream.map.output.field.

separator

String \t The separator to use when splitting the output from

the stream map process into key and value strings for

the map output

stream.num.map.

output.key.fields

int 1 The number of fields separated by

stream.map.output.field.separator

to treat as the map output key

stream.reduce.input.field.

separator

String \t The separator to use when passing the input key and

value strings to the stream reduce process as a stream

of bytes

MapReduce Types | 233

Property name Type Default value Description

stream.reduce.

output.field.

separator

String \t The separator to use when splitting the output from

the stream reduce process into key and value strings

for the final reduce output

stream.num.reduce.

output.key.fields

int 1 The number of fields separated by

stream.reduce.output.field.separator

to treat as the reduce output key



Input Formats





Input Splits and Records













DBInputFormat



InputSplit

org.apache.hadoop.mapreduce

public abstract class InputSplit {

public abstract long getLength() throws IOException, InterruptedException;

 org.apache.hadoop.mapred

234 | Chapter 7:MapReduce Types and Formats

public abstract String[] getLocations() throws IOException,

InterruptedException;

}

InputSplit











InputSplit

InputFormatInputFormat



InputFormat

public abstract class InputFormat<K, V> {

public abstract List<InputSplit> getSplits(JobContext context)

throws IOException, InterruptedException;

public abstract RecordReader<K, V>

createRecordReader(InputSplit split,

TaskAttemptContext context) throws IOException,

InterruptedException;

}

getSplits()





createRecordReader()InputFormatRecordReader

RecordReader



Mapperrun()

public void run(Context context) throws IOException, InterruptedException {

setup(context);

while (context.nextKeyValue()) {

map(context.getCurrentKey(), context.getCurrentValue(), context);

}

cleanup(context);

}

setup()nextKeyValue()Context

RecordReader

    Record

ReaderContextmap()

 nextKeyValue()

falsecleanup()

Input Formats | 235

Record

Reader

getCurrentKey()getCurrentValue()

nextKeyValue()





map()



Text

new Text(value)







Mapperrun()

MultithreadedMapper

mapreduce.mapper.multithreadedmapper.threads







          





FileInputFormat

FileInputFormatInputFormat







FileInputFormat input paths



FileInputFormat

Job

public static void addInputPath(Job job, Path path)

public static void addInputPaths(Job job, String commaSeparatedPaths)

public static void setInputPaths(Job job, Path... inputPaths)

public static void setInputPaths(Job job, String commaSeparatedPaths)

addInputPath()addInputPaths()

setInput

Paths()

Job

236 | Chapter 7:MapReduce Types and Formats

















mapred.input.dir.recursivetrue





setInputPathFilter()

FileInputFormat

public static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)





Input Formats | 237

FileInputFormat



setInputPathFilter()





-input





Property name Type Default value Description

mapred.input.dir Comma-separated paths None The input files for a job. Paths that contain commas

should have those commas escaped by a backslash

character. For example, the glob {a,b} would be

escaped as {a\,b}.

mapred.input.

pathFilter.class

PathFilter

classname

None The filter to apply to the input files for a job.

FileInputFormat input splits

FileInputFormatFileInputFormat











Property name Type Default value Description

mapred.min.split.size int 1 The smallest valid size in

bytes for a file split

mapred.max.split.size along Long.MAX_VALUE, that is,

9223372036854775807

The largest valid size in

bytes for a file split

dfs.block.size long 64 MB, that is,67108864 The size of a block in HDFS

in bytes

aThis property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated

indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapred.map.tasks (or the

setNumMapTasks() method on JobConf). Because mapred.map.tasks defaults to 1, this makes the maximum split size the size

of the input.









238 | Chapter 7:MapReduce Types and Formats











long



          computeSplitSize()  

FileInputFormat

max(minimumSize, min(maximumSize, blockSize))



minimumSize < blockSize < maximumSize

blockSize





Minimum split size Maximum split size Block size Split size Comment

1 (default) Long.MAX_VALUE

(default)

64 MB (default) 64 MB By default, the split size is the same

as the default block size.

1 (default) Long.MAX_VALUE

(default)

128 MB 128 MB The most natural way to increase the

split size is to have larger blocks in

HDFS, either by setting dfs.block

size or on a per-file basis at file con-

struction time.

128 MB Long.MAX_VALUE

(default)

64 MB (default) 128 MB Making the minimum split size

greater than the block size increases

the split size, but at the cost of locality.

1 (default) 32 MB 64 MB (default) 32 MB Making the maximum split size less

than the block size decreases the split

size.

Small files and CombineFileInputFormat



FileInputFormat















Input Formats | 239

CombineFileInputFormat

       FileInputFormat     

CombineFileInputFormat

CombineFileInputFormat

















NullWritable

               

CombineFileInputFormat

CombineFileInputFormat

CombineFileInputFor

mat



           

CombineFileInputFormat

           

mapred.max.split.size









CombineFileInputFormat

FileInputFormat



CombineFileInputFormatTextInputFormat

CombineFileInputFormatgetRecordReader()

240 | Chapter 7:MapReduce Types and Formats

Preventing splitting















Long.MAX_VALUE

FileInputFormat

isSplitable()false

TextInputFormat

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.mapreduce.JobContext;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

public class NonSplittableTextInputFormat extends TextInputFormat {

@Override

protected boolean isSplitable(JobContext context, Path file) {

return false;

}

File information in the mapper



getInputSplit()MapperContext

FileInputFormatInputSplit

FileSplit





       configure()  

MapperJobConf







FileSplit method Property name Type Description

getPath() map.input.file Path/String The path of the input file being processed

 SortValidator.RecordStatsChecker

 isSplitable()



Input Formats | 241

FileSplit method Property name Type Description

getStart() map.input.start long The byte offset of the start of the split from the beginning

of the file

getLength() map.input.length long The length of the split in bytes

FileSplit



Processing a whole file as a record





RecordReader

WholeFileInputFormat



public class WholeFileInputFormat

extends FileInputFormat<NullWritable, BytesWritable> {

@Override

protected boolean isSplitable(JobContext context, Path file) {

return false;

}

@Override

public RecordReader<NullWritable, BytesWritable> createRecordReader(

InputSplit split, TaskAttemptContext context) throws IOException,

InterruptedException {

WholeFileRecordReader reader = new WholeFileRecordReader();

reader.initialize(split, context);

return reader;

}

WholeFileInputFormat

NullWritableBytesWritable



      isSplitable()   false  

 createRecordReader()       Record

Reader



class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {

private FileSplit fileSplit;

private Configuration conf;

private BytesWritable value = new BytesWritable();

private boolean processed = false;

242 | Chapter 7:MapReduce Types and Formats

@Override

public void initialize(InputSplit split, TaskAttemptContext context)

throws IOException, InterruptedException {

this.fileSplit = (FileSplit) split;

this.conf = context.getConfiguration();

}

@Override

public boolean nextKeyValue() throws IOException, InterruptedException {

if (!processed) {

byte[] contents = new byte[(int) fileSplit.getLength()];

Path file = fileSplit.getPath();

FileSystem fs = file.getFileSystem(conf);

FSDataInputStream in = null;

try {

in = fs.open(file);

IOUtils.readFully(in, contents, 0, contents.length);

value.set(contents, 0, contents.length);

} finally {

IOUtils.closeStream(in);

}

processed = true;

return true;

}

return false;

}

@Override

public NullWritable getCurrentKey() throws IOException, InterruptedException {

return NullWritable.get();

}

@Override

public BytesWritable getCurrentValue() throws IOException,

InterruptedException {

return value;

}

@Override

public float getProgress() throws IOException {

return processed ? 1.0f : 0.0f;

}

@Override

public void close() throws IOException {

// do nothing

}

WholeFileRecordReaderFileSplit



WholeFileRecordReader

processednextKey

Input Formats | 243

Value()

IOUtils

BytesWritablenext()

true



close()



WholeFileInputFormat







public class SmallFilesToSequenceFileConverter extends Configured

implements Tool {

static class SequenceFileMapper

extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {

private Text filenameKey;

@Override

protected void setup(Context context) throws IOException,

InterruptedException {

InputSplit split = context.getInputSplit();

Path path = ((FileSplit) split).getPath();

filenameKey = new Text(path.toString());

}

@Override

protected void map(NullWritable key, BytesWritable value, Context context)

throws IOException, InterruptedException {

context.write(filenameKey, value);

}

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setInputFormatClass(WholeFileInputFormat.class);

job.setOutputFormatClass(SequenceFileOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(BytesWritable.class);

job.setMapperClass(SequenceFileMapper.class);

244 | Chapter 7:MapReduce Types and Formats

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);

System.exit(exitCode);

}

WholeFileInputFormat

InputSplit

FileSplit

Text

SequenceFileOutputFormat





% hadoop jar hadoop-examples.jar SmallFilesToSequenceFileConverter \

-conf conf/hadoop-localhost.xml -D mapred.reduce.tasks=2 input/smallfiles output



-text

% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00000

hdfs://localhost/user/tom/input/smallfiles/a 61 61 61 61 61 61 61 61 61 61

hdfs://localhost/user/tom/input/smallfiles/c 63 63 63 63 63 63 63 63 63 63

hdfs://localhost/user/tom/input/smallfiles/e

% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00001

hdfs://localhost/user/tom/input/smallfiles/b 62 62 62 62 62 62 62 62 62 62

hdfs://localhost/user/tom/input/smallfiles/d 64 64 64 64 64 64 64 64 64 64

hdfs://localhost/user/tom/input/smallfiles/f 66 66 66 66 66 66 66 66 66 66











   CombineFileInputFormat 

FileInputFormat





Text Input



InputFormat

Input Formats | 245

TextInputFormat

TextInputFormatInputFormat

LongWritable



Text

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.





(0, On the top of the Crumpetty Tree)

(33, The Quangle Wangle sat,)

(57, But his face you could not see,)

(89, On account of his Beaver Hat.)























The Relationship Between Input Splits and HDFS Blocks

FileInputFormat

TextInputFormat   



















246 | Chapter 7:MapReduce Types and Formats



KeyValueTextInputFormat

TextInputFormat



TextOutputFor

matOutputFormatKeyValueTextIn

putFormat

       mapreduce.input.keyvaluelinere

cordreader.key.value.separatorkey.value.separator.in.input.line





line1→On the top of the Crumpetty Tree

line2→The Quangle Wangle sat,

line3→But his face you could not see,

line4→On account of his Beaver Hat.

TextInputFormat

Text

(line1, On the top of the Crumpetty Tree)

(line2, The Quangle Wangle sat,)

(line3, But his face you could not see,)

(line4, On account of his Beaver Hat.)

NLineInputFormat

TextInputFormatKeyValueTextInputFormat





NLineInputFormatInputFormatTextInputFormat



            

           

mapreduce.input.lineinputformat.linespermap   mapred.line.input.for

mat.linespermap



Input Formats | 247

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.





(0, On the top of the Crumpetty Tree)

(33, The Quangle Wangle sat,)



(57, But his face you could not see,)

(89, On account of his Beaver Hat.)

TextInputFormat





























           







setNumReduceTasks()Job





XML









248 | Chapter 7:MapReduce Types and Formats









TextInputFormat



StreamXmlRecordReader

org.apache.hadoop.streaming

StreamInputFormat

stream.recordreader.class   org.apache.hadoop.streaming.StreamXml

RecordReader









page

StreamXmlRecordReaderpage



Binary Input

       



SequenceFileInputFormat













SequenceFileIn

putFormat



IntWritableText

Mapper<IntWritable, Text, K, V>KV



   XmlInputFormat        



Input Formats | 249

SequenceFileInputFormat



SequenceFileInputFormat

             

MapFileInputFormat

SequenceFileAsTextInputFormat

SequenceFileAsTextInputFormatSequenceFileInputFormat

Text

toString()



SequenceFileAsBinaryInputFormat

SequenceFileAsBinaryInputFormatSequenceFileInputFormat



BytesWritable

            Sequence

File.WriterappendRaw()SequenceFileAsBinaryOutputFormat







Multiple Inputs





InputFormatMapper













MultipleInputs

InputFormatMapper





 

         



250 | Chapter 7:MapReduce Types and Formats

MultipleInputs.addInputPath(job, ncdcInputPath,

TextInputFormat.class, MaxTemperatureMapper.class);

MultipleInputs.addInputPath(job, metOfficeInputPath,

TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);

FileInputFormat.addInputPath()job.setMap

perClass()TextInput

Format

MaxTemperatureMapper

MetOfficeMaxTemperatureMapper









MultipleInputsaddInputPath()



public static void addInputPath(Job job, Path path,

Class<? extends InputFormat> inputFormatClass)

JobsetMapperClass()



Database Input (and Output)

DBInputFormat 







MultipleInputs

DBOutputFormat







TableInputFormat

TableOutputFormat



Output Formats



OutputFormat

 



Output Formats | 251



Text Output

TextOutputFormat

TextOutputFormat

toString()

     mapreduce.output.textoutputformat.separator 

mapred.textoutputformat.separatorTextOutput

FormatKeyValueTextInputFormat

        





NullOutputFormatNullWritable



TextInputFormat

Binary Output

SequenceFileOutputFormat

SequenceFileOutputFormat



252 | Chapter 7:MapReduce Types and Formats

            

SequenceFileOutputFormat

     SequenceFileOutputFormat 



SequenceFileAsBinaryOutputFormat

SequenceFileAsBinaryOutputFormat     SequenceFileAsBinaryInput

Format

MapFileOutputFormat

MapFileOutputFormat











MapFileOutputFormat

Multiple Outputs

FileOutputFormat





MultipleOut

puts

An example: Partitioning data















 MultipleOutputFormat

MultipleOutputsMultipleOutputsMultipleOutputFormat

          MultipleOutputs    



MultipleOutputs

MultipleOutputFormat

Output Formats | 253

public class StationPartitioner extends Partitioner<LongWritable, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

public int getPartition(LongWritable key, Text value, int numPartitions) {

parser.parse(value);

return getPartition(parser.getStationId());

}

private int getPartition(String stationId) {

...

}

 getPartition(String)       































         









254 | Chapter 7:MapReduce Types and Formats



























HashPartitioner





HashPartitioner



MultipleOutputs

MultipleOutputs

MultipleOutputs





namennnnnnamennnnnname

nnnnn





MultipleOutputs





public class PartitionByStationUsingMultipleOutputs extends Configured

implements Tool {

static class StationMapper

extends Mapper<LongWritable, Text, Text, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

Output Formats | 255

parser.parse(value);

context.write(new Text(parser.getStationId()), value);

}

static class MultipleOutputsReducer

extends Reducer<Text, Text, NullWritable, Text> {

private MultipleOutputs<NullWritable, Text> multipleOutputs;

@Override

protected void setup(Context context)

throws IOException, InterruptedException {

multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);

}

@Override

protected void reduce(Text key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

for (Text value : values) {

multipleOutputs.write(NullWritable.get(), value, key.toString());

}

@Override

protected void cleanup(Context context)

throws IOException, InterruptedException {

multipleOutputs.close();

}

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setMapperClass(StationMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setReducerClass(MultipleOutputsReducer.class);

job.setOutputKeyClass(NullWritable.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new PartitionByStationUsingMultipleOutputs(),

args);

System.exit(exitCode);

}

256 | Chapter 7:MapReduce Types and Formats



MultipleOutputssetup()

MultipleOutputsreduce()

write()



station_identifiernnnnn



output/010010-99999-r-00027

output/010050-99999-r-00013

output/010100-99999-r-00015

output/010280-99999-r-00014

output/010550-99999-r-00000

output/010980-99999-r-00011

output/011060-99999-r-00025

output/012030-99999-r-00029

output/012350-99999-r-00018

output/012620-99999-r-00004

write()MultipleOutputs

/

       





@Override

protected void reduce(Text key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

for (Text value : values) {

parser.parse(value);

String basePath = String.format("%s/%s/part",

parser.getStationId(), parser.getYear());

multipleOutputs.write(NullWritable.get(), value, basePath);

}

MultipleOutputsOutputFormat

TextOutputFormat

OutputFormat







Lazy Output

FileOutputFormat     nnnnn     

Lazy

OutputFormat

Output Formats | 257



setOutputFormatClass()JobConf

-lazyOutputLazyOutputFormat

Database Output





258 | Chapter 7:MapReduce Types and Formats

CHAPTER 8

MapReduce Features





Counters































Built-in Counters













259


Group Name/Enum Reference
MapRe-
duce task
counters
org.apache.hadoop.mapred.Task$Counter (1.x)
org.apache.hadoop.mapreduce.TaskCounter (post-1.x)
Table 8-2
Filesystem
counters FileSystemCounters (1.x)
org.apache.hadoop.mapreduce.FileSystemCounter (post 1.x)
Table 8-3
FileInput-
Format
counters
org.apache.hadoop.mapred.FileInputFormat$Counter (1.x)
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
(post-1.x)
Table 8-4
FileOutput-
Format
counters
org.apache.hadoop.mapred.FileOutputFormat$Counter (1.x)
org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter
(post-1.x)
Table 8-5
Job coun-
ters org.apache.hadoop.mapred.JobInProgress$Counter (1.x)
org.apache.hadoop.mapreduce.JobCounter (post-1.x)
Table 8-6



Task counters

             
MAP_INPUT_RECORDS











       PHYSICAL_MEMORY_BYTES
VIRTUAL_MEMORY_BYTESCOMMITTED_HEAP_BYTES




260 | Chapter 8:MapReduce Features


Counter Description
Map input records
(MAP_INPUT_RECORDS)
The number of input records consumed by all the maps in the job. Incremented
every time a record is read from a RecordReader and passed to the map’s 
map() method by the framework.
Map skipped records
(MAP_SKIPPED_RECORDS)
The number of input records skipped by all the maps in the job. See “Skipping
Bad Records” on page 220.
Map input bytes
(MAP_INPUT_BYTES)
The number of bytes of uncompressed input consumed by all the maps in the
job. Incremented every time a record is read from a RecordReader and passed
to the map’s map() method by the framework.
Split raw bytes
(SPLIT_RAW_BYTES)
The number of bytes of input-split objects read by maps. These objects represent
the split metadata (that is, the offset and length within a file) rather than the
split data itself, so the total size should be small.
Map output records
(MAP_OUTPUT_RECORDS)
The number of map output records produced by all the maps in the job.
Incremented every time the collect() method is called on a map’s
OutputCollector.
Map output bytes
(MAP_OUTPUT_BYTES)
The number of bytes of uncompressed output produced by all the maps in the
job. Incremented every time the collect() method is called on a map’s 
OutputCollector.
Map output materialized bytes
(MAP_OUTPUT_MATERIALIZED_BYTES)
The number of bytes of map output actually written to disk. If map output
compression is enabled, this is reflected in the counter value.
Combine input records
(COMBINE_INPUT_RECORDS)
The number of input records consumed by all the combiners (if any) in the job.
Incremented every time a value is read from the combiner’s iterator over values.
Note that this count is the number of values consumed by the combiner, not
the number of distinct key groups (which would not be a useful metric, since
there is not necessarily one group per key for a combiner; see “Combiner Func-
tions” on page 33, and also “Shuffle and Sort” on page 208).
Combine output records
(COMBINE_OUTPUT_RECORDS)
The number of output records produced by all the combiners (if any) in the job.
Incremented every time the collect() method is called on a combiner’s 
OutputCollector.
Reduce input groups
(REDUCE_INPUT_GROUPS)
The number of distinct key groups consumed by all the reducers in the job.
Incremented every time the reducer’s reduce() method is called by the
framework.
Reduce input records
(REDUCE_INPUT_RECORDS)
The number of input records consumed by all the reducers in the job. Incremented
every time a value is read from the reducer’s iterator over values. If reducers
consume all of their inputs, this count should be the same as the count for map
output records.
Reduce output records
(REDUCE_OUTPUT_RECORDS)
The number of reduce output records produced by all the maps in the job.
Incremented every time the collect() method is called on a reducer’s
OutputCollector.
Reduce skipped groups
(REDUCE_SKIPPED_GROUPS)
The number of distinct key groups skipped by all the reducers in the job. See
“Skipping Bad Records” on page 220.
Reduce skipped records The number of input records skipped by all the reducers in the job.
Counters | 261

Counter Description

(REDUCE_SKIPPED_RECORDS)

Reduce shuffle bytes

(REDUCE_SHUFFLE_BYTES)

The number of bytes of map output copied by the shuffle to reducers.

Spilled records

(SPILLED_RECORDS)

The number of records spilled to disk in all map and reduce tasks in the job.

CPU milliseconds

(CPU_MILLISECONDS)

The cumulative CPU time for a task in milliseconds, as reported by /proc/cpuinfo.

Physical memory bytes

(PHYSICAL_MEMORY_BYTES)

The physical memory being used by a task in bytes, as reported by /proc/meminfo.

Virtual memory bytes

(VIRTUAL_MEMORY_BYTES)

The virtual memory being used by a task in bytes, as reported by /proc/meminfo.

Committed heap bytes

(COMMITTED_HEAP_BYTES)

The total amount of memory available in the JVM in bytes, as reported by

Runtime.getRuntime().totalMemory().

GC time milliseconds

(GC_TIME_MILLIS)

The elapsed time for garbage collection in tasks in milliseconds, as reported by

GarbageCollectorMXBean.getCollectionTime(). (Not available

in 1.x.)

Shuffled maps

(SHUFFLED_MAPS)

The number of map output files transferred to reducers by the shuffle (See

“Shuffle and Sort” on page 208.) (Not available in 1.x.)

Failed shuffle

(FAILED_SHUFFLE)

The number of map output copy failures during the shuffle. (Not available in 1.x.)

Merged map outputs

(MERGED_MAP_OUTPUTS)

The number of map outputs that have been merged on the reduce side of the

shuffle. (Not available in 1.x.)



Counter Description

Filesystem bytes read

(BYTES_READ)

The number of bytes read by each filesystem by map and reduce tasks. There is a counter for each

filesystem, and Filesystem, which may be Local, HDFS, S3, KFS, etc.

Filesystem bytes written

(BYTES_WRITTEN)

The number of bytes written by each filesystem by map and reduce tasks.



Counter Description

Bytes read

(BYTES_READ)

The number of bytes read by map tasks via the FileInputFormat.



Counter Description

Bytes written

(BYTES_WRITTEN)

The number of bytes written by map tasks (for map-only jobs) or reduce tasks via the FileOutputFormat.

262 | Chapter 8:MapReduce Features

Job counters







TOTAL_LAUNCHED_MAPS





Counter Description

Launched map tasks

(TOTAL_LAUNCHED_MAPS)

The number of map tasks that were launched. Includes tasks that were

started speculatively.

Launched reduce tasks

(TOTAL_LAUNCHED_REDUCES)

The number of reduce tasks that were launched. Includes tasks that

were started speculatively.

Launched uber tasks

(TOTAL_LAUNCHED_UBERTASKS)

The number of uber tasks (see “YARN (MapReduce 2)” on page 196)

that were launched. (Only in YARN-based MapReduce.)

Maps in uber tasks

(NUM_UBER_SUBMAPS)

The number of maps in uber tasks. (Only in YARN-based MapReduce.)

Reduces in uber tasks

(NUM_UBER_SUBREDUCES)

The number of reduces in uber tasks. (Only in YARN-based MapRe-

duce.)

Failed map tasks

(NUM_FAILED_MAPS)

The number of map tasks that failed. See “Task failure” on page 202

for potential causes.

Failed reduce tasks

(NUM_FAILED_REDUCES)

The number of reduce tasks that failed.

Failed uber tasks

(NUM_FAILED_UBERTASKS)

The number of uber tasks that failed. (Only in YARN-based MapRe-

duce.)

Data-local map tasks

(DATA_LOCAL_MAPS)

The number of map tasks that ran on the same node as their input data.

Rack-local map tasks

(RACK_LOCAL_MAPS)

The number of map tasks that ran on a node in the same rack as their

input data, but that are not data-local.

Other local map tasks

(OTHER_LOCAL_MAPS)

The number of map tasks that ran on a node in a different rack to their

input data. Inter-rack bandwidth is scarce, and Hadoop tries to place

map tasks close to their input data, so this count should be low. See

Figure 2-2.

Total time in map tasks

(SLOTS_MILLIS_MAPS)

The total time taken running map tasks in milliseconds. Includes tasks

that were started speculatively.

Total time in reduce tasks

(SLOTS_MILLIS_REDUCES)

The total time taken running reduce tasks in milliseconds. Includes

tasks that were started speculatively.

Total time in map tasks waiting after reserving slots

(FALLOW_SLOTS_MILLIS_MAPS)

The total time in milliseconds spent waiting after reserving slots for

map tasks. Slot reservation is a Capacity Scheduler feature for high-

memory jobs; see “Task memory limits” on page 318. Not used by

YARN-based MapReduce.

Counters | 263

Counter Description

Total time in reduce tasks waiting after reserving slots

(FALLOW_SLOTS_MILLIS_REDUCES)

The total time in milliseconds spent waiting after reserving slots for

reduce tasks. Slot reservation is a Capacity Scheduler feature for high-

memory jobs; see “Task memory limits” on page 318. Not used by

YARN-based MapReduce.

User-Defined Java Counters



















         



public class MaxTemperatureWithCounters extends Configured implements Tool {

enum Temperature {

MISSING,

MALFORMED

}

static class MaxTemperatureMapperWithCounters

extends Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

parser.parse(value);

if (parser.isValidTemperature()) {

int airTemperature = parser.getAirTemperature();

context.write(new Text(parser.getYear()),

new IntWritable(airTemperature));

} else if (parser.isMalformedTemperature()) {

System.err.println("Ignoring possibly corrupt input: " + value);

context.getCounter(Temperature.MALFORMED).increment(1);

} else if (parser.isMissingTemperature()) {

context.getCounter(Temperature.MISSING).increment(1);

}

// dynamic counter

context.getCounter("TemperatureQuality", parser.getQuality()).increment(1);

264 | Chapter 8:MapReduce Features

}

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(MaxTemperatureMapperWithCounters.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducer.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MaxTemperatureWithCounters(), args);

System.exit(exitCode);

}



% hadoop jar hadoop-examples.jar MaxTemperatureWithCounters \

input/ncdc/all output-counters





12/02/04 19:46:38 INFO mapred.JobClient: TemperatureQuality

12/02/04 19:46:38 INFO mapred.JobClient: 2=1246032

12/02/04 19:46:38 INFO mapred.JobClient: 1=973422173

12/02/04 19:46:38 INFO mapred.JobClient: 0=1

12/02/04 19:46:38 INFO mapred.JobClient: 6=40066

12/02/04 19:46:38 INFO mapred.JobClient: 5=158291879

12/02/04 19:46:38 INFO mapred.JobClient: 4=10764500

12/02/04 19:46:38 INFO mapred.JobClient: 9=66136858

12/02/04 19:46:38 INFO mapred.JobClient: Air Temperature Records

12/02/04 19:46:38 INFO mapred.JobClient: Malformed=3

12/02/04 19:46:38 INFO mapred.JobClient: Missing=66136856

Dynamic counters







     



Counters | 265

Reporter

String

public void incrCounter(String group, String counter, long amount)









String

Readable counter names























CounterGroupName





.name



CounterGroupName=Air Temperature Records

MISSING.name=Missing

MALFORMED.name=Malformed





      

zh_CN

java.util.PropertyResourceBundle

Retrieving counters

hadoop job

-counter







266 | Chapter 8:MapReduce Features



import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.util.*;

public class MissingTemperatureFields extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 1) {

JobBuilder.printUsage(this, "<job ID>");

return -1;

}

String jobID = args[0];

JobClient jobClient = new JobClient(new JobConf(getConf()));

RunningJob job = jobClient.getJob(JobID.forName(jobID));

if (job == null) {

System.err.printf("No job with ID %s found.\n", jobID);

return -1;

}

if (!job.isComplete()) {

System.err.printf("Job %s is not complete.\n", jobID);

return -1;

}

Counters counters = job.getCounters();

long missing = counters.getCounter(

MaxTemperatureWithCounters.Temperature.MISSING);

long total = counters.getCounter(Task.Counter.MAP_INPUT_RECORDS);

System.out.printf("Records with missing temperature fields: %.2f%%\n",

100.0 * missing / total);

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MissingTemperatureFields(), args);

System.exit(exitCode);

}

RunningJobJobClientgetJob()







 mapred.jobtracker.completeuserjobs.maximum        



RunningJobgetCounters()

Counters

Counters

getCounter()

Counters | 267









% hadoop jar hadoop-examples.jar MissingTemperatureFields job_201202040938_0012

Records with missing temperature fields: 5.47%







Cluster

JobRunningJobgetCounters()



Cluster cluster = new Cluster(getConf());

Job job = cluster.getJob(JobID.forName(jobID));

Counters counters = job.getCounters();

long missing = counters.findCounter(

MaxTemperatureWithCounters.Temperature.MISSING).getValue();

long total = counters.findCounter(TaskCounter.MAP_INPUT_RECORDS).getValue();

org.apache.hadoop.mapreduce.TaskCounter

org.apache.hadoop.mapred.Task.Counter

User-Defined Streaming Counters







reporter:counter:group,counter,amount





sys.stderr.write("reporter:counter:Temperature,Missing,1\n")



reporter:status:message

Sorting





            





Using the new MapReduce API.

268 | Chapter 8:MapReduce Features

Preparation

            

Text           



IntWritableText









% hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \

input/ncdc/all-seq



public class SortDataPreprocessor extends Configured implements Tool {

static class CleanerMapper

extends Mapper<LongWritable, Text, IntWritable, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

parser.parse(value);

if (parser.isValidTemperature()) {

context.write(new IntWritable(parser.getAirTemperature()), value);

}

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setMapperClass(CleanerMapper.class);

job.setOutputKeyClass(IntWritable.class);

job.setOutputValueClass(Text.class);

job.setNumReduceTasks(0);

job.setOutputFormatClass(SequenceFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);

SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

 





Sorting | 269

SequenceFileOutputFormat.setOutputCompressionType(job,

CompressionType.BLOCK);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new SortDataPreprocessor(), args);

System.exit(exitCode);

}

Partial Sort





IntWritable

    



public class SortByTemperatureUsingHashPartitioner extends Configured

implements Tool {

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setInputFormatClass(SequenceFileInputFormat.class);

job.setOutputKeyClass(IntWritable.class);

job.setOutputFormatClass(SequenceFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);

SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

SequenceFileOutputFormat.setOutputCompressionType(job,

CompressionType.BLOCK);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new SortByTemperatureUsingHashPartitioner(),

args);

System.exit(exitCode);

}

270 | Chapter 8:MapReduce Features

Controlling Sort Order
RawComparator
 mapred.output.key.comparator.class
setSortComparatorClass()Job
setOutputKeyComparatorClass()JobConf
        WritableComparable   

 RawComparator
WritableCompar
ablecompareTo()
RawCompara
torWritable

            


% hadoop jar hadoop-examples.jar SortByTemperatureUsingHashPartitioner \
  -D mapred.reduce.tasks=30 input/ncdc/all-seq output-hashsort




An application: Partitioned MapFile lookups

MapFileOutputFormat


public class SortByTemperatureToMapFile extends Configured implements Tool {
  
  @Override
  public int run(String[] args) throws Exception {
    Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
    if (job == null) {
 

 MapFileOutputFormat


Sorting | 271

return -1;

}

job.setInputFormatClass(SequenceFileInputFormat.class);

job.setOutputKeyClass(IntWritable.class);

job.setOutputFormatClass(MapFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);

SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

SequenceFileOutputFormat.setOutputCompressionType(job,

CompressionType.BLOCK);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new SortByTemperatureToMapFile(), args);

System.exit(exitCode);

}

MapFileOutputFormat        





public class LookupRecordByTemperature extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

JobBuilder.printUsage(this, "<path> <key>");

return -1;

}

Path path = new Path(args[0]);

IntWritable key = new IntWritable(Integer.parseInt(args[1]));

Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());

Partitioner<IntWritable, Text> partitioner =

new HashPartitioner<IntWritable, Text>();

Text val = new Text();

Writable entry =

MapFileOutputFormat.getEntry(readers, partitioner, key, val);

if (entry == null) {

System.err.println("Key not found: " + key);

return -1;

}

NcdcRecordParser parser = new NcdcRecordParser();

parser.parse(val.toString());

System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new LookupRecordByTemperature(), args);

System.exit(exitCode);

272 | Chapter 8:MapReduce Features

}

getReaders()MapFile.Reader

getEntry()

Readerget()

getEntry()null







–100

% hadoop jar hadoop-examples.jar LookupRecordByTemperature output-hashmapsort -100

357460-99999 1956







Reader reader = readers[partitioner.getPartition(key, val, readers.length)];

MapFileget()

next()





public class LookupRecordsByTemperature extends Configured implements Tool {

@Override

public int run(String[] args) throws Exception {

if (args.length != 2) {

JobBuilder.printUsage(this, "<path> <key>");

return -1;

}

Path path = new Path(args[0]);

IntWritable key = new IntWritable(Integer.parseInt(args[1]));

Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());

Partitioner<IntWritable, Text> partitioner =

new HashPartitioner<IntWritable, Text>();

Text val = new Text();

Reader reader = readers[partitioner.getPartition(key, val, readers.length)];

Writable entry = reader.get(key, val);

if (entry == null) {

System.err.println("Key not found: " + key);

return -1;

}

NcdcRecordParser parser = new NcdcRecordParser();

IntWritable nextKey = new IntWritable();

do {

parser.parse(val.toString());

Sorting | 273

System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());

} while(reader.next(nextKey, val) && key.equals(nextKey));

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new LookupRecordsByTemperature(), args);

System.exit(exitCode);

}



% hadoop jar hadoop-examples.jar LookupRecordsByTemperature output-hashmapsort -100 \

2> /dev/null | wc -l

1489272

Total Sort

























Temperature range < –10°C [–10°C, 0°C) [0°C, 10°C) >= 10°C

Proportion of records 11% 13% 17% 59%











                



274 | Chapter 8:MapReduce Features















 InputSampler     Sampler   

InputFormatJob

public interface Sampler<K, V> {

K[] getSample(InputFormat<K, V> inf, Job job)

throws IOException, InterruptedException;

}

           writePartition

File()InputSampler



public static <K, V> void writePartitionFile(Job job, Sampler<K, V> sampler)

throws IOException, ClassNotFoundException, InterruptedException

TotalOrderPartitioner



Sorting | 275

    



public class SortByTemperatureUsingTotalOrderPartitioner extends Configured

implements Tool {

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setInputFormatClass(SequenceFileInputFormat.class);

job.setOutputKeyClass(IntWritable.class);

job.setOutputFormatClass(SequenceFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);

SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

SequenceFileOutputFormat.setOutputCompressionType(job,

CompressionType.BLOCK);

job.setPartitionerClass(TotalOrderPartitioner.class);

InputSampler.Sampler<IntWritable, Text> sampler =

new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);

InputSampler.writePartitionFile(job, sampler);

// Add to DistributedCache

Configuration conf = job.getConfiguration();

String partitionFile =TotalOrderPartitioner.getPartitionFile(conf);

URI partitionUri = new URI(partitionFile + "#" +

TotalOrderPartitioner.DEFAULT_PATH);

DistributedCache.addCacheFile(partitionUri, conf);

DistributedCache.createSymlink(conf);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(

new SortByTemperatureUsingTotalOrderPartitioner(), args);

System.exit(exitCode);

}

RandomSampler





InputSampler







276 | Chapter 8:MapReduce Features

InputSampler

           







Temperature range < –5.6°C [–5.6°C, 13.9°C) [13.9°C, 22.0°C) >= 22.0°C

Proportion of records 29% 24% 23% 24%

SplitSampler





IntervalSampler

RandomSampler





Sampler

InputSamplerTotalOrderPartitioner





   TotalOrderPartitioner      







% hadoop jar hadoop-examples.jar SortByTemperatureUsingTotalOrderPartitioner \

-D mapred.reduce.tasks=30 input/ncdc/all-seq output-totalsort







Secondary Sort











 



RandomSampler

Sorting | 277

















              





1900 35°C

1900 34°C

...

1901 36°C

1901 35°C























278 | Chapter 8:MapReduce Features



 

 



 



Java code







public class MaxTemperatureUsingSecondarySort

extends Configured implements Tool {

static class MaxTemperatureMapper

extends Mapper<LongWritable, Text, IntPair, NullWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

protected void map(LongWritable key, Text value,

Context context) throws IOException, InterruptedException {

parser.parse(value);

if (parser.isValidTemperature()) {

context.write(new IntPair(parser.getYearInt(),

parser.getAirTemperature()), NullWritable.get());

}

static class MaxTemperatureReducer

extends Reducer<IntPair, NullWritable, IntPair, NullWritable> {

@Override

protected void reduce(IntPair key, Iterable<NullWritable> values,

Context context) throws IOException, InterruptedException {

context.write(key, NullWritable.get());

}

public static class FirstPartitioner

extends Partitioner<IntPair, NullWritable> {

@Override

public int getPartition(IntPair key, NullWritable value, int numPartitions) {

// multiply by 127 to perform some mixing

return Math.abs(key.getFirst() * 127) % numPartitions;

Sorting | 279

}

public static class KeyComparator extends WritableComparator {

protected KeyComparator() {

super(IntPair.class, true);

}

@Override

public int compare(WritableComparable w1, WritableComparable w2) {

IntPair ip1 = (IntPair) w1;

IntPair ip2 = (IntPair) w2;

int cmp = IntPair.compare(ip1.getFirst(), ip2.getFirst());

if (cmp != 0) {

return cmp;

}

return -IntPair.compare(ip1.getSecond(), ip2.getSecond()); //reverse

}

public static class GroupComparator extends WritableComparator {

protected GroupComparator() {

super(IntPair.class, true);

}

@Override

public int compare(WritableComparable w1, WritableComparable w2) {

IntPair ip1 = (IntPair) w1;

IntPair ip2 = (IntPair) w2;

return IntPair.compare(ip1.getFirst(), ip2.getFirst());

}

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setMapperClass(MaxTemperatureMapper.class);

job.setPartitionerClass(FirstPartitioner.class);

job.setSortComparatorClass(KeyComparator.class);

job.setGroupingComparatorClass(GroupComparator.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(IntPair.class);

job.setOutputValueClass(NullWritable.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new MaxTemperatureUsingSecondarySort(), args);

System.exit(exitCode);

}

280 | Chapter 8:MapReduce Features

IntPair

WritableIntPairTextPair





NullWritable

IntPairIntPair

toString()















FirstPartitioner

setSortComparatorClass()



setGroupingComparatorClass()





% hadoop jar hadoop-examples.jar MaxTemperatureUsingSecondarySort input/ncdc/all \

> output-secondarysort

% hadoop fs -cat output-secondarysort/part-* | sort | head

1901 317

1902 244

1903 289

1904 256

1905 283

1906 294

1907 283

1908 289

1909 278

1910 294

Streaming





hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

-D stream.num.map.output.key.fields=2 \

-D mapred.text.key.partitioner.options=-k1,1 \

             



Sorting | 281

-D mapred.output.key.comparator.class=\

org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \

-D mapred.text.key.comparator.options="-k1n -k2nr" \

-input input/ncdc/all \

-output output_secondarysort_streaming \

-mapper ch08/src/main/python/secondary_sort_map.py \

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

-reducer ch08/src/main/python/secondary_sort_reduce.py \

-file ch08/src/main/python/secondary_sort_map.py \

-file ch08/src/main/python/secondary_sort_reduce.py



               

stream.num.map.output.key.fields





#!/usr/bin/env python

import re

import sys

for line in sys.stdin:

val = line.strip()

(year, temp, q) = (val[15:19], int(val[87:92]), val[92:93])

if temp == 9999:

sys.stderr.write("reporter:counter:Temperature,Missing,1\n")

elif re.match("[01459]", q):

print "%s\t%s" % (year, temp)

KeyFieldBased

Partitioner

 mapred.text.key.partitioner.options     

-k1,1

map.output.key.field.separator







KeyFieldBasedComparator



mapred.text.key.comparator.options

-k1n -k2nr



KeyFieldBasedPartitioner        map.output.

key.field.separator







282 | Chapter 8:MapReduce Features



#!/usr/bin/env python

import sys

last_group = None

for line in sys.stdin:

val = line.strip()

(year, temp) = val.split("\t")

group = year

if last_group != group:

print val

last_group = group



KeyFieldBasedPartitionerKeyFieldBasedComparator





Joins



















        









            



            









Joins | 283





Map-Side Joins

















284 | Chapter 8:MapReduce Features















CompositeInputFormatorg.apache.hadoop.mapreduce.join

CompositeIn

putFormat



 org.apache.hadoop.examples.Join     







Reduce-Side Joins

















MultipleInputs

















             





 data_join       



Joins | 285





TextPair



        

01





public class JoinStationMapper

extends Mapper<LongWritable, Text, TextPair, Text> {

private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

if (parser.parse(value)) {

context.write(new TextPair(parser.getStationId(), "0"),

new Text(parser.getStationName()));

}



public class JoinRecordMapper

extends Mapper<LongWritable, Text, TextPair, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

parser.parse(value);

context.write(new TextPair(parser.getStationId(), "1"), value);

}







public class JoinReducer extends Reducer<TextPair, Text, Text, Text> {

@Override

protected void reduce(TextPair key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

Iterator<Text> iter = values.iterator();

Text stationName = new Text(iter.next());

while (iter.hasNext()) {

Text record = iter.next();

286 | Chapter 8:MapReduce Features

Text outValue = new Text(stationName.toString() + "\t" + record.toString());

context.write(key.getFirst(), outValue);

}





TextPairreduce()







Text

values

Text stationName = new Text(iter.next());

stationName







PartitionerKeyPartitionerFirst

ComparatorTextPair



public class JoinRecordWithStationName extends Configured implements Tool {

public static class KeyPartitioner extends Partitioner<TextPair, Text> {

@Override

public int getPartition(TextPair key, Text value, int numPartitions) {

return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;

}

@Override

public int run(String[] args) throws Exception {

if (args.length != 3) {

JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");

return -1;

}

Job job = new Job(getConf(), "Join weather records with station names");

job.setJarByClass(getClass());

Path ncdcInputPath = new Path(args[0]);

Path stationInputPath = new Path(args[1]);

Path outputPath = new Path(args[2]);

MultipleInputs.addInputPath(job, ncdcInputPath,

TextInputFormat.class, JoinRecordMapper.class);

Joins | 287

MultipleInputs.addInputPath(job, stationInputPath,

TextInputFormat.class, JoinStationMapper.class);

FileOutputFormat.setOutputPath(job, outputPath);

job.setPartitionerClass(KeyPartitioner.class);

job.setGroupingComparatorClass(TextPair.FirstComparator.class);

job.setMapOutputKeyClass(TextPair.class);

job.setReducerClass(JoinReducer.class);

job.setOutputKeyClass(Text.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);

System.exit(exitCode);

}



011990-99999 SIHCCAJAVRI 0067011990999991950051507004+68750...

011990-99999 SIHCCAJAVRI 0043011990999991950051512004+68750...

011990-99999 SIHCCAJAVRI 0043011990999991950051518004+68750...

012650-99999 TYNSET-HANSMOEN 0043012650999991949032412004+62300...

012650-99999 TYNSET-HANSMOEN 0043012650999991949032418004+62300...

Side Data Distribution







Using the Job Configuration



ConfigurationJobConf



   Context

getConfiguration()

configure()MapperReducerJobConf



map()reduce()





         Stringifier  

288 | Chapter 8:MapReduce Features

DefaultStringifier        



            











Distributed Cache



          







Usage

GenericOptionsParser



-

files







-archives

-libjars





            

-file

-file





-files-archives







% hadoop jar hadoop-examples.jar MaxTemperatureByStationNameUsingDistributedCacheFile \

-files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output

Side Data Distribution | 289





MaxTemperatureByStationNameUsingDistri

butedCacheFile





public class MaxTemperatureByStationNameUsingDistributedCacheFile

extends Configured implements Tool {

static class StationTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

parser.parse(value);

if (parser.isValidTemperature()) {

context.write(new Text(parser.getStationId()),

new IntWritable(parser.getAirTemperature()));

}

static class MaxTemperatureReducerWithStationLookup

extends Reducer<Text, IntWritable, Text, IntWritable> {

private NcdcStationMetadata metadata;

@Override

protected void setup(Context context)

throws IOException, InterruptedException {

metadata = new NcdcStationMetadata();

metadata.initialize(new File("stations-fixed-width.txt"));

}

@Override

protected void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {

String stationName = metadata.getStationName(key.toString());

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

}

context.write(new Text(stationName), new IntWritable(maxValue));

}

290 | Chapter 8:MapReduce Features

@Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;

}

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(StationTemperatureMapper.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducerWithStationLookup.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(

new MaxTemperatureByStationNameUsingDistributedCacheFile(), args);

System.exit(exitCode);

}

           

StationTemperatureMapper        

   MaxTemperatureReducer        



MaxTemperatureReducerWithStationLookup





setup()













-archives





PEATS RIDGE WARATAH 372

STRATHALBYN RACECOU 410

SHEOAKS AWS 399

WANGARATTA AERO 409

MOOGARA 334

MACKAY AERO 331

Side Data Distribution | 291

How it works

-files-archives

-libjars







-libjars













  local.cache.size













${mapred.local.dir}/taskTracker/archive





The distributed cache API



GenericOptionsParser





 Job      

JobContextJob

public void addCacheFile(URI uri)

public void addCacheArchive(URI uri)

public void setCacheFiles(URI[] files)

public void setCacheArchives(URI[] archives)

public void addFileToClassPath(Path file)

public void addArchiveToClassPath(Path archive)

public void createSymlink()

 org.apache.hadoop.file

cache.DistributedCache.

292 | Chapter 8:MapReduce Features





addCacheXXXX()

setCacheXXXXs()



addXXXXToClassPath()

           

GenericOptionsParser



Job API method GenericOptionsParser equiva-

lent

Description

addCacheFile(URI uri)

setCacheFiles(URI[] files)

-files

file1,file2,...

Add files to the distributed cache to

be copied to the task node.

addCacheArchive(URI uri)

setCacheArchives(URI[] files)

-archives

archive1,archive2,...

Add archives to the distributed

cache to be copied to the task node

and unarchived there.

addFileToClassPath(Path file) -libjars

jar1,jar2,...

Add files to the distributed cache to

be added to the MapReduce task’s

classpath. The files are not unarch-

ived, so this is a useful way to add

JAR files to the classpath.

addArchiveToClassPath(Path archive) None Add archives to the distributed

cache to be unarchived and added

to the MapReduce task’s classpath.

This can be useful when you want

to add a directory of files to the

classpath, since you can create an

archive containing the files. Alter-

natively, you could create a JAR file

and use

addFileToClassPath(),

which works equally well.

add()set()



GenericOptionsParser-files







GenericOptionsParser

add()set()Gener

icOptionsParser

Side Data Distribution | 293

JobcreateSymlink()











GenericOptionsParser





getLocalCacheFiles()  getLocalCacheArchives()  



JobContext



public Path[] getLocalCacheFiles() throws IOException;

public Path[] getLocalCacheArchives() throws IOException;

public Path[] getFileClassPaths();

public Path[] getArchiveClassPaths();





getLocal

CacheFiles()getLocalCacheArchives()



getFileClass

Paths()getArchiveClassPaths()

Path

FileSystemgetLocal()

java.io.Filesetup()MaxTempera

tureReducerWithStationLookup

@Override

protected void setup(Context context)

throws IOException, InterruptedException {

metadata = new NcdcStationMetadata();

Path[] localPaths = context.getLocalCacheFiles();

if (localPaths.length == 0) {

throw new FileNotFoundException("Distributed cache file not found.");

}

File localFile = new File(localPaths[0].toString());

metadata.initialize(localFile);

}

DistributedCache



@Override

public void configure(JobConf conf) {

294 | Chapter 8:MapReduce Features

metadata = new NcdcStationMetadata();

try {

Path[] localPaths = DistributedCache.getLocalCacheFiles(conf);

if (localPaths.length == 0) {

throw new FileNotFoundException("Distributed cache file not found.");

}

File localFile = new File(localPaths[0].toString());

metadata.initialize(localFile);

} catch (IOException e) {

throw new RuntimeException(e);

}

MapReduce Library Classes









Classes Description

ChainMapper, ChainReducer Runs a chain of mappers in a single mapper, and runs a reducer followed by a

chain of mappers in a single reducer. (Symbolically, M+RM*, where M is a mapper

and R is a reducer.) This can substantially reduce the amount of disk I/O incurred

compared to running multiple MapReduce jobs.

FieldSelectionMapReduce (old API)

FieldSelectionMapper and

FieldSelectionReducer (new API)

A mapper and a reducer that can select fields (like the Unix cut command) from

the input keys and values and emit them as output keys and values.

IntSumReducer,

LongSumReducer

Reducers that sum integer values to produce a total for every key.

InverseMapper A mapper that swaps keys and values.

MultithreadedMapRunner (old API)

MultithreadedMapper (new API)

A mapper (or map runner in the old API) that runs mappers concurrently in

separate threads. Useful for mappers that are not CPU-bound.

TokenCounterMapper A mapper that tokenizes the input value into words (using Java’s

StringTokenizer) and emits each word along with a count of one.

RegexMapper A mapper that finds matches of a regular expression in the input value and emits

the matches along with a count of one.

MapReduce Library Classes | 295

CHAPTER 9

Setting Up a Hadoop Cluster



















Cluster Specification





























297





























Why Not Use RAID?

           





















      













           



 



298 | Chapter 9:Setting Up a Hadoop Cluster













































             





Network Topology













 







Cluster Specification | 299

Rack awareness









          









          

           











         DNSToSwitchMapping 



public interface DNSToSwitchMapping {

public List<String> resolve(List<String> names);

}



300 | Chapter 9:Setting Up a Hadoop Cluster

names

topology.node.switch.mapping.impl

DNSToSwitchMapping









ScriptBasedMapping

          

topology.script.file.name











Cluster Setup and Installation





































Cluster Setup and Installation | 301

Installing Java







% java -version

java version "1.6.0_12"

Java(TM) SE Runtime Environment (build 1.6.0_12-b04)

Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)

Creating a Hadoop User





















Installing Hadoop







hadoop

% cd /usr/local

% sudo tar xzf hadoop-x.y.z.tar.gz

hadoop

% sudo chown -R hadoop:hadoop hadoop-x.y.z

302 | Chapter 9:Setting Up a Hadoop Cluster









       

           



       





--config







Testing the Installation









SSH Configuration









              

hadoop



hadoop

% ssh-keygen -t rsa -f ~/.ssh/id_rsa





            





-f



SSH Configuration | 303



hadoop





% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys





               





Hadoop Configuration





          





Filename Format Description

hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop

core-site.xml Hadoop configuration

XML

Configuration settings for Hadoop Core, such as I/O settings that are

common to HDFS and MapReduce

hdfs-site.xml Hadoop configuration

XML

Configuration settings for HDFS daemons: the namenode, the secondary

namenode, and the datanodes

mapred-site.xml Hadoop configuration

XML

Configuration settings for MapReduce daemons: the jobtracker, and the

tasktrackers

masters Plain text A list of machines (one per line) that each run a secondary namenode

slaves Plain text A list of machines (one per line) that each run a datanode and a task-

tracker

hadoop-metrics .prop-

erties

Java Properties Properties for controlling how metrics are published in Hadoop (see

“Metrics” on page 352)

log4j.properties Java Properties Properties for system logfiles, the namenode audit log, and the task log

for the tasktracker child process (“Hadoop Logs” on page 175)





 

304 | Chapter 9:Setting Up a Hadoop Cluster



--config

Configuration Management





























          





















Control scripts

















    HADOOP_SLAVES

Hadoop Configuration | 305

















 

 

 





 

 













         



Master node scenarios



      





















306 | Chapter 9:Setting Up a Hadoop Cluster













 



 





Environment Settings



Memory



   HADOOP_HEAPSIZE    







mapred.tasktracker.map.tasks.maximum

         mapred.task

tracker.reduce.tasks.maximum





mapred.child.java.opts-Xmx200m









JVM Default memory used (MB) Memory used for eight processors, 400 MB per child (MB)

Datanode 1,000 1,000

Tasktracker 1,000 1,000

Tasktracker child map task 2 × 200 7 × 400

Tasktracker child reduce task 2 × 200 7 × 400

Total 2,800 7,600

Hadoop Configuration | 307





            









mapred.tasktracker.map.tasks.maximummapred.task

tracker.reduce.tasks.maximum









  

























How Much Memory Does a Namenode Need?

























HADOOP_NAMENODE_OPTS

308 | Chapter 9:Setting Up a Hadoop Cluster

HADOOP_NAMENODE_OPTS

         -

Xmx2000m



    HADOOP_SECONDARYNAMENODE_OPTS   









Java

JAVA_HOME

JAVA_HOME





System logfiles

$HADOOP_INSTALL/logs

HADOOP_LOG_DIR









export HADOOP_LOG_DIR=/var/log/hadoop

































Hadoop Configuration | 309





HADOOP_IDENT_STRING



HADOOP_IDENT_STRING



SSH settings





    

ConnectTimeout





StrictHostKeyCheckingno

ask





        HADOOP_SSH_OPTS   

sshssh_config



HADOOP_MASTER

HADOOP_MASTER

HADOOP_INSTALL









HADOOP_MASTER

HADOOP_MASTER













HADOOP_SLAVE_SLEEP0.1

 



310 | Chapter 9:Setting Up a Hadoop Cluster





Important Hadoop Daemon Properties

           

















<?xml version="1.0"?>

<name>fs.default.name</name>

<value>hdfs://namenode/</value>

</property>

</configuration>



<?xml version="1.0"?>

<value>/disk1/hdfs/name,/remote/hdfs/name</value>

</property>

</property>

<name>fs.checkpoint.dir</name>

<value>/disk1/hdfs/namesecondary,/disk2/hdfs/namesecondary</value>

</property>

</configuration>

Hadoop Configuration | 311



<?xml version="1.0"?>

<name>mapred.job.tracker</name>

<value>jobtracker:8021</value>

</property>

<name>mapred.local.dir</name>

<value>/disk1/mapred/local,/disk2/mapred/local</value>

</property>

<name>mapred.system.dir</name>

<value>/tmp/hadoop/mapred/system</value>

</property>

<name>mapred.tasktracker.map.tasks.maximum</name>

</property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>

</property>

<name>mapred.child.java.opts</name>

</property>

</configuration>

HDFS



fs.default.name













312 | Chapter 9:Setting Up a Hadoop Cluster

fs.default.name









fs.default.name



















          

dfs.name.dir



dfs.name.dir







            

             



dfs.data.dir







           







noatime





fs.checkpoint.dir

           

Hadoop Configuration | 313

      







Property name Type Default value Description

fs.default.name URI file:/// The default filesystem. The URI defines

the hostname and port that the name-

node’s RPC server runs on. The default

port is 8020. This property is set in core-

site.xml.

dfs.name.dir Comma-separated

directory names ${hadoop.tmp.dir}/

dfs/name

The list of directories where the name-

node stores its persistent metadata.

The namenode stores a copy of the

metadata in each directory in the list.

dfs.data.dir Comma-separated

directory names ${hadoop.tmp.dir}/

dfs/data

A list of directories where the datanode

stores blocks. Each block is stored in

only one of these directories.

fs.checkpoint.dir Comma-separated

directory names ${hadoop.tmp.dir}/

dfs/namesecondary

A list of directories where the

secondary namenode stores check-

points. It stores a copy of the checkpoint

in each directory in the list.



hadoop.tmp.dir

/tmp/hadoop-${user.name}





MapReduce



             

mapred.job.tracker









mapred.local.dir



mapred.local.dir





314 | Chapter 9:Setting Up a Hadoop Cluster

dfs.data.dir





mapred.system.dir



fs.default.name

     mapred.tasktracker.map.tasks.maximum  mapred.task

tracker.reduce.tasks.maximum

mapred.child.java.opts

          







Property name Type Default value Description

mapred.job.tracker Hostname and port local The hostname and port that the job-

tracker’s RPC server runs on. If set to

the default value of local, the job-

tracker is run in-process on demand

when you run a MapReduce job (you

don’t need to start the jobtracker in

this case, and in fact you will get an

error if you try to start it in this mode).

mapred.local.dir Comma-separated

directory names ${hadoop.tmp.dir}

/mapred/local

A list of directories where MapReduce

stores intermediate data for jobs. The

data is cleared out when the job ends.

mapred.system.dir URI ${hadoop.tmp.dir}

/mapred/system

The directory relative to

fs.default.name where shared

files are stored during a job run.

mapred.tasktracker.

map.tasks.maximum

int 2 The number of map tasks that may

be run on a tasktracker at any one

time.

mapred.tasktracker.

reduce.tasks.maximum

int 2 The number of reduce tasks that may

be run on a tasktracker at any one

time.

mapred.child.java.opts String -Xmx200m The JVM options used to launch the

tasktracker child process that runs

map and reduce tasks. This property

can be set on a per-job basis, which

can be useful for setting JVM prop-

erties for debugging, for example.

Hadoop Configuration | 315

Property name Type Default value Description

mapreduce.map.

java.opts

String -Xmx200m The JVM options used for the child

process that runs map tasks. (Not

available in 1.x.)

mapreduce.reduce.

java.opts

String -Xmx200m The JVM options used for the child

process that runs reduce tasks. (Not

available in 1.x.)

Hadoop Daemon Addresses and Ports







0.0.0.0









Property name Default value Description

fs.default.name file:/// When set to an HDFS URI, this property determines

the namenode’s RPC server address and port. The

default port is 8020 if not specified.

dfs.datanode.ipc.address 0.0.0.0:50020 The datanode’s RPC server address and port.

mapred.job.tracker local When set to a hostname and port, this property

specifies the jobtracker’s RPC server address and

port. A commonly used port is 8021.

mapred.task.tracker.report.address 127.0.0.1:0 The tasktracker’s RPC server address and port. This

is used by the tasktracker’s child JVM to commu-

nicate with the tasktracker. Using any free port is

acceptable in this case, as the server only binds to

the loopback address. You should change this

setting only if the machine has no loopback

address.



dfs.datanode.address

0.0.0.0:50010



Property name Default value Description

mapred.job.tracker.http.address 0.0.0.0:50030 The jobtracker’s HTTP server address and port

mapred.task.tracker.http.address 0.0.0.0:50060 The tasktracker’s HTTP server address and port

316 | Chapter 9:Setting Up a Hadoop Cluster

Property name Default value Description

dfs.http.address 0.0.0.0:50070 The namenode’s HTTP server address and port

dfs.datanode.http.address 0.0.0.0:50075 The datanode’s HTTP server address and port

dfs.secondary.http.address 0.0.0.0:50090 The secondary namenode’s HTTP server address and

port

       



  dfs.datanode.dns.interface  mapred.tasktracker.dns.interface

default

eth0

Other Hadoop Properties



Cluster membership





dfs.hostsmapred.hosts

       dfs.hosts.exclude 

mapred.hosts.exclude



Buffer size







io.file.buffer.size

HDFS block size





dfs.block.size



Reserved storage space





dfs.datanode.du.reserved

Hadoop Configuration | 317

Trash







fs.trash.interval









TrashmoveToTrash()

Path

false

















% hadoop fs -expunge

Trashexpunge()

Job scheduler



           



Reduce slow start







mapred.reduce.slowstart.completed.maps

0.80

Task memory limits









318 | Chapter 9:Setting Up a Hadoop Cluster

mapred.child.java.opts







         

mapred.child.java.opts          







            



mapred.child.ulimit



mapred.child.java.opts











mapred.job.map.memory.mbmapred.job.reduce.memory.mb









mapred.job.map.mem

ory.mbmapred.cluster.map.memory.mb





-1



Property name Type Default

value

Description

mapred.cluster.map.mem

ory.mb

int -1 The amount of virtual memory, in MB, that defines a map

slot. Map tasks that require more than this amount of

memory will use more than one map slot.

mapred.cluster.reduce.mem

ory.mb

int -1 The amount of virtual memory, in MB, that defines a reduce

slot. Reduce tasks that require more than this amount of

memory will use more than one reduce slot.

                



Hadoop Configuration | 319

Property name Type Default

value

Description

mapred.job.map.memory.mb int -1 The amount of virtual memory, in MB, that a map task

requires to run. If a map task exceeds this limit, it may be

terminated and marked as failed.

mapred.job.reduce.mem

ory.mb

int -1 The amount of virtual memory, in MB, that a reduce task

requires to run. If a reduce task exceeds this limit, it may

be terminated and marked as failed.

mapred.clus

ter.max.map.memory.mb

int -1 The maximum limit that users can set

mapred.job.map.memory.mb to.

mapred.clus

ter.max.reduce.memory.mb

int -1 The maximum limit that users can set

mapred.job.reduce.memory.mb to.

User Account Creation







% hadoop fs -mkdir /user/username

% hadoop fs -chown username:username /user/username





% hadoop dfsadmin -setSpaceQuota 1t /user/username

YARN Configuration































320 | Chapter 9:Setting Up a Hadoop Cluster













Filename Format Description

yarn-env.sh Bash script Environment variables that are used in the scripts to run YARN

yarn-site.xml Hadoop configuration XML Configuration settings for YARN daemons: the resource manager, the job history

server, the webapp proxy server, and the node managers

Important YARN Daemon Properties





    

mapred.child.java.optsmapreduce.map.java.optsmap

reduce.reduce.java.opts





           





<?xml version="1.0"?>

<name>mapred.child.java.opts</name>

</property>

</configuration>

<?xml version="1.0"?>

<name>yarn.resourcemanager.address</name>

<value>resourcemanager:8032</value>

</property>

<name>yarn.nodemanager.local-dirs</name>

<value>/disk1/nm-local-dir,/disk2/nm-local-dir</value>

</property>

YARN Configuration | 321

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce.shuffle</value>

</property>

<name>yarn.nodemanager.resource.memory-mb</name>

</property>

</configuration>

        yarn.resourceman

ager.address



mapreduce.framework.name    yarn   



mapred.local.dir

yarn.nodemanager.local-dirs











           yarn.nodeman

ager.aux-servicesmapreduce.shuffle





Property name Type Default value Description

yarn.resourceman

ager.address

Hostname and port 0.0.0.0:8032 The hostname and port that the resource

manager’s RPC server runs on.

yarn.nodeman

ager.local-dirs

Comma-separated

directory names /tmp/nm-local-

dir

A list of directories where node manag-

ers allow containers to store intermedi-

ate data. The data is cleared out when

the application ends.

yarn.nodeman

ager.aux-services

Comma-separated

service names

A list of auxiliary services run by the node

manager. A service is implemented by

the class defined by the property

yarn.nodemanager.aux-serv

ices.service-name.class. By

default, no auxiliary services are speci-

fied.

yarn.nodeman

ager.resource.mem

ory-mb

int 8192 The amount of physical memory (in MB)

that may be allocated to containers be-

ing run by the node manager.

322 | Chapter 9:Setting Up a Hadoop Cluster

Property name Type Default value Description

yarn.nodeman

ager.vmem-pmem-

ratio

float 2.1 The ratio of virtual to physical memory

for containers. Virtual memory usage

may exceed the allocation by this

amount.

Memory







































yarn.nodemanager.resource.memory-mb      





mapred.child.java.opts

      mapreduce.map.memory.mb  mapreduce.reduce.

memory.mb







    mapred.child.java.opts    -Xmx800m  mapre

duce.map.memory.mb









YARN Configuration | 323











          

            

yarn.scheduler.capacity.minimum-allocation-mb

yarn.scheduler.capacity.maximum-allocation-mb





yarn.node

manager.vmem-pmem-ratio







             

PHYSICAL_MEMORY_BYTESVIRTUAL_MEMORY_BYTESCOMMITTED

_HEAP_BYTES



YARN Daemon Addresses and Ports







Property name Default value Description

yarn.resourceman

ager.address

0.0.0.0:8032 The resource manager’s RPC server address and port. This is used

by the client (typically outside the cluster) to communicate with

the resource manager.

yarn.resourceman

ager.admin.address

0.0.0.0:8033 The resource manager’s admin RPC server address and port. This is

used by the admin client (invoked with yarn rmadmin, typically

run outside the cluster) to communicate with the resource manager.

yarn.resourceman

ager.scheduler.address

0.0.0.0:8030 The resource manager scheduler’s RPC server address and port. This

is used by (in-cluster) application masters to communicate with the

resource manager.

yarn.resourceman

ager.resource-

tracker.address

0.0.0.0:8031 The resource manager resource tracker’s RPC server address and

port. This is used by the (in-cluster) node managers to communicate

with the resource manager.

324 | Chapter 9:Setting Up a Hadoop Cluster

Property name Default value Description

yarn.nodeman

ager.address

0.0.0.0:0 The node manager’s RPC server address and port. This is used by

(in-cluster) application masters to communicate with node man-

agers.

yarn.nodemanager.local

izer.address

0.0.0.0:8040 The node manager localizer’s RPC server address and port.

mapreduce.jobhis

tory.address

0.0.0.0:10020 The job history server’s RPC server address and port. This is used by

the client (typically outside the cluster) to query job history. This

property is set in mapred-site.xml.



Property name Default value Description

yarn.resourceman

ager.webapp.address

0.0.0.0:8088 The resource manager’s HTTP server address and port.

yarn.nodeman

ager.webapp.address

0.0.0.0:8042 The node manager’s HTTP server address and port.

yarn.web-proxy.address The web app proxy server’s HTTP server address and port. If not set

(the default), then the web app proxy server will run in the resource

manager process.

mapreduce.jobhis

tory.webapp.address

0.0.0.0:19888 The job history server’s HTTP server address and port. This property

is set in mapred-site.xml.

mapreduce.shuffle.port 8080 The shuffle handler’s HTTP port number. This is used for serving

map outputs, and is not a user-accessible web UI. This property is

set in mapred-site.xml.

Security



            







hadoop fs -rmr /















         

Security | 325



































           



Which Versions of Hadoop Support Kerberos Authentication?





Kerberos and Hadoop





          



 



 









326 | Chapter 9:Setting Up a Hadoop Cluster







kinit









              



ktutil

kinit-t

An example



    hadoop.security.authentication   

  kerberos     simple     





 







Security | 327

hadoop.security.author

izationtrue







*







           

preston,howard directors,inventorspreston

howarddirectorsinventors





% hadoop fs -put quangle.txt .

10/07/03 15:44:58 WARN ipc.Client: Exception encountered while connecting to the

server: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSEx

ception: No valid credentials provided (Mechanism level: Failed to find any Ker

beros tgt)]

Bad connection to FS. command aborted. exception: Call to localhost/127.0.0.1:80

20 failed on local exception: java.io.IOException: javax.security.sasl.SaslExcep

tion: GSS initiate failed [Caused by GSSException: No valid credentials provided

(Mechanism level: Failed to find any Kerberos tgt)]



kinit

% kinit

Password for hadoop-user@LOCALDOMAIN: password

% hadoop fs -put quangle.txt .

% hadoop fs -stat %n quangle.txt

quangle.txt



kinit

klist

kdestroy



Delegation Tokens

        











328 | Chapter 9:Setting Up a Hadoop Cluster



kinit































dfs.block.access.token.enable

true

















mapreduce.job.hdfs-servers



Other Security Enhancements





 





 



Security | 329





      mapred.task.tracker.task-controller 

org.apache.hadoop.mapred.LinuxTaskController   

      



              



          



 

mapred.acls.enabledtrue

mapreduce.job.acl-view-jobmapreduce.job.acl-modify-job





 





 









ktutil

dfs.datanode.keytab.filedfs.data

node.kerberos.principal

DataNodeProtocol

security.datanode.pro

tocol.acl

 



            











CompressionCodec

 LinuxTaskController

chmod +s

330 | Chapter 9:Setting Up a Hadoop Cluster

Benchmarking a Hadoop Cluster































Hadoop Benchmarks







% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar





% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO

TestFDSIO.0.0.4

Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile

resultFileName] [-bufferSize Bytes]

Benchmarking HDFS with TestDFSIO

TestDFSIO











% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10

-fileSize 1000

Benchmarking a Hadoop Cluster | 331





% cat TestDFSIO_results.log

----- TestDFSIO ----- : write

Date & time: Sun Apr 12 07:14:09 EDT 2009

Number of files: 10

Total MBytes processed: 10000

Throughput mb/sec: 7.796340865378244

Average IO rate mb/sec: 7.8862199783325195

IO rate std deviation: 0.9101254683525547

Test exec time sec: 163.387



     test.build.data      



-read

TestDFSIO -write

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10

-fileSize 1000



----- TestDFSIO ----- : read

Date & time: Sun Apr 12 07:24:28 EDT 2009

Number of files: 10

Total MBytes processed: 10000

Throughput mb/sec: 80.25553361904304

Average IO rate mb/sec: 98.6801528930664

IO rate std deviation: 36.63507598174921

Test exec time sec: 47.624



-clean

% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean

Benchmarking MapReduce with Sort









RandomWriter

        



     test.randomwriter.maps_per_host  test.random

write.bytes_per_map

RandomWriter

332 | Chapter 9:Setting Up a Hadoop Cluster

RandomWriter



% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data

Sort

% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data



     

                







% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \

-sortOutput sorted-data

SortValidator





SUCCESS! Validated the MapReduce framework's 'sort' successfully.

Other benchmarks



MRBenchmrbench



NNBenchnnbench







  



User Jobs









 



Benchmarking a Hadoop Cluster | 333









Hadoop in the Cloud













Apache Whirr



























Setup





% tar xzf whirr-x.y.z.tar.gz









 



334 | Chapter 9:Setting Up a Hadoop Cluster

% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr



















% export AWS_ACCESS_KEY_ID='...'

% export AWS_SECRET_ACCESS_KEY='...'

Launching a cluster

              





% bin/whirr launch-cluster --config recipes/hadoop-ec2.properties \

--private-key-file ~/.ssh/id_rsa_whirr

launch-cluster



Configuration

       



--config

--private-key-file









whirr.cluster-name=hadoop

whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,5 hadoop-datanode+

hadoop-tasktracker

whirr.cluster-name







whirr.instance-templates



Hadoop in the Cloud | 335

hadoop-namenode

hadoop-jobtrackerhadoop-datanode

 hadoop-tasktracker  whirr.instance-templates     







whirr.provider=aws-ec2

whirr.identity=${env:AWS_ACCESS_KEY_ID}

whirr.credential=${env:AWS_SECRET_ACCESS_KEY}

whirr.provider

whirr.identitywhirr.cre

dential











whirr.hardware-id=c1.xlarge

whirr.image-id=us-east-1/ami-da0cf8b3

whirr.location-id=us-east-1

whirr.



--cluster-name hadoop





whirr.private-key-file=/user/tom/.ssh/id_rsa_whirr







Running a proxy

 





% . ~/.whirr/hadoop/hadoop-proxy.sh





336 | Chapter 9:Setting Up a Hadoop Cluster

Running a MapReduce job









   



HADOOP_CONF_DIR

% export HADOOP_CONF_DIR=~/.whirr/hadoop









% hadoop distcp \

-Dfs.s3n.awsAccessKeyId='...' \

-Dfs.s3n.awsSecretAccessKey='...' \

s3n://hadoopbook/ncdc/all input/ncdc/all



            







% ssh -i ~/.ssh/id_rsa_whirr master_host



% hadoop jar hadoop-examples.jar MaxTemperatureWithCombiner \

/user/$USER/input/ncdc/all /user/$USER/output



% hadoop jar hadoop-examples.jar MaxTemperatureWithCombiner \

/user/$USER/input/ncdc/all s3n://mybucket/output









Hadoop in the Cloud | 337

Shutting down a cluster

destroy-cluster

% bin/whirr destroy-cluster --config recipes/hadoop-ec2.properties





338 | Chapter 9:Setting Up a Hadoop Cluster

CHAPTER 10

Administering Hadoop





HDFS

Persistent Data Structures



         





Namenode directory structure



${dfs.name.dir}/

current/

VERSION

edits

fsimage

fstime



dfs.name.dir









#Tue Mar 10 19:21:36 GMT 2009

namespaceID=134368441

cTime=0

339

storageType=NAME_NODE

layoutVersion=-18

layoutVersion





  





namespaceID



namespaceID

cTime





 storageType          





Writable





The filesystem image and edit log































340 | Chapter 10:Administering Hadoop







       





















  









 

 

 



 

 





    

               



          hadoop dfsadmin

-saveNamespace

 -checkpoint











HDFS | 341









     fs.checkpoint.period   

fs.checkpoint.size



Secondary namenode directory structure





              





342 | Chapter 10:Administering Hadoop

${fs.checkpoint.dir}/

current/

 VERSION

 edits

 fsimage

 fstime

previous.checkpoint/

VERSION

edits

fsimage

fstime







            





  -importCheckpoint       

-importCheckpoint

      fs.checkpoint.dir       

dfs.name.dir



Datanode directory structure



        



${dfs.data.dir}/

current/

VERSION

blk_<id_1>

blk_<id_1>.meta

blk_<id_2>

blk_<id_2>.meta

...

blk_<id_64>

blk_<id_64>.meta

subdir0/

subdir1/

...

subdir63/





#Tue Mar 10 21:32:31 GMT 2009

namespaceID=134368441

storageID=DS-547717739-172.16.85.1-50010-1236720751627

HDFS | 343

cTime=0

storageType=DATA_NODE

layoutVersion=-18

namespaceIDcTimelayoutVersion

namespaceID

storageID

            

storageType



















dfs.datanode.numblocks











dfs.data.dir







Safe Mode





            



















344 | Chapter 10:Administering Hadoop



   

























dfs.replication.min







Property name Type Default value Description

dfs.replication.min int 1 The minimum number of replicas that have to be writ-

ten for a write to be successful.

dfs.safemode.threshold.pct float 0.999 The proportion of blocks in the system that must

meet the minimum replication level defined by

dfs.replication.min before the namenode will

exit safe mode. Setting this value to 0 or less forces the

namenode not to start in safe mode. Setting this value

to more than 1 means the namenode never exits safe

mode.

dfs.safemode.extension int 30,000 The time, in milliseconds, to extend safe mode after

the minimum replication condition defined by

dfs.safemode.threshold.pct has been satis-

fied. For small clusters (tens of nodes), it can be set

to 0.

Entering and leaving safe mode

dfsadmin

% hadoop dfsadmin -safemode get

Safe mode is ON





HDFS | 345



wait

hadoop dfsadmin -safemode wait

# command to read or write a file









% hadoop dfsadmin -safemode enter

Safe mode is ON





dfs.safemode.threshold.pct





% hadoop dfsadmin -safemode leave

Safe mode is OFF

Audit Logging



INFO

WARN



log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN

WARNINFO





2009-03-13 07:11:22,982 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.

audit: ugi=tom,staff,admin ip=/127.0.0.1 cmd=listStatus src=/user/tom dst=null

perm=null







346 | Chapter 10:Administering Hadoop

Tools
dfsadmin


hadoop dfsadmin
        
-help

Command Description
-help Shows help for a given command, or all commands if no command is specified.
-report Shows filesystem statistics (similar to those shown in the web UI) and information on connected
datanodes.
-metasave Dumps information to a file in Hadoop’s log directory about blocks that are being replicated or
deleted, as well as a list of connected datanodes.
-safemode Changes or queries the state of safe mode. See “Safe Mode” on page 344.
-saveNamespace Saves the current in-memory filesystem image to a new fsimage file and resets the edits file. This
operation may be performed only in safe mode.
-refreshNodes Updates the set of datanodes that are permitted to connect to the namenode. See “Commissioning
and Decommissioning Nodes” on page 359.
-upgradeProgress Gets information on the progress of an HDFS upgrade or forces an upgrade to proceed. See
“Upgrades” on page 362.
-finalizeUpgrade Removes the previous version of the namenode and datanode storage directories. Used after an
upgrade has been applied and the cluster is running successfully on the new version. See
“Upgrades” on page 362.
-setQuota Sets directory quotas. Directory quotas set a limit on the number of names (files or directories) in
the directory tree. Directory quotas are useful for preventing users from creating large numbers
of small files, a measure that helps preserve the namenode’s memory (recall that accounting
information for every file, directory, and block in the filesystem is stored in memory).
-clrQuota Clears specified directory quotas.
-setSpaceQuota Sets space quotas on directories. Space quotas set a limit on the size of files that may be stored in
a directory tree. They are useful for giving users a limited amount of storage.
-clrSpaceQuota Clears specified space quotas.
-refreshServiceAcl Refreshes the namenode’s service-level authorization policy file.
Filesystem check (fsck)

   

HDFS | 347

% hadoop fsck /

......................Status: HEALTHY

Total size: 511799225 B

Total dirs: 10

Total files: 22

Total blocks (validated): 22 (avg. block size 23263601 B)

Minimally replicated blocks: 22 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 0 (0.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 3

Average block replication: 3.0

Corrupt blocks: 0

Missing replicas: 0 (0.0 %)

Number of data-nodes: 4

Number of racks: 1

The filesystem under path '/' is HEALTHY































hadoop dfsadmin -metasave









            





348 | Chapter 10:Administering Hadoop



















-move





-delete







% hadoop fsck /user/tom/part-00007 -files -blocks -racks

/user/tom/part-00007 25582428 bytes, 1 block(s): OK

0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/default-rack/10.251.43.2:50010,

/default-rack/10.251.27.178:50010, /default-rack/10.251.123.163:50010]





 -files



 -blocks



 -racks



hadoop fsck

Datanode block scanner





DataBlockScanner







dfs.datanode.scan.period.hours



Finding the blocks for a file.

HDFS | 349







Total Blocks : 21131

Verified in last hour : 70

Verified in last day : 1767

Verified in last week : 7360

Verified in last four weeks : 20057

Verified in SCAN_PERIOD : 20057

Not yet verified : 1074

Verified since restart : 35912

Scans since restart : 6541

Scan errors since restart : 0

Transient scan errors : 0

Current scan rate limit KBps : 1024

Progress this period : 109%

Time left in cur period : 53.08%

   listblocks  







blk_6035596358209321442 : status : ok type : none scan time : 0

not yet verified

blk_3065580480714947643 : status : ok type : remote scan time : 1215755306400

2008-07-11 05:48:26,400

blk_8729669677359108508 : status : ok type : local scan time : 1215755727345

2008-07-11 05:55:27,345



failedok

localremote

none







Balancer











 







350 | Chapter 10:Administering Hadoop





% start-balancer.sh

-threshold













Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved

Mar 18, 2009 5:23:42 PM 0 0 KB 219.21 MB 150.29 MB

Mar 18, 2009 5:27:14 PM 1 195.24 MB 22.45 MB 150.29 MB

The cluster is balanced. Exiting...

Balancing took 6.072933333333333 minutes







dfs.balance.bandwidthPerSec



Monitoring



























Monitoring | 351

Logging







Setting log levels













JobTracker

         

org.apache.hadoop.mapred.JobTrackerDEBUG



% hadoop daemonlog -setlevel jobtracker-host:50030 \

org.apache.hadoop.mapred.JobTracker DEBUG







log4j.logger.org.apache.hadoop.mapred.JobTracker=DEBUG

Getting stack traces







Metrics















352 | Chapter 10:Administering Hadoop

How Do Metrics Differ from Counters?



































            



dfs.class=org.apache.hadoop.metrics.spi.NullContext

mapred.class=org.apache.hadoop.metrics.spi.NullContext

jvm.class=org.apache.hadoop.metrics.spi.NullContext

rpc.class=org.apache.hadoop.metrics.spi.NullContext



MetricsCon

textNullContext



MetricsContext









FileContext

FileContext           

fileNameperiod

 

NullContext

Monitoring | 353











jvm.class=org.apache.hadoop.metrics.file.FileContext

jvm.fileName=/tmp/jvm_metrics.log

FileContext

fileName



jvm.metrics: hostName=ip-10-250-59-159, processName=NameNode, sessionId=, 

gcCount=46, gcTimeMillis=394, logError=0, logFatal=0, logInfo=59, logWarn=1, 

memHeapCommittedM=4.9375, memHeapUsedM=2.5322647, memNonHeapCommittedM=18.25, 

memNonHeapUsedM=11.330269, threadsBlocked=0, threadsNew=0, threadsRunnable=6, 

threadsTerminated=0, threadsTimedWaiting=8, threadsWaiting=13

jvm.metrics: hostName=ip-10-250-59-159, processName=SecondaryNameNode, sessionId=, 

gcCount=36, gcTimeMillis=261, logError=0, logFatal=0, logInfo=18, logWarn=4, 

memHeapCommittedM=5.4414062, memHeapUsedM=4.46756, memNonHeapCommittedM=18.25, 

memNonHeapUsedM=10.624519, threadsBlocked=0, threadsNew=0, threadsRunnable=5, 

threadsTerminated=0, threadsTimedWaiting=4, threadsWaiting=2

FileContext





GangliaContext







GangliaContext

GangliaContext     servers     









NullContextWithUpdateThread

FileContextGangliaContext

 Null

ContextWithUpdateThreadNullContext







354 | Chapter 10:Administering Hadoop

   MetricsContext  NullContext   

period

NullContextWithUpdateThread

GangliaContext





CompositeContext

CompositeContext

FileContextGangliaContext



jvm.class=org.apache.hadoop.metrics.spi.CompositeContext

jvm.arity=2

jvm.sub1.class=org.apache.hadoop.metrics.file.FileContext

jvm.fileName=/tmp/jvm_metrics.log

jvm.sub2.class=org.apache.hadoop.metrics.ganglia.GangliaContext

jvm.servers=ip-10-250-59-159.ec2.internal:8649

arity



jvm.sub1.classjvm.sub2.class

Java Management Extensions















Monitoring | 355



MBean class Daemons Metrics

NameNodeActivityMBean Namenode Namenode activity metrics, such as the

number of create file operations

FSNamesystemMBean Namenode Namenode status metrics, such as the

number of connected datanodes

DataNodeActivityMBean Datanode Datanode activity metrics, such as the

number of bytes read

FSDatasetMBean Datanode Datanode storage metrics, such as

capacity and free storage space

RpcActivityMBean All daemons that use RPC:

namenode, datanode,

jobtracker, and tasktracker

RPC statistics, such as average process-

ing time









         

MetricsContextNullContext

NullContextWithUpdateThread



356 | Chapter 10:Administering Hadoop























export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote

-Dcom.sun.management.jmxremote.ssl=false

-Dcom.sun.management.jmxremote.password.file=$HADOOP_CONF_DIR/jmxremote.password

-Dcom.sun.management.jmxremote.port=8004 $HADOOP_NAMENODE_OPTS"















% ./check_jmx -U service:jmx:rmi:///jndi/rmi://namenode-host:8004/jmxrmi -O \

hadoop:service=NameNode,name=FSNamesystemState -A UnderReplicatedBlocks \

-w 100 -c 1000 -username monitorRole -password secret

JMX OK - UnderReplicatedBlocks is 0





UnderReplicatedBlockshadoop:service=NameNode,name=FSNamesys

temState-w-c













 

Monitoring | 357

Maintenance

Routine Administration Procedures

Metadata backups















fs.checkpoint.dir







Data backups



































 





hdfs oivhdfs oev

358 | Chapter 10:Administering Hadoop











        



            



Filesystem check (fsck)



           



Filesystem balancer





Commissioning and Decommissioning Nodes













            



Commissioning new nodes

            







Maintenance | 359



















dfs.hosts











mapred.hosts

dfs.hostsmapred.hosts



dfs.hostsmapred.hosts







           





 

            



% hadoop dfsadmin -refreshNodes

 

% hadoop mradmin -refreshNodes

 



 

 





360 | Chapter 10:Administering Hadoop

Decommissioning old nodes







            















dfs.hosts.excludemapred.hosts.exclude























Node appears in include file Node appears in exclude file Interpretation

No No Node may not connect.

No Yes Node may not connect.

Yes No Node may connect.

Yes Yes Node may connect and will be decommissioned.



 



            



% hadoop dfsadmin -refreshNodes

% hadoop mradmin -refreshNodes

Maintenance | 361

 





 



 

% hadoop dfsadmin -refreshNodes

% hadoop mradmin -refreshNodes

 

Upgrades

























           











 

 

HDFS data and metadata upgrades







File system image contains an old layout version -16.

An upgrade to version -18 is required.

Please restart NameNode with -upgrade option.

362 | Chapter 10:Administering Hadoop



















































 



 

 

             



 -upgrade

 

 

 

 



PATH

Maintenance | 363



       

OLD_HADOOP_INSTALLNEW_HADOOP_INSTALL





% $NEW_HADOOP_INSTALL/bin/start-dfs.sh -upgrade





${dfs.name.dir}/current/VERSION

/edits

/fsimage

/fstime

/previous/VERSION

/edits

/fsimage

/fstime

       









% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status

Upgrade for version -18 has been completed.

Upgrade is not finalized.







          











    







% $NEW_HADOOP_INSTALL/bin/stop-dfs.sh

-rollback

Start the upgrade.

Wait until the upgrade is complete.

Check the upgrade.

Roll back the upgrade (optional).

364 | Chapter 10:Administering Hadoop

% $OLD_HADOOP_INSTALL/bin/start-dfs.sh -rollback

           















% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade

% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status

There are no upgrades in progress.



Finalize the upgrade (optional).

Maintenance | 365

CHAPTER 11

Pig

















 

 





















        















367





































Installing and Running Pig













% tar xzf pig-x.y.z.tar.gz



% export PIG_INSTALL=/home/tom/pig-x.y.z

% export PATH=$PATH:$PIG_INSTALL/bin

JAVA_HOME



pig -help

Execution Types



368 | Chapter 11:Pig

Local mode





-x-exectype

local

% pig -x local

grunt>



MapReduce mode















HADOOP_HOME





HADOOP_HOME



HADOOP_HOME

HADOOP_CONF_DIR

fs.default.namemapred.job.tracker



PIG_CONF_DIR



fs.default.name=hdfs://localhost/

mapred.job.tracker=localhost:8021



-xmapreduce



% pig

2012-01-18 20:23:05,764 [main] INFO org.apache.pig.Main - Logging error message

s to: /private/tmp/pig_1326946985762.log

2012-01-18 20:23:06,009 [main] INFO org.apache.pig.backend.hadoop.executionengi

ne.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/

2012-01-18 20:23:06,274 [main] INFO org.apache.pig.backend.hadoop.executionengi

ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021

grunt>

Installing and Running Pig | 369





Running Pig Programs







            pig

script.pig

-e







-e

runexec



PigServer



PigRunner

Grunt

















grunt> a = foreach b ge

gegenerate

grunt> a = foreach b generate







    

 

370 | Chapter 11:Pig





help

quit

Pig Latin Editors



   













An Example

             





-- max_temp.pig: Finds the maximum temperature by year

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,

MAX(filtered_records.temperature);

DUMP max_temp;







grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year:chararray, temperature:int, quality:int);







year:chararray

chararrayintint







An Example | 371













(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)





records

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

           



grunt> DESCRIBE records;

records: {year: chararray,temperature: int,quality: int}

recordsyeartemperaturequality











grunt> filtered_records = FILTER records BY temperature != 9999 AND

>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grunt> DUMP filtered_records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

records

year

grunt> grouped_records = GROUP filtered_records BY year;

grunt> DUMP grouped_records;

(1949,{(1949,111,1),(1949,78,1)})

(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})

372 | Chapter 11:Pig





     







grouped_records

grunt> DESCRIBE grouped_records;

grouped_records: {group: chararray,filtered_records: {year: chararray,

temperature: int,quality: int}}

group

filtered_records



grunt> max_temp = FOREACH grouped_records GENERATE group,

>> MAX(filtered_records.temperature);



               

group             

 filtered_records.temperature     temperature   

filtered_records    grouped_records MAX    



filtered_records

grunt> DUMP max_temp;

(1949,111)

(1950,22)



Generating Examples











  





           



An Example | 373

grunt> ILLUSTRATE max_temp;

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

| | 1949 | 78 | 1 |

| | 1949 | 111 | 1 |

| | 1949 | 9999 | 1 |

-------------------------------------------------------------------------------

---------------------------------------------------------------------------------

---------------------------------------------------------------------------------

| | 1949 | 78 | 1 |

| | 1949 | 111 | 1 |

---------------------------------------------------------------------------------

--------------------------------------------------------------------------------------

| grouped_records | group:chararray | filtered_records:bag{:tuple(year:chararray, |

temperature:int,quality:int)} |

--------------------------------------------------------------------------------------

| | 1949 | {(1949, 78, 1), (1949, 111, 1)} |

--------------------------------------------------------------------------------------

---------------------------------------------------

| max_temp | group:chararray | :int |

---------------------------------------------------

| | 1949 | 111 |

---------------------------------------------------











Comparison with Databases





























374 | Chapter 11:Pig

















            























               



Pig Latin



            





 

 





 



Pig Latin | 375

Structure







grouped_records = GROUP records BY year;



ls /





ls















records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

           

          



-- My program

DUMP A; -- What's in A?



/**/



* Description of my program spanning

* multiple lines.

A = LOAD 'input/pig/join/A';

B = LOAD 'input/pig/join/B';

C = JOIN A BY $0, /* ignored */ B BY $1;

DUMP C;





 



376 | Chapter 11:Pig

catlsmatchesFLATTENDIFFMAX









Statements



















-- max_temp.pig: Finds the maximum temperature by year

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,

MAX(filtered_records.temperature);

DUMP max_temp;













             







Multiquery Execution

  



run

exec



Pig Latin | 377

   



A = LOAD 'input/pig/multiquery/A';

B = FILTER A BY $1 == 'banana';

C = FILTER A BY $1 != 'banana';

STORE B INTO 'output/b';

STORE C INTO 'output/c';













-M-no_multiquerypig





       

EXPLAINEXPLAIN max_temp;















Category Operator Description

Loading and storing LOAD Loads data from the filesystem or other storage into a relation

STORE Saves a relation to the filesystem or other storage

DUMP Prints a relation to the console

Filtering FILTER Removes unwanted rows from a relation

DISTINCT Removes duplicate rows from a relation

FOREACH...GENERATE Adds or removes fields from a relation

MAPREDUCE Runs a MapReduce job using a relation as input

STREAM Transforms a relation using an external program

SAMPLE Selects a random sample of a relation

Grouping and joining JOIN Joins two or more relations

378 | Chapter 11:Pig

Category Operator Description

COGROUP Groups the data in two or more relations

GROUP Groups the data in a single relation

CROSS Creates the cross-product of two or more relations

Sorting ORDER Sorts a relation by one or more fields

LIMIT Limits the size of a relation to a maximum number of tuples

Combining and splitting UNION Combines two or more relations into one

SPLIT Splits a relation into two or more relations





          











Operator Description

DESCRIBE Prints a relation’s schema

EXPLAIN Prints the logical and physical plans

ILLUSTRATE Shows a sample execution of the logical plan, using a generated subset of the input









Statement Description

DEFINE Creates an alias for a macro, UDF, streaming script, or command specification

IMPORT Import macros defined in a separate file into a script







  



Pig Latin | 379



Category Command Description

Hadoop Filesystem cat Prints the contents of one or more files

cd Changes the current directory

copyFromLocal Copies a local file or directory to a Hadoop filesystem

copyToLocal Copies a file or directory on a Hadoop filesystem to the local filesystem

cp Copies a file or directory to another directory

fs Accesses Hadoop’s filesystem shell

ls Lists files

mkdir Creates a new directory

mv Moves a file or directory to another directory

pwd Prints the path of the current working directory

rm Deletes a file or directory

rmf Forcibly deletes a file or directory (does not fail if the file or directory does not exist)

Hadoop MapReduce kill Kills a MapReduce job

Utility exec Runs a script in a new Grunt shell in batch mode

help Shows the available commands and options

quit Exits the interpreter

run Runs a script within the existing Grunt shell

set Sets Pig options and MapReduce job properties

sh Run a shell command from within Grunt



hadoop fs

FileSystem

       fs   

fs -ls       fs -help       



fs.default.name





set

debug



-d-debug

grunt>

set debug on

380 | Chapter 11:Pig

job.name







execrun

exec



run





execrun





           











pig





Expressions













Category Expressions Description Examples

Constant Literal Constant value (see also the “Literal exam-

ple” column in Table 11-6)1.0, 'a'

Field (by

position) $nField in position n (zero-based) $0

Field (by name) fField named fyear

Field (disambigu-

ate) r::fField named f from relation r after grouping

or joining A::year

Pig Latin | 381

Category Expressions Description Examples

Projection c.$n, c.fField in container c (relation, bag, or tuple)

by position, by name records.$0,

records.year

Map lookup m#kValue associated with key k in map mitems#'Coat'

Cast (t) fCast of field f to type t(int) year

Arithmetic x + y, x - yAddition, subtraction $1 + $2, $1 - $2

x * y, x / yMultiplication, division $1 * $2, $1 / $2

x % yModulo, the remainder of x divided by y$1 % $2

+x, -xUnary positive, negation +1, –1

Conditional x ? y : zBincond/ternary; y if x evaluates to true, z

otherwise quality == 0 ? 0 : 1

Comparison x == y, x != yEquals, does not equal quality == 0, tempera

ture != 9999

x > y, x < yGreater than, less than quality > 0, quality <

x >= y, x <= yGreater than or equal to, less than or equal to quality >= 1, quality <=

x matches yPattern matching with regular expression quality matches

'[01459]'

x is null Is null temperature is null

x is not null Is not null temperature is not null

Boolean x or yLogical or q == 0 or q == 1

x and yLogical and q == 0 and r == 0

not xLogical negation not q matches '[01459]'

Functional fn(f1,f2,…) Invocation of function fn on fields f1, f2,

etc. isGood(quality)

Flatten FLATTEN(f)Removal of a level of nesting from bags and

tuples FLATTEN(group)

Types

intchararray



382 | Chapter 11:Pig

intlongfloatdouble

bytearraybyte

chararrayjava.lang.String



booleanbyteshortchar

intchararraychar



tuplebagmap





Category Type Description Literal example

Numeric int 32-bit signed integer 1

long 64-bit signed integer 1L

float 32-bit floating-point number 1.0F

double 64-bit floating-point number 1.0

Text chararray Character array in UTF-16 format 'a'

Binary bytearray Byte array Not supported

Complex tuple Sequence of fields of any type (1,'pomegranate')

bag An unordered collection of tuples, possibly with duplicates {(1,'pomegranate'),(2)}

map A set of key-value pairs; keys must be character arrays, but

values may be any type ['a'#'pomegranate']





      

PigStorage

{(1,pomegranate),(2)}





TOTUPLETOBAGTOMAP



  









 





Pig Latin | 383

A = {(1,2),(3,4)}; -- Error







$0

B = A.$0;



B = FOREACH A GENERATE $0;





Schemas







grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year:int, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: int,temperature: int,quality: int}

chararray





chararray





   







grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature, quality);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: bytearray,quality: bytearray}

               year

temperaturequalitybytearray



byte

arrayyear

384 | Chapter 11:Pig

grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'

>> AS (year, temperature:int, quality:int);

grunt> DESCRIBE records;

records: {year: bytearray,temperature: int,quality: int}







grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt';

grunt> DESCRIBE records;

Schema for records unknown.



$0$1

bytearray

grunt> projected_records = FOREACH records GENERATE $0, $1, $2;

grunt> DUMP projected_records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

grunt> DESCRIBE projected_records;

projected_records: {bytearray,bytearray,bytearray}











           





    

          





Validation and nulls

           





null



1950 0 1

1950 22 1

Pig Latin | 385

1950 e 1

1949 111 1

1949 78 1

null





grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'

>> AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,,1)

(1949,111,1)

(1949,78,1)

     











grunt> corrupt_records = FILTER records BY temperature is null;

grunt> DUMP corrupt_records;

(1950,,1)

is null









grunt> grouped = GROUP corrupt_records ALL;

grunt> all_grouped = FOREACH grouped GENERATE group, COUNT(corrupt_records);

grunt> DUMP all_grouped;

(all,1)







grunt> SPLIT records INTO good_records IF temperature is not null,

>> bad_records IF temperature is null;

grunt> DUMP good_records;

(1950,0,1)

(1950,22,1)

(1949,111,1)

(1949,78,1)

grunt> DUMP bad_records;

(1950,,1)

386 | Chapter 11:Pig

temperature

null

grunt> records = LOAD 'input/ncdc/micro-tab/sample_corrupt.txt'

>> AS (year:chararray, temperature, quality:int);

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,e,1)

(1949,111,1)

(1949,78,1)

grunt> filtered_records = FILTER records BY temperature != 9999 AND

>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grunt> grouped_records = GROUP filtered_records BY year;

grunt> max_temp = FOREACH grouped_records GENERATE group,

>> MAX(filtered_records.temperature);

grunt> DUMP max_temp;

(1949,111.0)

(1950,22.0)

temperaturebytearray

MAX

temperaturedoubleMAX

doublenull

MAX







SIZE

grunt> A = LOAD 'input/pig/corrupt/missing_fields';

grunt> DUMP A;

(2,Tie)

(4,Coat)

(3)

(1,Scarf)

grunt> B = FILTER A BY SIZE(TOTUPLE(*)) > 1;

grunt> DUMP B;

(2,Tie)

(4,Coat)

(1,Scarf)

Schema merging













           

Pig Latin | 387









              







Functions







MAX



MAX





            

MAX   















IsEmpty













PigStorage



              







388 | Chapter 11:Pig



Category Function Description

Eval AVG Calculates the average (mean) value of entries in a bag.

CONCAT Concatenates byte arrays or character arrays together.

COUNT Calculates the number of non-null entries in a bag.

COUNT_STAR Calculates the number of entries in a bag, including those that are null.

DIFF Calculates the set difference of two bags. If the two arguments are not bags,

returns a bag containing both if they are equal; otherwise, returns an empty

bag.

MAX Calculates the maximum value of entries in a bag.

MIN Calculates the minimum value of entries in a bag.

SIZE Calculates the size of a type. The size of numeric types is always one; for

character arrays, it is the number of characters; for byte arrays, the number

of bytes; and for containers (tuple, bag, map), it is the number of entries.

SUM Calculates the sum of the values of entries in a bag.

TOBAG Converts one or more expressions to individual tuples, which are then put in

a bag.

TOKENIZE Tokenizes a character array into a bag of its constituent words.

TOMAP Converts an even number of expressions to a map of key-value pairs.

TOP Calculates the top n tuples in a bag.

TOTUPLE Converts one or more expressions to a tuple.

Filter IsEmpty Tests whether a bag or map is empty.

Load/Store PigStorage Loads or stores relations using a field-delimited text format. Each line is

broken into fields using a configurable field delimiter (defaults to a tab

character) to be stored in the tuple’s fields. It is the default storage when

none is specified.

BinStorage Loads or stores relations from or to binary files in a Pig-specific format that

uses Hadoop Writable objects.

TextLoader Loads relations from a plain-text format. Each line corresponds to a tuple

whose single field is the line of text.

JsonLoader, JsonStorage Loads or stores relations from or to a (Pig-defined) JSON format. Each tuple

is stored on one line.

HBaseStorage Loads or stores relations from or to HBase tables.













Pig Latin | 389





Macros









DEFINE max_by_group(X, group_key, max_field) RETURNS Y {

A = GROUP $X by $group_key;

$Y = FOREACH A GENERATE group, MAX($X.$max_field);

};

max_by_groupX

group_keymax_fieldY

$$X



records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

max_temp = max_by_group(filtered_records, year, temperature);

DUMP max_temp





records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

macro_max_by_group_A_0 = GROUP filtered_records by (year);

max_temp = FOREACH macro_max_by_group_A_0 GENERATE group,

MAX(filtered_records.(temperature));

DUMP max_temp





           

-dryrunpig

filtered_recordsyear

temperature

$A



Amacro_max_by_group_A_0



390 | Chapter 11:Pig





IMPORT './ch11/src/main/pig/max_temp.macro';

User-Defined Functions







          



A Filter UDF







filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);



filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);











FilterFuncEvalFunc

EvalFunc

EvalFunc

public abstract class EvalFunc<T> {

public abstract T exec(Tuple input) throws IOException;

}

EvalFuncexec()

T

         FilterFunc T  Boolean  

true

IsGoodQualityFilterFunc

exec()Tuple



get()

Tuplenull

User-Defined Functions | 391



truefalse



package com.hadoopbook.pig;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.pig.FilterFunc;

import org.apache.pig.backend.executionengine.ExecException;

import org.apache.pig.data.DataType;

import org.apache.pig.data.Tuple;

import org.apache.pig.impl.logicalLayer.FrontendException;

public class IsGoodQuality extends FilterFunc {

@Override

public Boolean exec(Tuple tuple) throws IOException {

if (tuple == null || tuple.size() == 0) {

return false;

}

try {

Object object = tuple.get(0);

if (object == null) {

return false;

}

int i = (Integer) object;

return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;

} catch (ExecException e) {

throw new IOException(e);

}









grunt> REGISTER pig-examples.jar;



grunt> filtered_records = FILTER records BY temperature != 9999 AND

>> com.hadoopbook.pig.IsGoodQuality(quality);

     





392 | Chapter 11:Pig





              com.hadoop

book.pig.IsGoodQuality





MAX

MAXorg.apache.pig.builtin

MAXorg.apache.pig.builtin.MAX





  -Dudf.import.list=com.hadoopbook.pig   



grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();

grunt> filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);







Leveraging types

int



bytearrayDataByteArrayDataByteArray

Integer

exec()



getArgToFuncMapping()EvalFunc



@Override

public List<FuncSpec> getArgToFuncMapping() throws FrontendException {

List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>();

funcSpecs.add(new FuncSpec(this.getClass().getName(),

new Schema(new Schema.FieldSchema(null, DataType.INTEGER))));

return funcSpecs;

}

FuncSpec

exec()

FieldSchemanull

         INTEGER   

DataType

User-Defined Functions | 393



null

exec()falsenull







-- max_temp_filter_udf.pig

DEFINE isGood com.hadoopbook.pig.IsGoodQuality();

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,

MAX(filtered_records.temperature);

DUMP max_temp;

An Eval UDF



           

chararraytrim()java.lang.String





public class Trim extends EvalFunc<String> {

@Override

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0) {

return null;

}

try {

Object object = input.get(0);

if (object == null) {

return null;

}

return ((String) object).trim();

} catch (ExecException e) {

throw new IOException(e);

}

@Override

public List<FuncSpec> getArgToFuncMapping() throws FrontendException {

List<FuncSpec> funcList = new ArrayList<FuncSpec>();

funcList.add(new FuncSpec(this.getClass().getName(), new Schema(

new Schema.FieldSchema(null, DataType.CHARARRAY))));

return funcList;

394 | Chapter 11:Pig

}

EvalFunc

   String   Trim   exec()  getArgToFuncMapping()

IsGoodQuality



udf

B = FOREACH A GENERATE udf($0);

udf         



outputSchema()



Trimchararray



grunt> DUMP A;

( pomegranate)

(banana )

(apple)

( lychee )

grunt> DESCRIBE A;

A: {fruit: chararray}

grunt> B = FOREACH A GENERATE com.hadoopbook.pig.Trim(fruit);

grunt> DUMP B;

(pomegranate)

(banana)

(apple)

(lychee)

grunt> DESCRIBE B;

B: {chararray}

chararray

Trimfruit

chararray

Dynamic invokers













 

AlgebraicAccumulator

User-Defined Functions | 395



StringUtils

grunt> DEFINE trim InvokeForString('org.apache.commons.lang.StringUtils.trim', 'String');

grunt> B = FOREACH A GENERATE trim(fruit);

grunt> DUMP B;

(pomegranate)

(banana)

(apple)

(lychee)

 InvokeForString           

StringInvokeForIntInvokeForLongInvokeForDoubleInvokeFor

Float





A Load UDF



cut

grunt> records = LOAD 'input/ncdc/micro/sample.txt'

>> USING com.hadoopbook.pig.CutLoadFunc('16-19,88-92,93-93')

>> AS (year:int, temperature:int, quality:int);

grunt> DUMP records;

(1950,0,1)

(1950,22,1)

(1950,-11,1)

(1949,111,1)

(1949,78,1)

CutLoadFunc



CutLoadFunc



public class CutLoadFunc extends LoadFunc {

private static final Log LOG = LogFactory.getLog(CutLoadFunc.class);

private final List<Range> ranges;

private final TupleFactory tupleFactory = TupleFactory.getInstance();

private RecordReader reader;

public CutLoadFunc(String cutPattern) {

ranges = Range.parse(cutPattern);

}

@Override

public void setLocation(String location, Job job)

throws IOException {

396 | Chapter 11:Pig

FileInputFormat.setInputPaths(job, location);

}

@Override

public InputFormat getInputFormat() {

return new TextInputFormat();

}

@Override

public void prepareToRead(RecordReader reader, PigSplit split) {

this.reader = reader;

}

@Override

public Tuple getNext() throws IOException {

try {

if (!reader.nextKeyValue()) {

return null;

}

Text value = (Text) reader.getCurrentValue();

String line = value.toString();

Tuple tuple = tupleFactory.newTuple(ranges.size());

for (int i = 0; i < ranges.size(); i++) {

Range range = ranges.get(i);

if (range.getEnd() > line.length()) {

LOG.warn(String.format(

"Range end (%s) is longer than line length (%s)",

range.getEnd(), line.length()));

continue;

}

tuple.set(i, new DataByteArray(range.getSubstring(line)));

}

return tuple;

} catch (InterruptedException e) {

throw new ExecException(e);

}









InputFormatOutputFormat



LoadFunc

InputFormatLoadFunc



CutLoadFunc

Range

User-Defined Functions | 397

Range



setLocation()LoadFunc

CutLoadFuncTextInputFormat

FileInputFormat



      org.apache.hadoop.mapreduce



getInputFormat()RecordReader

RecordReaderprepareToRead()

CutLoadFuncgetNext()



getNext()

null



getNext()

TupleTupleFactory

TuplenewTuple()

Range

Range







null





null

Using a schema





bytearrary

DataByteArray

getLoadCaster()

LoadCaster



public interface LoadCaster {

public Integer bytesToInteger(byte[] b) throws IOException;

398 | Chapter 11:Pig

public Long bytesToLong(byte[] b) throws IOException;

public Float bytesToFloat(byte[] b) throws IOException;

public Double bytesToDouble(byte[] b) throws IOException;

public String bytesToCharArray(byte[] b) throws IOException;

public Map<String, Object> bytesToMap(byte[] b) throws IOException;

public Tuple bytesToTuple(byte[] b) throws IOException;

public DataBag bytesToBag(byte[] b) throws IOException;

}

CutLoadFuncgetLoadCaster()

Utf8StorageConverter











LoadMetadata

    LoadFunc        



LoadMetadata



LoadPushDown





CutLoadFunc



Data Processing Operators

Loading and Storing Data

     





grunt> STORE A INTO 'out' USING PigStorage(':');

grunt> cat out

Joe:cherry:2

Ali:apple:3

Joe:banana:2

Eve:apple:7



Data Processing Operators | 399

Filtering Data









FOREACH...GENERATE









grunt> DUMP A;

(Joe,cherry,2)

(Ali,apple,3)

(Joe,banana,2)

(Eve,apple,7)

grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';

grunt> DUMP B;

(Joe,3,Constant)

(Ali,4,Constant)

(Joe,3,Constant)

(Eve,8,Constant)



$0$2



chararrayConstant







-- year_stats.pig

DEFINE isGood com.hadoopbook.pig.IsGoodQuality();

records = LOAD 'input/ncdc/all/19{1,2,3,4,5}0*'

USING com.hadoopbook.pig.CutLoadFunc('5-10,11-15,16-19,88-92,93-93')

AS (usaf:chararray, wban:chararray, year:int, temperature:int, quality:int);

grouped_records = GROUP records BY year PARALLEL 30;

year_stats = FOREACH grouped_records {

uniq_stations = DISTINCT records.usaf;

good_records = FILTER records BY isGood(quality);

GENERATE FLATTEN(group), COUNT(uniq_stations) AS station_count,

COUNT(good_records) AS good_record_count, COUNT(records) AS record_count;

}

DUMP year_stats;

400 | Chapter 11:Pig



recordsrecords





















(1920,8L,8595L,8595L)

(1950,1988L,8635452L,8641353L)

(1930,121L,89245L,89262L)

(1910,7L,7650L,7650L)

(1940,732L,1052333L,1052976L)







STREAM









cut



grunt> C = STREAM A THROUGH `cut -f 2`;

grunt> DUMP C;

(cherry)

(apple)

(banana)

(apple)









PigToStreamStreamToPig

org.apache.pig





Data Processing Operators | 401

#!/usr/bin/env python

import re

import sys

for line in sys.stdin:

(year, temp, q) = line.strip().split()

if (temp != "9999" and re.match("[01459]", q)):

print "%s\t%s" % (year, temp)







-- max_temp_filter_stream.pig

DEFINE is_good_quality `is_good_quality.py`

SHIP ('ch11/src/main/python/is_good_quality.py');

records = LOAD 'input/ncdc/micro-tab/sample.txt'

AS (year:chararray, temperature:int, quality:int);

filtered_records = STREAM records THROUGH is_good_quality

AS (year:chararray, temperature:int);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,

MAX(filtered_records.temperature);

DUMP max_temp;

Grouping and Joining Data





            





JOIN



grunt> DUMP A;

(2,Tie)

(4,Coat)

(3,Hat)

(1,Scarf)

grunt> DUMP B;

(Joe,2)

(Hank,4)

(Ali,0)

(Eve,3)

(Hank,2)



grunt> C = JOIN A BY $0, B BY $1;

grunt> DUMP C;

402 | Chapter 11:Pig

(2,Tie,Joe,2)

(2,Tie,Hank,2)

(3,Hat,Eve,3)

(4,Coat,Hank,4)



















grunt> C = JOIN A BY $0, B BY $1 USING "replicated";









grunt> C = JOIN A BY $0 LEFT OUTER, B BY $1;

grunt> DUMP C;

(1,Scarf,,)

(2,Tie,Joe,2)

(2,Tie,Hank,2)

(3,Hat,Eve,3)

(4,Coat,Hank,4)

COGROUP







grunt> D = COGROUP A BY $0, B BY $1;

grunt> DUMP D;

(0,{},{(Ali,0)})

(1,{(1,Scarf)},{})

(2,{(2,Tie)},{(Joe,2),(Hank,2)})

(3,{(3,Hat)},{(Eve,3)})

(4,{(4,Coat)},{(Hank,4)})







 "skewed"

"merge"



Data Processing Operators | 403











           



D = COGROUP A BY $0 OUTER, B BY $1 OUTER;









grunt> E = COGROUP A BY $0 INNER, B BY $1;

grunt> DUMP E;

(1,{(1,Scarf)},{})

(2,{(2,Tie)},{(Joe,2),(Hank,2)})

(3,{(3,Hat)},{(Eve,3)})

(4,{(4,Coat)},{(Hank,4)})



grunt> F = FOREACH E GENERATE FLATTEN(A), B.$0;

grunt> DUMP F;

(1,Scarf,{})

(2,Tie,{(Joe),(Hank)})

(3,Hat,{(Eve)})

(4,Coat,{(Hank)})





grunt> G = COGROUP A BY $0 INNER, B BY $1 INNER;

grunt> H = FOREACH G GENERATE FLATTEN($1), FLATTEN($2);

grunt> DUMP H;

(2,Tie,Joe,2)

(2,Tie,Hank,2)

(3,Hat,Eve,3)

(4,Coat,Hank,4)

JOIN A BY $0, B BY $1











-- max_temp_station_name.pig

DEFINE isGood com.hadoopbook.pig.IsGoodQuality();

404 | Chapter 11:Pig

stations = LOAD 'input/ncdc/metadata/stations-fixed-width.txt'

USING com.hadoopbook.pig.CutLoadFunc('1-6,8-12,14-42')

AS (usaf:chararray, wban:chararray, name:chararray);

trimmed_stations = FOREACH stations GENERATE usaf, wban,

com.hadoopbook.pig.Trim(name);

records = LOAD 'input/ncdc/all/191*'

USING com.hadoopbook.pig.CutLoadFunc('5-10,11-15,88-92,93-93')

AS (usaf:chararray, wban:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND isGood(quality);

grouped_records = GROUP filtered_records BY (usaf, wban) PARALLEL 30;

max_temp = FOREACH grouped_records GENERATE FLATTEN(group),

MAX(filtered_records.temperature);

max_temp_named = JOIN max_temp BY (usaf, wban), trimmed_stations BY (usaf, wban)

PARALLEL 30;

max_temp_result = FOREACH max_temp_named GENERATE $0, $1, $5, $2;

STORE max_temp_result INTO 'max_temp_by_station';















228020 99999 SORTAVALA 322

029110 99999 VAASA AIRPORT 300

040650 99999 GRIMSEY 378





CROSS









grunt> I = CROSS A, B;

grunt> DUMP I;

(2,Tie,Joe,2)

(2,Tie,Hank,4)

(2,Tie,Ali,0)

(2,Tie,Eve,3)

(2,Tie,Hank,2)

(4,Coat,Joe,2)

(4,Coat,Hank,4)

Data Processing Operators | 405

(4,Coat,Ali,0)

(4,Coat,Eve,3)

(4,Coat,Hank,2)

(3,Hat,Joe,2)

(3,Hat,Hank,4)

(3,Hat,Ali,0)

(3,Hat,Eve,3)

(3,Hat,Hank,2)

(1,Scarf,Joe,2)

(1,Scarf,Hank,4)

(1,Scarf,Ali,0)

(1,Scarf,Eve,3)

(1,Scarf,Hank,2)

























              

      





GROUP









grunt> DUMP A;

(Joe,cherry)

(Ali,apple)

(Joe,banana)

(Eve,apple)



 

             



406 | Chapter 11:Pig

grunt> B = GROUP A BY SIZE($1);

grunt> DUMP B;

(5,{(Ali,apple),(Eve,apple)})

(6,{(Joe,cherry),(Joe,banana)})



group







grunt> C = GROUP A ALL;

grunt> DUMP C;

(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})











Sorting Data



grunt> DUMP A;

(2,3)

(1,2)

(2,4)

















grunt> B = ORDER A BY $0, $1 DESC;

grunt> DUMP B;

(1,2)

(2,4)

(2,3)





grunt> C = FOREACH B GENERATE *;

Data Processing Operators | 407





















grunt> D = LIMIT B 2;

grunt> DUMP D;

(1,2)

(2,4)













Combining and Splitting Data





grunt> DUMP A;

(2,3)

(1,2)

(2,4)

grunt> DUMP B;

(z,x,8)

(w,y,1)

grunt> C = UNION A, B;

grunt> DUMP C;

(2,3)

(1,2)

(2,4)

(z,x,8)

(w,y,1)











408 | Chapter 11:Pig

grunt> DESCRIBE A;

A: {f0: int,f1: int}

grunt> DESCRIBE B;

B: {f0: chararray,f1: chararray,f2: int}

grunt> DESCRIBE C;

Schema for C unknown.









Pig in Practice

            



Parallelism

           

 



pig.exec.reduc

ers.bytes.per.reducer       pig.exec.reduc

ers.max









grouped_records = GROUP records BY year PARALLEL 30;

default_parallel



grunt>

set default_parallel 30











Pig in Practice | 409

Parameter Substitution









$

$input$output

-- max_temp_param.pig

records = LOAD '$input' AS (year:chararray, temperature:int, quality:int);

filtered_records = FILTER records BY temperature != 9999 AND

(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);

grouped_records = GROUP filtered_records BY year;

max_temp = FOREACH grouped_records GENERATE group,

MAX(filtered_records.temperature);

STORE max_temp into '$output';

-param



% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \

> -param output=/tmp/out \

> ch11/src/main/pig/max_temp_param.pig

-param_file





# Input file

input=/user/tom/input/ncdc/micro-tab/sample.txt

# Output file

output=/tmp/out



% pig -param_file ch11/src/main/pig/max_temp_param.param \

> ch11/src/main/pig/max_temp_param.pig

-param_file

-param-param_file





Dynamic parameters

-param







410 | Chapter 11:Pig

% pig -param input=/user/tom/input/ncdc/micro-tab/sample.txt \

> -param output=/tmp/`date "+%Y-%m-%d"`/out \

> ch11/src/main/pig/max_temp_param.pig











Parameter substitution processing



 -dryrun













Pig in Practice | 411

CHAPTER 12

Hive













































         



                



413

Installing Hive















          













Which Versions of Hadoop Does Hive Work With?









HADOOP_HOME





% tar xzf hive-x.y.z.tar.gz



% export HIVE_INSTALL=/home/tom/hive-x.y.z-dev

% export PATH=$PATH:$HIVE_INSTALL/bin

hive

% hive

hive>

 



hadoop fs

414 | Chapter 12:Hive

The Hive Shell













hive> SHOW TABLES;

Time taken: 10.425 seconds

show

tables;



      



hive

-f



% hive -f script.q

-e



% hive -e 'SELECT * FROM dummy'

Hive history file=/tmp/tom/hive_job_log_tom_201005042112_1906486281.txt

Time taken: 4.734 seconds



SELECT

        



% echo 'X' > /tmp/dummy.txt

% hive -e "CREATE TABLE dummy (value STRING); \

LOAD DATA LOCAL INPATH '/tmp/dummy.txt' \

OVERWRITE INTO TABLE dummy"





-S



Installing Hive | 415

% hive -S -e 'SELECT * FROM dummy'



!

dfs

An Example









CREATE TABLE

CREATE TABLE records (year STRING, temperature INT, quality INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t';

     records     year temperature 

quality



ROW FORMAT

          









LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt'

OVERWRITE INTO TABLE records;









fs.default.name

file:///

hive.metastore.warehouse.dir



     records      



% ls /user/hive/warehouse/records/

sample.txt

416 | Chapter 12:Hive





OVERWRITELOAD DATA







hive> SELECT year, MAX(temperature)

> FROM records

> WHERE temperature != 9999

> AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)

> GROUP BY year;

1949 111

1950 22

SELECTGROUP BY

MAX()













Running Hive







Configuring Hive











--confighive

% hive --config /Users/tom/dev/hive-conf





HIVE_CONF_DIR



Running Hive | 417



         

fs.default.namemapred.job.tracker





          



             

-hiveconfhive



% hive -hiveconf fs.default.name=localhost -hiveconf mapred.job.tracker=localhost:8021









% hadoop fs -mkdir /tmp

% hadoop fs -chmod a+w /tmp

% hadoop fs -mkdir /user/hive/warehouse

% hadoop fs -chmod a+w /user/hive/warehouse

g+w



SET







hive> SET hive.enforce.bucketing=true;

SET

hive> SET hive.enforce.bucketing;

hive.enforce.bucketing=true

SET



SET -v







 SET

 -hiveconf

 

 

418 | Chapter 12:Hive

       



      



Logging

















% hive -hiveconf hive.root.logger=DEBUG,console

Hive Services

hive

--servicehive --service

help



cli



hiveserver







HIVE_PORT



hwi



jar

hadoop jar



metastore





METASTORE_PORT

Running Hive | 419

The Hive Web Interface (HWI)





% export ANT_LIB=/path/to/ant/lib

% hive --service hwi

ANT_LIB

          













Hive clients

hive --service hiveserver

       





420 | Chapter 12:Hive













            

org.apache.hadoop.hive.jdbc.HiveDriver

jdbc:hive://host:port/dbname









jdbc:hive://



















The Metastore



















Failed to start database 'metastore_db'











Running Hive | 421

         

javax.jdo.option.*

            

javax.jdo.option.ConnectionURL    jdbc:mysql://host/dbname?createData

baseIfNotExist=true   javax.jdo.option.ConnectionDriverName   

com.mysql.jdbc.Driver





 





     javax.jdo         





422 | Chapter 12:Hive





            hive.meta

store.localfalsehive.metastore.uris

thrift://

host:portMETASTORE_PORT





Property name Type Default value Description

hive.metastore .

warehouse.dir

URI /user/hive/

warehouse

The directory relative to

fs.default.name where managed tables

are stored.

hive.metastore.

local

boolean true Whether to use an embedded metastore

server (true) or connect to a remote instance

(false). If false, then

hive.metastore.uris must be set.

hive.metastore.uris Comma-

separated

URIs

Not set The URIs specifying the remote metastore

servers to connect to. Clients connect in a

round-robin fashion when there are multiple

remote servers.

javax.jdo.option.

ConnectionURL

URI jdbc:derby:;database

Name=metastore_db;

create=true

The JDBC URL of the metastore database.

javax.jdo.option.

ConnectionDriverName

String org.apache.derby.

jdbc.EmbeddedDriver

The JDBC driver classname.

javax.jdo.option.

ConnectionUserName

String APP The JDBC username.

javax.jdo.option.

ConnectionPassword

String mine The JDBC password.

Comparison with Traditional Databases





          



Schema on Read Versus Schema on Write





Comparison with Traditional Databases | 423































Updates, Transactions, and Indexes













 INSERT INTO





SELECT * from t WHERE x = a

   x            























424 | Chapter 12:Hive


SHOW LOCKS
          




HiveQL








TRANSFORMMAPREDUCE


  

        


Feature SQL HiveQL References
Updates UPDATE, INSERT, DELETE INSERT “Inserts” on page 441; “Updates,
Transactions, and Indexes”
on page 424
Transactions Supported Supported (table- and par-
tition-level)
Indexes Supported Supported
Latency Sub-second Minutes  
Data types Integral, floating-point, fixed-
point, text and binary strings,
temporal
Integral, floating-point,
Boolean, text and binary
strings, timestamp, array,
map, struct
“Data Types” on page 426
Functions Hundreds of built-in
functions
Dozens of built-in
functions
“Operators and Func-
tions” on page 428
Multitable inserts Not supported Supported “Multitable insert” on page 442
HiveQL | 425

Feature SQL HiveQL References
Create table as
select
Not valid SQL-92, but found in
some databases
Supported “CREATE TABLE...AS SE-
LECT” on page 442
Select SQL-92 Single table or view in the
FROM clause; SORT BY for
partial ordering. LIMIT to
limit number of rows re-
turned
“Querying Data” on page 444
Joins SQL-92 or variants (join
tables in the FROM clause, join
condition in the WHERE
clause)
Inner joins, outer joins; semi
joins, map joins (SQL-92
syntax, with hinting)
“Joins” on page 446
Subqueries In any clause (correlated or
noncorrelated)
Only in the FROM clause
(correlated subqueries not
supported)
“Subqueries” on page 449
Views Updatable (materialized or
nonmaterialized)
Read-only (materialized
views not supported)
“Views” on page 450
Extension points User-defined functions.
Stored procedures.
User-defined functions,
MapReduce scripts.
“User-Defined Functions” on page 451;
“MapReduce Scripts” on page 445
Data Types
          





Category Type Description Literal examples
Primitive TINYINT 1-byte (8-bit) signed integer, from -128 to
127 1
SMALLINT 2-byte (16-bit) signed integer, from
-32,768 to 32,767 1
INT 4-byte (32-bit) signed integer, from
-2,147,483,648 to 2,147,483,647 1
BIGINT 8-byte (64-bit) signed integer, from
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
1
FLOAT 4-byte (32-bit) single-precision floating-
point number 1.0
DOUBLE 8-byte (64-bit) double-precision floating-
point number 1.0
426 | Chapter 12:Hive

Category Type Description Literal examples

BOOLEAN true/false value TRUE

STRING Character string 'a', "a"

BINARY Byte array Not supported

TIMESTAMP Timestamp with nanosecond precision 1325502245000, '2012-01-02

03:04:05.123456789'

Complex ARRAY An ordered collection of fields. The fields

must all be of the same type. array(1, 2) a

MAP An unordered collection of key-value pairs.

Keys must be primitives; values may be any

type. For a particular map, the keys must

be the same type, and the values must be

the same type.

map('a', 1, 'b', 2)

STRUCT A collection of named fields. The fields may

be of different types. struct('a', 1, 1.0) b

aThe literal forms for arrays, maps, and structs are provided as functions. That is, array(), map(), and struct() are built-in Hive functions.

bThe columns are named col1, col2, col3, etc.

Primitive types





TINYINTSMALLINTINTBIGINT

byteshortintlong



FLOATDOUBLEfloatdouble







BOOLEAN

      STRING   

STRINGVARCHAR

STRING

STRING

            



BINARY

TIMESTAMP





HiveQL | 427

TIMESTAMP        to_utc_timestamp 

from_utc_timestamp

Complex types

     ARRAY MAP  STRUCT ARRAY  MAP   

STRUCT







CREATE TABLE complex (

col1 ARRAY<INT>,

col2 MAP<STRING, INT>,

col3 STRUCT<a:STRING, b:INT, c:DOUBLE>

);

ARRAYMAPSTRUCT







hive> SELECT col1[0], col2['b'], col3.c FROM complex;

1 2 1.0

Operators and Functions

x =

'a'x IS NULLx LIKE 'a%'

x + 1

xORy   

||concat









xpath



SHOW FUNCTIONS

DESCRIBE

hive> DESCRIBE FUNCTION length;

length(str | binary) - Returns the length of str or number of bytes in binary data

 



428 | Chapter 12:Hive





Conversions



TINYINT

INTINT

CAST



FLOAT

  STRING      DOUBLE TINYINT SMALL

INTINTFLOATBOOLEAN



TIMESTAMPSTRING

CASTCAST('1' AS INT)

    '1'            

CAST('X' AS INT)NULL

Tables















Multiple Database/Schema Support





        CREATE DATABASE dbname

USE dbname  DROP DATABASE dbname        

dbname.tablenamedefault

Managed Tables and External Tables





Tables | 429





LOADDROP







CREATE TABLE managed_table (dummy STRING);

LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;

          

managed_table









NULL



SELECT



DROP TABLE managed_table;



LOADDROP







CREATE EXTERNAL TABLE external_table (dummy STRING)

LOCATION '/user/tom/external_table';

LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table;

EXTERNAL











 

LOCAL 

LOAD



430 | Chapter 12:Hive



DROP









              









Partitions and Buckets















Partitions





















PARTITIONED BY





 INSERT OVERWRITE DIRECTORY





                ALTER TABLE



Tables | 431

CREATE TABLE logs (ts BIGINT, line STRING)

PARTITIONED BY (dt STRING, country STRING);



LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'

INTO TABLE logs

PARTITION (dt='2001-01-01', country='GB');



logs



/user/hive/warehouse/logs

dt=2001-01-01/

 country=GB/

  file1

  file2

 country=US/

 file3

dt=2001-01-02/

country=GB/

 file4

country=US/

file5

file6



logs2001-01-012001-01-02



 GB  US       



SHOW PARTITIONS

hive> SHOW PARTITIONS logs;

dt=2001-01-01/country=GB

dt=2001-01-01/country=US

dt=2001-01-02/country=GB

dt=2001-01-02/country=US

PARTITIONED BY





SELECT



SELECT ts, dt, line

FROM logs

WHERE country='GB';

432 | Chapter 12:Hive



dt



Buckets



















CLUSTERED

BY

CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) INTO 4 BUCKETS;



























CREATE TABLE bucketed_users (id INT, name STRING)

CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;







Tables | 433













hive> SELECT * FROM users;

0 Nat

2 Joe

3 Kay

4 Ann

hive.enforce.bucketing

true

INSERT

INSERT OVERWRITE TABLE bucketed_users

SELECT * FROM users;









bucketed_users

hive> dfs -ls /user/hive/warehouse/bucketed_users;





000000_0

000001_0

000002_0

000003_0

INT



hive> dfs -cat /user/hive/warehouse/bucketed_users/000000_0;

0Nat

4Ann

TABLESAMPLE



hive> SELECT * FROM bucketed_users

> TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

0 Nat

4 Ann

 



434 | Chapter 12:Hive













hive> SELECT * FROM bucketed_users

> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id);

0 Nat

4 Ann

2 Joe



TABLESAMPLE

rand()



hive> SELECT * FROM users

> TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());

2 Joe

Storage Formats















INSERT











The default storage format: Delimited text

ROW FORMATSTORED AS









Tables | 435







ARRAYSTRUCTMAP

MAP





         







          







CREATE TABLE nested

SELECT array(array(1, 2), array(3, 4))

FROM dummy;

hexdump









CREATE TABLE ...;



CREATE TABLE ...

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;





LazySimpleSerDe

     





true

false

436 | Chapter 12:Hive









TEXTFILE

STORED AS TEXTFILE







SerDe name Java package Description

LazySimpleSerDe org.apache.hadoop.hive.

serde2.lazy

The default SerDe. Delimited textual format,

with lazy field access.

LazyBinarySerDe org.apache.hadoop.hive.

serde2.lazybinary

A more efficient version of LazySimple

SerDe. Binary format with lazy field access.

Used internally for such things as temporary

tables.

BinarySortableSerDe org.apache.hadoop.hive.

serde2.binarysortable

A binary SerDe like LazyBinarySerDe, but

optimized for sorting at the expense of com-

pactness (although it is still significantly more

compact than LazySimpleSerDe).

ColumnarSerDe org.apache.hadoop.hive.

serde2.columnar

A variant of LazySimpleSerDe for column-

based storage with RCFile.

RegexSerDe org.apache.hadoop.hive.

contrib.serde2

A SerDe for reading textual data where columns

are specified by a regular expression. Also

writes data using a formatting expression.

Useful for reading logfiles, but inefficient, so

not suitable for general-purpose storage.

ThriftByteStreamTypedSerDe org.apache.hadoop.hive.

serde2.thrift

A SerDe for reading Thrift-encoded binary data.

HBaseSerDe org.apache.hadoop.hive.

hbase

A SerDe for storing data in an HBase table.

HBase storage uses a Hive storage handler,

which unifies (and generalizes) the roles of row

format and file format. Storage handlers are

specified using a STORED BY clause, which

replaces the ROW FORMAT and STORED AS

clauses. See https://cwiki.apache.org/conflu

ence/display/Hive/HBaseIntegration.

Tables | 437

Binary storage formats: Sequence files, Avro datafiles and RCFiles


STORED AS SEQUENCEFILECREATE TABLE




           

hive> CREATE TABLE compressed_users (id INT, name STRING)
    > STORED AS SEQUENCEFILE;
hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compress=true;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive> INSERT OVERWRITE TABLE compressed_users
    > SELECT * FROM users;



           



  





     

          



438 | Chapter 12:Hive













         



CREATE TABLE

CREATE TABLE ...

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'

STORED AS RCFILE;

Tables | 439

An example: RegexSerDe





CREATE TABLE stations (usaf STRING, wban STRING, name STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "(\\d{6}) (\\d{5}) (.{29}) .*"

);

DELIMITED

ROW FORMATSERDE



org.apache.hadoop.hive.contrib.serde2.RegexSerDe

WITH SERDEPROPERTIES

input.regexRegexSerDe

input.regex







usafwban

name

LOAD DATA

LOAD DATA LOCAL INPATH "input/ncdc/metadata/stations-fixed-width.txt"

INTO TABLE stations;

LOAD DATA









hive> SELECT * FROM stations LIMIT 4;

010000 99999 BOGUS NORWAY

010003 99999 BOGUS NORWAY

010010 99999 JAN MAYEN

010013 99999 ROST

 

(ab)+ab

?



(?:ab)+

440 | Chapter 12:Hive

Importing Data

LOAD DATA



INSERT

CREATE TABLE...AS

SELECT





Inserts

INSERT

INSERT OVERWRITE TABLE target

SELECT col1, col2

FROM source;

             

PARTITION

INSERT OVERWRITE TABLE target

PARTITION (dt='2001-01-01')

SELECT col1, col2

FROM source;

OVERWRITEtarget

2001-01-01

SELECT

INSERT INTO TABLE



SELECT

INSERT OVERWRITE TABLE target

PARTITION (dt)

SELECT col1, col2, dt

FROM source;



hive.exec.dynamic.partitiontrue



INSERT        

          

INSERT INTO...VALUES...

Tables | 441

Multitable insert

INSERTFROM



FROM source

INSERT OVERWRITE TABLE target

SELECT col1, col2;



INSERT

INSERT





FROM records2

INSERT OVERWRITE TABLE stations_by_year

SELECT year, COUNT(DISTINCT station)

GROUP BY year

INSERT OVERWRITE TABLE records_by_year

SELECT year, COUNT(1)

GROUP BY year

INSERT OVERWRITE TABLE good_records_by_year

SELECT year, COUNT(1)

WHERE temperature != 9999

AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)

GROUP BY year;

records2



CREATE TABLE...AS SELECT









SELECTtargetcol1

col2source

CREATE TABLE target

SELECT col1, col2

FROM source;

SELECT



442 | Chapter 12:Hive

Altering Tables









ALTER TABLE

ALTER TABLE source RENAME TO target;

ALTER TABLE



        









ALTER TABLE target ADD COLUMNS (col3 STRING);

col3

nullcol3







SELECT













Dropping Tables

DROP TABLE    



DELETE

TRUNCATE

hive>

dfs -rmr /user/hive/warehouse/my_table;



Tables | 443



LIKE

CREATE TABLE new_table LIKE existing_table;

Querying Data

SELECT



Sorting and Aggregating

ORDER BY

ORDER BY



            





  SORT BY SORT BY    





          DISTRIBUTE BY







hive> FROM records2

> SELECT year, temperature

> DISTRIBUTE BY year

> SORT BY year ASC, temperature DESC;

1949 111

1949 78

1950 22

1950 0

1950 -11

              





SORT BYDISTRIBUTE BYCLUSTER BY



 

444 | Chapter 12:Hive

MapReduce Scripts

TRANSFORMMAPREDUCE









#!/usr/bin/env python

import re

import sys

for line in sys.stdin:

(year, temp, q) = line.strip().split()

if (temp != "9999" and re.match("[01459]", q)):

print "%s\t%s" % (year, temp)





hive> ADD FILE /Users/tom/book-workspace/hadoop-book/ch12/src/main/python/is_good_quality.py;

hive> FROM records2

> SELECT TRANSFORM(year, temperature, quality)

> USING 'is_good_quality.py'

> AS year, temperature;

1950 0

1950 22

1950 -11

1949 111

1949 78





yeartemperaturequality

year

temperature



MAPREDUCESELECT

TRANSFORM



FROM (

FROM records2

MAP year, temperature, quality

USING 'is_good_quality.py'

AS year, temperature) map_output

REDUCE year, temperature

Querying Data | 445

USING 'max_temperature_reduce.py'

AS year, temperature;

Joins





           



Inner joins



sales

things



hive> SELECT * FROM sales;

Joe 2

Hank 4

Ali 0

Eve 3

Hank 2

hive> SELECT * FROM things;

2 Tie

4 Coat

3 Hat

1 Scarf



hive> SELECT sales.*, things.*

> FROM sales JOIN things ON (sales.id = things.id);

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

FROMsalesJOINthings

ON

id



446 | Chapter 12:Hive



FROMWHERE

SELECT



SELECT sales.*, things.*

FROM sales, things

WHERE sales.id = things.id;

FROM

JOIN



AND

JOIN...ON...



  





EXPLAIN

EXPLAIN

SELECT sales.*, things.*

FROM sales JOIN things ON (sales.id = things.id);

EXPLAIN





EXPLAIN EXTENDED

   



Outer joins





things

LEFT OUTER JOIN

sales

things

hive> SELECT sales.*, things.*

> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);

Ali 0 NULL NULL

 JOIN

     



Querying Data | 447

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

things

NULL



things



hive> SELECT sales.*, things.*

> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);

NULL NULL 1 Scarf

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat





hive> SELECT sales.*, things.*

> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);

Ali 0 NULL NULL

NULL NULL 1 Scarf

Joe 2 2 Tie

Hank 2 2 Tie

Eve 3 3 Hat

Hank 4 4 Coat

Semi joins

INLEFT

SEMI JOIN

INthings

sales

SELECT *

FROM things

WHERE things.id IN (SELECT id from sales);



hive> SELECT *

> FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);

2 Tie

3 Hat

4 Coat

LEFT SEMI JOIN

salesONSELECT



448 | Chapter 12:Hive

Map joins

      





hive> SELECT /*+ MAPJOIN(things) */ sales.*, things.*

> FROM sales JOIN things ON (sales.id = things.id);

Joe 2 2 Tie

Hank 4 4 Coat

Eve 3 3 Hat

Hank 2 2 Tie

   

RIGHTFULL OUTER JOIN



        









SET hive.optimize.bucketmapjoin=true;

Subqueries

SELECT

    FROM 

SELECT



SELECT

WHERE



IN







SELECT station, year, AVG(max_temperature)

FROM (

SELECT station, year, MAX(temperature) AS max_temperature

FROM records2

WHERE temperature != 9999

AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)

GROUP BY station, year

) mt

GROUP BY station, year;

Querying Data | 449



AVG





mt



Views

SELECT









SELECT



            









quality

CREATE VIEW valid_records

SELECT *

FROM records2

WHERE temperature != 9999

AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9);



SHOW TABLES

DESCRIBE

EXTENDED view_name



valid_records

CREATE VIEW max_temperatures (station, year, max_temperature)

SELECT station, year, MAX(temperature)

FROM valid_records

GROUP BY station, year;





450 | Chapter 12:Hive

_c2AS

SELECT



SELECT station, year, AVG(max_temperature)

FROM max_temperatures

GROUP BY station, year;





GROUP BY









User-Defined Functions









SELECT TRANSFORM





         



 



 

COUNTMAX

 



x



CREATE TABLE arrays (x ARRAY<STRING>)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002';

User-Defined Functions | 451

ROW FORMAT

   

^B



a^Bb

c^Bd^Be

LOAD DATA



hive> SELECT * FROM arrays;

["a","b"]

["c","d","e"]

explode

ySTRING



hive> SELECT explode(x) AS y FROM arrays;

SELECT



LATERAL VIEWLATERAL

VIEW



Writing a UDF



trim

stripStrip



package com.hadoopbook.hive;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public class Strip extends UDF {

private Text result = new Text();

public Text evaluate(Text str) {

if (str == null) {

return null;

}

452 | Chapter 12:Hive

result.set(StringUtils.strip(str.toString()));

return result;

}

public Text evaluate(Text str, String stripChars) {

if (str == null) {

return null;

}

result.set(StringUtils.strip(str.toString(), stripChars));

return result;

}



org.apache.hadoop.hive.ql.exec.UDF

 evaluate()

evaluate()



evaluate()



Stripevaluate()



StringUtils



Text

java.util.Listjava.util.Map



public String evaluate(String str)

Text





ant hive



ADD JAR /path/to/hive-examples.jar;



CREATE TEMPORARY FUNCTION strip AS 'com.hadoopbook.hive.Strip';

TEMPORARY









User-Defined Functions | 453

ADD JAR

   





--auxpath



% hive --auxpath /path/to/hive-examples.jar

   HIVE_AUX_JARS_PATH    







hive> SELECT strip(' bee ') FROM dummy;

bee

hive> SELECT strip('banana', 'ab') FROM dummy;

nan



hive> SELECT STRIP(' bee ') FROM dummy;

bee

Writing a UDAF













package com.hadoopbook.hive;

import org.apache.hadoop.hive.ql.exec.UDAF;

import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

import org.apache.hadoop.io.IntWritable;

public class Maximum extends UDAF {

public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {

private IntWritable result;

public void init() {

result = null;

}

public boolean iterate(IntWritable value) {

454 | Chapter 12:Hive

if (value == null) {

return true;

}

if (result == null) {

result = new IntWritable(value.get());

} else {

result.set(Math.max(result.get(), value.get()));

}

return true;

}

public IntWritable terminatePartial() {

return result;

}

public boolean merge(IntWritable other) {

return iterate(other);

}

public IntWritable terminate() {

return result;

}

  

org.apache.hadoop.hive.ql.exec.UDAF

org.apache.hadoop.hive.ql.exec.UDA

FEvaluatorMaximumIntUDAFEvaluator

MaximumLongUDAFEvaluatorMaximumFloatU

DAFEvaluator 







init()

 init()          

MaximumIntUDAFEvaluatorIntWritable

nullnull

NULL

iterate()

iterate()



iterate()



null

resultvalue

value

User-Defined Functions | 455

true



terminatePartial()

terminatePartial()



IntWritable

null

merge()

merge()



      terminatePartial()     

merge()iterate()











456 | Chapter 12:Hive

terminate()

terminate()

result





hive> CREATE TEMPORARY FUNCTION maximum AS 'com.hadoopbook.hive.Maximum';

hive> SELECT maximum(temperature) FROM records;

111

A more complex UDAF



IntWritable





          









          PartialResult





merge()iterate()

ter

minatePartial()PartialResult

terminate()DoubleWritable





package com.hadoopbook.hive;

import org.apache.hadoop.hive.ql.exec.UDAF;

import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

import org.apache.hadoop.hive.serde2.io.DoubleWritable;

public class Mean extends UDAF {

public static class MeanDoubleUDAFEvaluator implements UDAFEvaluator {

public static class PartialResult {

double sum;

long count;

}

private PartialResult partial;

User-Defined Functions | 457

public void init() {

partial = null;

}

public boolean iterate(DoubleWritable value) {

if (value == null) {

return true;

}

if (partial == null) {

partial = new PartialResult();

}

partial.sum += value.get();

partial.count++;

return true;

}

public PartialResult terminatePartial() {

return partial;

}

public boolean merge(PartialResult other) {

if (other == null) {

return true;

}

if (partial == null) {

partial = new PartialResult();

}

partial.sum += other.sum;

partial.count += other.count;

return true;

}

public DoubleWritable terminate() {

if (partial == null) {

return null;

}

return new DoubleWritable(partial.sum / partial.count);

}

458 | Chapter 12:Hive

CHAPTER 13

HBase



HBasics

















          



















             

          









459





Backdrop

              



        













Concepts





Whirlwind Tour of the Data Model



















         

         







 

 

 :



460 | Chapter 13:HBase











    

            













Regions























Locking





Implementation









       









Concepts | 461

























         

     





 

462 | Chapter 13:HBase































HBase in operation

-ROOT-.META.



-ROOT-.META..META.











               

         





-ROOT-

-ROOT-.META.

.META.







-ROOT-.META.



          TestTable     xyz TestTable,xyz,

1279729913622.1b6e176fb8d8aa88fd4ab6bc80247ece         



Concepts | 463

.META.



.META..META.

-ROOT-



             





































Installation





% tar xzf hbase-x.y.z.tar.gz



JAVA_HOME





JAVA_HOME





464 | Chapter 13:HBase

           



% export HBASE_HOME=/home/hbase/hbase-x.y.z

% export PATH=$PATH:$HBASE_HOME/bin



% hbase

Usage: hbase <command>

where <command> is one of:

shell run the HBase shell

master run an HBase HMaster node

regionserver run an HBase HRegionServer node

zookeeper run a Zookeeper server

rest run an HBase REST server

thrift run an HBase Thrift server

avro run an HBase Avro server

migrate upgrade an hbase.rootdir

hbck run the hbase 'fsck' tool

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

Test Drive





% start-hbase.sh







% hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version: 0.89.0-SNAPSHOT, ra4ea1a9a7b074a2e5b7b24f761302d4ea28ed1b2, Sun Jul 18

15:01:50 PDT 2010 hbase(main):001:0>



help

help COMMAND_GROUPhelp COMMAND







 



Installation | 465



           







disable

alterenable

testdata



hbase(main):007:0> create 'test', 'data'

0 row(s) in 1.3066 seconds







         



help



  list 



hbase(main):019:0> list

test

1 row(s) in 0.1485 seconds

data



hbase(main):021:0> put 'test', 'row1', 'data:1', 'value1'

0 row(s) in 0.0454 seconds

hbase(main):022:0> put 'test', 'row2', 'data:2', 'value2'

0 row(s) in 0.0035 seconds

hbase(main):023:0> put 'test', 'row3', 'data:3', 'value3'

0 row(s) in 0.0090 seconds

hbase(main):024:0> scan 'test'

ROW COLUMN+CELL

row1 column=data:1, timestamp=1240148026198, value=value1

row2 column=data:2, timestamp=1240148040035, value=value2

row3 column=data:3, timestamp=1240148047497, value=value3

3 row(s) in 0.0825 seconds





hbase(main):025:0> disable 'test'

09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test

466 | Chapter 13:HBase

0 row(s) in 6.0426 seconds

hbase(main):026:0> drop 'test'

09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test

0 row(s) in 0.0210 seconds

hbase(main):027:0> list

0 row(s) in 2.0645 seconds



% stop-hbase.sh





Clients



Java







public class ExampleClient {

public static void main(String[] args) throws IOException {

Configuration config = HBaseConfiguration.create();

// Create table

HBaseAdmin admin = new HBaseAdmin(config);

HTableDescriptor htd = new HTableDescriptor("test");

HColumnDescriptor hcd = new HColumnDescriptor("data");

htd.addFamily(hcd);

admin.createTable(htd);

byte [] tablename = htd.getName();

HTableDescriptor [] tables = admin.listTables();

if (tables.length != 1 && Bytes.equals(tablename, tables[0].getName())) {

throw new IOException("Failed create of table");

}

// Run some operations -- a put, a get, and a scan -- against the table.

HTable table = new HTable(config, tablename);

byte [] row1 = Bytes.toBytes("row1");

Put p1 = new Put(row1);

byte [] databytes = Bytes.toBytes("data");

p1.add(databytes, Bytes.toBytes("1"), Bytes.toBytes("value1"));

table.put(p1);

Get g = new Get(row1);

Result result = table.get(g);

System.out.println("Get: " + result);

Scan scan = new Scan();

ResultScanner scanner = table.getScanner(scan);

Clients | 467

try {

for (Result scannerResult: scanner) {

System.out.println("Scan: " + scannerResult);

}

} finally {

scanner.close();

}

// Drop the table

admin.disableTable(tablename);

admin.deleteTable(tablename);

}

    main            

            

org.apache.hadoop.conf.Configuration    org.apache.hadoop.hbase.HBase

ConfigurationConfiguration



Configuration

HBaseAdminHTableorg.apache.hadoop.hbase.client

HBaseAdmin

HTableConfigura

tion

HBaseAdmin

testdata

 org.apache.hadoop.hbase.HTableDe

scriptororg.apache.hadoop.hbase.HColumnDescriptor





          org.apache.hadoop.hbase.cli

ent.HTableConfiguration

      HTable      

org.apache.hadoop.hbase.clientPutvalue1

row1data:1

databytes

    Bytes.toBytes("1")     org.

apache.hadoop.hbase.client.Getorg.

apache.hadoop.hbase.client.Scan







468 | Chapter 13:HBase

MapReduce

org.apache.hadoop.hbase.mapreduce

TableInputFormat



TableOutputFormatRowCounter

mapreduce

TableInputFormat



public class RowCounter {

/** Name of this 'program'. */

static final String NAME = "rowcounter";

static class RowCounterMapper

extends TableMapper<ImmutableBytesWritable, Result> {

/** Counter enumeration to count the actual rows. */

public static enum Counters {ROWS}

@Override

public void map(ImmutableBytesWritable row, Result values,

Context context)

throws IOException {

for (KeyValue value: values.list()) {

if (value.getValue().length > 0) {

context.getCounter(Counters.ROWS).increment(1);

break;

}

public static Job createSubmittableJob(Configuration conf, String[] args)

throws IOException {

String tableName = args[0];

Job job = new Job(conf, NAME + "_" + tableName);

job.setJarByClass(RowCounter.class);

// Columns are space delimited

StringBuilder sb = new StringBuilder();

final int columnoffset = 1;

for (int i = columnoffset; i < args.length; i++) {

if (i > columnoffset) {

sb.append(" ");

}

sb.append(args[i]);

}

Scan scan = new Scan();

scan.setFilter(new FirstKeyOnlyFilter());

if (sb.length() > 0) {

for (String columnName :sb.toString().split(" ")) {

String [] fields = columnName.split(":");

if(fields.length == 1) {

scan.addFamily(Bytes.toBytes(fields[0]));

Clients | 469

} else {

scan.addColumn(Bytes.toBytes(fields[0]), Bytes.toBytes(fields[1]));

}

// Second argument is the table name.

job.setOutputFormatClass(NullOutputFormat.class);

TableMapReduceUtil.initTableMapperJob(tableName, scan,

RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);

job.setNumReduceTasks(0);

return job;

}

public static void main(String[] args) throws Exception {

Configuration conf = HBaseConfiguration.create();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length < 1) {

System.err.println("ERROR: Wrong number of parameters: " + args.length);

System.err.println("Usage: RowCounter <tablename> [<column1> <column2>...]");

System.exit(-1);

}

Job job = createSubmittableJob(conf, otherArgs);

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

GenericOptionsParser

Row

CounterMapperTableMapper

 org.apache.hadoop.mapreduce.Mapper        

TableInputFormatcreateSubmittableJob()



RowCounter

org.apache.hadoop.hbase.client.Scan

TableInputFormatMapper

org.apache.hadoop.hbase.filter.FirstKeyOnlyFil

ter



   createSubmittableJob()     TableMapReduceU

til.initTableMapJob()

TableInputFormat



Counters.ROWS

Avro, REST, and Thrift







470 | Chapter 13:HBase





REST





% hbase-daemon.sh start rest







       





% hbase-daemon.sh stop rest

Thrift





% hbase-daemon.sh start thrift













% hbase-daemon.sh stop thrift

Avro







 

Clients | 471

Example



















       

















Schemas



stations

stationid

info

        info:name info:location 

info:description



observations



stationiddata

airtemp













472 | Chapter 13:HBase

stationsstationid

observations





Long.MAX_VALUE - epoch





hbase(main):036:0> create 'stations', {NAME => 'info', VERSIONS => 1}

0 row(s) in 0.1304 seconds

hbase(main):037:0> create 'observations', {NAME => 'data', VERSIONS => 1}

0 row(s) in 0.1332 seconds

VERSIONS

1

Loading Data





















public class HBaseTemperatureImporter extends Configured implements Tool {

// Inner-class for map

static class HBaseTemperatureMapper<K, V> extends MapReduceBase implements

Mapper<LongWritable, Text, K, V> {

private NcdcRecordParser parser = new NcdcRecordParser();

private HTable table;

public void map(LongWritable key, Text value,

OutputCollector<K, V> output, Reporter reporter)

throws IOException {

parser.parse(value.toString());

if (parser.isValidTemperature()) {

byte[] rowKey = RowKeyConverter.makeObservationRowKey(parser.getStationId(),

parser.getObservationDate().getTime());

Put p = new Put(rowKey);

p.add(HBaseTemperatureCli.DATA_COLUMNFAMILY,

HBaseTemperatureCli.AIRTEMP_QUALIFIER,

Bytes.toBytes(parser.getAirTemperature()));

Example | 473

table.put(p);

}

public void configure(JobConf jc) {

super.configure(jc);

// Create the HBase table client once up-front and keep it around

// rather than create on each map invocation.

try {

this.table = new HTable(new HBaseConfiguration(jc), "observations");

} catch (IOException e) {

throw new RuntimeException("Failed HTable construction", e);

}

@Override

public void close() throws IOException {

super.close();

table.close();

}

public int run(String[] args) throws IOException {

if (args.length != 1) {

System.err.println("Usage: HBaseTemperatureImporter <input>");

return -1;

}

JobConf jc = new JobConf(getConf(), getClass());

FileInputFormat.addInputPath(jc, new Path(args[0]));

jc.setMapperClass(HBaseTemperatureMapper.class);

jc.setNumReduceTasks(0);

jc.setOutputFormat(NullOutputFormat.class);

JobClient.runJob(jc);

return 0;

}

public static void main(String[] args) throws Exception {

int exitCode = ToolRunner.run(new HBaseConfiguration(),

new HBaseTemperatureImporter(), args);

System.exit(exitCode);

}

HBaseTemperatureImporterHBaseTemperatureMapper

MaxTemperatureMapperTool

HBaseTemperatureMapperHBaseTemperatureMap

perMaxTemperatureMapper

NcdcRecordParser

         MaxTemperatureMapper 

observations

dataairtemp

HBaseTemperatureCliconfigure()

HTable     observations       

474 | Chapter 13:HBase

close()HTable



        makeObservationRowKey()   RowKey

Converter

public class RowKeyConverter {

private static final int STATION_ID_LENGTH = 12;

/**

* @return A row key whose format is: <station_id> <reverse_order_epoch>

public static byte[] makeObservationRowKey(String stationId,

long observationTime) {

byte[] row = new byte[STATION_ID_LENGTH + Bytes.SIZEOF_LONG];

Bytes.putBytes(row, 0, Bytes.toBytes(stationId), 0, STATION_ID_LENGTH);

long reverseOrderEpoch = Long.MAX_VALUE - observationTime;

Bytes.putLong(row, STATION_ID_LENGTH, reverseOrderEpoch);

return row;

}



BytesmakeObservationRowKey()



makeObservationRowKey()Bytes.putLong() 

Bytes.SIZEOF_LONG





% hbase HBaseTemperatureImporter input/ncdc/all

Optimization notes

 









stationid

  TextInputFormat       



 

  





Example | 475

    HTable          

HTable

HTableconfigure()

 HTable.put(put)

     HTable    HTable.setAuto

Flush(false)



HTable.flushCommits()HTable.close()

HTable.flushCommits()



close()

 TableInputFormatTableOutputFormat



MaxTemperatureMapper

MaxTemperatureMapper

TableOutputFormat

Web Queries









info



public Map<String, String> getStationInfo(HTable table, String stationId)

throws IOException {

Get get = new Get(Bytes.toBytes(stationId));

get.addColumn(INFO_COLUMNFAMILY);

Result res = table.get(get);

if (res == null) {

return null;

}

Map<String, String> resultMap = new HashMap<String, String>();

resultMap.put("name", getValue(res, INFO_COLUMNFAMILY, NAME_QUALIFIER));

resultMap.put("location", getValue(res, INFO_COLUMNFAMILY, LOCATION_QUALIFIER));

resultMap.put("description", getValue(res, INFO_COLUMNFAMILY,

DESCRIPTION_QUALIFIER));

return resultMap;

}

private static String getValue(Result res, byte [] cf, byte [] qualifier) {

byte [] value = res.getValue(cf, qualifier);

return value == null? ": Bytes.toString(value);

}

476 | Chapter 13:HBase

getStationInfo()HTable

HTable.get(),Get

              

INFO_COLUMNFAMILY

get()Result

getStationInfo()

ResultMapMapString









       









Result

             



     Map<ObservationTime, ObservedTemp>     

NavigableMap<Long, Integer>descendingMap()









public NavigableMap<Long, Integer> getStationObservations(HTable table,

String stationId, long maxStamp, int maxCount) throws IOException {

byte[] startRow = RowKeyConverter.makeObservationRowKey(stationId, maxStamp);

NavigableMap<Long, Integer> resultMap = new TreeMap<Long, Integer>();

Scan scan = new Scan(startRow);

scan.addColumn(DATA_COLUMNFAMILY, AIRTEMP_QUALIFIER);

ResultScanner scanner = table.getScanner(scan);

Result res = null;

int count = 0;

try {

while ((res = scanner.next()) != null && count++ < maxCount) {

byte[] row = res.getRow();

byte[] value = res.getValue(DATA_COLUMNFAMILY, AIRTEMP_QUALIFIER);

Long stamp = Long.MAX_VALUE -

Bytes.toLong(row, row.length - Bytes.SIZEOF_LONG, Bytes.SIZEOF_LONG);

Integer temp = Bytes.toInt(value);

resultMap.put(stamp, temp);

}

} finally {

scanner.close();

Example | 477

}

return resultMap;

}

/**

* Return the last ten observations.

public NavigableMap<Long, Integer> getStationObservations(HTable table,

String stationId) throws IOException {

return getStationObservations(table, stationId, Long.MAX_VALUE, 10);

getStationObservations()max

StampmaxCountNavigableMap



NavigableMap.descendingMap()

Scanners





        HTable.getScanner(scan) 

scanScanScan



ResultScanner

HTable.getScanner()

public interface ResultScanner extends Closeable, Iterable<Result> {

public Result next() throws IOException;

public Result [] next(int nbRows) throws IOException;

public void close();

}

               

next()



 



 hbase.client.scanner.caching

Scan











         

hbase.regionserver.lease.periodUnknownScannerException



478 | Chapter 13:HBase

Long.MAX_VALUE - stamp













HBase Versus RDBMS







     



















            















            















HBase Versus RDBMS | 479

Successful Service



















 

















































         





480 | Chapter 13:HBase

HBase



































     













Use Case: HBase at Streamy.com

















             



HBase Versus RDBMS | 481









Very large items tables





           















Very large sort merges





SELECT id, stamp, type FROM streams

WHERE type IN ('type1','type2','type3','type4',...,'typeN')

ORDER BY stamp DESC LIMIT 10 OFFSET 0;

id    stamp type  



MERGE (

SELECT id, stamp, type FROM streams

WHERE type = 'type1' ORDER BY stamp DESC,

...,

SELECT id, stamp, type FROM streams

WHERE type = 'typeN' ORDER BY stamp DESC

) ORDER BY stamp DESC LIMIT 10 OFFSET 0;













SELECT id, stamp, type FROM streams

WHERE type = 'typeN'

ORDER BY stamp DESC LIMIT 1 OFFSET 0;

482 | Chapter 13:HBase

typeN



SELECT id, stamp, type FROM streams

WHERE type = 'typeN'

ORDER BY stamp DESC LIMIT 1 OFFSET 1;











Life with HBase















       



















Praxis





Versions







Praxis | 483



























HDFS











































 



 

484 | Chapter 13:HBase



   













dfs.datanode.max.xcievers





















Metrics







        







Schema Design





 

 





 

Praxis | 485















Joins







Row keys





















Counters



stumbleupon.com



incre

mentColumnValue()org.apache.hadoop.hbase.HTable



Bulk Load







 

486 | Chapter 13:HBase





Praxis | 487

CHAPTER 14

ZooKeeper



          









































 





489











































Installing and Running ZooKeeper

















% tar xzf zookeeper-x.y.z.tar.gz





% export ZOOKEEPER_INSTALL=/home/tom/zookeeper-x.y.z

% export PATH=$PATH:$ZOOKEEPER_INSTALL/bin

 

 



490 | Chapter 14:ZooKeeper





              

ZOOCFGDIR

tickTime=2000

dataDir=/Users/tom/zookeeper

clientPort=2181



          

tickTimedataDir

clientPort



dataDir





% zkServer.sh start

ruok

nctelnet

% echo ruok | nc localhost 2181

imok







Category Command Description

Server status ruok Prints imok if the server is running and not in an error state.

conf Prints the server configuration (from zoo.cfg).

envi Prints the server environment, including ZooKeeper version, Java version and other system

properties.

srvr Prints server statistics, including latency statistics, the number of znodes, and the server

mode (standalone, leader or follower).

stat Prints server statistics and connected clients.

srst Resets server statistics.

isro Shows whether the server is in read-only (ro) mode (due to a network partition) or read-

write mode (rw).

Client connections dump Lists all the sessions and ephemeral znodes for the ensemble. You must connect to the

leader (see srvr) for this command.

cons Lists connection statistics for all the server’s clients.

crst Resets connection statistics.

Watches wchs Lists summary information for the server’s watches.

Installing and Running ZooKeeper | 491

Category Command Description

wchc Lists all the server’s watches by connection. Caution: may impact server performance for

large number of watches.

wchp Lists all the server’s watches by znode path. Caution: may impact server performance for

large number of watches.

Monitoring mntr Lists server statistics in Java Properties format, suitable as a source for monitoring systems

such as Ganglia and Nagios.

mntr





An Example



























Group Membership in ZooKeeper











             



492 | Chapter 14:ZooKeeper







Creating the Group







public class CreateGroup implements Watcher {

private static final int SESSION_TIMEOUT = 5000;

private ZooKeeper zk;

private CountDownLatch connectedSignal = new CountDownLatch(1);

public void connect(String hosts) throws IOException, InterruptedException {

zk = new ZooKeeper(hosts, SESSION_TIMEOUT, this);

connectedSignal.await();

}

@Override

public void process(WatchedEvent event) { // Watcher interface

if (event.getState() == KeeperState.SyncConnected) {

connectedSignal.countDown();

}

public void create(String groupName) throws KeeperException,

InterruptedException {

An Example | 493

String path = "/" + groupName;

String createdPath = zk.create(path, null/*data*/, Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

System.out.println("Created " + createdPath);

}

public void close() throws InterruptedException {

zk.close();

}

public static void main(String[] args) throws Exception {

CreateGroup createGroup = new CreateGroup();

createGroup.connect(args[0]);

createGroup.create(args[1]);

createGroup.close();

}

main()CreateGroup

connect()ZooKeeper





  



WatcherWatcher

   

CreateGroupWatcherZooKeeper

ZooKeeper



ZooKeeper

CountDownLatchjava.util.concurrent

ZooKeeperWatcherWatcher



public void process(WatchedEvent event);

        Watcher     

process()

     Watcher.Event.KeeperState   

SyncConnected       CountDownLatch   count

Down()

             count

Down()await()

connect()

CreateGroupcreate()

               



494 | Chapter 14:ZooKeeper

create()ZooKeeper

null

















create()



create()







% export CLASSPATH=ch14/target/classes/:$ZOOKEEPER_INSTALL/*:$ZOOKEEPER_INSTALL/lib/*:\

$ZOOKEEPER_INSTALL/conf

% java CreateGroup localhost zoo

Created /zoo

Joining a Group









JoinGroup

ZooKeeper

ConnectionWatcher



public class JoinGroup extends ConnectionWatcher {

public void join(String groupName, String memberName) throws KeeperException,

InterruptedException {

String path = "/" + groupName + "/" + memberName;

String createdPath = zk.create(path, null/*data*/, Ids.OPEN_ACL_UNSAFE,

CreateMode.EPHEMERAL);

System.out.println("Created " + createdPath);

}

public static void main(String[] args) throws Exception {

JoinGroup joinGroup = new JoinGroup();

joinGroup.connect(args[0]);

An Example | 495

joinGroup.join(args[1], args[2]);

// stay alive until process is killed or thread is interrupted

Thread.sleep(Long.MAX_VALUE);

}



public class ConnectionWatcher implements Watcher {

private static final int SESSION_TIMEOUT = 5000;

protected ZooKeeper zk;

private CountDownLatch connectedSignal = new CountDownLatch(1);

public void connect(String hosts) throws IOException, InterruptedException {

zk = new ZooKeeper(hosts, SESSION_TIMEOUT, this);

connectedSignal.await();

}

@Override

public void process(WatchedEvent event) {

if (event.getState() == KeeperState.SyncConnected) {

connectedSignal.countDown();

}

public void close() throws InterruptedException {

zk.close();

}

JoinGroupCreateGroup

join()





Listing Members in a Group





public class ListGroup extends ConnectionWatcher {

public void list(String groupName) throws KeeperException,

InterruptedException {

String path = "/" + groupName;

try {

List<String> children = zk.getChildren(path, false);

if (children.isEmpty()) {

System.out.printf("No members in group %s\n", groupName);

496 | Chapter 14:ZooKeeper

System.exit(1);

}

for (String child : children) {

System.out.println(child);

}

} catch (KeeperException.NoNodeException e) {

System.out.printf("Group %s does not exist\n", groupName);

System.exit(1);

}

public static void main(String[] args) throws Exception {

ListGroup listGroup = new ListGroup();

listGroup.connect(args[0]);

listGroup.list(args[1]);

listGroup.close();

}

list()getChildren()



Watcher





  KeeperException.NoNodeException        



ListGroupzoo



% java ListGroup localhost zoo

No members in group zoo

JoinGroup



% java JoinGroup localhost zoo duck &

% java JoinGroup localhost zoo cow &

% java JoinGroup localhost zoo goat &

% goat_pid=$!



goat



% java ListGroup localhost zoo

goat

duck

cow



% kill $goat_pid

An Example | 497







% java ListGroup localhost zoo

duck

cow



















ZooKeeper command-line tools





% zkCli.sh localhost ls /zoo

Processing ls

WatchedEvent: Server state change. New state: SyncConnected

[duck, cow]



Deleting a Group

ZooKeeper

delete()







–1



DeleteGroup





public class DeleteGroup extends ConnectionWatcher {

public void delete(String groupName) throws KeeperException,

InterruptedException {

String path = "/" + groupName;

498 | Chapter 14:ZooKeeper

try {

List<String> children = zk.getChildren(path, false);

for (String child : children) {

zk.delete(path + "/" + child, -1);

}

zk.delete(path, -1);

} catch (KeeperException.NoNodeException e) {

System.out.printf("Group %s does not exist\n", groupName);

System.exit(1);

}

public static void main(String[] args) throws Exception {

DeleteGroup deleteGroup = new DeleteGroup();

deleteGroup.connect(args[0]);

deleteGroup.delete(args[1]);

deleteGroup.close();

}

zoo

% java DeleteGroup localhost zoo

% java ListGroup localhost zoo

Group zoo does not exist

The ZooKeeper Service



             



Data Model





















          



The ZooKeeper Service | 499









               

















               

java.lang.StringPathjava.net.URI







Ephemeral znodes



















   



Sequence numbers







500 | Chapter 14:ZooKeeper









create()



               



Watches





exists

exists









     



exists

 



Operations





Operation Description

create Creates a znode (the parent znode must already exist)

delete Deletes a znode (the znode must not have any children)

exists Tests whether a znode exists and retrieves its metadata

getACL, setACL Gets/sets the ACL for a znode

getChildren Gets a list of the children of a znode

getData, setData Gets/sets the data associated with a znode

 



 

The ZooKeeper Service | 501

Operation Description

sync Synchronizes a client’s view of a znode with ZooKeeper

deletesetData



exists













syncfsync()







         sync





Multiupdate

multi



















multi



APIs



contrib

   

exists

Stat

null

502 | Chapter 14:ZooKeeper

public Stat exists(String path, Watcher watcher) throws KeeperException,

InterruptedException

ZooKeeper

public void exists(String path, Watcher watcher, StatCallback cb, Object ctx)

void





StatCallback

public void processResult(int rc, String path, Object ctx, Stat stat);

rcKeeperEx

ceptionstat

nullpathctx

exists()

ctx



null

zookeeper_st

pthread



zookeeper_mt





Should I Use the Synchronous or Asynchronous API?

















The ZooKeeper Service | 503

Watch triggers

existsgetChildrengetData

createdeletesetData







 exists



 getData



getData

 getChildren





NodeDeletedNodeChildrenChanged







Watch trigger

Watch creation setData

create znode create child delete znode delete child

exists NodeCreated NodeDeleted NodeData

Changed

getData NodeDeleted NodeData

Changed

getChildren NodeChildren

Changed

NodeDeleted NodeChildren

Changed



NodeCreatedNodeDeleted

Node

ChildrenChangedgetChildren

NodeDataChanged

getData





504 | Chapter 14:ZooKeeper

ACLs

  



       



digest



sasl











digest

zk.addAuthInfo("digest", "tom:secret".getBytes());





10.0.0.1ip

10.0.0.1READACL



new ACL(Perms.READ,

new Id("ip", "10.0.0.1"));

exists

existsStat





ACL permission Permitted operations

CREATE create (a child znode)

READ getChildren

getData

WRITE setData

DELETE delete (a child znode)

ADMIN setACL

The ZooKeeper Service | 505

ZooDefs.Ids

OPEN_ACL_UNSAFEADMIN





Implementation



              

































      



















506 | Chapter 14:ZooKeeper

Does ZooKeeper Use Paxos?

      



































Consistency











            





              







 

 

            leaderServes   no  



The ZooKeeper Service | 507















































508 | Chapter 14:ZooKeeper



















sync

sync





   sync       





sync

sync

Sessions





















           











            



The ZooKeeper Service | 509













             







Time

























































510 | Chapter 14:ZooKeeper







          

         



States

 ZooKeeper          

getState()

public States getState()

StatesZooKeeper

ZooKeeper

   ZooKeeper     CONNECTING     



CONNECTED



ZooKeeper

WatcherCONNECTED

WatchedEventKeeperStateSyncConnected

The ZooKeeper Service | 511

Watcher



     



ZooKeeper         

Watcher



            



ZooKeeper

CONNECTEDCONNECTING

DisconnectedZooKeeper



ZooKeeperCLOSEDclose()

            KeeperState  

ExpiredCLOSEDZooKeeper

isAlive()States

            ZooKeeper



Building Applications with ZooKeeper

            



A Configuration Service







        

















512 | Chapter 14:ZooKeeper

ActiveKeyValueStore

public class ActiveKeyValueStore extends ConnectionWatcher {

private static final Charset CHARSET = Charset.forName("UTF-8");

public void write(String path, String value) throws InterruptedException,

KeeperException {

Stat stat = zk.exists(path, false);

if (stat == null) {

zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

} else {

zk.setData(path, value.getBytes(CHARSET), -1);

}

write()



exists



                

getBytes()

ActiveKeyValueStoreConfigUpdater





public class ConfigUpdater {

public static final String PATH = "/config";

private ActiveKeyValueStore store;

private Random random = new Random();

public ConfigUpdater(String hosts) throws IOException, InterruptedException {

store = new ActiveKeyValueStore();

store.connect(hosts);

}

public void run() throws InterruptedException, KeeperException {

while (true) {

String value = random.nextInt(100) + ";

store.write(PATH, value);

System.out.printf("Set %s to %s\n", PATH, value);

TimeUnit.SECONDS.sleep(random.nextInt(10));

}

Building Applications with ZooKeeper | 513

public static void main(String[] args) throws Exception {

ConfigUpdater configUpdater = new ConfigUpdater(args[0]);

configUpdater.run();

}

ConfigUpdaterActiveKeyValueStore

ConfigUpdaterrun()





ActiveKeyValueStore

public String read(String path, Watcher watcher) throws InterruptedException,

KeeperException {

byte[] data = zk.getData(path, watcher, null/*stat*/);

return new String(data, CHARSET);

}

getData()WatcherStat

StatgetData()



Stat

ConfigWatcherActiveKey

ValueStoreread()displayConfig()









public class ConfigWatcher implements Watcher {

private ActiveKeyValueStore store;

public ConfigWatcher(String hosts) throws IOException, InterruptedException {

store = new ActiveKeyValueStore();

store.connect(hosts);

}

public void displayConfig() throws InterruptedException, KeeperException {

String value = store.read(ConfigUpdater.PATH, this);

System.out.printf("Read %s as %s\n", ConfigUpdater.PATH, value);

}

@Override

public void process(WatchedEvent event) {

if (event.getType() == EventType.NodeDataChanged) {

try {

displayConfig();

} catch (InterruptedException e) {

System.err.println("Interrupted. Exiting.");

Thread.currentThread().interrupt();

514 | Chapter 14:ZooKeeper

} catch (KeeperException e) {

System.err.printf("KeeperException: %s. Exiting.\n", e);

}

public static void main(String[] args) throws Exception {

ConfigWatcher configWatcher = new ConfigWatcher(args[0]);

configWatcher.displayConfig();

// stay alive until process is killed or thread is interrupted

Thread.sleep(Long.MAX_VALUE);

}

ConfigUpdater

EventType.NodeDataChangedConfigWatcher

process()



read()ActiveKeyValueStore













ConfigUpdater

% java ConfigUpdater localhost

Set /config to 79

Set /config to 14

Set /config to 78

ConfigWatcher

% java ConfigWatcher localhost

Read /config as 79

Read /config as 14

Read /config as 78

The Resilient ZooKeeper Application











 

Building Applications with ZooKeeper | 515



InterruptedExceptionKeeperException

InterruptedException

InterruptedException

interrupt()



InterruptedException



InterruptedException

InterruptedException





KeeperException

KeeperException



  KeeperException   KeeperException.NoNodeException  

KeeperException



KeeperException

KeeperException.NoNodeExceptionKeep

erException.Code.NONODE

KeeperExceptionKeeperException

             

KeeperException

KeeperException





setData

KeeperException.BadVersionException





           KeeperExcep

tion.NoChildrenForEphemeralsException



State exceptions.

 

516 | Chapter 14:ZooKeeper





KeeperException.ConnectionLossException      





KeeperExcep

tion.ConnectionLossException











setData





         









       

KeeperEx

ception.SessionExpiredExceptionKeeper

Exception.AuthFailedException





A reliable configuration service

write()ActiveKeyValueStore

existscreatesetData

public void write(String path, String value) throws InterruptedException,

KeeperException {

Stat stat = zk.exists(path, false);

if (stat == null) {

zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

} else {

zk.setData(path, value.getBytes(CHARSET), -1);

}

write()

write()

Recoverable exceptions.

Unrecoverable exceptions.

Building Applications with ZooKeeper | 517

          MAX_RETRIES   

RETRY_PERIOD_SECONDS

public void write(String path, String value) throws InterruptedException,

KeeperException {

int retries = 0;

while (true) {

try {

Stat stat = zk.exists(path, false);

if (stat == null) {

zk.create(path, value.getBytes(CHARSET), Ids.OPEN_ACL_UNSAFE,

CreateMode.PERSISTENT);

} else {

zk.setData(path, value.getBytes(CHARSET), stat.getVersion());

}

return;

} catch (KeeperException.SessionExpiredException e) {

throw e;

} catch (KeeperException e) {

if (retries++ == MAX_RETRIES) {

throw e;

}

// sleep then retry

TimeUnit.SECONDS.sleep(RETRY_PERIOD_SECONDS);

}

      KeeperException.SessionExpiredException

ZooKeeperCLOSED



ZooKeeperwrite()

ConfigUpdater

ResilientConfigUpdater

public static void main(String[] args) throws Exception {

while (true) {

try {

ResilientConfigUpdater configUpdater =

new ResilientConfigUpdater(args[0]);

configUpdater.run();

} catch (KeeperException.SessionExpiredException e) {

// start a new session

} catch (KeeperException e) {

// already retried, so exit

e.printStackTrace();

break;

}

 KeeperException

KeeperException.Code.SESSIONEXPIRED



518 | Chapter 14:ZooKeeper

KeeperState

ExpiredConnectionWatcher



write()KeeperException.SessionExpiredExcep

tion







ZooKeeper





IOException









org.apache.hadoop.io.retry





A Lock Service













       





















Building Applications with ZooKeeper | 519

















 

create

 

 



 

The herd effect



























             



            



Recoverable exceptions







            

520 | Chapter 14:ZooKeeper





















getSessionId()

ZooKeeper

             

<sessionId>

<sessionId><sequenceNumber>





Unrecoverable exceptions















Implementation





WriteLock

More Distributed Data Structures and Protocols













Building Applications with ZooKeeper | 521









BookKeeper and Hedwig

































           

















ZooKeeper in Production





          

        



522 | Chapter 14:ZooKeeper

Resilience and Performance





























         





















dataDir

dataLogDir















             



ZooKeeper in Production | 523

JVMFLAGS



Configuration





dataDir







server.n=hostname:port:port

n



          



tickTime=2000

dataDir=/disk1/zookeeper

dataLogDir=/disk2/zookeeper

clientPort=2181

initLimit=5

syncLimit=2

server.1=zookeeper1:2888:3888

server.2=zookeeper2:2888:3888

server.3=zookeeper3:2888:3888











        zookeeper1:2181,zoo

keeper2:2181,zookeeper3:2181ZooKeeper



         initLimit 

syncLimittickTime

initLimit







syncLimit





524 | Chapter 14:ZooKeeper







ZooKeeper in Production | 525

CHAPTER 15

Sqoop



            





























Getting Sqoop

       



  











x.y.z$SQOOP_HOME

$SQOOP_HOME/bin/sqoop

527





sqoop

sqoop



% sqoop

Try sqoop help for usage.



help



% sqoop help

usage: sqoop COMMAND [ARGS]

Available commands:

codegen Generate code to interact with database records

create-hive-table Import a table definition into Hive

eval Evaluate a SQL statement and display the results

export Export an HDFS directory to a database table

help List available commands

import Import a table from a database to HDFS

import-all-tables Import tables from a database to HDFS

job Work with saved jobs

list-databases List available databases on a server

list-tables List available tables in a database

merge Merge results of incremental imports

metastore Run a standalone Sqoop metastore

version Display version information

See 'sqoop help COMMAND' for information on a specific command.

help



% sqoop help import

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

--connect <jdbc-uri> Specify JDBC connect string

--driver <class-name> Manually specify JDBC driver class to use

--hadoop-home <dir> Override $HADOOP_HOME

--help Print usage instructions

-P Read password from console

--password <password> Set authentication password

--username <username> Set authentication username

--verbose Print more information while working

...

528 | Chapter 15:Sqoop



sqoop-toolnamesqoop-helpsqoop-import

sqoop helpsqoop import

Sqoop Connectors





























A Sample Import









        

sudo apt-

get install mysql-client mysql-server     sudo yum install

mysql mysql-server





% mysql -u root -p

Enter password:

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 349

Server version: 5.1.37-1ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the current input

statement.

A Sample Import | 529

mysql> CREATE DATABASE hadoopguide;

Query OK, 1 row affected (0.02 sec)

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';

Query OK, 0 rows affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';

Query OK, 0 rows affected (0.00 sec)

mysql> quit;

Bye









hadoopguide



hadoopguide







% mysql hadoopguide

Welcome to the MySQL monitor. Commands end with ; or \g.

Your MySQL connection id is 352

Server version: 5.1.37-1ubuntu5.4 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,

-> widget_name VARCHAR(64) NOT NULL,

-> price DECIMAL(10,2),

-> design_date DATE,

-> version INT,

-> design_comment VARCHAR(100));

Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10',

-> 1, 'Connects two gizmos');

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4,

-> NULL);

 









530 | Chapter 15:Sqoop

Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',

-> 13, 'Our flagship product');

Query OK, 1 row affected (0.00 sec)

mysql> quit;

widgets

widgets





% sqoop import --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets -m 1

10/06/23 14:44:18 INFO tool.CodeGenTool: Beginning code generation

...

10/06/23 14:44:20 INFO mapred.JobClient: Running job: job_201006231439_0002

10/06/23 14:44:21 INFO mapred.JobClient: map 0% reduce 0%

10/06/23 14:44:32 INFO mapred.JobClient: map 100% reduce 0%

10/06/23 14:44:34 INFO mapred.JobClient: Job complete:

job_201006231439_0002

...

10/06/23 14:44:34 INFO mapreduce.ImportJobBase: Retrieved 3 records.

import







-m 1





% hadoop fs -cat widgets/part-m-00000

1,sprocket,0.25,2010-02-10,1,Connects two gizmos

2,gizmo,4.00,2009-11-30,4,null

3,gadget,99.99,1983-08-13,13,Our flagship product

 



localhost













         

A Sample Import | 531



sqoop help importman sqoop-import

Text and Binary File Formats







VARBINARYnull

"null"--null-string



















Generated Code

         



sqoop import

ls widgets.java







widgets









widgets







 

532 | Chapter 15:Sqoop

% sqoop codegen --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets --class-name Widget

codegen

          Widget     

--class-name

       















Additional Serialization Systems



       



Imports: A Deeper Look





























           



Imports: A Deeper Look | 533





VARCHAR

INTEGERStringInteger





Widget



public Integer get_id();

public String get_widget_name();

public java.math.BigDecimal get_price();

public java.sql.Date get_design_date();

public Integer get_version();

public String get_design_comment();



DBWritableWidget

public void readFields(ResultSet __dbResults) throws SQLException;

public void write(PreparedStatement __dbStmt) throws SQLException;

ResultSet

readFields()Widget

ResultSetwrite()

Widget



534 | Chapter 15:Sqoop

InputFormat

DataDrivenDBInputFormat





SELECT col1,col2,col3,... FROM tableName













widgetsid



id

DataDrivenDBInputFormat

SELECT MIN(id), MAX(id) FROM widgets



-m 5

SELECT id, widget_name, ... FROM widgets WHERE id >= 0 AND id < 20000SELECT

id, widget_name, ... FROM widgets WHERE id >= 20000 AND id < 40000

id



              





-m 1



InputFormat



ResultSet



Controlling the Import



WHERE



            



WHERE id >= 100000

    WHERE 

Imports: A Deeper Look | 535





Imports and Consistency











Direct-mode Imports



DataDrivenDBInputFormat



   mysqldump        





--direct

CLOBBLOB











mysqldump

             mk-

parallel-dump







Working with Imported Data





TextInputFormat







"1"Integer

536 | Chapter 15:Sqoop

int

             

parse()

TextCharSequencechar[]

MaxWidgetId





versionversion





% jar cvvf widgets.jar *.class

% HADOOP_CLASSPATH=/usr/lib/sqoop/sqoop-version.jar hadoop jar \

> widgets.jar MaxWidgetId -libjars /usr/lib/sqoop/sqoop-version.jar

$HADOOP_CLASS

PATH MaxWidgetId.run()    

-libjars





3,gadget,99.99,1983-08-13,13,Our flagship product

           Widget 

Widget

Writable     



MaxWidgetId







           





  

      MaxWidgetIdGenericAvro   



Imported Data and Hive



          



Working with Imported Data | 537













1,15,120 Any St.,Los Angeles,CA,90210,2010-08-01

3,4,120 Any St.,Los Angeles,CA,90210,2010-08-01

2,5,400 Some Pl.,Cupertino,CA,95014,2010-07-30

2,7,88 Mile Rd.,Manhattan,NY,10005,2010-07-18





widgets



widgets







hive> CREATE TABLE sales(widget_id INT, qty INT,

> street STRING, city STRING, state STRING,

> zip INT, sale_date STRING)

> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Time taken: 5.248 seconds

hive> LOAD DATA LOCAL INPATH "sales.log" INTO TABLE sales;

Copying data from file:/home/sales.log

Loading data to table sales

Time taken: 0.188 seconds



widgets



% sqoop create-hive-table --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets --fields-terminated-by ','

...

10/06/23 18:05:34 INFO hive.HiveImport: OK

10/06/23 18:05:34 INFO hive.HiveImport: Time taken: 3.22 seconds

10/06/23 18:05:35 INFO hive.HiveImport: Hive import complete.

% hive

hive> LOAD DATA INPATH "widgets" INTO TABLE widgets;

Loading data to table widgets

Time taken: 3.265 seconds

538 | Chapter 15:Sqoop



















10/06/23 18:09:36 WARN hive.TableDefWriter:

Column design_date had to be

cast to a less precise type in Hive











widgets

% sqoop import --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets -m 1 --hive-import

sqoop import--hive-import









widgets

sales



hive> CREATE TABLE zip_profits (sales_vol DOUBLE, zip INT);

hive> INSERT OVERWRITE TABLE zip_profits

> SELECT SUM(w.price * s.qty) AS sales_vol, s.zip FROM SALES s

> JOIN widgets w ON (s.widget_id = w.id) GROUP BY s.zip;

...

3 Rows loaded to zip_profits

hive> SELECT * FROM zip_profits ORDER BY sales_vol DESC;

...

403.71 90210

Working with Imported Data | 539

28.0 10005

20.0 95014

Importing Large Objects





CLOBBLOB











             











540 | Chapter 15:Sqoop





































           java.io.Input

Streamjava.io.Reader





    widgets       schematic





2,gizmo,4.00,2009-11-30,4,null,externalLob(lf,lobfile0,100,5011714)

externalLob(...)

lf



Widget.get_schematic()

BlobRefschematic

BlobRef.getDataStream()

InputStreamschematic

Widget

schematic





Importing Large Objects | 541

BlobRefClobRef

schematic



Performing an Export



















StringCHAR(64)VARCHAR(200)



zip_profits



% mysql hadoopguide

mysql> CREATE TABLE sales_by_zip (volume DECIMAL(8,2), zip INTEGER);

Query OK, 0 rows affected (0.01 sec)



% sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1 \

> --table sales_by_zip --export-dir /user/hive/warehouse/zip_profits \

> --input-fields-terminated-by '\0001'

...

10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Transferred 41 bytes in 10.8947

seconds (3.7633 bytes/sec)

10/07/02 16:16:50 INFO mapreduce.ExportJobBase: Exported 3 records.



% mysql hadoopguide -e 'SELECT * FROM sales_by_zip'

+--------+-------+

| volume | zip |

+--------+-------+

| 28.00 | 10005 |

| 403.71 | 90210 |

| 20.00 | 95014 |

+--------+-------+

zip_profits

0x0001



SELECT

542 | Chapter 15:Sqoop





--input-fields-ter

minated-bysqoop export









     --input-fields-terminated-by \\0001  







Escape Description

\b Backspace

\n Newline

\r Carriage return

\t Tab

\' Single quote

\" Double quote

\\ Backslash

\0 NUL. This will insert NUL characters between fields or lines, or will disable enclosing/escaping if used for one of the

--enclosed-by, --optionally-enclosed-by, or --escaped-by arguments.

\0ooo The octal representation of a Unicode character’s code point. The actual character is specified by the octal value ooo.

\0xhhh The hexadecimal representation of a Unicode character’s code point. This should be of the form \0xhhh, where

hhh is the hex value. For example, --fields-terminated-by '\0x10' specifies the carriage return

character.

Exports: A Deeper Look











ResultSet





Exports: A Deeper Look | 543



INSERT



INSERT







mysqlimport

mysqlimport

mysqlimport



        







CombineFileInputFormat



544 | Chapter 15:Sqoop

Exports and Transactionality





   





        













--staging-table



--clear-staging-table

Exports and SequenceFiles











Output

CollectorOutputFormat



org.apache.sqoop.lib.SqoopRecord



sqoop-codegenSqoopRecord



sqoop-

export

SqoopRecord













Exports: A Deeper Look | 545

--class-name --jar-file    



widgets



% sqoop import --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets -m 1 --class-name WidgetHolder --as-sequencefile \

> --target-dir widget_sequence_files --bindir .

...

10/07/05 17:09:13 INFO mapreduce.ImportJobBase: Retrieved 3 records.

% mysql hadoopguide

mysql> CREATE TABLE widgets2(id INT, widget_name VARCHAR(100),

-> price DOUBLE, designed DATE, version INT, notes VARCHAR(200));

Query OK, 0 rows affected (0.03 sec)

mysql> exit;

% sqoop export --connect jdbc:mysql://localhost/hadoopguide \

> --table widgets2 -m 1 --class-name WidgetHolder \

> --jar-file widgets.jar --export-dir widget_sequence_files

...

10/07/05 17:26:44 INFO mapreduce.ExportJobBase: Exported 3 records.



--bindir









546 | Chapter 15:Sqoop

CHAPTER 16

Case Studies

Hadoop Usage at Last.fm

Last.fm: The Social Music Revolution









          









Hadoop at Last.fm



 









 



 



 

547

 



 

















Generating Charts with Hadoop













548 | Chapter 16:Case Studies



 





 

























The Track Statistics Program

            

     





        







UserId TrackId Scrobble Radio Skip

111115 222 0 1 0

111113 225 1 0 0

111117 223 0 1 1

111115 225 1 0 0













Hadoop Usage at Last.fm | 549













 

 

 

 

 







Calculating the number of unique listeners



UniqueListenersMapper



public void map(LongWritable position, Text rawLine, OutputCollector<IntWritable,

IntWritable> output, Reporter reporter) throws IOException {

String[] parts = (rawLine.toString()).split(" ");

UniqueListenersMapper.



550 | Chapter 16:Case Studies

int scrobbles = Integer.parseInt(parts[TrackStatisticsProgram.COL_SCROBBLES]);

int radioListens = Integer.parseInt(parts[TrackStatisticsProgram.COL_RADIO]);

// if track somehow is marked with zero plays - ignore

if (scrobbles <= 0 && radioListens <= 0) {

return;

}

// if we get to here then user has listened to track,

// so output user id against track id

IntWritable trackId = new IntWritable(

Integer.parseInt(parts[TrackStatisticsProgram.COL_TRACKID]));

IntWritable userId = new IntWritable(

Integer.parseInt(parts[TrackStatisticsProgram.COL_USERID]));

output.collect(trackId, userId);

}

UniqueListenersReducers

Set



Set

            





public void reduce(IntWritable trackId, Iterator<IntWritable> values,

OutputCollector<IntWritable, IntWritable> output, Reporter reporter)

throws IOException {

Set<Integer> userIds = new HashSet<Integer>();

// add all userIds to the set, duplicates automatically removed (set contract)

while (values.hasNext()) {

IntWritable userId = values.next();

userIds.add(Integer.valueOf(userId.get()));

}

// output trackId -> number of unique listeners per track

output.collect(trackId, new IntWritable(userIds.size()));

}







Line of file UserId TrackId Scrobbled Radio play Skip

LongWritable IntWritable IntWritable Boolean Boolean Boolean

0 11115 222 010

1 11113 225 100

2 11117 223 011

3 11115 225 1 0 0

UniqueListenersReducer.

Hadoop Usage at Last.fm | 551



TrackId UserId

IntWritable IntWritable

222 11115

225 11113

223 11117

225 11115



TrackId #listeners

IntWritable IntWritable

222 1

225 2

223 1

Summing the track totals











     TrackStats  

WritableComparable

TrackStats





public void map(LongWritable position, Text rawLine,

OutputCollector<IntWritable, TrackStats> output, Reporter reporter)

throws IOException {

String[] parts = (rawLine.toString()).split(" ");

int trackId = Integer.parseInt(parts[TrackStatisticsProgram.COL_TRACKID]);

int scrobbles = Integer.parseInt(parts[TrackStatisticsProgram.COL_SCROBBLES]);

int radio = Integer.parseInt(parts[TrackStatisticsProgram.COL_RADIO]);

int skip = Integer.parseInt(parts[TrackStatisticsProgram.COL_SKIP]);

// set number of listeners to 0 (this is calculated later)

// and other values as provided in text file

TrackStats trackstat = new TrackStats(0, scrobbles + radio, scrobbles, radio, skip);

output.collect(new IntWritable(trackId), trackstat);

}

SumMapper.

552 | Chapter 16:Case Studies





public void reduce(IntWritable trackId, Iterator<TrackStats> values,

OutputCollector<IntWritable, TrackStats> output, Reporter reporter)

throws IOException {

TrackStats sum = new TrackStats(); // holds the totals for this track

while (values.hasNext()) {

TrackStats trackStats = (TrackStats) values.next();

sum.setListeners(sum.getListeners() + trackStats.getListeners());

sum.setPlays(sum.getPlays() + trackStats.getPlays());

sum.setSkips(sum.getSkips() + trackStats.getSkips());

sum.setScrobbles(sum.getScrobbles() + trackStats.getScrobbles());

sum.setRadioPlays(sum.getRadioPlays() + trackStats.getRadioPlays());

}

output.collect(trackId, sum);

}







Line UserId TrackId Scrobbled Radio play Skip

LongWritable IntWritable IntWritable Boolean Boolean Boolean

0 11115 222 0 1 0

1 11113 225 1 0 0

2 11117 223 0 1 1

3 11115 225 1 0 0



TrackId #listeners #plays #scrobbles #radio plays #skips

IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable

222 0 1 0 1 0

225 0 1 1 0 0

223 0 1 0 1 1

225 0 1 1 0 0



TrackId #listeners #plays #scrobbles #radio plays #skips

IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable

222 0 1 0 1 0

225 0 2 2 0 0

223 0 1 0 1 1

SumReducer.

Hadoop Usage at Last.fm | 553

Merging the results







            

MultipleInputs   

JobConf

MultipleInputs.addInputPath(conf, sumInputDir,

SequenceFileInputFormat.class, IdentityMapper.class);

MultipleInputs.addInputPath(conf, listenersInputDir,

SequenceFileInputFormat.class, MergeListenersMapper.class);





UniqueListenerJob

TrackStats

SumMapper



public void map(IntWritable trackId, IntWritable uniqueListenerCount,

OutputCollector<IntWritable, TrackStats> output, Reporter reporter)

throws IOException {

TrackStats trackStats = new TrackStats();

trackStats.setListeners(uniqueListenerCount.get());

output.collect(trackId, trackStats);

}







TrackId #listeners

IntWritable IntWritable

222 1

225 2

223 1



TrackId #listeners #plays #scrobbles #radio #skips

222 1 0 0 0 0

225 2 0 0 0 0

223 10000

MergeListenersMapper.

554 | Chapter 16:Case Studies

 IdentityMapper      SumJob  

TrackStats





TrackId #listeners #plays #scrobbles #radio #skips

IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable

222 0 1 0 1 0

225 0 2 2 0 0

223 01011

TrackStats

SumReducer

TrackStats





TrackId #listeners #plays #scrobbles #radio #skips

IntWritable IntWritable IntWritable IntWritable IntWritable IntWritable

222 1 1 0 1 0

225 2 2 2 0 0

223 1 1 0 1 1









IdentityMapper.

SumReducer.

Hadoop Usage at Last.fm | 555

Summary





















Hadoop and Hive at Facebook















Hadoop at Facebook

History





















         









556 | Chapter 16:Case Studies



          



























Use cases



 



 







 



 



 



 

              





Data architecture





Hadoop and Hive at Facebook | 557











































558 | Chapter 16:Case Studies















        













Hadoop configuration







 

 



        



 







Hypothetical Use Case Studies









Advertiser insights and performance







Hadoop and Hive at Facebook | 559















































            





































560 | Chapter 16:Case Studies

















     







Ad hoc analysis and product feedback























       





















 







Hadoop and Hive at Facebook | 561

Data analysis





















          























Hive































562 | Chapter 16:Case Studies

          



Data organization



















       















 

 

 







Query language

















 





Hadoop and Hive at Facebook | 563

Data pipelines using Hive









   











dim_ads

impression_logs







SELECT a.campaign_id, count(1), count(DISTINCT b.user_id)

FROM dim_ads a JOIN impression_logs b ON(b.ad_id = a.ad_id)

WHERE b.dateid = '2008-12-01'

GROUP BY a.campaign_id;











FROM(

SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id

FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)

WHERE b.dateid = '2008-12-01') x

INSERT OVERWRITE DIRECTORY 'results_gby_adid'

SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id

INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'

SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id

INSERT OVERWRITE DIRECTORY 'results_gby_accountid'

SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;









unique_idimpression_logs

unique_idad_id



unique_id

      

564 | Chapter 16:Case Studies







            impres

sion_logs

ad_idunique_id



impression_logs 







INSERT OVERWRITE lifetime_partial_imps PARTITION(dateid='2008-12-01')

SELECT x.ad_id, x.user_id, sum(x.cnt)

FROM (

SELECT a.ad_id, a.user_id, a.cnt

FROM lifetime_partial_imps a

WHERE a.dateid = '2008-11-30'

UNION ALL

SELECT b.ad_id, b.user_id, 1 as cnt

FROM impression_log b

WHERE b.dateid = '2008-12-01'

) x

GROUP BY x.ad_id, x.user_id;













FROM(

SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id, b.cnt

FROM dim_ads a JOIN lifetime_partial_imps b ON (b.ad_id = a.ad_id)

WHERE b.dateid = '2008-12-01') x

INSERT OVERWRITE DIRECTORY 'results_gby_adid'

SELECT x.ad_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY x.ad_id

INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'

SELECT x.campaign_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY x.campaign_id

INSERT OVERWRITE DIRECTORY 'results_gby_accountid'

SELECT x.account_id, sum(x.cnt), count(DISTINCT x.user_id) GROUP BY x.account_id;











Hadoop and Hive at Facebook | 565

Problems and Future Work

Fair sharing















































Space management















       





566 | Chapter 16:Case Studies

Scribe-HDFS integration



  









Improvements to Hive





















Nutch Search Engine























             











Nutch Search Engine | 567

















       



         <a href="..">anchor

text</a> elements





          



  

















Data Structures





              







MapFileOutputFormatSequenceFileOutputFormat







568 | Chapter 16:Case Studies

CrawlDb

<url, CrawlDatum>

TextCrawlDatum

Writable















LinkDb



<url, Inlinks>Inlinks









Segments

            



      

MapFileOutputFormatSequenceFileOut

putFormat



<url, Content>









<url, CrawlDatum>

          











<url, CrawlDatum>

Nutch Search Engine | 569



















    <url, ParseText>        























   







570 | Chapter 16:Case Studies









           















Selected Examples of Hadoop Data Processing in Nutch





Link inversion

























Nutch Search Engine | 571





              





  







JobConf job = new JobConf(configuration);

FileInputFormat.addInputPath(job, new Path(segmentPath, "parse_data"));

job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(LinkDb.class);

job.setReducerClass(LinkDb.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Inlinks.class);

job.setOutputFormat(MapFileOutputFormat.class);

FileOutputFormat.setOutputPath(job, newLinkDbPath);



ParseData













java.net.URLjava.net.URI





         





map()reduce()



public void map(Text fromUrl, ParseData parseData,

OutputCollector<Text, Inlinks> output, Reporter reporter) {

...

Outlink[] outlinks = parseData.getOutlinks();

Inlinks inlinks = new Inlinks();

for (Outlink out : outlinks) {

inlinks.clear(); // instance reuse to avoid excessive GC

String toUrl = out.getToUrl();

String anchor = out.getAnchor();

inlinks.add(new Inlink(fromUrl, anchor));

output.collect(new Text(toUrl), inlinks);

572 | Chapter 16:Case Studies

}

Outlinkmap()

<toUrl, Inlinks>InlinksInlink

fromUrl

Inlinksreduce()

public void reduce(Text toUrl, Iterator<Inlinks> values,

OutputCollector<Text, Inlinks> output, Reporter reporter) {

Inlinks result = new Inlinks();

while (values.hasNext()) {

result.add(values.next());

}

output.collect(toUrl, result);

}



fromUrlstoUrl



MapFileOutputFormat



Generation of fetchlists



<url, crawlDatum>crawlDatum







SequenceFileInputFormat









 

           



 



 





 

Nutch Search Engine | 573

 

 

















<url, datum>

<score, <url, datum>>



FileInputFormat.addInputPath(job, crawlDbPath);

job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(Selector.class);

job.setPartitionerClass(Selector.class);

job.setReducerClass(Selector.class);

FileOutputFormat.setOutputPath(job, tempDir);

job.setOutputFormat(SequenceFileOutputFormat.class);

job.setOutputKeyClass(FloatWritable.class);

job.setOutputKeyComparatorClass(DecreasingFloatComparator.class);

job.setOutputValueClass(SelectorEntry.class);

Selector

Selector Partitioner 









            

Step 1: Select, sort by score, limit by URL count per host.

574 | Chapter 16:Case Studies







                

Selector    Partitioner      



/** Partition by host. */

public int getPartition(FloatWritable key, Writable value, int numReduceTasks) {

return hostPartitioner.getPartition(((SelectorEntry)value).url, key,

numReduceTasks);

}

0numReduceTasks - 1

SelectorEntry

PartitionUrlByHost

/** Hash by hostname. */

public int getPartition(Text key, Writable value, int numReduceTasks) {

String urlString = key.toString();

URL url = null;

try {

url = new URL(urlString);

} catch (MalformedURLException e) {

LOG.warn("Malformed URL: '" + urlString + "'");

}

int hashCode = (url == null ? urlString : url.getHost()).hashCode();

// make hosts wind up in different partitions on different runs

hashCode ^= seed;

return (hashCode & Integer.MAX_VALUE) % numReduceTasks;

}









MapFileOutputFormat





SelectorEn

try

Selector.reduce()

            







Nutch Search Engine | 575





      

<score, selectorEntry>

<url, datum>





FileInputFormat.addInputPath(job, tempDir);

job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(SelectorInverseMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(SelectorEntry.class);

job.setPartitionerClass(PartitionUrlByHost.class);

job.setReducerClass(PartitionReducer.class);

job.setNumReduceTasks(numParts);

FileOutputFormat.setOutputPath(job, output);

job.setOutputFormat(SequenceFileOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(CrawlDatum.class);

job.setOutputKeyComparatorClass(HashComparator.class);

SelectorInverseMapper

SelectorEntry

        

CrawlDatum

<Text, CrawlDatum>

<Text, SelectorEntry>

setMapOutputKeyClass()

setMapOutputValueClass()



PartitionUrlByHost

           





HashComparator







FetchernumParts

Fetcher

PartitionReducer<url,

selectorEntry>  <url, crawlDatum>       HashCompara

tor

Step 2: Invert, partition by host, sort randomly.

576 | Chapter 16:Case Studies

reduce()

              

SelectorEntry



public void reduce(Text key, Iterator<SelectorEntry> values,

OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException {

// when using HashComparator, we get only one input key in case of hash collisions

// so use only URLs extracted from values

while (values.hasNext()) {

SelectorEntry entry = values.next();

output.collect(entry.url, entry.datum);

}

          SequenceFileOutputFormat  

      



Fetcher: A multithreaded MapRunner in action











HashCom

parator



MapRunner

FetcherMapRunner





job.setSpeculativeExecution(false);

FileInputFormat.addInputPath(job, "segment/crawl_generate");

job.setInputFormat(InputFormat.class);

job.setMapRunnerClass(Fetcher.class);

FileOutputFormat.setOutputPath(job, segment);

job.setOutputFormat(FetcherOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(NutchWritable.class);







InputFormat



Nutch Search Engine | 577





OutputFormat

NutchWritable

NutchWritableGenericWritable

Writable

MapRunner

MapRunner

public void run(RecordReader<Text, CrawlDatum> input,

OutputCollector<Text, NutchWritable> output,

Reporter reporter) throws IOException {

int threadCount = getConf().getInt("fetcher.threads.fetch", 10);

feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);

feeder.start();

for (int i = 0; i < threadCount; i++) { // spawn threads

new FetcherThread(getConf()).start();

}

do { // wait for threads to exit

try {

Thread.sleep(1000);

} catch (InterruptedException e) {}

reportStatus(reporter);

} while (activeThreads.get() > 0);

}

FetcherQueueFeeder

FetcherThread

QueueFeeder

FetcherThread













Indexer: Using custom OutputFormat















578 | Chapter 16:Case Studies

FileInputFormat.addInputPath(job, crawlDbPath);

FileInputFormat.addInputPath(job, linkDbPath);

// add segment data

FileInputFormat.addInputPath(job, "segment/crawl_fetch");

FileInputFormat.addInputPath(job, "segment/crawl_parse");

FileInputFormat.addInputPath(job, "segment/parse_data");

FileInputFormat.addInputPath(job, "segment/parse_text");

job.setInputFormat(SequenceFileInputFormat.class);

job.setMapperClass(Indexer.class);

job.setReducerClass(Indexer.class);

FileOutputFormat.setOutputPath(job, indexDir);

job.setOutputFormat(OutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LuceneDocumentWrapper.class);





MapperIndexer

NutchWritable



NutchWritable

Reducer

    CrawlDatum  CrawlDatum  Inlinks Parse

DataParseText

WritableLuceneDocumentWrapper







OutputFormat

public static class OutputFormat extends

FileOutputFormat<WritableComparable, LuceneDocumentWrapper> {

public RecordWriter<WritableComparable, LuceneDocumentWrapper>

getRecordWriter(final FileSystem fs, JobConf job,

String name, final Progressable progress) throws IOException {

final Path out = new Path(FileOutputFormat.getOutputPath(job), name);

final IndexWriter writer = new IndexWriter(out.toString(),

new NutchDocumentAnalyzer(job), true);

return new RecordWriter<WritableComparable, LuceneDocumentWrapper>() {

boolean closed;

public void write(WritableComparable key, LuceneDocumentWrapper value)

throws IOException { // unwrap & index doc

Document doc = value.get();

writer.addDocument(doc);

progress.progress();

}

public void close(final Reporter reporter) throws IOException {

// spawn a thread to give progress heartbeats

Nutch Search Engine | 579

Thread prog = new Thread() {

public void run() {

while (!closed) {

try {

reporter.setStatus("closing");

Thread.sleep(1000);

} catch (InterruptedException e) { continue; }

catch (Throwable e) { return; }

}

};

try {

prog.start();

// optimize & close index

writer.optimize();

writer.close();

} finally {

closed = true;

}

};

}

RecordWriterOutputFormat

IndexWriter

         LuceneDocumentWrapper



RecordWriter















Summary



           















580 | Chapter 16:Case Studies









Log Processing at Rackspace







Requirements/The Problem























Logs



           



















             



Log Processing at Rackspace | 581

Brief History





         







Choosing Hadoop







Collection and Storage

Log collection























582 | Chapter 16:Case Studies













          

hadoop.rpc.socket.factory.class.default   SocksSocketFactory 











Log storage





















MapReduce for Logs

Processing





















1DBD21B48AE

Log Processing at Rackspace | 583









Nov 12 17:36:54 gate8.gate.sat.mlsrvr.com postfix/smtpd[2552]: connect from hostname

Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/qmgr[9489]: 1DBD21B48AE:

from=<mapreduce@rackspace.com>, size=5950, nrcpt=1 (queue active)

Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtpd[28085]: disconnect from

hostname

Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: too many errors

after DATA from hostname

Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: disconnect from

hostname

Nov 12 17:36:54 gate10.gate.sat.mlsrvr.com postfix/smtpd[10311]: connect from

hostname

Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtp[28107]: D42001B48B5:

to=<mapreduce@rackspace.com>, relay=hostname[ip], delay=0.32, delays=0.28/0/0/0.04,

dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 1DBD21B48AE)

Nov 12 17:36:54 gate20.gate.sat.mlsrvr.com postfix/smtpd[27168]: disconnect from

hostname

Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/qmgr[1209]: 645965A0224: removed

Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/smtp[15928]: 732196384ED: to=<m

apreduce@rackspace.com>, relay=hostname[ip], conn_use=2, delay=0.69, delays=0.04/

0.44/0.04/0.17, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 02E1544C005)

Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/qmgr[13764]: 732196384ED: removed

Nov 12 17:36:54 gate1.gate.sat.mlsrvr.com postfix/smtpd[26394]: NOQUEUE: reject: RCP

T from hostname 554 5.7.1 <mapreduce@rackspace.com>: Client host rejected: The

sender's mail server is blocked; from=<mapreduce@rackspace.com> to=<mapred

uce@rackspace.com> proto=ESMTP helo=<mapreduce@rackspace.com>















InputFormat

OutputFormat















584 | Chapter 16:Case Studies















InputFormat

SequenceFileRecordReaderLineRecordReader

FileSplit







HopWritable





Phase 1: Map.

Phase 1: Reduce.

Log Processing at Rackspace | 585

 HopWritable       

Writable



         



OutputFormat

InputFormatMultiSequenceFileOutputFormat

MultipleSequenceFileOutputFormat







HopWritable



SequenceFileInputFormatIdentityMapper

HopWrita

ble









MultiSequenceFileOutputFormat



       SolrOutputFormat  

HopWritable



SolrOutputFormat

        

OutputFormat





 

 

 

HashPartitioner



Partitioner







Phase 2: Map.

Phase 2: Reduce.

586 | Chapter 16:Case Studies

Merging for near-term search









SolrOutputFormat

IndexWriter.addIndexes()

MergeAgentRAMDirectory

FSDirectory<commit/>































sender:"mapreduce@rackspace.com" -recipient:"hadoop@rackspace.com"

recipient:"@rackspace.com" short-status:deferred timestamp:[1228140900 TO 2145916799]









Archiving for analysis





Sharding.

Search results.

Log Processing at Rackspace | 587



SolrInputFormat

InputFormat







 

 

 

 

 

588 | Chapter 16:Case Studies











Cascading











           

















FunctionFilterAggregator

Buffer





















Cascading | 589





Fields, Tuples, and Pipes





     

































590 | Chapter 16:Case Studies







String







select



java.lang.Comparable







Each

EachFunc

tionFilter

GroupBy

GroupBy

group by



CoGroup

CoGroup





Cascading | 591



Every

Every

GroupByCoGroupEvery



SubAssembly

SubAssembly























592 | Chapter 16:Case Studies







         new GroupBy(<previous>, <grouping

fields>, <secondary sorting fields>)previous

Operations





Function

Function

Each

Filter

 Filter            

            

functionFilter

AndOrXorNot



Aggregator

 Aggregator          

             

Aggregator

SumCountAverageMaxMin

Buffer

BufferAggregator





Aggregator

          

AggregatorsBuffer





EachEvery













Cascading | 593



        concatenate(String first, String

second) concatenate(Person person)

concatenate()Person



Taps, Schemes, and Flows



Tap

Scheme

Tap

Tap



Scheme

Scheme





Taps Pipe



594 | Chapter 16:Case Studies





TapFlowFlow

TapsTaps

Tap





Taps



Tap

   Taps     



Flow



Flow





Cascading in Practice







Scheme sourceScheme =

new TextLine(new Fields("line"));

Tap source =

new Hfs(sourceScheme, inputPath);

Scheme sinkScheme = new TextLine();

Tap sink =

new Hfs(sinkScheme, outputPath, SinkMode.REPLACE);

Cascading | 595

Pipe assembly = new Pipe("wordcount"); 
String regexString = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
Function regex = new RegexGenerator(new Fields("word"), regexString);
assembly =
  new Each(assembly, new Fields("line"), regex); 
assembly =
  new GroupBy(assembly, new Fields("word")); 
Aggregator count = new Count(new Fields("count"));
assembly = new Every(assembly, count); 
assembly =
  new GroupBy(assembly, new Fields("count"), new Fields("word")); 
FlowConnector flowConnector = new FlowConnector();
Flow flow =
  flowConnector.connect("word-count", source, sink, assembly); 
flow.complete();
SchemeTuple
Fields
SchemeTuple


Tap
Tap

Taps

Each
Tuple
GroupByTuple

   Every   Aggregator      
Tuple
GroupByTuple



Flow
Flow
596 | Chapter 16:Case Studies




      

               


SubAssembly


public class ParseWordsAssembly extends SubAssembly 
  {
  public ParseWordsAssembly(Pipe previous)
    {
    String regexString = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";
    Function regex = new RegexGenerator(new Fields("word"), regexString);
    previous = new Each(previous, new Fields("line"), regex);
    String exprString = "word.toLowerCase()";
    Function expression =
      new ExpressionFunction(new Fields("word"), exprString, String.class); 
    previous = new Each(previous, new Fields("word"), expression);
    setTails(previous); 
    }
  }
SubAssemblyPipe
toLowerCase()String

String
SubAssembly

SubAssembly




Scheme sourceScheme = new TextLine(new Fields("line"));
Tap source = new Hfs(sourceScheme, inputPath);
Scheme sinkScheme = new TextLine(new Fields("word", "count"));
Tap sink = new Hfs(sinkScheme, outputPath, SinkMode.REPLACE);
Cascading | 597

Pipe assembly = new Pipe("wordcount");

assembly =

new ParseWordsAssembly(assembly);

assembly = new GroupBy(assembly, new Fields("word"));

Aggregator count = new Count(new Fields("count"));

assembly = new Every(assembly, count);

assembly = new GroupBy(assembly, new Fields("count"), new Fields("word"));

FlowConnector flowConnector = new FlowConnector();

Flow flow = flowConnector.connect("word-count", source, sink, assembly);

flow.complete();

EachParseWordsAssembly

SubAssemblyEvery





Flexibility





MapperReducer











598 | Chapter 16:Case Studies



















Hadoop and Cascading at ShareThis



             

     

















         























            









Cascading | 599







        























cascading.ClusterTestCase           

  



FunctionsFilters



SubAssembly



600 | Chapter 16:Case Studies





      

SubAssembly





TupleSubAs

sembly

Tuple

SubAssembly





    







TemplateTapTap

TupleTemplateTap



SubAssembly











public void testLogParsing() throws IOException

{

Hfs source = new Hfs(new TextLine(new Fields("line")), sampleData);

Hfs sink =

new Hfs(new TextLine(), outputPath + "/parser", SinkMode.REPLACE);

Pipe pipe = new Pipe("parser");

// split "line" on tabs

pipe = new Each(pipe, new Fields("line"), new RegexSplitter("\t"));

pipe = new LogParser(pipe);

pipe = new LogRules(pipe);

Cascading | 601

// testing only assertions

pipe = new ParserAssertions(pipe);

Flow flow = new FlowConnector().connect(source, sink, pipe);

flow.complete(); // run the test flow

// verify there are 98 tuples, 2 fields, and matches the regex pattern

// for TextLine schemes the tuples are { "offset", "line }

validateLength(flow, 98, 2, Pattern.compile("^[0-9]+(\\t[^\\t]*){19}$"));

}





Flow

TapsTaps

Flow

Flow

Tap

TupleFlow

Flow







Flow





















          







602 | Chapter 16:Case Studies



Summary





















TeraByte Sort on Apache Hadoop



















TeraByte Sort on Apache Hadoop | 603









   

        









 TeraGen

 TeraSort



 TeraValidate





TeraGen







TeraGen



TeraSort









TeraSort











io.sort.mbio.sort.factorfs.inmemory.size.mb



 



              



604 | Chapter 16:Case Studies











TeraValidate









TeraByte Sort on Apache Hadoop | 605







 

 

 

 

 

 

 

 

606 | Chapter 16:Case Studies













Using Pig and Wukong to Explore Billion-edge Network Graphs























 

@mrflip@tom_e_white

@infochimps@hadoop





 







            



 

 

 

Using Pig and Wukong to Explore Billion-edge Network Graphs | 607

 





 





󲂐󴥡





 

608 | Chapter 16:Case Studies







          





 



 



Measuring Community











 

 

           



Everybody’s Talkin’ at Me: The Twitter Reply Graph













class Tweet < Struct.new(:tweet_id, :screen_name, :created_at,

:reply_tweet_id, :reply_screen_name, :text)

def initialize(raw_tweet)

              









 

 

Using Pig and Wukong to Explore Billion-edge Network Graphs | 609

# ... gory details of parsing raw tweet omitted

end

# Tweet is a reply if there's something in the reply_tweet_id slot

def is_reply?

not reply_tweet_id.blank?

true

end





{"text":"Just finished the final draft for Hadoop: the Definitive Guide!",

"screen_name":"tom_e_white","reply_screen_name":null,"id":3239897342,

"reply_tweet_id":null,...}

{"text":"@tom_e_white Can't wait to get a copy!",

"screen_name":"mrflip","reply_screen_name":"tom_e_white","id":3239873453,

"reply_tweet_id":3239897342,...}

{"text":"@josephkelly great job on the #InfoChimps API.

Remind me to tell you about the time a baboon broke into our house.",

"screen_name":"wattsteve","reply_screen_name":"josephkelly","id":16434069252,...}

{"text":"@mza Re: http://j.mp/atbroxmr Check out @James_Rubino's

http://bit.ly/clusterfork ? Lots of good hadoop refs there too",

"screen_name":"mrflip","reply_screen_name":"@mza","id":7809927173,...}

{"text":"@tlipcon divide lots of data into little parts. Magic software gnomes

fix up the parts, elves then assemble those into whole things #hadoop",

"screen_name":"nealrichter","reply_screen_name":"tlipcon","id":4491069515,...}

reply_screen_namereply_tweet_id

null



class ReplyGraphMapper < LineStreamer

def process(raw_tweet)

tweet = Tweet.new(raw_tweet)

if tweet.is_reply?

emit [tweet.screen_name, tweet.reply_screen_name]

end

LineStreamer

 process       process  







% reply_graph_mapper --run raw_tweets.json a_replies_b.tsv

mrflip tom_e_white

 

 





610 | Chapter 16:Case Studies

wattsteve josephkelly

mrflip mza

nealrichter tlipcon

@watt

steve@josephkelly

Edge pairs versus adjacency list









GROUP BY

a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);



replies_out = GROUP a_replies_b BY src;

DUMP replies_out

(cutting,{(tom_e_white)})

(josephkelly,{(wattsteve)})

(mikeolson,{(LusciousPear),(kevinweil),(LusciousPear),(tlipcon)})

(mndoci,{(mrflip),(peteskomoroch),(LusciousPear),(mrflip)})

(mrflip,{(LusciousPear),(mndoci),(mndoci),(esammer),(ogrisel),(esammer),(wattsteve)})

(peteskomoroch,{(CMastication),(esammer),(DataJunkie),(mndoci),(nealrichter),...

(tlipcon,{(LusciousPear),(LusciousPear),(nealrichter),(mrflip),(kevinweil)})

(tom_e_white,{(mrflip),(lenbust)})

Degree





FOREACH



a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);

replies_in = GROUP a_replies_b BY dest; -- group on dest to get in-links

replies_in_degree = FOREACH replies_in {

nbrs = DISTINCT a_replies_b.src;

GENERATE group, COUNT(nbrs), COUNT(a_replies_b);

};

DUMP replies_in_degree

(cutting,1L,1L)

(josephkelly,1L,1L)

 



io.sort.record.percent

PIG_OPTS="-Dio.sort.record.percent=0.25 -Dio.sort.mb=350" pig my_file.pig

Using Pig and Wukong to Explore Billion-edge Network Graphs | 611

(mikeolson,3L,4L)

(mndoci,3L,4L)

(mrflip,5L,9L)

(peteskomoroch,9L,18L)

(tlipcon,4L,8L)

(tom_e_white,2L,2L)

@peteskomoroch



@THE_REAL_SHAQ

@sockington







Symmetric Links

@THE_REAL_SHAQ



@mndoci

@mndoci@THE_REAL_SHAQ

A Replied To B

A Replied By B

a_repl_to_b = LOAD 'a_replies_b.tsv' AS (user_a:chararray, user_b:chararray);

a_repl_by_b = LOAD 'a_replies_b.tsv' AS (user_b:chararray, user_a:chararray);

-- symmetric edges appear in both sets

a_symm_b_j = JOIN a_repl_to_b BY (user_a, user_b),

a_repl_by_b BY (user_a, user_b);

...











a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);

a_b_rels = FOREACH a_replies_b GENERATE

((src <= dest) ? src : dest) AS user_a,

((src <= dest) ? dest : src) AS user_b,

((src <= dest) ? 1 : 0) AS a_re_b:int,

 



 

 



612 | Chapter 16:Case Studies

((src <= dest) ? 0 : 1) AS b_re_a:int;

DUMP a_b_rels

(mrflip,tom_e_white,1,0)

(josephkelly,wattsteve,0,1)

(mrflip,mza,1,0)

(nealrichter,tlipcon,0,1)





a_b_rels_g = GROUP a_b_rels BY (user_a, user_b);

a_symm_b_all = FOREACH a_b_rels_g GENERATE

group.user_a AS user_a,

group.user_b AS user_b,

(( (SUM(a_b_rels.a_re_b) > 0) AND

(SUM(a_b_rels.b_re_a) > 0) ) ? 1 : 0) AS is_symmetric:int;

DUMP a_symm_b_all

(mrflip,tom_e_white,1)

(mrflip,mza,0)

(josephkelly,wattsteve,0)

(nealrichter,tlipcon,1)

...

a_symm_b = FILTER a_symm_b_all BY (is_symmetric == 1);

STORE a_symm_b INTO 'a_symm_b.tsv';

@mrflip@tom_e_white



(mrflip,tom_e_white,1)

(nealrichter,tlipcon,1)

...

Community Extraction





     



Get neighbors

@hadoop

a_replies_b = LOAD 'a_replies_b.tsv' AS (src:chararray, dest:chararray);

-- Extract edges that originate or terminate on the seed

n0_edges = FILTER a_replies_b BY (src == 'hadoop') OR (dest == 'hadoop');

-- Choose the node in each pair that *isn't* our seed:

n1_nodes_all = FOREACH n0_edges GENERATE

((src == 'hadoop') ? dest : src) AS screen_name;

n1_nodes = DISTINCT n1_nodes_all;

DUMP n1_nodes

Using Pig and Wukong to Explore Billion-edge Network Graphs | 613



n1_nodes

n1_edges_out_j = JOIN a_replies_b BY src,

n1_nodes BY screen_name USING 'replicated';

n1_edges_out = FOREACH n1_edges_out_j GENERATE src, dest;





USING 'replicated'JOIN



n1_nodes

srcn1_nodes







n1_edges_j = JOIN n1_edges_out BY dest,

n1_nodes BY screen_name USING 'replicated';

n1_edges = FOREACH n1_edges_j GENERATE src, dest;

DUMP n1_edges

(mrflip,tom_e_white)

(mrflip,mza)

(wattsteve,josephkelly)

(nealrichter,tlipcon)

(bradfordcross,lusciouspear)

(mrflip,jeromatron)

(mndoci,mrflip)

(nealrichter,datajunkie)

Community metrics and the 1 million × 1 million problem

@hadoop@cloudera@infochimps





@THE_REAL_SHAQ

n1_edges

n1_edges



          

   @THE_REAL_SHAQ    



Local properties at global scale







614 | Chapter 16:Case Studies





     



@britneyspears@Whole

Foods



    



@britneyspears













 @vsergei

  @jakehofman          



Using Pig and Wukong to Explore Billion-edge Network Graphs | 615

APPENDIX A

Installing Apache Hadoop













Prerequisites



















Installation











% tar xzf hadoop-x.y.z.tar.gz

617



JAVA_HOME







JAVA_HOME

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/



export JAVA_HOME=/usr/lib/jvm/java-6-sun



HADOOP_INSTALL





% export HADOOP_INSTALL=/home/tom/hadoop-x.y.z

% export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin



% hadoop version

Hadoop 1.0.0

Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0

-r 1214675

Compiled by hortonfo on Thu Dec 15 16:36:35 UTC 2011

Configuration

























            





618 | Appendix A:Installing Apache Hadoop



























            





Component Property Standalone Pseudodistributed Fully distributed

Common fs.default.name file:/// (de-

fault) hdfs://local

host/

hdfs://name

node/

HDFS dfs.replication N/A 1 3 (default)

MapReduce 1 mapred.job.tracker local (default) localhost:8021 jobtracker:8021

YARN

(MapReduce 2) yarn.resource

manager.address

N/A localhost:8032 resourceman

ager:8032





Standalone Mode





Pseudodistributed Mode





--config

<?xml version="1.0"?>

Configuration | 619

<name>fs.default.name</name>

<value>hdfs://localhost/</value>

</property>

</configuration>

<?xml version="1.0"?>

<name>dfs.replication</name>

</property>

</configuration>

<?xml version="1.0"?>

<name>mapred.job.tracker</name>

<value>localhost:8021</value>

</property>

</configuration>



<?xml version="1.0"?>

<name>yarn.resourcemanager.address</name>

<value>localhost:8032</value>

</property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce.shuffle</value>

</property>

</configuration>

Configuring SSH



     

          













% sudo apt-get install ssh

620 | Appendix A:Installing Apache Hadoop



opensshssh-host-config -y

          





% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys



% ssh localhost



Formatting the HDFS filesystem





     













% hadoop namenode -format

Starting and stopping the daemons (MapReduce 1)



% start-dfs.sh

% start-mapred.sh



--config



% start-dfs.sh --config path-to-config-directory

% start-mapred.sh --config path-to-config-directory





         

           

Configuration | 621



jps



% stop-dfs.sh

% stop-mapred.sh

Starting and stopping the daemons (MapReduce 2)



% start-dfs.sh

% start-yarn.sh







% stop-dfs.sh

% stop-yarn.sh

Fully Distributed Mode





622 | Appendix A:Installing Apache Hadoop

APPENDIX B

Cloudera’s Distribution Including

Apache Hadoop







  

   







 

 

 

 

 

 

 

 

 

 

 

 

 

623







624 | Appendix B:Cloudera’s Distribution Including Apache Hadoop

APPENDIX C

Preparing the NCDC Weather Data















1901.tar.bz2

1902.tar.bz2

1903.tar.bz2

...

2000.tar.bz2







% tar jxf 1901.tar.bz2

% ls -l 1901 | head

011990-99999-1950.gz

...

011990-99999-1950.gz















625











#!/usr/bin/env bash

# NLineInputFormat gives a single line: key is offset, value is S3 URI

read offset s3file

# Retrieve file from S3 to local disk

echo "reporter:status:Retrieving $s3file" >&2

$HADOOP_INSTALL/bin/hadoop fs -get $s3file .

# Un-bzip and un-tar the local file

target=`basename $s3file .tar.bz2`

mkdir -p $target

echo "reporter:status:Un-tarring $s3file to $target" >&2

tar jxf `basename $s3file` -C $target

# Un-gzip each station file and concat into one file

echo "reporter:status:Un-gzipping $target" >&2

for file in $target/*/*

gunzip -c $file >> $target.all

echo "reporter:status:Processed $file" >&2

done

# Put gzipped version into HDFS

echo "reporter:status:Gzipping $target and putting in HDFS" >&2

gzip -c $target.all | $HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz







s3n://hadoopbook/ncdc/raw/isd-1901.tar.bz2

s3n://hadoopbook/ncdc/raw/isd-1902.tar.bz2

...

s3n://hadoopbook/ncdc/raw/isd-2000.tar.bz2

NLineInputFormat







hadoop fs -put -

reporter:status

        



626 | Appendix C:Preparing the NCDC Weather Data



% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \

-D mapred.reduce.tasks=0 \

-D mapred.map.tasks.speculative.execution=false \

-D mapred.task.timeout=12000000 \

-input ncdc_files.txt \

-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat \

-output output \

-mapper load_ncdc_map.sh \

-file load_ncdc_map.sh















Preparing the NCDC Weather Data | 627

Index
Symbols






























A





































629




































































B





























630 | Index








C
























































































Index | 631



































































































632 | Index
























































































D








Index | 633



































































































634 | Index














































E


































F















Index | 635



































































































636 | Index

G






































































H

























Index | 637



































































































638 | Index



































































































Index | 639















I


















































































640 | Index










































J























































Index | 641










































































K























642 | Index


L




























































M

































Index | 643


































































































644 | Index



































































































Index | 645





























































































N




646 | Index




















































O













































Index | 647


















































































P














648 | Index


























































































Q







Index | 649



R


























































































S


650 | Index



































































































Index | 651



































































































652 | Index
























































































T








Index | 653


































































































654 | Index






















U













































V









W
















Index | 655














































X




Y



















Z























656 | Index















Index | 657

About the Author



















Colophon

















          

























  





Hadoop The Definitive Guide 3rd Edition Orielly May 2012

Navigation menu

Versions of this User Manual:

Views

Navigation