Programming Hive Guide

User Manual:
Open the PDF directly: View PDF .
Page Count: 350
Download
Open PDF In Browser	View PDF
Download from Wow! eBook 

Programming Hive

Edward Capriolo, Dean Wampler, and Jason Rutherglen

Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo

Programming Hive
by Edward Capriolo, Dean Wampler, and Jason Rutherglen
Copyright © 2012 Edward Capriolo, Aspect Research Associates, and Jason Rutherglen. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles (http://my.safaribooksonline.com). For more information, contact our
corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editors: Mike Loukides and Courtney Nash
Indexer: Bob Pfahler
Production Editors: Iris Febres and Rachel Steely Cover Designer: Karen Montgomery
Proofreaders: Stacie Arellano and Kiel Van Horn Interior Designer: David Futato
Illustrator: Rebecca Demarest
October 2012:

First Edition.

Revision History for the First Edition:
2012-09-17
First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449319335 for release details.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Programming Hive, the image of a hornet’s hive, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

ISBN: 978-1-449-31933-5
[LSI]
1347905436

Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
An Overview of Hadoop and MapReduce
Hive in the Hadoop Ecosystem
Pig
HBase
Cascading, Crunch, and Others
Java Versus Hive: The Word Count Algorithm
What’s Next

3
6
8
8
9
10
13

2. Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Installing a Preconfigured Virtual Machine
Detailed Installation
Installing Java
Installing Hadoop
Local Mode, Pseudodistributed Mode, and Distributed Mode
Testing Hadoop
Installing Hive
What Is Inside Hive?
Starting Hive
Configuring Your Hadoop Environment
Local Mode Configuration
Distributed and Pseudodistributed Mode Configuration
Metastore Using JDBC
The Hive Command
Command Options
The Command-Line Interface
CLI Options
Variables and Properties
Hive “One Shot” Commands

15
16
16
18
19
20
21
22
23
24
24
26
28
29
29
30
31
31
34

iii

Executing Hive Queries from Files
The .hiverc File
More on Using the Hive CLI
Command History
Shell Execution
Hadoop dfs Commands from Inside Hive
Comments in Hive Scripts
Query Column Headers

35
36
36
37
37
38
38
38

3. Data Types and File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Primitive Data Types
Collection Data Types
Text File Encoding of Data Values
Schema on Read

41
43
45
48

4. HiveQL: Data Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Databases in Hive
Alter Database
Creating Tables
Managed Tables
External Tables
Partitioned, Managed Tables
External Partitioned Tables
Customizing Table Storage Formats
Dropping Tables
Alter Table
Renaming a Table
Adding, Modifying, and Dropping a Table Partition
Changing Columns
Adding Columns
Deleting or Replacing Columns
Alter Table Properties
Alter Storage Properties
Miscellaneous Alter Table Statements

49
52
53
56
56
58
61
63
66
66
66
66
67
68
68
68
68
69

5. HiveQL: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Loading Data into Managed Tables
Inserting Data into Tables from Queries
Dynamic Partition Inserts
Creating Tables and Loading Them in One Query
Exporting Data

iv | Table of Contents

71
73
74
75
76

6. HiveQL: Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
SELECT … FROM Clauses
Specify Columns with Regular Expressions
Computing with Column Values
Arithmetic Operators
Using Functions
LIMIT Clause
Column Aliases
Nested SELECT Statements
CASE … WHEN … THEN Statements
When Hive Can Avoid MapReduce
WHERE Clauses
Predicate Operators
Gotchas with Floating-Point Comparisons
LIKE and RLIKE
GROUP BY Clauses
HAVING Clauses
JOIN Statements
Inner JOIN
Join Optimizations
LEFT OUTER JOIN
OUTER JOIN Gotcha
RIGHT OUTER JOIN
FULL OUTER JOIN
LEFT SEMI-JOIN
Cartesian Product JOINs
Map-side Joins
ORDER BY and SORT BY
DISTRIBUTE BY with SORT BY
CLUSTER BY
Casting
Casting BINARY Values
Queries that Sample Data
Block Sampling
Input Pruning for Bucket Tables
UNION ALL

79
81
81
82
83
91
91
91
91
92
92
93
94
96
97
97
98
98
100
101
101
103
104
104
105
105
107
107
108
109
109
110
111
111
112

7. HiveQL: Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Views to Reduce Query Complexity
Views that Restrict Data Based on Conditions
Views and Map Type for Dynamic Tables
View Odds and Ends

113
114
114
115

Table of Contents | v

8. HiveQL: Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Creating an Index
Bitmap Indexes
Rebuilding the Index
Showing an Index
Dropping an Index
Implementing a Custom Index Handler

117
118
118
119
119
119

9. Schema Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Table-by-Day
Over Partitioning
Unique Keys and Normalization
Making Multiple Passes over the Same Data
The Case for Partitioning Every Table
Bucketing Table Data Storage
Adding Columns to a Table
Using Columnar Tables
Repeated Data
Many Columns
(Almost) Always Use Compression!

121
122
123
124
124
125
127
128
128
128
128

10. Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Using EXPLAIN
EXPLAIN EXTENDED
Limit Tuning
Optimized Joins
Local Mode
Parallel Execution
Strict Mode
Tuning the Number of Mappers and Reducers
JVM Reuse
Indexes
Dynamic Partition Tuning
Speculative Execution
Single MapReduce MultiGROUP BY
Virtual Columns

131
134
134
135
135
136
137
138
139
140
140
141
142
142

11. Other File Formats and Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Determining Installed Codecs
Choosing a Compression Codec
Enabling Intermediate Compression
Final Output Compression
Sequence Files
vi | Table of Contents

145
146
147
148
148

Compression in Action
Archive Partition
Compression: Wrapping Up

149
152
154

12. Developing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Changing Log4J Properties
Connecting a Java Debugger to Hive
Building Hive from Source
Running Hive Test Cases
Execution Hooks
Setting Up Hive and Eclipse
Hive in a Maven Project
Unit Testing in Hive with hive_test
The New Plugin Developer Kit

155
156
156
156
158
158
158
159
161

13. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Discovering and Describing Functions
Calling Functions
Standard Functions
Aggregate Functions
Table Generating Functions
A UDF for Finding a Zodiac Sign from a Day
UDF Versus GenericUDF
Permanent Functions
User-Defined Aggregate Functions
Creating a COLLECT UDAF to Emulate GROUP_CONCAT
User-Defined Table Generating Functions
UDTFs that Produce Multiple Rows
UDTFs that Produce a Single Row with Multiple Columns
UDTFs that Simulate Complex Types
Accessing the Distributed Cache from a UDF
Annotations for Use with Functions
Deterministic
Stateful
DistinctLike
Macros

163
164
164
164
165
166
169
171
172
172
177
177
179
179
182
184
184
184
185
185

14. Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Identity Transformation
Changing Types
Projecting Transformation
Manipulative Transformations
Using the Distributed Cache

188
188
188
189
189
Table of Contents | vii

Producing Multiple Rows from a Single Row
Calculating Aggregates with Streaming
CLUSTER BY, DISTRIBUTE BY, SORT BY
GenericMR Tools for Streaming to Java
Calculating Cogroups

190
191
192
194
196

15. Customizing Hive File and Record Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
File Versus Record Formats
Demystifying CREATE TABLE Statements
File Formats
SequenceFile
RCFile
Example of a Custom Input Format: DualInputFormat
Record Formats: SerDes
CSV and TSV SerDes
ObjectInspector
Think Big Hive Reflection ObjectInspector
XML UDF
XPath-Related Functions
JSON SerDe
Avro Hive SerDe
Defining Avro Schema Using Table Properties
Defining a Schema from a URI
Evolving Schema
Binary Output

199
199
201
201
202
203
205
206
206
206
207
207
208
209
209
210
210
211

16. Hive Thrift Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Starting the Thrift Server
Setting Up Groovy to Connect to HiveService
Connecting to HiveServer
Getting Cluster Status
Result Set Schema
Fetching Results
Retrieving Query Plan
Metastore Methods
Example Table Checker
Administrating HiveServer
Productionizing HiveService
Cleanup
Hive ThriftMetastore
ThriftMetastore Configuration
Client Configuration

viii | Table of Contents

213
214
214
215
215
215
216
216
216
217
217
218
219
219
219

Download from Wow! eBook 

17. Storage Handlers and NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Storage Handler Background
HiveStorageHandler
HBase
Cassandra
Static Column Mapping
Transposed Column Mapping for Dynamic Columns
Cassandra SerDe Properties
DynamoDB

221
222
222
224
224
224
224
225

18. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Integration with Hadoop Security
Authentication with Hive
Authorization in Hive
Users, Groups, and Roles
Privileges to Grant and Revoke
Partition-Level Privileges
Automatic Grants

228
228
229
230
231
233
233

19. Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Locking Support in Hive with Zookeeper
Explicit, Exclusive Locks

235
238

20. Hive Integration with Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Oozie Actions
Hive Thrift Service Action
A Two-Query Workflow
Oozie Web Console
Variables in Workflows
Capturing Output
Capturing Output to Variables

239
240
240
242
242
243
243

21. Hive and Amazon Web Services (AWS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Why Elastic MapReduce?
Instances
Before You Start
Managing Your EMR Hive Cluster
Thrift Server on EMR Hive
Instance Groups on EMR
Configuring Your EMR Cluster
Deploying hive-site.xml
Deploying a .hiverc Script

245
245
246
246
247
247
248
248
249

Table of Contents | ix

Setting Up a Memory-Intensive Configuration
Persistence and the Metastore on EMR
HDFS and S3 on EMR Cluster
Putting Resources, Configs, and Bootstrap Scripts on S3
Logs on S3
Spot Instances
Security Groups
EMR Versus EC2 and Apache Hive
Wrapping Up

249
250
251
252
252
252
253
254
254

22. HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Introduction
MapReduce
Reading Data
Writing Data
Command Line
Security Model
Architecture

255
256
256
258
261
261
262

23. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
m6d.com (Media6Degrees)
Data Science at M6D Using Hive and R
M6D UDF Pseudorank
M6D Managing Hive Data Across Multiple MapReduce Clusters
Outbrain
In-Site Referrer Identification
Counting Uniques
Sessionization
NASA’s Jet Propulsion Laboratory
The Regional Climate Model Evaluation System
Our Experience: Why Hive?
Some Challenges and How We Overcame Them
Photobucket
Big Data at Photobucket
What Hardware Do We Use for Hive?
What’s in Hive?
Who Does It Support?
SimpleReach
Experiences and Needs from the Customer Trenches
A Karmasphere Perspective
Introduction
Use Case Examples from the Customer Trenches

x | Table of Contents

265
265
270
274
278
278
280
282
287
287
290
291
292
292
293
293
293
294
296
296
296
297

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Appendix: References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Table of Contents | xi

Preface

Programming Hive introduces Hive, an essential tool in the Hadoop ecosystem that
provides an SQL (Structured Query Language) dialect for querying data stored in the
Hadoop Distributed Filesystem (HDFS), other filesystems that integrate with Hadoop,
such as MapR-FS and Amazon’s S3 and databases like HBase (the Hadoop database)
and Cassandra.
Most data warehouse applications are implemented using relational databases that use
SQL as the query language. Hive lowers the barrier for moving these applications to
Hadoop. People who know SQL can learn Hive easily. Without Hive, these users must
learn new languages and tools to become productive again. Similarly, Hive makes it
easier for developers to port SQL-based applications to Hadoop, compared to other
tool options. Without Hive, developers would face a daunting challenge when porting
their SQL applications to Hadoop.
Still, there are aspects of Hive that are different from other SQL-based environments.
Documentation for Hive users and Hadoop developers has been sparse. We decided
to write this book to fill that gap. We provide a pragmatic, comprehensive introduction
to Hive that is suitable for SQL experts, such as database designers and business analysts. We also cover the in-depth technical details that Hadoop developers require for
tuning and customizing Hive.
You can learn more at the book’s catalog page (http://oreil.ly/Programming_Hive).

Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions. Definitions of most terms can be found in the Glossary.
Constant width

Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
xiii

Constant width bold

Shows commands or other text that should be typed literally by the user.
Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note.

This icon indicates a warning or caution.

Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen (O’Reilly). Copyright 2012 Edward Capriolo,
Aspect Research Associates, and Jason Rutherglen, 978-1-449-31933-5.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.

Safari® Books Online
Safari Books Online (www.safaribooksonline.com) is an on-demand digital
library that delivers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.

xiv | Preface

Safari Books Online offers a range of product mixes and pricing programs for organizations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley
Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John
Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT
Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit
us online.

How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/Programming_Hive.
To comment or ask technical questions about this book, send email to
bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia

What Brought Us to Hive?
The three of us arrived here from different directions.

Edward Capriolo
When I first became involved with Hadoop, I saw the distributed filesystem and MapReduce as a great way to tackle computer-intensive problems. However, programming
in the MapReduce model was a paradigm shift for me. Hive offered a fast and simple
way to take advantage of MapReduce in an SQL-like world I was comfortable in. This
approach also made it easy to prototype proof-of-concept applications and also to
Preface | xv

champion Hadoop as a solution internally. Even though I am now very familiar with
Hadoop internals, Hive is still my primary method of working with Hadoop.
It is an honor to write a Hive book. Being a Hive Committer and a member of the
Apache Software Foundation is my most valued accolade.

Dean Wampler
As a “big data” consultant at Think Big Analytics, I work with experienced “data people”
who eat and breathe SQL. For them, Hive is a necessary and sufficient condition for
Hadoop to be a viable tool to leverage their investment in SQL and open up new opportunities for data analytics.
Hive has lacked good documentation. I suggested to my previous editor at O’Reilly,
Mike Loukides, that a Hive book was needed by the community. So, here we are…

Jason Rutherglen
I work at Think Big Analytics as a software architect. My career has involved an array
of technologies including search, Hadoop, mobile, cryptography, and natural language
processing. Hive is the ultimate way to build a data warehouse using open technologies
on any amount of data. I use Hive regularly on a variety of projects.

Acknowledgments
Everyone involved with Hive. This includes committers, contributors, as well as end
users.
Mark Grover wrote the chapter on Hive and Amazon Web Services. He is a contributor
to the Apache Hive project and is active helping others on the Hive IRC channel.
David Ha and Rumit Patel, at M6D, contributed the case study and code on the Rank
function. The ability to do Rank in Hive is a significant feature.
Ori Stitelman, at M6D, contributed the case study, Data Science using Hive and R,
which demonstrates how Hive can be used to make first pass on large data sets and
produce results to be used by a second R process.
David Funk contributed three use cases on in-site referrer identification, sessionization,
and counting unique visitors. David’s techniques show how rewriting and optimizing
Hive queries can make large scale map reduce data analysis more efficient.
Ian Robertson read the entire first draft of the book and provided very helpful feedback
on it. We’re grateful to him for providing that feedback on short notice and a tight
schedule.

xvi | Preface

John Sichi provided technical review for the book. John was also instrumental in driving
through some of the newer features in Hive like StorageHandlers and Indexing Support.
He has been actively growing and supporting the Hive community.
Alan Gates, author of Programming Pig, contributed the HCatalog chapter. Nanda
Vijaydev contributed the chapter on how Karmasphere offers productized enhancements for Hive. Eric Lubow contributed the SimpleReach case study. Chris A. Mattmann, Paul Zimdars, Cameron Goodale, Andrew F. Hart, Jinwon Kim, Duane Waliser,
and Peter Lean contributed the NASA JPL case study.

Preface | xvii

CHAPTER 1

Introduction

From the early days of the Internet’s mainstream breakout, the major search engines
and ecommerce companies wrestled with ever-growing quantities of data. More recently, social networking sites experienced the same problem. Today, many organizations realize that the data they gather is a valuable resource for understanding their
customers, the performance of their business in the marketplace, and the effectiveness
of their infrastructure.
The Hadoop ecosystem emerged as a cost-effective way of working with such large data
sets. It imposes a particular programming model, called MapReduce, for breaking up
computation tasks into units that can be distributed around a cluster of commodity,
server class hardware, thereby providing cost-effective, horizontal scalability. Underneath this computation model is a distributed file system called the Hadoop Distributed
Filesystem (HDFS). Although the filesystem is “pluggable,” there are now several commercial and open source alternatives.
However, a challenge remains; how do you move an existing data infrastructure to
Hadoop, when that infrastructure is based on traditional relational databases and the
Structured Query Language (SQL)? What about the large base of SQL users, both expert
database designers and administrators, as well as casual users who use SQL to extract
information from their data warehouses?
This is where Hive comes in. Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster.
SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model
for organizing and using data. Mapping these familiar data operations to the low-level
MapReduce Java API can be daunting, even for experienced Java developers. Hive does
this dirty work for you, so you can focus on the query itself. Hive translates most queries
to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a
familiar SQL abstraction. If you don’t believe us, see “Java Versus Hive: The Word
Count Algorithm” on page 10 later in this chapter.

1

Hive is most suited for data warehouse applications, where relatively static data is analyzed, fast response times are not required, and when the data is not changing rapidly.
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS
impose limits on what Hive can do. The biggest limitation is that Hive does not provide
record-level update, insert, nor delete. You can generate new tables from queries or
output query results to files. Also, because Hadoop is a batch-oriented system, Hive
queries have higher latency, due to the start-up overhead for MapReduce jobs. Queries
that would finish in seconds for a traditional database take longer for Hive, even for
relatively small data sets.1 Finally, Hive does not provide transactions.
So, Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing, but as we’ll see,
Hive isn’t ideal for satisfying the “online” part of OLAP, at least today, since there can
be significant latency between issuing a query and receiving a reply, both due to the
overhead of Hadoop and due to the size of the data sets Hadoop was designed to serve.
If you need OLTP features for large-scale data, you should consider using a NoSQL
database. Examples include HBase, a NoSQL database integrated with Hadoop,2 Cassandra,3 and DynamoDB, if you are using Amazon’s Elastic MapReduce (EMR) or
Elastic Compute Cloud (EC2).4 You can even integrate Hive with these databases
(among others), as we’ll discuss in Chapter 17.
So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
Because most data warehouse applications are implemented using SQL-based relational databases, Hive lowers the barrier for moving these applications to Hadoop.
People who know SQL can learn Hive easily. Without Hive, these users would need to
learn new languages and tools to be productive again.
Similarly, Hive makes it easier for developers to port SQL-based applications to
Hadoop, compared with other Hadoop languages and tools.
However, like most SQL dialects, HiveQL does not conform to the ANSI SQL standard
and it differs in various ways from the familiar SQL dialects provided by Oracle,
MySQL, and SQL Server. (However, it is closest to MySQL’s dialect of SQL.)

1. However, for the big data sets Hive is designed for, this start-up overhead is trivial compared to the actual
processing time.
2. See the Apache HBase website, http://hbase.apache.org, and HBase: The Definitive Guide by Lars George
(O’Reilly).
3. See the Cassandra website, http://cassandra.apache.org/, and High Performance Cassandra Cookbook by
Edward Capriolo (Packt).
4. See the DynamoDB website, http://aws.amazon.com/dynamodb/.

2 | Chapter 1: Introduction

So, this book has a dual purpose. First, it provides a comprehensive, example-driven
introduction to HiveQL for all users, from developers, database administrators and
architects, to less technical users, such as business analysts.
Second, the book provides the in-depth technical details required by developers and
Hadoop administrators to tune Hive query performance and to customize Hive with
user-defined functions, custom data formats, etc.
We wrote this book out of frustration that Hive lacked good documentation, especially
for new users who aren’t developers and aren’t accustomed to browsing project artifacts
like bug and feature databases, source code, etc., to get the information they need. The
Hive Wiki5 is an invaluable source of information, but its explanations are sometimes
sparse and not always up to date. We hope this book remedies those issues, providing
a single, comprehensive guide to all the essential features of Hive and how to use them
effectively.6

An Overview of Hadoop and MapReduce
If you’re already familiar with Hadoop and the MapReduce computing model, you can
skip this section. While you don’t need an intimate knowledge of MapReduce to use
Hive, understanding the basic principles of MapReduce will help you understand what
Hive is doing behind the scenes and how you can use Hive more effectively.
We provide a brief overview of Hadoop and MapReduce here. For more details, see
Hadoop: The Definitive Guide by Tom White (O’Reilly).

MapReduce
MapReduce is a computing model that decomposes large data manipulation jobs into
individual tasks that can be executed in parallel across a cluster of servers. The results
of the tasks can be joined together to compute the final results.
The MapReduce programming model was developed at Google and described in an
influential paper called MapReduce: simplified data processing on large clusters (see the
Appendix) on page 309. The Google Filesystem was described a year earlier in a paper
called The Google filesystem on page 310. Both papers inspired the creation of Hadoop
by Doug Cutting.
The term MapReduce comes from the two fundamental data-transformation operations
used, map and reduce. A map operation converts the elements of a collection from one
form to another. In this case, input key-value pairs are converted to zero-to-many

5. See https://cwiki.apache.org/Hive/.
6. It’s worth bookmarking the wiki link, however, because the wiki contains some more obscure information
we won’t cover here.

An Overview of Hadoop and MapReduce | 3

output key-value pairs, where the input and output keys might be completely different
and the input and output values might be completely different.
In MapReduce, all the key-pairs for a given key are sent to the same reduce operation.
Specifically, the key and a collection of the values are passed to the reducer. The goal
of “reduction” is to convert the collection to a value, such as summing or averaging a
collection of numbers, or to another collection. A final key-value pair is emitted by the
reducer. Again, the input versus output keys and values may be different. Note that if
the job requires no reduction step, then it can be skipped.
An implementation infrastructure like the one provided by Hadoop handles most of
the chores required to make jobs run successfully. For example, Hadoop determines
how to decompose the submitted job into individual map and reduce tasks to run, it
schedules those tasks given the available resources, it decides where to send a particular
task in the cluster (usually where the corresponding data is located, when possible, to
minimize network overhead), it monitors each task to ensure successful completion,
and it restarts tasks that fail.
The Hadoop Distributed Filesystem, HDFS, or a similar distributed filesystem, manages
data across the cluster. Each block is replicated several times (three copies is the usual
default), so that no single hard drive or server failure results in data loss. Also, because
the goal is to optimize the processing of very large data sets, HDFS and similar filesystems use very large block sizes, typically 64 MB or multiples thereof. Such large blocks
can be stored contiguously on hard drives so they can be written and read with minimal
seeking of the drive heads, thereby maximizing write and read performance.
To make MapReduce more clear, let’s walk through a simple example, the Word
Count algorithm that has become the “Hello World” of MapReduce.7 Word Count
returns a list of all the words that appear in a corpus (one or more documents) and the
count of how many times each word appears. The output shows each word found and
its count, one per line. By common convention, the word (output key) and count (output value) are usually separated by a tab separator.
Figure 1-1 shows how Word Count works in MapReduce.
There is a lot going on here, so let’s walk through it from left to right.
Each Input box on the left-hand side of Figure 1-1 is a separate document. Here are
four documents, the third of which is empty and the others contain just a few words,
to keep things simple.
By default, a separate Mapper process is invoked to process each document. In real
scenarios, large documents might be split and each split would be sent to a separate
Mapper. Also, there are techniques for combining many small documents into a single
split for a Mapper. We won’t worry about those details now.
7. If you’re not a developer, a “Hello World” program is the traditional first program you write when learning
a new language or tool set.

4 | Chapter 1: Introduction

Figure 1-1. Word Count algorithm using MapReduce

The fundamental data structure for input and output in MapReduce is the key-value
pair. After each Mapper is started, it is called repeatedly for each line of text from the
document. For each call, the key passed to the mapper is the character offset into the
document at the start of the line. The corresponding value is the text of the line.
In Word Count, the character offset (key) is discarded. The value, the line of text, is
tokenized into words, using one of several possible techniques (e.g., splitting on whitespace is the simplest, but it can leave in undesirable punctuation). We’ll also assume
that the Mapper converts each word to lowercase, so for example, “FUN” and “fun”
will be counted as the same word.
Finally, for each word in the line, the mapper outputs a key-value pair, with the word
as the key and the number 1 as the value (i.e., the count of “one occurrence”). Note
that the output types of the keys and values are different from the input types.
Part of Hadoop’s magic is the Sort and Shuffle phase that comes next. Hadoop sorts
the key-value pairs by key and it “shuffles” all pairs with the same key to the same
Reducer. There are several possible techniques that can be used to decide which reducer
gets which range of keys. We won’t worry about that here, but for illustrative purposes,
we have assumed in the figure that a particular alphanumeric partitioning was used. In
a real implementation, it would be different.
For the mapper to simply output a count of 1 every time a word is seen is a bit wasteful
of network and disk I/O used in the sort and shuffle. (It does minimize the memory
used in the Mappers, however.) One optimization is to keep track of the count for each
word and then output only one count for each word when the Mapper finishes. There

An Overview of Hadoop and MapReduce | 5

are several ways to do this optimization, but the simple approach is logically correct
and sufficient for this discussion.
The inputs to each Reducer are again key-value pairs, but this time, each key will be
one of the words found by the mappers and the value will be a collection of all the counts
emitted by all the mappers for that word. Note that the type of the key and the type of
the value collection elements are the same as the types used in the Mapper’s output.
That is, the key type is a character string and the value collection element type is an
integer.
To finish the algorithm, all the reducer has to do is add up all the counts in the value
collection and write a final key-value pair consisting of each word and the count for
that word.
Word Count isn’t a toy example. The data it produces is used in spell checkers, language
detection and translation systems, and other applications.

Hive in the Hadoop Ecosystem
The Word Count algorithm, like most that you might implement with Hadoop, is a
little involved. When you actually implement such algorithms using the Hadoop Java
API, there are even more low-level details you have to manage yourself. It’s a job that’s
only suitable for an experienced Java developer, potentially putting Hadoop out of
reach of users who aren’t programmers, even when they understand the algorithm they
want to use.
In fact, many of those low-level details are actually quite repetitive from one job to the
next, from low-level chores like wiring together Mappers and Reducers to certain data
manipulation constructs, like filtering for just the data you want and performing SQLlike joins on data sets. There’s a real opportunity to eliminate reinventing these idioms
by letting “higher-level” tools handle them automatically.
That’s where Hive comes in. It not only provides a familiar programming model for
people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky
coding you would have to do in Java.
This is why Hive is so important to Hadoop, whether you are a DBA or a Java developer.
Hive lets you complete a lot of work with relatively little effort.
Figure 1-2 shows the major “modules” of Hive and how they work with Hadoop.
There are several ways to interact with Hive. In this book, we will mostly focus on the
CLI, command-line interface. For people who prefer graphical user interfaces, commercial and open source options are starting to appear, including a commercial product
from Karmasphere (http://karmasphere.com), Cloudera’s open source Hue (https://git
hub.com/cloudera/hue), a new “Hive-as-a-service” offering from Qubole (http://qubole
.com), and others.

6 | Chapter 1: Introduction

Download from Wow! eBook 

Figure 1-2. Hive modules

Bundled with the Hive distribution is the CLI, a simple web interface called Hive web
interface (HWI), and programmatic access through JDBC, ODBC, and a Thrift server
(see Chapter 16).
All commands and queries go to the Driver, which compiles the input, optimizes the
computation required, and executes the required steps, usually with MapReduce jobs.
When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an
XML file representing the “job plan.” In other words, these generic modules function
like mini language interpreters and the “language” to drive the computation is encoded
in XML.
Hive communicates with the JobTracker to initiate the MapReduce job. Hive does not
have to be running on the same master node with the JobTracker. In larger clusters,
it’s common to have edge nodes where tools like Hive run. They communicate remotely
with the JobTracker on the master node to execute jobs. Usually, the data files to be
processed are in HDFS, which is managed by the NameNode.
The Metastore is a separate relational database (usually a MySQL instance) where Hive
persists table schemas and other system metadata. We’ll discuss it in detail in Chapter 2.
While this is a book about Hive, it’s worth mentioning other higher-level tools that you
should consider for your needs. Hive is best suited for data warehouse applications,
where real-time responsiveness to queries and record-level inserts, updates, and deletes
Hive in the Hadoop Ecosystem | 7

are not required. Of course, Hive is also very nice for people who know SQL already.
However, some of your work may be easier to accomplish with alternative tools.

Pig
The best known alternative to Hive is Pig (see http://pig.apache.org), which was developed at Yahoo! about the same time Facebook was developing Hive. Pig is also now a
top-level Apache project that is closely associated with Hadoop.
Suppose you have one or more sources of input data and you need to perform a complex
set of transformations to generate one or more collections of output data. Using Hive,
you might be able to do this with nested queries (as we’ll see), but at some point it will
be necessary to resort to temporary tables (which you have to manage yourself) to
manage the complexity.
Pig is described as a data flow language, rather than a query language. In Pig, you write
a series of declarative statements that define relations from other relations, where each
new relation performs some new data transformation. Pig looks at these declarations
and then builds up a sequence of MapReduce jobs to perform the transformations until
the final results are computed the way that you want.
This step-by-step “flow” of data can be more intuitive than a complex set of queries.
For this reason, Pig is often used as part of ETL (Extract, Transform, and Load) processes used to ingest external data into a Hadoop cluster and transform it into a more
desirable form.
A drawback of Pig is that it uses a custom language not based on SQL. This is appropriate, since it is not designed as a query language, but it also means that Pig is less
suitable for porting over SQL applications and experienced SQL users will have a larger
learning curve with Pig.
Nevertheless, it’s common for Hadoop teams to use a combination of Hive and Pig,
selecting the appropriate tool for particular jobs.
Programming Pig by Alan Gates (O’Reilly) provides a comprehensive introduction to
Pig.

HBase
What if you need the database features that Hive doesn’t provide, like row-level
updates, rapid query response times, and transactions?
HBase is a distributed and scalable data store that supports row-level updates, rapid
queries, and row-level transactions (but not multirow transactions).
HBase is inspired by Google’s Big Table, although it doesn’t implement all Big Table
features. One of the important features HBase supports is column-oriented storage,
where columns can be organized into column families. Column families are physically

8 | Chapter 1: Introduction

stored together in a distributed cluster, which makes reads and writes faster when the
typical query scenarios involve a small subset of the columns. Rather than reading entire
rows and discarding most of the columns, you read only the columns you need.
HBase can be used like a key-value store, where a single key is used for each row to
provide very fast reads and writes of the row’s columns or column families. HBase also
keeps a configurable number of versions of each column’s values (marked by timestamps), so it’s possible to go “back in time” to previous values, when needed.
Finally, what is the relationship between HBase and Hadoop? HBase uses HDFS (or
one of the other distributed filesystems) for durable file storage of data. To provide
row-level updates and fast queries, HBase also uses in-memory caching of data and
local files for the append log of updates. Periodically, the durable files are updated with
all the append log updates, etc.
HBase doesn’t provide a query language like SQL, but Hive is now integrated with
HBase. We’ll discuss this integration in “HBase” on page 222.
For more on HBase, see the HBase website, and HBase: The Definitive Guide by Lars
George.

Cascading, Crunch, and Others
There are several other “high-level” languages that have emerged outside of the Apache
Hadoop umbrella, which also provide nice abstractions on top of Hadoop to reduce
the amount of low-level boilerplate code required for typical jobs. For completeness,
we list several of them here. All are JVM (Java Virtual Machine) libraries that can be
used from programming languages like Java, Clojure, Scala, JRuby, Groovy, and Jython, as opposed to tools with their own languages, like Hive and Pig.
Using one of these programming languages has advantages and disadvantages. It makes
these tools less attractive to nonprogrammers who already know SQL. However, for
developers, these tools provide the full power of a Turing complete programming language. Neither Hive nor Pig are Turing complete. We’ll learn how to extend Hive with
Java code when we need additional functionality that Hive doesn’t provide (Table 1-1).
Table 1-1. Alternative higher-level libraries for Hadoop
Name

URL

Description

Cascading

http://cascading.org

Java API with Data Processing abstractions. There are now
many Domain Specific Languages (DSLs) for Cascading in other
languages, e.g., Scala, Groovy, JRuby, and Jython.

Cascalog

https://github.com/nathanmarz/casca
log

A Clojure DSL for Cascading that provides additional functionality inspired by Datalog for data processing and query abstractions.

Crunch

https://github.com/cloudera/crunch

A Java and Scala API for defining data flow pipelines.

Hive in the Hadoop Ecosystem | 9

Because Hadoop is a batch-oriented system, there are tools with different distributed
computing models that are better suited for event stream processing, where closer to
“real-time” responsiveness is required. Here we list several of the many alternatives
(Table 1-2).
Table 1-2. Distributed data processing tools that don’t use MapReduce
Name

URL

Description

Spark

http://www.spark-project.org/

A distributed computing framework based on the idea of distributed data sets with a Scala API. It can work with HDFS files
and it offers notable performance improvements over Hadoop
MapReduce for many computations. There is also a project to
port Hive to Spark, called Shark (http://shark.cs.berkeley.edu/).

Storm

https://github.com/nathanmarz/storm

A real-time event stream processing system.

Kafka

http://incubator.apache.org/kafka/in
dex.html

A distributed publish-subscribe messaging system.

Finally, it’s important to consider when you don’t need a full cluster (e.g., for smaller
data sets or when the time to perform a computation is less critical). Also, many alternative tools are easier to use when prototyping algorithms or doing exploration with a
subset of data. Some of the more popular options are listed in Table 1-3.
Table 1-3. Other data processing languages and tools
Name

URL

Description

R

http://r-project.org/

An open source language for statistical analysis and graphing
of data that is popular with statisticians, economists, etc. It’s
not a distributed system, so the data sizes it can handle are
limited. There are efforts to integrate R with Hadoop.

Matlab

http://www.mathworks.com/products/
matlab/index.html

A commercial system for data analysis and numerical methods
that is popular with engineers and scientists.

Octave

http://www.gnu.org/software/octave/

An open source clone of MatLab.

Mathematica

http://www.wolfram.com/mathema
tica/

A commercial data analysis, symbolic manipulation, and numerical methods system that is also popular with scientists and
engineers.

SciPy, NumPy

http://scipy.org

Extensive software package for scientific programming in
Python, which is widely used by data scientists.

Java Versus Hive: The Word Count Algorithm
If you are not a Java programmer, you can skip to the next section.
If you are a Java programmer, you might be reading this book because you’ll need to
support the Hive users in your organization. You might be skeptical about using Hive
for your own work. If so, consider the following example that implements the Word
10 | Chapter 1: Introduction

Count algorithm we discussed above, first using the Java MapReduce API and then
using Hive.
It’s very common to use Word Count as the first Java MapReduce program that people
write, because the algorithm is simple to understand, so you can focus on the API.
Hence, it has become the “Hello World” of the Hadoop world.
The following Java implementation is included in the Apache Hadoop distribution.8 If
you don’t know Java (and you’re still reading this section), don’t worry, we’re only
showing you the code for the size comparison:
package org.myorg;
import java.io.IOException;
import java.util.*;
import
import
import
import
import
import
import
import

org.apache.hadoop.fs.Path;
org.apache.hadoop.conf.*;
org.apache.hadoop.io.*;
org.apache.hadoop.mapreduce.*;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {
public static class Map extends Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

}

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
public static class Reduce extends Reducer {
public void reduce(Text key, Iterable values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}

8. Apache Hadoop word count: http://wiki.apache.org/hadoop/WordCount.

Java Versus Hive: The Word Count Algorithm | 11

}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
}

job.waitForCompletion(true);

}

That was 63 lines of Java code. We won’t explain the API details.9 Here is the same
calculation written in HiveQL, which is just 8 lines of code, and does not require compilation nor the creation of a “JAR” (Java ARchive) file:
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
GROUP BY word
ORDER BY word;

We’ll explain all this HiveQL syntax later on.

9. See Hadoop: The Definitive Guide by Tom White for the details.

12 | Chapter 1: Introduction

In both examples, the files were tokenized into words using the simplest possible approach; splitting on whitespace boundaries. This approach doesn’t properly handle
punctuation, it doesn’t recognize that singular and plural forms of words are the same
word, etc. However, it’s good enough for our purposes here.10
The virtue of the Java API is the ability to customize and fine-tune every detail of an
algorithm implementation. However, most of the time, you just don’t need that level
of control and it slows you down considerably when you have to manage all those
details.
If you’re not a programmer, then writing Java MapReduce code is out of reach. However, if you already know SQL, learning Hive is relatively straightforward and many
applications are quick and easy to implement.

What’s Next
We described the important role that Hive plays in the Hadoop ecosystem. Now let’s
get started!

10. There is one other minor difference. The Hive query hardcodes a path to the data, while the Java code
takes the path as an argument. In Chapter 2, we’ll learn how to use Hive variables in scripts to avoid
hardcoding such details.

What’s Next | 13

CHAPTER 2

Getting Started

Let’s install Hadoop and Hive on our personal workstation. This is a convenient way
to learn and experiment with Hadoop. Then we’ll discuss how to configure Hive for
use on Hadoop clusters.
If you already use Amazon Web Services, the fastest path to setting up Hive for learning
is to run a Hive-configured job flow on Amazon Elastic MapReduce (EMR). We discuss
this option in Chapter 21.
If you have access to a Hadoop cluster with Hive already installed, we encourage
you to skim the first part of this chapter and pick up again at “What Is Inside
Hive?” on page 22.

Installing a Preconfigured Virtual Machine
There are several ways you can install Hadoop and Hive. An easy way to install a complete Hadoop system, including Hive, is to download a preconfigured virtual machine (VM) that runs in VMWare1 or VirtualBox2. For VMWare, either VMWare
Player for Windows and Linux (free) or VMWare Fusion for Mac OS X (inexpensive)
can be used. VirtualBox is free for all these platforms, and also Solaris.
The virtual machines use Linux as the operating system, which is currently the only
recommended operating system for running Hadoop in production.3
Using a virtual machine is currently the only way to run Hadoop on
Windows systems, even when Cygwin or similar Unix-like software is
installed.

1. http://vmware.com.
2. https://www.virtualbox.org/.
3. However, some vendors are starting to support Hadoop on other systems. Hadoop has been used in
production on various Unix systems and it works fine on Mac OS X for development use.

15

Most of the preconfigured virtual machines (VMs) available are only designed for
VMWare, but if you prefer VirtualBox you may find instructions on the Web that
explain how to import a particular VM into VirtualBox.
You can download preconfigured virtual machines from one of the websites given in
Table 2-1.4 Follow the instructions on these web sites for loading the VM into VMWare.
Table 2-1. Preconfigured Hadoop virtual machines for VMWare
Provider

URL

Notes

Cloudera, Inc.

https://ccp.cloudera.com/display/SUPPORT/Clou
dera’s+Hadoop+Demo+VM

Uses Cloudera’s own distribution
of Hadoop, CDH3 or CDH4.

MapR, Inc.

http://www.mapr.com/doc/display/MapR/Quick
+Start+-+Test+Drive+MapR+on+a+Virtual
+Machine

MapR’s Hadoop distribution,
which replaces HDFS with the
MapR Filesystem (MapR-FS).

Hortonworks,
Inc.

http://docs.hortonworks.com/HDP-1.0.4-PREVIEW
-6/Using_HDP_Single_Box_VM/HDP_Single_Box
_VM.htm

Based on the latest, stable Apache
releases.

Think Big Analytics, Inc.

http://thinkbigacademy.s3-website-us-east-1.ama
zonaws.com/vm/README.html

Based on the latest, stable Apache
releases.

Next, go to “What Is Inside Hive?” on page 22.

Detailed Installation
While using a preconfigured virtual machine may be an easy way to run Hive, installing
Hadoop and Hive yourself will give you valuable insights into how these tools work,
especially if you are a developer.
The instructions that follow describe the minimum necessary Hadoop and Hive
installation steps for your personal Linux or Mac OS X workstation. For production
installations, consult the recommended installation procedures for your Hadoop
distributor.

Installing Java
Hive requires Hadoop and Hadoop requires Java. Ensure your system has a recent
v1.6.X or v1.7.X JVM (Java Virtual Machine). Although the JRE (Java Runtime Environment) is all you need to run Hive, you will need the full JDK (Java Development
Kit) to build examples in this book that demonstrate how to extend Hive with Java
code. However, if you are not a programmer, the companion source code distribution
for this book (see the Preface) contains prebuilt examples.

4. These are the current URLs at the time of this writing.

16 | Chapter 2: Getting Started

After the installation is complete, you’ll need to ensure that Java is in your path and
the JAVA_HOME environment variable is set.

Linux-specific Java steps
On Linux systems, the following instructions set up a bash file in the /etc/profile.d/
directory that defines JAVA_HOME for all users. Changing environmental settings in
this folder requires root access and affects all users of the system. (We’re using $ as the
bash shell prompt.) The Oracle JVM installer typically installs the software in /usr/java/
jdk-1.6.X (for v1.6) and it creates sym-links from /usr/java/default and /usr/java/latest
to the installation:
$ /usr/java/latest/bin/java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)
$ sudo echo "export JAVA_HOME=/usr/java/latest" > /etc/profile.d/java.sh
$ sudo echo "PATH=$PATH:$JAVA_HOME/bin" >> /etc/profile.d/java.sh
$ . /etc/profile
$ echo $JAVA_HOME
/usr/java/latest

If you’ve never used sudo (“super user do something”) before to run a
command as a “privileged” user, as in two of the commands, just type
your normal password when you’re asked for it. If you’re on a personal
machine, your user account probably has “sudo rights.” If not, ask your
administrator to run those commands.
However, if you don’t want to make permanent changes that affect all
users of the system, an alternative is to put the definitions shown for
PATH and JAVA_HOME in your $HOME/.bashrc file:
export JAVA_HOME=/usr/java/latest
export PATH=$PATH:$JAVA_HOME/bin

Mac OS X−specific Java steps
Mac OS X systems don’t have the /etc/profile.d directory and they are typically
single-user systems, so it’s best to put the environment variable definitions in your
$HOME/.bashrc. The Java paths are different, too, and they may be in one of several
places.5
Here are a few examples. You’ll need to determine where Java is installed on your Mac
and adjust the definitions accordingly. Here is a Java 1.6 example for Mac OS X:
$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
$ export PATH=$PATH:$JAVA_HOME/bin

5. At least that’s the current situation on Dean’s Mac. This discrepancy may actually reflect the fact that
stewardship of the Mac OS X Java port is transitioning from Apple to Oracle as of Java 1.7.

Detailed Installation | 17

Here is a Java 1.7 example for Mac OS X:
$ export JAVA_HOME=/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home
$ export PATH=$PATH:$JAVA_HOME/bin

OpenJDK 1.7 releases also install under /Library/Java/JavaVirtualMachines.

Installing Hadoop
Hive runs on top of Hadoop. Hadoop is an active open source project with many releases and branches. Also, many commercial software companies are now producing
their own distributions of Hadoop, sometimes with custom enhancements or replacements for some components. This situation promotes innovation, but also potential
confusion and compatibility issues.
Keeping software up to date lets you exploit the latest performance enhancements and
bug fixes. However, sometimes you introduce new bugs and compatibility issues. So,
for this book, we’ll show you how to install the Apache Hadoop release v0.20.2. This
edition is not the most recent stable release, but it has been the reliable gold standard
for some time for performance and compatibility.
However, you should be able to choose a different version, distribution, or release
without problems for learning and using Hive, such as the Apache Hadoop v0.20.205
or 1.0.X releases, Cloudera CDH3 or CDH4, MapR M3 or M5, and the forthcoming
Hortonworks distribution. Note that the bundled Cloudera, MapR, and planned
Hortonworks distributions all include a Hive release.
However, we don’t recommend installing the new, alpha-quality, “Next Generation”
Hadoop v2.0 (also known as v0.23), at least for the purposes of this book. While this
release will bring significant enhancements to the Hadoop ecosystem, it is too new for
our purposes.
To install Hadoop on a Linux system, run the following commands. Note that we
wrapped the long line for the wget command:
$ cd ~
# or use another directory of your choice.
$ wget \
http://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$ tar -xzf hadoop-0.20.2.tar.gz
$ sudo echo "export HADOOP_HOME=$PWD/hadoop-0.20.2" > /etc/profile.d/hadoop.sh
$ sudo echo "PATH=$PATH:$HADOOP_HOME/bin" >> /etc/profile.d/hadoop.sh
$ . /etc/profile

To install Hadoop on a Mac OS X system, run the following commands. Note that we
wrapped the long line for the curl command:
$ cd ~
# or use another directory of your choice.
$ curl -o \
http://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$ tar -xzf hadoop-0.20.2.tar.gz
$ echo "export HADOOP_HOME=$PWD/hadoop-0.20.2" >> $HOME/.bashrc

18 | Chapter 2: Getting Started

$ echo "PATH=$PATH:$HADOOP_HOME/bin" >> $HOME/.bashrc
$ . $HOME/.bashrc

In what follows, we will assume that you added $HADOOP_HOME/bin to your path, as in
the previous commands. This will allow you to simply type the hadoop command
without the path prefix.

Local Mode, Pseudodistributed Mode, and Distributed Mode
Before we proceed, let’s clarify the different runtime modes for Hadoop. We mentioned
above that the default mode is local mode, where filesystem references use the local
filesystem. Also in local mode, when Hadoop jobs are executed (including most Hive
queries), the Map and Reduce tasks are run as part of the same process.
Actual clusters are configured in distributed mode, where all filesystem references that
aren’t full URIs default to the distributed filesystem (usually HDFS) and jobs are managed by the JobTracker service, with individual tasks executed in separate processes.
A dilemma for developers working on personal machines is the fact that local mode
doesn’t closely resemble the behavior of a real cluster, which is important to remember
when testing applications. To address this need, a single machine can be configured to
run in pseudodistributed mode, where the behavior is identical to distributed mode,
namely filesystem references default to the distributed filesystem and jobs are managed
by the JobTracker service, but there is just a single machine. Hence, for example, HDFS
file block replication is limited to one copy. In other words, the behavior is like a singlenode “cluster.” We’ll discuss these configuration options in “Configuring Your Hadoop Environment” on page 24.
Because Hive uses Hadoop jobs for most of its work, its behavior reflects the Hadoop
mode you’re using. However, even when running in distributed mode, Hive can decide
on a per-query basis whether or not it can perform the query using just local mode,
where it reads the data files and manages the MapReduce tasks itself, providing faster
turnaround. Hence, the distinction between the different modes is more of an
execution style for Hive than a deployment style, as it is for Hadoop.
For most of the book, it won’t matter which mode you’re using. We’ll assume you’re
working on a personal machine in local mode and we’ll discuss the cases where the
mode matters.
When working with small data sets, using local mode execution
will make Hive queries much faster. Setting the property set
hive.exec.mode.local.auto=true; will cause Hive to use this mode more
aggressively, even when you are running Hadoop in distributed or pseudodistributed mode. To always use this setting, add the command to
your $HOME/.hiverc file (see “The .hiverc File” on page 36).

Detailed Installation | 19

Testing Hadoop
Assuming you’re using local mode, let’s look at the local filesystem two different ways.
The following output of the Linux ls command shows the typical contents of the “root”
directory of a Linux system:
$ ls /
bin cgroup
boot dev

etc
home

lib
lib64

lost+found
media

mnt
opt
root selinux
null proc sbin srv

sys
tmp

user
usr

var

Hadoop provides a dfs tool that offers basic filesystem functionality like ls for the
default filesystem. Since we’re using local mode, the default filesystem is the local filesystem:6
$ hadoop dfs -ls /
Found 26 items
drwxrwxrwx
- root
drwxr-xr-x
- root
drwx------ root
drwxr-xr-x
- root
dr-xr-x--- root
...

root
root
root
root
root

24576
4096
16384
0
4096

2012-06-03
2012-01-25
2010-12-30
2012-05-11
2012-05-23

14:28
22:43
14:56
16:44
22:32

/tmp
/opt
/lost+found
/selinux
/root

If instead you get an error message that hadoop isn’t found, either invoke the command
with the full path (e.g., $HOME/hadoop-0.20.2/bin/hadoop) or add the bin directory to
your PATH variable, as discussed in “Installing Hadoop” on page 18 above.
If you find yourself using the hadoop dfs command frequently, it’s
convenient to define an alias for it (e.g., alias hdfs="hadoop dfs").

Hadoop offers a framework for MapReduce. The Hadoop distribution contains an
implementation of the Word Count algorithm we discussed in Chapter 1. Let’s run it!
Start by creating an input directory (inside your current working directory) with files
to be processed by Hadoop:
$ mkdir wc-in
$ echo "bla bla" > wc-in/a.txt
$ echo "bla wa wa " > wc-in/b.txt

Use the hadoop command to launch the Word Count application on the input directory
we just created. Note that it’s conventional to always specify directories for input and
output, not individual files, since there will often be multiple input and/or output files
per directory, a consequence of the parallelism of the system.

6. Unfortunately, the dfs -ls command only provides a “long listing” format. There is no short format, like
the default for the Linux ls command.

20 | Chapter 2: Getting Started

If you are running these commands on your local installation that was configured to
use local mode, the hadoop command will launch the MapReduce components in the
same process. If you are running on a cluster or on a single machine using pseudodistributed mode, the hadoop command will launch one or more separate processes using
the JobTracker service (and the output below will be slightly different). Also, if you are
running with a different version of Hadoop, change the name of the examples.jar as
needed:
$ hadoop
12/06/03
...
12/06/03
12/06/03
12/06/03
12/06/03

jar $HADOOP_HOME/hadoop-0.20.2-examples.jar wordcount wc-in wc-out
15:40:26 INFO input.FileInputFormat: Total input paths to process : 2
15:40:27
15:40:30
15:40:41
15:40:41

INFO
INFO
INFO
INFO

mapred.JobClient: Running job: job_local_0001
mapred.JobClient: map 100% reduce 0%
mapred.JobClient: map 100% reduce 100%
mapred.JobClient: Job complete: job_local_0001

The results of the Word count application can be viewed through local filesystem
commands:
$ ls wc-out/*
part-r-00000
$ cat wc-out/*
bla
3
wa
2

They can also be viewed by the equivalent dfs command (again, because we assume
you are running in local mode):
$ hadoop dfs -cat wc-out/*
bla
3
wa
2

For very big files, if you want to view just the first or last parts, there is
no -more, -head, nor -tail subcommand. Instead, just pipe the output
of the -cat command through the shell’s more, head, or tail. For example: hadoop dfs -cat wc-out/* | more.

Now that we have installed and tested an installation of Hadoop, we can install Hive.

Installing Hive
Installing Hive is similar to installing Hadoop. We will download and extract a tarball
for Hive, which does not include an embedded version of Hadoop. A single Hive binary
is designed to work with multiple versions of Hadoop. This means it’s often easier and
less risky to upgrade to newer Hive releases than it is to upgrade to newer Hadoop
releases.
Hive uses the environment variable HADOOP_HOME to locate the Hadoop JARs and configuration files. So, make sure you set that variable as discussed above before proceeding. The following commands work for both Linux and Mac OS X:
Detailed Installation | 21

$
$
$
$
$

cd ~
# or use another directory of your choice.
curl -o http://archive.apache.org/dist/hive/hive-0.9.0/hive-0.9.0-bin.tar.gz
tar -xzf hive-0.9.0.tar.gz
sudo mkdir -p /user/hive/warehouse
sudo chmod a+rwx /user/hive/warehouse

As you can infer from these commands, we are using the latest stable release of Hive
at the time of this writing, v0.9.0. However, most of the material in this book works
with Hive v0.7.X and v0.8.X. We’ll call out the differences as we come to them.
You’ll want to add the hive command to your path, like we did for the hadoop command.
We’ll follow the same approach, by first defining a HIVE_HOME variable, but unlike
HADOOP_HOME, this variable isn’t really essential. We’ll assume it’s defined for some examples later in the book.
For Linux, run these commands:
$ sudo echo "export HIVE_HOME=$PWD/hive-0.9.0" > /etc/profile.d/hive.sh
$ sudo echo "PATH=$PATH:$HIVE_HOME/bin >> /etc/profile.d/hive.sh
$ . /etc/profile

For Mac OS X, run these commands:
$ echo "export HIVE_HOME=$PWD/hive-0.9.0" >> $HOME/.bashrc
$ echo "PATH=$PATH:$HIVE_HOME/bin" >> $HOME/.bashrc
$ . $HOME/.bashrc

What Is Inside Hive?
The core of a Hive binary distribution contains three parts. The main part is the Java
code itself. Multiple JAR (Java archive) files such as hive-exec*.jar and hive-meta
store*.jar are found under the $HIVE_HOME/lib directory. Each JAR file implements
a particular subset of Hive’s functionality, but the details don’t concern us now.
The $HIVE_HOME/bin directory contains executable scripts that launch various Hive
services, including the hive command-line interface (CLI). The CLI is the most popular
way to use Hive. We will use hive (in lowercase, with a fixed-width font) to refer to the
CLI, except where noted. The CLI can be used interactively to type in statements one
at a time or it can be used to run “scripts” of Hive statements, as we’ll see.
Hive also has other components. A Thrift service provides remote access from other
processes. Access using JDBC and ODBC are provided, too. They are implemented on
top of the Thrift service. We’ll describe these features in later chapters.
All Hive installations require a metastore service, which Hive uses to store table schemas
and other metadata. It is typically implemented using tables in a relational database.
By default, Hive uses a built-in Derby SQL server, which provides limited, singleprocess storage. For example, when using Derby, you can’t run two simultaneous instances of the Hive CLI. However, this is fine for learning Hive on a personal machine

22 | Chapter 2: Getting Started

Download from Wow! eBook 

and some developer tasks. For clusters, MySQL or a similar relational database is
required. We will discuss the details in “Metastore Using JDBC” on page 28.
Finally, a simple web interface, called Hive Web Interface (HWI), provides remote
access to Hive.
The conf directory contains the files that configure Hive. Hive has a number of configuration properties that we will discuss as needed. These properties control features
such as the metastore (where data is stored), various optimizations, and “safety
controls,” etc.

Starting Hive
Let’s finally start the Hive command-line interface (CLI) and run a few commands!
We’ll briefly comment on what’s happening, but save the details for discussion later.
In the following session, we’ll use the $HIVE_HOME/bin/hive command, which is a
bash shell script, to start the CLI. Substitute the directory where Hive is installed on
your system whenever $HIVE_HOME is listed in the following script. Or, if you added
$HIVE_HOME/bin to your PATH, you can just type hive to run the command. We’ll make
that assumption for the rest of the book.
As before, $ is the bash prompt. In the Hive CLI, the hive> string is the hive prompt,
and the indented > is the secondary prompt. Here is a sample session, where we have
added a blank line after the output of each command, for clarity:
$ cd $HIVE_HOME
$ bin/hive
Hive history file=/tmp/myname/hive_job_log_myname_201201271126_1992326118.txt
hive> CREATE TABLE x (a INT);
OK
Time taken: 3.543 seconds
hive> SELECT * FROM x;
OK
Time taken: 0.231 seconds
hive> SELECT *
> FROM x;
OK
Time taken: 0.072 seconds
hive> DROP TABLE x;
OK
Time taken: 0.834 seconds
hive> exit;
$

The first line printed by the CLI is the local filesystem location where the CLI writes
log data about the commands and queries you execute. If a command or query is

Starting Hive | 23

successful, the first line of output will be OK, followed by the output, and finished by
the line showing the amount of time taken to run the command or query.
Throughout the book, we will follow the SQL convention of showing
Hive keywords in uppercase (e.g., CREATE, TABLE, SELECT and FROM), even
though case is ignored by Hive, following SQL conventions.
Going forward, we’ll usually add the blank line after the command output for all sessions. Also, when starting a session, we’ll omit the line
about the logfile. For individual commands and queries, we’ll omit the
OK and Time taken:... lines, too, except in special cases, such as when
we want to emphasize that a command or query was successful, but it
had no other output.

At the successive prompts, we create a simple table named x with a single INT (4-byte
integer) column named a, then query it twice, the second time showing how queries
and commands can spread across multiple lines. Finally, we drop the table.
If you are running with the default Derby database for the metastore, you’ll notice that
your current working directory now contains a new subdirectory called metastore_db
that was created by Derby during the short hive session you just executed. If you are
running one of the VMs, it’s possible it has configured different behavior, as we’ll discuss later.
Creating a metastore_db subdirectory under whatever working directory you happen
to be in is not convenient, as Derby “forgets” about previous metastores when you
change to a new working directory! In the next section, we’ll see how to configure a
permanent location for the metastore database, as well as make other changes.

Configuring Your Hadoop Environment
Let’s dive a little deeper into the different Hadoop modes and discuss more configuration issues relevant to Hive.
You can skip this section if you’re using Hadoop on an existing cluster or you are using
a virtual machine instance. If you are a developer or you installed Hadoop and Hive
yourself, you’ll want to understand the rest of this section. However, we won’t provide
a complete discussion. See Appendix A of Hadoop: The Definitive Guide by Tom White
for the full details on configuring the different modes.

Local Mode Configuration
Recall that in local mode, all references to files go to your local filesystem, not the
distributed filesystem. There are no services running. Instead, your jobs run all tasks
in a single JVM instance.

24 | Chapter 2: Getting Started

Figure 2-1 illustrates a Hadoop job running in local mode.

Figure 2-1. Hadoop in local mode

If you plan to use the local mode regularly, it’s worth configuring a standard location
for the Derby metastore_db, where Hive stores metadata about your tables, etc.
You can also configure a different directory for Hive to store table data, if you don’t
want to use the default location, which is file:///user/hive/warehouse, for local mode,
and hdfs://namenode_server/user/hive/warehouse for the other modes discussed next.
First, go to the $HIVE_HOME/conf directory. The curious may want to peek at the
large hive-default.xml.template file, which shows the different configuration properties
supported by Hive and their default values. Most of these properties you can safely
ignore. Changes to your configuration are done by editing the hive-site.xml file. Create
one if it doesn’t already exist.
Here is an example configuration file where we set several properties for local mode
execution (Example 2-1).
Example 2-1. Local-mode hive-site.xml




hive.metastore.warehouse.dir
/home/me/hive/warehouse

Local or HDFS directory where Hive keeps table contents.



hive.metastore.local

Configuring Your Hadoop Environment | 25

true

Use false if a production metastore server is used.



javax.jdo.option.ConnectionURL
jdbc:derby:;databaseName=/home/me/hive/metastore_db;create=true

The JDBC connection URL.




You can remove any of these ... tags you don’t want to change.
As the  tags indicate, the hive.metastore.warehouse.dir tells Hive where
in your local filesystem to keep the data contents for Hive’s tables. (This value is appended to the value of fs.default.name defined in the Hadoop configuration and defaults to file:///.) You can use any directory path you want for the value. Note that this
directory will not be used to store the table metadata, which goes in the separate
metastore.
The hive.metastore.local property defaults to true, so we don’t really need to show
it in Example 2-1. It’s there more for documentation purposes. This property controls
whether to connect to a remote metastore server or open a new metastore server as part
of the Hive Client JVM. This setting is almost always set to true and JDBC is used to
communicate directly to a relational database. When it is set to false, Hive will
communicate through a metastore server, which we’ll discuss in “Metastore Methods” on page 216.
The value for the javax.jdo.option.ConnectionURL property makes one small but convenient change to the default value for this property. This property tells Hive how to
connect to the metastore server. By default, it uses the current working directory for
the databaseName part of the value string. As shown in Example 2-1, we use database
Name=/home/me/hive/metastore_db as the absolute path instead, which is the location
where the metastore_db directory will always be located. This change eliminates the
problem of Hive dropping the metastore_db directory in the current working directory
every time we start a new Hive session. Now, we’ll always have access to all our
metadata, no matter what directory we are working in.

Distributed and Pseudodistributed Mode Configuration
In distributed mode, several services run in the cluster. The JobTracker manages jobs
and the NameNode is the HDFS master. Worker nodes run individual job tasks, managed by a TaskTracker service on each node, and then hold blocks for files in the
distributed filesystem, managed by DataNode services.
Figure 2-2 shows a typical distributed mode configuration for a Hadoop cluster.
26 | Chapter 2: Getting Started

Figure 2-2. Hadoop in distributed mode

We’re using the convention that *.domain.pvt is our DNS naming convention for the
cluster’s private, internal network.
Pseudodistributed mode is nearly identical; it’s effectively a one-node cluster.
We’ll assume that your administrator has already configured Hadoop, including your
distributed filesystem (e.g., HDFS, or see Appendix A of Hadoop: The Definitive
Guide by Tom White). Here, we’ll focus on the unique configuration steps required by
Hive.
One Hive property you might want to configure is the top-level directory for table
storage, which is specified by the property hive.metastore.warehouse.dir, which we
also discussed in “Local Mode Configuration” on page 24.
The default value for this property is /user/hive/warehouse in the Apache Hadoop and
MapR distributions, which will be interpreted as a distributed filesystem path when
Hadoop is configured for distributed or pseudodistributed mode. For Amazon Elastic
MapReduce (EMR), the default value is /mnt/hive_0M_N/warehouse when using Hive
v0.M.N (e.g., /mnt/hive_08_1/warehouse).
Specifying a different value here allows each user to define their own warehouse directory, so they don’t affect other system users. Hence, each user might use the following
statement to define their own warehouse directory:
set hive.metastore.warehouse.dir=/user/myname/hive/warehouse;

It’s tedious to type this each time you start the Hive CLI or to remember to add it to
every Hive script. Of course, it’s also easy to forget to define this property. Instead, it’s
Configuring Your Hadoop Environment | 27

best to put commands like this in the $HOME/.hiverc file, which will be processed
when Hive starts. See “The .hiverc File” on page 36 for more details.
We’ll assume the value is /user/hive/warehouse from here on.

Metastore Using JDBC
Hive requires only one extra component that Hadoop does not already have; the
metastore component. The metastore stores metadata such as table schema and partition information that you specify when you run commands such as create table
x..., or alter table y..., etc. Because multiple users and systems are likely to need
concurrent access to the metastore, the default embedded database is not suitable for
production.
If you are using a single node in pseudodistributed mode, you may not
find it useful to set up a full relational database for the metastore. Rather,
you may wish to continue using the default Derby store, but configure
it to use a central location for its data, as described in “Local Mode
Configuration” on page 24.

Any JDBC-compliant database can be used for the metastore. In practice, most installations of Hive use MySQL. We’ll discuss how to use MySQL. It is straightforward to
adapt this information to other JDBC-compliant databases.
The information required for table schema, partition information, etc.,
is small, typically much smaller than the large quantity of data stored in
Hive. As a result, you typically don’t need a powerful dedicated database
server for the metastore. However because it represents a Single Point
of Failure (SPOF), it is strongly recommended that you replicate and
back up this database using the standard techniques you would normally use with other relational database instances. We won’t discuss
those techniques here.

For our MySQL configuration, we need to know the host and port the service is running
on. We will assume db1.mydomain.pvt and port 3306, which is the standard MySQL
port. Finally, we will assume that hive_db is the name of our catalog. We define these
properties in Example 2-2.
Example 2-2. Metastore database configuration in hive-site.xml




javax.jdo.option.ConnectionURL
jdbc:mysql://db1.mydomain.pvt/hive_db?createDatabaseIfNotExist=true


28 | Chapter 2: Getting Started


javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver


javax.jdo.option.ConnectionUserName
database_user


javax.jdo.option.ConnectionPassword
database_pass



You may have noticed the ConnectionURL property starts with a prefix of jdbc:mysql.
For Hive to be able to connect to MySQL, we need to place the JDBC driver in our
classpath. Download the MySQL JDBC driver (Jconnector) from http://www.mysql
.com/downloads/connector/j/. The driver can be placed in the Hive library path,
$HIVE_HOME/lib. Some teams put all such support libraries in their Hadoop lib
directory.
With the driver and the configuration settings in place, Hive will store its metastore
information in MySQL.

The Hive Command
The $HIVE_HOME/bin/hive shell command, which we’ll simply refer to as hive from now
on, is the gateway to Hive services, including the command-line interface or CLI.
We’ll also assume that you have added $HIVE_HOME/bin to your environment’s PATH so
you can type hive at the shell prompt and your shell environment (e.g., bash) will find
the command.

Command Options
If you run the following command, you’ll see a brief list of the options for the hive
command. Here is the output for Hive v0.8.X and v0.9.X:
$ bin/hive --help
Usage ./hive  --service serviceName 
Service List: cli help hiveserver hwi jar lineage metastore rcfilecat
Parameters parsed:
--auxpath : Auxiliary jars
--config : Hive configuration directory
--service : Starts specific service/component. cli is default
Parameters used:
HADOOP_HOME : Hadoop install directory
HIVE_OPT : Hive options

The Hive Command | 29

For help on a particular service:
./hive --service serviceName --help
Debug help: ./hive --debug --help

Note the Service List. There are several services available, including the CLI that we
will spend most of our time using. You can invoke a service using the --service name
option, although there are shorthand invocations for some of the services, as well.
Table 2-2 describes the most useful services.
Table 2-2. Hive services
Option

Name

Description

cli

Command-line interface

Used to define tables, run queries, etc. It is the default service
if no other service is specified. See “The Command-Line Interface” on page 30.

hiveserver

Hive Server

A daemon that listens for Thrift connections from other processes. See Chapter 16 for more details.

hwi

Hive Web Interface

A simple web interface for running queries and other commands without logging into a cluster machine and using the
CLI.

jar

An extension of the hadoop jar command for running an
application that also requires the Hive environment.

metastore

Start an external Hive metastore service to support multiple
clients (see also “Metastore Using JDBC” on page 28).

rcfilecat

A tool for printing the contents of an RCFile (see
“RCFile” on page 202).

The --auxpath option lets you specify a colon-separated list of “auxiliary” Java archive
(JAR) files that contain custom extensions, etc., that you might require.
The --config directory is mostly useful if you have to override the default configuration
properties in $HIVE_HOME/conf in a new directory.

The Command-Line Interface
The command-line interface or CLI is the most common way to interact with Hive.
Using the CLI, you can create tables, inspect schema and query tables, etc.

30 | Chapter 2: Getting Started

CLI Options
The following command shows a brief list of the options for the CLI. Here we show
the output for Hive v0.8.X and v0.9.X:
$ hive --help --service cli
usage: hive
-d,--define 
-e 
-f 
-H,--help
-h 
--hiveconf 
--hivevar 
-i 
-p 
-S,--silent
-v,--verbose

Variable substitution to apply to hive
commands. e.g. -d A=B or --define A=B
SQL from command line
SQL from files
Print help information
connecting to Hive Server on remote host
Use value for given property
Variable substitution to apply to hive
commands. e.g. --hivevar A=B
Initialization SQL file
connecting to Hive Server on port number
Silent mode in interactive shell
Verbose mode (echo executed SQL to the
console)

A shorter version of this command is hive -h. However, that’s technically an unsupported option, but it produces the help output with an additional line that complains
about Missing argument for option: h.
For Hive v0.7.X, the -d, --hivevar, and -p options are not supported.
Let’s explore these options in more detail.

Variables and Properties
The --define key=value option is effectively equivalent to the --hivevar key=value
option. Both let you define on the command line custom variables that you can reference in Hive scripts to customize execution. This feature is only supported in Hive
v0.8.0 and later versions.
When you use this feature, Hive puts the key-value pair in the hivevar “namespace” to
distinguish these definitions from three other built-in namespaces, hiveconf, system,
and env.
The terms variable or property are used in different contexts, but they
function the same way in most cases.

The namespace options are described in Table 2-3.

The Command-Line Interface | 31

Table 2-3. Hive namespaces for variables and properties
Namespace

Access

Description

hivevar

Read/Write

(v0.8.0 and later) User-defined custom variables.

hiveconf

Read/Write

Hive-specific configuration properties.

system

Read/Write

Configuration properties defined by Java.

env

Read only

Environment variables defined by the shell environment (e.g.,
bash).

Hive’s variables are internally stored as Java Strings. You can reference variables in
queries; Hive replaces the reference with the variable’s value before sending the query
to the query processor.
Inside the CLI, variables are displayed and changed using the SET command. For example, the following session shows the value for one variable, in the env namespace,
and then all variable definitions! Here is a Hive session where some output has been
omitted and we have added a blank line after the output of each command for clarity:
$ hive
hive> set env:HOME;
env:HOME=/home/thisuser
hive> set;
... lots of output including these variables:
hive.stats.retries.wait=3000
env:TERM=xterm
system:user.timezone=America/New_York
...
hive> set -v;
... even more output!...

Without the -v flag, set prints all the variables in the namespaces hivevar, hiveconf,
system, and env. With the -v option, it also prints all the properties defined by Hadoop,
such as properties controlling HDFS and MapReduce.
The set command is also used to set new values for variables. Let’s look specifically at
the hivevar namespace and a variable that is defined for it on the command line:
$ hive --define foo=bar
hive> set foo;
foo=bar;
hive> set hivevar:foo;
hivevar:foo=bar;
hive> set hivevar:foo=bar2;
hive> set foo;
foo=bar2

32 | Chapter 2: Getting Started

hive> set hivevar:foo;
hivevar:foo=bar2

As we can see, the hivevar: prefix is optional. The --hivevar flag is the same as the
--define flag.
Variable references in queries are replaced in the CLI before the query is sent to the
query processor. Consider the following hive CLI session (v0.8.X only):
hive> create table toss1(i int, ${hivevar:foo} string);
hive> describe toss1;
i
int
bar2
string
hive> create table toss2(i2 int, ${foo} string);
hive> describe toss2;
i2
int
bar2
string
hive> drop table toss1;
hive> drop table toss2;

Let’s look at the --hiveconf option, which is supported in Hive v0.7.X. It is used for
all properties that configure Hive behavior. We’ll use it with a property
hive.cli.print.current.db that was added in Hive v0.8.0. It turns on printing of the
current working database name in the CLI prompt. (See “Databases in
Hive” on page 49 for more on Hive databases.) The default database is named
default. This property is false by default:
$ hive --hiveconf hive.cli.print.current.db=true
hive (default)> set hive.cli.print.current.db;
hive.cli.print.current.db=true
hive (default)> set hiveconf:hive.cli.print.current.db;
hiveconf:hive.cli.print.current.db=true
hive (default)> set hiveconf:hive.cli.print.current.db=false;
hive> set hiveconf:hive.cli.print.current.db=true;
hive (default)> ...

We can even add new hiveconf entries, which is the only supported option for Hive
versions earlier than v0.8.0:
$ hive --hiveconf y=5
hive> set y;
y=5
hive> CREATE TABLE whatsit(i int);
hive> ... load data into whatsit ...

The Command-Line Interface | 33

hive> SELECT * FROM whatsit WHERE i = ${hiveconf:y};
...

It’s also useful to know about the system namespace, which provides read-write access
to Java system properties, and the env namespace, which provides read-only access to
environment variables:
hive> set system:user.name;
system:user.name=myusername
hive> set system:user.name=yourusername;
hive> set system:user.name;
system:user.name=yourusername
hive> set env:HOME;
env:HOME=/home/yourusername
hive> set env:HOME;
env:* variables can not be set.

Unlike hivevar variables, you have to use the system: or env: prefix with system properties and environment variables.
The env namespace is useful as an alternative way to pass variable definitions to Hive,
especially for Hive v0.7.X. Consider the following example:
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";

The query processor will see the literal number 2012 in the WHERE clause.
If you are using Hive v0.7.X, some of the examples in this book that use
parameters and variables may not work as written. If so, replace the
variable reference with the corresponding value.

All of Hive’s built-in properties are listed in $HIVE_HOME/conf/hivedefault.xml.template, the “sample” configuration file. It also shows the
default values for each property.

Hive “One Shot” Commands
The user may wish to run one or more queries (semicolon separated) and then have
the hive CLI exit immediately after completion. The CLI accepts a -e command argument
that enables this feature. If mytable has a string and integer column, we might see the
following output:
$ hive -e "SELECT * FROM mytable LIMIT 3";
OK
name1 10
name2 20
name3 30

34 | Chapter 2: Getting Started

Time taken: 4.955 seconds
$

A quick and dirty technique is to use this feature to output the query results to a file.
Adding the -S for silent mode removes the OK and Time taken ... lines, as well as other
inessential output, as in this example:
$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery
$ cat /tmp/myquery
name1 10
name2 20
name3 30

Note that hive wrote the output to the standard output and the shell command redirected that output to the local filesystem, not to HDFS.
Finally, here is a useful trick for finding a property name that you can’t quite remember,
without having to scroll through the list of the set output. Suppose you can’t remember
the name of the property that specifies the “warehouse” location for managed tables:
$ hive -S -e "set" | grep warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=false

It’s the first one.

Executing Hive Queries from Files
Hive can execute one or more queries that were saved to a file using the -f file argument. By convention, saved Hive query files use the .q or .hql extension.
$ hive -f /path/to/file/withqueries.hql

If you are already inside the Hive shell you can use the SOURCE command to execute a
script file. Here is an example:
$ cat /path/to/file/withqueries.hql
SELECT x.* FROM src x;
$ hive
hive> source /path/to/file/withqueries.hql;
...

By the way, we’ll occasionally use the name src (“source”) for tables in queries when
the name of the table is irrelevant for the example. This convention is taken from the
unit tests in Hive’s source code; first create a src table before all tests.
For example, when experimenting with a built-in function, it’s convenient to write a
“query” that passes literal arguments to the function, as in the following example taken
from later in the book, “XPath-Related Functions” on page 207:
hive> SELECT xpath(\'b1b2\',\'//@id\')
> FROM src LIMIT 1;
[foo","bar]

The Command-Line Interface | 35

The details for xpath don’t concern us here, but note that we pass string literals to the
xpath function and use FROM src LIMIT 1 to specify the required FROM clause and to limit
the output. Substitute src with the name of a table you have already created or create
a dummy table named src:
CREATE TABLE src(s STRING);

Also the source table must have at least one row of content in it:
$ echo "one row" > /tmp/myfile
$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;

The .hiverc File
The last CLI option we’ll discuss is the -i file option, which lets you specify a file of
commands for the CLI to run as it starts, before showing you the prompt. Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands
it contains, if any.
These files are convenient for commands that you run frequently, such as setting
system properties (see “Variables and Properties” on page 31) or adding Java archives
(JAR files) of custom Hive extensions to Hadoop’s distributed cache (as discussed in
Chapter 15).
The following shows an example of a typical $HOME/.hiverc file:
ADD JAR /path/to/custom_hive_extensions.jar;
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;

The first line adds a JAR file to the Hadoop distributed cache. The second line modifies
the CLI prompt to show the current working Hive database, as we described earlier in
“Variables and Properties” on page 31. The last line “encourages” Hive to be more
aggressive about using local-mode execution when possible, even when Hadoop is
running in distributed or pseudo-distributed mode, which speeds up queries for small
data sets.
An easy mistake to make is to forget the semicolon at the end of lines
like this. When you make this mistake, the definition of the property
will include all the text from all the subsequent lines in the file until the
next semicolon.

More on Using the Hive CLI
The CLI supports a number of other useful features.

36 | Chapter 2: Getting Started

Autocomplete
If you start typing and hit the Tab key, the CLI will autocomplete possible keywords
and function names. For example, if you type SELE and then the Tab key, the CLI will
complete the word SELECT.
If you type the Tab key at the prompt, you’ll get this reply:
hive>
Display all 407 possibilities? (y or n)

If you enter y, you’ll get a long list of all the keywords and built-in functions.
A common source of error and confusion when pasting statements into
the CLI occurs where some lines begin with a tab. You’ll get the prompt
about displaying all possibilities, and subsequent characters in the
stream will get misinterpreted as answers to the prompt, causing the
command to fail.

Command History
You can use the up and down arrow keys to scroll through previous commands. Actually, each previous line of input is shown separately; the CLI does not combine multiline commands and queries into a single history entry. Hive saves the last 100,00 lines
into a file $HOME/.hivehistory.
If you want to repeat a previous command, scroll to it and hit Enter. If you want to edit
the line before entering it, use the left and right arrow keys to navigate to the point
where changes are required and edit the line. You can hit Return to submit it without
returning to the end of the line.
Most navigation keystrokes using the Control key work as they do for
the bash shell (e.g., Control-A goes to the beginning of the line and
Control-E goes to the end of the line). However, similar “meta,” Option,
or Escape keys don’t work (e.g., Option-F to move forward a word at a
time). Similarly, the Delete key will delete the character to the left of the
cursor, but the Forward Delete key doesn’t delete the character under
the cursor.

Shell Execution
You don’t need to leave the hive CLI to run simple bash shell commands. Simply
type ! followed by the command and terminate the line with a semicolon (;):
hive> ! /bin/echo "what up dog";
"what up dog"
hive> ! pwd;
/home/me/hiveplay

The Command-Line Interface | 37

Don’t invoke interactive commands that require user input. Shell “pipes” don’t work
and neither do file “globs.” For example, ! ls *.hql; will look for a file named *.hql;,
rather than all files that end with the .hql extension.

Hadoop dfs Commands from Inside Hive
You can run the hadoop dfs ... commands from within the hive CLI; just drop the
hadoop word from the command and add the semicolon at the end:
hive> dfs -ls / ;
Found 3 items
drwxr-xr-x - root supergroup
drwxr-xr-x
- edward supergroup
drwxrwxr-x - hadoop supergroup

0 2011-08-17 16:27 /etl
0 2012-01-18 15:51 /flag
0 2010-02-03 17:50 /users

This method of accessing hadoop commands is actually more efficient than using the
hadoop dfs ... equivalent at the bash shell, because the latter starts up a new JVM
instance each time, whereas Hive just runs the same code in its current process.
You can see a full listing of help on the options supported by dfs using this command:
hive> dfs -help;

See also http://hadoop.apache.org/common/docs/r0.20.205.0/file_system_shell.html or
similar documentation for your Hadoop distribution.

Comments in Hive Scripts
As of Hive v0.8.0, you can embed lines of comments that start with the string --, for
example:
-- Copyright (c) 2012 Megacorp, LLC.
-- This is the best Hive script evar!!
SELECT * FROM massive_table;
...

The CLI does not parse these comment lines. If you paste them into the
CLI, you’ll get errors. They only work when used in scripts executed
with hive -f script_name.

Query Column Headers
As a final example that pulls together a few things we’ve learned, let’s tell the CLI to
print column headers, which is disabled by default. We can enable this feature by setting
the hiveconf property hive.cli.print.header to true:

38 | Chapter 2: Getting Started

hive> set hive.cli.print.header=true;
hive> SELECT * FROM system_logs
tstamp severity server message
1335667117.337715 ERROR server1
1335667117.338012 WARN server1
1335667117.339234 WARN server2

LIMIT 3;
Hard drive hd1 is 90% full!
Slow response from server2.
Uh, Dude, I'm kinda busy right now...

If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.

The Command-Line Interface | 39

CHAPTER 3

Data Types and File Formats

Hive supports many of the primitive data types you find in relational databases, as well
as three collection data types that are rarely found in relational databases, for reasons
we’ll discuss shortly.
A related concern is how these types are represented in text files, as well as alternatives
to text storage that address various performance and other concerns. A unique feature
of Hive, compared to most databases, is that it provides great flexibility in how data is
encoded in files. Most databases take total control of the data, both how it is persisted
to disk and its life cycle. By letting you control all these aspects, Hive makes it easier
to manage and process data with a variety of tools.

Primitive Data Types
Hive supports several sizes of integer and floating-point types, a Boolean type, and
character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary
fields.
Table 3-1 lists the primitive types supported by Hive.
Table 3-1. Primitive data types
Type

Size

Literal syntax examples

TINYINT

1 byte signed integer.

20

SMALLINT

2 byte signed integer.

20

INT

4 byte signed integer.

20

BIGINT

8 byte signed integer.

20

BOOLEAN

Boolean true or false.

TRUE

FLOAT

Single precision floating point.

3.14159

DOUBLE

Double precision floating point.

3.14159

41

Type

Size

Literal syntax examples

STRING

Sequence of characters. The character
set can be specified. Single or double
quotes can be used.

'Now is the time', "for all
good men"

TIMESTAMP (v0.8.0+)

Integer, float, or string.

1327882394 (Unix epoch seconds),
1327882394.123456789 (Unix ep-

och seconds plus nanoseconds), and

'2012-02-03
12:34:56.123456789' (JDBCcompliant java.sql.Timestamp

format)
BINARY (v0.8.0+)

Array of bytes.

See discussion below

As for other SQL dialects, the case of these names is ignored.
It’s useful to remember that each of these types is implemented in Java, so the particular
behavior details will be exactly what you would expect from the corresponding Java
types. For example, STRING is implemented by the Java String, FLOAT is implemented
by Java float, etc.
Note that Hive does not support “character arrays” (strings) with maximum-allowed
lengths, as is common in other SQL dialects. Relational databases offer this feature as
a performance optimization; fixed-length records are easier to index, scan, etc. In the
“looser” world in which Hive lives, where it may not own the data files and has to be
flexible on file format, Hive relies on the presence of delimiters to separate fields. Also,
Hadoop and Hive emphasize optimizing disk reading and writing performance, where
fixing the lengths of column values is relatively unimportant.
Values of the new TIMESTAMP type can be integers, which are interpreted as seconds since
the Unix epoch time (Midnight, January 1, 1970), floats, which are interpreted as seconds since the epoch time with nanosecond resolution (up to 9 decimal places), and
strings, which are interpreted according to the JDBC date string format convention,
YYYY-MM-DD hh:mm:ss.fffffffff.
TIMESTAMPS are interpreted as UTC times. Built-in functions for conversion to and from
timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respec-

tively (see Chapter 13 for more details).
The BINARY type is similar to the VARBINARY type found in many relational databases.
It’s not like a BLOB type, since BINARY columns are stored within the record, not separately like BLOBs. BINARY can be used as a way of including arbitrary bytes in a record
and preventing Hive from attempting to parse them as numbers, strings, etc.
Note that you don’t need BINARY if your goal is to ignore the tail end of each record. If
a table schema specifies three columns and the data files contain five values for each
record, the last two will be ignored by Hive.

42 | Chapter 3: Data Types and File Formats

What if you run a query that wants to compare a float column to a double column or
compare a value of one integer type with a value of a different integer type? Hive will
implicitly cast any integer to the larger of the two integer types, cast FLOAT to DOUBLE,
and cast any integer value to DOUBLE, as needed, so it is comparing identical types.
What if you run a query that wants to interpret a string column as a number? You can
explicitly cast one type to another as in the following example, where s is a string
column that holds a value representing an integer:
... cast(s AS INT) ...;

(To be clear, the AS INT are keywords, so lowercase would be fine.)
We’ll discuss data conversions in more depth in “Casting” on page 109.

Collection Data Types
Hive supports columns that are structs, maps, and arrays. Note that the literal syntax
examples in Table 3-2 are actually calls to built-in functions.
Table 3-2. Collection data types
Type

Description

Literal syntax examples

STRUCT

Analogous to a C struct or an “object.” Fields can be accessed
using the “dot” notation. For example, if a column name is of
type STRUCT {first STRING; last STRING}, then
the first name field can be referenced using name.first.

struct('John', 'Doe')

MAP

A collection of key-value tuples, where the fields are accessed
using array notation (e.g., ['key']). For example, if a column
name is of type MAP with key→value pairs
'first'→'John' and 'last'→'Doe', then the last
name can be referenced using name['last'].

map('first', 'John',
'last', 'Doe')

ARRAY

Ordered sequences of the same type that are indexable using
zero-based integers. For example, if a column name is of type
ARRAY of strings with the value ['John', 'Doe'], then
the second element can be referenced using name[1].

array('John', 'Doe')

As for simple types, the case of the type name is ignored.
Most relational databases don’t support such collection types, because using them
tends to break normal form. For example, in traditional data models, structs might be
captured in separate tables, with foreign key relations between the tables, as
appropriate.
A practical problem with breaking normal form is the greater risk of data duplication,
leading to unnecessary disk space consumption and potential data inconsistencies, as
duplicate copies can grow out of sync as changes are made.

Collection Data Types | 43

However, in Big Data systems, a benefit of sacrificing normal form is higher processing
throughput. Scanning data off hard disks with minimal “head seeks” is essential when
processing terabytes to petabytes of data. Embedding collections in records makes retrieval faster with minimal seeks. Navigating each foreign key relationship requires
seeking across the disk, with significant performance overhead.
Hive doesn’t have the concept of keys. However, you can index tables,
as we’ll see in Chapter 7.

Here is a table declaration that demonstrates how to use these types, an employees table
in a fictitious Human Resources application:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP,
address
STRUCT);

The name is a simple string and for most employees, a float is large enough for the salary.
The list of subordinates is an array of string values, where we treat the name as a “primary
key,” so each element in subordinates would reference another record in the table.
Employees without subordinates would have an empty array. In a traditional model,
the relationship would go the other way, from an employee to his or her manager. We’re
not arguing that our model is better for Hive; it’s just a contrived example to illustrate
the use of arrays.
The deductions is a map that holds a key-value pair for every deduction that will be
subtracted from the employee’s salary when paychecks are produced. The key is the
name of the deduction (e.g., “Federal Taxes”), and the key would either be a percentage
value or an absolute number. In a traditional data model, there might be separate tables
for deduction type (each key in our map), where the rows contain particular deduction
values and a foreign key pointing back to the corresponding employee record.
Finally, the home address of each employee is represented as a struct, where each field
is named and has a particular type.
Note that Java syntax conventions for generics are followed for the collection types. For
example, MAP means that every key in the map will be of type STRING
and every value will be of type FLOAT. For an ARRAY, every item in the array will
be a STRING. STRUCTs can mix different types, but the locations are fixed to the declared
position in the STRUCT.

44 | Chapter 3: Data Types and File Formats

Text File Encoding of Data Values
Let’s begin our exploration of file formats by looking at the simplest example, text files.
You are no doubt familiar with text files delimited with commas or tabs, the so-called
comma-separated values (CSVs) or tab-separated values (TSVs), respectively. Hive can
use those formats if you want and we’ll show you how shortly. However, there is a
drawback to both formats; you have to be careful about commas or tabs embedded in
text and not intended as field or column delimiters. For this reason, Hive uses various
control characters by default, which are less likely to appear in value strings. Hive uses
the term field when overriding the default delimiter, as we’ll see shortly. They are listed
in Table 3-3.
Table 3-3. Hive’s default record and field delimiters
Delimiter

Description

\n

For text files, each line is a record, so the line feed character separates records.

^A (“control” A)

Separates all fields (columns). Written using the octal code \001 when explicitly
specified in CREATE TABLE statements.

^B

Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a MAP.
Written using the octal code \002 when explicitly specified in CREATE TABLE
statements.

^C

Separate the key from the corresponding value in MAP key-value pairs. Written using
the octal code \003 when explicitly specified in CREATE TABLE statements.

Records for the employees table declared in the previous section would look like the
following example, where we use ^A, etc., to represent the field delimiters. A text editor
like Emacs will show the delimiters this way. Note that the lines have been wrapped in
the example because they are too long for the printed page. To clearly indicate the
division between records, we have added blank lines between them that would not
appear in the file:
John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState
Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600
Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState
05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601

Taxes^C.

Todd Jones^A70000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1^A200 Chicago Ave.^BOak Park^BIL^B60700
Bill King^A60000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1^A300 Obscure Dr.^BObscuria^BIL^B60100
This is a little hard to read, but you would normally let Hive do that for you, of course.
Let’s walk through the first line to understand the structure. First, here is what it would

Text File Encoding of Data Values | 45

look like in JavaScript Object Notation (JSON), where we have also inserted the names
from the table schema:
{

}

"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance":
.1
},
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip":
60600
}

You’ll note that maps and structs are effectively the same thing in JSON.
Now, here’s how the first line of the text file breaks down:
•
•
•
•

John Doe is the name.
100000.0 is the salary.
Mary Smith^BTodd Jones are the subordinates “Mary Smith” and “Todd Jones.”
Federal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1 are the deductions, where

20% is deducted for “Federal Taxes,” 5% is deducted for “State Taxes,” and 10%
is deducted for “Insurance.”
• 1 Michigan Ave.^BChicago^BIL^B60600 is the address, “1 Michigan Ave., Chicago,
60600.”
You can override these default delimiters. This might be necessary if another application writes the data using a different convention. Here is the same table declaration
again, this time with all the format defaults explicitly specified:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP,
address
STRUCT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

46 | Chapter 3: Data Types and File Formats

The ROW FORMAT DELIMITED sequence of keywords must appear before any of the other
clauses, with the exception of the STORED AS … clause.
The character \001 is the octal code for ^A. The clause ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\001' means that Hive will use the ^A character to separate fields.
Similarly, the character \002 is the octal code for ^B. The clause ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '\002' means that Hive will use the ^B character to
separate collection items.
Finally, the character \003 is the octal code for ^C. The clause ROW FORMAT DELIMITED
MAP KEYS TERMINATED BY '\003' means that Hive will use the ^C character to separate
map keys from values.
The clause LINES TERMINATED BY '…' and STORED AS … do not require the ROW FORMAT
DELIMITED keywords.
Actually, it turns out that Hive does not currently support any character for LINES
TERMINATED BY … other than '\n'. So this clause has limited utility today.
You can override the field, collection, and key-value separators and still use the default
text file format, so the clause STORED AS TEXTFILE is rarely used. For most of this book,
we will use the default TEXTFILE file format.
There are other file format options, but we’ll defer discussing them until Chapter 15.
A related issue is compression of files, which we’ll discuss in Chapter 11.
So, while you can specify all these clauses explicitly, using the default separators most
of the time, you normally only provide the clauses for explicit overrides.
These specifications only affect what Hive expects to see when it reads
files. Except in a few limited cases, it’s up to you to write the data files
in the correct format.

For example, here is a table definition where the data will contain comma-delimited
fields.
CREATE TABLE some_data (
first FLOAT,
second FLOAT,
third FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Use '\t' for tab-delimited fields.

Text File Encoding of Data Values | 47

This example does not properly handle the general case of files in CSV
(comma-separated values) and TSV (tab-separated values) formats. They
can include a header row with column names and column string values
might be quoted and they might contain embedded commas or tabs,
respectively. See Chapter 15 for details on handling these file types more
generally.

This powerful customization feature makes it much easier to use Hive with files created
by other tools and various ETL (extract, transform, and load) processes.

Schema on Read
When you write data to a traditional database, either through loading external data,
writing the output of a query, doing UPDATE statements, etc., the database has total
control over the storage. The database is the “gatekeeper.” An important implication
of this control is that the database can enforce the schema as data is written. This is
called schema on write.
Hive has no such control over the underlying storage. There are many ways to create,
modify, and even damage the data that Hive will query. Therefore, Hive can only enforce queries on read. This is called schema on read.
So what if the schema doesn’t match the file contents? Hive does the best that it can to
read the data. You will get lots of null values if there aren’t enough fields in each record
to match the schema. If some fields are numbers and Hive encounters nonnumeric
strings, it will return nulls for those fields. Above all else, Hive tries to recover from all
errors as best it can.

48 | Chapter 3: Data Types and File Formats

CHAPTER 4

HiveQL: Data Definition

HiveQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t
fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest
to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes. Hive doesn’t support transactions. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with
custom extensions and even external programs.
Still, much of HiveQL will be familiar. This chapter and the ones that follow discuss
the features of HiveQL using representative examples. In some cases, we will briefly
mention details for completeness, then explore them more fully in later chapters.
This chapter starts with the so-called data definition language parts of HiveQL, which
are used for creating, altering, and dropping databases, tables, views, functions, and
indexes. We’ll discuss databases and tables in this chapter, deferring the discussion of
views until Chapter 7, indexes until Chapter 8, and functions until Chapter 13.
We’ll also discuss the SHOW and DESCRIBE commands for listing and describing items as
we go.
Subsequent chapters explore the data manipulation language parts of HiveQL that are
used to put data into Hive tables and to extract data to the filesystem, and how to
explore and manipulate data with queries, grouping, filtering, joining, etc.

Databases in Hive
The Hive concept of a database is essentially just a catalog or namespace of tables.
However, they are very useful for larger clusters with multiple teams and users, as a
way of avoiding table name collisions. It’s also common to use databases to organize
production tables into logical groups.
If you don’t specify a database, the default database is used.
The simplest syntax for creating a database is shown in the following example:

49

hive> CREATE DATABASE financials;

Hive will throw an error if financials already exists. You can suppress these warnings
with this variation:
hive> CREATE DATABASE IF NOT EXISTS financials;

While normally you might like to be warned if a database of the same name already
exists, the IF NOT EXISTS clause is useful for scripts that should create a database onthe-fly, if necessary, before proceeding.
You can also use the keyword SCHEMA instead of DATABASE in all the database-related
commands.
At any time, you can see the databases that already exist as follows:
hive> SHOW DATABASES;
default
financials
hive> CREATE DATABASE human_resources;
hive> SHOW DATABASES;
default
financials
human_resources

If you have a lot of databases, you can restrict the ones listed using a regular expression, a concept we’ll explain in “LIKE and RLIKE” on page 96, if it is new to you. The
following example lists only those databases that start with the letter h and end with
any other characters (the .* part):
hive> SHOW DATABASES LIKE 'h.*';
human_resources
hive> ...

Hive will create a directory for each database. Tables in that database will be stored in
subdirectories of the database directory. The exception is tables in the default database,
which doesn’t have its own directory.
The database directory is created under a top-level directory specified by the property
hive.metastore.warehouse.dir, which we discussed in “Local Mode Configuration” on page 24 and “Distributed and Pseudodistributed Mode Configuration” on page 26. Assuming you are using the default value for this property, /user/hive/
warehouse, when the financials database is created, Hive will create the directory /user/
hive/warehouse/financials.db. Note the .db extension.
You can override this default location for the new directory as shown in this example:
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';

You can add a descriptive comment to the database, which will be shown by the
DESCRIBE DATABASE  command.

50 | Chapter 4: HiveQL: Data Definition

hive> CREATE DATABASE financials
> COMMENT 'Holds all financial tables';
hive> DESCRIBE DATABASE financials;
financials Holds all financial tables
hdfs://master-server/user/hive/warehouse/financials.db

Note that DESCRIBE DATABASE also shows the directory location for the database. In this
example, the URI scheme is hdfs. For a MapR installation, it would be maprfs. For an
Amazon Elastic MapReduce (EMR) cluster, it would also be hdfs, but you could set
hive.metastore.warehouse.dir to use Amazon S3 explicitly (i.e., by specifying s3n://
bucketname/… as the property value). You could use s3 as the scheme, but the newer
s3n is preferred.
In the output of DESCRIBE DATABASE, we’re showing master-server to indicate the URI
authority, in this case a DNS name and optional port number (i.e., server:port) for the
“master node” of the filesystem (i.e., where the NameNode service is running for
HDFS). If you are running in pseudo-distributed mode, then the master server will be
localhost. For local mode, the path will be a local path, file:///user/hive/warehouse/
financials.db.
If the authority is omitted, Hive uses the master-server name and port defined by
the property fs.default.name in the Hadoop configuration files, found in the
$HADOOP_HOME/conf directory.
To be clear, hdfs:///user/hive/warehouse/financials.db is equivalent to hdfs://masterserver/user/hive/warehouse/financials.db, where master-server is your master node’s
DNS name and optional port.
For completeness, when you specify a relative path (e.g., some/relative/path), Hive will
put this under your home directory in the distributed filesystem (e.g., hdfs:///user/) for HDFS. However, if you are running in local mode, your current working
directory is used as the parent of some/relative/path.
For script portability, it’s typical to omit the authority, only specifying it when referring
to another distributed filesystem instance (including S3 buckets).
Lastly, you can associate key-value properties with the database, although their only
function currently is to provide a way of adding information to the output of DESCRIBE
DATABASE EXTENDED :
hive> CREATE DATABASE financials
> WITH DBPROPERTIES ('creator' = 'Mark Moneybags', 'date' = '2012-01-02');
hive> DESCRIBE DATABASE financials;
financials hdfs://master-server/user/hive/warehouse/financials.db
hive> DESCRIBE DATABASE EXTENDED financials;
financials hdfs://master-server/user/hive/warehouse/financials.db
{date=2012-01-02, creator=Mark Moneybags);

Databases in Hive | 51

The USE command sets a database as your working database, analogous to changing
working directories in a filesystem:
hive> USE financials;

Now, commands such as SHOW TABLES; will list the tables in this database.
Unfortunately, there is no command to show you which database is your current
working database! Fortunately, it’s always safe to repeat the USE … command; there is
no concept in Hive of nesting of databases.
Recall that we pointed out a useful trick in “Variables and Properties” on page 31 for
setting a property to print the current database as part of the prompt (Hive v0.8.0 and
later):
hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;
hive (default)> set hive.cli.print.current.db=false;
hive> ...

Finally, you can drop a database:
hive> DROP DATABASE IF EXISTS financials;

The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.
By default, Hive won’t permit you to drop a database if it contains tables. You can either
drop the tables first or append the CASCADE keyword to the command, which will cause
the Hive to drop the tables in the database first:
hive> DROP DATABASE IF EXISTS financials CASCADE;

Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior,
where existing tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.

Alter Database
You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTER DATABASE command. No other metadata about the database can be changed,
including its name and directory location:
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');

There is no way to delete or “unset” a DBPROPERTY.

52 | Chapter 4: HiveQL: Data Definition

Creating Tables
The CREATE TABLE statement follows SQL conventions, but Hive’s version offers significant extensions to support a wide range of flexibility where the data files for tables
are stored, the formats used, etc. We discussed many of these options in “Text File
Encoding of Data Values” on page 45 and we’ll return to more advanced options later
in Chapter 15. In this section, we describe the other options available for the CREATE
TABLE statement, adapting the employees table declaration we used previously in “Collection Data Types” on page 43:
CREATE TABLE IF NOT EXISTS mydb.employees (
name
STRING COMMENT 'Employee name',
salary
FLOAT COMMENT 'Employee salary',
subordinates ARRAY COMMENT 'Names of subordinates',
deductions MAP
COMMENT 'Keys are deductions names, values are percentages',
address
STRUCT
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

First, note that you can prefix a database name, mydb in this case, if you’re not currently
working in the target database.
If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table
already exists. This is useful in scripts that should create a table the first time they run.
However, the clause has a gotcha you should know. If the schema specified differs from
the schema in the table that already exists, Hive won’t warn you. If your intention is
for this table to have the new schema, you’ll have to drop the old table, losing your
data, and then re-create it. Consider if you should use one or more ALTER TABLE statements to change the existing table schema instead. See “Alter Table” on page 66 for
details.
If you use IF NOT EXISTS and the existing table has a different schema
than the schema in the CREATE TABLE statement, Hive will ignore the
discrepancy.

You can add a comment to any column, after the type. Like databases, you can attach
a comment to the table itself and you can define one or more table properties. In most
cases, the primary benefit of TBLPROPERTIES is to add additional documentation in a
key-value format. However, when we examine Hive’s integration with databases such
as DynamoDB (see “DynamoDB” on page 225), we’ll see that the TBLPROPERTIES can
be used to express essential metadata about the database connection.

Creating Tables | 53

Hive automatically adds two table properties: last_modified_by holds the username of
the last user to modify the table, and last_modified_time holds the epoch time in seconds of that modification.
A planned enhancement for Hive v0.10.0 is to add a SHOW TBLPROPERTIES
table_name command that will list just the TBLPROPERTIES for a table.

Finally, you can optionally specify a location for the table data (as opposed to metadata, which the metastore will always hold). In this example, we are showing the default
location that Hive would use, /user/hive/warehouse/mydb.db/employees, where /user/
hive/warehouse is the default “warehouse” location (as discussed previously),
mydb.db is the database directory, and employees is the table directory.
By default, Hive always creates the table’s directory under the directory for the enclosing database. The exception is the default database. It doesn’t have a directory under /user/hive/warehouse, so a table in the default database will have its directory created
directly in /user/hive/warehouse (unless explicitly overridden).
To avoid potential confusion, it’s usually better to use an external table
if you don’t want to use the default location table. See “External
Tables” on page 56 for details.

You can also copy the schema (but not the data) of an existing table:
CREATE TABLE IF NOT EXISTS mydb.employees2
LIKE mydb.employees;

This version also accepts the optional LOCATION clause, but note that no other properties,
including the schema, can be defined; they are determined from the original table.
The SHOW TABLES command lists the tables. With no additional arguments, it shows the
tables in the current working database. Let’s assume we have already created a few
other tables, table1 and table2, and we did so in the mydb database:
hive> USE mydb;
hive> SHOW TABLES;
employees
table1
table2

If we aren’t in the same database, we can still list the tables in that database:
hive> USE default;
hive> SHOW TABLES IN mydb;
employees

54 | Chapter 4: HiveQL: Data Definition

table1
table2

If we have a lot of tables, we can limit the ones listed using a regular expression, a
concept we’ll discuss in detail in “LIKE and RLIKE” on page 96:
hive> USE mydb;
hive> SHOW TABLES 'empl.*';
employees

Not all regular expression features are supported. If you know regular expressions, it’s
better to test a candidate regular expression to make sure it actually works!
The regular expression in the single quote looks for all tables with names starting with
empl and ending with any other characters (the .* part).
Using the IN database_name clause and a regular expression for the table
names together is not supported.

We can also use the DESCRIBE EXTENDED mydb.employees command to show details about
the table. (We can drop the mydb. prefix if we’re currently using the mydb database.) We
have reformatted the output for easier reading and we have suppressed many details
to focus on the items that interest us now:
hive> DESCRIBE EXTENDED mydb.employees;
name
string Employee name
salary float Employee salary
subordinates
array Names of subordinates
deductions
map Keys are deductions names, values are percentages
address struct Home address
Detailed Table Information
Table(tableName:employees, dbName:mydb, owner:me,
...
location:hdfs://master-server/user/hive/warehouse/mydb.db/employees,
parameters:{creator=me, created_at='2012-01-02 10:00:00',
last_modified_user=me, last_modified_time=1337544510,
comment:Description of the table, ...}, ...)

Replacing EXTENDED with FORMATTED provides more readable but also more verbose
output.
The first section shows the output of DESCRIBE without EXTENDED or FORMATTED (i.e., the
schema including the comments for each column).
If you only want to see the schema for a particular column, append the column to the
table name. Here, EXTENDED adds no additional output:
hive> DESCRIBE mydb.employees.salary;
salary float Employee salary

Creating Tables | 55

Download from Wow! eBook 

Returning to the extended output, note the line in the description that starts with
location:. It shows the full URI path in HDFS to the directory where Hive will keep
all the data for this table, as we discussed above.
We said that the last_modified_by and last_modified_time table properties are automatically created. However, they are only shown in the
Detailed Table Information if a user-specified table property has also
been defined!

Managed Tables
The tables we have created so far are called managed tables or sometimes called internal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table (see “Dropping Tables” on page 66), Hive deletes
the data in the table.
However, managed tables are less convenient for sharing with other tools. For example,
suppose we have data that is created and used primarily by Pig or other tools, but we
want to run some queries against it, but not give Hive ownership of the data. We can
define an external table that points to that data, but doesn’t take ownership of it.

External Tables
Suppose we are analyzing data from the stock markets. Periodically, we ingest the data
for NASDAQ and the NYSE from a source like Infochimps (http://infochimps.com/da
tasets) and we want to study this data with many tools. (See the data sets named
infochimps_dataset_4777_download_16185 and infochimps_dataset_4778_download_
16677, respectively, which are actually sourced from Yahoo! Finance.) The schema we’ll
use next matches the schemas of both these data sources. Let’s assume the data files
are in the distributed filesystem directory /data/stocks.
The following table declaration creates an external table that can read all the data files
for this comma-delimited data in /data/stocks:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange
STRING,
symbol
STRING,
ymd
STRING,
price_open
FLOAT,
price_high
FLOAT,
price_low
FLOAT,
price_close
FLOAT,
volume
INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

56 | Chapter 4: HiveQL: Data Definition

The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is
required to tell Hive where it’s located.
Because it’s external, Hive does not assume it owns the data. Therefore, dropping the
table does not delete the data, although the metadata for the table will be deleted.
There are a few other small differences between managed and external tables, where
some HiveQL constructs are not permitted for external tables. We’ll discuss those when
we come to them.
However, it’s important to note that the differences between managed and external
tables are smaller than they appear at first. Even for managed tables, you know where
they are located, so you can use other tools, hadoop dfs commands, etc., to modify and
even delete the files in the directories for managed tables. Hive may technically own
these directories and files, but it doesn’t have full control over them! Recall, in “Schema
on Read” on page 48, we said that Hive really has no control over the integrity of the
files used for storage and whether or not their contents are consistent with the table
schema. Even managed tables don’t give us this control.
Still, a general principle of good software design is to express intent. If the data is shared
between tools, then creating an external table makes this ownership explicit.
You can tell whether or not a table is managed or external using the output of DESCRIBE
EXTENDED tablename. Near the end of the Detailed Table Information output, you will
see the following for managed tables:
... tableType:MANAGED_TABLE)

For external tables, you will see the following:
... tableType:EXTERNAL_TABLE)

As for managed tables, you can also copy the schema (but not the data) of an existing
table:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';

If you omit the EXTERNAL keyword and the original table is external, the
new table will also be external. If you omit EXTERNAL and the original
table is managed, the new table will also be managed. However, if you
include the EXTERNAL keyword and the original table is managed, the new
table will be external. Even in this scenario, the LOCATION clause will
still be optional.

Creating Tables | 57

Partitioned, Managed Tables
The general notion of partitioning data is an old one. It can take many forms, but often
it’s used for distributing load horizontally, moving data physically closer to its most
frequent users, and other purposes.
Hive has the notion of partitioned tables. We’ll see that they have important
performance benefits, and they can help organize data in a logical fashion, such as
hierarchically.
We’ll discuss partitioned managed tables first. Let’s return to our employees table and
imagine that we work for a very large multinational corporation. Our HR people often
run queries with WHERE clauses that restrict the results to a particular country or to a
particular first-level subdivision (e.g., state in the United States or province in Canada).
(First-level subdivision is an actual term, used here, for example: http://www.common
datahub.com/state_source.jsp.) We’ll just use the word state for simplicity. We have
redundant state information in the address field. It is distinct from the state partition.
We could remove the state element from address. There is no ambiguity in queries,
since we have to use address.state to project the value inside the address. So, let’s
partition the data first by country and then by state:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP,
address
STRUCT
)
PARTITIONED BY (country STRING, state STRING);

Partitioning tables changes how Hive structures the data storage. If we create this table
in the mydb database, there will still be an employees directory for the table:
hdfs://master_server/user/hive/warehouse/mydb.db/employees

However, Hive will now create subdirectories reflecting the partitioning structure. For
example:
...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
...

Yes, those are the actual directory names. The state directories will contain zero or more
files for the employees in those states.

58 | Chapter 4: HiveQL: Data Definition

Once created, the partition keys (country and state, in this case) behave like regular
columns. There is one known exception, due to a bug (see “Aggregate functions” on page 85). In fact, users of the table don’t need to care if these “columns”
are partitions or not, except when they want to optimize query performance.
For example, the following query selects all employees in the state of Illinois in the
United States:
SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';

Note that because the country and state values are encoded in directory names, there
is no reason to have this data in the data files themselves. In fact, the data just gets in
the way in the files, since you have to account for it in the table schema, and this data
wastes space.
Perhaps the most important reason to partition data is for faster queries. In the previous
query, which limits the results to employees in Illinois, it is only necessary to scan the
contents of one directory. Even if we have thousands of country and state directories,
all but one can be ignored. For very large data sets, partitioning can dramatically improve query performance, but only if the partitioning scheme reflects common range
filtering (e.g., by locations, timestamp ranges).
When we add predicates to WHERE clauses that filter on partition values, these predicates
are called partition filters.
Even if you do a query across the entire US, Hive only reads the 65 directories covering
the 50 states, 9 territories, and the District of Columbia, and 6 military “states” used
by the armed services. You can see the full list here: http://www.50states.com/abbrevia
tions.htm.
Of course, if you need to do a query for all employees around the globe, you can still
do it. Hive will have to read every directory, but hopefully these broader disk scans will
be relatively rare.
However, a query across all partitions could trigger an enormous MapReduce job if the
table data and number of partitions are large. A highly suggested safety measure is
putting Hive into “strict” mode, which prohibits queries of partitioned tables without
a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the
following session:
hive> set hive.mapred.mode=strict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;

Partitioned, Managed Tables | 59

John Doe
...

100000.0

You can see the partitions that exist with the SHOW PARTITIONS command:
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK
...

If you have a lot of partitions and you want to see if partitions have been defined for
particular partition keys, you can further restrict the command with an optional PARTI
TION clause that specifies one or more of the partitions with specific values:
hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
...
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
country=US/state=AK

The DESCRIBE EXTENDED employees command shows the partition keys:
hive> DESCRIBE EXTENDED employees;
name
string,
salary
float,
...
address
struct<...>,
country
string,
state
string
Detailed Table Information...
partitionKeys:[FieldSchema(name:country, type:string, comment:null),
FieldSchema(name:state, type:string, comment:null)],
...

The schema part of the output lists the country and state with the other columns,
because they are columns as far as queries are concerned. The Detailed Table Infor
mation includes the country and state as partition keys. The comments for both of these
keys are null; we could have added comments just as for regular columns.
You create partitions in managed tables by loading data into them. The following example creates a US and CA (California) partition while loading data into it from a local
directory, $HOME/california-employees. You must specify a value for each partition
column. Notice how we reference the HOME environment variable in HiveQL:
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

60 | Chapter 4: HiveQL: Data Definition

The directory for this partition, …/employees/country=US/state=CA, will be created by
Hive and all data files in $HOME/california-employees will be copied into it. See
“Loading Data into Managed Tables” on page 71 for more information on populating
tables.

External Partitioned Tables
You can use partitioning with external tables. In fact, you may find that this is your
most common scenario for managing large production data sets. The combination gives you a way to “share” data with other tools, while still optimizing query
performance.
You also have more flexibility in the directory structure used, as you define it yourself.
We’ll see a particularly useful example in a moment.
Let’s consider a new example that fits this scenario well: logfile analysis. Most organizations use a standard format for log messages, recording a timestamp, severity (e.g.,
ERROR, WARNING, INFO), perhaps a server name and process ID, and then an arbitrary text
message. Suppose our Extract, Transform, and Load (ETL) process ingests and aggregates logfiles in our environment, converting each log message to a tab-delimited record
and also decomposing the timestamp into separate year, month, and day fields, and a
combined hms field for the remaining hour, minute, and second parts of the timestamp,
for reasons that will become clear in a moment. You could do this parsing of log messages using the string parsing functions built into Hive or Pig, for example. Alternatively, we could use smaller integer types for some of the timestamp-related fields to
conserve space. Here, we are ignoring subsequent resolution.
Here’s how we might define the corresponding Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms
INT,
severity
STRING,
server
STRING,
process_id
INT,
message
STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

We’re assuming that a day’s worth of log data is about the correct size for a useful
partition and finer grain queries over a day’s data will be fast enough.
Recall that when we created the nonpartitioned external stocks table, a LOCATION …
clause was required. It isn’t used for external partitioned tables. Instead, an ALTER
TABLE statement is used to add each partition separately. It must specify a value for each
partition key, the year, month, and day, in this case (see “Alter Table” on page 66 for
more details on this feature). Here is an example, where we add a partition for January
2nd, 2012:
ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';

Partitioned, Managed Tables | 61

The directory convention we use is completely up to us. Here, we follow a hierarchical
directory structure, because it’s a logical way to organize our data, but there is no
requirement to do so. We could follow Hive’s directory naming convention (e.g., …/
exchange=NASDAQ/symbol=AAPL), but there is no requirement to do so.
An interesting benefit of this flexibility is that we can archive old data on inexpensive
storage, like Amazon’s S3, while keeping newer, more “interesting” data in HDFS. For
example, each day we might use the following procedure to move data older than a
month to S3:
• Copy the data for the partition being moved to S3. For example, you can use the
hadoop distcp command:
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02

• Alter the table to point the partition to the S3 location:
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

• Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02

You don’t have to be an Amazon Elastic MapReduce user to use S3 this way. S3 support
is part of the Apache Hadoop distribution. You can still query this data, even queries
that cross the month-old “boundary,” where some data is read from HDFS and some
data is read from S3!
By the way, Hive doesn’t care if a partition directory doesn’t exist for a partition or if
it has no files. In both cases, you’ll just get no results for a query that filters for the
partition. This is convenient when you want to set up partitions before a separate process starts writing data to them. As soon as data is there, queries will return results from
that data.
This feature illustrates another benefit: new data can be written to a dedicated directory
with a clear distinction from older data in other directories. Also, whether you move
old data to an “archive” location or delete it outright, the risk of tampering with newer
data is reduced since the data subsets are in separate directories.
As for nonpartitioned external tables, Hive does not own the data and it does not delete
the data if the table is dropped.
As for managed partitioned tables, you can see an external table’s partitions with SHOW
PARTITIONS:
hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
year=2012/month=1/day=1
year=2012/month=1/day=2
...

62 | Chapter 4: HiveQL: Data Definition

Similarly, the DESCRIBE EXTENDED log_messages shows the partition keys both as part
of the schema and in the list of partitionKeys:
hive> DESCRIBE EXTENDED log_messages;
...
message
string,
year
int,
month
int,
day
int
Detailed Table Information...
partitionKeys:[FieldSchema(name:year, type:int, comment:null),
FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null)],
...

This output is missing a useful bit of information, the actual location of the partition
data. There is a location field, but it only shows Hive’s default directory that would be
used if the table were a managed table. However, we can get a partition’s location as
follows:
hive> DESCRIBE EXTENDED log_messages PARTITION (year=2012, month=1, day=2);
...
location:s3n://ourbucket/logs/2011/01/02,
...

We frequently use external partitioned tables because of the many benefits they provide, such as logical data management, performant queries, etc.
ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with

managed tables, too, when you have (or will have) data for partitions in directories
created outside of the LOAD and INSERT options we discussed above. You’ll need to
remember that not all of the table’s data will be under the usual Hive “warehouse”
directory, and this data won’t be deleted when you drop the managed table! Hence,
from a “sanity” perspective, it’s questionable whether you should dare to use this feature with managed tables.

Customizing Table Storage Formats
In “Text File Encoding of Data Values” on page 45, we discussed that Hive defaults to
a text file format, which is indicated by the optional clause STORED AS TEXTFILE, and
you can overload the default values for the various delimiters when creating the table.
Here we repeat the definition of the employees table we used in that discussion:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP,
address
STRUCT
)
ROW FORMAT DELIMITED

Partitioned, Managed Tables | 63

FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

TEXTFILE implies that all fields are encoded using alphanumeric characters, including

those from international character sets, although we observed that Hive uses nonprinting characters as “terminators” (delimiters), by default. When TEXTFILE is used,
each line is considered a separate record.
You can replace TEXTFILE with one of the other built-in file formats supported by Hive,
including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O
bandwidth performance using binary encoding and optional compression. These formats are discussed in more detail in Chapter 11 and Chapter 15.
Hive draws a distinction between how records are encoded into files and how columns
are encoded into records. You customize these behaviors separately.
The record encoding is handled by an input format object (e.g., the Java code behind
TEXTFILE.) Hive uses a Java class (compiled module) named org.apache
.hadoop.mapred.TextInputFormat. If you are unfamiliar with Java, the dotted name syntax indicates a hierarchical namespace tree of packages that actually corresponds to the
directory structure for the Java code. The last name, TextInputFormat, is a class in the
lowest-level package mapred.
The record parsing is handled by a serializer/deserializer or SerDe for short. For TEXT
FILE and the encoding we described in Chapter 3 and repeated in the example above,
the SerDe Hive uses is another Java class called org.apache.hadoop.hive.serde2.lazy.
LazySimpleSerDe.
For completeness, there is also an output format that Hive uses for writing the
output of queries to files and to the console. For TEXTFILE, the Java class
named org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat is used for
output.
Hive uses an input format to split input streams into records, an output
format to format records into output streams (i.e., the output of queries), and a SerDe to parse records into columns, when reading, and
encodes columns into records, when writing. We’ll explore these distinctions in greater depth in Chapter 15.

Third-party input and output formats and SerDes can be specified, a feature which
permits users to customize Hive for a wide range of file formats not supported natively.
Here is a complete example that uses a custom SerDe, input format, and output format
for files accessible through the Avro protocol, which we will discuss in detail in “Avro
Hive SerDe” on page 209:

64 | Chapter 4: HiveQL: Data Definition

CREATE TABLE kst
PARTITIONED BY (ds string)
ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe'
WITH SERDEPROPERTIES ('schema.url'='http://schema_provider/kst.avsc')
STORED AS
INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat'
OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat';

The ROW FORMAT SERDE … specifies the SerDe to use. Hive provides the WITH SERDEPRO
PERTIES feature that allows users to pass configuration information to the SerDe. Hive
knows nothing about the meaning of these properties. It’s up to the SerDe to decide
their meaning. Note that the name and value of each property must be a quoted string.
Finally, the STORED AS INPUTFORMAT … OUTPUTFORMAT … clause specifies the Java classes
to use for the input and output formats, respectively. If you specify one of these formats,
you are required to specify both of them.
Note that the DESCRIBE EXTENDED table command lists the input and output formats,
the SerDe, and any SerDe properties in the DETAILED TABLE INFORMATION. For our example, we would see the following:
hive> DESCRIBE EXTENDED kst
...
inputFormat:com.linkedin.haivvreo.AvroContainerInputFormat,
outputFormat:com.linkedin.haivvreo.AvroContainerOutputFormat,
...
serdeInfo:SerDeInfo(name:null,
serializationLib:com.linkedin.haivvreo.AvroSerDe,
parameters:{schema.url=http://schema_provider/kst.avsc})
...

Finally, there are a few additional CREATE TABLE clauses that describe more details about
how the data is supposed to be stored. Let’s extend our previous stocks table example
from “External Tables” on page 56:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange
STRING,
symbol
STRING,
ymd
STRING,
price_open
FLOAT,
price_high
FLOAT,
price_low
FLOAT,
price_close
FLOAT,
volume
INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

The CLUSTERED BY … INTO … BUCKETS clause, with an optional SORTED BY … clause is used
to optimize certain kinds of queries, which we discuss in detail in “Bucketing Table
Data Storage” on page 125.
Partitioned, Managed Tables | 65

Dropping Tables
The familiar DROP TABLE command from SQL is supported:
DROP TABLE IF EXISTS employees;

The IF EXISTS keywords are optional. If not used and the table doesn’t exist, Hive
returns an error.
For managed tables, the table metadata and data are deleted.
Actually, if you enable the Hadoop Trash feature, which is not on by
default, the data is moved to the .Trash directory in the distributed
filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable
this feature, set the property fs.trash.interval to a reasonable positive
number. It’s the number of minutes between “trash checkpoints”; 1,440
would be 24 hours. While it’s not guaranteed to work for all versions of
all distributed filesystems, if you accidentally drop a managed table with
important data, you may be able to re-create the table, re-create any
partitions, and then move the files from .Trash to the correct directories
(using the filesystem commands) to restore the data.

For external tables, the metadata is deleted but the data is not.

Alter Table
Most table properties can be altered with ALTER TABLE statements, which change
metadata about the table but not the data itself. These statements can be used to fix
mistakes in schema, move partition locations (as we saw in “External Partitioned
Tables” on page 61), and do other operations.
ALTER TABLE modifies table metadata only. The data for the table is

untouched. It’s up to you to ensure that any modifications are consistent
with the actual data.

Renaming a Table
Use this statement to rename the table log_messages to logmsgs:
ALTER TABLE log_messages RENAME TO logmsgs;

Adding, Modifying, and Dropping a Table Partition
As we saw previously, ALTER TABLE table ADD PARTITION … is used to add a new partition
to a table (usually an external table). Here we repeat the same command shown previously with the additional options available:

66 | Chapter 4: HiveQL: Data Definition

ALTER TABLE log_messages ADD IF
PARTITION (year = 2011, month =
PARTITION (year = 2011, month =
PARTITION (year = 2011, month =
...;

NOT EXISTS
1, day = 1) LOCATION '/logs/2011/01/01'
1, day = 2) LOCATION '/logs/2011/01/02'
1, day = 3) LOCATION '/logs/2011/01/03'

Multiple partitions can be added in the same query when using Hive v0.8.0 and later.
As always, IF NOT EXISTS is optional and has the usual meaning.
Hive v0.7.X allows you to use the syntax with multiple partition specifications, but it actually uses just the first partition specification, silently
ignoring the others! Instead, use a separate ALTER STATEMENT statement
for each partition.

Similarly, you can change a partition location, effectively moving it:
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

This command does not move the data from the old location, nor does it delete the old
data.
Finally, you can drop a partition:
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day = 2);

The IF EXISTS clause is optional, as usual. For managed tables, the data for the partition
is deleted, along with the metadata, even if the partition was created using ALTER TABLE
… ADD PARTITION. For external tables, the data is not deleted.
There are a few more ALTER statements that affect partitions discussed later
in “Alter Storage Properties” on page 68 and “Miscellaneous Alter Table Statements” on page 69.

Changing Columns
You can rename a column, change its position, type, or comment:
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

You have to specify the old name, a new name, and the type, even if the name or type
is not changing. The keyword COLUMN is optional as is the COMMENT clause. If you aren’t
moving the column, the AFTER other_column clause is not necessary. In the example
shown, we move the column after the severity column. If you want to move the column
to the first position, use FIRST instead of AFTER other_column.
As always, this command changes metadata only. If you are moving columns, the data
must already match the new schema or you must change it to match by some other
means.
Alter Table | 67

Adding Columns
You can add new columns to the end of the existing columns, before any partition
columns.
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');

The COMMENT clauses are optional, as usual. If any of the new columns are in the wrong
position, use an ALTER COLUMN table CHANGE COLUMN statement for each one to move it
to the correct position.

Deleting or Replacing Columns
The following example removes all the existing columns and replaces them with the
new columns specified:
ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT
COMMENT 'hour, minute, seconds from timestamp',
severity
STRING COMMENT 'The message severity'
message
STRING COMMENT 'The rest of the message');

This statement effectively renames the original hms column and removes the server and
process_id columns from the original schema definition. As for all ALTER statements,
only the table metadata is changed.
The REPLACE statement can only be used with tables that use one of the native SerDe
modules: DynamicSerDe or MetadataTypedColumnsetSerDe. Recall that the SerDe determines how records are parsed into columns (deserialization) and how a record’s columns are written to storage (serialization). See Chapter 15 for more details on SerDes.

Alter Table Properties
You can add additional table properties or modify existing properties, but not remove
them:
ALTER TABLE log_messages SET TBLPROPERTIES (
'notes' = 'The process id is no longer captured; this column is always NULL');

Alter Storage Properties
There are several ALTER TABLE statements for modifying format and SerDe properties.
The following statement changes the storage format for a partition to be SEQUENCE
FILE, as we discussed in “Creating Tables” on page 53 (see “Sequence
Files” on page 148 and Chapter 15 for more information):
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;

68 | Chapter 4: HiveQL: Data Definition

The PARTITION clause is required if the table is partitioned.
You can specify a new SerDe along with SerDe properties or change the properties for
the existing SerDe. The following example specifies that a table will use a Java class
named com.example.JSONSerDe to process a file of JSON-encoded records:
ALTER TABLE table_using_JSON_storage
SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES (
'prop1' = 'value1',
'prop2' = 'value2');

The SERDEPROPERTIES are passed to the SerDe module (the Java class com.example.JSON
SerDe, in this case). Note that both the property names (e.g., prop1) and the values (e.g.,
value1) must be quoted strings.
The SERDEPROPERTIES feature is a convenient mechanism that SerDe implementations
can exploit to permit user customization. We’ll see a real-world example of a JSON
SerDe and how it uses SERDEPROPERTIES in “JSON SerDe” on page 208.
The following example demonstrates how to add new SERDEPROPERTIES for the current
SerDe:
ALTER TABLE table_using_JSON_storage
SET SERDEPROPERTIES (
'prop3' = 'value3',
'prop4' = 'value4');

You can alter the storage properties that we discussed in “Creating Tables”
on page 53:
ALTER TABLE stocks
CLUSTERED BY (exchange, symbol)
SORTED BY (symbol)
INTO 48 BUCKETS;

The SORTED BY clause is optional, but the CLUSTER BY and INTO … BUCKETS are required.
(See also “Bucketing Table Data Storage” on page 125 for information on the use of
data bucketing.)

Miscellaneous Alter Table Statements
In “Execution Hooks” on page 158, we’ll discuss a technique for adding execution
“hooks” for various operations. The ALTER TABLE … TOUCH statement is used to trigger
these hooks:
ALTER TABLE log_messages TOUCH
PARTITION(year = 2012, month = 1, day = 1);

The PARTITION clause is required for partitioned tables. A typical scenario for this statement is to trigger execution of the hooks when table storage files have been modified
outside of Hive. For example, a script that has just written new files for the 2012/01/01
partition for log_message can make the following call to the Hive CLI:
Alter Table | 69

hive -e 'ALTER TABLE log_messages TOUCH PARTITION(year = 2012, month = 1, day = 1);'

This statement won’t create the table or partition if it doesn’t already exist. Use the
appropriate creation commands in that case.
The ALTER TABLE … ARCHIVE PARTITION statement captures the partition files into a Hadoop archive (HAR) file. This only reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any space savings (e.g., through
compression):
ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);

To reverse the operation, substitute UNARCHIVE for ARCHIVE. This feature is only available
for individual partitions of partitioned tables.
Finally, various protections are available. The following statements prevent the partition from being dropped and queried:
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;

To reverse either operation, replace ENABLE with DISABLE. These operations also can’t
be used with nonpartitioned tables.

70 | Chapter 4: HiveQL: Data Definition

CHAPTER 5

HiveQL: Data Manipulation

This chapter continues our discussion of HiveQL, the Hive query language, focusing
on the data manipulation language parts that are used to put data into tables and to
extract data from tables to the filesystem.
This chapter uses SELECT ... WHERE clauses extensively when we discuss populating
tables with data queried from other tables. So, why aren’t we covering SELECT ...
WHERE clauses first, instead of waiting until the next chapter, Chapter 6?
Since we just finished discussing how to create tables, we wanted to cover the next
obvious topic: how to get data into these tables so you’ll have something to query! We
assume you already understand the basics of SQL, so these clauses won’t be new to
you. If they are, please refer to Chapter 6 for details.

Loading Data into Managed Tables
Since Hive has no row-level insert, update, and delete operations, the only way to put
data into an table is to use one of the “bulk” load operations. Or you can just write files
in the correct directories by other means.
We saw an example of how to load data into a managed table in “Partitioned, Managed
Tables” on page 58, which we repeat here with an addition, the use of the OVERWRITE
keyword:
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

This command will first create the directory for the partition, if it doesn’t already exist,
then copy the data to it.
If the target table is not partitioned, you omit the PARTITION clause.
It is conventional practice to specify a path that is a directory, rather than an individual
file. Hive will copy all the files in the directory, which give you the flexibility of organizing the data into multiple files and changing the file naming convention, without
71

requiring a change to your Hive scripts. Either way, the files will be copied to the appropriate location for the table and the names will be the same.
If the LOCAL keyword is used, the path is assumed to be in the local filesystem. The data
is copied into the final location. If LOCAL is omitted, the path is assumed to be in the
distributed filesystem. In this case, the data is moved from the path to the final location.
LOAD DATA LOCAL ... copies the local data to the final location in the
distributed filesystem, while LOAD DATA ... (i.e., without LOCAL) moves

the data to the final location.

The rationale for this inconsistency is the assumption that you usually don’t want
duplicate copies of your data files in the distributed filesystem.
Also, because files are moved in this case, Hive requires the source and target files and
directories to be in the same filesystem. For example, you can’t use LOAD DATA to load
(move) data from one HDFS cluster to another.
It is more robust to specify a full path, but relative paths can be used. When running
in local mode, the relative path is interpreted relative to the user’s working directory
when the Hive CLI was started. For distributed or pseudo-distributed mode, the path
is interpreted relative to the user’s home directory in the distributed filesystem, which
is /user/$USER by default in HDFS and MapRFS.
If you specify the OVERWRITE keyword, any data already present in the target directory
will be deleted first. Without the keyword, the new files are simply added to the target
directory. However, if files already exist in the target directory that match filenames
being loaded, the old files are overwritten.
Versions of Hive before v0.9.0 had the following bug: when the OVER
WRITE keyword was not used, an existing data file in the target directory
would be overwritten if its name matched the name of a data file being
written to the directory. Hence, data would be lost. This bug was fixed
in the v0.9.0 release.

The PARTITION clause is required if the table is partitioned and you must specify a value
for each partition key.
In the example, the data will now exist in the following directory:
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=CA

Another limit on the file path used, the INPATH clause, is that it cannot contain any
directories.

72 | Chapter 5: HiveQL: Data Manipulation

Hive does not verify that the data you are loading matches the schema for the table.
However, it will verify that the file format matches the table definition. For example,
if the table was created with SEQUENCEFILE storage, the loaded files must be sequence
files.

Inserting Data into Tables from Queries
The INSERT statement lets you load data into a table from a query. Reusing our employ
ees example from the previous chapter, here is an example for the state of Oregon,
where we presume the data is already in another table called staged_employees. For
reasons we’ll discuss shortly, let’s use different names for the country and state fields
in staged_employees, calling them cnty and st, respectively:
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

With OVERWRITE, any previous contents of the partition (or whole table if not partitioned) are replaced.
If you drop the keyword OVERWRITE or replace it with INTO, Hive appends the data rather
than replaces it. This feature is only available in Hive v0.8.0 or later.
This example suggests one common scenario where this feature is useful: data has been
staged in a directory, exposed to Hive as an external table, and now you want to put it
into the final, partitioned table. A workflow like this is also useful if you want the target
table to have a different record format than the source table (e.g., a different field delimiter).
However, if staged_employees is very large and you run 65 of these statements to cover
all states, then it means you are scanning staged_employees 65 times! Hive offers an
alternative INSERT syntax that allows you to scan the input data once and split it multiple
ways. The following example shows this feature for creating the employees partitions
for three states:
FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';

We have used indentation to make it clearer how the clauses group together. Each
record read from staged_employees will be evaluated with each SELECT … WHERE … clause.
Those clauses are evaluated independently; this is not an IF … THEN … ELSE … construct!
Inserting Data into Tables from Queries | 73

In fact, by using this construct, some records from the source table can be written to
multiple partitions of the destination table or none of them.
If a record satisfied a given SELECT … WHERE … clause, it gets written to the specified table
and partition. To be clear, each INSERT clause can insert into a different table, when
desired, and some of those tables could be partitioned while others aren’t.
Hence, some records from the input might get written to multiple output locations and
others might get dropped!
You can mix INSERT OVERWRITE clauses and INSERT INTO clauses, as well.

Dynamic Partition Inserts
There’s still one problem with this syntax: if you have a lot of partitions to create, you
have to write a lot of SQL! Fortunately, Hive also supports a dynamic partition feature,
where it can infer the partitions to create based on query parameters. By comparison,
up until now we have considered only static partitions.
Consider this change to the previous example:
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;

Hive determines the values of the partition keys, country and state, from the last two
columns in the SELECT clause. This is why we used different names in staged_employ
ees, to emphasize that the relationship between the source column values and the output partition values is by position only and not by matching on names.
Suppose that staged_employees has data for a total of 100 country and state pairs. After
running this query, employees will have 100 partitions!
You can also mix dynamic and static partitions. This variation of the previous query
specifies a static value for the country (US) and a dynamic value for the state:
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';

The static partition keys must come before the dynamic partition keys.
Dynamic partitioning is not enabled by default. When it is enabled, it works in “strict”
mode by default, where it expects at least some columns to be static. This helps protect
against a badly designed query that generates a gigantic number of partitions. For example, you partition by timestamp and generate a separate partition for each second!
Perhaps you meant to partition by day or maybe hour instead. Several other properties
are also used to limit excess resource utilization. Table 5-1 describes these properties.

74 | Chapter 5: HiveQL: Data Manipulation

Download from Wow! eBook 

Table 5-1. Dynamic partitions properties
Name

Default

Description

hive.exec.dynamic.parti
tion

false

Set to true to enable dynamic partitioning.

hive.exec.dynamic.parti
tion.mode

strict

Set to nonstrict to enable all partitions to be determined
dynamically.

hive.exec.max.dynamic.par
titions.pernode

100

The maximum number of dynamic partitions that can be created by each mapper or reducer. Raises a fatal error if one
mapper or reducer attempts to create more than the threshold.

hive.exec.max.dynamic.par
titions

+1000

The total number of dynamic partitions that can be created by
one statement with dynamic partitioning. Raises a fatal error
if the limit is exceeded.

hive.exec.max.cre
ated.files

100000

The maximum total number of files that can be created globally.
A Hadoop counter is used to track the number of files created.
Raises a fatal error if the limit is exceeded.

So, for example, our first example using dynamic partitioning for all partitions might
actually look this, where we set the desired properties just before use:
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=1000;
hive>
>
>
>

INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cty, se.st
FROM staged_employees se;

Creating Tables and Loading Them in One Query
You can also create a table and insert query results into it in one statement:
CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

This table contains just the name, salary, and address columns from the employee table
records for employees in California. The schema for the new table is taken from the
SELECT clause.
A common use for this feature is to extract a convenient subset of data from a larger,
more unwieldy table.
This feature can’t be used with external tables. Recall that “populating” a partition for
an external table is done with an ALTER TABLE statement, where we aren’t “loading”
data, per se, but pointing metadata to a location where the data can be found.

Creating Tables and Loading Them in One Query | 75

Exporting Data
How do we get data out of tables? If the data files are already formatted the way you
want, then it’s simple enough to copy the directories or files:
hadoop fs -cp source_path target_path

Otherwise, you can use INSERT … DIRECTORY …, as in this example:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted

following the usual rules. One or more files will be written to /tmp/ca_employees,
depending on the number of reducers invoked.
The specified path can also be a full URI (e.g., hdfs://master-server/tmp/ca_employees).
Independent of how the data is actually stored in the source table, it is written to files
with all fields serialized as strings. Hive uses the same encoding in the generated output
files as it uses for the tables internal storage.
As a reminder, we can look at the results from within the hive CLI:
hive> ! ls /tmp/ca_employees;
000000_0
hive> ! cat /tmp/payroll/000000_0
John Doe100000.0201 San Antonio CircleMountain ViewCA94040
Mary Smith80000.01 Infinity LoopCupertinoCA95014
...

Yes, the filename is 000000_0. If there were two or more reducers writing output, we
would have additional files with similar names (e.g., 000001_0).
The fields appear to be joined together without delimiters because the ^A and ^B
separators aren’t rendered.
Just like inserting data to tables, you can specify multiple inserts to directories:
FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';

There are some limited options for customizing the output of the data (other than
writing a custom OUTPUTFORMAT, as discussed in “Customizing Table Storage Formats” on page 63). To format columns, the built-in functions include those for
formatting strings, such as converting case, padding output, and more. See “Other
built-in functions” on page 88 for more details.

76 | Chapter 5: HiveQL: Data Manipulation

The field delimiter for the table can be problematic. For example, if it uses the default
^A delimiter. If you export table data frequently, it might be appropriate to use comma
or tab delimiters.
Another workaround is to define a “temporary” table with the storage configured to
match the desired output format (e.g., tab-delimited fields). Then write a query result
to that table and use INSERT OVERWRITE DIRECTORY, selecting from the temporary table.
Unlike many relational databases, there is no temporary table feature in Hive. You have
to manually drop any tables you create that aren’t intended to be permanent.

Exporting Data | 77

CHAPTER 6

HiveQL: Queries

After learning the many ways we can define and format tables, let’s learn how to run
queries. Of course, we have assumed all along that you have some prior knowledge of
SQL. We’ve used some queries already to illustrate several concepts, such as loading
query data into other tables in Chapter 5. Now we’ll fill in most of the details. Some
special topics will be covered in subsequent chapters.
We’ll move quickly through details that are familiar to users with prior SQL experience
and focus on what’s unique to HiveQL, including syntax and feature differences, as
well as performance implications.

SELECT … FROM Clauses
SELECT is the projection operator in SQL. The FROM clause identifies from which table,

view, or nested query we select records (see Chapter 7).
For a given record, SELECT specifies the columns to keep, as well as the outputs of
function calls on one or more columns (e.g., the aggregation functions like count(*)).
Recall again our partitioned employees table:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP,
address
STRUCT
)
PARTITIONED BY (country STRING, state STRING);

Let’s assume we have the same contents we showed in “Text File Encoding of Data
Values” on page 45 for four employees in the US state of Illinois (abbreviated IL). Here
are queries of this table and the output they produce:
hive> SELECT name, salary FROM employees;
John Doe
100000.0
Mary Smith 80000.0

79

Todd Jones
Bill King

70000.0
60000.0

The following two queries are identical. The second version uses a table alias e, which
is not very useful in this query, but becomes necessary in queries with JOINs (see “JOIN
Statements” on page 98) where several different tables are used:
hive> SELECT name, salary FROM employees;
hive> SELECT e.name, e.salary FROM employees e;

When you select columns that are one of the collection types, Hive uses JSON (JavaScript Object Notation) syntax for the output. First, let’s select the subordinates, an
ARRAY, where a comma-separated list surrounded with […] is used. Note that STRING
elements of the collection are quoted, while the primitive STRING name column is not:
hive> SELECT name, subordinates FROM employees;
John Doe
["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []

The deductions is a MAP, where the JSON representation for maps is used, namely a
comma-separated list of key:value pairs, surrounded with {...}:
hive> SELECT name, deductions FROM employees;
John Doe
{"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}

Finally, the address is a STRUCT, which is also written using the JSON map format:
hive> SELECT name, address FROM employees;
John Doe
{"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}

Next, let’s see how to reference elements of collections.
First, ARRAY indexing is 0-based, as in Java. Here is a query that selects the first element
of the subordinates array:
hive> SELECT name, subordinates[0] FROM employees;
John Doe
Mary Smith
Mary Smith Bill King
Todd Jones NULL
Bill King NULL

Note that referencing a nonexistent element returns NULL. Also, the extracted STRING
values are no longer quoted!
To reference a MAP element, you also use ARRAY[...] syntax, but with key values instead
of integer indices:
hive> SELECT name, deductions["State Taxes"] FROM employees;
John Doe
0.05

80 | Chapter 6: HiveQL: Queries

Mary Smith 0.05
Todd Jones 0.03
Bill King 0.03

Finally, to reference an element in a STRUCT, you use “dot” notation, similar to the
table_alias.column mentioned above:
hive> SELECT name, address.city FROM employees;
John Doe
Chicago
Mary Smith Chicago
Todd Jones Oak Park
Bill King Obscuria

These same referencing techniques are also used in WHERE clauses, which we discuss in
“WHERE Clauses” on page 92.

Specify Columns with Regular Expressions
We can even use regular expressions to select the columns we want. The following query
selects the symbol column and all columns from stocks whose names start with the
prefix price:1
hive> SELECT symbol, `price.*`
AAPL
195.69 197.88 194.0
AAPL
192.63 196.0 190.85
AAPL
196.73 198.37 191.57
AAPL
195.17 200.2
194.42
AAPL
195.91 196.32 193.38
...

FROM stocks;
194.12 194.12
195.46 195.46
192.05 192.05
199.23 199.23
195.86 195.86

We’ll talk more about Hive’s use of regular expressions in the section “LIKE and
RLIKE” on page 96.

Computing with Column Values
Not only can you select columns in a table, but you can manipulate column values
using function calls and arithmetic expressions.
For example, let’s select the employees’ names converted to uppercase, their salaries,
federal taxes percentage, and the value that results if we subtract the federal taxes portion from their salaries and round to the nearest integer. We could call a built-in function map_values to extract all the values from the deductions map and then add them
up with the built-in sum function.
The following query is long enough that we’ll split it over two lines. Note the secondary
prompt that Hive uses, an indented greater-than sign (>):
hive> SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees;

1. At the time of this writing, the Hive Wiki shows an incorrect syntax for specifying columns using regular
expressions.

SELECT … FROM Clauses | 81

JOHN
MARY
TODD
BILL

DOE
100000.0 0.2
80000
SMITH 80000.0 0.2
64000
JONES 70000.0 0.15 59500
KING
60000.0 0.15 51000

Let’s discuss arithmetic operators and then discuss the use of functions in expressions.

Arithmetic Operators
All the typical arithmetic operators are supported. Table 6-1 describes the specific
details.
Table 6-1. Arithmetic operators
Operator

Types

Description

A + B

Numbers

Add A and B.

A - B

Numbers

Subtract B from A.

A * B

Numbers

Multiply A and B.

A / B

Numbers

Divide A with B. If the operands are integer types, the quotient of the division
is returned.

A % B

Numbers

The remainder of dividing A with B.

A & B

Numbers

Bitwise AND of A and B.

A | B

Numbers

Bitwise OR of A and B.

A ^ B

Numbers

Bitwise XOR of A and B.

~A

Numbers

Bitwise NOT of A.

Arithmetic operators take any numeric type. No type coercion is performed if the two
operands are of the same numeric type. Otherwise, if the types differ, then the value of
the smaller of the two types is promoted to wider type of the other value. (Wider in the
sense that a type with more bytes can hold a wider range of values.) For example, for
INT and BIGINT operands, the INT is promoted to BIGINT. For INT and FLOAT operands,
the INT is promoted to FLOAT. Note that our query contained (1 - deductions[…]). Since
the deductions are FLOATS, the 1 was promoted to FLOAT.
You have to be careful about data overflow or underflow when doing arithmetic. Hive
follows the rules for the underlying Java types, where no attempt is made to automatically convert a result to a wider type if one exists, when overflow or underflow will
occur. Multiplication and division are most likely to trigger this problem.
It pays to be aware of the ranges of your numeric data values, whether or not those
values approach the upper or lower range limits of the types you are using in the corresponding schema, and what kinds of calculations people might do with the data.
If you are concerned about overflow or underflow, consider using wider types in the
schema. The drawback is the extra memory each data value will occupy.
82 | Chapter 6: HiveQL: Queries

You can also convert values to wider types in specific expressions, called casting. See
Table 6-2 below and “Casting” on page 109 for details.
Finally, it is sometimes useful to scale data values, such as dividing by powers of 10,
using log values, and so on. Scaling can also improve the accuracy and numerical stability of algorithms used in certain machine learning calculations, for example.

Using Functions
Our tax-deduction example also uses a built-in mathematical function, round(), for
finding the nearest integer for a DOUBLE value.

Mathematical functions
Table 6-2 describes the built-in mathematical functions, as of Hive v0.8.0, for working
with single columns of data.
Table 6-2. Mathematical functions
Return type

Signature

Description

BIGINT

round(d)

Return the BIGINT for the rounded value of DOUBLE d.

DOUBLE

round(d, N)

Return the DOUBLE for the value of d, a DOUBLE, rounded to
N decimal places.

BIGINT

floor(d)

Return the largest BIGINT that is <= d, a DOUBLE.

BIGINT

ceil(d), ceiling(DOUBLE d)

Return the smallest BIGINT that is >= d.

DOUBLE

rand(), rand(seed)

Return a pseudorandom DOUBLE that changes for each row.
Passing in an integer seed makes the return value
deterministic.

DOUBLE

exp(d)

Return e to the d, a DOUBLE.

DOUBLE

ln(d)

Return the natural logarithm of d, a DOUBLE.

DOUBLE

log10(d)

Return the base-10 logarithm of d, a DOUBLE.

DOUBLE

log2(d)

Return the base-2 logarithm of d, a DOUBLE.

DOUBLE

log(base, d)

Return the base-base logarithm of d, where base and d are
DOUBLEs.

DOUBLE

pow(d, p), power(d, p)

Return d raised to the power p, where d and p are DOUBLEs.

DOUBLE

sqrt(d)

Return the square root of d, a DOUBLE.

STRING

bin(i)

STRING

hex(i)

STRING

hex(str)

Return the STRING representing the binary value of i, a

BIGINT.

Return the STRING representing the hexadecimal value of i, a

BIGINT.

Return the STRING representing the hexadecimal value of s,
where each two characters in the STRING s is converted to its
hexadecimal representation.

SELECT … FROM Clauses | 83

Return type

Signature

Description

STRING

unhex(i)

The inverse of hex(str).

STRING

conv(i, from_base, to_base)

Return the STRING in base to_base, an INT, representing the
value of i, a BIGINT, in base from_base, an INT.

STRING

conv(str, from_base,
to_base)

Return the STRING in base to_base, an INT, representing the
value of str, a STRING, in base from_base, an INT.

DOUBLE

abs(d)

Return the DOUBLE that is the absolute value of d, a DOUBLE.

INT

pmod(i1, i2)

Return the positive module INT for two INTs, i1 mod i2.

DOUBLE

pmod(d1, d2)

Return the positive module DOUBLE for two DOUBLEs, d1 mod
d2.

DOUBLE

sin(d)

Return the DOUBLE that is the sin of d, a DOUBLE, in radians.

DOUBLE

asin(d)

Return the DOUBLE that is the arcsin of d, a DOUBLE, in radians.

DOUBLE

cos(d)

Return the DOUBLE that is the cosine of d, a DOUBLE, in radians.

DOUBLE

acos(d)

Return the DOUBLE that is the arccosine of d, a DOUBLE, in
radians.

DOUBLE

tan(d)

Return the DOUBLE that is the tangent of d, a DOUBLE, in radians.

DOUBLE

atan(d)

Return the DOUBLE that is the arctangent of d, a DOUBLE, in
radians.

DOUBLE

degrees(d)

Return the DOUBLE that is the value of d, a DOUBLE, converted
from radians to degrees.

DOUBLE

radians(d)

Return the DOUBLE that is the value of d, a DOUBLE, converted
from degrees to radians.

INT

positive(i)

DOUBLE

positive(d)

INT

negative(i)

Return the negative of the INT value of i (i.e., it’s effectively the
expression -i).

DOUBLE

negative(d)

Return the negative of the DOUBLE value of d; effectively, the
expression -d.

FLOAT

sign(d)

Return the FLOAT value 1.0 if d, a DOUBLE, is positive; return
the FLOAT value -1.0 if d is negative; otherwise return 0.0.

DOUBLE

e()

Return the DOUBLE that is the value of the constant e,
2.718281828459045.

DOUBLE

pi()

Return the DOUBLE that is the value of the constant pi,
3.141592653589793.

Return the INT value of i (i.e., it’s effectively the expression \

+i).

Return the DOUBLE value of d (i.e., it’s effectively the expression

\+d).

Note the functions floor, round, and ceil (“ceiling”) for converting DOUBLE to BIGINT,
which is floating-point numbers to integer numbers. These functions are the preferred
technique, rather than using the cast operator we mentioned above.

84 | Chapter 6: HiveQL: Queries

Also, there are functions for converting integers to strings in different bases (e.g.,
hexadecimal).

Aggregate functions
A special kind of function is the aggregate function that returns a single value resulting
from some computation over many rows. More precisely, this is the User Defined Aggregate Function, as we’ll see in “Aggregate Functions” on page 164. Perhaps the two
best known examples are count, which counts the number of rows (or values for a
specific column), and avg, which returns the average value of the specified column
values.
Here is a query that counts the number of our example employees and averages their
salaries:
hive> SELECT count(*), avg(salary) FROM employees;
4 77500.0

We’ll see other examples when we discuss GROUP BY in the section “GROUP BY Clauses” on page 97.
Table 6-3 lists Hive’s built-in aggregate functions.
Table 6-3. Aggregate functions
Return type

Signature

Description

BIGINT

count(*)

Return the total number of retrieved rows, including rows
containing NULL values.

BIGINT

count(expr)

Return the number of rows for which the supplied
expression is not NULL.

BIGINT

count(DISTINCT expr[, expr_.])

Return the number of rows for which the supplied
expression(s) are unique and not NULL.

DOUBLE

sum(col)

Return the sum of the values.

DOUBLE

sum(DISTINCT col)

Return the sum of the distinct values.

DOUBLE

avg(col)

Return the average of the values.

DOUBLE

avg(DISTINCT col)

Return the average of the distinct values.

DOUBLE

min(col)

Return the minimum value of the values.

DOUBLE

max(col)

Return the maximum value of the values.

DOUBLE

variance(col), var_pop(col)

Return the variance of a set of numbers in a collection:
col.

DOUBLE

var_samp(col)

Return the sample variance of a set of numbers.

DOUBLE

stddev_pop(col)

Return the standard deviation of a set of numbers.

DOUBLE

stddev_samp(col)

Return the sample standard deviation of a set of numbers.

DOUBLE

covar_pop(col1, col2)

Return the covariance of a set of numbers.

DOUBLE

covar_samp(col1, col2)

Return the sample covariance of a set of numbers.

SELECT … FROM Clauses | 85

Return type

Signature

Description

DOUBLE

corr(col1, col2)

Return the correlation of two sets of numbers.

DOUBLE

percentile(int_expr, p)

Return the percentile of int_expr at p (range: [0,1]),
where p is a DOUBLE.

ARRAY

percentile(int_expr,
[p1, ...])

Return the percentiles of int_expr at p (range: [0,1]),
where p is a DOUBLE array.

DOUBLE

percentile_approx(int_expr,
p , NB)

Return the approximate percentiles of int_expr at p
(range: [0,1]), where p is a DOUBLE and NB is the number
of histogram bins for estimating (default: 10,000 if not
specified).

DOUBLE

percentile_approx(int_expr,
[p1, ...] , NB)

Return the approximate percentiles of int_expr at p
(range: [0,1]), where p is a DOUBLE array and NB is the
number of histogram bins for estimating (default: 10,000
if not specified).

ARRAY

histogram_numeric(col, NB)

Return an array of NB histogram bins, where the x value
is the center and the y value is the height of the bin.

ARRAY

collect_set(col)

Return a set with the duplicate elements from collection
col removed.

You can usually improve the performance of aggregation by setting the following property to true, hive.map.aggr, as shown here:
hive> SET hive.map.aggr=true;
hive> SELECT count(*), avg(salary) FROM employees;

This setting will attempt to do “top-level” aggregation in the map phase, as in this
example. (An aggregation that isn’t top-level would be aggregation after performing a
GROUP BY.) However, this setting will require more memory.
As Table 6-3 shows, several functions accept DISTINCT … expressions. For example, we
could count the unique stock symbols this way:
hive> SELECT count(DISTINCT symbol) FROM stocks;
0

Wait, zero?? There is a bug when trying to use count(DISTINCT col)
when col is a partition column. The answer should be 743 for NASDAQ
and NYSE, at least as of early 2010 in the infochimps.org data set we
used.

Note that the Hive wiki currently claims that you can’t use more than one function(DIS
TINCT …) expression in a query. For example, the following is supposed to be disallowed,
but it actually works:
hive> SELECT count(DISTINCT ymd), count(DISTINCT volume) FROM stocks;
12110 26144

86 | Chapter 6: HiveQL: Queries

So, there are 12,110 trading days of data, over 40 years worth.

Table generating functions
The “inverse” of aggregate functions are so-called table generating functions, which take
single columns and expand them to multiple columns or rows. We will discuss them
extensively in “Table Generating Functions” on page 165, but to complete the contents
of this section, we will discuss them briefly now and list the few built-in table generating
functions available in Hive.
To explain by way of an example, the following query converts the subordinate array
in each employees record into zero or more new records. If an employee record has an
empty subordinates array, then no new records are generated. Otherwise, one new
record per subordinate is generated:
hive> SELECT explode(subordinates) AS sub FROM employees;
Mary Smith
Todd Jones
Bill King

We used a column alias, sub, defined using the AS sub clause. When using table generating functions, column aliases are required by Hive. There are many other particular
details that you must understand to use these functions correctly. We’ll wait until
“Table Generating Functions” on page 165 to discuss the details.
Table 6-4 lists the built-in table generating functions.
Table 6-4. Table generating functions
Return type

Signature

Description

N rows

explode(array)

Return 0 to many rows, one row for each element from
the input array.

N rows

explode(map)

(v0.8.0 and later) Return 0 to many rows, one row for each
map key-value pair, with a field for each map key and a
field for the map value.

tuple

json_tuple(jsonStr, p1, p2, …,
pn)

Like get_json_object, but it takes multiple names
and returns a tuple. All the input parameters and output
column types are STRING.

tuple

parse_url_tuple(url, part
name1, partname2, …, partna
meN) where N >= 1

Extract N parts from a URL. It takes a URL and the part
names to extract, returning a tuple. All the input parameters and output column types are STRING. The valid
partnames are case-sensitive and should only contain
a minimum of white space: HOST, PATH, QUERY, REF,
PROTOCOL, AUTHORITY, FILE, USERINFO,
QUERY:.

N rows

stack(n, col1, …, colM)

Convert M columns into N rows of size M/N each.

SELECT … FROM Clauses | 87

Here is an example that uses parse_url_tuple where we assume a url_table exists that
contains a column of URLs called url:
SELECT parse_url_tuple(url, 'HOST', 'PATH', 'QUERY') as (host, path, query)
FROM url_table;

Compare parse_url_tuple with parse_url in Table 6-5 below.

Other built-in functions
Table 6-5 describes the rest of the built-in functions for working with strings, maps,
arrays, JSON, and timestamps, with or without the recently introduced TIMESTAMP type
(see “Primitive Data Types” on page 41).
Table 6-5. Other built-in functions
Return type

Signature

Description

BOOLEAN

test in(val1, val2, …)

Return true if test equals one of the values in the list.

INT

length(s)

Return the length of the string.

STRING

reverse(s)

Return a reverse copy of the string.

STRING

concat(s1, s2, …)

Return the string resulting from s1 joined with s2, etc.
For example, concat('ab', 'cd') results in
'abcd'. You can pass an arbitrary number of string arguments and the result will contain all of them joined
together.

STRING

concat_ws(separator, s1, s2,
…)

Like concat, but using the specified separator.

STRING

substr(s, start_index)

Return the substring of s starting from the
start_index position, where 1 is the index of the first
character, until the end of s. For example,
substr('abcd', 3) results in 'cd'.

STRING

substr(s, int start, int
length)

Return the substring of s starting from the start position with the given length, e.g., substr('abc
defgh', 3, 2) results in 'cd'.

STRING

upper(s)

Return the string that results from converting all characters of s to upper case, e.g., upper('hIvE') results in
'HIVE'.

STRING

ucase(s)

A synonym for upper().

STRING

lower(s)

Return the string that results from converting all characters of s to lower case, e.g., lower('hIvE') results in
'hive'.

STRING

lcase(s)

A synonym for lower().

STRING

trim(s)

Return the string that results from removing whitespace
from both ends of s, e.g., trim(' hive ') results in
'hive'.

88 | Chapter 6: HiveQL: Queries

Return type

Signature

Description

STRING

ltrim(s)

Return the string resulting from trimming spaces from
the beginning (lefthand side) of s, e.g., ltrim(' hive
') results in 'hive '.

STRING

rtrim(s)

Return the string resulting from trimming spaces from
the end (righthand side) of s, e.g., rtrim(' hive
') results in ' hive'.

STRING

regexp_replace(s, regex,
replacement)

Return the string resulting from replacing all substrings
in s that match the Java regular expression re with
replacement.a If replacement is blank, the
matches are effectively deleted, e.g.,
regexp_replace('hive', '[ie]', 'z')
returns 'hzvz'.

STRING

regexp_extract(subject,
regex_pattern, index)

Returns the substring for the index’s match using the
regex_pattern.

STRING

parse_url(url, partname, key)

Extracts the specified part from a URL. It takes a URL and
the partname to extract. The valid partnames are
case-sensitive: HOST, PATH, QUERY, REF, PROTOCOL,
AUTHORITY, FILE, USERINFO, QUERY:. The
optional key is used for the last QUERY: request.
Compare with parse_url_tuple described in Table 6-4.

int

size(map)

Return the number of elements in the map.

int

size(array)

Return the number of elements in the array.

value of type

cast( as )

Convert (“cast”) the result of the expression expr to
type, e.g., cast('1' as BIGINT) will convert the
string '1' to its integral representation. A NULL is returned if the conversion does not succeed.

STRING

from_unixtime(int unixtime)

Convert the number of seconds from the Unix epoch
(1970-01-01 00:00:00 UTC) to a string representing the
timestamp of that moment in the current system time
zone in the format of '1970-01-01 00:00:00'.

STRING

to_date(timestamp)

Return the date part of a timestamp string, e.g.,
to_date("1970-01-01 00:00:00") returns
'1970-01-01'.

INT

year(timestamp)

Return the year part as an INT of a timestamp string, e.g.,
year("1970-11-01 00:00:00") returns 1970.

INT

month(timestamp)

Return the month part as an INT of a timestamp string,
e.g., month("1970-11-01 00:00:00") returns
11.

INT

day(timestamp)

Return the day part as an INT of a timestamp string, e.g.,
day("1970-11-01 00:00:00") returns 1.

STRING

get_json_object(json_string,
path)

Extract the JSON object from a JSON string based on the
given JSON path, and return the JSON string of the
SELECT … FROM Clauses | 89

a

Return type

Signature

Description
extracted object. NULL is returned if the input JSON string
is invalid.

STRING

space(n)

Returns n spaces.

STRING

repeat(s, n)

Repeats s n times.

STRING

ascii(s)

Returns the integer value for the first ASCII character in
the string s.

STRING

lpad(s, len, pad)

Returns s exactly len length, prepending instances of
the string pad on its left, if necessary, to reach len characters. If s is longer than len, it is truncated.

STRING

rpad(s, len, pad)

Returns s exactly len length, appending instances of the
string pad on its right, if necessary, to reach len characters. If s is longer than len, it is truncated.

ARRAY

split(s, pattern)

Returns an array of substrings of s, split on occurrences
of pattern.

INT

find_in_set(s, commaSeparated
String)

Returns the index of the comma-separated string where
s is found, or NULL if it is not found.

INT

locate(substr, str, pos])

Returns the index of str after pos where substr is
found.

INT

instr(str, substr)

Returns the index of str where substr is found.

MAP

str_to_map(s, delim1, delim2)

Creates a map by parsing s, using delim1 as the separator between key-value pairs and delim2 as the keyvalue separator.

ARRAY>

sentences(s, lang, locale)

Splits s into arrays of sentences, where each sentence is
an array of words. The lang and country arguments
are optional; if omitted, the default locale is used.

ARRAY>

ngrams(array>,
N, K, pf)

Estimates the top-K n-grams in the text. pf is the precision
factor.

ARRAY>

con
text_ngrams(array>,array,int K, int
pf)

Like ngrams, but looks for n-grams that begin with the
second array of words in each outer array.

BOOLEAN

in_file(s, filename)

Returns true if s appears in the file named filename.

See http://docs.oracle.com/javase/tutorial/essential/regex/ for more on Java regular expression syntax.

Note that the time-related functions (near the end of the table) take integer or string
arguments. As of Hive v0.8.0, these functions also take TIMESTAMP arguments, but they
will continue to take integer or string arguments for backwards compatibility.

90 | Chapter 6: HiveQL: Queries

LIMIT Clause
The results of a typical query can return a large number of rows. The LIMIT clause puts
an upper limit on the number of rows returned:
hive> SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees
> LIMIT 2;
JOHN DOE
100000.0 0.2
80000
MARY SMITH
80000.0 0.2
64000

Column Aliases
You can think of the previous example query as returning a new relation with new
columns, some of which are anonymous results of manipulating columns in
employees. It’s sometimes useful to give those anonymous columns a name, called a
column alias. Here is the previous query with column aliases for the third and fourth
columns returned by the query, fed_taxes and salary_minus_fed_taxes, respectively:
hive> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees LIMIT 2;
JOHN DOE
100000.0 0.2
80000
MARY SMITH
80000.0 0.2
64000

Nested SELECT Statements
The column alias feature is especially useful in nested select statements. Let’s use the
previous example as a nested query:
hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;
JOHN DOE
100000.0 0.2
80000

The previous result set is aliased as e, from which we perform a second query to select
the name and the salary_minus_fed_taxes, where the latter is greater than 70,000. (We’ll
cover WHERE clauses in “WHERE Clauses” on page 92 below.)

CASE … WHEN … THEN Statements
The CASE … WHEN … THEN clauses are like if statements for individual columns in query
results. For example:
hive> SELECT name, salary,
> CASE
>
WHEN salary < 50000.0 THEN 'low'

SELECT … FROM Clauses | 91

>
WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
>
WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
>
ELSE 'very high'
>
END AS bracket FROM employees;
John Doe
100000.0
very high
Mary Smith
80000.0
high
Todd Jones
70000.0
high
Bill King
60000.0
middle
Boss Man
200000.0 very high
Fred Finance
150000.0 very high
Stacy Accountant 60000.0 middle
...

When Hive Can Avoid MapReduce
If you have been running the queries in this book so far, you have probably noticed
that a MapReduce job is started in most cases. Hive implements some kinds of queries
without using MapReduce, in so-called local mode, for example:
SELECT * FROM employees;

In this case, Hive can simply read the records from employees and dump the formatted
output to the console.
This even works for WHERE clauses that only filter on partition keys, with or without
LIMIT clauses:
SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;

Furthermore, Hive will attempt to run other operations in local mode if the
hive.exec.mode.local.auto property is set to true:
set hive.exec.mode.local.auto=true;

Otherwise, Hive uses MapReduce to run all other queries.
Trust us, you want to add set hive.exec.mode.local.auto=true; to your
$HOME/.hiverc file.

WHERE Clauses
While SELECT clauses select columns, WHERE clauses are filters; they select which records
to return. Like SELECT clauses, we have already used many simple examples of WHERE
clauses before defining the clause, on the assumption you have seen them before. Now
we’ll explore them in a bit more detail.

92 | Chapter 6: HiveQL: Queries

WHERE clauses use predicate expressions, applying predicate operators, which we’ll describe in a moment, to columns. Several predicate expressions can be joined with AND
and OR clauses. When the predicate expressions evaluate to true, the corresponding
rows are retained in the output.

We just used the following example that restricts the results to employees in the state
of California:
SELECT * FROM employees
WHERE country = 'US' AND state = 'CA';

The predicates can reference the same variety of computations over column values that
can be used in SELECT clauses. Here we adapt our previously used query involving
Federal Taxes, filtering for those rows where the salary minus the federal taxes is greater
than 70,000:
hive> SELECT name, salary, deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;
John Doe
100000.0 0.2
80000.0

This query is a bit ugly, because the complex expression on the second line is duplicated
in the WHERE clause. The following variation eliminates the duplication, using a column
alias, but unfortunately it’s not valid:
hive> SELECT name, salary, deductions["Federal Taxes"],
>
salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
> FROM employees
> WHERE round(salary_minus_fed_taxes) > 70000;
FAILED: Error in semantic analysis: Line 4:13 Invalid table alias or
column reference 'salary_minus_fed_taxes': (possible column names are:
name, salary, subordinates, deductions, address)

As the error message says, we can’t reference column aliases in the WHERE clause. However, we can use a nested SELECT statement:
hive> SELECT e.* FROM
> (SELECT name, salary, deductions["Federal Taxes"] as ded,
>
salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
> FROM employees) e
> WHERE round(e.salary_minus_fed_taxes) > 70000;
John Doe
100000.0
0.2
80000.0
Boss Man
200000.0
0.3
140000.0
Fred Finance
150000.0
0.3
105000.0

Predicate Operators
Table 6-6 describes the predicate operators, which are also used in JOIN … ON and
HAVING clauses.

WHERE Clauses | 93

Table 6-6. Predicate operators
Operator

Types

Description

A = B

Primitive types

True if A equals B. False otherwise.

A <> B, A != B

Primitive types

NULL if A or B is NULL; true if A is not equal to B; false
otherwise.

A < B

Primitive types

NULL if A or B is NULL; true if A is less than B; false
otherwise.

A <= B

Primitive types

NULL if A or B is NULL; true if A is less than or equal to
B; false otherwise.

A > B

Primitive types

NULL if A or B is NULL; true if A is greater than B; false
otherwise.

A >= B

Primitive types

NULL if A or B is NULL; true if A is greater than or equal
to B; false otherwise.

A IS NULL

All types

True if A evaluates to NULL; false otherwise.

A IS NOT NULL

All types

False if A evaluates to NULL; true otherwise.

A LIKE B

String

True if A matches the SQL simplified regular expression
specification given by B; false otherwise. B is interpreted
as follows: 'x%' means A must begin with the prefix 'x',
'%x' means A must end with the suffix 'x', and '%x
%' means A must begin with, end with, or contain the
substring 'x'. Similarly, the underscore '_' matches a
single character. B must match the whole string A.

A RLIKE B, A REGEXP B

String

True if A matches the regular expression given by B; false
otherwise. Matching is done by the JDK regular expression
library and hence it follows the rules of that library. For
example, the regular expression must match the entire
string A, not just a subset. See below for more information
about regular expressions.

We’ll discuss LIKE and RLIKE in detail below (“LIKE and RLIKE” on page 96). First,
let’s point out an issue with comparing floating-point numbers that you should
understand.

Gotchas with Floating-Point Comparisons
A common gotcha arises when you compare floating-point numbers of different types
(i.e., FLOAT versus DOUBLE). Consider the following query of the employees table, which
is designed to return the employee’s name, salary, and federal taxes deduction, but only
if that tax deduction exceeds 0.2 (20%) of his or her salary:
hive> SELECT name, salary, deductions['Federal Taxes']
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;
John Doe
100000.0
0.2
Mary Smith
80000.0
0.2

94 | Chapter 6: HiveQL: Queries

Boss Man
Fred Finance

200000.0
150000.0

0.3
0.3

Wait! Why are records with deductions['Federal Taxes'] = 0.2 being returned?
Is it a Hive bug? There is a bug filed against Hive for this issue, but it actually reflects
the behavior of the internal representation of floating-point numbers when they are
compared and it affects almost all software written in most languages on all modern
digital computers (see https://issues.apache.org/jira/browse/HIVE-2586).
When you write a floating-point literal value like 0.2, Hive uses a DOUBLE to hold the
value. We defined the deductions map values to be FLOAT, which means that Hive will
implicitly convert the tax deduction value to DOUBLE to do the comparison. This should
work, right?
Actually, it doesn’t work. Here’s why. The number 0.2 can’t be represented exactly in
a FLOAT or DOUBLE. (See http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg
.html for an in-depth discussion of floating-point number issues.) In this particular case,
the closest exact value is just slightly greater than 0.2, with a few nonzero bits at the
least significant end of the number.
To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and
0.200000000001 for DOUBLE, because an 8-byte DOUBLE has more significant digits (after
the decimal point). When the FLOAT value from the table is converted to DOUBLE by Hive,
it produces the DOUBLE value 0.200000100000, which is greater than 0.200000000001.
That’s why the query results appear to use >= not >!
This issue is not unique to Hive nor Java, in which Hive is implemented. Rather, it’s a
general problem for all systems that use the IEEE standard for encoding floating-point
numbers!
However, there are two workarounds we can use in Hive.
First, if we read the data from a TEXTFILE (see Chapter 15), which is what we have been
assuming so far, then Hive reads the string “0.2” from the data file and converts it to a
real number. We could use DOUBLE instead of FLOAT in our schema. Then we would be
comparing a DOUBLE for the deductions['Federal Taxes'] with a double for the literal
0.2. However, this change will increase the memory footprint of our queries. Also, we
can’t simply change the schema like this if the data file is a binary file format like
SEQUENCEFILE (discussed in Chapter 15).
The second workaround is to explicitly cast the 0.2 literal value to FLOAT. Java has a
nice way of doing this: you append the letter f or F to the end of the number (e.g.,
0.2f). Unfortunately, Hive doesn’t support this syntax; we have to use the cast
operator.
Here is a modified query that casts the 0.2 literal value to FLOAT. With this change, the
expected results are returned by the query:
hive> SELECT name, salary, deductions['Federal Taxes'] FROM employees
> WHERE deductions['Federal Taxes'] > cast(0.2 AS FLOAT);

WHERE Clauses | 95

Download from Wow! eBook 

Boss Man
Fred Finance

200000.0
150000.0

0.3
0.3

Note the syntax inside the cast operator: number AS FLOAT.
Actually, there is also a third solution: avoid floating-point numbers for anything involving money.
Use extreme caution when comparing floating-point numbers. Avoid
all implicit casts from smaller to wider types.

LIKE and RLIKE
Table 6-6 describes the LIKE and RLIKE predicate operators. You have probably seen
LIKE before, a standard SQL operator. It lets us match on strings that begin with or end
with a particular substring, or when the substring appears anywhere within the string.
For example, the following three queries select the employee names and addresses
where the street ends with Ave., the city begins with O, and the street contains Chicago:
hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Ave.';
John Doe
1 Michigan Ave.
Todd Jones
200 Chicago Ave.
hive> SELECT name, address.city FROM employees WHERE address.city LIKE 'O%';
Todd Jones
Oak Park
Bill King
Obscuria
hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Chi%';
Todd Jones
200 Chicago Ave.

A Hive extension is the RLIKE clause, which lets us use Java regular expressions, a more
powerful minilanguage for specifying matches. The rich details of regular expression
syntax and features are beyond the scope of this book. The entry for RLIKE in Table 6-6 provides links to resources with more details on regular expressions. Here, we
demonstrate their use with an example, which finds all the employees whose street
contains the word Chicago or Ontario:
hive> SELECT name, address.street
> FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';
Mary Smith
100 Ontario St.
Todd Jones
200 Chicago Ave.

The string after the RLIKE keyword has the following interpretation. A period (.) matches
any character and a star (*) means repeat the “thing to the left” (period, in the two cases
shown) zero to many times. The expression (x|y) means match either x or y.
Hence, there might be no characters before “Chicago” or “Ontario” and there might
be no characters after them. Of course, we could have written this particular example
with two LIKE clauses:

96 | Chapter 6: HiveQL: Queries

SELECT name, address FROM employees
WHERE address.street LIKE '%Chicago%' OR address.street LIKE '%Ontario%';

General regular expression matches will let us express much richer matching criteria
that would become very unwieldy with joined LIKE clauses such as these.
For more details about regular expressions as implemented by Hive using Java, see the
documentation for the Java regular expression syntax at http://docs.oracle.com/javase/
6/docs/api/java/util/regex/Pattern.html or see Regular Expression Pocket Reference by
Tony Stubblebine (O’Reilly), Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan (O’Reilly), or Mastering Regular Expressions, 3rd Edition, by Jeffrey E.F.
Friedl (O’Reilly).

GROUP BY Clauses
The GROUP BY statement is often used in conjunction with aggregate functions to
group the result set by one or more columns and then perform an aggregation over each
group.
Let’s return to the stocks table we defined in “External Tables” on page 56. The following query groups stock records for Apple by year, then averages the closing price
for each year:
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd);
1984
25.578625440597534
1985
20.193676221040867
1986
32.46102808021274
1987
53.88968399108163
1988
41.540079275138766
1989
41.65976212516664
1990
37.56268799823263
1991
52.49553383386182
1992
54.80338610251119
1993
41.02671956450572
1994
34.0813495847914
...

HAVING Clauses
The HAVING clause lets you constrain the groups produced by GROUP BY in a way that
could be expressed with a subquery, using a syntax that’s easier to express. Here’s the
previous query with an additional HAVING clause that limits the results to years where
the average closing price was greater than $50.0:

GROUP BY Clauses | 97

hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;
1987
53.88968399108163
1991
52.49553383386182
1992
54.80338610251119
1999
57.77071460844979
2000
71.74892876261757
2005
52.401745992993554
...

Without the HAVING clause, this query would require a nested SELECT statement:
hive>
>
>
>
>
1987
...

SELECT s2.year, s2.avg FROM
(SELECT year(ymd) AS year, avg(price_close) AS avg FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd)) s2
WHERE s2.avg > 50.0;
53.88968399108163

JOIN Statements
Hive supports the classic SQL JOIN statement, but only equi-joins are supported.

Inner JOIN
In an inner JOIN, records are discarded unless join criteria finds matching records in
every table being joined. For example, the following query compares Apple (symbol
AAPL) and IBM (symbol IBM). The stocks table is joined against itself, a self-join, where
the dates, ymd (year-month-day) values must be equal in both tables. We say that the
ymd columns are the join keys in this query:
hive> SELECT a.ymd, a.price_close, b.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';
2010-01-04
214.01 132.45
2010-01-05
214.38 130.85
2010-01-06
210.97 130.0
2010-01-07
210.58 129.55
2010-01-08
211.98 130.85
2010-01-11
210.11 129.48
...

The ON clause specifies the conditions for joining records between the two tables. The
WHERE clause limits the lefthand table to AAPL records and the righthand table to IBM
records. You can also see that using table aliases for the two occurrences of stocks is
essential in this query.
As you may know, IBM is an older company than Apple. It has been a publicly traded
stock for much longer than Apple. However, since this is an inner JOIN, no IBM records
98 | Chapter 6: HiveQL: Queries

will be returned older than September 7, 1984, which was the first day that Apple was
publicly traded!
Standard SQL allows a non-equi-join on the join keys, such as the following example
that shows Apple versus IBM, but with all older records for Apple paired up with each
day of IBM data. It would be a lot of data (Example 6-1)!
Example 6-1. Query that will not work in Hive
SELECT a.ymd, a.price_close, b.price_close
FROM stocks a JOIN stocks b
ON a.ymd <= b.ymd
WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';

This is not valid in Hive, primarily because it is difficult to implement these kinds of
joins in MapReduce. It turns out that Pig offers a cross product feature that makes it
possible to implement this join, even though Pig’s native join feature doesn’t support
it, either.
Also, Hive does not currently support using OR between predicates in ON clauses.
To see a nonself join, let’s introduce the corresponding dividends data, also
available from infochimps.org, as described in “External Tables” on page 56:
CREATE EXTERNAL TABLE IF NOT EXISTS dividends (
ymd
STRING,
dividend
FLOAT
)
PARTITIONED BY (exchange STRING, symbol STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Here is an inner JOIN between stocks and dividends for Apple, where we use the ymd
and symbol columns as join keys:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
1987-05-11
AAPL
77.0
0.015
1987-08-10
AAPL
48.25
0.015
1987-11-17
AAPL
35.0
0.02
...
1995-02-13
AAPL
43.75
0.03
1995-05-26
AAPL
42.69
0.03
1995-08-16
AAPL
44.5
0.03
1995-11-21
AAPL
38.63
0.03

Yes, Apple paid a dividend years ago and only recently announced it would start doing
so again! Note that because we have an inner JOIN, we only see records approximately
every three months, the typical schedule of dividend payments, which are announced
when reporting quarterly results.
You can join more than two tables together. Let’s compare Apple, IBM, and GE side
by side:

JOIN Statements | 99

hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
>
JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
2010-01-04
214.01 132.45 15.45
2010-01-05
214.38 130.85 15.53
2010-01-06
210.97 130.0 15.45
2010-01-07
210.58 129.55 16.25
2010-01-08
211.98 130.85 16.6
2010-01-11
210.11 129.48 16.76
...

Most of the time, Hive will use a separate MapReduce job for each pair of things to
join. In this example, it would use one job for tables a and b, then a second job to join
the output of the first join with c.
Why not join b and c first? Hive goes from left to right.

However, this example actually benefits from an optimization we’ll discuss next.

Join Optimizations
In the previous example, every ON clause uses a.ymd as one of the join keys. In this case,
Hive can apply an optimization where it joins all three tables in a single MapReduce
job. The optimization would also be used if b.ymd were used in both ON clauses.
When joining three or more tables, if every ON clause uses the same join
key, a single MapReduce job will be used.

Hive also assumes that the last table in the query is the largest. It attempts to buffer the
other tables and then stream the last table through, while performing joins on individual
records. Therefore, you should structure your join queries so the largest table is last.
Recall our previous join between stocks and dividends. We actually made the mistake
of using the smaller dividends table last:
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

We should switch the positions of stocks and dividends:
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM dividends d JOIN stocks s ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

100 | Chapter 6: HiveQL: Queries

It turns out that these data sets are too small to see a noticeable performance difference,
but for larger data sets, you’ll want to exploit this optimization.
Fortunately, you don’t have to put the largest table last in the query. Hive also provides
a “hint” mechanism to tell the query optimizer which table should be streamed:
SELECT /*+ STREAMTABLE(s) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

Now Hive will attempt to stream the stocks table, even though it’s not the last table in
the query.
There is another important optimization called map-side joins that we’ll return to in
“Map-side Joins” on page 105.

LEFT OUTER JOIN
The left-outer join is indicated by adding the LEFT OUTER keywords:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
...
1987-05-01
AAPL
80.0
NULL
1987-05-04
AAPL
79.75
NULL
1987-05-05
AAPL
80.25
NULL
1987-05-06
AAPL
80.0
NULL
1987-05-07
AAPL
80.25
NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
1987-05-14
AAPL
79.25
NULL
1987-05-15
AAPL
78.25
NULL
1987-05-18
AAPL
75.75
NULL
1987-05-19
AAPL
73.25
NULL
1987-05-20
AAPL
74.5
NULL
...

In this join, all the records from the lefthand table that match the WHERE clause are
returned. If the righthand table doesn’t have a record that matches the ON criteria,
NULL is used for each column selected from the righthand table.
Hence, in this result set, we see that the every Apple stock record is returned and the
d.dividend value is usually NULL, except on days when a dividend was paid (May 11th,
1987, in this output).

OUTER JOIN Gotcha
Before we discuss the other outer joins, let’s discuss a gotcha you should understand.

JOIN Statements | 101

Recall what we said previously about speeding up queries by adding partition filters in
the WHERE clause. To speed up our previous query, we might choose to add predicates
that select on the exchange in both tables:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL'
> AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
1987-05-11
AAPL
77.0
0.015
1987-08-10
AAPL
48.25
0.015
1987-11-17
AAPL
35.0
0.02
1988-02-12
AAPL
41.0
0.02
1988-05-16
AAPL
41.25
0.02
...

However, the output has changed, even though we thought we were just adding an
optimization! We’re back to having approximately four stock records per year and we
have non-NULL entries for all the dividend values. In other words, we are back to the
original inner join!
This is actually common behavior for all outer joins in most SQL implementations. It
occurs because the JOIN clause is evaluated first, then the results are passed through
the WHERE clause. By the time the WHERE clause is reached, d.exchange is NULL most of the
time, so the “optimization” actually filters out all records except those on the day of
dividend payments.
One solution is straightforward; remove the clauses in the WHERE clause that reference
the dividends table:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL' AND s.exchange = 'NASDAQ';
...
1987-05-07
AAPL
80.25
NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
...

This isn’t very satisfactory. You might wonder if you can move the predicates from the
WHERE clause into the ON clause, at least the partition filters. This does not work for outer
joins, despite documentation on the Hive Wiki that claims it should work (https://cwiki
.apache.org/confluence/display/Hive/LanguageManual+Joins).
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d
> ON s.ymd = d.ymd AND s.symbol = d.symbol
> AND s.symbol = 'AAPL' AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ';
...
1962-01-02
GE
74.75
NULL
1962-01-02
IBM
572.0
NULL
1962-01-03
GE
74.0
NULL
1962-01-03
IBM
577.0
NULL

102 | Chapter 6: HiveQL: Queries

1962-01-04
1962-01-04
1962-01-05
1962-01-05
...

GE
IBM
GE
IBM

73.12
NULL
571.25 NULL
71.25
NULL
560.0 NULL

The partition filters are ignored for OUTER JOINTS. However, using such filter predicates
in ON clauses for inner joins does work!
Fortunately, there is solution that works for all joins; use nested SELECT statements:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM
> (SELECT * FROM stocks WHERE symbol = 'AAPL' AND exchange = 'NASDAQ') s
> LEFT OUTER JOIN
> (SELECT * FROM dividends WHERE symbol = 'AAPL' AND exchange = 'NASDAQ') d
> ON s.ymd = d.ymd;
...
1988-02-10
AAPL
41.0
NULL
1988-02-11
AAPL
40.63 NULL
1988-02-12
AAPL
41.0
0.02
1988-02-16
AAPL
41.25 NULL
1988-02-17
AAPL
41.88 NULL
...

The nested SELECT statement performs the required “push down” to apply the partition
filters before data is joined.
WHERE clauses are evaluated after joins are performed, so WHERE clauses

should use predicates that only filter on column values that won’t be
NULL. Also, contrary to Hive documentation, partition filters don’t work
in ON clauses for OUTER JOINS, although they do work for INNER JOINS!

RIGHT OUTER JOIN
Right-outer joins return all records in the righthand table that match the WHERE clause.
NULL is used for fields of missing records in the lefthand table.
Here we switch the places of stocks and dividends and perform a righthand join, but
leave the SELECT statement unchanged:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';
...
1987-05-07
AAPL
80.25 NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
...

JOIN Statements | 103

FULL OUTER JOIN
Finally, a full-outer join returns all records from all tables that match the WHERE clause.
NULL is used for fields in missing records in either table.

If we convert the previous query to a full-outer join, we’ll actually get the same results,
since there is never a case where a dividend record exists without a matching stock
record:
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';
...
1987-05-07
AAPL
80.25
NULL
1987-05-08
AAPL
79.0
NULL
1987-05-11
AAPL
77.0
0.015
1987-05-12
AAPL
75.5
NULL
1987-05-13
AAPL
78.5
NULL
...

LEFT SEMI-JOIN
A left semi-join returns records from the lefthand table if records are found in the righthand table that satisfy the ON predicates. It’s a special, optimized case of the more general
inner join. Most SQL dialects support an IN ... EXISTS construct to do the same thing.
For instance, the following query in Example 6-2 attempts to return stock records only
on the days of dividend payments, but it doesn’t work in Hive.
Example 6-2. Query that will not work in Hive
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);

Instead, you use the following LEFT SEMI JOIN syntax:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol;
...
1962-11-05
IBM
361.5
1962-08-07
IBM
373.25
1962-05-08
IBM
459.5
1962-02-06
IBM
551.5

Note that the SELECT and WHERE clauses can’t reference columns from the righthand
table.
Right semi-joins are not supported in Hive.

104 | Chapter 6: HiveQL: Queries

The reason semi-joins are more efficient than the more general inner join is as follows.
For a given record in the lefthand table, Hive can stop looking for matching records in
the righthand table as soon as any match is found. At that point, the selected columns
from the lefthand table record can be projected.

Cartesian Product JOINs
A Cartesian product is a join where all the tuples in the left side of the join are paired
with all the tuples of the right table. If the left table has 5 rows and the right table has
6 rows, 30 rows of output will be produced:
SELECTS * FROM stocks JOIN dividends;

Using the table of stocks and dividends, it is hard to find a reason for a join of this type,
as the dividend of one stock is not usually paired with another. Additionally, Cartesian
products create a lot of data. Unlike other join types, Cartesian products are not executed in parallel, and they are not optimized in any way using MapReduce.
It is critical to point out that using the wrong join syntax will cause a long, slow-running
Cartesian product query. For example, the following query will be optimized to an
inner join in many databases, but not in Hive:
hive > SELECT * FROM stocks JOIN dividends
> WHERE stock.symbol = dividends.symbol and stock.symbol='AAPL';

In Hive, this query computes the full Cartesian product before applying the WHERE
clause. It could take a very long time to finish. When the property hive.mapred.mode is
set to strict, Hive prevents users from inadvertently issuing a Cartesian product query.
We’ll discuss the features of strict mode more extensively in Chapter 10.
Cartesian product queries can be useful. For example, suppose there is
a table of user preferences, a table of news articles, and an algorithm
that predicts which articles a user would like to read. A Cartesian product is required to generate the set of all users and all pages.

Map-side Joins
If all but one table is small, the largest table can be streamed through the mappers while
the small tables are cached in memory. Hive can do all the joining map-side, since it
can look up every possible match against the small tables in memory, thereby eliminating the reduce step required in the more common join scenarios. Even on smaller
data sets, this optimization is noticeably faster than the normal join. Not only does it
eliminate reduce steps, it sometimes reduces the number of map steps, too.
The joins between stocks and dividends can exploit this optimization, as the dividends
data set is small enough to be cached.

JOIN Statements | 105

Before Hive v0.7, it was necessary to add a hint to the query to enable this optimization.
Returning to our inner join example:
SELECT /*+ MAPJOIN(d) */ s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
WHERE s.symbol = 'AAPL';

Running this query versus the original on a fast MacBook Pro laptop yielded times of
approximately 23 seconds versus 33 seconds for the original unoptimized query, which
is roughly 30% faster using our sample stock data.
The hint still works, but it’s now deprecated as of Hive v0.7. However, you still have
to set a property, hive.auto.convert.join, to true before Hive will attempt the optimization. It’s false by default:
hive> set hive.auto.convert.join=true;
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';

Note that you can also configure the threshold size for table files considered small
enough to use this optimization. Here is the default definition of the property (in bytes):
hive.mapjoin.smalltable.filesize=25000000

If you always want Hive to attempt this optimization, set one or both of these properties
in your $HOME/.hiverc file.
Hive does not support the optimization for right- and full-outer joins.
This optimization can also be used for larger tables under certain conditions when
the data for every table is bucketed, as discussed in “Bucketing Table Data Storage” on page 125. Briefly, the data must be bucketed on the keys used in the ON clause
and the number of buckets for one table must be a multiple of the number of buckets
for the other table. When these conditions are met, Hive can join individual buckets
between tables in the map phase, because it does not need to fetch the entire contents
of one table to match against each bucket in the other table.
However, this optimization is not turned on by default. It must be enabled by setting
the property hive.optimize.bucketmapjoin:
set hive.optimize.bucketmapjoin=true;

If the bucketed tables actually have the same number of buckets and the data is sorted
by the join/bucket keys, then Hive can perform an even faster sort-merge join. Once
again, properties must be set to enable the optimization:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;

106 | Chapter 6: HiveQL: Queries

ORDER BY and SORT BY
The ORDER BY clause is familiar from other SQL dialects. It performs a total ordering of
the query result set. This means that all the data is passed through a single reducer,
which may take an unacceptably long time to execute for larger data sets.
Hive adds an alternative, SORT BY, that orders the data only within each reducer, thereby
performing a local ordering, where each reducer’s output will be sorted. Better performance is traded for total ordering.
In both cases, the syntax differs only by the use of the ORDER or SORT keyword. You can
specify any columns you wish and specify whether or not the columns are ascending
using the ASC keyword (the default) or descending using the DESC keyword.
Here is an example using ORDER BY:
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
ORDER BY s.ymd ASC, s.symbol DESC;

Here is the same example using SORT BY instead:
SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
SORT BY s.ymd ASC, s.symbol DESC;

The two queries look almost identical, but if more than one reducer is invoked, the
output will be sorted differently. While each reducer’s output files will be sorted, the
data will probably overlap with the output of other reducers.
Because ORDER BY can result in excessively long run times, Hive will require a LIMIT
clause with ORDER BY if the property hive.mapred.mode is set to strict. By default, it is
set to nonstrict.

DISTRIBUTE BY with SORT BY
DISTRIBUTE BY controls how map output is divided among reducers. All data that flows

through a MapReduce job is organized into key-value pairs. Hive must use this feature
internally when it converts your queries to MapReduce jobs.
Usually, you won’t need to worry about this feature. The exceptions are queries that
use the Streaming feature (see Chapter 14) and some stateful UDAFs (User-Defined
Aggregate Functions; see “Aggregate Functions” on page 164). There is one other scenario where these clauses are useful.
By default, MapReduce computes a hash on the keys output by mappers and tries to
evenly distribute the key-value pairs among the available reducers using the hash values.
Unfortunately, this means that when we use SORT BY, the contents of one reducer’s
output will overlap significantly with the output of the other reducers, as far as sorted
order is concerned, even though the data is sorted within each reducer’s output.
DISTRIBUTE BY with SORT BY | 107

Say we want the data for each stock symbol to be captured together. We can use
DISTRIBUTE BY to ensure that the records for each stock symbol go to the same reducer,
then use SORT BY to order the data the way we want. The following query demonstrates
this technique:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s
> DISTRIBUTE BY s.symbol
> SORT BY s.symbol ASC, s.ymd ASC;
1984-09-07 AAPL 26.5
1984-09-10 AAPL 26.37
1984-09-11 AAPL 26.87
1984-09-12 AAPL 26.12
1984-09-13 AAPL 27.5
1984-09-14 AAPL 27.87
1984-09-17 AAPL 28.62
1984-09-18 AAPL 27.62
1984-09-19 AAPL 27.0
1984-09-20 AAPL 27.12
...

Of course, the ASC keywords could have been omitted as they are the defaults. The
ASC keyword is placed here for reasons that will be described shortly.
DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers
receive rows for processing, while SORT BY controls the sorting of data inside the reducer.

Note that Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.

CLUSTER BY
In the previous example, the s.symbol column was used in the DISTRIBUTE BY clause,
and the s.symbol and the s.ymd columns in the SORT BY clause. Suppose that the same
columns are used in both clauses and all columns are sorted by ascending order (the
default). In this case, the CLUSTER BY clause is a shor-hand way of expressing the same
query.
For example, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER
BY on s.symbol:
hive> SELECT s.ymd, s.symbol, s.price_close
> FROM stocks s
> CLUSTER BY s.symbol;
2010-02-08 AAPL 194.12
2010-02-05 AAPL 195.46
2010-02-04 AAPL 192.05
2010-02-03 AAPL 199.23
2010-02-02 AAPL 195.86
2010-02-01 AAPL 194.73
2010-01-29 AAPL 192.06
2010-01-28 AAPL 199.29

108 | Chapter 6: HiveQL: Queries

2010-01-27 AAPL 207.88
...

Because the sort requirements are removed for the s.ymd, the output reflects the original
order of the stock data, which is sorted descending.
Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY clauses is a way to exploit
the parallelism of SORT BY, yet achieve a total ordering across the output files.

Casting
We briefly mentioned in “Primitive Data Types” on page 41 that Hive will perform
some implicit conversions, called casts, of numeric data types, as needed. For example,
when doing comparisons between two numbers of different types. This topic is discussed more fully in “Predicate Operators” on page 93 and “Gotchas with FloatingPoint Comparisons” on page 94.
Here we discuss the cast() function that allows you to explicitly convert a value of one
type to another.
Recall our employees table uses a FLOAT for the salary column. Now, imagine for a
moment that STRING was used for that column instead. How could we work with the
values as FLOATS?
The following example casts the values to FLOAT before performing a comparison:
SELECT name, salary FROM employees
WHERE cast(salary AS FLOAT) < 100000.0;

The syntax of the cast function is cast(value AS TYPE). What would happen in the
example if a salary value was not a valid string for a floating-point number? In this
case, Hive returns NULL.
Note that the preferred way to convert floating-point numbers to integers is to use the
round() or floor() functions listed in Table 6-2, rather than to use the cast operator.

Casting BINARY Values
The new BINARY type introduced in Hive v0.8.0 only supports casting BINARY to
STRING. However, if you know the value is a number, you can nest cast() invocations,
as in this example where column b is a BINARY column:
SELECT (2.0*cast(cast(b as string) as double)) from src;

You can also cast STRING to BINARY.

Casting | 109

Queries that Sample Data
For very large data sets, sometimes you want to work with a representative sample of
a query result, not the whole thing. Hive supports this goal with queries that sample
tables organized into buckets.
In the following example, assume the numbers table has one number column with
values 1−10.
We can sample using the rand() function, which returns a random number. In the first
two queries, two distinct numbers are returned for each query. In the third query, no
results are returned:
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
2
4
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;
7
10
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s;

If we bucket on a column instead of rand(), then identical results are returned on multiple runs:
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s;
2
hive> SELECT * from numbers TABLESAMPLE(BUCKET 5 OUT OF 10 ON number) s;
4
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s;
2

The denominator in the bucket clause represents the number of buckets into which
data will be hashed. The numerator is the bucket number selected:
hive> SELECT * from numbers TABLESAMPLE(BUCKET 1 OUT OF 2 ON number) s;
2
4
6
8
10
hive> SELECT * from numbers TABLESAMPLE(BUCKET 2 OUT OF 2 ON number) s;
1
3
5
7
9

110 | Chapter 6: HiveQL: Queries

Block Sampling
Hive offers another syntax for sampling a percentage of blocks of an input path as an
alternative to sampling based on rows:
hive> SELECT * FROM numbersflat TABLESAMPLE(0.1 PERCENT) s;

This sampling is not known to work with all file formats. Also, the
smallest unit of sampling is a single HDFS block. Hence, for tables less
than the typical block size of 128 MB, all rows will be retuned.

Percentage-based sampling offers a variable to control the seed information for blockbased tuning. Different seeds produce different samples:

hive.sample.seednumber
0
A number used for percentage sampling. By changing this
number, user will change the subsets of data sampled.


Input Pruning for Bucket Tables
From a first look at the TABLESAMPLE syntax, an astute user might come to the conclusion
that the following query would be equivalent to the TABLESAMPLE operation:
hive> SELECT * FROM numbersflat WHERE number % 2 = 0;
2
4
6
8
10

It is true that for most table types, sampling scans through the entire table and selects
every Nth row. However, if the columns specified in the TABLESAMPLE clause match the
columns in the CLUSTERED BY clause, TABLESAMPLE queries only scan the required hash
partitions of the table:
hive> CREATE TABLE numbers_bucketed (number int) CLUSTERED BY (number) INTO 3 BUCKETS;
hive> SET hive.enforce.bucketing=true;
hive> INSERT OVERWRITE TABLE numbers_bucketed SELECT number FROM numbers;
hive> dfs -ls /user/hive/warehouse/mydb.db/numbers_bucketed;
/user/hive/warehouse/mydb.db/numbers_bucketed/000000_0
/user/hive/warehouse/mydb.db/numbers_bucketed/000001_0
/user/hive/warehouse/mydb.db/numbers_bucketed/000002_0

Queries that Sample Data | 111

hive> dfs -cat /user/hive/warehouse/mydb.db/numbers_bucketed/000001_0;
1
7
10
4

Because this table is clustered into three buckets, the following query can be used to
sample only one of the buckets efficiently:
hive> SELECT * FROM numbers_bucketed TABLESAMPLE (BUCKET 2 OUT OF 3 ON NUMBER) s;
1
7
10
4

UNION ALL
UNION ALL combines two or more tables. Each subquery of the union query must pro-

duce the same number of columns, and for each column, its type must match all the
column types in the same position. For example, if the second column is a FLOAT, then
the second column of all the other query results must be a FLOAT.
Here is an example the merges log data:
SELECT log.ymd, log.level, log.message
FROM (
SELECT l1.ymd, l1.level,
l1.message, 'Log1' AS source
FROM log1 l1
UNION ALL
SELECT l2.ymd, l2.level,
l2.message, 'Log2' AS source
FROM log1 l2
) log
SORT BY log.ymd ASC;

UNION may be used when a clause selects from the same source table. Logically, the same
results could be achieved with a single SELECT and WHERE clause. This technique increases
readability by breaking up a long complex WHERE clause into two or more UNION queries.

However, unless the source table is indexed, the query will have to make multiple passes
over the same source data. For example:
FROM (
FROM src SELECT src.key, src.value WHERE src.key < 100
UNION ALL
FROM src SELECT src.* WHERE src.key > 110
) unioninput
INSERT OVERWRITE DIRECTORY '/tmp/union.out' SELECT unioninput.*

112 | Chapter 6: HiveQL: Queries

CHAPTER 7

HiveQL: Views

A view allows a query to be saved and treated like a table. It is a logical construct, as it
does not store data like a table. In other words, materialized views are not currently
supported by Hive.
When a query references a view, the information in its definition is combined with the
rest of the query by Hive’s query planner. Logically, you can imagine that Hive executes
the view and then uses the results in the rest of the query.

Views to Reduce Query Complexity
When a query becomes long or complicated, a view may be used to hide the complexity
by dividing the query into smaller, more manageable pieces; similar to writing a function in a programming language or the concept of layered design in software. Encapsulating the complexity makes it easier for end users to construct complex queries from
reusable parts. For example, consider the following query with a nested subquery:
FROM (
SELECT * FROM people JOIN cart
ON (cart.people_id=people.id) WHERE firstname='john'
) a SELECT a.lastname WHERE a.id=3;

It is common for Hive queries to have many levels of nesting. In the following example,
the nested portion of the query is turned into a view:
CREATE VIEW shorter_join AS
SELECT * FROM people JOIN cart
ON (cart.people_id=people.id) WHERE firstname='john';

Now the view is used like any other table. In this query we added a WHERE clause to the
SELECT statement. This exactly emulates the original query:
SELECT lastname FROM shorter_join WHERE id=3;

113

Download from Wow! eBook 

Views that Restrict Data Based on Conditions
A common use case for views is restricting the result rows based on the value of one or
more columns. Some databases allow a view to be used as a security mechanism. Rather
than give the user access to the raw table with sensitive data, the user is given access to
a view with a WHERE clause that restricts the data. Hive does not currently support this
feature, as the user must have access to the entire underlying raw table for the view to
work. However, the concept of a view created to limit data access can be used to protect
information from the casual query:
hive> CREATE TABLE userinfo (
>
firstname string, lastname string, ssn string, password string);
hive> CREATE VIEW safer_user_info AS
> SELECT firstname,lastname FROM userinfo;

Here is another example where a view is used to restrict data based on a WHERE clause.
In this case, we wish to provide a view on an employee table that only exposes employees
from a specific department:
hive> CREATE TABLE employee (firstname string, lastname string,
>
ssn string, password string, department string);
hive> CREATE VIEW techops_employee AS
> SELECT firstname,lastname,ssn FROM userinfo WERE department='techops';

Views and Map Type for Dynamic Tables
Recall from Chapter 3 that Hive supports arrays, maps, and structs datatypes. These
datatypes are not common in traditional databases as they break first normal form.
Hive’s ability to treat a line of text as a map, rather than a fixed set of columns, combined
with the view feature, allows you to define multiple logical tables over one physical table.
For example, consider the following sample data file that treats an entire row as a map
rather than a list of fixed columns. Rather than using Hive’s default values for separators, this file uses ^A (Control-A) as the collection item separator (i.e., between keyvalue pairs in this case, where the collection is a map) and ^B (Control-B) as the separator between keys and values in the map. The long lines wrap in the following listing,
so we added a blank line between them for better clarity:
time^B1298598398404^Atype^Brequest^Astate^Bny^Acity^Bwhite
plains^Apart\^Bmuffler
time^B1298598398432^Atype^Bresponse^Astate^Bny^Acity^Btarrytown^Apart\^Bmuffler
time^B1298598399404^Atype^Brequest^Astate^Btx^Acity^Baustin^Apart^Bheadlight
Now we create our table:

114 | Chapter 7: HiveQL: Views

CREATE EXTERNAL TABLE dynamictable(cols map)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\004'
COLLECTION ITEMS TERMINATED BY '\001'
MAP KEYS TERMINATED BY '\002'
STORED AS TEXTFILE;

Because there is only one field per row, the FIELDS TERMINATED BY value actually has
no effect.
Now we can create a view that extracts only rows with type equal to requests and get
the city, state, and part into a view called orders:
CREATE VIEW orders(state, city, part) AS
SELECT cols["state"], cols["city"], cols["part"]
FROM dynamictable
WHERE cols["type"] = "request";

A second view is created named shipments. This view returns the time and part column
from rows where the type is response:
CREATE VIEW shipments(time, part) AS
SELECT cols["time"], cols["parts"]
FROM dynamictable
WHERE cols["type"] = "response";

For another example of this feature, see http://dev.bizo.com/2011/02/columns-in-hive
.html#!/2011/02/columns-in-hive.html.

View Odds and Ends
We said that Hive evaluates the view and then uses the results to evaluate the query.
However, as part of Hive’s query optimization, the clauses of both the query and view
may be combined together into a single actual query.
Nevertheless, the conceptual view still applies when the view and a query that uses it
both contain an ORDER BY clause or a LIMIT clause. The view’s clauses are evaluated
before the using query’s clauses.
For example, if the view has a LIMIT 100 clause and the query has a LIMIT 200 clause,
you’ll get at most 100 results.
While defining a view doesn’t “materialize” any data, the view is frozen to any subsequent changes to any tables and columns that the view uses. Hence, a query using a
view can fail if the referenced tables or columns no longer exist.
There are a few other clauses you can use when creating views. Modifying our last
example:
CREATE VIEW IF NOT EXISTS shipments(time, part)
COMMENT 'Time and parts for shipments.'
TBLPROPERTIES ('creator' = 'me')
AS SELECT ...;

View Odds and Ends | 115

As for tables, the IF NOT EXISTS and COMMENT … clauses are optional, and have the same
meaning they have for tables.
A view’s name must be unique compared to all other table and view names in the same
database.
You can also add a COMMENT for any or all of the new column names. The comments are
not “inherited” from the definition of the original table.
Also, if the AS SELECT contains an expression without an alias—e.g., size(cols) (the
number of items in cols)—then Hive will use _CN as the name, where N is a number
starting with 0. The view definition will fail if the AS SELECT clause is invalid.
Before the AS SELECT clause, you can also define TBLPROPERTIES, just like for tables. In
the example, we defined a property for the “creator” of the view.
The CREATE TABLE … LIKE … construct discussed in “Creating Tables” on page 53 can
also be used to copy a view, that is with a view as part of the LIKE expression:
CREATE TABLE shipments2
LIKE shipments;

You can also use the optional EXTERNAL keyword and LOCATION … clause, as before.
The behavior of this statement is different as of Hive v0.8.0 and previous
versions of Hive. For v0.8.0, the command creates a new table, not a
new view. It uses defaults for the SerDe and file formats. For earlier
versions, a new view is created.

A view is dropped in the same way as a table:
DROP VIEW IF EXISTS shipments;

As usual, IF EXISTS is optional.
A view will be shown using SHOW TABLES (there is no SHOW VIEWS), however DROP TABLE
cannot be used to delete a view.
As for tables, DESCRIBE shipments and DESCRIBE EXTENDED shipments displays the usual
data for the shipment view. With the latter, there will be a tableType value in the
Detailed Table Information indicating the “table” is a VIRTUAL_VIEW.
You cannot use a view as a target of an INSERT or LOAD command.
Finally, views are read-only. You can only alter the metadata TBLPROPERTIES for a view:
ALTER VIEW shipments SET TBLPROPERTIES ('created_at' = 'some_timestamp');

116 | Chapter 7: HiveQL: Views

CHAPTER 8

HiveQL: Indexes

Hive has limited indexing capabilities. There are no keys in the usual relational database
sense, but you can build an index on columns to speed some operations. The index
data for a table is stored in another table.
Also, the feature is relatively new, so it doesn’t have a lot of options yet. However, the
indexing process is designed to be customizable with plug-in Java code, so teams can
extend the feature to meet their needs.
Indexing is also a good alternative to partitioning when the logical partitions would
actually be too numerous and small to be useful. Indexing can aid in pruning some
blocks from a table as input for a MapReduce job. Not all queries can benefit from an
index—the EXPLAIN syntax and Hive can be used to determine if a given query is aided
by an index.
Indexes in Hive, like those in relational databases, need to be evaluated carefully.
Maintaining an index requires extra disk space and building an index has a processing
cost. The user must weigh these costs against the benefits they offer when querying a
table.

Creating an Index
Let’s create an index for our managed, partitioned employees table we described in
“Partitioned, Managed Tables” on page 58. Here is the table definition we used previously, for reference:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP,
address
STRUCT
)
PARTITIONED BY (country STRING, state STRING);

Let’s index on the country partition only:
117

CREATE INDEX employees_index
ON TABLE employees (country)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES ('creator = 'me', 'created_at' = 'some_time')
IN TABLE employees_index_table
PARTITIONED BY (country, name)
COMMENT 'Employees indexed by country and name.';

In this case, we did not partition the index table to the same level of granularity as the
original table. We could choose to do so. If we omitted the PARTITIONED BY clause
completely, the index would span all partitions of the original table.
The AS ... clause specifies the index handler, a Java class that implements indexing.
Hive ships with a few representative implementations; the CompactIndexHandler shown
was in the first release of this feature. Third-party implementations can optimize certain
scenarios, support specific file formats, and more. We’ll provide more information on
implementing your own index handler in “Implementing a Custom Index Handler” on page 119.
We’ll discuss the meaning of WITH DEFERRED REBUILD in the next section.
It’s not a requirement for the index handler to save its data in a new table, but if it does,
the IN TABLE ... clause is used. It supports many of the options available when creating
other tables. Specifically, the example doesn’t use the optional ROW FORMAT, STORED AS,
STORED BY, LOCATION, and TBLPROPERTIES clauses that we discussed in Chapter 4. All
would appear before the final COMMENT clause shown.
Currently, indexing external tables and views is supported except for data residing
in S3.

Bitmap Indexes
Hive v0.8.0 adds a built-in bitmap index handler. Bitmap indexes are commonly used
for columns with few distinct values. Here is our previous example rewritten to use the
bitmap index handler:
CREATE INDEX employees_index
ON TABLE employees (country)
AS 'BITMAP'
WITH DEFERRED REBUILD
IDXPROPERTIES ('creator = 'me', 'created_at' = 'some_time')
IN TABLE employees_index_table
PARTITIONED BY (country, name)
COMMENT 'Employees indexed by country and name.';

Rebuilding the Index
If you specified WITH DEFERRED REBUILD, the new index starts empty. At any time, the
index can be built the first time or rebuilt using the ALTER INDEX statement:
118 | Chapter 8: HiveQL: Indexes

ALTER INDEX employees_index
ON TABLE employees
PARTITION (country = 'US')
REBUILD;

If the PARTITION clause is omitted, the index is rebuilt for all partitions.
There is no built-in mechanism to trigger an automatic rebuild of the index if the underlying table or a particular partition changes. However, if you have a workflow that
updates table partitions with data, one where you might already use the ALTER TABLE ...
TOUCH PARTITION(...) feature described in “Miscellaneous Alter Table Statements” on page 69, that same workflow could issue the ALTER INDEX ... REBUILD
command for a corresponding index.
The rebuild is atomic in the sense that if the rebuild fails, the index is left in the previous
state before the rebuild was started.

Showing an Index
The following command will show all the indexes defined for any column in the indexed
table:
SHOW FORMATTED INDEX ON employees;

FORMATTED is optional. It causes column titles to be added to the output. You can also
replace INDEX with INDEXES, as the output may list multiple indexes.

Dropping an Index
Dropping an index also drops the index table, if any:
DROP INDEX IF EXISTS employees_index ON TABLE employees;

Hive won’t let you attempt to drop the index table directly with DROP TABLE. As always,
IF EXISTS is optional and serves to suppress errors if the index doesn’t exist.
If the table that was indexed is dropped, the index itself and its table is dropped. Similarly, if a partition of the original table is dropped, the corresponding partition index
is also dropped.

Implementing a Custom Index Handler
The full details for implementing a custom index handler are given on the Hive Wiki
page, https://cwiki.apache.org/confluence/display/Hive/IndexDev#CREATE_INDEX,
where the initial design of indexing is documented. Of course, you can use the
source code for org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler as an
example.

Implementing a Custom Index Handler | 119

When the index is created, the Java code you implement for the index handler has to
do some initial validation and define the schema for the index table, if used. It also has
to implement the rebuilding process where it reads the table to be indexed and writes
to the index storage (e.g., the index table). The handler must clean up any nontable
storage it uses for the index when the index is dropped, relying on Hive to drop the
index table, as needed. Finally, the handler must participate in optimizing queries.

120 | Chapter 8: HiveQL: Indexes

CHAPTER 9

Schema Design

Hive looks and acts like a relational database. Users have a familiar nomenclature such
as tables and columns, as well as a query language that is remarkably similar to SQL
dialects they have used before. However, Hive is implemented and used in ways that
are very different from conventional relational databases. Often, users try to carry over
paradigms from the relational world that are actually Hive anti-patterns. This section
highlights some Hive patterns you should use and some anti-patterns you should avoid.

Table-by-Day
Table-by-day is a pattern where a table named supply is appended with a timestamp
such as supply_2011_01_01, supply_2011_01_02, etc. Table-by-day is an anti-pattern in
the database world, but due to common implementation challenges of ever-growing
data sets, it is still widely used:
hive> CREATE TABLE supply_2011_01_02 (id int, part string, quantity int);
hive> CREATE TABLE supply_2011_01_03 (id int, part string, quantity int);
hive> CREATE TABLE supply_2011_01_04 (id int, part string, quantity int);
hive> .... load data ...
hive>
>
>
>

SELECT part,quantity supply_2011_01_02
UNION ALL
SELECT part,quantity from supply_2011_01_03
WHERE quantity < 4;

With Hive, a partitioned table should be used instead. Hive uses expressions in the
WHERE clause to select input only from the partitions needed for the query. This query
will run efficiently, and it is clean and easy on the eyes:
hive> CREATE TABLE supply (id int, part string, quantity int)
> PARTITIONED BY (int day);
hive> ALTER TABLE supply add PARTITION (day=20110102);

121

hive> ALTER TABLE supply add PARTITION (day=20110103);
hive> ALTER TABLE supply add PARTITION (day=20110102);
hive> .... load data ...
hive> SELECT part,quantity FROM supply
> WHERE day>=20110102 AND day<20110103 AND quantity < 4;

Over Partitioning
The partitioning feature is very useful in Hive. This is because Hive typically performs
full scans over all input to satisfy a query (we’ll leave Hive’s indexing out for this
discussion). However, a design that creates too many partitions may optimize some
queries, but be detrimental for other important queries:
hive> CREATE TABLE weblogs (url string, time long )
> PARTITIONED BY (day int, state string, city string);
hive> SELECT * FROM weblogs WHERE day=20110102;

HDFS was designed for many millions of large files, not billions of small files. The first
drawback of having too many partitions is the large number of Hadoop files and directories that are created unnecessarily. Each partition corresponds to a directory that
usually contains multiple files. If a given table contains thousands of partitions, it may
have tens of thousands of files, possibly created every day. If the retention of this table
is multiplied over years, it will eventually exhaust the capacity of the NameNode to
manage the filesystem metadata. The NameNode must keep all metadata for the filesystem in memory. While each file requires a small number of bytes for its metadata
(approximately 150 bytes/file), the net effect is to impose an upper limit on the total
number of files that can be managed in an HDFS installation. Other filesystems, like
MapR and Amazon S3 don’t have this limitation.
MapReduce processing converts a job into multiple tasks. In the default case, each task
is a new JVM instance, requiring the overhead of start up and tear down. For small
files, a separate task will be used for each file. In pathological scenarios, the overhead
of JVM start up and tear down can exceed the actual processing time!
Hence, an ideal partition scheme should not result in too many partitions and their
directories, and the files in each directory should be large, some multiple of the filesystem block size.
A good strategy for time-range partitioning, for example, is to determine the approximate size of your data accumulation over different granularities of time, and start with
the granularity that results in “modest” growth in the number of partitions over time,
while each partition contains files at least on the order of the filesystem block size or
multiples thereof. This balancing keeps the partitions large, which optimizes
throughput for the general case query. Consider when the next level of granularity is
122 | Chapter 9: Schema Design

appropriate, especially if query WHERE clauses typically select ranges of smaller
granularities:
hive> CREATE TABLE weblogs (url string, time long, state string, city string )
> PARTITIONED BY (day int);
hive> SELECT * FROM weblogs WHERE day=20110102;

Another solution is to use two levels of partitions along different dimensions. For example, the first partition might be by day and the second-level partition might be by
geographic region, like the state:
hive> CREATE TABLE weblogs (url string, time long, city string )
> PARTITIONED BY (day int, state string);
hive> SELECT * FROM weblogs WHERE day=20110102;

However, since some states will probably result in lots more data than others, you could
see imbalanced map tasks, as processing the larger states takes a lot longer than processing the smaller states.
If you can’t find good, comparatively sized partition choices, consider using bucketing as described in “Bucketing Table Data Storage” on page 125.

Unique Keys and Normalization
Relational databases typically use unique keys, indexes, and normalization to store data
sets that fit into memory or mostly into memory. Hive, however, does not have the
concept of primary keys or automatic, sequence-based key generation. Joins should be
avoided in favor of denormalized data, when feasible. The complex types, Array, Map,
and Struct, help by allowing the storage of one-to-many data inside a single row. This
is not to say normalization should never be utilized, but star-schema type designs are
nonoptimal.
The primary reason to avoid normalization is to minimize disk seeks, such as those
typically required to navigate foreign key relations. Denormalizing data permits it to
be scanned from or written to large, contiguous sections of disk drives, which optimizes
I/O performance. However, you pay the penalty of denormalization, data duplication
and the greater risk of inconsistent data.
For example, consider our running example, the employees table. Here it is again with
some changes for clarity:
CREATE TABLE employees (
name
STRING,
salary
FLOAT,
subordinates ARRAY,
deductions MAP
address
STRUCT);

The data model of this example breaks the traditional design rules in a few ways.

Unique Keys and Normalization | 123

First, we are informally using name as the primary key, although we all know that names
are often not unique! Ignoring that issue for now, a relational model would have a single
foreign key relation from an employee record to the manager record, using the name
key. We represented this relation the other way around: each employee has an ARRAY
of names of subordinates.
Second, the value for each deduction is unique to the employee, but the map keys are
duplicated data, even if you substitute “flags” (say, integers) for the actual key strings.
A normal relational model would have a separate, two-column table for the deduction
name (or flag) and value, with a one-to-many relationship between the employees and
this deductions table.
Finally, chances are that at least some employees live at the same address, but we are
duplicating the address for each employee, rather than using a one-to-one relationship
to an addresses table.
It’s up to us to manage referential integrity (or deal with the consequences), and to fix
the duplicates of a particular piece of data that has changed. Hive does not give us a
convenient way to UPDATE single records.
Still, when you have 10s of terabytes to many petabytes of data, optimizing speed makes
these limitations worth accepting.

Making Multiple Passes over the Same Data
Hive has a special syntax for producing multiple aggregations from a single pass
through a source of data, rather than rescanning it for each aggregation. This change
can save considerable processing time for large input data sets. We discussed the details
previously in Chapter 5.
For example, each of the following two queries creates a table from the same source
table, history:
hive>
>
hive>
>

INSERT
SELECT
INSERT
SELECT

OVERWRITE TABLE sales
* FROM history WHERE action='purchased';
OVERWRITE TABLE credits
* FROM history WHERE action='returned';

This syntax is correct, but inefficient. The following rewrite achieves the same thing,
but using a single pass through the source history table:
hive> FROM history
> INSERT OVERWRITE sales
SELECT * WHERE action='purchased'
> INSERT OVERWRITE credits SELECT * WHERE action='returned';

The Case for Partitioning Every Table
Many ETL processes involve multiple processing steps. Each step may produce one or
more temporary tables that are only needed until the end of the next job. At first it may
124 | Chapter 9: Schema Design

appear that partitioning these temporary tables is unnecessary. However, imagine a
scenario where a mistake in step’s query or raw data forces a rerun of the ETL process
for several days of input. You will likely need to run the catch-up process a day at a
time in order to make sure that one job does not overwrite the temporary table before
other tasks have completed.
For example, this following design creates an intermediate table by the name
of distinct_ip_in_logs to be used by a subsequent processing step:
$ hive -hiveconf dt=2011-01-01
hive> INSERT OVERWRITE table distinct_ip_in_logs
> SELECT distinct(ip) as ip from weblogs
> WHERE hit_date='${hiveconf:dt}';
hive> CREATE TABLE state_city_for_day (state string,city string);
hive> INSERT OVERWRITE state_city_for_day
> SELECT distinct(state,city) FROM distinct_ip_in_logs
> JOIN geodata ON (distinct_ip_in_logs.ip=geodata.ip);

This approach works, however computing a single day causes the record of the previous
day to be removed via the INSERT OVERWRITE clause. If two instances of this process are
run at once for different days they could stomp on each others’ results.
A more robust approach is to carry the partition information all the way through the
process. This makes synchronization a nonissue. Also, as a side effect, this approach
allows you to compare the intermediate data day over day:
$ hive -hiveconf dt=2011-01-01
hive> INSERT OVERWRITE table distinct_ip_in_logs
> PARTITION (hit_date=${dt})
> SELECT distinct(ip) as ip from weblogs
> WHERE hit_date='${hiveconf:dt}';
hive> CREATE TABLE state_city_for_day (state string,city string)
> PARTITIONED BY (hit_date string);
hive>
>
>
>

INSERT OVERWRITE table state_city_for_day PARTITION(${hiveconf:df})
SELECT distinct(state,city) FROM distinct_ip_in_logs
JOIN geodata ON (distinct_ip_in_logs.ip=geodata.ip)
WHERE (hit_date='${hiveconf:dt}');

A drawback of this approach is that you will need to manage the intermediate table
and delete older partitions, but these tasks are easy to automate.

Bucketing Table Data Storage
Partitions offer a convenient way to segregate data and to optimize queries. However,
not all data sets lead to sensible partitioning, especially given the concerns raised earlier
about appropriate sizing.
Bucketing is another technique for decomposing data sets into more manageable parts.
Bucketing Table Data Storage | 125

For example, suppose a table using the date dt as the top-level partition and the
user_id as the second-level partition leads to too many small partitions. Recall that if
you use dynamic partitioning to create these partitions, by default Hive limits the maximum number of dynamic partitions that may be created to prevent the extreme case
where so many partitions are created they overwhelm the filesystem’s ability to manage
them and other problems. So, the following commands might fail:
hive> CREATE TABLE weblog (url STRING, source_ip STRING)
> PARTITIONED BY (dt STRING, user_id INT);
hive> FROM raw_weblog
> INSERT OVERWRITE TABLE page_view PARTITION(dt='2012-06-08', user_id)
> SELECT server_name, url, source_ip, dt, user_id;

Instead, if we bucket the weblog table and use user_id as the bucketing column, the
value of this column will be hashed by a user-defined number into buckets. Records
with the same user_id will always be stored in the same bucket. Assuming the number
of users is much greater than the number of buckets, each bucket will have many users:
hive> CREATE TABLE weblog (user_id INT, url STRING, source_ip STRING)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (user_id) INTO 96 BUCKETS;

However, it is up to you to insert data correctly into the table! The specification in
CREATE TABLE only defines metadata, but has no effect on commands that actually pop-

ulate the table.
This is how to populate the table correctly, when using an INSERT … TABLE statement.
First, we set a property that forces Hive to choose the correct number of reducers corresponding to the target table’s bucketing setup. Then we run a query to populate the
partitions. For example:
hive> SET hive.enforce.bucketing = true;
hive>
>
>
>

FROM raw_logs
INSERT OVERWRITE TABLE weblog
PARTITION (dt='2009-02-25')
SELECT user_id, url, source_ip WHERE dt='2009-02-25';

If we didn’t use the hive.enforce.bucketing property, we would have to set the number
of reducers to match the number of buckets, using set mapred.reduce.tasks=96. Then
the INSERT query would require a CLUSTER BY clause after the SELECT clause.
As for all table metadata, specifying bucketing doesn’t ensure that the
table is properly populated. Follow the previous example to ensure that
you correctly populate bucketed tables.

Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. Buckets are ideal for sampling. If two tables are bucketed by user_id,

126 | Chapter 9: Schema Design

Hive can create a logically correct sampling. Bucketing also aids in doing efficient mapside joins, as we discussed in “Map-side Joins” on page 105.

Adding Columns to a Table
Hive allows the definition of a schema over raw data files, unlike many databases that
force the conversion and importation of data following a specific format. A benefit of
this separation of concerns is the ability to adapt a table definition easily when new
columns are added to the data files.
Hive offers the SerDe abstraction, which enables the extraction of data from input. The
SerDe also enables the output of data, though the output feature is not used as frequently because Hive is used primarily as a query mechanism. A SerDe usually parses
from left to right, splitting rows by specified delimiters into columns. The SerDes tend
to be very forgiving. For example, if a row has fewer columns than expected, the missing
columns will be returned as null. If the row has more columns than expected, they will
be ignored. Adding new columns to the schema involves a single ALTER TABLE ADD COL
UMN command. This is very useful as log formats tend to only add more information to
a message:
hive> CREATE TABLE weblogs (version LONG, url STRING)
> PARTITIONED BY (hit_date int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hive> ! cat log1.txt
1 /mystuff
1 /toys
hive> LOAD DATA LOCAL INPATH 'log1.txt' int weblogs partition(20110101);
hive> SELECT * FROM weblogs;
1 /mystuff 20110101
1 /toys
20110101

Over time a new column may be added to the underlying data. In the following example
the column user_id is added to the data. Note that some older raw data files may not
have this column:
hive> ! cat log2.txt
2 /cars bob
2 /stuff terry
hive> ALTER TABLE weblogs ADD COLUMNS (user_id string);
hive> LOAD DATA LOCAL INPATH 'log2.txt' int weblogs partition(20110102);
hive> SELECT
1 /mystuff
1 /toys
2 /cars
2 /stuff

* from weblogs
20110101 NULL
20110101 NULL
20110102 bob
20110102 terry

Adding Columns to a Table | 127

Note that with this approach, columns cannot be added in the beginning or the middle.

Using Columnar Tables
Hive typically uses row-oriented storage, however Hive also has a columnar SerDe that
stores information in a hybrid row-column orientated form. While this format can be
used for any type of data there are some data sets that it is optimal for.

Repeated Data
Given enough rows, fields like state and age will have the same data repeated many
times. This type of data benefits from column-based storage.
state

uid

age

NY

Bob

40

NJ

Sara

32

NY

Peter

14

NY

Sandra

4

Many Columns
The table below has a large number of columns.
state

uid

age

server

tz

many_more …

NY

Bob

40

web1

est

stuff

NJ

Sara

32

web1

est

stuff

NY

Peter

14

web3

pst

stuff

NY

Sandra

4

web45

pst

stuff

Queries typically only use a single column or a small set of columns. Column-based
storage will make analyzing the table data faster:
hive> SELECT distinct(state) from weblogs;
NY
NJ

You can reference the section “RCFile” on page 202 to see how to use this format.

(Almost) Always Use Compression!
In almost all cases, compression makes data smaller on disk, which usually makes
queries faster by reducing I/O overhead. Hive works seamlessly with many compression
types. The only compelling reason to not use compression is when the data produced
128 | Chapter 9: Schema Design

is intended for use by an external system, and an uncompressed format, such as text,
is the most compatible.
But compression and decompression consumes CPU resources. MapReduce jobs tend
to be I/O bound, so the extra CPU overhead is usually not a problem. However, for
workflows that are CPU intensive, such as some machine-learning algorithms, compression may actually reduce performance by stealing valuable CPU resources from
more essential operations.
See Chapter 11 for more on how to use compression.

(Almost) Always Use Compression! | 129

CHAPTER 10

Tuning

HiveQL is a declarative language where users issue declarative queries and Hive figures
out how to translate them into MapReduce jobs. Most of the time, you don’t need to
understand how Hive works, freeing you to focus on the problem at hand. While the
sophisticated process of query parsing, planning, optimization, and execution is the
result of many years of hard engineering work by the Hive team, most of the time you
can remain oblivious to it.
However, as you become more experienced with Hive, learning about the theory behind
Hive, and the low-level implementation details, will let you use Hive more effectively,
especially where performance optimizations are concerned.
This chapter covers several different topics related to tuning Hive performance. Some
tuning involves adjusting numeric configuration parameters (“turning the knobs”),
while other tuning steps involve enabling or disabling specific features.

Using EXPLAIN
The first step to learning how Hive works (after reading this book…) is to use the
EXPLAIN feature to learn how Hive translates queries into MapReduce jobs.
Consider the following example:
hive> DESCRIBE onecol;
number int
hive> SELECT * FROM onecol;
5
5
4
hive> SELECT SUM(number) FROM onecol;
14

Now, put the EXPLAIN keyword in front of the last query to see the query plan and other
information. The query will not be executed.

131

Download from Wow! eBook 

hive> EXPLAIN SELECT SUM(number) FROM onecol;

The output requires some explaining and practice to understand.
First, the abstract syntax tree is printed. This shows how Hive parsed the query into
tokens and literals, as part of the first step in turning the query into the ultimate result:
ABSTRACT SYNTAX TREE:
(TOK_QUERY
(TOK_FROM (TOK_TABREF (TOK_TABNAME onecol)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT
(TOK_SELEXPR
(TOK_FUNCTION sum (TOK_TABLE_OR_COL number))))))

(The indentation of the actual output was changed to fit the page.)
For those not familiar with parsers and tokenizers, this can look overwhelming. However, even if you are a novice in this area, you can study the output to get a sense for
what Hive is doing with the SQL statement. (As a first step, ignore the TOK_ prefixes.)
Even though our query will write its output to the console, Hive will actually write the
output to a temporary file first, as shown by this part of the output:
'(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))'

Next, we can see references to our column name number, our table name onecol, and
the sum function.
A Hive job consists of one or more stages, with dependencies between different stages.
As you might expect, more complex queries will usually involve more stages and more
stages usually requires more processing time to complete.
A stage could be a MapReduce job, a sampling stage, a merge stage, a limit stage, or a
stage for some other task Hive needs to do. By default, Hive executes these stages one
at a time, although later we’ll discuss parallel execution in “Parallel Execution” on page 136.
Some stages will be short, like those that move files around. Other stages may also
finish quickly if they have little data to process, even though they require a map or
reduce task:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

The STAGE PLAN section is verbose and complex. Stage-1 is the bulk of the processing
for this job and happens via a MapReduce job. A TableScan takes the input of the table
and produces a single output column number. The Group By Operator applies the
sum(number) and produces an output column _col0 (a synthesized name for an anonymous result). All this is happening on the map side of the job, under the Map Operator
Tree:

132 | Chapter 10: Tuning

STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
onecol
TableScan
alias: onecol
Select Operator
expressions:
expr: number
type: int
outputColumnNames: number
Group By Operator
aggregations:
expr: sum(number)
bucketGroup: false
mode: hash
outputColumnNames: _col0
Reduce Output Operator
sort order:
tag: -1
value expressions:
expr: _col0
type: bigint

On the reduce side, under the Reduce Operator Tree, we see the same Group by Opera
tor but this time it is applying sum on _col0. Finally, in the reducer we see the File
Output Operator, which shows that the output will be text, based on the string output
format: HiveIgnoreKeyTextOutputFormat:
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Because this job has no LIMIT clause, Stage-0 is a no-op stage:
Stage: Stage-0
Fetch Operator
limit: -1

Using EXPLAIN | 133

Understanding the intricate details of how Hive parses and plans every query is not
useful all of the time. However, it is a nice to have for analyzing complex or poorly
performing queries, especially as we try various tuning steps. We can observe what
effect these changes have at the “logical” level, in tandem with performance measurements.

EXPLAIN EXTENDED
Using EXPLAIN EXTENDED produces even more output. In an effort to “go green,” we
won’t show the entire output, but we will show you the Reduce Operator Tree to
demonstrate the different output:
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
directory: file:/tmp/edward/hive_2012-[long number]/-ext-10001
NumFilesPerFileSink: 1
Stats Publishing Key Prefix:
file:/tmp/edward/hive_2012-[long number]/-ext-10001/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
columns _col0
columns.types bigint
escape.delim \
serialization.format 1
TotalFiles: 1
GatherStats: false
MultiFileSpray: false

We encourage you to compare the two outputs for the Reduce Operator Tree.

Limit Tuning
The LIMIT clause is commonly used, often by people working with the CLI. However,
in many cases a LIMIT clause still executes the entire query, then only returns a handful

134 | Chapter 10: Tuning

of results. Because this behavior is generally wasteful, it should be avoided when
possible. Hive has a configuration property to enable sampling of source data for use
with LIMIT:

hive.limit.optimize.enable
true
Whether to enable to optimization to
try a smaller subset of data for simple LIMIT first.


Once the hive.limit.optimize.enable is set to true, two variables control its operation,
hive.limit.row.max.size and hive.limit.optimize.limit.file:

hive.limit.row.max.size
100000
When trying a smaller subset of data for simple LIMIT,
how much size we need to guarantee each row to have at least.



hive.limit.optimize.limit.file
10
When trying a smaller subset of data for simple LIMIT,
maximum number of files we can sample.


A drawback of this feature is the risk that useful input data will never get processed.
For example, any query that requires a reduce step, such as most JOIN and GROUP BY
operations, most calls to aggregate functions, etc., will have very different results. Perhaps this difference is okay in many cases, but it’s important to understand.

Optimized Joins
We discussed optimizing join performance in “Join Optimizations” on page 100 and
“Map-side Joins” on page 105. We won’t reproduce the details here, but just remind
yourself that it’s important to know which table is the largest and put it last in the
JOIN clause, or use the /* streamtable(table_name) */ directive.
If all but one table is small enough, typically to fit in memory, then Hive can perform
a map-side join, eliminating the need for reduce tasks and even some map tasks. Sometimes even tables that do not fit in memory are good candidates because removing the
reduce phase outweighs the cost of bringing semi-large tables into each map tasks.

Local Mode
Many Hadoop jobs need the full scalability benefits of Hadoop to process large data
sets. However, there are times when the input to Hive is very small. In these cases, the
Local Mode | 135

overhead of launching tasks for queries consumes a significant percentage of the overall
job execution time. In many of these cases, Hive can leverage the lighter weight of the
local mode to perform all the tasks for the job on a single machine and sometimes in
the same process. The reduction in execution times can be dramatic for small data sets.
You can explicitly enable local mode temporarily, as in this example:
hive> set oldjobtracker=${hiveconf:mapred.job.tracker};
hive> set mapred.job.tracker=local;
hive> set mapred.tmp.dir=/home/edward/tmp;
hive> SELECT * from people WHERE firstname=bob;
...
hive> set mapred.job.tracker=${oldjobtracker};

You can also tell Hive to automatically apply this optimization by setting
hive.exec.mode.local.auto to true, perhaps in your $HOME/.hiverc.
To set this property permanently for all users, change the value in your $HIVE_HOME/
conf/hive-site.xml:

hive.exec.mode.local.auto
true

Let hive determine whether to run in local mode automatically



Parallel Execution
Hive converts a query into one or more stages. Stages could be a MapReduce stage, a
sampling stage, a merge stage, a limit stage, or other possible tasks Hive needs to do.
By default, Hive executes these stages one at a time. However, a particular job may
consist of some stages that are not dependent on each other and could be executed in
parallel, possibly allowing the overall job to complete more quickly. However, if more
stages are run simultaneously, the job may complete much faster.
Setting hive.exec.parallel to true enables parallel execution. Be careful in a shared
cluster, however. If a job is running more stages in parallel, it will increase its cluster
utilization:

hive.exec.parallel
true
Whether to execute jobs in parallel


136 | Chapter 10: Tuning

Strict Mode
Strict mode is a setting in Hive that prevents users from issuing queries that could have
unintended and undesirable effects.
Setting the property hive.mapred.mode to strict disables three types of queries.
First, queries on partitioned tables are not permitted unless they include a partition
filter in the WHERE clause, limiting their scope. In other words, you’re prevented from
queries that will scan all partitions. The rationale for this limitation is that partitioned
tables often hold very large data sets that may be growing rapidly. An unrestricted
partition could consume unacceptably large resources over such a large table:
hive> SELECT DISTINCT(planner_id) FROM fracture_ins WHERE planner_id=5;
FAILED: Error in semantic analysis: No Partition Predicate Found for
Alias "fracture_ins" Table "fracture_ins"

The following enhancement adds a partition filter—the table partitions—to the
WHERE clause:
hive> SELECT DISTINCT(planner_id) FROM fracture_ins
> WHERE planner_id=5 AND hit_date=20120101;
... normal results ...

The second type of restricted query are those with ORDER BY clauses, but no LIMIT clause.
Because ORDER BY sends all results to a single reducer to perform the ordering, forcing
the user to specify a LIMIT clause prevents the reducer from executing for an extended
period of time:
hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id;
FAILED: Error in semantic analysis: line 1:56 In strict mode,
limit must be specified if ORDER BY is present planner_id

To issue this query, add a LIMIT clause:
hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id
> LIMIT 100000;
... normal results ...

The third and final type of query prevented is a Cartesian product. Users coming from
the relational database world may expect that queries that perform a JOIN not with an
ON clause but with a WHERE clause will have the query optimized by the query planner,
effectively converting the WHERE clause into an ON clause. Unfortunately, Hive does not
perform this optimization, so a runaway query will occur if the tables are large:
hive> SELECT * FROM fracture_act JOIN fracture_ads
> WHERE fracture_act.planner_id = fracture_ads.planner_id;
FAILED: Error in semantic analysis: In strict mode, cartesian product
is not allowed. If you really want to perform the operation,
+set hive.mapred.mode=nonstrict+

Here is a properly constructed query with JOIN and ON clauses:

Strict Mode | 137

hive> SELECT * FROM fracture_act JOIN fracture_ads
> ON (fracture_act.planner_id = fracture_ads.planner_id);
... normal results ...

Tuning the Number of Mappers and Reducers
Hive is able to parallelize queries by breaking the query into one or more MapReduce
jobs. Each of which might have multiple mapper and reducer tasks, at least some of
which can run in parallel. Determining the optimal number of mappers and reducers
depends on many variables, such as the size of the input and the operation being performed on the data.
A balance is required. Having too many mapper or reducer tasks causes excessive overhead in starting, scheduling, and running the job, while too few tasks means the
inherent parallelism of the cluster is underutilized.
When running a Hive query that has a reduce phase, the CLI prints information about
how the number of reducers can be tuned. Let’s see an example that uses a GROUP BY
query, because they always require a reduce phase. In contrast, many other queries are
converted into map-only jobs:
hive> SELECT pixel_id, count FROM fracture_ins WHERE hit_date=20120119
> GROUP BY pixel_id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
...

Hive is determining the number of reducers from the input size. This can be confirmed
using the dfs -count command, which works something like the Linux du -s command;
it computes a total size for all the data under a given directory:
[edward@etl02 ~]$
1 8 2614608737
1 7 2742992546
1 17 2656878252
1 2
362657644

hadoop dfs -count /user/media6/fracture/ins/* | tail -4
hdfs://.../user/media6/fracture/ins/hit_date=20120118
hdfs://.../user/media6/fracture/ins/hit_date=20120119
hdfs://.../user/media6/fracture/ins/hit_date=20120120
hdfs://.../user/media6/fracture/ins/hit_date=20120121

(We’ve reformatted the output and elided some details for space.)
The default value of hive.exec.reducers.bytes.per.reducer is 1 GB. Changing this
value to 750 MB causes Hive to estimate four reducers for this job:
hive> set hive.exec.reducers.bytes.per.reducer=750000000;
hive> SELECT pixel_id,count(1) FROM fracture_ins WHERE hit_date=20120119
> GROUP BY pixel_id;

138 | Chapter 10: Tuning

Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 4
...

This default typically yields good results. However, there are cases where a query’s map
phase will create significantly more data than the input size. In the case of excessive
map phase data, the input size of the default might be selecting too few reducers. Likewise the map function might filter a large portion of the data from the data set and then
fewer reducers may be justified.
A quick way to experiment is by setting the number of reducers to a fixed size, rather
than allowing Hive to calculate the value. If you remember, the Hive default estimate
is three reducers. Set mapred.reduce.tasks to different numbers and determine if more
or fewer reducers results in faster run times. Remember that benchmarking like this is
complicated by external factors such as other users running jobs simultaneously. Hadoop has a few seconds overhead to start up and schedule map and reduce tasks. When
executing performance tests, it’s important to keep these factors in mind, especially if
the jobs are small.
The hive.exec.reducers.max property is useful for controlling resource utilization on
shared clusters when dealing with large jobs. A Hadoop cluster has a fixed number of
map and reduce “slots” to allocate to tasks. One large job could reserve all of the slots
and block other jobs from starting. Setting hive.exec.reducers.max can stop a query
from taking too many reducer resources. It is a good idea to set this value in your
$HIVE_HOME/conf/hive-site.xml. A suggested formula is to set the value to the result
of this calculation:
(Total Cluster Reduce Slots * 1.5) / (avg number of queries running)

The 1.5 multiplier is a fudge factor to prevent underutilization of the cluster.

JVM Reuse
JVM reuse is a Hadoop tuning parameter that is very relevant to Hive performance,
especially scenarios where it’s hard to avoid small files and scenarios with lots of tasks,
most which have short execution times.
The default configuration of Hadoop will typically launch map or reduce tasks in a
forked JVM. The JVM start-up may create significant overhead, especially when
launching jobs with hundreds or thousands of tasks. Reuse allows a JVM instance to
be reused up to N times for the same job. This value is set in Hadoop’s mapredsite.xml (in $HADOOP_HOME/conf):

mapred.job.reuse.jvm.num.tasks
10
How many tasks to run per jvm. If set to -1, there is no limit.

JVM Reuse | 139




A drawback of this feature is that JVM reuse will keep reserved task slots open until
the job completes, in case they are needed for reuse. If an “unbalanced” job has some
reduce tasks that run considerably longer than the others, the reserved slots will sit idle,
unavailable for other jobs, until the last task completes.

Indexes
Indexes may be used to accelerate the calculation speed of a GROUP BY query.
Hive contains an implementation of bitmap indexes since v0.8.0. The main use case
for bitmap indexes is when there are comparatively few values for a given column. See
“Bitmap Indexes” on page 118 for more information.

Dynamic Partition Tuning
As explained in “Dynamic Partition Inserts” on page 74, dynamic partition INSERT
statements enable a succinct SELECT statement to create many new partitions for insertion into a partitioned table.
This is a very powerful feature, however if the number of partitions is high, a large
number of output handles must be created on the system. This is a somewhat uncommon use case for Hadoop, which typically creates a few files at once and streams large
amounts of data to them.
Out of the box, Hive is configured to prevent dynamic partition inserts from creating
more than 1,000 or so partitions. While it can be bad for a table to have too many
partitions, it is generally better to tune this setting to the larger value and allow these
queries to work.
First, it is always good to set the dynamic partition mode to strict in your hivesite.xml, as discussed in “Strict Mode” on page 137. When strict mode is on, at
least one partition has to be static, as demonstrated in “Dynamic Partition Inserts” on page 74:

hive.exec.dynamic.partition.mode
strict
In strict mode, the user must specify at least one
static partition in case the user accidentally overwrites all
partitions.


Then, increase the other relevant properties to allow queries that will create a large
number of dynamic partitions, for example:

hive.exec.max.dynamic.partitions

140 | Chapter 10: Tuning

300000
Maximum number of dynamic partitions allowed to be
created in total.


hive.exec.max.dynamic.partitions.pernode
10000
Maximum number of dynamic partitions allowed to be
created in each mapper/reducer node.


Another setting controls how many files a DataNode will allow to be open at once. It
must be set in the DataNode’s $HADOOP_HOME/conf/hdfs-site.xml.
In Hadoop v0.20.2, the default value is 256, which is too low. The value affects the
number of maximum threads and resources, so setting it to a very high number is not
recommended. Note also that in Hadoop v0.20.2, changing this variable requires restarting the DataNode to take effect:

dfs.datanode.max.xcievers
8192


Speculative Execution
Speculative execution is a feature of Hadoop that launches a certain number of duplicate tasks. While this consumes more resources computing duplicate copies of data
that may be discarded, the goal of this feature is to improve overall job progress by
getting individual task results faster, and detecting then black-listing slow-running
TaskTrackers.
Hadoop speculative execution is controlled in the $HADOOP_HOME/conf/mapredsite.xml file by the following two variables:

mapred.map.tasks.speculative.execution
true
If true, then multiple instances of some map tasks
may be executed in parallel.


mapred.reduce.tasks.speculative.execution
true
If true, then multiple instances of some reduce tasks
may be executed in parallel.


However, Hive provides its own variable to control reduce-side speculative execution:

Speculative Execution | 141


hive.mapred.reduce.tasks.speculative.execution
true
Whether speculative execution for
reducers should be turned on. 


It is hard to give a concrete recommendation about tuning these speculative execution
variables. If you are very sensitive to deviations in runtime, you may wish to turn these
features on. However, if you have long-running map or reduce tasks due to large
amounts of input, the waste could be significant.

Single MapReduce MultiGROUP BY
Another special optimization attempts to combine multiple GROUP BY operations in a
query into a single MapReduce job. For this optimization to work, a common set of
GROUP BY keys is required:

hive.multigroupby.singlemr
false
Whether to optimize multi group by query to generate single M/R
job plan. If the multi group by query has common group by keys, it will be
optimized to generate single M/R job.


Virtual Columns
Hive provides two virtual columns: one for the input filename for split and the other
for the block offset in the file. These are helpful when diagnosing queries where Hive
is producing unexpected or null results. By projecting these “columns,” you can see
which file and row is causing problems:
hive> set hive.exec.rowoffset=true;
hive> SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE, line
> FROM hive_text WHERE line LIKE '%hive%' LIMIT 2;
har://file/user/hive/warehouse/hive_text/folder=docs/
data.har/user/hive/warehouse/hive_text/folder=docs/README.txt 2243
http://hive.apache.org/
har://file/user/hive/warehouse/hive_text/folder=docs/
data.har/user/hive/warehouse/hive_text/folder=docs/README.txt 3646
- Hive 0.8.0 ignores the hive-default.xml file, though we continue

(We wrapped the long output and put a blank line between the two output rows.)
A third virtual column provides the row offset of the file. It must be enabled explicitly:

hive.exec.rowoffset
true

142 | Chapter 10: Tuning

Whether to provide the row offset virtual column


Now it can be used in queries:
hive> SELECT INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE,
> ROW__OFFSET__INSIDE__BLOCK
> FROM hive_text WHERE line LIKE '%hive%' limit 2;
file:/user/hive/warehouse/hive_text/folder=docs/README.txt
file:/user/hive/warehouse/hive_text/folder=docs/README.txt

2243
3646

0
0

Virtual Columns | 143

CHAPTER 11

Other File Formats and Compression

One of Hive’s unique features is that Hive does not force data to be converted to a
specific format. Hive leverages Hadoop’s InputFormat APIs to read data from a variety
of sources, such as text files, sequence files, or even custom formats. Likewise, the
OutputFormat API is used to write data to various formats.
While Hadoop offers linear scalability in file storage for uncompressed data, storing
data in compressed form has many benefits. Compression typically saves significant
disk storage; for example, text-based files may compress 40% or more. Compression
also can increase throughput and performance. This may seem counterintuitive because compressing and decompressing data incurs extra CPU overhead, however, the
I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.
Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will
improve performance. However, if your jobs are CPU bound, then compression will
probably lower your performance. The only way to really know is to experiment with
different options and measure the results.

Determining Installed Codecs
Based on your Hadoop version, different codecs will be available to you. The set feature
in Hive can be used to display the value of hiveconf or Hadoop configuration values.
The codecs available are in a comma-separated list named io.compression.codec:
# hive -e "set io.compression.codecs"
io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec

145

Choosing a Compression Codec
Using compression has the advantage of minimizing the disk space required for files
and the overhead of disk and network I/O. However, compressing and decompressing
files increases the CPU overhead. Therefore, compression is best used for I/O-bound
jobs, where there is extra CPU capacity, or when disk space is at a premium.
All recent versions of Hadoop have built-in support for the GZip and BZip2 compression schemes, including native Linux libraries that accelerate compression and decompression for these formats. Bundled support for Snappy compression was recently
added, but if your version of Hadoop doesn’t support it, you can add the appropriate
libraries yourself.1 Finally, LZO compression is often used.2
So, why do we need different compression schemes? Each scheme makes a trade-off
between speed and minimizing the size of the compressed output. BZip2 creates the
smallest compressed output, but with the highest CPU overhead. GZip is next in terms
of compressed size versus speed. Hence, if disk space utilization and I/O overhead are
concerns, both are attractive choices.
LZO and Snappy create larger files but are much faster, especially for decompression.
They are good choices if disk space and I/O overhead are less important than rapid
decompression of frequently read data.
Another important consideration is whether or not the compression format is splittable. MapReduce wants to split very large input files into splits (often one split per filesystem block, i.e., a multiple of 64 MB), where each split is sent to a separate map
process. This can only work if Hadoop knows the record boundaries in the file. In text
files, each line is a record, but these boundaries are obscured by GZip and Snappy.
However, BZip2 and LZO provide block-level compression, where each block has
complete records, so Hadoop can split these files on block boundaries.
The desire for splittable files doesn’t rule out GZip and Snappy. When you create your
data files, you could partition them so that they are approximately the desired size.
Typically the number of output files is equal to the number of reducers. If you are using
N reducers you typically get N output files. Be careful, if you have a large nonsplittable
file, a single task will have to read the entire file beginning to end.
There’s much more we could say about compression, but instead we’ll refer you to
Hadoop: The Definitive Guide by Tom White (O’Reilly) for more details, and we’ll focus
now on how to tell Hive what format you’re using.
From Hive’s point of view, there are two aspects to the file format. One aspect is how
the file is delimited into rows (records). Text files use \n (linefeed) as the default row
delimiter. When you aren’t using the default text file format, you tell Hive the name of
1. See http://code.google.com/p/hadoop-snappy/.
2. See http://wiki.apache.org/hadoop/UsingLzoCompression.

146 | Chapter 11: Other File Formats and Compression

an InputFormat and an OutputFormat to use. Actually, you will specify the names of Java
classes that implement these formats. The InputFormat knows how to read splits and
partition them into records, and the OutputFormat knows how to write these splits back
to files or console output.
The second aspect is how records are partitioned into fields (or columns). Hive uses
^A by default to separate fields in text files. Hive uses the name SerDe, which is short
for serializer/deserializer for the “module” that partitions incoming records (the deserializer) and also knows how to write records in this format (the serializer). This time
you will specify a single Java class that performs both jobs.
All this information is specified as part of the table definition when you create the table.
After creation, you query the table as you normally would, agnostic to the underlying
format. Hence, if you’re a user of Hive, but not a Java developer, don’t worry about
the Java aspects. The developers on your team will help you specify this information
when needed, after which you’ll work as you normally do.

Enabling Intermediate Compression
Intermediate compression shrinks the data shuffled between the map and reduce tasks
for a job. For intermediate compression, choosing a codec that has lower CPU cost is
typically more important than choosing a codec that results in the most compression.
The property hive.exec.compress.intermediate defaults to false and should be set to
true by default:

hive.exec.compress.intermediate
true
 This controls whether intermediate files produced by Hive between
multiple map-reduce jobs are compressed. The compression codec and other options
are determined from hadoop config variables mapred.output.compress* 


The property that controls intermediate compression for other Hadoop
jobs is mapred.compress.map.output.

Hadoop compression has a DefaultCodec. Changing the codec involves setting the
mapred.map.output.compression.codec property. This is a Hadoop variable and can be
set in the $HADOOP_HOME/conf/mapred-site.xml or the $HADOOP_HOME/conf/
hive-site.xml. SnappyCodec is a good choice for intermediate compression because it
combines good compression performance with low CPU cost:

mapred.map.output.compression.codec
org.apache.hadoop.io.compress.SnappyCodec
 This controls whether intermediate files produced by Hive

Enabling Intermediate Compression | 147

between multiple map-reduce jobs are compressed. The compression codec
and other options are determined from hadoop config variables
mapred.output.compress* 


Final Output Compression
When Hive writes output to a table, that content can also be compressed. The property
hive.exec.compress.output controls this feature. You may wish to leave this value set
to false in the global configuration file, so that the default output is uncompressed
clear text. Users can turn on final compression by setting the property to true on a
query-by-query basis or in their scripts:

hive.exec.compress.output
false
 This controls whether the final outputs of a query
(to a local/hdfs file or a Hive table) is compressed. The compression
codec and other options are determined from hadoop config variables
mapred.output.compress* 


The property that controls final compression for other Hadoop jobs is
mapred.output.compress.

If hive.exec.compress.output is set true, a codec can be chosen. GZip compression is
a good choice for output compression because it typically reduces the size of files significantly, but remember that GZipped files aren’t splittable by subsequent MapReduce
jobs:

mapred.output.compression.codec
org.apache.hadoop.io.compress.GzipCodec
If the job outputs are compressed, how should they be compressed?



Sequence Files
Compressing files results in space savings but one of the downsides of storing raw
compressed files in Hadoop is that often these files are not splittable. Splittable files
can be broken up and processed in parts by multiple mappers in parallel. Most compressed files are not splittable because you can only start reading from the beginning.
The sequence file format supported by Hadoop breaks a file into blocks and then optionally compresses the blocks in a splittable way.

148 | Chapter 11: Other File Formats and Compression

To use sequence files from Hive, add the STORED AS SEQUENCEFILE clause to a CREATE
TABLE statement:
CREATE TABLE a_sequence_file_table STORED AS SEQUENCEFILE;

Sequence files have three different compression options: NONE, RECORD, and BLOCK.
RECORD is the default. However, BLOCK compression is usually more efficient and it still
provides the desired splittability. Like many other compression properties, this one is
not Hive-specific. It can be defined in Hadoop’s mapred-site.xml file, in Hive’s hivesite.xml, or as needed in scripts or before individual queries:

mapred.output.compression.type
BLOCK
If the job outputs are to compressed as SequenceFiles,
how should they be compressed? Should be one of NONE, RECORD or BLOCK.



Compression in Action
We have introduced a number of compression-related properties in Hive, and different
permutations of these options result in different output. Let’s use these properties in
some examples and show what they produce. Remember that variables set by the CLI
persist across the rest of the queries in the session, so between examples you should
revert the settings or simply restart the Hive session:
hive> SELECT * FROM a;
4
5
3
2
hive> DESCRIBE a;
a
int
b
int

First, let’s enable intermediate compression. This won’t affect the final output, however
the job counters will show less physical data transferred for the job, since the shuffle
sort data was compressed:
hive> set hive.exec.compress.intermediate=true;
hive> CREATE TABLE intermediate_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on
Table default.intermediate_comp_on stats: [num_partitions: 0, num_files: 1,
num_rows: 2, total_size: 8, raw_data_size: 6]
...

As expected, intermediate compression did not affect the final output, which remains
uncompressed:
hive> dfs -ls /user/hive/warehouse/intermediate_comp_on;
Found 1 items

Compression in Action | 149

/user/hive/warehouse/intermediate_comp_on/000000_0
hive> dfs -cat /user/hive/warehouse/intermediate_comp_on/000000_0;
4
5
3
2

We can also chose an intermediate compression codec other then the default codec. In
this case we chose GZIP, although Snappy is normally a better option. The first line is
wrapped for space:
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;
hive> CREATE TABLE intermediate_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on_gz
Table default.intermediate_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 8, raw_data_size: 6]
hive> dfs -cat /user/hive/warehouse/intermediate_comp_on_gz/000000_0;
4
5
3
2

Next, we can enable output compression:
hive> set hive.exec.compress.output=true;
hive> CREATE TABLE final_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/tmp/hive-edward/hive_2012-01-15_11-11-01_884_.../-ext-10001
Moving data to: file:/user/hive/warehouse/final_comp_on
Table default.final_comp_on stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 16, raw_data_size: 6]
hive> dfs -ls /user/hive/warehouse/final_comp_on;
Found 1 items
/user/hive/warehouse/final_comp_on/000000_0.deflate

The output table statistics show that the total_size is 16, but the raw_data_size is 6.
The extra space is overhead for the deflate algorithm. We can also see the output file
is named .deflate.
Trying to cat the file is not suggested, as you get binary output. However, Hive can
query this data normally:
hive> dfs -cat /user/hive/warehouse/final_comp_on/000000_0.deflate;
... UGLYBINARYHERE ...
hive> SELECT * FROM final_comp_on;
4
5
3
2

150 | Chapter 11: Other File Formats and Compression

This ability to seamlessly work with compressed files is not Hive-specific; Hadoop’s
TextInputFormat is at work here. While the name is confusing in this case, TextInput
Format understands file extensions such as .deflate or .gz and decompresses these files
on the fly. Hive is unaware if the underlying files are uncompressed or compressed
using any of the supported compression schemes.
Let’s change the codec used by output compression to see the results (another line wrap
for space):
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec
=org.apache.hadoop.io.compress.GzipCodec;
hive> CREATE TABLE final_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz
Table default.final_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 28, raw_data_size: 6]
hive> dfs -ls /user/hive/warehouse/final_comp_on_gz;
Found 1 items
/user/hive/warehouse/final_comp_on_gz/000000_0.gz

As you can see, the output folder now contains zero or more .gz files. Hive has a quick
hack to execute local commands like zcat from inside the Hive shell. The ! tells Hive
to fork and run the external command and block until the system returns a result.
zcat is a command-line utility that decompresses and displays output:
hive> ! /bin/zcat /user/hive/warehouse/final_comp_on_gz/000000_0.gz;
4
5
3
2
hive> SELECT * FROM final_comp_on_gz;
OK
4
5
3
2
Time taken: 0.159 seconds

Using output compression like this results in binary compressed files that are small
and, as a result, operations on them are very fast. However, recall that the number of
output files is a side effect of how many mappers or reducers processed the data. In the
worst case scenario, you can end up with one large binary file in a directory that is not
splittable. This means that subsequent steps that have to read this data cannot work
in parallel. The answer to this problem is to use sequence files:
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
hive>
>
>
>

CREATE TABLE final_comp_on_gz_seq
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS SEQUENCEFILE
AS SELECT * FROM a;

Compression in Action | 151

Moving data to: file:/user/hive/warehouse/final_comp_on_gz_seq
Table default.final_comp_on_gz_seq stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 199, raw_data_size: 6]
hive> dfs -ls /user/hive/warehouse/final_comp_on_gz_seq;
Found 1 items
/user/hive/warehouse/final_comp_on_gz_seq/000000_0

Sequence files are binary. But it is a nice exercise to see the header. To confirm the
results are what was intended (output wrapped):
hive> dfs -cat /user/hive/warehouse/final_comp_on_gz_seq/000000_0;
SEQ[]org.apache.hadoop.io.BytesWritable[]org.apache.hadoop.io.BytesWritable[]
org.apache.hadoop.io.compress.GzipCodec[]

Because of the meta-information embedded in the sequence file and in the Hive metastore, Hive can query the table without any specific settings. Hadoop also offers the
dfs -text command to strip the header and compression away from sequence files and
return the raw result:
hive> dfs -text /user/hive/warehouse/final_comp_on_gz_seq/000000_0;
4
5
3
2
hive> select * from final_comp_on_gz_seq;
OK
4
5
3
2

Finally, let’s use intermediate and output compression at the same time and set different
compression codecs for each while saving the final output to sequence files! These
settings are commonly done for production environments where data sets are large and
such settings improve performance:
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.SnappyCodec;
hive> set hive.exec.compress.intermediate=true;
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec
=org.apache.hadoop.io.compress.GzipCodec;
hive> CREATE TABLE final_comp_on_gz_int_compress_snappy_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE AS SELECT * FROM a;

Archive Partition
Hadoop has a format for storage known as HAR, which stands for Hadoop ARchive. A
HAR file is like a TAR file that lives in the HDFS filesystem as a single file. However,
internally it can contain multiple files and directories. In some use cases, older directories and files are less commonly accessed than newer files. If a particular partition
contains thousands of files it will require significant overhead to manage it in the HDFS

152 | Chapter 11: Other File Formats and Compression

Download from Wow! eBook 

NameNode. By archiving the partition it is stored as a single, large file, but it can still
be accessed by hive. The trade-off is that HAR files will be less efficient to query. Also,
HAR files are not compressed, so they don’t save any space.
In the following example, we’ll use Hive’s own documentation as data.
First, create a partitioned table and load it with the text data from the Hive package:
hive> CREATE TABLE hive_text (line STRING) PARTITIONED BY (folder STRING);
hive> ! ls $HIVE_HOME;
LICENSE
README.txt
RELEASE_NOTES.txt
hive> ALTER TABLE hive_text ADD PARTITION (folder='docs');
hive> LOAD DATA INPATH '${env:HIVE_HOME}/README.txt'
> INTO TABLE hive_text PARTITION (folder='docs');
Loading data to table default.hive_text partition (folder=docs)
hive> LOAD DATA INPATH '${env:HIVE_HOME}/RELEASE_NOTES.txt'
> INTO TABLE hive_text PARTITION (folder='docs');
Loading data to table default.hive_text partition (folder=docs)
hive> SELECT * FROM hive_text WHERE line LIKE '%hive%' LIMIT 2;
http://hive.apache.org/
docs
- Hive 0.8.0 ignores the hive-default.xml file, though we continue

docs

Some versions of Hadoop, such as Hadoop v0.20.2, will require the JAR containing
the Hadoop archive tools to be placed on the Hive auxlib:
$ mkdir $HIVE_HOME/auxlib
$ cp $HADOOP_HOME/hadoop-0.20.2-tools.jar $HIVE_HOME/auxlib/

Take a look at the underlying structure of the table, before we archive it. Note the
location of the table’s data partition, since it’s a managed, partitioned table:
hive> dfs -ls /user/hive/warehouse/hive_text/folder=docs;
Found 2 items
/user/hive/warehouse/hive_text/folder=docs/README.txt
/user/hive/warehouse/hive_text/folder=docs/RELEASE_NOTES.txt

The ALTER TABLE ... ARCHIVE PARTITION statement converts the table into an archived
table:
hive> SET hive.archive.enabled=true;
hive> ALTER TABLE hive_text ARCHIVE PARTITION (folder='docs');
intermediate.archived is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
intermediate.original is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Creating data.har for file:/user/hive/warehouse/hive_text/folder=docs
in file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
Please wait... (this may take a while)
Moving file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel

Archive Partition | 153

to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
Moving file:/user/hive/warehouse/hive_text/folder=docs
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Moving file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
to file:/user/hive/warehouse/hive_text/folder=docs

(We reformatted the output slightly so it would fit, and used ... to replace two timestamp strings in the original output.)
The underlying table has gone from two files to one Hadoop archive (HAR file):
hive> dfs -ls /user/hive/warehouse/hive_text/folder=docs;
Found 1 items
/user/hive/warehouse/hive_text/folder=docs/data.har

The ALTER TABLE ... UNARCHIVE PARTITION command extracts the files from the HAR
and puts them back into HDFS:
ALTER TABLE hive_text UNARCHIVE PARTITION (folder='docs');

Compression: Wrapping Up
Hive’s ability to read and write different types of compressed files is a big performance
win as it saves disk space and processing overhead. This flexibility also aids in integration with other tools, as Hive can query many native file types without the need to write
custom “adapters” in Java.

154 | Chapter 11: Other File Formats and Compression

CHAPTER 12

Developing

Hive won’t provide everything you could possibly need. Sometimes a third-party library
will fill a gap. At other times, you or someone else who is a Java developer will need to
write user-defined functions (UDFs; see Chapter 13), SerDes (see “Record Formats:
SerDes” on page 205), input and/or output formats (see Chapter 15), or other
enhancements.
This chapter explores working with the Hive source code itself, including the new
Plugin Developer Kit introduced in Hive v0.8.0.

Changing Log4J Properties
Hive can be configured with two separate Log4J configuration files found in
$HIVE_HOME/conf. The hive-log4j.properties file controls the logging of the CLI or
other locally launched components. The hive-exec-log4j.properties file controls the logging inside the MapReduce tasks. These files do not need to be present inside the Hive
installation because the default properties come built inside the Hive JARs. In fact, the
actual files in the conf directory have the .template extension, so they are ignored by
default. To use either of them, copy it with a name that removes the .template extension
and edit it to taste:
$ cp conf/hive-log4j.properties.template conf/hive-log4j.properties
$ ... edit file ...

It is also possible to change the logging configuration of Hive temporarily without
copying and editing the Log4J files. The hiveconf switch can be specified on start-up
with definitions of any properties in the log4.properties file. For example, here we set
the default logger to the DEBUG level and send output to the console appender:
$ bin/hive -hiveconf hive.root.logger=DEBUG,console
12/03/27 08:46:01 WARN conf.HiveConf: hive-site.xml not found on CLASSPATH
12/03/27 08:46:01 DEBUG conf.Configuration: java.io.IOException: config()

155

Connecting a Java Debugger to Hive
When enabling more verbose output does not help find the solution to the problem
you are troubleshooting, attaching a Java debugger will give you the ability to step
through the Hive code and hopefully find the problem.
Remote debugging is a feature of Java that is manually enabled by setting specific command-line properties for the JVM. The Hive shell script provides a switch and help
screen that makes it easy to set these properties (some output truncated for space):
$ bin/hive --help --debug
Allows to debug Hive by connecting to it via JDI API
Usage: hive --debug[:comma-separated parameters list]
Parameters:
recursive=
Should child JVMs also be started in debug mode. Default: y
port= Port on which main JVM listens for debug connection. Defaul...
mainSuspend=
Should main JVM wait with execution for the debugger to con...
childSuspend= Should child JVMs wait with execution for the debugger to c...
swapSuspend
Swaps suspend options between main and child JVMs

Building Hive from Source
Running Apache releases is usually a good idea, however you may wish to use features
that are not part of a release, or have an internal branch with nonpublic customizations.
Hence, you’ll need to build Hive from source. The minimum requirements for building
Hive are a recent Java JDK, Subversion, and ANT. Hive also contains components such
as Thrift-generated classes that are not built by default. Rebuilding Hive requires a
Thrift compiler, too.
The following commands check out a Hive release and builds it, produces output in
the hive-trunk/build/dist directory:
$ svn co http://svn.apache.org/repos/asf/hive/trunk hive-trunk
$ cd hive-trunk
$ ant package
$ ls build/dist/
bin examples LICENSE
conf lib
NOTICE

README.txt
RELEASE_NOTES.txt

scripts

Running Hive Test Cases
Hive has a unique built-in infrastructure for testing. Hive does have traditional JUnit
tests, however the majority of the testing happens by running queries saved in .q files,
then comparing the results with a previous run saved in Hive source.1 There are multiple
1. That is, they are more like feature or acceptance tests.

156 | Chapter 12: Developing

directories inside the Hive source folder. “Positive” tests are those that should pass,
while “negative” tests should fail.
An example of a positive test is a well-formed query. An example of a negative test is a
query that is malformed or tries doing something that is not allowed by HiveQL:
$ ls -lah ql/src/test/queries/
total 76K
drwxrwxr-x. 7 edward edward 4.0K
drwxrwxr-x. 8 edward edward 4.0K
drwxrwxr-x. 3 edward edward 20K
drwxrwxr-x. 3 edward edward 36K
drwxrwxr-x. 3 edward edward 4.0K
drwxrwxr-x. 3 edward edward 4.0K

May
May
Feb
Mar
May
Mar

28 2011 .
28 2011 ..
21 20:08 clientnegative
8 09:17 clientpositive
28 2011 negative
12 09:25 positive

Take a look at ql/src/test/queries/clientpositive/cast1.q. The first thing you should know
is that a src table is the first table automatically created in the test process. It is a table
with two columns, key and value, where key is an INT and value is a STRING. Because
Hive does not currently have the ability to do a SELECT without a FROM clause, selecting
a single row from the src table is the trick used to test out functions that don’t really
need to retrieve table data; inputs can be “hard-coded” instead.
As you can see in the following example queries, the src table is never referenced in the
SELECT clauses:
hive> CREATE TABLE dest1(c1 INT, c2 DOUBLE, c3 DOUBLE,
> c4 DOUBLE, c5 INT, c6 STRING, c7 INT) STORED AS TEXTFILE;
hive>
>
>
>
>

EXPLAIN
FROM src INSERT OVERWRITE TABLE dest1
SELECT 3 + 2, 3.0 + 2, 3 + 2.0, 3.0 + 2.0,
3 + CAST(2.0 AS INT) + CAST(CAST(0 AS SMALLINT) AS INT),
CAST(1 AS BOOLEAN), CAST(TRUE AS INT) WHERE src.key = 86;

hive>
>
>
>

FROM src INSERT OVERWRITE TABLE dest1
SELECT 3 + 2, 3.0 + 2, 3 + 2.0, 3.0 + 2.0,
3 + CAST(2.0 AS INT) + CAST(CAST(0 AS SMALLINT) AS INT),
CAST(1 AS BOOLEAN), CAST(TRUE AS INT) WHERE src.key = 86;

hive> SELECT dest1.* FROM dest1;

The results of the script are found here: ql/src/test/results/clientpositive/cast1.q.out. The
result file is large and printing the complete results inline will kill too many trees. However, portions of the file are worth noting.
This command invokes a positive and a negative test case for the Hive client:
ant test -Dtestcase=TestCliDriver -Dqfile=mapreduce1.q
ant test -Dtestcase=TestNegativeCliDriver -Dqfile=script_broken_pipe1.q

The two particular tests only parse queries. They do not actually run the client. They
are now deprecated in favor of clientpositive and clientnegative.

Building Hive from Source | 157

You can also run multiple tests in one ant invocation to save time (the last -Dqfile=…
string was wrapped for space; it’s all one string):
ant test -Dtestcase=TestCliDriver -Dqfile=avro_change_schema.q,avro_joins.q,
avro_schema_error_message.q,avro_evolved_schemas.q,avro_sanity_test.q,
avro_schema_literal.q

Execution Hooks
PreHooks and PostHooks are utilities that allow user code to hook into parts of Hive
and execute custom code. Hive’s testing framework uses hooks to echo commands that
produce no output, so that the results show up inside tests:
PREHOOK: query: CREATE TABLE dest1(c1 INT, c2 DOUBLE, c3 DOUBLE,
c4 DOUBLE, c5 INT, c6 STRING, c7 INT) STORED AS TEXTFILE
PREHOOK: type: CREATETABLE
POSTHOOK: query: CREATE TABLE dest1(c1 INT, c2 DOUBLE, c3 DOUBLE,
c4 DOUBLE, c5 INT, c6 STRING, c7 INT) STORED AS TEXTFILE

Setting Up Hive and Eclipse
Eclipse is an open source IDE (Integrated Development Environment). The following
steps allow you to use Eclipse to work with the Hive source code:
$
$
$
$
$

ant clean package eclipse-files
cd metastore
ant model-jar
cd ../ql
ant gen-test

Once built, you can import the project into Eclipse and use it as you normally would.
Create a workspace in Eclipse, as normal. Then use the File → Import command and
then select General → Existing Projects into Workspace. Select the directory where Hive
is installed.
When the list of available projects is shown in the wizard, you’ll see one named hivetrunk, which you should select and click Finish.
Figure 12-1 shows how to start the Hive Command CLI Driver from within Eclipse.

Hive in a Maven Project
You can set up Hive as a dependency in Maven builds. The Maven repository http://
mvnrepository.com/artifact/org.apache.hive/hive-service contains the most recent releases. This page also lists the dependencies hive-service requires.
Here is the top-level dependency definition for Hive v0.9.0, not including the tree of
transitive dependencies, which is quite deep:

158 | Chapter 12: Developing

Figure 12-1. Starting the Hive Command CLI Driver from within Eclipse

org.apache.hive
hive-service
0.9.0


The pom.xml file for hive_test, which we discuss next, provides a complete example
of the transitive dependencies for Hive v0.9.0. You can find that file at https://github
.com/edwardcapriolo/hive_test/blob/master/pom.xml.

Unit Testing in Hive with hive_test
The optimal way to write applications to work with Hive is to access Hive with Thrift
through the HiveService. However, the Thrift service was traditionally difficult to bring
up in an embedded environment due to Hive’s many JAR dependencies and the metastore component.

Unit Testing in Hive with hive_test | 159

Hive_test fetches all the Hive dependencies from Maven, sets up the metastore and

Thrift service locally, and provides test classes to make unit testing easier. Also, because
it is very lightweight and unit tests run quickly, this is in contrast to the elaborate test
targets inside Hive, which have to rebuild the entire project to execute any unit test.
Hive_test is ideal for testing code such as UDFs, input formats, SerDes, or any com-

ponent that only adds a pluggable feature for the language. It is not useful for internal
Hive development because all the Hive components are pulled from Maven and are
external to the project.
In your Maven project, create a pom.xml and include hive_test as a dependency, as
shown here:

com.jointhegrid
hive_test
3.0.1-SNAPSHOT


Then create a version of hive-site.xml:
$ cp $HIVE_HOME/conf/* src/test/resources/
$ vi src/test/resources/hive-site.xml

Unlike a normal hive-site.xml, this version should not save any data to a
permanent place. This is because unit tests are not supposed to create or preserve any
permanent state. javax.jdo.option.ConnectionURL is set to use a feature in Derby that
only stores the database in main memory. The warehouse directory hive
.metastore.warehouse.dir is set to a location inside /tmp that will be deleted on each
run of the unit test:


javax.jdo.option.ConnectionURL
jdbc:derby:memory:metastore_db;create=true
JDBC connect string for a JDBC metastore


hive.metastore.warehouse.dir
/tmp/warehouse
location of default database for the warehouse



Hive_test provides several classes that extend JUnit test cases. HiveTestService set up

the environment, cleared out the warehouse directory, and launched a metastore and
HiveService in-process. This is typically the component to extend for testing. However,
other components, such as HiveTestEmbedded are also available:
package com.jointhegrid.hive_test;

160 | Chapter 12: Developing

import java.io.BufferedWriter;
import java.io.IOException;
import java.io.OutputStreamWriter;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
/* Extending HiveTestService creates and initializes
the metastore and thrift service in an embedded mode */
public class ServiceHiveTest extends HiveTestService {
public ServiceHiveTest() throws IOException {
super();
}
public void testExecute() throws Exception {
/* Use the Hadoop filesystem API to create a
data file */
Path p = new Path(this.ROOT_DIR, "afile");
FSDataOutputStream o = this.getFileSystem().create(p);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(o));
bw.write("1\n");
bw.write("2\n");
bw.close();
/* ServiceHive is a component that connections
to an embedded or network HiveService based
on the constructor used */
ServiceHive sh = new ServiceHive();
/* We can now interact through the HiveService
and assert on results */
sh.client.execute("create table atest (num int)");
sh.client.execute("load data local inpath '"
+ p.toString() + "' into table atest");
sh.client.execute("select count(1) as cnt from atest");
String row = sh.client.fetchOne();
assertEquals("2", row);
sh.client.execute("drop table atest");
}

}

The New Plugin Developer Kit
Hive v0.8.0 introduced a Plugin Developer Kit (PDK). Its intent is to allow developers
to build and test plug-ins without the Hive source. Only Hive binary code is required.
The PDK is relatively new and has some subtle bugs of its own that can make it difficult
to use. If you want to try using the PDK anyway, consult the wiki page, https://cwiki
.apache.org/Hive/plugindeveloperkit.html, but note that this page has a few errors, at
least at the time of this writing.
The New Plugin Developer Kit | 161

CHAPTER 13

Functions

User-Defined Functions (UDFs) are a powerful feature that allow users to extend
HiveQL. As we’ll see, you implement them in Java and once you add them to your
session (interactive or driven by a script), they work just like built-in functions, even
the online help. Hive has several types of user-defined functions, each of which performs a particular “class” of transformations on input data.
In an ETL workload, a process might have several processing steps. The Hive language
has multiple ways to pipeline the output from one step to the next and produce multiple
outputs during a single query. Users also have the ability to create their own functions
for custom processing. Without this feature a process might have to include a custom
MapReduce step or move the data into another system to apply the changes. Interconnecting systems add complexity and increase the chance of misconfigurations or other
errors. Moving data between systems is time consuming when dealing with gigabyteor terabyte-sized data sets. In contrast, UDFs run in the same processes as the tasks for
your Hive queries, so they work efficiently and eliminate the complexity of integration
with other systems. This chapter covers best practices associated with creating and
using UDFs.

Discovering and Describing Functions
Before writing custom UDFs, let’s familiarize ourselves with the ones that are already
part of Hive. Note that it’s common in the Hive community to use “UDF” to refer to
any function, user-defined or built-in.
The SHOW FUNCTIONS command lists the functions currently loaded in the Hive session,
both built-in and any user-defined functions that have been loaded using the techniques
we will discuss shortly:
hive> SHOW FUNCTIONS;
abs
acos
and
array

163

array_contains
...

Functions usually have their own documentation. Use DESCRIBE FUNCTION to display a
short description:
hive> DESCRIBE FUNCTION concat;
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN

Functions may also contain extended documentation that can be accessed by adding
the EXTENDED keyword:
hive> DESCRIBE FUNCTION EXTENDED concat;
concat(str1, str2, ... strN) - returns the concatenation of str1, str2, ... strN
Returns NULL if any argument is NULL.
Example:
> SELECT concat('abc', 'def') FROM src LIMIT 1;
'abcdef'

Calling Functions
To use a function, simply call it by name in a query, passing in any required arguments.
Some functions take a specific number of arguments and argument types, while other
functions accept a variable number of arguments with variable types. Just like keywords, the case of function names is ignored:
SELECT concat(column1,column2) AS x FROM table;

Standard Functions
The term user-defined function (UDF) is also used in a narrower sense to refer to any
function that takes a row argument or one or more columns from a row and returns a
single value. Most functions fall into this category.
Examples include many of the mathematical functions, like round() and floor(), for
converting DOUBLES to BIGINTS, and abs(), for taking the absolute value of a number.
Other examples include string manipulation functions, like ucase(), which converts
the string to upper case; reverse(), which reverses a string; and concat(), which joins
multiple input strings into one output string.
Note that these UDFs can return a complex object, such as an array, map, or struct.

Aggregate Functions
Another type of function is an aggregate function. All aggregate functions, user-defined
and built-in, are referred to generically as user-defined aggregate functions (UDAFs).
An aggregate function takes one or more columns from zero to many rows and returns
a single result. Examples include the math functions: sum(), which returns a sum of all
164 | Chapter 13: Functions

inputs; avg(), which computes the average of the values; min() and max(), which return
the lowest and highest values, respectively:
hive> SELECT avg(price_close)
> FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL';

Aggregate methods are often combined with GROUP BY clauses. We saw this example in
“GROUP BY Clauses” on page 97:
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd);
1984
25.578625440597534
1985
20.193676221040867
1986
32.46102808021274
...

Table 6-3 in Chapter 6 lists the built-in aggregate functions in HiveQL.

Table Generating Functions
A third type of function supported by Hive is a table generating function. As for the
other function kinds, all table generating functions, user-defined and built-in, are often
referred to generically as user-defined table generating functions (UDTFs).
Table generating functions take zero or more inputs and produce multiple columns or
rows of output. The array function takes a list of arguments and returns the list as a
single array type. Suppose we start with this query using an array:
hive> SELECT array(1,2,3) FROM dual;
[1,2,3]

The explode() function is a UDTF that takes an array of input and iterates through the
list, returning each element from the list in a separate row.
hive> SELECT explode(array(1,2,3)) AS element FROM src;
1
2
3

However, Hive only allows table generating functions to be used in limited ways. For
example, we can’t project out any other columns from the table, a significant limitation.
Here is a query we would like to write with the employees table we have used before.
We want to list each manager-subordinate pair.
Example 13-1. Invalid use of explode
hive> SELECT name, explode(subordinates) FROM employees;
FAILED: Error in semantic analysis: UDTF's are not supported outside
the SELECT clause, nor nested in expressions

However, Hive offers a LATERAL VIEW feature to allow this kind of query:

Table Generating Functions | 165

hive> SELECT name, sub
> FROM employees
> LATERAL VIEW explode(subordinates) subView AS sub;
John Doe
Mary Smith
John Doe
Todd Jones
Mary Smith
Bill King

Note that there are no output rows for employees who aren’t managers (i.e., who have
no subordinates), namely Bill King and Todd Jones. Hence, explode outputs zero to
many new records.
The LATERAL VIEW wraps the output of the explode call. A view alias and column alias
are required, subView and sub, respectively, in this case.
The list of built-in, table generating functions can be found in Table 6-4 in Chapter 6.

A UDF for Finding a Zodiac Sign from a Day
Let’s tackle writing our own UDF. Imagine we have a table with each user’s birth date
stored as a column of a table. With that information, we would like to determine the
user’s Zodiac sign. This process can be implemented with a standard function (UDF
in the most restrictive sense). Specifically, we assume we have a discrete input either as
a date formatted as a string or as a month and a day. The function must return a discrete
single column of output.
Here is a sample data set, which we’ll put in a file called littlebigdata.txt in our home
directory:
edward capriolo,edward@media6degrees.com,2-12-1981,209.191.139.200,M,10
bob,bob@test.net,10-10-2004,10.10.10.1,M,50
sara connor,sara@sky.net,4-5-1974,64.64.5.1,F,2

Load this data set into a table called littlebigdata:
hive > CREATE TABLE IF NOT EXISTS littlebigdata(
>
name STRING,
> email STRING,
>
bday STRING,
> ip
STRING,
>
gender STRING,
> anum INT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> LOAD DATA LOCAL INPATH '${env:HOME}/littlebigdata.txt'
> INTO TABLE littlebigdata;

The input for the function will be a date and the output will be a string representing
the user’s Zodiac sign.
Here is a Java implementation of the UDF we need:
package org.apache.hadoop.hive.contrib.udf.example;
import java.util.Date;

166 | Chapter 13: Functions

import java.text.SimpleDateFormat;
import org.apache.hadoop.hive.ql.exec.UDF;
@Description(name = "zodiac",
value = "_FUNC_(date) - from the input date string "+
"or separate month and day arguments, returns the sign of the Zodiac.",
extended = "Example:\n"
+ " > SELECT _FUNC_(date_string) FROM src;\n"
+ " > SELECT _FUNC_(month, day) FROM src;")
public class UDFZodiacSign extends UDF{
private SimpleDateFormat df;
public UDFZodiacSign(){
df = new SimpleDateFormat("MM-dd-yyyy");
}
public String evaluate( Date bday ){
return this.evaluate( bday.getMonth(), bday.getDay() );
}
public String evaluate(String bday){
Date date = null;
try {
date = df.parse(bday);
} catch (Exception ex) {
return null;
}
return this.evaluate( date.getMonth()+1, date.getDay() );
}

}

public String evaluate( Integer month, Integer day ){
if (month==1) {
if (day < 20 ){
return "Capricorn";
} else {
return "Aquarius";
}
}
if (month==2){
if (day < 19 ){
return "Aquarius";
} else {
return "Pisces";
}
}
/* ...other months here */
return null;
}

To write a UDF, start by extending the UDF class and implements and the evaluate()
function. During query processing, an instance of the class is instantiated for each usage
of the function in a query. The evaluate() is called for each input row. The result of
A UDF for Finding a Zodiac Sign from a Day | 167

evaluate() is returned to Hive. It is legal to overload the evaluate method. Hive will

pick the method that matches in a similar way to Java method overloading.
The @Description(...) is an optional Java annotation. This is how function documentation is defined and you should use these annotations to document your own UDFs.
When a user invokes DESCRIBE FUNCTION ..., the _FUNC_ strings will be replaced with
the function name the user picks when defining a “temporary” function, as discussed
below.
The arguments and return types of the UDF’s evaluate() function can
only be types that Hive can serialize. For example, if you are working
with whole numbers, a UDF can take as input a primitive int, an Inte
ger wrapper object, or an IntWritable, which is the Hadoop wrapper
for integers. You do not have to worry specifically about what the caller
is sending because Hive will convert the types for you if they do not
match. Remember that null is valid for any type in Hive, but in Java
primitives are not objects and cannot be null.

To use the UDF inside Hive, compile the Java code and package the UDF bytecode
class file into a JAR file. Then, in your Hive session, add the JAR to the classpath and
use a CREATE FUNCTION statement to define a function that uses the Java class:
hive> ADD JAR /full/path/to/zodiac.jar;
hive> CREATE TEMPORARY FUNCTION zodiac
> AS 'org.apache.hadoop.hive.contrib.udf.example.UDFZodiacSign';

Note that quotes are not required around the JAR file path and currently it needs to be
a full path to the file on a local filesystem. Hive not only adds this JAR to the classpath,
it puts the JAR file in the distributed cache so it’s available around the cluster.
Now the Zodiac UDF can be used like any other function. Notice the word TEMPO
RARY found inside the CREATE FUNCTION statement. Functions declared will only be available in the current session. You will have to add the JAR and create the function in each
session. However, if you use the same JAR files and functions frequently, you can add
these statements to your $HOME/.hiverc file:
hive> DESCRIBE FUNCTION zodiac;
zodiac(date) - from the input date string or separate month and day
arguments, returns the sign of the Zodiac.
hive> DESCRIBE FUNCTION EXTENDED zodiac;
zodiac(date) - from the input date string or separate month and day
arguments, returns the sign of the Zodiac.
Example:
> SELECT zodiac(date_string) FROM src;
> SELECT zodiac(month, day) FROM src;
hive> SELECT name, bday, zodiac(bday) FROM littlebigdata;
edward capriolo 2-12-1981 Aquarius

168 | Chapter 13: Functions

bob
sara connor

10-10-2004 Libra
4-5-1974 Aries

To recap, our UDF allows us to do custom transformations inside the Hive language.
Hive can now convert the user’s birthday to the corresponding Zodiac sign while it is
doing any other aggregations and transformations.
If we’re finished with the function, we can drop it:
hive> DROP TEMPORARY FUNCTION IF EXISTS zodiac;

As usual, the IF EXISTS is optional. It suppresses errors if the function doesn’t exist.

UDF Versus GenericUDF
In our Zodiac example we extended the UDF class. Hive offers a counterpart called
GenericUDF. GenericUDF is a more complex abstraction, but it offers support for better
null handling and makes it possible to handle some types of operations programmatically that a standard UDF cannot support. An example of a generic UDF is the Hive
CASE ... WHEN statement, which has complex logic depending on the arguments to the
statement. We will demonstrate how to use the GenericUDF class to write a user-defined
function, called nvl(), which returns a default value if null is passed in.
The nvl() function takes two arguments. If the first argument is non-null, it is returned.
If the first argument is null, the second argument is returned. The GenericUDF framework is a good fit for this problem. A standard UDF could be used as a solution but it
would be cumbersome because it requires overloading the evaluate method to handle
many different input types. GenericUDF will detect the type of input to the function
programmatically and provide an appropriate response.
We begin with the usual laundry list of import statements:
package org.apache.hadoop.hive.ql.udf.generic;
import
import
import
import
import
import
import
import

org.apache.hadoop.hive.ql.exec.Description;
org.apache.hadoop.hive.ql.exec.UDFArgumentException;
org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
org.apache.hadoop.hive.ql.metadata.HiveException;
org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils;
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;

Next, we use the @Description annotation to document the UDF:
@Description(name = "nvl",
value = "_FUNC_(value,default_value) - Returns default value if value"
+" is null else returns value",
extended = "Example:\n"
+ " > SELECT _FUNC_(null,'bla') FROM src LIMIT 1;\n")

UDF Versus GenericUDF | 169

Now the class extends GenericUDF, a requirement to exploit the generic handling we
want.
The initialize() method is called and passed an ObjectInspector for each argument.
The goal of this method is to determine the return type from the arguments. The user
can also throw an Exception to signal that bad types are being sent to the method. The
returnOIResolver is a built-in class that determines the return type by finding the type
of non-null variables and using that type:
public class GenericUDFNvl extends GenericUDF {
private GenericUDFUtils.ReturnObjectInspectorResolver returnOIResolver;
private ObjectInspector[] argumentOIs;
@Override
public ObjectInspector initialize(ObjectInspector[] arguments)
throws UDFArgumentException {
argumentOIs = arguments;
if (arguments.length != 2) {
throw new UDFArgumentLengthException(
"The operator 'NVL' accepts 2 arguments.");
}
returnOIResolver = new GenericUDFUtils.ReturnObjectInspectorResolver(true);
if (!(returnOIResolver.update(arguments[0]) && returnOIResolver
.update(arguments[1]))) {
throw new UDFArgumentTypeException(2,
"The 1st and 2nd args of function NLV should have the same type, "
+ "but they are different: \"" + arguments[0].getTypeName()
+ "\" and \"" + arguments[1].getTypeName() + "\"");
}
return returnOIResolver.get();
}
...

The evaluate method has access to the values passed to the method stored in an array
of DeferredObject values. The returnOIResolver created in the initialize method is
used to get values from the DeferredObjects. In this case, the function returns the first
non-null value:
...
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
Object retVal = returnOIResolver.convertIfNecessary(arguments[0].get(),
argumentOIs[0]);
if (retVal == null ){
retVal = returnOIResolver.convertIfNecessary(arguments[1].get(),
argumentOIs[1]);
}
return retVal;
}
...

The final method to override is getDisplayString(), which is used inside the Hadoop
tasks to display debugging information when the function is being used:

170 | Chapter 13: Functions

...
@Override
public String getDisplayString(String[] children) {
StringBuilder sb = new StringBuilder();
sb.append("if ");
sb.append(children[0]);
sb.append(" is null ");
sb.append("returns");
sb.append(children[1]);
return sb.toString() ;
}
}

To test the generic nature of the UDF, it is called several times, each time passing values
of different types, as shown the following example:
hive> ADD JAR /path/to/jar.jar;
hive> CREATE TEMPORARY FUNCTION nvl
> AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFNvl';
hive> SELECT nvl( 1 , 2 ) AS COL1,
>
nvl( NULL, 5 ) AS COL2,
>
nvl( NULL, "STUFF" ) AS COL3
> FROM src LIMIT 1;
1
5
STUFF

Permanent Functions
Until this point we have bundled our code into JAR files, then used ADD JAR and CREATE
TEMPORARY FUNCTION to make use of them.
Your function may also be added permanently to Hive, however this requires a small
modification to a Hive Java file and then rebuilding Hive.
Inside the Hive source code, a one-line change is required to the FunctionRegistry class
found at ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java. Then you
rebuild Hive following the instructions that come with the source distribution.
While it is recommended that you redeploy the entire new build, only the hive-exec*.jar, where \* is the version number, needs to be replaced.
Here is an example change to FunctionRegistry where the new nvl() function is added
to Hive’s list of built-in functions:
...
registerUDF("parse_url", UDFParseUrl.class, false);
registerGenericUDF("nvl", GenericUDFNvl.class);
registerGenericUDF("split", GenericUDFSplit.class);
...

Permanent Functions | 171

Download from Wow! eBook 

User-Defined Aggregate Functions
Users are able to define aggregate functions, too. However, the interface is more complex to implement. Aggregate functions are processed in several phases. Depending on
the transformation the UDAF performs, the types returned by each phase could be
different. For example, a sum() UDAF could accept primitive integer input, create integer PARTIAL data, and produce a final integer result. However, an aggregate like
median() could take primitive integer input, have an intermediate list of integers as
PARTIAL data, and then produce a final integer as the result.
For an example of a generic user-defined aggregate function, see the source code for
GenericUDAFAverage available at http://svn.apache.org/repos/asf/hive/branches/branch-0
.8/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java.
Aggregations execute inside the context of a map or reduce task, which
is a Java process with memory limitations. Therefore, storing large
structures inside an aggregate may exceed available heap space. The
min() UDAF only requires a single element be stored in memory for
comparison. The collectset() UDAF uses a set internally to deduplicate data in order to limit memory usage. percentile_approx()
uses approximations to achieve a near correct result while limiting
memory usage. It is important to keep memory usage in mind when
writing a UDAF. You can increase your available memory to some extent
by adjusting mapred.child.java.opts, but that solution does not scale:

mapred.child.java.opts
-Xmx200m


Creating a COLLECT UDAF to Emulate GROUP_CONCAT
MySQL has a useful function known as GROUP_CONCAT, which combines all the
elements of a group into a single string using a user-specified delimiter. Below is an
example MySQL query that shows how to use its version of this function:
mysql > CREATE TABLE people (
name STRING,
friendname STRING );
mysql > SELECT * FROM people;
bob
sara
bob
john
bob
ted
john
sara
ted
bob
ted
sara
mysql > SELECT name, GROUP_CONCAT(friendname SEPARATOR ',')
FROM people

172 | Chapter 13: Functions

GROUP
bob
john
ted

BY name;
sara,john,ted
sara
bob,sara

We can do the same transformation in Hive without the need for additional grammar
in the language. First, we need an aggregate function that builds a list of all input to
the aggregate. Hive already has a UDAF called collect_set that adds all input into a
java.util.Set collection. Sets automatically de-duplicate entries on insertion, which
is undesirable for GROUP CONCAT. To build collect, we will take the code in col
lect_set and replace instances of Set with instances of ArrayList. This will stop the
de-duplication. The result of the aggregate will be a single array of all values.
It is important to remember that the computation of your aggregation must be arbitrarily divisible over the data. Think of it as writing a divide-and-conquer algorithm
where the partitioning of the data is completely out of your control and handled by
Hive. More formally, given any subset of the input rows, you should be able to compute
a partial result, and also be able to merge any pair of partial results into another partial
result.
The following code is available on Github. All the input to the aggregation must be
primitive types. Rather than returning an ObjectInspector, like GenericUDFs, aggregates
return a subclass of GenericUDAFEvaluator:
@Description(name = "collect", value = "_FUNC_(x) - Returns a list of objects. "+
"CAUTION will easily OOM on large data sets" )
public class GenericUDAFCollect extends AbstractGenericUDAFResolver {
static final Log LOG = LogFactory.getLog(GenericUDAFCollect.class.getName());
public GenericUDAFCollect() {
}

}

@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] parameters)
throws SemanticException {
if (parameters.length != 1) {
throw new UDFArgumentTypeException(parameters.length - 1,
"Exactly one argument is expected.");
}
if (parameters[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(0,
"Only primitive type arguments are accepted but "
+ parameters[0].getTypeName() + " was passed as parameter 1.");
}
return new GenericUDAFMkListEvaluator();
}

Table 13-1 describes the methods that are part of the base class.

User-Defined Aggregate Functions | 173

Table 13-1. Methods in AbstractGenericUDAFResolver
Method

Description

init

Called by Hive to initialize an instance of the UDAF evaluator
class.

getNewAggregationBuffer

Return an object that will be used to store temporary aggregation results.

iterate

Process a new row of data into the aggregation buffer.

terminatePartial

Return the contents of the current aggregation in a persistable
way. Here, persistable means the return value can only be built
up in terms of Java primitives, arrays, primitive wrappers
(e.g., Double), Hadoop Writables, Lists, and Maps. Do NOT
use your own classes (even if they implement java.io
.Serializable).

merge

Merge a partial aggregation returned by
terminatePartial into the current aggregation.

terminate

Return the final result of the aggregation to Hive.

In the init method, the object inspectors for the result type are set, after determining
what mode the evaluator is in.
The iterate() and terminatePartial() methods are used on the map side, while ter
minate() and merge() are used on the reduce side to produce the final result. In all cases
the merges are building larger lists:
public static class GenericUDAFMkListEvaluator extends GenericUDAFEvaluator {
private PrimitiveObjectInspector inputOI;
private StandardListObjectInspector loi;
private StandardListObjectInspector internalMergeOI;
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters)
throws HiveException {
super.init(m, parameters);
if (m == Mode.PARTIAL1) {
inputOI = (PrimitiveObjectInspector) parameters[0];
return ObjectInspectorFactory
.getStandardListObjectInspector(
(PrimitiveObjectInspector) ObjectInspectorUtils
.getStandardObjectInspector(inputOI));
} else {
if (!(parameters[0] instanceof StandardListObjectInspector)) {
inputOI = (PrimitiveObjectInspector) ObjectInspectorUtils
.getStandardObjectInspector(parameters[0]);
return (StandardListObjectInspector) ObjectInspectorFactory
.getStandardListObjectInspector(inputOI);
} else {
internalMergeOI = (StandardListObjectInspector) parameters[0];
inputOI = (PrimitiveObjectInspector)
internalMergeOI.getListElementObjectInspector();

174 | Chapter 13: Functions

loi = (StandardListObjectInspector) ObjectInspectorUtils
.getStandardObjectInspector(internalMergeOI);
return loi;
}

}

}
...

The remaining methods and class definition define MkArrayAggregationBuffer as well
as top-level methods that modify the contents of the buffer:
You may have noticed that Hive tends to avoid allocating objects with
new whenever possible. Hadoop and Hive use this pattern to create fewer
temporary objects and thus less work for the JVM’s Garbage Collec
tion algorithms. Keep this in mind when writing UDFs, because references are typically reused. Assuming immutable objects will lead to
bugs!
...
static class MkArrayAggregationBuffer implements AggregationBuffer {
List
Programming Hive Guide

Navigation menu

Versions of this User Manual:

Views

Navigation