Programming Hive Guide

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 350 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Programming Hive
Edward Capriolo, Dean Wampler, and Jason Rutherglen
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
Programming Hive
by Edward Capriolo, Dean Wampler, and Jason Rutherglen
Copyright © 2012 Edward Capriolo, Aspect Research Associates, and Jason Rutherglen. All rights re-
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles ( For more information, contact our
corporate/institutional sales department: 800-998-9938 or
Editors: Mike Loukides and Courtney Nash
Production Editors: Iris Febres and Rachel Steely
Proofreaders: Stacie Arellano and Kiel Van Horn
Indexer: Bob Pfahler
Cover Designer: Karen Montgomery
Interior Designer: David Futato
Illustrator: Rebecca Demarest
October 2012: First Edition.
Revision History for the First Edition:
2012-09-17 First release
See for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of
O’Reilly Media, Inc. Programming Hive, the image of a hornet’s hive, and related trade dress are trade-
marks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a
trademark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information con-
tained herein.
ISBN: 978-1-449-31933-5
Table of Contents
Preface .................................................................... xiii
1. Introduction ........................................................... 1
An Overview of Hadoop and MapReduce 3
Hive in the Hadoop Ecosystem 6
Pig 8
HBase 8
Cascading, Crunch, and Others 9
Java Versus Hive: The Word Count Algorithm 10
What’s Next 13
2. Getting Started ........................................................ 15
Installing a Preconfigured Virtual Machine 15
Detailed Installation 16
Installing Java 16
Installing Hadoop 18
Local Mode, Pseudodistributed Mode, and Distributed Mode 19
Testing Hadoop 20
Installing Hive 21
What Is Inside Hive? 22
Starting Hive 23
Configuring Your Hadoop Environment 24
Local Mode Configuration 24
Distributed and Pseudodistributed Mode Configuration 26
Metastore Using JDBC 28
The Hive Command 29
Command Options 29
The Command-Line Interface 30
CLI Options 31
Variables and Properties 31
Hive “One Shot” Commands 34
Executing Hive Queries from Files 35
The .hiverc File 36
More on Using the Hive CLI 36
Command History 37
Shell Execution 37
Hadoop dfs Commands from Inside Hive 38
Comments in Hive Scripts 38
Query Column Headers 38
3. Data Types and File Formats ............................................. 41
Primitive Data Types 41
Collection Data Types 43
Text File Encoding of Data Values 45
Schema on Read 48
4. HiveQL: Data Definition ................................................. 49
Databases in Hive 49
Alter Database 52
Creating Tables 53
Managed Tables 56
External Tables 56
Partitioned, Managed Tables 58
External Partitioned Tables 61
Customizing Table Storage Formats 63
Dropping Tables 66
Alter Table 66
Renaming a Table 66
Adding, Modifying, and Dropping a Table Partition 66
Changing Columns 67
Adding Columns 68
Deleting or Replacing Columns 68
Alter Table Properties 68
Alter Storage Properties 68
Miscellaneous Alter Table Statements 69
5. HiveQL: Data Manipulation .............................................. 71
Loading Data into Managed Tables 71
Inserting Data into Tables from Queries 73
Dynamic Partition Inserts 74
Creating Tables and Loading Them in One Query 75
Exporting Data 76
iv | Table of Contents
6. HiveQL: Queries ........................................................ 79
SELECT … FROM Clauses 79
Specify Columns with Regular Expressions 81
Computing with Column Values 81
Arithmetic Operators 82
Using Functions 83
LIMIT Clause 91
Column Aliases 91
Nested SELECT Statements 91
CASE … WHEN … THEN Statements 91
When Hive Can Avoid MapReduce 92
WHERE Clauses 92
Predicate Operators 93
Gotchas with Floating-Point Comparisons 94
GROUP BY Clauses 97
HAVING Clauses 97
JOIN Statements 98
Inner JOIN 98
Join Optimizations 100
OUTER JOIN Gotcha 101
Cartesian Product JOINs 105
Map-side Joins 105
Casting 109
Casting BINARY Values 109
Queries that Sample Data 110
Block Sampling 111
Input Pruning for Bucket Tables 111
7. HiveQL: Views ........................................................ 113
Views to Reduce Query Complexity 113
Views that Restrict Data Based on Conditions 114
Views and Map Type for Dynamic Tables 114
View Odds and Ends 115
Table of Contents | v
8. HiveQL: Indexes ...................................................... 117
Creating an Index 117
Bitmap Indexes 118
Rebuilding the Index 118
Showing an Index 119
Dropping an Index 119
Implementing a Custom Index Handler 119
9. Schema Design ....................................................... 121
Table-by-Day 121
Over Partitioning 122
Unique Keys and Normalization 123
Making Multiple Passes over the Same Data 124
The Case for Partitioning Every Table 124
Bucketing Table Data Storage 125
Adding Columns to a Table 127
Using Columnar Tables 128
Repeated Data 128
Many Columns 128
(Almost) Always Use Compression! 128
10. Tuning .............................................................. 131
Using EXPLAIN 131
Limit Tuning 134
Optimized Joins 135
Local Mode 135
Parallel Execution 136
Strict Mode 137
Tuning the Number of Mappers and Reducers 138
JVM Reuse 139
Indexes 140
Dynamic Partition Tuning 140
Speculative Execution 141
Single MapReduce MultiGROUP BY 142
Virtual Columns 142
11. Other File Formats and Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Determining Installed Codecs 145
Choosing a Compression Codec 146
Enabling Intermediate Compression 147
Final Output Compression 148
Sequence Files 148
vi | Table of Contents
Compression in Action 149
Archive Partition 152
Compression: Wrapping Up 154
12. Developing .......................................................... 155
Changing Log4J Properties 155
Connecting a Java Debugger to Hive 156
Building Hive from Source 156
Running Hive Test Cases 156
Execution Hooks 158
Setting Up Hive and Eclipse 158
Hive in a Maven Project 158
Unit Testing in Hive with hive_test 159
The New Plugin Developer Kit 161
13. Functions ............................................................ 163
Discovering and Describing Functions 163
Calling Functions 164
Standard Functions 164
Aggregate Functions 164
Table Generating Functions 165
A UDF for Finding a Zodiac Sign from a Day 166
UDF Versus GenericUDF 169
Permanent Functions 171
User-Defined Aggregate Functions 172
Creating a COLLECT UDAF to Emulate GROUP_CONCAT 172
User-Defined Table Generating Functions 177
UDTFs that Produce Multiple Rows 177
UDTFs that Produce a Single Row with Multiple Columns 179
UDTFs that Simulate Complex Types 179
Accessing the Distributed Cache from a UDF 182
Annotations for Use with Functions 184
Deterministic 184
Stateful 184
DistinctLike 185
Macros 185
14. Streaming ........................................................... 187
Identity Transformation 188
Changing Types 188
Projecting Transformation 188
Manipulative Transformations 189
Using the Distributed Cache 189
Table of Contents | vii
Producing Multiple Rows from a Single Row 190
Calculating Aggregates with Streaming 191
GenericMR Tools for Streaming to Java 194
Calculating Cogroups 196
15. Customizing Hive File and Record Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
File Versus Record Formats 199
Demystifying CREATE TABLE Statements 199
File Formats 201
SequenceFile 201
RCFile 202
Example of a Custom Input Format: DualInputFormat 203
Record Formats: SerDes 205
CSV and TSV SerDes 206
ObjectInspector 206
Think Big Hive Reflection ObjectInspector 206
XPath-Related Functions 207
JSON SerDe 208
Avro Hive SerDe 209
Defining Avro Schema Using Table Properties 209
Defining a Schema from a URI 210
Evolving Schema 210
Binary Output 211
16. Hive Thrift Service .................................................... 213
Starting the Thrift Server 213
Setting Up Groovy to Connect to HiveService 214
Connecting to HiveServer 214
Getting Cluster Status 215
Result Set Schema 215
Fetching Results 215
Retrieving Query Plan 216
Metastore Methods 216
Example Table Checker 216
Administrating HiveServer 217
Productionizing HiveService 217
Cleanup 218
Hive ThriftMetastore 219
ThriftMetastore Configuration 219
Client Configuration 219
viii | Table of Contents
17. Storage Handlers and NoSQL ............................................ 221
Storage Handler Background 221
HiveStorageHandler 222
HBase 222
Cassandra 224
Static Column Mapping 224
Transposed Column Mapping for Dynamic Columns 224
Cassandra SerDe Properties 224
DynamoDB 225
18. Security ............................................................. 227
Integration with Hadoop Security 228
Authentication with Hive 228
Authorization in Hive 229
Users, Groups, and Roles 230
Privileges to Grant and Revoke 231
Partition-Level Privileges 233
Automatic Grants 233
19. Locking . ............................................................ 235
Locking Support in Hive with Zookeeper 235
Explicit, Exclusive Locks 238
20. Hive Integration with Oozie . ........................................... 239
Oozie Actions 239
Hive Thrift Service Action 240
A Two-Query Workflow 240
Oozie Web Console 242
Variables in Workflows 242
Capturing Output 243
Capturing Output to Variables 243
21. Hive and Amazon Web Services (AWS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Why Elastic MapReduce? 245
Instances 245
Before You Start 246
Managing Your EMR Hive Cluster 246
Thrift Server on EMR Hive 247
Instance Groups on EMR 247
Configuring Your EMR Cluster 248
Deploying hive-site.xml 248
Deploying a .hiverc Script 249
Table of Contents | ix
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
Setting Up a Memory-Intensive Configuration 249
Persistence and the Metastore on EMR 250
HDFS and S3 on EMR Cluster 251
Putting Resources, Configs, and Bootstrap Scripts on S3 252
Logs on S3 252
Spot Instances 252
Security Groups 253
EMR Versus EC2 and Apache Hive 254
Wrapping Up 254
22. HCatalog ............................................................ 255
Introduction 255
MapReduce 256
Reading Data 256
Writing Data 258
Command Line 261
Security Model 261
Architecture 262
23. Case Studies ......................................................... 265 (Media6Degrees) 265
Data Science at M6D Using Hive and R 265
M6D UDF Pseudorank 270
M6D Managing Hive Data Across Multiple MapReduce Clusters 274
Outbrain 278
In-Site Referrer Identification 278
Counting Uniques 280
Sessionization 282
NASA’s Jet Propulsion Laboratory 287
The Regional Climate Model Evaluation System 287
Our Experience: Why Hive? 290
Some Challenges and How We Overcame Them 291
Photobucket 292
Big Data at Photobucket 292
What Hardware Do We Use for Hive? 293
What’s in Hive? 293
Who Does It Support? 293
SimpleReach 294
Experiences and Needs from the Customer Trenches 296
A Karmasphere Perspective 296
Introduction 296
Use Case Examples from the Customer Trenches 297
x | Table of Contents
Glossary ................................................................... 305
Appendix: References ....................................................... 309
Index ..................................................................... 313
Table of Contents | xi
Programming Hive introduces Hive, an essential tool in the Hadoop ecosystem that
provides an SQL (Structured Query Language) dialect for querying data stored in the
Hadoop Distributed Filesystem (HDFS), other filesystems that integrate with Hadoop,
such as MapR-FS and Amazon’s S3 and databases like HBase (the Hadoop database)
and Cassandra.
Most data warehouse applications are implemented using relational databases that use
SQL as the query language. Hive lowers the barrier for moving these applications to
Hadoop. People who know SQL can learn Hive easily. Without Hive, these users must
learn new languages and tools to become productive again. Similarly, Hive makes it
easier for developers to port SQL-based applications to Hadoop, compared to other
tool options. Without Hive, developers would face a daunting challenge when porting
their SQL applications to Hadoop.
Still, there are aspects of Hive that are different from other SQL-based environments.
Documentation for Hive users and Hadoop developers has been sparse. We decided
to write this book to fill that gap. We provide a pragmatic, comprehensive introduction
to Hive that is suitable for SQL experts, such as database designers and business ana-
lysts. We also cover the in-depth technical details that Hadoop developers require for
tuning and customizing Hive.
You can learn more at the book’s catalog page (
Conventions Used in This Book
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions. Defi-
nitions of most terms can be found in the Glossary.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter-
mined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
require permission. Answering a question by citing this book and quoting example
code does not require permission. Incorporating a significant amount of example code
from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen (O’Reilly). Copyright 2012 Edward Capriolo,
Aspect Research Associates, and Jason Rutherglen, 978-1-449-31933-5.”
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at
Safari® Books Online
Safari Books Online ( is an on-demand digital
library that delivers expert content in both book and video form from the
world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and cre-
ative professionals use Safari Books Online as their primary resource for research,
problem solving, learning, and certification training.
xiv | Preface
Safari Books Online offers a range of product mixes and pricing programs for organi-
zations, government agencies, and individuals. Subscribers have access to thousands
of books, training videos, and prepublication manuscripts in one fully searchable da-
tabase from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley
Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John
Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT
Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Tech-
nology, and dozens more. For more information about Safari Books Online, please visit
us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at
To comment or ask technical questions about this book, send email to
For more information about our books, courses, conferences, and news, see our website
Find us on Facebook:
Follow us on Twitter:
Watch us on YouTube:
What Brought Us to Hive?
The three of us arrived here from different directions.
Edward Capriolo
When I first became involved with Hadoop, I saw the distributed filesystem and Map-
Reduce as a great way to tackle computer-intensive problems. However, programming
in the MapReduce model was a paradigm shift for me. Hive offered a fast and simple
way to take advantage of MapReduce in an SQL-like world I was comfortable in. This
approach also made it easy to prototype proof-of-concept applications and also to
Preface | xv
champion Hadoop as a solution internally. Even though I am now very familiar with
Hadoop internals, Hive is still my primary method of working with Hadoop.
It is an honor to write a Hive book. Being a Hive Committer and a member of the
Apache Software Foundation is my most valued accolade.
Dean Wampler
As a “big data” consultant at Think Big Analytics, I work with experienced “data people”
who eat and breathe SQL. For them, Hive is a necessary and sufficient condition for
Hadoop to be a viable tool to leverage their investment in SQL and open up new op-
portunities for data analytics.
Hive has lacked good documentation. I suggested to my previous editor at O’Reilly,
Mike Loukides, that a Hive book was needed by the community. So, here we are…
Jason Rutherglen
I work at Think Big Analytics as a software architect. My career has involved an array
of technologies including search, Hadoop, mobile, cryptography, and natural language
processing. Hive is the ultimate way to build a data warehouse using open technologies
on any amount of data. I use Hive regularly on a variety of projects.
Everyone involved with Hive. This includes committers, contributors, as well as end
Mark Grover wrote the chapter on Hive and Amazon Web Services. He is a contributor
to the Apache Hive project and is active helping others on the Hive IRC channel.
David Ha and Rumit Patel, at M6D, contributed the case study and code on the Rank
function. The ability to do Rank in Hive is a significant feature.
Ori Stitelman, at M6D, contributed the case study, Data Science using Hive and R,
which demonstrates how Hive can be used to make first pass on large data sets and
produce results to be used by a second R process.
David Funk contributed three use cases on in-site referrer identification, sessionization,
and counting unique visitors. David’s techniques show how rewriting and optimizing
Hive queries can make large scale map reduce data analysis more efficient.
Ian Robertson read the entire first draft of the book and provided very helpful feedback
on it. We’re grateful to him for providing that feedback on short notice and a tight
xvi | Preface
John Sichi provided technical review for the book. John was also instrumental in driving
through some of the newer features in Hive like StorageHandlers and Indexing Support.
He has been actively growing and supporting the Hive community.
Alan Gates, author of Programming Pig, contributed the HCatalog chapter. Nanda
Vijaydev contributed the chapter on how Karmasphere offers productized enhance-
ments for Hive. Eric Lubow contributed the SimpleReach case study. Chris A. Matt-
mann, Paul Zimdars, Cameron Goodale, Andrew F. Hart, Jinwon Kim, Duane Waliser,
and Peter Lean contributed the NASA JPL case study.
Preface | xvii
From the early days of the Internet’s mainstream breakout, the major search engines
and ecommerce companies wrestled with ever-growing quantities of data. More re-
cently, social networking sites experienced the same problem. Today, many organiza-
tions realize that the data they gather is a valuable resource for understanding their
customers, the performance of their business in the marketplace, and the effectiveness
of their infrastructure.
The Hadoop ecosystem emerged as a cost-effective way of working with such large data
sets. It imposes a particular programming model, called MapReduce, for breaking up
computation tasks into units that can be distributed around a cluster of commodity,
server class hardware, thereby providing cost-effective, horizontal scalability. Under-
neath this computation model is a distributed file system called the Hadoop Distributed
Filesystem (HDFS). Although the filesystem is “pluggable,” there are now several com-
mercial and open source alternatives.
However, a challenge remains; how do you move an existing data infrastructure to
Hadoop, when that infrastructure is based on traditional relational databases and the
Structured Query Language (SQL)? What about the large base of SQL users, both expert
database designers and administrators, as well as casual users who use SQL to extract
information from their data warehouses?
This is where Hive comes in. Hive provides an SQL dialect, called Hive Query Lan-
guage (abbreviated HiveQL or just HQL) for querying data stored in a Hadoop cluster.
SQL knowledge is widespread for a reason; it’s an effective, reasonably intuitive model
for organizing and using data. Mapping these familiar data operations to the low-level
MapReduce Java API can be daunting, even for experienced Java developers. Hive does
this dirty work for you, so you can focus on the query itself. Hive translates most queries
to MapReduce jobs, thereby exploiting the scalability of Hadoop, while presenting a
familiar SQL abstraction. If you don’t believe us, see “Java Versus Hive: The Word
Count Algorithm” on page 10 later in this chapter.
Hive is most suited for data warehouse applications, where relatively static data is an-
alyzed, fast response times are not required, and when the data is not changing rapidly.
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS
impose limits on what Hive can do. The biggest limitation is that Hive does not provide
record-level update, insert, nor delete. You can generate new tables from queries or
output query results to files. Also, because Hadoop is a batch-oriented system, Hive
queries have higher latency, due to the start-up overhead for MapReduce jobs. Queries
that would finish in seconds for a traditional database take longer for Hive, even for
relatively small data sets.1 Finally, Hive does not provide transactions.
So, Hive doesn’t provide crucial features required for OLTP, Online Transaction Pro-
cessing. It’s closer to being an OLAP tool, Online Analytic Processing, but as we’ll see,
Hive isn’t ideal for satisfying the “online” part of OLAP, at least today, since there can
be significant latency between issuing a query and receiving a reply, both due to the
overhead of Hadoop and due to the size of the data sets Hadoop was designed to serve.
If you need OLTP features for large-scale data, you should consider using a NoSQL
database. Examples include HBase, a NoSQL database integrated with Hadoop,2 Cas-
sandra,3 and DynamoDB, if you are using Amazon’s Elastic MapReduce (EMR) or
Elastic Compute Cloud (EC2).4 You can even integrate Hive with these databases
(among others), as we’ll discuss in Chapter 17.
So, Hive is best suited for data warehouse applications, where a large data set is main-
tained and mined for insights, reports, etc.
Because most data warehouse applications are implemented using SQL-based rela-
tional databases, Hive lowers the barrier for moving these applications to Hadoop.
People who know SQL can learn Hive easily. Without Hive, these users would need to
learn new languages and tools to be productive again.
Similarly, Hive makes it easier for developers to port SQL-based applications to
Hadoop, compared with other Hadoop languages and tools.
However, like most SQL dialects, HiveQL does not conform to the ANSI SQL standard
and it differs in various ways from the familiar SQL dialects provided by Oracle,
MySQL, and SQL Server. (However, it is closest to MySQL’s dialect of SQL.)
1. However, for the big data sets Hive is designed for, this start-up overhead is trivial compared to the actual
processing time.
2. See the Apache HBase website,, and HBase: The Definitive Guide by Lars George
3. See the Cassandra website,, and High Performance Cassandra Cookbook by
Edward Capriolo (Packt).
4. See the DynamoDB website,
2 | Chapter 1:Introduction
So, this book has a dual purpose. First, it provides a comprehensive, example-driven
introduction to HiveQL for all users, from developers, database administrators and
architects, to less technical users, such as business analysts.
Second, the book provides the in-depth technical details required by developers and
Hadoop administrators to tune Hive query performance and to customize Hive with
user-defined functions, custom data formats, etc.
We wrote this book out of frustration that Hive lacked good documentation, especially
for new users who aren’t developers and aren’t accustomed to browsing project artifacts
like bug and feature databases, source code, etc., to get the information they need. The
Hive Wiki5 is an invaluable source of information, but its explanations are sometimes
sparse and not always up to date. We hope this book remedies those issues, providing
a single, comprehensive guide to all the essential features of Hive and how to use them
An Overview of Hadoop and MapReduce
If you’re already familiar with Hadoop and the MapReduce computing model, you can
skip this section. While you don’t need an intimate knowledge of MapReduce to use
Hive, understanding the basic principles of MapReduce will help you understand what
Hive is doing behind the scenes and how you can use Hive more effectively.
We provide a brief overview of Hadoop and MapReduce here. For more details, see
Hadoop: The Definitive Guide by Tom White (O’Reilly).
MapReduce is a computing model that decomposes large data manipulation jobs into
individual tasks that can be executed in parallel across a cluster of servers. The results
of the tasks can be joined together to compute the final results.
The MapReduce programming model was developed at Google and described in an
influential paper called MapReduce: simplified data processing on large clusters (see the
Appendix) on page 309. The Google Filesystem was described a year earlier in a paper
called The Google filesystem on page 310. Both papers inspired the creation of Hadoop
by Doug Cutting.
The term MapReduce comes from the two fundamental data-transformation operations
used, map and reduce. A map operation converts the elements of a collection from one
form to another. In this case, input key-value pairs are converted to zero-to-many
5. See
6. It’s worth bookmarking the wiki link, however, because the wiki contains some more obscure information
we won’t cover here.
An Overview of Hadoop and MapReduce | 3
output key-value pairs, where the input and output keys might be completely different
and the input and output values might be completely different.
In MapReduce, all the key-pairs for a given key are sent to the same reduce operation.
Specifically, the key and a collection of the values are passed to the reducer. The goal
of “reduction” is to convert the collection to a value, such as summing or averaging a
collection of numbers, or to another collection. A final key-value pair is emitted by the
reducer. Again, the input versus output keys and values may be different. Note that if
the job requires no reduction step, then it can be skipped.
An implementation infrastructure like the one provided by Hadoop handles most of
the chores required to make jobs run successfully. For example, Hadoop determines
how to decompose the submitted job into individual map and reduce tasks to run, it
schedules those tasks given the available resources, it decides where to send a particular
task in the cluster (usually where the corresponding data is located, when possible, to
minimize network overhead), it monitors each task to ensure successful completion,
and it restarts tasks that fail.
The Hadoop Distributed Filesystem, HDFS, or a similar distributed filesystem, manages
data across the cluster. Each block is replicated several times (three copies is the usual
default), so that no single hard drive or server failure results in data loss. Also, because
the goal is to optimize the processing of very large data sets, HDFS and similar filesys-
tems use very large block sizes, typically 64 MB or multiples thereof. Such large blocks
can be stored contiguously on hard drives so they can be written and read with minimal
seeking of the drive heads, thereby maximizing write and read performance.
To make MapReduce more clear, let’s walk through a simple example, the Word
Count algorithm that has become the “Hello World” of MapReduce.7 Word Count
returns a list of all the words that appear in a corpus (one or more documents) and the
count of how many times each word appears. The output shows each word found and
its count, one per line. By common convention, the word (output key) and count (out-
put value) are usually separated by a tab separator.
Figure 1-1 shows how Word Count works in MapReduce.
There is a lot going on here, so let’s walk through it from left to right.
Each Input box on the left-hand side of Figure 1-1 is a separate document. Here are
four documents, the third of which is empty and the others contain just a few words,
to keep things simple.
By default, a separate Mapper process is invoked to process each document. In real
scenarios, large documents might be split and each split would be sent to a separate
Mapper. Also, there are techniques for combining many small documents into a single
split for a Mapper. We won’t worry about those details now.
7. If you’re not a developer, a “Hello World” program is the traditional first program you write when learning
a new language or tool set.
4 | Chapter 1:Introduction
The fundamental data structure for input and output in MapReduce is the key-value
pair. After each Mapper is started, it is called repeatedly for each line of text from the
document. For each call, the key passed to the mapper is the character offset into the
document at the start of the line. The corresponding value is the text of the line.
In Word Count, the character offset (key) is discarded. The value, the line of text, is
tokenized into words, using one of several possible techniques (e.g., splitting on white-
space is the simplest, but it can leave in undesirable punctuation). We’ll also assume
that the Mapper converts each word to lowercase, so for example, “FUN” and “fun”
will be counted as the same word.
Finally, for each word in the line, the mapper outputs a key-value pair, with the word
as the key and the number 1 as the value (i.e., the count of “one occurrence”). Note
that the output types of the keys and values are different from the input types.
Part of Hadoop’s magic is the Sort and Shuffle phase that comes next. Hadoop sorts
the key-value pairs by key and it “shuffles” all pairs with the same key to the same
Reducer. There are several possible techniques that can be used to decide which reducer
gets which range of keys. We won’t worry about that here, but for illustrative purposes,
we have assumed in the figure that a particular alphanumeric partitioning was used. In
a real implementation, it would be different.
For the mapper to simply output a count of 1 every time a word is seen is a bit wasteful
of network and disk I/O used in the sort and shuffle. (It does minimize the memory
used in the Mappers, however.) One optimization is to keep track of the count for each
word and then output only one count for each word when the Mapper finishes. There
Figure 1-1. Word Count algorithm using MapReduce
An Overview of Hadoop and MapReduce | 5
are several ways to do this optimization, but the simple approach is logically correct
and sufficient for this discussion.
The inputs to each Reducer are again key-value pairs, but this time, each key will be
one of the words found by the mappers and the value will be a collection of all the counts
emitted by all the mappers for that word. Note that the type of the key and the type of
the value collection elements are the same as the types used in the Mapper’s output.
That is, the key type is a character string and the value collection element type is an
To finish the algorithm, all the reducer has to do is add up all the counts in the value
collection and write a final key-value pair consisting of each word and the count for
that word.
Word Count isn’t a toy example. The data it produces is used in spell checkers, language
detection and translation systems, and other applications.
Hive in the Hadoop Ecosystem
The Word Count algorithm, like most that you might implement with Hadoop, is a
little involved. When you actually implement such algorithms using the Hadoop Java
API, there are even more low-level details you have to manage yourself. It’s a job that’s
only suitable for an experienced Java developer, potentially putting Hadoop out of
reach of users who aren’t programmers, even when they understand the algorithm they
want to use.
In fact, many of those low-level details are actually quite repetitive from one job to the
next, from low-level chores like wiring together Mappers and Reducers to certain data
manipulation constructs, like filtering for just the data you want and performing SQL-
like joins on data sets. There’s a real opportunity to eliminate reinventing these idioms
by letting “higher-level” tools handle them automatically.
That’s where Hive comes in. It not only provides a familiar programming model for
people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky
coding you would have to do in Java.
This is why Hive is so important to Hadoop, whether you are a DBA or a Java developer.
Hive lets you complete a lot of work with relatively little effort.
Figure 1-2 shows the major “modules” of Hive and how they work with Hadoop.
There are several ways to interact with Hive. In this book, we will mostly focus on the
CLI, command-line interface. For people who prefer graphical user interfaces, com-
mercial and open source options are starting to appear, including a commercial product
from Karmasphere (, Cloudera’s open source Hue (https://git, a new “Hive-as-a-service” offering from Qubole (http://qubole
.com), and others.
6 | Chapter 1:Introduction
Bundled with the Hive distribution is the CLI, a simple web interface called Hive web
interface (HWI), and programmatic access through JDBC, ODBC, and a Thrift server
(see Chapter 16).
All commands and queries go to the Driver, which compiles the input, optimizes the
computation required, and executes the required steps, usually with MapReduce jobs.
When MapReduce jobs are required, Hive doesn’t generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an
XML file representing the “job plan.” In other words, these generic modules function
like mini language interpreters and the “language” to drive the computation is encoded
in XML.
Hive communicates with the JobTracker to initiate the MapReduce job. Hive does not
have to be running on the same master node with the JobTracker. In larger clusters,
it’s common to have edge nodes where tools like Hive run. They communicate remotely
with the JobTracker on the master node to execute jobs. Usually, the data files to be
processed are in HDFS, which is managed by the NameNode.
The Metastore is a separate relational database (usually a MySQL instance) where Hive
persists table schemas and other system metadata. We’ll discuss it in detail in Chapter 2.
While this is a book about Hive, it’s worth mentioning other higher-level tools that you
should consider for your needs. Hive is best suited for data warehouse applications,
where real-time responsiveness to queries and record-level inserts, updates, and deletes
Figure 1-2. Hive modules
Hive in the Hadoop Ecosystem | 7
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
are not required. Of course, Hive is also very nice for people who know SQL already.
However, some of your work may be easier to accomplish with alternative tools.
The best known alternative to Hive is Pig (see, which was devel-
oped at Yahoo! about the same time Facebook was developing Hive. Pig is also now a
top-level Apache project that is closely associated with Hadoop.
Suppose you have one or more sources of input data and you need to perform a complex
set of transformations to generate one or more collections of output data. Using Hive,
you might be able to do this with nested queries (as we’ll see), but at some point it will
be necessary to resort to temporary tables (which you have to manage yourself) to
manage the complexity.
Pig is described as a data flow language, rather than a query language. In Pig, you write
a series of declarative statements that define relations from other relations, where each
new relation performs some new data transformation. Pig looks at these declarations
and then builds up a sequence of MapReduce jobs to perform the transformations until
the final results are computed the way that you want.
This step-by-step “flow” of data can be more intuitive than a complex set of queries.
For this reason, Pig is often used as part of ETL (Extract, Transform, and Load) pro-
cesses used to ingest external data into a Hadoop cluster and transform it into a more
desirable form.
A drawback of Pig is that it uses a custom language not based on SQL. This is appro-
priate, since it is not designed as a query language, but it also means that Pig is less
suitable for porting over SQL applications and experienced SQL users will have a larger
learning curve with Pig.
Nevertheless, it’s common for Hadoop teams to use a combination of Hive and Pig,
selecting the appropriate tool for particular jobs.
Programming Pig by Alan Gates (O’Reilly) provides a comprehensive introduction to
What if you need the database features that Hive doesn’t provide, like row-level
updates, rapid query response times, and transactions?
HBase is a distributed and scalable data store that supports row-level updates, rapid
queries, and row-level transactions (but not multirow transactions).
HBase is inspired by Google’s Big Table, although it doesn’t implement all Big Table
features. One of the important features HBase supports is column-oriented storage,
where columns can be organized into column families. Column families are physically
8 | Chapter 1:Introduction
stored together in a distributed cluster, which makes reads and writes faster when the
typical query scenarios involve a small subset of the columns. Rather than reading entire
rows and discarding most of the columns, you read only the columns you need.
HBase can be used like a key-value store, where a single key is used for each row to
provide very fast reads and writes of the row’s columns or column families. HBase also
keeps a configurable number of versions of each column’s values (marked by time-
stamps), so it’s possible to go “back in time” to previous values, when needed.
Finally, what is the relationship between HBase and Hadoop? HBase uses HDFS (or
one of the other distributed filesystems) for durable file storage of data. To provide
row-level updates and fast queries, HBase also uses in-memory caching of data and
local files for the append log of updates. Periodically, the durable files are updated with
all the append log updates, etc.
HBase doesn’t provide a query language like SQL, but Hive is now integrated with
HBase. We’ll discuss this integration in “HBase” on page 222.
For more on HBase, see the HBase website, and HBase: The Definitive Guide by Lars
Cascading, Crunch, and Others
There are several other “high-level” languages that have emerged outside of the Apache
Hadoop umbrella, which also provide nice abstractions on top of Hadoop to reduce
the amount of low-level boilerplate code required for typical jobs. For completeness,
we list several of them here. All are JVM (Java Virtual Machine) libraries that can be
used from programming languages like Java, Clojure, Scala, JRuby, Groovy, and Jy-
thon, as opposed to tools with their own languages, like Hive and Pig.
Using one of these programming languages has advantages and disadvantages. It makes
these tools less attractive to nonprogrammers who already know SQL. However, for
developers, these tools provide the full power of a Turing complete programming lan-
guage. Neither Hive nor Pig are Turing complete. We’ll learn how to extend Hive with
Java code when we need additional functionality that Hive doesn’t provide (Table 1-1).
Table 1-1. Alternative higher-level libraries for Hadoop
Name URL Description
Cascading Java API with Data Processing abstractions. There are now
many Domain Specific Languages (DSLs) for Cascading in other
languages, e.g., Scala, Groovy, JRuby, and Jython.
A Clojure DSL for Cascading that provides additional function-
ality inspired by Datalog for data processing and query ab-
Crunch A Java and Scala API for defining data flow pipelines.
Hive in the Hadoop Ecosystem | 9
Because Hadoop is a batch-oriented system, there are tools with different distributed
computing models that are better suited for event stream processing, where closer to
“real-time” responsiveness is required. Here we list several of the many alternatives
(Table 1-2).
Table 1-2. Distributed data processing tools that don’t use MapReduce
Name URL Description
Spark A distributed computing framework based on the idea of dis-
tributed data sets with a Scala API. It can work with HDFS files
and it offers notable performance improvements over Hadoop
MapReduce for many computations. There is also a project to
port Hive to Spark, called Shark (
Storm A real-time event stream processing system.
A distributed publish-subscribe messaging system.
Finally, it’s important to consider when you don’t need a full cluster (e.g., for smaller
data sets or when the time to perform a computation is less critical). Also, many alter-
native tools are easier to use when prototyping algorithms or doing exploration with a
subset of data. Some of the more popular options are listed in Table 1-3.
Table 1-3. Other data processing languages and tools
Name URL Description
R An open source language for statistical analysis and graphing
of data that is popular with statisticians, economists, etc. It’s
not a distributed system, so the data sizes it can handle are
limited. There are efforts to integrate R with Hadoop.
A commercial system for data analysis and numerical methods
that is popular with engineers and scientists.
Octave An open source clone of MatLab.
A commercial data analysis, symbolic manipulation, and nu-
merical methods system that is also popular with scientists and
SciPy, NumPy Extensive software package for scientific programming in
Python, which is widely used by data scientists.
Java Versus Hive: The Word Count Algorithm
If you are not a Java programmer, you can skip to the next section.
If you are a Java programmer, you might be reading this book because you’ll need to
support the Hive users in your organization. You might be skeptical about using Hive
for your own work. If so, consider the following example that implements the Word
10 | Chapter 1:Introduction
Count algorithm we discussed above, first using the Java MapReduce API and then
using Hive.
It’s very common to use Word Count as the first Java MapReduce program that people
write, because the algorithm is simple to understand, so you can focus on the API.
Hence, it has become the “Hello World” of the Hadoop world.
The following Java implementation is included in the Apache Hadoop distribution.8 If
you don’t know Java (and you’re still reading this section), don’t worry, we’re only
showing you the code for the size comparison:
package org.myorg;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
context.write(word, one);
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, new IntWritable(sum));
8. Apache Hadoop word count:
Java Versus Hive: The Word Count Algorithm | 11
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
That was 63 lines of Java code. We won’t explain the API details.9 Here is the same
calculation written in HiveQL, which is just 8 lines of code, and does not require com-
pilation nor the creation of a “JAR” (Java ARchive) file:
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) w
ORDER BY word;
We’ll explain all this HiveQL syntax later on.
9. See Hadoop: The Definitive Guide by Tom White for the details.
12 | Chapter 1:Introduction
In both examples, the files were tokenized into words using the simplest possible ap-
proach; splitting on whitespace boundaries. This approach doesn’t properly handle
punctuation, it doesn’t recognize that singular and plural forms of words are the same
word, etc. However, it’s good enough for our purposes here.10
The virtue of the Java API is the ability to customize and fine-tune every detail of an
algorithm implementation. However, most of the time, you just don’t need that level
of control and it slows you down considerably when you have to manage all those
If you’re not a programmer, then writing Java MapReduce code is out of reach. How-
ever, if you already know SQL, learning Hive is relatively straightforward and many
applications are quick and easy to implement.
What’s Next
We described the important role that Hive plays in the Hadoop ecosystem. Now let’s
get started!
10. There is one other minor difference. The Hive query hardcodes a path to the data, while the Java code
takes the path as an argument. In Chapter 2, we’ll learn how to use Hive variables in scripts to avoid
hardcoding such details.
What’s Next | 13
Getting Started
Let’s install Hadoop and Hive on our personal workstation. This is a convenient way
to learn and experiment with Hadoop. Then we’ll discuss how to configure Hive for
use on Hadoop clusters.
If you already use Amazon Web Services, the fastest path to setting up Hive for learning
is to run a Hive-configured job flow on Amazon Elastic MapReduce (EMR). We discuss
this option in Chapter 21.
If you have access to a Hadoop cluster with Hive already installed, we encourage
you to skim the first part of this chapter and pick up again at “What Is Inside
Hive?” on page 22.
Installing a Preconfigured Virtual Machine
There are several ways you can install Hadoop and Hive. An easy way to install a com-
plete Hadoop system, including Hive, is to download a preconfigured virtual ma-
chine (VM) that runs in VMWare1 or VirtualBox2. For VMWare, either VMWare
Player for Windows and Linux (free) or VMWare Fusion for Mac OS X (inexpensive)
can be used. VirtualBox is free for all these platforms, and also Solaris.
The virtual machines use Linux as the operating system, which is currently the only
recommended operating system for running Hadoop in production.3
Using a virtual machine is currently the only way to run Hadoop on
Windows systems, even when Cygwin or similar Unix-like software is
3. However, some vendors are starting to support Hadoop on other systems. Hadoop has been used in
production on various Unix systems and it works fine on Mac OS X for development use.
Most of the preconfigured virtual machines (VMs) available are only designed for
VMWare, but if you prefer VirtualBox you may find instructions on the Web that
explain how to import a particular VM into VirtualBox.
You can download preconfigured virtual machines from one of the websites given in
Table 2-1.4 Follow the instructions on these web sites for loading the VM into VMWare.
Table 2-1. Preconfigured Hadoop virtual machines for VMWare
Provider URL Notes
Cloudera, Inc.
Uses Cloudera’s own distribution
of Hadoop, CDH3 or CDH4.
MapR, Inc.
MapR’s Hadoop distribution,
which replaces HDFS with the
MapR Filesystem (MapR-FS).
Based on the latest, stable Apache
Think Big An-
alytics, Inc.
Based on the latest, stable Apache
Next, go to “What Is Inside Hive?” on page 22.
Detailed Installation
While using a preconfigured virtual machine may be an easy way to run Hive, installing
Hadoop and Hive yourself will give you valuable insights into how these tools work,
especially if you are a developer.
The instructions that follow describe the minimum necessary Hadoop and Hive
installation steps for your personal Linux or Mac OS X workstation. For production
installations, consult the recommended installation procedures for your Hadoop
Installing Java
Hive requires Hadoop and Hadoop requires Java. Ensure your system has a recent
v1.6.X or v1.7.X JVM (Java Virtual Machine). Although the JRE (Java Runtime Envi-
ronment) is all you need to run Hive, you will need the full JDK (Java Development
Kit) to build examples in this book that demonstrate how to extend Hive with Java
code. However, if you are not a programmer, the companion source code distribution
for this book (see the Preface) contains prebuilt examples.
4. These are the current URLs at the time of this writing.
16 | Chapter 2:Getting Started
After the installation is complete, you’ll need to ensure that Java is in your path and
the JAVA_HOME environment variable is set.
Linux-specific Java steps
On Linux systems, the following instructions set up a bash file in the /etc/profile.d/
directory that defines JAVA_HOME for all users. Changing environmental settings in
this folder requires root access and affects all users of the system. (We’re using $ as the
bash shell prompt.) The Oracle JVM installer typically installs the software in /usr/java/
jdk-1.6.X (for v1.6) and it creates sym-links from /usr/java/default and /usr/java/latest
to the installation:
$ /usr/java/latest/bin/java -version
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)
$ sudo echo "export JAVA_HOME=/usr/java/latest" > /etc/profile.d/
$ sudo echo "PATH=$PATH:$JAVA_HOME/bin" >> /etc/profile.d/
$ . /etc/profile
$ echo $JAVA_HOME
If you’ve never used sudo (“super user do something”) before to run a
command as a “privileged” user, as in two of the commands, just type
your normal password when you’re asked for it. If you’re on a personal
machine, your user account probably has “sudo rights.” If not, ask your
administrator to run those commands.
However, if you don’t want to make permanent changes that affect all
users of the system, an alternative is to put the definitions shown for
PATH and JAVA_HOME in your $HOME/.bashrc file:
export JAVA_HOME=/usr/java/latest
export PATH=$PATH:$JAVA_HOME/bin
Mac OS X−specific Java steps
Mac OS X systems don’t have the /etc/profile.d directory and they are typically
single-user systems, so it’s best to put the environment variable definitions in your
$HOME/.bashrc. The Java paths are different, too, and they may be in one of several
Here are a few examples. You’ll need to determine where Java is installed on your Mac
and adjust the definitions accordingly. Here is a Java 1.6 example for Mac OS X:
$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
$ export PATH=$PATH:$JAVA_HOME/bin
5. At least that’s the current situation on Dean’s Mac. This discrepancy may actually reflect the fact that
stewardship of the Mac OS X Java port is transitioning from Apple to Oracle as of Java 1.7.
Detailed Installation | 17
Here is a Java 1.7 example for Mac OS X:
$ export JAVA_HOME=/Library/Java/JavaVirtualMachines/1.7.0.jdk/Contents/Home
$ export PATH=$PATH:$JAVA_HOME/bin
OpenJDK 1.7 releases also install under /Library/Java/JavaVirtualMachines.
Installing Hadoop
Hive runs on top of Hadoop. Hadoop is an active open source project with many re-
leases and branches. Also, many commercial software companies are now producing
their own distributions of Hadoop, sometimes with custom enhancements or replace-
ments for some components. This situation promotes innovation, but also potential
confusion and compatibility issues.
Keeping software up to date lets you exploit the latest performance enhancements and
bug fixes. However, sometimes you introduce new bugs and compatibility issues. So,
for this book, we’ll show you how to install the Apache Hadoop release v0.20.2. This
edition is not the most recent stable release, but it has been the reliable gold standard
for some time for performance and compatibility.
However, you should be able to choose a different version, distribution, or release
without problems for learning and using Hive, such as the Apache Hadoop v0.20.205
or 1.0.X releases, Cloudera CDH3 or CDH4, MapR M3 or M5, and the forthcoming
Hortonworks distribution. Note that the bundled Cloudera, MapR, and planned
Hortonworks distributions all include a Hive release.
However, we don’t recommend installing the new, alpha-quality, “Next Generation”
Hadoop v2.0 (also known as v0.23), at least for the purposes of this book. While this
release will bring significant enhancements to the Hadoop ecosystem, it is too new for
our purposes.
To install Hadoop on a Linux system, run the following commands. Note that we
wrapped the long line for the wget command:
$ cd ~ # or use another directory of your choice.
$ wget \
$ tar -xzf hadoop-0.20.2.tar.gz
$ sudo echo "export HADOOP_HOME=$PWD/hadoop-0.20.2" > /etc/profile.d/
$ sudo echo "PATH=$PATH:$HADOOP_HOME/bin" >> /etc/profile.d/
$ . /etc/profile
To install Hadoop on a Mac OS X system, run the following commands. Note that we
wrapped the long line for the curl command:
$ cd ~ # or use another directory of your choice.
$ curl -o \
$ tar -xzf hadoop-0.20.2.tar.gz
$ echo "export HADOOP_HOME=$PWD/hadoop-0.20.2" >> $HOME/.bashrc
18 | Chapter 2:Getting Started
$ echo "PATH=$PATH:$HADOOP_HOME/bin" >> $HOME/.bashrc
$ . $HOME/.bashrc
In what follows, we will assume that you added $HADOOP_HOME/bin to your path, as in
the previous commands. This will allow you to simply type the hadoop command
without the path prefix.
Local Mode, Pseudodistributed Mode, and Distributed Mode
Before we proceed, let’s clarify the different runtime modes for Hadoop. We mentioned
above that the default mode is local mode, where filesystem references use the local
filesystem. Also in local mode, when Hadoop jobs are executed (including most Hive
queries), the Map and Reduce tasks are run as part of the same process.
Actual clusters are configured in distributed mode, where all filesystem references that
aren’t full URIs default to the distributed filesystem (usually HDFS) and jobs are man-
aged by the JobTracker service, with individual tasks executed in separate processes.
A dilemma for developers working on personal machines is the fact that local mode
doesn’t closely resemble the behavior of a real cluster, which is important to remember
when testing applications. To address this need, a single machine can be configured to
run in pseudodistributed mode, where the behavior is identical to distributed mode,
namely filesystem references default to the distributed filesystem and jobs are managed
by the JobTracker service, but there is just a single machine. Hence, for example, HDFS
file block replication is limited to one copy. In other words, the behavior is like a single-
node “cluster.” We’ll discuss these configuration options in “Configuring Your Ha-
doop Environment” on page 24.
Because Hive uses Hadoop jobs for most of its work, its behavior reflects the Hadoop
mode you’re using. However, even when running in distributed mode, Hive can decide
on a per-query basis whether or not it can perform the query using just local mode,
where it reads the data files and manages the MapReduce tasks itself, providing faster
turnaround. Hence, the distinction between the different modes is more of an
execution style for Hive than a deployment style, as it is for Hadoop.
For most of the book, it won’t matter which mode you’re using. We’ll assume you’re
working on a personal machine in local mode and we’ll discuss the cases where the
mode matters.
When working with small data sets, using local mode execution
will make Hive queries much faster. Setting the property set; will cause Hive to use this mode more
aggressively, even when you are running Hadoop in distributed or pseu-
dodistributed mode. To always use this setting, add the command to
your $HOME/.hiverc file (see “The .hiverc File” on page 36).
Detailed Installation | 19
Testing Hadoop
Assuming you’re using local mode, let’s look at the local filesystem two different ways.
The following output of the Linux ls command shows the typical contents of the “root”
directory of a Linux system:
$ ls /
bin cgroup etc lib lost+found mnt opt root selinux sys user var
boot dev home lib64 media null proc sbin srv tmp usr
Hadoop provides a dfs tool that offers basic filesystem functionality like ls for the
default filesystem. Since we’re using local mode, the default filesystem is the local file-
$ hadoop dfs -ls /
Found 26 items
drwxrwxrwx - root root 24576 2012-06-03 14:28 /tmp
drwxr-xr-x - root root 4096 2012-01-25 22:43 /opt
drwx------ - root root 16384 2010-12-30 14:56 /lost+found
drwxr-xr-x - root root 0 2012-05-11 16:44 /selinux
dr-xr-x--- - root root 4096 2012-05-23 22:32 /root
If instead you get an error message that hadoop isn’t found, either invoke the command
with the full path (e.g., $HOME/hadoop-0.20.2/bin/hadoop) or add the bin directory to
your PATH variable, as discussed in “Installing Hadoop” on page 18 above.
If you find yourself using the hadoop dfs command frequently, it’s
convenient to define an alias for it (e.g., alias hdfs="hadoop dfs").
Hadoop offers a framework for MapReduce. The Hadoop distribution contains an
implementation of the Word Count algorithm we discussed in Chapter 1. Let’s run it!
Start by creating an input directory (inside your current working directory) with files
to be processed by Hadoop:
$ mkdir wc-in
$ echo "bla bla" > wc-in/a.txt
$ echo "bla wa wa " > wc-in/b.txt
Use the hadoop command to launch the Word Count application on the input directory
we just created. Note that it’s conventional to always specify directories for input and
output, not individual files, since there will often be multiple input and/or output files
per directory, a consequence of the parallelism of the system.
6. Unfortunately, the dfs -ls command only provides a “long listing” format. There is no short format, like
the default for the Linux ls command.
20 | Chapter 2:Getting Started
If you are running these commands on your local installation that was configured to
use local mode, the hadoop command will launch the MapReduce components in the
same process. If you are running on a cluster or on a single machine using pseudodis-
tributed mode, the hadoop command will launch one or more separate processes using
the JobTracker service (and the output below will be slightly different). Also, if you are
running with a different version of Hadoop, change the name of the examples.jar as
$ hadoop jar $HADOOP_HOME/hadoop-0.20.2-examples.jar wordcount wc-in wc-out
12/06/03 15:40:26 INFO input.FileInputFormat: Total input paths to process : 2
12/06/03 15:40:27 INFO mapred.JobClient: Running job: job_local_0001
12/06/03 15:40:30 INFO mapred.JobClient: map 100% reduce 0%
12/06/03 15:40:41 INFO mapred.JobClient: map 100% reduce 100%
12/06/03 15:40:41 INFO mapred.JobClient: Job complete: job_local_0001
The results of the Word count application can be viewed through local filesystem
$ ls wc-out/*
$ cat wc-out/*
bla 3
wa 2
They can also be viewed by the equivalent dfs command (again, because we assume
you are running in local mode):
$ hadoop dfs -cat wc-out/*
bla 3
wa 2
For very big files, if you want to view just the first or last parts, there is
no -more, -head, nor -tail subcommand. Instead, just pipe the output
of the -cat command through the shell’s more, head, or tail. For exam-
ple: hadoop dfs -cat wc-out/* | more.
Now that we have installed and tested an installation of Hadoop, we can install Hive.
Installing Hive
Installing Hive is similar to installing Hadoop. We will download and extract a tarball
for Hive, which does not include an embedded version of Hadoop. A single Hive binary
is designed to work with multiple versions of Hadoop. This means it’s often easier and
less risky to upgrade to newer Hive releases than it is to upgrade to newer Hadoop
Hive uses the environment variable HADOOP_HOME to locate the Hadoop JARs and con-
figuration files. So, make sure you set that variable as discussed above before proceed-
ing. The following commands work for both Linux and Mac OS X:
Detailed Installation | 21
$ cd ~ # or use another directory of your choice.
$ curl -o
$ tar -xzf hive-0.9.0.tar.gz
$ sudo mkdir -p /user/hive/warehouse
$ sudo chmod a+rwx /user/hive/warehouse
As you can infer from these commands, we are using the latest stable release of Hive
at the time of this writing, v0.9.0. However, most of the material in this book works
with Hive v0.7.X and v0.8.X. We’ll call out the differences as we come to them.
You’ll want to add the hive command to your path, like we did for the hadoop command.
We’ll follow the same approach, by first defining a HIVE_HOME variable, but unlike
HADOOP_HOME, this variable isn’t really essential. We’ll assume it’s defined for some ex-
amples later in the book.
For Linux, run these commands:
$ sudo echo "export HIVE_HOME=$PWD/hive-0.9.0" > /etc/profile.d/
$ sudo echo "PATH=$PATH:$HIVE_HOME/bin >> /etc/profile.d/
$ . /etc/profile
For Mac OS X, run these commands:
$ echo "export HIVE_HOME=$PWD/hive-0.9.0" >> $HOME/.bashrc
$ echo "PATH=$PATH:$HIVE_HOME/bin" >> $HOME/.bashrc
$ . $HOME/.bashrc
What Is Inside Hive?
The core of a Hive binary distribution contains three parts. The main part is the Java
code itself. Multiple JAR (Java archive) files such as hive-exec*.jar and hive-meta
store*.jar are found under the $HIVE_HOME/lib directory. Each JAR file implements
a particular subset of Hive’s functionality, but the details don’t concern us now.
The $HIVE_HOME/bin directory contains executable scripts that launch various Hive
services, including the hive command-line interface (CLI). The CLI is the most popular
way to use Hive. We will use hive (in lowercase, with a fixed-width font) to refer to the
CLI, except where noted. The CLI can be used interactively to type in statements one
at a time or it can be used to run “scripts” of Hive statements, as we’ll see.
Hive also has other components. A Thrift service provides remote access from other
processes. Access using JDBC and ODBC are provided, too. They are implemented on
top of the Thrift service. We’ll describe these features in later chapters.
All Hive installations require a metastore service, which Hive uses to store table schemas
and other metadata. It is typically implemented using tables in a relational database.
By default, Hive uses a built-in Derby SQL server, which provides limited, single-
process storage. For example, when using Derby, you can’t run two simultaneous in-
stances of the Hive CLI. However, this is fine for learning Hive on a personal machine
22 | Chapter 2:Getting Started
and some developer tasks. For clusters, MySQL or a similar relational database is
required. We will discuss the details in “Metastore Using JDBC” on page 28.
Finally, a simple web interface, called Hive Web Interface (HWI), provides remote
access to Hive.
The conf directory contains the files that configure Hive. Hive has a number of con-
figuration properties that we will discuss as needed. These properties control features
such as the metastore (where data is stored), various optimizations, and “safety
controls,” etc.
Starting Hive
Let’s finally start the Hive command-line interface (CLI) and run a few commands!
We’ll briefly comment on what’s happening, but save the details for discussion later.
In the following session, we’ll use the $HIVE_HOME/bin/hive command, which is a
bash shell script, to start the CLI. Substitute the directory where Hive is installed on
your system whenever $HIVE_HOME is listed in the following script. Or, if you added
$HIVE_HOME/bin to your PATH, you can just type hive to run the command. We’ll make
that assumption for the rest of the book.
As before, $ is the bash prompt. In the Hive CLI, the hive> string is the hive prompt,
and the indented > is the secondary prompt. Here is a sample session, where we have
added a blank line after the output of each command, for clarity:
$ bin/hive
Hive history file=/tmp/myname/hive_job_log_myname_201201271126_1992326118.txt
hive> CREATE TABLE x (a INT);
Time taken: 3.543 seconds
hive> SELECT * FROM x;
Time taken: 0.231 seconds
hive> SELECT *
> FROM x;
Time taken: 0.072 seconds
hive> DROP TABLE x;
Time taken: 0.834 seconds
hive> exit;
The first line printed by the CLI is the local filesystem location where the CLI writes
log data about the commands and queries you execute. If a command or query is
Starting Hive | 23
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
successful, the first line of output will be OK, followed by the output, and finished by
the line showing the amount of time taken to run the command or query.
Throughout the book, we will follow the SQL convention of showing
Hive keywords in uppercase (e.g., CREATE, TABLE, SELECT and FROM), even
though case is ignored by Hive, following SQL conventions.
Going forward, we’ll usually add the blank line after the command out-
put for all sessions. Also, when starting a session, we’ll omit the line
about the logfile. For individual commands and queries, we’ll omit the
OK and Time taken:... lines, too, except in special cases, such as when
we want to emphasize that a command or query was successful, but it
had no other output.
At the successive prompts, we create a simple table named x with a single INT (4-byte
integer) column named a, then query it twice, the second time showing how queries
and commands can spread across multiple lines. Finally, we drop the table.
If you are running with the default Derby database for the metastore, you’ll notice that
your current working directory now contains a new subdirectory called metastore_db
that was created by Derby during the short hive session you just executed. If you are
running one of the VMs, it’s possible it has configured different behavior, as we’ll dis-
cuss later.
Creating a metastore_db subdirectory under whatever working directory you happen
to be in is not convenient, as Derby “forgets” about previous metastores when you
change to a new working directory! In the next section, we’ll see how to configure a
permanent location for the metastore database, as well as make other changes.
Configuring Your Hadoop Environment
Let’s dive a little deeper into the different Hadoop modes and discuss more configu-
ration issues relevant to Hive.
You can skip this section if you’re using Hadoop on an existing cluster or you are using
a virtual machine instance. If you are a developer or you installed Hadoop and Hive
yourself, you’ll want to understand the rest of this section. However, we won’t provide
a complete discussion. See Appendix A of Hadoop: The Definitive Guide by Tom White
for the full details on configuring the different modes.
Local Mode Configuration
Recall that in local mode, all references to files go to your local filesystem, not the
distributed filesystem. There are no services running. Instead, your jobs run all tasks
in a single JVM instance.
24 | Chapter 2:Getting Started
Figure 2-1 illustrates a Hadoop job running in local mode.
Figure 2-1. Hadoop in local mode
If you plan to use the local mode regularly, it’s worth configuring a standard location
for the Derby metastore_db, where Hive stores metadata about your tables, etc.
You can also configure a different directory for Hive to store table data, if you don’t
want to use the default location, which is file:///user/hive/warehouse, for local mode,
and hdfs://namenode_server/user/hive/warehouse for the other modes discussed next.
First, go to the $HIVE_HOME/conf directory. The curious may want to peek at the
large hive-default.xml.template file, which shows the different configuration properties
supported by Hive and their default values. Most of these properties you can safely
ignore. Changes to your configuration are done by editing the hive-site.xml file. Create
one if it doesn’t already exist.
Here is an example configuration file where we set several properties for local mode
execution (Example 2-1).
Example 2-1. Local-mode hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
Local or HDFS directory where Hive keeps table contents.
Configuring Your Hadoop Environment | 25
Use false if a production metastore server is used.
The JDBC connection URL.
You can remove any of these <property>...</property> tags you don’t want to change.
As the <description> tags indicate, the hive.metastore.warehouse.dir tells Hive where
in your local filesystem to keep the data contents for Hive’s tables. (This value is ap-
pended to the value of defined in the Hadoop configuration and de-
faults to file:///.) You can use any directory path you want for the value. Note that this
directory will not be used to store the table metadata, which goes in the separate
The hive.metastore.local property defaults to true, so we don’t really need to show
it in Example 2-1. It’s there more for documentation purposes. This property controls
whether to connect to a remote metastore server or open a new metastore server as part
of the Hive Client JVM. This setting is almost always set to true and JDBC is used to
communicate directly to a relational database. When it is set to false, Hive will
communicate through a metastore server, which we’ll discuss in “Metastore Meth-
ods” on page 216.
The value for the javax.jdo.option.ConnectionURL property makes one small but con-
venient change to the default value for this property. This property tells Hive how to
connect to the metastore server. By default, it uses the current working directory for
the databaseName part of the value string. As shown in Example 2-1, we use database
Name=/home/me/hive/metastore_db as the absolute path instead, which is the location
where the metastore_db directory will always be located. This change eliminates the
problem of Hive dropping the metastore_db directory in the current working directory
every time we start a new Hive session. Now, we’ll always have access to all our
metadata, no matter what directory we are working in.
Distributed and Pseudodistributed Mode Configuration
In distributed mode, several services run in the cluster. The JobTracker manages jobs
and the NameNode is the HDFS master. Worker nodes run individual job tasks, man-
aged by a TaskTracker service on each node, and then hold blocks for files in the
distributed filesystem, managed by DataNode services.
Figure 2-2 shows a typical distributed mode configuration for a Hadoop cluster.
26 | Chapter 2:Getting Started
We’re using the convention that *.domain.pvt is our DNS naming convention for the
cluster’s private, internal network.
Pseudodistributed mode is nearly identical; it’s effectively a one-node cluster.
We’ll assume that your administrator has already configured Hadoop, including your
distributed filesystem (e.g., HDFS, or see Appendix A of Hadoop: The Definitive
Guide by Tom White). Here, we’ll focus on the unique configuration steps required by
One Hive property you might want to configure is the top-level directory for table
storage, which is specified by the property hive.metastore.warehouse.dir, which we
also discussed in “Local Mode Configuration” on page 24.
The default value for this property is /user/hive/warehouse in the Apache Hadoop and
MapR distributions, which will be interpreted as a distributed filesystem path when
Hadoop is configured for distributed or pseudodistributed mode. For Amazon Elastic
MapReduce (EMR), the default value is /mnt/hive_0M_N/warehouse when using Hive
v0.M.N (e.g., /mnt/hive_08_1/warehouse).
Specifying a different value here allows each user to define their own warehouse direc-
tory, so they don’t affect other system users. Hence, each user might use the following
statement to define their own warehouse directory:
set hive.metastore.warehouse.dir=/user/myname/hive/warehouse;
It’s tedious to type this each time you start the Hive CLI or to remember to add it to
every Hive script. Of course, it’s also easy to forget to define this property. Instead, it’s
Figure 2-2. Hadoop in distributed mode
Configuring Your Hadoop Environment | 27
best to put commands like this in the $HOME/.hiverc file, which will be processed
when Hive starts. See “The .hiverc File” on page 36 for more details.
We’ll assume the value is /user/hive/warehouse from here on.
Metastore Using JDBC
Hive requires only one extra component that Hadoop does not already have; the
metastore component. The metastore stores metadata such as table schema and parti-
tion information that you specify when you run commands such as create table
x..., or alter table y..., etc. Because multiple users and systems are likely to need
concurrent access to the metastore, the default embedded database is not suitable for
If you are using a single node in pseudodistributed mode, you may not
find it useful to set up a full relational database for the metastore. Rather,
you may wish to continue using the default Derby store, but configure
it to use a central location for its data, as described in “Local Mode
Configuration” on page 24.
Any JDBC-compliant database can be used for the metastore. In practice, most instal-
lations of Hive use MySQL. We’ll discuss how to use MySQL. It is straightforward to
adapt this information to other JDBC-compliant databases.
The information required for table schema, partition information, etc.,
is small, typically much smaller than the large quantity of data stored in
Hive. As a result, you typically don’t need a powerful dedicated database
server for the metastore. However because it represents a Single Point
of Failure (SPOF), it is strongly recommended that you replicate and
back up this database using the standard techniques you would nor-
mally use with other relational database instances. We won’t discuss
those techniques here.
For our MySQL configuration, we need to know the host and port the service is running
on. We will assume db1.mydomain.pvt and port 3306, which is the standard MySQL
port. Finally, we will assume that hive_db is the name of our catalog. We define these
properties in Example 2-2.
Example 2-2. Metastore database configuration in hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
28 | Chapter 2:Getting Started
You may have noticed the ConnectionURL property starts with a prefix of jdbc:mysql.
For Hive to be able to connect to MySQL, we need to place the JDBC driver in our
classpath. Download the MySQL JDBC driver (Jconnector) from http://www.mysql
.com/downloads/connector/j/. The driver can be placed in the Hive library path,
$HIVE_HOME/lib. Some teams put all such support libraries in their Hadoop lib
With the driver and the configuration settings in place, Hive will store its metastore
information in MySQL.
The Hive Command
The $HIVE_HOME/bin/hive shell command, which we’ll simply refer to as hive from now
on, is the gateway to Hive services, including the command-line interface or CLI.
We’ll also assume that you have added $HIVE_HOME/bin to your environment’s PATH so
you can type hive at the shell prompt and your shell environment (e.g., bash) will find
the command.
Command Options
If you run the following command, you’ll see a brief list of the options for the hive
command. Here is the output for Hive v0.8.X and v0.9.X:
$ bin/hive --help
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: cli help hiveserver hwi jar lineage metastore rcfilecat
Parameters parsed:
--auxpath : Auxiliary jars
--config : Hive configuration directory
--service : Starts specific service/component. cli is default
Parameters used:
HADOOP_HOME : Hadoop install directory
HIVE_OPT : Hive options
The Hive Command | 29
For help on a particular service:
./hive --service serviceName --help
Debug help: ./hive --debug --help
Note the Service List. There are several services available, including the CLI that we
will spend most of our time using. You can invoke a service using the --service name
option, although there are shorthand invocations for some of the services, as well.
Table 2-2 describes the most useful services.
Table 2-2. Hive services
Option Name Description
cli Command-line interface Used to define tables, run queries, etc. It is the default service
if no other service is specified. See “The Command-Line Inter-
face” on page 30.
hiveserver Hive Server A daemon that listens for Thrift connections from other pro-
cesses. See Chapter 16 for more details.
hwi Hive Web Interface A simple web interface for running queries and other com-
mands without logging into a cluster machine and using the
jar An extension of the hadoop jar command for running an
application that also requires the Hive environment.
metastore Start an external Hive metastore service to support multiple
clients (see also “Metastore Using JDBC” on page 28).
rcfilecat A tool for printing the contents of an RCFile (see
“RCFile” on page 202).
The --auxpath option lets you specify a colon-separated list of “auxiliary” Java archive
(JAR) files that contain custom extensions, etc., that you might require.
The --config directory is mostly useful if you have to override the default configuration
properties in $HIVE_HOME/conf in a new directory.
The Command-Line Interface
The command-line interface or CLI is the most common way to interact with Hive.
Using the CLI, you can create tables, inspect schema and query tables, etc.
30 | Chapter 2:Getting Started
CLI Options
The following command shows a brief list of the options for the CLI. Here we show
the output for Hive v0.8.X and v0.9.X:
$ hive --help --service cli
usage: hive
-d,--define <key=value> Variable substitution to apply to hive
commands. e.g. -d A=B or --define A=B
-e <quoted-query-string> SQL from command line
-f <filename> SQL from files
-H,--help Print help information
-h <hostname> connecting to Hive Server on remote host
--hiveconf <property=value> Use value for given property
--hivevar <key=value> Variable substitution to apply to hive
commands. e.g. --hivevar A=B
-i <filename> Initialization SQL file
-p <port> connecting to Hive Server on port number
-S,--silent Silent mode in interactive shell
-v,--verbose Verbose mode (echo executed SQL to the
A shorter version of this command is hive -h. However, that’s technically an unsup-
ported option, but it produces the help output with an additional line that complains
about Missing argument for option: h.
For Hive v0.7.X, the -d, --hivevar, and -p options are not supported.
Let’s explore these options in more detail.
Variables and Properties
The --define key=value option is effectively equivalent to the --hivevar key=value
option. Both let you define on the command line custom variables that you can refer-
ence in Hive scripts to customize execution. This feature is only supported in Hive
v0.8.0 and later versions.
When you use this feature, Hive puts the key-value pair in the hivevar “namespace” to
distinguish these definitions from three other built-in namespaces, hiveconf, system,
and env.
The terms variable or property are used in different contexts, but they
function the same way in most cases.
The namespace options are described in Table 2-3.
The Command-Line Interface | 31
Table 2-3. Hive namespaces for variables and properties
Namespace Access Description
hivevar Read/Write (v0.8.0 and later) User-defined custom variables.
hiveconf Read/Write Hive-specific configuration properties.
system Read/Write Configuration properties defined by Java.
env Read only Environment variables defined by the shell environment (e.g.,
Hive’s variables are internally stored as Java Strings. You can reference variables in
queries; Hive replaces the reference with the variable’s value before sending the query
to the query processor.
Inside the CLI, variables are displayed and changed using the SET command. For ex-
ample, the following session shows the value for one variable, in the env namespace,
and then all variable definitions! Here is a Hive session where some output has been
omitted and we have added a blank line after the output of each command for clarity:
$ hive
hive> set env:HOME;
hive> set;
... lots of output including these variables:
hive> set -v;
... even more output!...
Without the -v flag, set prints all the variables in the namespaces hivevar, hiveconf,
system, and env. With the -v option, it also prints all the properties defined by Hadoop,
such as properties controlling HDFS and MapReduce.
The set command is also used to set new values for variables. Let’s look specifically at
the hivevar namespace and a variable that is defined for it on the command line:
$ hive --define foo=bar
hive> set foo;
hive> set hivevar:foo;
hive> set hivevar:foo=bar2;
hive> set foo;
32 | Chapter 2:Getting Started
hive> set hivevar:foo;
As we can see, the hivevar: prefix is optional. The --hivevar flag is the same as the
--define flag.
Variable references in queries are replaced in the CLI before the query is sent to the
query processor. Consider the following hive CLI session (v0.8.X only):
hive> create table toss1(i int, ${hivevar:foo} string);
hive> describe toss1;
i int
bar2 string
hive> create table toss2(i2 int, ${foo} string);
hive> describe toss2;
i2 int
bar2 string
hive> drop table toss1;
hive> drop table toss2;
Let’s look at the --hiveconf option, which is supported in Hive v0.7.X. It is used for
all properties that configure Hive behavior. We’ll use it with a property
hive.cli.print.current.db that was added in Hive v0.8.0. It turns on printing of the
current working database name in the CLI prompt. (See “Databases in
Hive” on page 49 for more on Hive databases.) The default database is named
default. This property is false by default:
$ hive --hiveconf hive.cli.print.current.db=true
hive (default)> set hive.cli.print.current.db;
hive (default)> set hiveconf:hive.cli.print.current.db;
hive (default)> set hiveconf:hive.cli.print.current.db=false;
hive> set hiveconf:hive.cli.print.current.db=true;
hive (default)> ...
We can even add new hiveconf entries, which is the only supported option for Hive
versions earlier than v0.8.0:
$ hive --hiveconf y=5
hive> set y;
hive> CREATE TABLE whatsit(i int);
hive> ... load data into whatsit ...
The Command-Line Interface | 33
hive> SELECT * FROM whatsit WHERE i = ${hiveconf:y};
It’s also useful to know about the system namespace, which provides read-write access
to Java system properties, and the env namespace, which provides read-only access to
environment variables:
hive> set;
hive> set;
hive> set;
hive> set env:HOME;
hive> set env:HOME;
env:* variables can not be set.
Unlike hivevar variables, you have to use the system: or env: prefix with system prop-
erties and environment variables.
The env namespace is useful as an alternative way to pass variable definitions to Hive,
especially for Hive v0.7.X. Consider the following example:
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";
The query processor will see the literal number 2012 in the WHERE clause.
If you are using Hive v0.7.X, some of the examples in this book that use
parameters and variables may not work as written. If so, replace the
variable reference with the corresponding value.
All of Hive’s built-in properties are listed in $HIVE_HOME/conf/hive-
default.xml.template, the “sample” configuration file. It also shows the
default values for each property.
Hive “One Shot” Commands
The user may wish to run one or more queries (semicolon separated) and then have
the hive CLI exit immediately after completion. The CLI accepts a -e command argument
that enables this feature. If mytable has a string and integer column, we might see the
following output:
$ hive -e "SELECT * FROM mytable LIMIT 3";
name1 10
name2 20
name3 30
34 | Chapter 2:Getting Started
Time taken: 4.955 seconds
A quick and dirty technique is to use this feature to output the query results to a file.
Adding the -S for silent mode removes the OK and Time taken ... lines, as well as other
inessential output, as in this example:
$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery
$ cat /tmp/myquery
name1 10
name2 20
name3 30
Note that hive wrote the output to the standard output and the shell command redi-
rected that output to the local filesystem, not to HDFS.
Finally, here is a useful trick for finding a property name that you can’t quite remember,
without having to scroll through the list of the set output. Suppose you can’t remember
the name of the property that specifies the “warehouse” location for managed tables:
$ hive -S -e "set" | grep warehouse
It’s the first one.
Executing Hive Queries from Files
Hive can execute one or more queries that were saved to a file using the -f file argu-
ment. By convention, saved Hive query files use the .q or .hql extension.
$ hive -f /path/to/file/withqueries.hql
If you are already inside the Hive shell you can use the SOURCE command to execute a
script file. Here is an example:
$ cat /path/to/file/withqueries.hql
SELECT x.* FROM src x;
$ hive
hive> source /path/to/file/withqueries.hql;
By the way, we’ll occasionally use the name src (“source”) for tables in queries when
the name of the table is irrelevant for the example. This convention is taken from the
unit tests in Hive’s source code; first create a src table before all tests.
For example, when experimenting with a built-in function, it’s convenient to write a
“query” that passes literal arguments to the function, as in the following example taken
from later in the book, “XPath-Related Functions” on page 207:
hive> SELECT xpath(\'<a><b id="foo">b1</b><b id="bar">b2</b></a>\',\'//@id\')
> FROM src LIMIT 1;
The Command-Line Interface | 35
The details for xpath don’t concern us here, but note that we pass string literals to the
xpath function and use FROM src LIMIT 1 to specify the required FROM clause and to limit
the output. Substitute src with the name of a table you have already created or create
a dummy table named src:
Also the source table must have at least one row of content in it:
$ echo "one row" > /tmp/myfile
$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;
The .hiverc File
The last CLI option we’ll discuss is the -i file option, which lets you specify a file of
commands for the CLI to run as it starts, before showing you the prompt. Hive auto-
matically looks for a file named .hiverc in your HOME directory and runs the commands
it contains, if any.
These files are convenient for commands that you run frequently, such as setting
system properties (see “Variables and Properties” on page 31) or adding Java archives
(JAR files) of custom Hive extensions to Hadoop’s distributed cache (as discussed in
Chapter 15).
The following shows an example of a typical $HOME/.hiverc file:
ADD JAR /path/to/custom_hive_extensions.jar;
set hive.cli.print.current.db=true;
The first line adds a JAR file to the Hadoop distributed cache. The second line modifies
the CLI prompt to show the current working Hive database, as we described earlier in
“Variables and Properties” on page 31. The last line “encourages” Hive to be more
aggressive about using local-mode execution when possible, even when Hadoop is
running in distributed or pseudo-distributed mode, which speeds up queries for small
data sets.
An easy mistake to make is to forget the semicolon at the end of lines
like this. When you make this mistake, the definition of the property
will include all the text from all the subsequent lines in the file until the
next semicolon.
More on Using the Hive CLI
The CLI supports a number of other useful features.
36 | Chapter 2:Getting Started
If you start typing and hit the Tab key, the CLI will autocomplete possible keywords
and function names. For example, if you type SELE and then the Tab key, the CLI will
complete the word SELECT.
If you type the Tab key at the prompt, you’ll get this reply:
Display all 407 possibilities? (y or n)
If you enter y, you’ll get a long list of all the keywords and built-in functions.
A common source of error and confusion when pasting statements into
the CLI occurs where some lines begin with a tab. You’ll get the prompt
about displaying all possibilities, and subsequent characters in the
stream will get misinterpreted as answers to the prompt, causing the
command to fail.
Command History
You can use the up and down arrow keys to scroll through previous commands. Ac-
tually, each previous line of input is shown separately; the CLI does not combine mul-
tiline commands and queries into a single history entry. Hive saves the last 100,00 lines
into a file $HOME/.hivehistory.
If you want to repeat a previous command, scroll to it and hit Enter. If you want to edit
the line before entering it, use the left and right arrow keys to navigate to the point
where changes are required and edit the line. You can hit Return to submit it without
returning to the end of the line.
Most navigation keystrokes using the Control key work as they do for
the bash shell (e.g., Control-A goes to the beginning of the line and
Control-E goes to the end of the line). However, similar “meta,” Option,
or Escape keys don’t work (e.g., Option-F to move forward a word at a
time). Similarly, the Delete key will delete the character to the left of the
cursor, but the Forward Delete key doesn’t delete the character under
the cursor.
Shell Execution
You don’t need to leave the hive CLI to run simple bash shell commands. Simply
type ! followed by the command and terminate the line with a semicolon (;):
hive> ! /bin/echo "what up dog";
"what up dog"
hive> ! pwd;
The Command-Line Interface | 37
Don’t invoke interactive commands that require user input. Shell “pipes” don’t work
and neither do file “globs.” For example, ! ls *.hql; will look for a file named *.hql;,
rather than all files that end with the .hql extension.
Hadoop dfs Commands from Inside Hive
You can run the hadoop dfs ... commands from within the hive CLI; just drop the
hadoop word from the command and add the semicolon at the end:
hive> dfs -ls / ;
Found 3 items
drwxr-xr-x - root supergroup 0 2011-08-17 16:27 /etl
drwxr-xr-x - edward supergroup 0 2012-01-18 15:51 /flag
drwxrwxr-x - hadoop supergroup 0 2010-02-03 17:50 /users
This method of accessing hadoop commands is actually more efficient than using the
hadoop dfs ... equivalent at the bash shell, because the latter starts up a new JVM
instance each time, whereas Hive just runs the same code in its current process.
You can see a full listing of help on the options supported by dfs using this command:
hive> dfs -help;
See also or
similar documentation for your Hadoop distribution.
Comments in Hive Scripts
As of Hive v0.8.0, you can embed lines of comments that start with the string --, for
-- Copyright (c) 2012 Megacorp, LLC.
-- This is the best Hive script evar!!
SELECT * FROM massive_table;
The CLI does not parse these comment lines. If you paste them into the
CLI, you’ll get errors. They only work when used in scripts executed
with hive -f script_name.
Query Column Headers
As a final example that pulls together a few things we’ve learned, let’s tell the CLI to
print column headers, which is disabled by default. We can enable this feature by setting
the hiveconf property hive.cli.print.header to true:
38 | Chapter 2:Getting Started
hive> set hive.cli.print.header=true;
hive> SELECT * FROM system_logs LIMIT 3;
tstamp severity server message
1335667117.337715 ERROR server1 Hard drive hd1 is 90% full!
1335667117.338012 WARN server1 Slow response from server2.
1335667117.339234 WARN server2 Uh, Dude, I'm kinda busy right now...
If you always prefer seeing the headers, put the first line in your $HOME/.hiverc file.
The Command-Line Interface | 39
Data Types and File Formats
Hive supports many of the primitive data types you find in relational databases, as well
as three collection data types that are rarely found in relational databases, for reasons
we’ll discuss shortly.
A related concern is how these types are represented in text files, as well as alternatives
to text storage that address various performance and other concerns. A unique feature
of Hive, compared to most databases, is that it provides great flexibility in how data is
encoded in files. Most databases take total control of the data, both how it is persisted
to disk and its life cycle. By letting you control all these aspects, Hive makes it easier
to manage and process data with a variety of tools.
Primitive Data Types
Hive supports several sizes of integer and floating-point types, a Boolean type, and
character strings of arbitrary length. Hive v0.8.0 added types for timestamps and binary
Table 3-1 lists the primitive types supported by Hive.
Table 3-1. Primitive data types
Type Size Literal syntax examples
TINYINT 1 byte signed integer. 20
SMALLINT 2 byte signed integer. 20
INT 4 byte signed integer. 20
BIGINT 8 byte signed integer. 20
BOOLEAN Boolean true or false. TRUE
FLOAT Single precision floating point. 3.14159
DOUBLE Double precision floating point. 3.14159
Type Size Literal syntax examples
STRING Sequence of characters. The character
set can be specified. Single or double
quotes can be used.
'Now is the time', "for all
good men"
TIMESTAMP (v0.8.0+) Integer, float, or string. 1327882394 (Unix epoch seconds),
1327882394.123456789 (Unix ep-
och seconds plus nanoseconds), and
12:34:56.123456789' (JDBC-
compliant java.sql.Timestamp
BINARY (v0.8.0+) Array of bytes. See discussion below
As for other SQL dialects, the case of these names is ignored.
It’s useful to remember that each of these types is implemented in Java, so the particular
behavior details will be exactly what you would expect from the corresponding Java
types. For example, STRING is implemented by the Java String, FLOAT is implemented
by Java float, etc.
Note that Hive does not support “character arrays” (strings) with maximum-allowed
lengths, as is common in other SQL dialects. Relational databases offer this feature as
a performance optimization; fixed-length records are easier to index, scan, etc. In the
“looser” world in which Hive lives, where it may not own the data files and has to be
flexible on file format, Hive relies on the presence of delimiters to separate fields. Also,
Hadoop and Hive emphasize optimizing disk reading and writing performance, where
fixing the lengths of column values is relatively unimportant.
Values of the new TIMESTAMP type can be integers, which are interpreted as seconds since
the Unix epoch time (Midnight, January 1, 1970), floats, which are interpreted as sec-
onds since the epoch time with nanosecond resolution (up to 9 decimal places), and
strings, which are interpreted according to the JDBC date string format convention,
YYYY-MM-DD hh:mm:ss.fffffffff.
TIMESTAMPS are interpreted as UTC times. Built-in functions for conversion to and from
timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respec-
tively (see Chapter 13 for more details).
The BINARY type is similar to the VARBINARY type found in many relational databases.
It’s not like a BLOB type, since BINARY columns are stored within the record, not sepa-
rately like BLOBs. BINARY can be used as a way of including arbitrary bytes in a record
and preventing Hive from attempting to parse them as numbers, strings, etc.
Note that you don’t need BINARY if your goal is to ignore the tail end of each record. If
a table schema specifies three columns and the data files contain five values for each
record, the last two will be ignored by Hive.
42 | Chapter 3:Data Types and File Formats
What if you run a query that wants to compare a float column to a double column or
compare a value of one integer type with a value of a different integer type? Hive will
implicitly cast any integer to the larger of the two integer types, cast FLOAT to DOUBLE,
and cast any integer value to DOUBLE, as needed, so it is comparing identical types.
What if you run a query that wants to interpret a string column as a number? You can
explicitly cast one type to another as in the following example, where s is a string
column that holds a value representing an integer:
... cast(s AS INT) ...;
(To be clear, the AS INT are keywords, so lowercase would be fine.)
We’ll discuss data conversions in more depth in “Casting” on page 109.
Collection Data Types
Hive supports columns that are structs, maps, and arrays. Note that the literal syntax
examples in Table 3-2 are actually calls to built-in functions.
Table 3-2. Collection data types
Type Description Literal syntax examples
STRUCT Analogous to a C struct or an “object.” Fields can be accessed
using the “dot” notation. For example, if a column name is of
type STRUCT {first STRING; last STRING}, then
the first name field can be referenced using name.first.
struct('John', 'Doe')
MAP A collection of key-value tuples, where the fields are accessed
using array notation (e.g., ['key']). For example, if a column
name is of type MAP with keyvalue pairs
'first'→'John' and 'last'→'Doe', then the last
name can be referenced using name['last'].
map('first', 'John',
'last', 'Doe')
ARRAY Ordered sequences of the same type that are indexable using
zero-based integers. For example, if a column name is of type
ARRAY of strings with the value ['John', 'Doe'], then
the second element can be referenced using name[1].
array('John', 'Doe')
As for simple types, the case of the type name is ignored.
Most relational databases don’t support such collection types, because using them
tends to break normal form. For example, in traditional data models, structs might be
captured in separate tables, with foreign key relations between the tables, as
A practical problem with breaking normal form is the greater risk of data duplication,
leading to unnecessary disk space consumption and potential data inconsistencies, as
duplicate copies can grow out of sync as changes are made.
Collection Data Types | 43
However, in Big Data systems, a benefit of sacrificing normal form is higher processing
throughput. Scanning data off hard disks with minimal “head seeks” is essential when
processing terabytes to petabytes of data. Embedding collections in records makes re-
trieval faster with minimal seeks. Navigating each foreign key relationship requires
seeking across the disk, with significant performance overhead.
Hive doesn’t have the concept of keys. However, you can index tables,
as we’ll see in Chapter 7.
Here is a table declaration that demonstrates how to use these types, an employees table
in a fictitious Human Resources application:
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>);
The name is a simple string and for most employees, a float is large enough for the salary.
The list of subordinates is an array of string values, where we treat the name as a “primary
key,” so each element in subordinates would reference another record in the table.
Employees without subordinates would have an empty array. In a traditional model,
the relationship would go the other way, from an employee to his or her manager. We’re
not arguing that our model is better for Hive; it’s just a contrived example to illustrate
the use of arrays.
The deductions is a map that holds a key-value pair for every deduction that will be
subtracted from the employee’s salary when paychecks are produced. The key is the
name of the deduction (e.g., “Federal Taxes”), and the key would either be a percentage
value or an absolute number. In a traditional data model, there might be separate tables
for deduction type (each key in our map), where the rows contain particular deduction
values and a foreign key pointing back to the corresponding employee record.
Finally, the home address of each employee is represented as a struct, where each field
is named and has a particular type.
Note that Java syntax conventions for generics are followed for the collection types. For
example, MAP<STRING, FLOAT> means that every key in the map will be of type STRING
and every value will be of type FLOAT. For an ARRAY<STRING>, every item in the array will
be a STRING. STRUCTs can mix different types, but the locations are fixed to the declared
position in the STRUCT.
44 | Chapter 3:Data Types and File Formats
Text File Encoding of Data Values
Let’s begin our exploration of file formats by looking at the simplest example, text files.
You are no doubt familiar with text files delimited with commas or tabs, the so-called
comma-separated values (CSVs) or tab-separated values (TSVs), respectively. Hive can
use those formats if you want and we’ll show you how shortly. However, there is a
drawback to both formats; you have to be careful about commas or tabs embedded in
text and not intended as field or column delimiters. For this reason, Hive uses various
control characters by default, which are less likely to appear in value strings. Hive uses
the term field when overriding the default delimiter, as we’ll see shortly. They are listed
in Table 3-3.
Table 3-3. Hive’s default record and field delimiters
Delimiter Description
\n For text files, each line is a record, so the line feed character separates records.
^A (“control” A) Separates all fields (columns). Written using the octal code \001 when explicitly
specified in CREATE TABLE statements.
^B Separate the elements in an ARRAY or STRUCT, or the key-value pairs in a MAP.
Written using the octal code \002 when explicitly specified in CREATE TABLE
^C Separate the key from the corresponding value in MAP key-value pairs. Written using
the octal code \003 when explicitly specified in CREATE TABLE statements.
Records for the employees table declared in the previous section would look like the
following example, where we use ^A, etc., to represent the field delimiters. A text editor
like Emacs will show the delimiters this way. Note that the lines have been wrapped in
the example because they are too long for the printed page. To clearly indicate the
division between records, we have added blank lines between them that would not
appear in the file:
John Doe^A100000.0^AMary Smith^BTodd Jones^AFederal Taxes^C.2^BState
Taxes^C.05^BInsurance^C.1^A1 Michigan Ave.^BChicago^BIL^B60600
Mary Smith^A80000.0^ABill King^AFederal Taxes^C.2^BState Taxes^C.
05^BInsurance^C.1^A100 Ontario St.^BChicago^BIL^B60601
Todd Jones^A70000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1^A200 Chicago Ave.^BOak Park^BIL^B60700
Bill King^A60000.0^AFederal Taxes^C.15^BState Taxes^C.03^BInsurance^C.
1^A300 Obscure Dr.^BObscuria^BIL^B60100
This is a little hard to read, but you would normally let Hive do that for you, of course.
Let’s walk through the first line to understand the structure. First, here is what it would
Text File Encoding of Data Values | 45
look like in JavaScript Object Notation (JSON), where we have also inserted the names
from the table schema:
"name": "John Doe",
"salary": 100000.0,
"subordinates": ["Mary Smith", "Todd Jones"],
"deductions": {
"Federal Taxes": .2,
"State Taxes": .05,
"Insurance": .1
"address": {
"street": "1 Michigan Ave.",
"city": "Chicago",
"state": "IL",
"zip": 60600
You’ll note that maps and structs are effectively the same thing in JSON.
Now, here’s how the first line of the text file breaks down:
John Doe is the name.
100000.0 is the salary.
Mary Smith^BTodd Jones are the subordinates “Mary Smith” and “Todd Jones.”
Federal Taxes^C.2^BState Taxes^C.05^BInsurance^C.1 are the deductions, where
20% is deducted for “Federal Taxes,” 5% is deducted for “State Taxes,” and 10%
is deducted for “Insurance.”
1 Michigan Ave.^BChicago^BIL^B60600 is the address, “1 Michigan Ave., Chicago,
You can override these default delimiters. This might be necessary if another applica-
tion writes the data using a different convention. Here is the same table declaration
again, this time with all the format defaults explicitly specified:
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
46 | Chapter 3:Data Types and File Formats
The ROW FORMAT DELIMITED sequence of keywords must appear before any of the other
clauses, with the exception of the STORED AS … clause.
The character \001 is the octal code for ^A. The clause ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\001' means that Hive will use the ^A character to separate fields.
Similarly, the character \002 is the octal code for ^B. The clause ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '\002' means that Hive will use the ^B character to
separate collection items.
Finally, the character \003 is the octal code for ^C. The clause ROW FORMAT DELIMITED
MAP KEYS TERMINATED BY '\003' means that Hive will use the ^C character to separate
map keys from values.
The clause LINES TERMINATED BY '…' and STORED AS … do not require the ROW FORMAT
DELIMITED keywords.
Actually, it turns out that Hive does not currently support any character for LINES
TERMINATED BY … other than '\n'. So this clause has limited utility today.
You can override the field, collection, and key-value separators and still use the default
text file format, so the clause STORED AS TEXTFILE is rarely used. For most of this book,
we will use the default TEXTFILE file format.
There are other file format options, but we’ll defer discussing them until Chapter 15.
A related issue is compression of files, which we’ll discuss in Chapter 11.
So, while you can specify all these clauses explicitly, using the default separators most
of the time, you normally only provide the clauses for explicit overrides.
These specifications only affect what Hive expects to see when it reads
files. Except in a few limited cases, it’s up to you to write the data files
in the correct format.
For example, here is a table definition where the data will contain comma-delimited
CREATE TABLE some_data (
first FLOAT,
second FLOAT,
third FLOAT
Use '\t' for tab-delimited fields.
Text File Encoding of Data Values | 47
This example does not properly handle the general case of files in CSV
(comma-separated values) and TSV (tab-separated values) formats. They
can include a header row with column names and column string values
might be quoted and they might contain embedded commas or tabs,
respectively. See Chapter 15 for details on handling these file types more
This powerful customization feature makes it much easier to use Hive with files created
by other tools and various ETL (extract, transform, and load) processes.
Schema on Read
When you write data to a traditional database, either through loading external data,
writing the output of a query, doing UPDATE statements, etc., the database has total
control over the storage. The database is the “gatekeeper.” An important implication
of this control is that the database can enforce the schema as data is written. This is
called schema on write.
Hive has no such control over the underlying storage. There are many ways to create,
modify, and even damage the data that Hive will query. Therefore, Hive can only en-
force queries on read. This is called schema on read.
So what if the schema doesn’t match the file contents? Hive does the best that it can to
read the data. You will get lots of null values if there aren’t enough fields in each record
to match the schema. If some fields are numbers and Hive encounters nonnumeric
strings, it will return nulls for those fields. Above all else, Hive tries to recover from all
errors as best it can.
48 | Chapter 3:Data Types and File Formats
HiveQL: Data Definition
HiveQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t
fully conform to any particular revision of the ANSI SQL standard. It is perhaps closest
to MySQL’s dialect, but with significant differences. Hive offers no support for row-
level inserts, updates, and deletes. Hive doesn’t support transactions. Hive adds ex-
tensions to provide better performance in the context of Hadoop and to integrate with
custom extensions and even external programs.
Still, much of HiveQL will be familiar. This chapter and the ones that follow discuss
the features of HiveQL using representative examples. In some cases, we will briefly
mention details for completeness, then explore them more fully in later chapters.
This chapter starts with the so-called data definition language parts of HiveQL, which
are used for creating, altering, and dropping databases, tables, views, functions, and
indexes. We’ll discuss databases and tables in this chapter, deferring the discussion of
views until Chapter 7, indexes until Chapter 8, and functions until Chapter 13.
We’ll also discuss the SHOW and DESCRIBE commands for listing and describing items as
we go.
Subsequent chapters explore the data manipulation language parts of HiveQL that are
used to put data into Hive tables and to extract data to the filesystem, and how to
explore and manipulate data with queries, grouping, filtering, joining, etc.
Databases in Hive
The Hive concept of a database is essentially just a catalog or namespace of tables.
However, they are very useful for larger clusters with multiple teams and users, as a
way of avoiding table name collisions. It’s also common to use databases to organize
production tables into logical groups.
If you don’t specify a database, the default database is used.
The simplest syntax for creating a database is shown in the following example:
hive> CREATE DATABASE financials;
Hive will throw an error if financials already exists. You can suppress these warnings
with this variation:
While normally you might like to be warned if a database of the same name already
exists, the IF NOT EXISTS clause is useful for scripts that should create a database on-
the-fly, if necessary, before proceeding.
You can also use the keyword SCHEMA instead of DATABASE in all the database-related
At any time, you can see the databases that already exist as follows:
hive> CREATE DATABASE human_resources;
If you have a lot of databases, you can restrict the ones listed using a regular expres-
sion, a concept we’ll explain in “LIKE and RLIKE” on page 96, if it is new to you. The
following example lists only those databases that start with the letter h and end with
any other characters (the .* part):
hive> ...
Hive will create a directory for each database. Tables in that database will be stored in
subdirectories of the database directory. The exception is tables in the default database,
which doesn’t have its own directory.
The database directory is created under a top-level directory specified by the property
hive.metastore.warehouse.dir, which we discussed in “Local Mode Configura-
tion” on page 24 and “Distributed and Pseudodistributed Mode Configura-
tion” on page 26. Assuming you are using the default value for this property, /user/hive/
warehouse, when the financials database is created, Hive will create the directory /user/
hive/warehouse/financials.db. Note the .db extension.
You can override this default location for the new directory as shown in this example:
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';
You can add a descriptive comment to the database, which will be shown by the
DESCRIBE DATABASE <database> command.
50 | Chapter 4:HiveQL: Data Definition
hive> CREATE DATABASE financials
> COMMENT 'Holds all financial tables';
hive> DESCRIBE DATABASE financials;
financials Holds all financial tables
Note that DESCRIBE DATABASE also shows the directory location for the database. In this
example, the URI scheme is hdfs. For a MapR installation, it would be maprfs. For an
Amazon Elastic MapReduce (EMR) cluster, it would also be hdfs, but you could set
hive.metastore.warehouse.dir to use Amazon S3 explicitly (i.e., by specifying s3n://
bucketname/… as the property value). You could use s3 as the scheme, but the newer
s3n is preferred.
In the output of DESCRIBE DATABASE, we’re showing master-server to indicate the URI
authority, in this case a DNS name and optional port number (i.e., server:port) for the
“master node” of the filesystem (i.e., where the NameNode service is running for
HDFS). If you are running in pseudo-distributed mode, then the master server will be
localhost. For local mode, the path will be a local path, file:///user/hive/warehouse/
If the authority is omitted, Hive uses the master-server name and port defined by
the property in the Hadoop configuration files, found in the
$HADOOP_HOME/conf directory.
To be clear, hdfs:///user/hive/warehouse/financials.db is equivalent to hdfs://master-
server/user/hive/warehouse/financials.db, where master-server is your master node’s
DNS name and optional port.
For completeness, when you specify a relative path (e.g., some/relative/path), Hive will
put this under your home directory in the distributed filesystem (e.g., hdfs:///user/<user-
name>) for HDFS. However, if you are running in local mode, your current working
directory is used as the parent of some/relative/path.
For script portability, it’s typical to omit the authority, only specifying it when referring
to another distributed filesystem instance (including S3 buckets).
Lastly, you can associate key-value properties with the database, although their only
function currently is to provide a way of adding information to the output of DESCRIBE
hive> CREATE DATABASE financials
> WITH DBPROPERTIES ('creator' = 'Mark Moneybags', 'date' = '2012-01-02');
hive> DESCRIBE DATABASE financials;
financials hdfs://master-server/user/hive/warehouse/financials.db
financials hdfs://master-server/user/hive/warehouse/financials.db
{date=2012-01-02, creator=Mark Moneybags);
Databases in Hive | 51
The USE command sets a database as your working database, analogous to changing
working directories in a filesystem:
hive> USE financials;
Now, commands such as SHOW TABLES; will list the tables in this database.
Unfortunately, there is no command to show you which database is your current
working database! Fortunately, it’s always safe to repeat the USE … command; there is
no concept in Hive of nesting of databases.
Recall that we pointed out a useful trick in “Variables and Properties” on page 31 for
setting a property to print the current database as part of the prompt (Hive v0.8.0 and
hive> set hive.cli.print.current.db=true;
hive (financials)> USE default;
hive (default)> set hive.cli.print.current.db=false;
hive> ...
Finally, you can drop a database:
hive> DROP DATABASE IF EXISTS financials;
The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.
By default, Hive won’t permit you to drop a database if it contains tables. You can either
drop the tables first or append the CASCADE keyword to the command, which will cause
the Hive to drop the tables in the database first:
Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior,
where existing tables must be dropped before dropping the database.
When a database is dropped, its directory is also deleted.
Alter Database
You can set key-value pairs in the DBPROPERTIES associated with a database using the
ALTER DATABASE command. No other metadata about the database can be changed,
including its name and directory location:
hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba');
There is no way to delete or “unset” a DBPROPERTY.
52 | Chapter 4:HiveQL: Data Definition
Creating Tables
The CREATE TABLE statement follows SQL conventions, but Hive’s version offers sig-
nificant extensions to support a wide range of flexibility where the data files for tables
are stored, the formats used, etc. We discussed many of these options in “Text File
Encoding of Data Values” on page 45 and we’ll return to more advanced options later
in Chapter 15. In this section, we describe the other options available for the CREATE
TABLE statement, adapting the employees table declaration we used previously in “Col-
lection Data Types” on page 43:
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
First, note that you can prefix a database name, mydb in this case, if you’re not currently
working in the target database.
If you add the option IF NOT EXISTS, Hive will silently ignore the statement if the table
already exists. This is useful in scripts that should create a table the first time they run.
However, the clause has a gotcha you should know. If the schema specified differs from
the schema in the table that already exists, Hive won’t warn you. If your intention is
for this table to have the new schema, you’ll have to drop the old table, losing your
data, and then re-create it. Consider if you should use one or more ALTER TABLE state-
ments to change the existing table schema instead. See “Alter Table” on page 66 for
If you use IF NOT EXISTS and the existing table has a different schema
than the schema in the CREATE TABLE statement, Hive will ignore the
You can add a comment to any column, after the type. Like databases, you can attach
a comment to the table itself and you can define one or more table properties. In most
cases, the primary benefit of TBLPROPERTIES is to add additional documentation in a
key-value format. However, when we examine Hive’s integration with databases such
as DynamoDB (see “DynamoDB” on page 225), we’ll see that the TBLPROPERTIES can
be used to express essential metadata about the database connection.
Creating Tables | 53
Hive automatically adds two table properties: last_modified_by holds the username of
the last user to modify the table, and last_modified_time holds the epoch time in sec-
onds of that modification.
A planned enhancement for Hive v0.10.0 is to add a SHOW TBLPROPERTIES
table_name command that will list just the TBLPROPERTIES for a table.
Finally, you can optionally specify a location for the table data (as opposed to meta-
data, which the metastore will always hold). In this example, we are showing the default
location that Hive would use, /user/hive/warehouse/mydb.db/employees, where /user/
hive/warehouse is the default “warehouse” location (as discussed previously),
mydb.db is the database directory, and employees is the table directory.
By default, Hive always creates the table’s directory under the directory for the enclos-
ing database. The exception is the default database. It doesn’t have a directory un-
der /user/hive/warehouse, so a table in the default database will have its directory created
directly in /user/hive/warehouse (unless explicitly overridden).
To avoid potential confusion, it’s usually better to use an external table
if you don’t want to use the default location table. See “External
Tables” on page 56 for details.
You can also copy the schema (but not the data) of an existing table:
LIKE mydb.employees;
This version also accepts the optional LOCATION clause, but note that no other properties,
including the schema, can be defined; they are determined from the original table.
The SHOW TABLES command lists the tables. With no additional arguments, it shows the
tables in the current working database. Let’s assume we have already created a few
other tables, table1 and table2, and we did so in the mydb database:
hive> USE mydb;
If we aren’t in the same database, we can still list the tables in that database:
hive> USE default;
hive> SHOW TABLES IN mydb;
54 | Chapter 4:HiveQL: Data Definition
If we have a lot of tables, we can limit the ones listed using a regular expression, a
concept we’ll discuss in detail in “LIKE and RLIKE” on page 96:
hive> USE mydb;
hive> SHOW TABLES 'empl.*';
Not all regular expression features are supported. If you know regular expressions, it’s
better to test a candidate regular expression to make sure it actually works!
The regular expression in the single quote looks for all tables with names starting with
empl and ending with any other characters (the .* part).
Using the IN database_name clause and a regular expression for the table
names together is not supported.
We can also use the DESCRIBE EXTENDED mydb.employees command to show details about
the table. (We can drop the mydb. prefix if we’re currently using the mydb database.) We
have reformatted the output for easier reading and we have suppressed many details
to focus on the items that interest us now:
hive> DESCRIBE EXTENDED mydb.employees;
name string Employee name
salary float Employee salary
subordinates array<string> Names of subordinates
deductions map<string,float> Keys are deductions names, values are percentages
address struct<street:string,city:string,state:string,zip:int> Home address
Detailed Table Information Table(tableName:employees, dbName:mydb, owner:me,
parameters:{creator=me, created_at='2012-01-02 10:00:00',
last_modified_user=me, last_modified_time=1337544510,
comment:Description of the table, ...}, ...)
Replacing EXTENDED with FORMATTED provides more readable but also more verbose
The first section shows the output of DESCRIBE without EXTENDED or FORMATTED (i.e., the
schema including the comments for each column).
If you only want to see the schema for a particular column, append the column to the
table name. Here, EXTENDED adds no additional output:
hive> DESCRIBE mydb.employees.salary;
salary float Employee salary
Creating Tables | 55
Returning to the extended output, note the line in the description that starts with
location:. It shows the full URI path in HDFS to the directory where Hive will keep
all the data for this table, as we discussed above.
We said that the last_modified_by and last_modified_time table prop-
erties are automatically created. However, they are only shown in the
Detailed Table Information if a user-specified table property has also
been defined!
Managed Tables
The tables we have created so far are called managed tables or sometimes called inter-
nal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table (see “Dropping Tables” on page 66), Hive deletes
the data in the table.
However, managed tables are less convenient for sharing with other tools. For example,
suppose we have data that is created and used primarily by Pig or other tools, but we
want to run some queries against it, but not give Hive ownership of the data. We can
define an external table that points to that data, but doesn’t take ownership of it.
External Tables
Suppose we are analyzing data from the stock markets. Periodically, we ingest the data
for NASDAQ and the NYSE from a source like Infochimps (
tasets) and we want to study this data with many tools. (See the data sets named
infochimps_dataset_4777_download_16185 and infochimps_dataset_4778_download_
16677, respectively, which are actually sourced from Yahoo! Finance.) The schema we’ll
use next matches the schemas of both these data sources. Let’s assume the data files
are in the distributed filesystem directory /data/stocks.
The following table declaration creates an external table that can read all the data files
for this comma-delimited data in /data/stocks:
exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
LOCATION '/data/stocks';
56 | Chapter 4:HiveQL: Data Definition
Downloa d f r o m W o w ! e B o o k < w w w.woweb o o k . c o m >
The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is
required to tell Hive where it’s located.
Because it’s external, Hive does not assume it owns the data. Therefore, dropping the
table does not delete the data, although the metadata for the table will be deleted.
There are a few other small differences between managed and external tables, where
some HiveQL constructs are not permitted for external tables. We’ll discuss those when
we come to them.
However, it’s important to note that the differences between managed and external
tables are smaller than they appear at first. Even for managed tables, you know where
they are located, so you can use other tools, hadoop dfs commands, etc., to modify and
even delete the files in the directories for managed tables. Hive may technically own
these directories and files, but it doesn’t have full control over them! Recall, in “Schema
on Read” on page 48, we said that Hive really has no control over the integrity of the
files used for storage and whether or not their contents are consistent with the table
schema. Even managed tables don’t give us this control.
Still, a general principle of good software design is to express intent. If the data is shared
between tools, then creating an external table makes this ownership explicit.
You can tell whether or not a table is managed or external using the output of DESCRIBE
EXTENDED tablename. Near the end of the Detailed Table Information output, you will
see the following for managed tables:
... tableType:MANAGED_TABLE)
For external tables, you will see the following:
... tableType:EXTERNAL_TABLE)
As for managed tables, you can also copy the schema (but not the data) of an existing
LIKE mydb.employees
LOCATION '/path/to/data';
If you omit the EXTERNAL keyword and the original table is external, the
new table will also be external. If you omit EXTERNAL and the original
table is managed, the new table will also be managed. However, if you
include the EXTERNAL keyword and the original table is managed, the new
table will be external. Even in this scenario, the LOCATION clause will
still be optional.
Creating Tables | 57
Partitioned, Managed Tables
The general notion of partitioning data is an old one. It can take many forms, but often
it’s used for distributing load horizontally, moving data physically closer to its most
frequent users, and other purposes.
Hive has the notion of partitioned tables. We’ll see that they have important
performance benefits, and they can help organize data in a logical fashion, such as
We’ll discuss partitioned managed tables first. Let’s return to our employees table and
imagine that we work for a very large multinational corporation. Our HR people often
run queries with WHERE clauses that restrict the results to a particular country or to a
particular first-level subdivision (e.g., state in the United States or province in Canada).
(First-level subdivision is an actual term, used here, for example: http://www.common We’ll just use the word state for simplicity. We have
redundant state information in the address field. It is distinct from the state partition.
We could remove the state element from address. There is no ambiguity in queries,
since we have to use address.state to project the value inside the address. So, let’s
partition the data first by country and then by state:
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
Partitioning tables changes how Hive structures the data storage. If we create this table
in the mydb database, there will still be an employees directory for the table:
However, Hive will now create subdirectories reflecting the partitioning structure. For
Yes, those are the actual directory names. The state directories will contain zero or more
files for the employees in those states.
58 | Chapter 4:HiveQL: Data Definition
Once created, the partition keys (country and state, in this case) behave like regular
columns. There is one known exception, due to a bug (see “Aggregate func-
tions” on page 85). In fact, users of the table don’t need to care if these “columns”
are partitions or not, except when they want to optimize query performance.
For example, the following query selects all employees in the state of Illinois in the
United States:
SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';
Note that because the country and state values are encoded in directory names, there
is no reason to have this data in the data files themselves. In fact, the data just gets in
the way in the files, since you have to account for it in the table schema, and this data
wastes space.
Perhaps the most important reason to partition data is for faster queries. In the previous
query, which limits the results to employees in Illinois, it is only necessary to scan the
contents of one directory. Even if we have thousands of country and state directories,
all but one can be ignored. For very large data sets, partitioning can dramatically im-
prove query performance, but only if the partitioning scheme reflects common range
filtering (e.g., by locations, timestamp ranges).
When we add predicates to WHERE clauses that filter on partition values, these predicates
are called partition filters.
Even if you do a query across the entire US, Hive only reads the 65 directories covering
the 50 states, 9 territories, and the District of Columbia, and 6 military “states” used
by the armed services. You can see the full list here:
Of course, if you need to do a query for all employees around the globe, you can still
do it. Hive will have to read every directory, but hopefully these broader disk scans will
be relatively rare.
However, a query across all partitions could trigger an enormous MapReduce job if the
table data and number of partitions are large. A highly suggested safety measure is
putting Hive into “strict” mode, which prohibits queries of partitioned tables without
a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the
following session:
hive> set hive.mapred.mode=strict;
hive> SELECT, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> SELECT, e.salary FROM employees e LIMIT 100;
Partitioned, Managed Tables | 59
John Doe 100000.0
You can see the partitions that exist with the SHOW PARTITIONS command:
hive> SHOW PARTITIONS employees;
If you have a lot of partitions and you want to see if partitions have been defined for
particular partition keys, you can further restrict the command with an optional PARTI
TION clause that specifies one or more of the partitions with specific values:
hive> SHOW PARTITIONS employees PARTITION(country='US');
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
The DESCRIBE EXTENDED employees command shows the partition keys:
hive> DESCRIBE EXTENDED employees;
name string,
salary float,
address struct<...>,
country string,
state string
Detailed Table Information...
partitionKeys:[FieldSchema(name:country, type:string, comment:null),
FieldSchema(name:state, type:string, comment:null)],
The schema part of the output lists the country and state with the other columns,
because they are columns as far as queries are concerned. The Detailed Table Infor
mation includes the country and state as partition keys. The comments for both of these
keys are null; we could have added comments just as for regular columns.
You create partitions in managed tables by loading data into them. The following ex-
ample creates a US and CA (California) partition while loading data into it from a local
directory, $HOME/california-employees. You must specify a value for each partition
column. Notice how we reference the HOME environment variable in HiveQL:
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
60 | Chapter 4:HiveQL: Data Definition
The directory for this partition, …/employees/country=US/state=CA, will be created by
Hive and all data files in $HOME/california-employees will be copied into it. See
“Loading Data into Managed Tables” on page 71 for more information on populating
External Partitioned Tables
You can use partitioning with external tables. In fact, you may find that this is your
most common scenario for managing large production data sets. The combi-
nation gives you a way to “share” data with other tools, while still optimizing query
You also have more flexibility in the directory structure used, as you define it yourself.
We’ll see a particularly useful example in a moment.
Let’s consider a new example that fits this scenario well: logfile analysis. Most organ-
izations use a standard format for log messages, recording a timestamp, severity (e.g.,
ERROR, WARNING, INFO), perhaps a server name and process ID, and then an arbitrary text
message. Suppose our Extract, Transform, and Load (ETL) process ingests and aggre-
gates logfiles in our environment, converting each log message to a tab-delimited record
and also decomposing the timestamp into separate year, month, and day fields, and a
combined hms field for the remaining hour, minute, and second parts of the timestamp,
for reasons that will become clear in a moment. You could do this parsing of log mes-
sages using the string parsing functions built into Hive or Pig, for example. Alterna-
tively, we could use smaller integer types for some of the timestamp-related fields to
conserve space. Here, we are ignoring subsequent resolution.
Here’s how we might define the corresponding Hive table:
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
We’re assuming that a day’s worth of log data is about the correct size for a useful
partition and finer grain queries over a day’s data will be fast enough.
Recall that when we created the nonpartitioned external stocks table, a LOCATION …
clause was required. It isn’t used for external partitioned tables. Instead, an ALTER
TABLE statement is used to add each partition separately. It must specify a value for each
partition key, the year, month, and day, in this case (see “Alter Table” on page 66 for
more details on this feature). Here is an example, where we add a partition for January
2nd, 2012:
ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';
Partitioned, Managed Tables | 61
The directory convention we use is completely up to us. Here, we follow a hierarchical
directory structure, because it’s a logical way to organize our data, but there is no
requirement to do so. We could follow Hive’s directory naming convention (e.g., …/
exchange=NASDAQ/symbol=AAPL), but there is no requirement to do so.
An interesting benefit of this flexibility is that we can archive old data on inexpensive
storage, like Amazon’s S3, while keeping newer, more “interesting” data in HDFS. For
example, each day we might use the following procedure to move data older than a
month to S3:
Copy the data for the partition being moved to S3. For example, you can use the
hadoop distcp command:
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
Alter the table to point the partition to the S3 location:
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';
Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02
You don’t have to be an Amazon Elastic MapReduce user to use S3 this way. S3 support
is part of the Apache Hadoop distribution. You can still query this data, even queries
that cross the month-old “boundary,” where some data is read from HDFS and some
data is read from S3!
By the way, Hive doesn’t care if a partition directory doesn’t exist for a partition or if
it has no files. In both cases, you’ll just get no results for a query that filters for the
partition. This is convenient when you want to set up partitions before a separate pro-
cess starts writing data to them. As soon as data is there, queries will return results from
that data.
This feature illustrates another benefit: new data can be written to a dedicated directory
with a clear distinction from older data in other directories. Also, whether you move
old data to an “archive” location or delete it outright, the risk of tampering with newer
data is reduced since the data subsets are in separate directories.
As for nonpartitioned external tables, Hive does not own the data and it does not delete
the data if the table is dropped.
As for managed partitioned tables, you can see an external table’s partitions with SHOW
hive> SHOW PARTITIONS log_messages;
62 | Chapter 4:HiveQL: Data Definition
Similarly, the DESCRIBE EXTENDED log_messages shows the partition keys both as part
of the schema and in the list of partitionKeys:
hive> DESCRIBE EXTENDED log_messages;
message string,
year int,
month int,
day int
Detailed Table Information...
partitionKeys:[FieldSchema(name:year, type:int, comment:null),
FieldSchema(name:month, type:int, comment:null),
FieldSchema(name:day, type:int, comment:null)],
This output is missing a useful bit of information, the actual location of the partition
data. There is a location field, but it only shows Hive’s default directory that would be
used if the table were a managed table. However, we can get a partition’s location as
hive> DESCRIBE EXTENDED log_messages PARTITION (year=2012, month=1, day=2);
We frequently use external partitioned tables because of the many benefits they pro-
vide, such as logical data management, performant queries, etc.
ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with
managed tables, too, when you have (or will have) data for partitions in directories
created outside of the LOAD and INSERT options we discussed above. You’ll need to
remember that not all of the table’s data will be under the usual Hive “warehouse”
directory, and this data won’t be deleted when you drop the managed table! Hence,
from a “sanity” perspective, it’s questionable whether you should dare to use this fea-
ture with managed tables.
Customizing Table Storage Formats
In “Text File Encoding of Data Values” on page 45, we discussed that Hive defaults to
a text file format, which is indicated by the optional clause STORED AS TEXTFILE, and
you can overload the default values for the various delimiters when creating the table.
Here we repeat the definition of the employees table we used in that discussion:
CREATE TABLE employees (
name STRING,
salary FLOAT,