Big Data Guide 2014

User Manual:

Open the PDF directly: View PDF .
Page Count: 36

Download
Open PDF In Browser	View PDF

DZONE RESEARCH PRESENTS

2 01 4 G U I D E TO

BIG DATA
B R O U G H T TO YO U I N PA R T N E R S H I P W I T H

dzone.com/research/bigdata

Welcome
Dear Reader,
Welcome to our fifth DZone Research Guide and welcome
to The Age of Big Data. It’s fitting that this guide follows
our Internet of Things guide, as by all accounts, IoT is
driving the creation of more data than ever before. In
the blink of an eye we can fill more storage, in a smaller
form factor, than the largest hard drives of 15 years
ago. Through the never-ending march of technology
and the broad availability of cloud, available storage
and computing power is now effectively limitless. The
combination of these technologies gives developers and
analysts such as ourselves a wealth of new possibilities
to draw conclusions from our data and make better
business decisions.
Just as our fourth guide was focused around the
platforms and devices that are driving this amazing
creation of new data, this guide is focused around the
tools that developers and architects will use to gather
and analyze data more effectively. We’ve covered a
wide spectrum of the tools: from NoSQL databases like
MongoDB and Hadoop to business intelligence (BI)
tools like Actuate BIRT, down to traditional relational
databases like Oracle, MySQL, and PostgreSQL.
Gathering the data is easy, it’s what you do with it after
the fact that makes it interesting.
As you’ll find while you read through our findings from
nearly 1,000 developers, architects, and executives,
Big Data is no longer a passing fad or something that
people are just beginning to explore. Nearly 89% of all
respondents told us that they are either exploring a Big
Data implementation or have already rolled out at least
one project. This is amazing growth for an industry that
barely existed even 5 years ago. So, welcome to the DZone
Big Data Guide, and we hope you enjoy the data and the
resources that we’ve collected.

TABLE OF CONTENTS
Summary & Key Takeaways

KEY RESEARCH FINDINGS

The no fluff introduction to big data
By Benjamin ball

The Evolution of MapReduce and Hadoop
By Srinath Perera & adam diaz

the developer’s guide to data science
By sander mak

the DIY big data cluster
by chanwit Kaewkasi

finding the database for your use case

big data Solutions Directory

GLOSSARY

CREDITS
DZone Research

DZone Marketing and Sales

Jayash ree
Gopalakrishn an

Alex Crafts

Director of Research

Mitch Pronschinske
Senior Research Analyst

Benjamin Ball

Research Analyst, author

Matt Werner

John esposito

Refcardz Editor /coordinator

Alec Noller

senior Content Cur ator

Gr aphic Designer

Cto, President
research@dzone.com

2 0 1 4 G U I D E T O B I G D ATA

K ellet Atkinson

Director of Marketing

Ashley S late

director of design

Chris S mith

production advisor

Market Researcher

Mike lento

Matt Schmidt

Senior Account Manager

Special thanks to our topic experts
David Rosenthal, Oren Eini, Otis
Gospodnetic, Daniel Bryant,
Marvin Froeder, Isaac Sacolick,
Arnon Rotem-Gal-Oz, Kate Borger,
and our trusted DZone Most
Valuable Bloggers for all their help
and feedback in making this report
a great success.

DZone CorporatE
Rick Ross
CEO

Matt S chmidt
CTO, President

Brandon Noke s
VP of Oper ations

Hernâni Cerqu e i r a

Le ad Soft ware Engineer

2 0 1 4 G U I D E T O B I G D ATA

Summary & KEY TAKEAWAYS
Big Data, NoSQL, and NewSQL—these are the high-level concepts relating to the new, unprecedented data management and analysis challenges
that enterprises and startups are now facing. Some estimates expect the amount of digital data in the world to double every two years [1], while
other estimates suggest that 90% of the world’s current data was created in the last two years [2]. The predictions for data growth are staggering
no matter where you look, but what does that mean practically for you, the developer, the sysadmin, the product manager, or C-level leader?
DZone’s 2014 Guide to Big Data is the definitive resource for learning how industry experts are handling the massive growth and diversity of data. It
contains resources that will help you navigate and excel in the world of Big Data management. These resources include:
• Side-by-side feature comparison of the best analytics tools, databases, and data processing platforms (selected based on

several criteria including solution maturity, technical innovativeness, relevance, and data availability).
• Comprehensive data sourced from 850+ IT professionals on data management tools, strategies, and experiences.
• A database selection tool for discovering the strong and weak use cases of each database type.
• A guide to the emerging field of data science.

BIG DATA PROJECT STATUS

• Forecasts of the future challenges in processing and creating business

value from Big Data.

No Plans

K e y Takeaway s

11%

35%

Big Data Plans are Underway for Most Organizations

Big Data analysis is a key project for many organizations, with only 11% of survey

Proof of
Concept
Stage

14%

Testing Stage

New
Features Stage

respondents saying they have no plans to add large-scale data gathering and analysis
to their systems. A large portion, however, are only in the exploratory stages with new
data management technologies (35%). 24% are currently building or testing their first
solution, and 30% have either deployed a solution or are improving a solution that they’ve
already built. As you’ll discover in this guide, this final group is finding plenty of new data
correlations, but their biggest challenge is finding the causes behind these observations.

Research
Stage

First
Iteration Stage

Production
Stage

13%

Big Data
Toolmaker

Hadoop is Now Prevalent in Modern IT
Apache Hadoop isn’t the sole focus of Big Data management, but it certainly opened the door for more cost-effective batch processing. Today
the project has a comprehensive arsenal of data processing tools and compatible projects. 53% of respondents have used Hadoop and 35% of
respondents’ organizations use Hadoop, so even though many organizations still don’t use Hadoop, a majority of developers have taken the initiative
to familiarize themselves with this influential tool. Looking forward, YARN and Apache Spark are likely to become highly influential projects as well.

RDBMS Still Dominates the Broader IT Industry

WHICH DATABASES DOES
YOUR ORGANIZATION USE?

Relational database mainstays including MySQL and Oracle are being used at 59% and 54% of

59%

up with certain aspects of RDBMS, today relational data stores are still a dominant part of the IT

54%

A multi-database solution, or polyglot persistence approach, is a popular pattern among today’s
experts. NoSQL databases are certainly making inroads at companies besides the high-profile web
companies that created them. Currently, MongoDB is the most popular NoSQL database with 29%

29%
MongoDB

PostgreSQL

SQL Server

33%
Oracle

choices. Even though NoSQL databases gained a rabid following from developers who were fed
industry, and the strengths of SQL databases have been reiterated by many key players in the space.

45%

MySQL

respondents’ organizations respectively. SQL Server (45%) and PostgreSQL (33%) are also popular

of respondents’ organizations currently using it. For a look at the strong and weak use cases of each
database type, see the Finding the Database For Your Use Case section of this guide.
[1] http://www.emc.com/about/news/press/2014/20140409-01.htm
[2] http://www.sciencedaily.com/releases/2013/05/130522085217.htm

dzone.com/research/bigdata

dzone.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

K ey R E S E ARC H Findings
More than 850 IT professionals responded to DZone’s 2014 Big Data Survey. Here are the
demographics for this survey:
• Developers (43%) and development team leads (26%) were the most common roles.
• 60% of respondents come from large organizations (100 or more employees) and 40%
come from small organizations (under 100 employees).
• The majority of respondents are headquartered in the US (35%) or Europe (38%).
• Over half of the respondents (63%) have over 10 years of experience as IT professionals.
• A large majority of respondents’ organizations use Java (86%). Python is the next highest (37%).
DATA SOURCES

56%

Files

55%

Server Logs

43%
37%

Enterprise Data
User Generated

25%

Sensors/Hardware

23%

Logistics/Supply

Files and Logs are the Most Common Data Sources
The first step to getting a handle on your data is understanding the sources and types
of data that your systems record and generate. In the three Vs of Big Data, sources and
types of data represent your data variety. After asking the respondents about the data
sources they deal with, we found that files such as documents and media are the most
common (56%), while server logs from applications are close behind (55%). Even though
the Big Data explosion is being driven mainly by user generated data (37%), most of the
companies in this survey aren’t dealing with it as much as files, server logs, and enterprise
system data such as ERP and CRM entries. Much of the exploding user generated data is
in the hands of the high-profile web companies like Facebook, Google, and Twitter.

LARGEST DATA TYPES

Next, users were asked about the types of data they analyze. Structured data
(76%) and semi-structured data (53%) are the two most common types of data.
Among unstructured data types, event messaging data (34%), text (38%), and
clickstreams (32%) are the most common. For most organizations, their largest
set of data being analyzed is structured data (39%). However, when filtering
just for organizations that have a high number of data processing nodes or that
use NoSQL, unstructured data types and user generated data sources are more
common than the overall survey pool.

STRUCTURED

ENTERPRISE
SYSTEMS

TEXT

SERVER
LOGS

WEB LOGS

FILES

TEXT

Unstructured Data is More
Common in Hadoop and NoSQL Orgs

LARGEST DATA SOURCES

AVERAGE USAGE OF STORAGE MEDIA TYPES
Disk
Drives

Hosted
Storage

Solid
State
Drives

Magnetic
Tape

Optical
Disks

Cloud Storage Customers Take Full AdvantagE
81%
64%

35%

33%

2 0 1 4 G U I D E T O B I G D ATA

28%

Another important aspect of Big Data is volume. Sometimes the easiest way to
deal with data volume is more hardware. The most common storage medium
for respondents is disk-drives (84%), which is not surprising; however, solid-state
drives (34%) and third-party hosted storage (22%) also have significant traction.
The average disk-drive-using respondent uses it for 81% of their storage needs. If
a respondent uses hosted storage, they often make heavy usage of it, averaging
about 64% of their storage.

2 0 1 4 G U I D E T O B I G D ATA

Almost All Orgs Expect Their Storage Needs to Grow Exponentially
Just how much volume do today’s IT organizations have? The majority of respondents surveyed don’t have more than 9 TBs in their
entire organization, but in just the next year, only 10% think they will have less than 1 TB. The bulk of organizations expect to have
between 1 and 50 TBs. Since one year predictions are often more accurate than 2-5 year predictions, this is strong evidence that most
organizations will experience exponential data growth. The past trends of the IT industry also provide clear evidence that exponential
data growth is common and a constant factor in technology.

Estimated organization data storage and usage
<1 TB

1-9 TB

10-49 TB

50-99 TbB

100-499 TB

500 TB - 1 PB

> 1 PB

24%
10%

29%
25%

16%
20%

9%
13%

11%
17%

6%
8%

5%
7%

Currently

In the Next Year

Hadoop Usage is High Despite the Learning Curve
53% of respondents have used Apache Hadoop. However, only 29% have used
it at their work, while 38% have used it for personal projects. This indicates that
the respondents are preparing themselves for the tool’s increasing ubiquity in
modern data processing. The top three uses for Hadoop among respondents
were statistics and pattern analysis (63%), data transformation and preparation
(60%), and reporting/BI (53%). When asked about their experience with
Hadoop, 39% said it was difficult to use. This is reflected in another question
that asked about the most challenging aspects of Hadoop. The three biggest
are the learning curve (68%), development effort (59%), and hiring experienced
developers (44%). Finally, users were asked about the Hadoop distribution they
use. The most commonly used distribution is the basic Apache distribution
(48%), but close behind is Cloudera (40%).

WHICH ROLES MANAGE ANALYTICS IN
YOUR ORGANIZATION?

40% 40% 35%

27% 25% 23%

ARCHITECT

DATA SCIENTIST

DBA

DEVELOPER

DATA ANALYST

IT DIRECTOR

IS HADOOP DIFFICULT TO USE?

61%

39%

YES

Big Data is Very Much in the Hands of the Developers
The three departments in respondents’ organizations that are most
commonly responsible for providing a data analytics environment are
operations, research or analysis, and the application group. When it comes
to developing and managing data analytics, application architects and
developers are called upon most often in the surveyed organizations (40%).
Data scientists were also common stewards of data analytics with 35% of
organizations utilizing them. The least likely manager of data analytics was
the CIO or similar executive (8%).

Teams Running Hadoop and Column Stores
Tend to Have Bigger Analytics Clusters
Almost half of organizations (46%) have just one to four data
processing nodes in a data processing cluster, but there is a
wide distribution of respondents using various node amounts.
Companies that gather sensor data, user data, or logistics/supply
chain data tend to have higher node counts among respondents,
which means they are probably more data-heavy sources.
Organizations with audio/video data or scientific data are more
likely than other segments to have over four data nodes. Also,
companies that use Hadoop often had more data processing
nodes than non-Hadoop companies. This is where it becomes
clear how much Hadoop helps with handling multi-node data
processing clusters. Teams with data scientist titles and teams
running HBase or Cassandra also tend to have more nodes.

HOW MANY NODES ARE TYPICALLY
IN YOUR DATA PROCESSING CLUSTERS?
1-4

46%

5-9

22%

10-19

15%

20-49 7%
50+

dzone.com/research/bigdata

10%

dzone.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

The No Fluff
Introduction to Big Data
by Benjamin Ball
Big Data traditionally has referred to a collection

What are the Characteristics of Big Data?

of data too massive to be handled efficiently by

Tech companies are constantly amassing data from a variety of

traditional database tools and methods. This original
definition has expanded over the years to identify
tools (Big Data tools) that tackle extremely large
datasets (NoSQL databases, MapReduce, Hadoop,
NewSQL, etc.), and to describe the industry challenge
posed by having data harvesting abilities that far
outstrip the ability to process, interpret, and act on
that data. Technologists knew that those huge batches
of user data and other data types were full of insights
that could be extracted by analyzing the data in large

digital sources that is almost without end—everything from email
addresses to digital images, MP3s, social media communication,
server traffic logs, purchase history, and demographics. And it’s not
just the data itself, but data about the data (metadata). It is a barrage
of information on every level. What is it that makes this mountain of
data Big Data?
One of the most helpful models for understanding the nature of Big
Data is “the three Vs”: volume, velocity, and variety.

Data Volume
Volume is the sheer size of the data being collected. There was a
point in not-so-distant history where managing gigabytes of data
was considered a serious task—now we have web giants like Google

aggregates. They just didn’t have any cheap, simple

and Facebook handling petabytes of information about users’ digital

technology for organizing and querying these large

activities. The size of the data is often seen as the first challenge

batches of raw, unstructured data.

of characterizing Big Data storage, but even beyond that is the
capability of programs to provide architecture that can not only
store but query these massive datasets. One of the most popular

The term quickly became a buzzword for every sort of data

models for Big Data architecture comes from Google’s MapReduce

processing product’s marketing team. Big Data became a catchall

concept, which was the basis for Apache Hadoop, a popular data

term for anything that handled non-trivial sizes of data. Sean Owen,

management solution.

a data scientist at Cloudera, has suggested that Big Data is a stage
where individual data points are irrelevant and only aggregate

Data Velocity

analysis matters [1]. But this is true for a 400 person survey as well,

Velocity is a problem that flows naturally from the volume

and most people wouldn’t consider that very big. The key part
missing from that definition is the transformation of unstructured
data batches into structured datasets. It doesn’t matter if the
database is relational or non-relational. Big Data is not defined by
a number of terabytes: it’s rooted in the push to discover hidden
insights in data that companies used to disregard or throw away.
Due to the obstacles presented by
large scale data management, the goal

Big Data became
a catchall term
for anything
that handled
non-trivial sizes
of data.

for developers and data scientists is
two-fold: first, systems must be created
to handle large-scale data, and second,
business intelligence and insights should
be acquired from analysis of the data.

characteristics of Big Data. Data velocity is the speed at which data
is flowing into a business’s infrastructure and the ability of software
solutions to receive and ingest that data quickly. Certain types of
high-velocity data, such as streaming data, needs to be moved
into storage and processed on the fly. This is often referred to as
complex event processing (CEP). The ability to intercept and analyze
data that has a lifespan of milliseconds is widely sought after. This
kind of quick-fire data processing has long been the cornerstone of
digital financial transactions, but it is also used to track live consumer
behavior or to bring instant updates to social media feeds.

Data Variety
Variety refers to the source and type of data collected. This data
could be anything from raw image data to sensor readings, audio

Acquiring the tools and methods to meet

recordings, social media communication, and metadata. The

these goals is a major focus in the data

challenge of data variety is being able to take raw, unstructured

science industry, but it’s a landscape

data and organize it so that an application can use it. This kind

where needs and goals are still shifting.

of structure can be achieved through architectural models that
traditionally favor relational databases—but there is often a need

2 0 1 4 G U I D E T O B I G D ATA

to tidy up this data before it will even be useful to store in a refined

ease of use for handling large-scale data processing operations have

form. Sometimes a better option is to use a schemaless, non-

secured a large part of the Big Data marketplace.

relational database.

Besides Hadoop, there was a host of non-relational (NoSQL)
databases that emerged around 2009 to meet a different set of

How Do You Manage Big Data?

demands for processing Big Data. Whereas Hadoop is used for

The Three Vs is a great model for getting an initial understanding of

its massive scalability and parallel processing, NoSQL databases

what makes Big Data a challenge for businesses. However, Big Data is

are especially useful for handling data stored within large multi-

not just about the data itself, but the way that it is handled. A popular

structured datasets. This kind of discrete data handling is not

way of thinking about these challenges is to look at how a business

traditionally seen as a strong point of relational databases, but it’s

stores, processes, and accesses their data.

also not the same kind of data operations that Hadoop runs. The
solution for many businesses ends up being a combination of these

• Store: Can you store the vast amounts of data being

approaches to data management.

collected?

• Process: Can you organize, clean, and analyze the data
collected?

Finding Hidden Data Insights
Once you get beyond storage and management, you still have
the enormous task of creating actionable business intelligence (BI)

• Access: Can you search and query this data in an organized
manner?
The Store, Process, and Access model is useful for two reasons: it
reminds businesses that Big Data is largely about managing data, and
it demonstrates the problem of scale within Big Data management.
“Big” is relative. The data batches that challenge some companies
could be moved through a single Google datacenter in under a
minute. The only question a company needs to ask itself is how it
will store and access increasingly massive amounts of data for its
particular use case. There are several high level approaches that
companies have turned to in the last few years.

from the datasets you’ve collected. This problem of processing and
analyzing data is maybe one of the trickiest in the data management
lifecycle. The best options for data analytics will favor an approach
that is predictive and adaptable to changing
data streams. The thing is, there are so many

Data insight
means
nothing for
a business
if they can’t
then create
actionable
intelligence.

types of analytic models, and different ways of
providing infrastructure for this process. Your
analytics solution should scale, but to what
degree? Scalability can be an enormous pain
in your analytical neck, due to the problem of
decreasing performance returns when scaling
out many algorithms.

The Traditional Approach
The traditional method for handling most data is to use relational
databases. Data warehouses are then used to integrate and analyze
data from many sources. These databases are structured according
to the concept of “early structure binding”—essentially, the database
has predetermined “questions” that can be asked based on a schema.
Relational databases are highly functional, and the goal with this
type of data processing is for the database to be fully transactional.
Although relational databases are the most common persistence type
by a large margin (see Key Findings pg. 4-5), a growing number of use
cases are not well-suited for relational schemas. Relational architectures
tend to have difficulty when dealing with the velocity and variety
of Big Data, since their structure is very rigid. When you perform
functions such as JOIN on many complex data sets, the volume can
be a problem as well. Instead, businesses are looking to non-relational
databases, or a mixture of both types, to meet data demand.

Ultimately, analytics tools rely on a great deal
of reasoning and analysis to extract data
patterns and data insights, but this capacity
means nothing for a business if they can’t

then create actionable intelligence. Part of this problem is that many
businesses have the infrastructure to accommodate Big Data, but
they aren’t asking questions about what problems they’re going to
solve with the data. Implementing a Big Data-ready infrastructure
before knowing what questions you want to ask is like putting the cart
before the horse.
But even if we do know the questions we want to ask, data analysis can
always reveal many correlations with no clear causes. As organizations
get better at processing and analyzing Big Data, the next major hurdle
will be pinpointing the causes behind the trends by asking the right
questions and embracing the complexity of our answers.

The Newer Approach - MapReduce, Hadoop, and
NoSQL Databases

[1] http://www.quora.com/What-is-big-data

In the early 2000s, web giant Google released two helpful web
technologies: Google File System (GFS) and MapReduce. Both were
new and unique approaches to the growing problem of Big Data,
but MapReduce was chief among them, especially when it comes to

WRITTEN B Y

its role as a major influencer of later solution models. MapReduce is

Benjamin Ball

a programming paradigm that allows for low cost data analysis and
clustered scale-out processing.

Benjamin Ball is a Research Analyst and Content
Curator at DZone. When he’s not producing
technical content or tweeting about someone
else’s (@bendzone), he is an avid reader, writer,
and gadget collector.

MapReduce became the primary architectural influence for the next
big thing in Big Data: the creation of the Big Data management
infrastructure known as Hadoop. Hadoop’s open source ecosystem and

dzone.com/research/bigdata

dzone.com

research@dzone.com

(919) 678-0300

Being data-driven has
never been easier
New Relic Insights ™ is a real-time
analytics platform that collects and
analyzes billions of events directly
from your software to provide instant,
actionable intelligence about your
applications, customers, and business.

Start making better decisions today
www.newrelic.com/insights

“We’ve come to rely on
New Relic Insights so much
that we’ve basically stopped
using all of our other analytics tools.”
Andrew Sutherland
Founder and CTO, Quizlet

2 0 1 4 G U I D E T O B I G D ATA

Sponsored Opinion

Tackling Big Data Challenges
with Software Analytics
• Lightning-fast querying. In addition to real-time

Companies today are amassing a tidal wave of data

running? It shouldn’t take a team of developers

created from their production software, websites,
mobile applications, and back-end systems. They

data collection, can you query the database
and get an immediate answer? Do you have to

weeks or months to build out your data pipeline,
nor should it require advanced data scientists

recognize the value that lies in this big data, yet many
struggle to make data-driven decisions. Businesses

wait for a nightly batch process to occur for an
existing query to finish running before you can

to get the answers to your questions. Instead,
look for a tool that’s easy for both technical and

need answers to their data questions now—not in days

see the results? Many open source technologies

business users to run queries.

or weeks.

have attempted to make progress on this front
(Hadoop, Hive, Tez, Impala, Spark), but they

While there are various analytics tools out there

gathering metrics in real time from live production
software and transforming them into actionable data.

are still not delivering the real-time analysis
businesses need to succeed.

that can help address such challenges—anything
from web analytics, business intelligence tools, log
search, NoSQL, and Hadoop—each falls short in some

It allows you to ask your software questions and get
immediate answers. In short, software analytics is all

• Flexibility and granularity of data. Can you

capacity. When evaluating solutions for big data
analysis, consider the following criteria:

easily add custom data types and associated
attributes using your analytics toolset? Can

about removing the time, complexity, and high cost
of analytics, so that companies both big and small
can make fast and informed data decisions that help

you add them at volume without hitting some
arbitrary limit or needing to upgrade to a higher
service tier? Companies need the ability to add
data into their analytics tool and get user-level
analyses out of their datasets. This means

• Real-time data collection and analysis. Does

the solution allow you to collect and ingest data
in real time or is there a lag between collection
and analysis? Many analytics tools don’t provide
data in real time, often resulting in up to 24-hour
lag times between collection and reporting.
For example, you might have visibility into the
front-end user experience, but not into back-end

propel the business forward. To learn more, visit:
www.newrelic.com/insights.

being able to go beyond simple aggregations
to get useful segments of your data to generate
insights that impact your bottom line.

by Ankush Rustagi

• Ease of setup and use. What does it require to set

technical performance issues that might be
impacting the frontend.

Product Marketing Manager, New Relic

up and get your analytics environment up and

New Relic Insights
Data Platform

Software analytics combines these key capabilities,

by New Relic

Operational Intelligence

New Relic’s analytics platform provides real-time data collection and querying capabilities based on closed-source
database technologies.
database integrations
New Relic Query Language

hosting
SaaS

hadoop
No Hadoop Support

integration Support
None

Notable Customers
• Sony
• Microsoft
• Nike
• NBC
• Groupon
• Intuit

 High Availability
 Load Balancing

Full profile link

Strengths

 Automatic Failover
FULL PROFILE LINK

dzone.com/r/pdHQ

• Provides actionable, real-time business insights
from the billions of metrics your software is
producing
• Collects every event automatically, as it happens,
directly from the source
• ●Stores data in a cloud-hosted database - no
installation, provisioning or configuration required
• Queries billions of real-time events in milliseconds
using SQL-like query language
• Enables fast and informed decision making about
your software, customers, and business

website newrelic.com

dzone.com/research/bigdata

twitter @newrelic

dzone.com

research@dzone.com

Proprietary

(919) 678-0300

dzone.com/research/bigdata

The Evolution of
MapReduce and Hadoop
by Srinath Perera & Adam Diaz
With its Google pedigree, MapReduce has had a
far-ranging impact on the computing industry [1].
It is built on the simple concept of mapping (i.e.
filtering and sorting) and then reducing data (i.e.
running a formula for summarization), but the true
value of MapReduce lies with its ability to run these
processes in parallel on commodity servers while
balancing disk, CPU, and I/O evenly across each
node in a computing cluster. When used alongside
a distributed storage architecture, this horizontally
scalable system is cheap enough for a fledgling
startup. It is also a cost-effective alternative for
large organizations that were previously forced
to use expensive high-performance computing
methods and complicated tools such as MPI
(the Message Passing Interface library). With
MapReduce, companies no longer need to delete old
logs that are ripe with insights—or dump them onto
unmanageable tape storage—before they’ve had a
chance to analyze them.
Hadoop Takes Over
Today, the Apache Hadoop project is the most widely used
implementation of MapReduce. It handles all the details required
to scale MapReduce operations. The industry support and
community contributions have been so strong over the years
that Hadoop has become a fully-featured, extensible dataprocessing platform. There are scores of other open source
projects designed specifically to work with Hadoop. Apache Pig
and Cascading, for instance, provide high-level languages and
abstractions for data manipulation. Apache Hive provides a data
warehouse on top of Hadoop.
As the Hadoop ecosystem left the competition behind,
companies like Microsoft, who were trying to build their own
MapReduce platform, eventually gave up and decided to
support Hadoop under the pressure of customer demand [2].
Other tech powerhouses like Netflix, LinkedIn, Facebook, and
Yahoo (where the project originated) have been using Hadoop
for years. A new Hadoop user in the industry, TRUECar, recently

2 0 1 4 G U I D E T O B I G D ATA

reported having a cost of $0.23 per GB with Hadoop. Before
Hadoop, they were spending $19 per GB [3]. Smaller shops
looking to keep costs even lower have tried to run virtual
Hadoop instances. However, virtualizing Hadoop is the subject of
some controversy amongst Hadoop vendors and architects. The
cost and performance of virtualized Hadoop is fiercely debated.
Hadoop’s strengths are more clearly visible in use cases such as
clickstream and server log analytics. Analytics like financial risk
scores, sensor-based mechanical failure predictions, and vehicle
fleet route analysis are just some of the areas where Hadoop is
making an impact. With some of these
industries having 60 to 90 day time limits
on data retention, Hadoop is unlocking
insights that were once extremely difficult
to obtain in time. If an organization is
allowed to store data longer, the Hadoop
File System (HDFS) can save data in its
raw, unstructured form while it waits to be
processed, just like the NoSQL databases
- Urs Hölzle, Google
that have broadened our options for
managing massive data.

We don’t
really use
MapReduce
anymore.

Where MapReduce Falls Short
• It usually doesn’t make sense to use Hadoop and MapReduce
if you’re not dealing with large datasets like high-traffic web
logs or clickstreams.
• Joining two large datasets with complex conditions—a
problem that has baffled database people for decades—is
also difficult for MapReduce.
• Machine learning algorithms such as KMeans and Support
Vector Machines (SVM) are often too complex for MapReduce.
• When the map phase generates too many keys (e.g. taking
the cross product of two datasets), then the mapping phase
will take a very long time.
• If processing is highly stateful (e.g. evaluating a state
machine), MapReduce won’t be as efficient.
As the software industry starts to encounter these harder use
cases, MapReduce will not be the right tool for the job, but
Hadoop might be.

Hadoop Adapts
Long before Google’s dropping of MapReduce, software vendors
and communities were building new technologies to handle
some of the technologies described above. The Hadoop project

2 0 1 4 G U I D E T O B I G D ATA

With YARN,
developers can
run a variety of
jobs in a YARN
container. Instead
of scheduling the
jobs, the whole
YARN container is
scheduled.

made significant changes just last year and
now has a cluster resource management
platform called YARN that allows developers
to use many other non-MapReduce
technologies on top of it. The Hadoop
project contributors were already thinking
about a resource manager for Hadoop back
in early 2008 [4].

With YARN, developers can run a variety
of jobs in a YARN container. Instead of
scheduling the jobs, the whole YARN
container is scheduled. The code inside that
container can be any normal programming
construct, so MapReduce is just one of many
application types that Hadoop can harness.
Even the MPI library from the pre-MapReduce days can run on
Hadoop. The number of products and projects that the YARN
ecosystem enables is too large to list here, but this table will give
you an idea of the wide ranging capabilities YARN can support:

Sponsored Opinion

Building Big Data Applications with
Spark & the Typesafe Reactive Platform
Why Spark?

In the Hadoop community, an emerging consensus is forming around Apache
Spark as the next-generation, multi-purpose compute engine for Big Data
applications. Spark improves upon the venerable MapReduce compute engine
in several key areas:
• Spark provides a more flexible and concise programming model for
developers, and features significantly better performance in most
production scenarios.
• Spark supports traditional batch-mode applications, but it also provides a
streaming model.
• The functional programming foundation of Spark and its support for
iterative algorithms provide the basis for a wide range of libraries—
including SparkSQL for integrated SQL-based queries over data with defined
schemas, Spark Streaming for handling incoming events in near-real time,
GraphX for computations over graphs, and MLlib for machine-learning.
• Spark scales down to a single machine and up to large clusters. Spark jobs
can run in Hadoop using the YARN resource manager, on a Mesos cluster,
or in small standalone clusters.

Spark’s greater flexibility offers new opportunities for integrating data analytics
in event streaming reactive applications built with the Typesafe Reactive
Platform. A possible architecture is shown in the following figure.

REST

applicable and work together to build non-trivial transformations of data. The
Scala library implements these operations for data sets that fit into memory for
a single JVM process. Scala developers can write concise, expressive code, with
high productivity.
The Spark API scales up the idioms of the Scala library to the size of clusters.
Therefore, developers using Spark enjoy the same productivity benefits that
other Scala developers enjoy. The Spark API is also remarkably similar to the
Scala Collections API. Let’s look at a simple example, the famous Word Count
algorithm, where we read in one or more documents, tokenize them into words,
then count the occurrences of every word.
The following listing shows an implementation using the Scala Collections API,
where we read all the text from a single file.

Scala Code
import java.io._
import scala.io._
val wordsCounted = Source.fromFile(...)
// Read from a file,
.getLines.map(line => line.toLowerCase)
// convert to lower case,
.flatMap(line => line.split(“””\W+”””)).toSeq
// split into words,
.groupBy(word => word)
// group words together,
.map { case (word, group) => (word, group.size) }
// count group sizes,
val out = new PrintStream(new File(...))
// write results.
wordsCounted foreach (word_count => out.println(word_count)

The comments provide the essential details. Note how concise this source code is!
The Spark implementation looks almost the same. There are differences in
handling the input and output, and in how the environment is set up and torn
down, but the core logic is identical. The same idioms work for small data in a
single process all the way up to a massive cluster.

Reactive
Streams

Spark Code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

Services can be deployed and managed by Mesos, providing efficient allocation
of cluster resources running on bare hardware or Infrastructure-as-a-Service
(IaaS). Play and Akka implement web services and ingest reactive streams of
data from message queues and other sources. Database access is also shown.
Akka streams data to Spark, whose streaming model works on time slices of
event traffic.
Spark is used to perform analytics: anything from running aggregations to
machine-learning algorithms. Spark and Akka can read and write data in
local or distributed file systems. Additional batch-mode Spark jobs would run
periodically to perform large-scale data analysis, like aggregations over long
time frames, and ETL (extract, transform, and load) tasks like data cleansing,
reformatting, and archiving.

val sc = new SparkContext(“local”, “Word Count (2)”)
// “Context” driver
// Except for how input/output, the sequence of calls is identical.
val wordsCounted = sc.textFile(args(0)).map(line => line.toLowerCase)
.flatMap(line => line.split(“””\W+”””))
.groupBy(word => word)
.map { case (word, group) => (word, group.size) }
wordsCounted.saveAsTextFile(args(1))
sc.stop()

Apache Spark is a natural extension to the Typesafe Reactive Platform for
adding sophisticated data analytics.

by Dean Wampler

Spark and Scala

Consultant, Typesafe

Functional programming provides a set of operations on data that are broadly

dzone.com/research/bigdata

dzone.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

Developer’s Guide
By Sander Mak

When developers talk about using data, they are
usually concerned with ACID, scalability, and other
operational aspects of managing data. But data
science is not just about making fancy business
intelligence reports for management. Data drives
the user experience directly, not after the fact.
Large scale analysis and adaptive features are being built into the
fabric of many of today’s applications. The world is already full of
applications that learn what we like. Gmail sorts our priority inbox
for us. Facebook decides what’s important in our newsfeed on our
behalf. E-commerce sites are full of recommendations, sometimes
eerily accurate. We see automatic tagging and classification of
natural language resources. Ad-targeting systems predict how
likely you are to click on a given ad. The list goes on and on.
Many of the applications discussed above emerged from web
giants like Google, Yahoo, and Facebook and other successful
startups. Yes, these places are filled to the brim with very smart
people, working on the bleeding edge. But make no mistake, this
trend will trickle down into “regular” application development too.
In fact, it already has. When users interact with slick and intelligent
apps every day, their expectations for business applications rise as
well. For enterprise applications it’s not a matter of if, but when.
This is why many enterprise developers will need to familiarize
themselves with data science. Granted, the term is incredibly
hyped, but there’s a lot of substance behind the hype. So we
might as well give it a name and try to figure out what it means for
us as developers.

From developer to data scientist
How do we cope with these increased expectations? It’s not just
a software engineering problem. You can’t just throw libraries at
it and hope for the best. Yes, there are great machine learning
libraries, like Apache Mahout (Java) and scikit-learn (Python). There
are even programming languages squarely aimed at doing data
science, such as the R language. But it’s not just about that. There
is a more fundamental level of understanding you need to attain
before you can properly wield these tools.

Let’s tackle domain expertise first. It may sound obvious, but if you
want to create good models for your data, then you need to know
what you’re talking about. This is not strictly true for all approaches.
For example, deep learning and other machine learning techniques
might be viewed as an exception. In general though, having more
domain-specific knowledge is better. So start looking beyond
the user-stories in your backlog and talk to your domain experts
about what really makes the clock tick. Beware though: if you only
know your domain and can churn out decent code, you’re in the
danger zone. This means you’re at risk of re-inventing the wheel,
misapplying techniques, and shooting yourself in the foot in a
myriad of other ways.
Of course, the elephant in the room here is
“math & statistics.” The link between math
and the implementation of features such as
recommendation or classification is very strong.
Even if you’re not building a recommender
algorithm from scratch (which hopefully you
wouldn’t have to), you need to know what goes on
under the hood in order to select the right one and
to tune it correctly. As the diagram points out, the
combination of domain expertise and math and
statistics knowledge is traditionally the expertise
area of researchers and analysts within companies.
But when you combine these skills with software
engineering prowess, many new doors will open.

“How do we
cope with these
increased
expectations? You
can’t just throw
libraries at it and
hope for the best.”

This article will not be enough to gain the required level of
understanding. It can, however, show you the landmarks along the
road to data science. This diagram (adapted from Drew Conway’s
original) shows the lay of the land [1]:

What can you do as developer if you don’t want to miss the bus?
Before diving head-first into libraries and tools, there are several areas
where you can focus your energy:

As software engineers, we can relate to hacking skills. It’s our bread
and butter. And that’s good, because from that solid foundation you
can branch out into the other fields and become more well-rounded.

• Data management
• Statistics
• Math

2 0 1 4 G U I D E T O B I G D ATA

We’ll look at each of them in the remainder of this article. Think of
these items as the major stops on the road to data science.

Data management
Recommendation, classification, and prediction engines cannot be
coded in a vacuum. You need data to drive the process of creating/
tuning a good recommender engine for your application, in your
specific context. It all starts with gathering relevant data, which
might already be in your databases. If you don’t already have the
data, you might have to set up new ways of capturing relevant data.
Then comes the act of combining and cleaning data. This is also
known as data wrangling or munging. Different algorithms have
different pre-conditions on input data. You’ll have to develop a
strong intuition for good data versus messy data.
Typically, this phase of a data science project is very experimental.
You’ll need tools that help you quickly process lots of
heterogeneous data and iterate on different strategies. Real world
data is ugly and lacks structure. Dynamic scripting languages are
often used to filter and organize data because they fit this challenge
perfectly. A popular choice is Python with Pandas or the R language.

“The data science
field is currently
dominated by PhDs.
On the flipside,
we live in an age
where education
has never been
more accessible.”

It’s important to keep a close eye on
everything related to data munging.
Just because it’s not production code,
doesn’t mean it’s not important. There
won’t be any compiler errors or test
failures when you silently omit or distort
data, but it will influence the validity
of all subsequent steps. Make sure
you keep all your data management
scripts, and keep both mangled and
unmangled data. That way you can
always trace your steps. Garbage in,
garbage out applies as always.

of statistics knows a thing or two about experimental setup. You’ll
learn that you should always divide your data into a training set (to
build your model) and a test set (to validate your model). Otherwise,
your model won’t work for real-world data: you’ll end up with an
overfitting model. Even then, you’re still susceptible to pitfalls like
multiple testing. There’s a lot to take into account.

Math
Statistics tells you about the when and why, but for the how, math
is unavoidable. Many popular algorithms such as linear regression,
neural networks, and various recommendation algorithms all boil
down to math. Linear algebra, to be more precise. So brushing up
on vector and matrix manipulations is a must. Again, many libraries
abstract over the details for you, but it is essential to know what is
going on behind the scenes in order to know which knobs to turn.
When results are different than you expected, you need to know
how to debug the algorithm.
It’s also very instructive to try and code at least one algorithm from
scratch. Take linear regression for example, implemented with
gradient descent. You will experience the intimate connection
between optimization, derivatives, and linear algebra when
researching and implementing it. Andrew Ng’s Machine Learning
class on Coursera takes you through this journey in a surprisingly
accessible way.

But wait, there’s more...
Besides the fundamentals discussed so far, getting good at data
science includes many other skills, such as clearly communicating
the results of data-driven experiments, or scaling whatever
algorithm or data munging method you selected across a cluster
for large datasets. Also, many algorithms in data science are
“batch-oriented,” requiring expensive recalculations. Translation
into online versions of these algorithms is often necessary.
Fortunately, many (open source) products and libraries can help
with the last two challenges.

Statistics
Once you have data in the appropriate format, the time has come
to do something useful with it. Much of the time you’ll be working
with sample data to create models that handle yet unseen data.
How can you infer valid information from this sample? How do you
even know your data is representative? This is where we enter the
domain of statistics, a vitally important part of data science. I’ve
heard it said: “a Data Scientist is a person who is better at statistics
than any software engineer and better at software engineering than
any statistician.”
What should you know? Start by mastering the basics. Understand
probabilities and probability distributions. When is a sample large
enough to be representative? Know about common assumptions
such as independence of probabilities, or that values are expected
to follow a normal distribution. Many statistical procedures only
make sense in the context of these assumptions. How do you test
the significance of your findings? How do you select promising
features from your data as input for algorithms? Any introductory
material on statistics can teach you this. After that, move on the
Bayesian statistics. It will pop up more and more in the context of
machine learning.

Data science is a fascinating combination between real-world
software engineering, math, and statistics. This explains why the
field is currently dominated by PhDs. On the flipside, we live in
an age where education has never been more accessible, be it
through MOOCs, websites, or books. If you want read a hands-on
book to get started, read Machine Learning for Hackers, then move
on to a more rigorous book like Elements of Statistical Learning.
There are no shortcuts on the road to data science. Broadening
your view from software engineering to data science will be hard,
but certainly rewarding.
[1] http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

WRITTEN B Y

Sander Mak
Sander Mak works as Senior Software Engineer
at Luminis Technologies. He has been awarded a
JavaOne Rockstar award for his talk Data Science with
R for Java Developers and is giving three talks at this
year’s JavaOne conference. Sander speaks regularly at
various international developer conferences, sharing
his passion for Java and JVM languages.

It’s not just theory. Did you notice how we conveniently glossed over
the “science” part of data science up till now? Doing data science is
essentially setting up experiments with data. Fortunately, the world

dzone.com/research/bigdata

dzone.com

research@dzone.com

(919) 678-0300

2 0 1 4 G U I D E T O B I G D ATA

Sponsored Opinion

A Modern Approach to
SQL-on-Hadoop
Modern applications for
social, mobile, and sensor
data are generating an
order of magnitude more
data than ever before. It’s
not just the scale, but the
variety and variability of
these datasets that are a
challenge. These datasets are
often self-describing, include
complex content and evolve
rapidly, making it difficult for
traditional DBAs to maintain
the schemas required in
SQL RDBMSs. This delays
time to insight from data for
business analysts.

Drill brings the
SQL ecosystem
and relational
system
performance to
big data without
compromising on
Hadoop/NoSQL
flexibility

One such example is JSON, the lingua franca
of data for APIs, data exchange, data storage,
and data processing. HBase, another example,
is a highly scalable NoSQL database capable

metadata definitions in a centralized
store; this saves weeks or months on data
preparation, modeling, and subsequent
schema management.

of storing 1000s of columns in
a single row, and every row has
its own schema. Other formats/
systems include Parquet, AVRO, and
Protobuf.

• Flexibility: Drill provides a JSON-like
internal data model to represent and
process data, so you can query both simple
and complex/nested data types.

New projects from the Apache
Hadoop community such as Apache
Drill take a different approach to
SQL-on-Hadoop. The goal: perform
self-service data exploration by
bringing the SQL ecosystem and
performance of the relational
systems to Big Data scale without
compromising on Hadoop/NoSQL
flexibility.

• Familiarity: Use Drill to leverage familiar
ANSI SQL syntax and BI/analytics tools
through JDBC/ODBC drivers.
To learn more, read the quick start guide and
visit the Apache Drill web site.

The core elements of Drill include:
• Agility: Perform direct queries on selfdescribing, semi-structured data in files and
HBase tables without needing to specify

by Neeraja Rentachintala
Director of Product Managment, MapR

MapR Distribution including Apache Hadoop
Data Platform

by MapR Technologies

Data Management, Data Integration, Analytics

MapR’s distribution features true built-in enterprise-grade features like high availability, full multi-tenancy,
integrated optimized NoSQL, and full NFS access.
database integrations

hosting
SaaS, PaaS, On-Premise

Oracle HBase IBM DB2
SQL Server MongoDB MapR-DB

hadoop
Built on Hadoop

integration Support
ETL

Notable Customers
• Samsung
• Cisco
• comScore
• Beats Music
• HP
• TransUnion

 High Availability
 Load Balancing

Full profile link

Strengths
• Built-in enterprise grade capabilities to ensure no
data or work loss despite disasters
• Full multi-tenancy to manage distinct user groups,
data sets, or jobs in a single cluster
• Higher performance to do more work with less
hardware, resulting in lower TCO
• Integrated security to ensure enterprise data is
only accessed by authorized users
• Integrated NoSQL to run combined operational
and analytical workloads in one cluster

 Automatic Failover

FULL PROFILE LINK

dzone.com/r/7u6h

website mapr.com

dzone.com/research/bigdata

twitter @mapr

dzone.com

research@dzone.com

Proprietary &
Open Source

(919) 678-0300

dzone.com/research/bigdata

This blog post was written by DZone’s in-house content curator Alec Noller, who actively writes for DZone’s NoSQL Zone.
You can stop by DZone’s NoSQL Zone for daily news and articles: http://dzone.com/mz/nosql

Oversaturation Problem
BY ALEC NOLLER

It’s a familiar story at this point - trying out NoSQL, then
moving back to relational databases - and the response is
generally consistent as well: NoSQL will only be useful if
you understand your individual problem and choose the
appropriate solution.
According to Matthew Mombrea at IT World, though,
that doesn’t quite cover it. In this recent article, he
shares his own “NoSQL and back again” thoughts,
which hinge on the idea that there are simply too many
NoSQL solutions out there, preventing newcomers from
jumping right in.
Mombrea acknowledges that there are use cases where
NoSQL is ideal. However, he argues that there are some
major drawbacks that require additional effort:
It’s helpful to think of NoSQL as a flat file
storage system where the filename is the
key and the file contents are the value.
You can store whatever you want in these
files and you can read/write to them very
quickly, but . . . the brains of a relational
database are gone and you’re left to
implement everything you’ve taken for
granted with SQL in your code . . . for every
application. The overhead is not justifiable
for most applications.

Beyond that, he argues that the advantages of NoSQL
don’t even apply to most use cases:
The big draw to NoSQL is it’s ability to
scale out with ease and to provide very
high throughput. While it would be really
nice to have the same scalability with an

Alec Noller is the Senior Content Curator at
DZone. When he’s not creating and curating
content for DZone, he spends his time writing
and developing Java and Android applications.

2 0 1 4 G U I D E T O B I G D ATA

RDBMS, the real world fact is that 99% of
applications will never operate at a scale
where it matters. Look at Stack Exchange.
They are one of the most trafficked sites on
the planet and they run on MSSQL using
commodity servers.

Given these drawbacks, how is one to decide what
solution is appropriate? If nothing else, NoSQL demands
quite a bit more research, which is hard to justify when
it’s not even clear that it’s necessary or beneficial. This
is potentially an oversimplification, though, of the
use cases that call for NoSQL solutions. According to
Moshe Kaplan’s list of times when one ought to choose
MongoDB over MySQL, for example, there are quite a few
other scenarios. Just a few ideas:
•• If you need high availability in an unreliable

environment

•• If your data is location-based
•• If you don’t have a DBA

Mombrea’s conclusion, though, still hits an interesting
point: NoSQL is young, and adoption may become more
practical as it matures, given that such a central aspect
is understanding which solution is appropriate for any
given job. That may be a more manageable task when
the market has thinned a bit, leaving behind a handful of
well-tested solutions to well-defined problems.

find this article online:

http://bit.ly/1z45kCY
Find more at DZone’s NoSQL and Big Data Zones:

NoSQL: dzone.com/mz/nosql
Big Data: dzone.com/mz/big-data

DIVIN G DEEPER I NTO

2 0 1 4 G U I D E T O B I G D ATA

BIG DATA
TOP TEN #BIGDATA TWITTER FEEDS

@KirkDBorne

@kdnuggets

@marcusborba

@data_nerd

@BigDataGal

@jameskobielus

@medriscoll

@spyced

@IBMbigdata

@InformaticaCorp

TOP 6 BIG DATA WEBSITES
Big Data University bit.ly/1pdcuxt

A collection of courses that teach users about a

wide variety of Big Data concepts, frameworks, and uses, including Hadoop, analytics, and
relational databases.

Big Data and the History of Information Storage bit.ly/1AHocGZ

A timeline

Top 5 Big Data Resources bit.ly/1qKSg3i

A list of five top articles about various
Big Data concepts, from Hadoop to data mining to machine learning.

DB-Engines Ranking bit.ly/1rZY37n

of Big Data concepts, from the 1880 US census to the modern digital data explosion.

industry prevalance.

The Data Store: on Big Data bit.ly/1oDxZYW

What is Big Data? oreil.ly/1tQ62E1

Current news and information on Big

Database rankings according to popularity and

O’Reilly’s introduction to Big Data’s basic concepts.

Data concepts from The Guardian.

DZONE REFCARDZ
■

■

Predictive Models
Linear Regression
Logisitic Regression
Regression with Regularization
Neural Network
And more...

Patterns for Predictive Analytics
By Ricky Ho

INTRODUCTION

> library(car)
> summary(Prestige)
education
income
women
Min.
: 6.38000
Min.
: 611.000
Min.
: 0.00000
1st Qu.: 8.44500
1st Qu.: 4106.000
1st Qu.: 3.59250
Median :10.54000
Median : 5930.500
Median :13.60000
Mean
:10.73804
Mean
: 6797.902
Mean
:28.97902
3rd Qu.:12.64750
3rd Qu.: 8187.250
3rd Qu.:52.20250
Max.
:15.97000
Max.
:25879.000
Max.
:97.51000
prestige
census
type
Min.
:14.80000
Min.
:1113.000
bc :44
1st Qu.:35.22500
1st Qu.:3120.500
prof:31
Median :43.60000
Median :5135.000
wc :23
Mean
:46.83333
Mean
:5401.775
NA’s: 4
3rd Qu.:59.27500
3rd Qu.:8312.500
Max.
:87.20000
Max.
:9517.000
> head(Prestige)
education income women prestige census type
gov.administrators
13.11 12351 11.16
68.8
1113 prof
general.managers
12.26 25879 4.02
69.1
1130 prof
accountants
12.77
9271 15.70
63.4
1171 prof

Predictive Analytics is about predicting future outcome based on
analyzing data collected previously. It includes two phases:
1.
2.

Training phase: Learn a model from training data
Predicting phase: Use the model to predict the
unknown or future outcome

PREDICTIVE MODELS

chemists
14.62
8403 11.68
physicists
15.64 11030 5.13
> testidx <- which(1:nrow(Prestige)%%4==0)
> prestige_train <- Prestige[-testidx,]
> prestige_test <- Prestige[testidx,]

> summary(iris)
Sepal.Length
Min.
:4.300000
1st Qu.:5.100000
Median :5.800000
Mean
:5.843333
3rd Qu.:6.400000
Max.
:7.900000
Species
setosa
:50
versicolor:50
virginica :50

Sepal.Width
Min.
:2.000000
1st Qu.:2.800000
Median :3.000000
Mean
:3.057333
3rd Qu.:3.300000
Max.
:4.400000

Petal.Length
Min.
:1.000
1st Qu.:1.600
Median :4.350
Mean
:3.758
3rd Qu.:5.100
Max.
:6.900

73.5
77.6

Apache HBase

Configuration
Start/Stop
HBase Shell
Java API

The NoSQL Database for Hadoop and Big Data

■

Web UI: Master & Slaves

■

and More!

By Alex Baranau and Otis Gospodnetic
hbase-site.xml

ABOUT HBASE
HBase is the Hadoop database. Think of it as a distributed, scalable Big
Data store.

property_name
property_value

…

Use HBase when you need random, real-time read/write access to your
Big Data. The goal of the HBase project is to host very large tables —
billions of rows multiplied by millions of columns — on clusters built with
commodity hardware. HBase is an open-source, distributed, versioned,
column-oriented store modeled after Google’s Bigtable. Just as Bigtable
leverages the distributed data storage provided by the Google File System,
HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

http://hbase.apache.org/
(or view the raw /conf/hbaseThese are the most important properties:

Property
hbase.cluster.
distributed

OS & Other Pre-requisites

HBase uses the local hostname to self-report its IP address. Both forwardand reverse-DNS resolving should work.
HBase uses many ﬁles simultaneously. The default maximum number
of allowed open-ﬁle descriptors (1024 on most *nix systems) is often
insufﬁcient. Increase this setting for any Hbase user.

Value
true

HBase depends on a running
ZooKeeper cluster. Configure
using external ZK. (If not
configured, internal instance of ZK
is started.)
The directory shared by region
servers and where HBase
persists. The URL should be 'fully
qualified' to include the filesystem
scheme.

…where y is the output numeric value, and xi is the input numeric
value.

Apache HBase
DZone, Inc.

Description

HBASE_HEAPSIZE

Shows the maximum amount of heap to use, in
MB. Default is 1000. It is essential to give HBase
as much memory as you can (avoid swapping!) to
achieve good performance.

HBASE_OPTS

Shows extra Java run-time options. You can also
add the following to watch for GC:
export HBASE_OPTS="$HBASE_OPTS -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCDateStamps
$HBASE_GC_OPTS"

DZone, Inc.

www.dzone.com

Big Data Machine Learning:
Patterns for Predictive Analytics
bit.ly/WUFymk
This Refcard covers machine learning for
predictive analytics, a powerful instrument in
the developer's Big Data toolbox.

> db.serverCmdLineOpts()
{ “argv” : [ “./mongod”, “--port”, “30000” ], “parsed” : { },
“ok” : 1 }

USING THE SHELL

Shell Help

There are a number of functions that give you a little help if you forget a
command:

Config File

--dbpath /path/to/db

dbpath=/path/to/db

--auth

auth=true

-vvv

vvv=true

Option

Running Modes

dfs.datanode.max.xcievers
4096

Env Variable

Seeing Options

If you started mongod with a bunch of options six months ago, how can
you see which options you used? The shell has a helper:

CONFIGURATION OPTIONS

Command-Line

hbase-env.sh

> iristrain <- iris[-testidx,]
> iristest <- iris[testidx,]

By Kristina Chodorow
All of the connections are over TCP.

ABOUT THIS REFCARD
MongoDB is a document-oriented database that is easy to use from
almost any language. This cheat sheet covers a bunch of handy and easily
forgotten options, commands, and techniques for getting the most out of
MongoDB.

> // basic help
> help
db.help()
db.mycoll.help()
sh.help()
rs.help()
help admin
help connect
...

Run mongod --help for a full list of options, but here are some of the most
useful ones:

To increase the maximum number of ﬁles HDFS DataNode can serve at
one time in hadoop/conf/hdfs-site.xml, just do this:

To illustrate a regression problem (where the output we predict
is a numeric quantity), we’ll use the “Prestige” data set imported
from the “car” package to create our training and testing data.

Flexible NoSQL for Humongous Data

❱ Quick Rules

the three types of options:

START/STOP

Because HBase depends on Hadoop, it bundles an instance of the
Hadoop jar under its /lib directory. The bundled jar is ONLY for use in
standalone mode. In the distributed mode, it is critical that the version
of Hadoop on your cluster matches what is under HBase. If the versions
do not match, replace the Hadoop jar in the HBase /lib directory with the
Hadoop jar from your cluster.

> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
3
4.7
3.2
1.3
0.2 setosa
4
4.6
3.1
1.5
0.2 setosa
5
5.0
3.6
1.4
0.2 setosa
6
5.4
3.9
1.7
0.4 setosa
>
> # Prepare training and testing data
> testidx <- which(1:length(iris[,1])%%5 == 0)

MongoDB

❱ Configuration Options
❱ Using the Shell
❱ Diagnosing What's Happening

Startup options for MongoDB can be set on the command line or in a

Set value to true when running in
distributed mode.

my.zk.
server1,my.
zk.server2,

hdfs://my.hdfs.
server/hbase

CONTENTS INCLUDE:

❱ Query Operators
❱ Update Modifiers...and More!

Setting Options

Description

hbase.zookeeper.
quorum

hbase.rootdir

The nproc setting for a user running HBase also often needs to be
increased — when under a load, a low nproc setting can result in the
OutOfMemoryError.

y = Ө0 + Ө1x1 + Ө 2x2 + …

Petal.Width
Min.
:0.100000
1st Qu.:0.300000
Median :1.300000
Mean
:1.199333
3rd Qu.:1.800000
Max.
:2.500000

■

CONFIGURATION

2111 prof
2113 prof

LINEAR REGRESSION

Linear regression has the longest, most well-understood history
in statistics, and is the most popular machine learning model.
It is based on the assumption that a linear relationship exists
between the input and output variables, as follows:

For classiﬁcation problems, we use the “iris” data set and predict
its “species” from its “width” and “length” measures of sepals
and petals. Here is how we set up our training and testing data:

#171
CONTENTS INCLUDE:
■

Get More Refcardz! Visit refcardz.com

■

MongoDB: Flexible NoSQL for Humongous Data

Big Data Machine Learning:

CONTENTS INCLUDE
■

We can choose many models, each based on a set of different
assumptions regarding the underlying distribution of data.
Therefore, we are interested in two general types of problems
in this discussion: 1. Classiﬁcation—about predicting a category
(a value that is discrete, ﬁnite with no ordering implied), and 2.
Regression—about predicting a numeric quantity (a value that’s
continuous and inﬁnite with ordering).

Machine Learning

Brought to you by:

#159
Get More Refcardz! Visit refcardz.com

Get More Refcardz! Visit refcardz.com

#158

Description

--config /path/to/config

Path to data directory.
Port for MongoDB to listen on.

help on db methods
help on collection methods
sharding helpers
replica set helpers
administrative help
connecting to a db help

Note that there are separate help functions for databases, collections,
replica sets, sharding, administration, and more. Although not listed
explicitly, there is also help for cursors:

Specifies config file where other options are
set.

--dbpath /path/to/data
--port 27017

> // list common cursor functions

--logpath /path/to/file.log

Where the log is stored. This is a path to the
exact file, not a directory.

--logappend

On restart, appends to (does not truncate) the
existing log file. Always use this when using
the --logpath option.

--fork

Forks the mongod as a daemon process.

--auth

Enables authentication on a single server.

--keyFile /path/to/key.txt

Enables authentication on replica sets and
sharding. Takes a path to a shared secret key

--nohttpinterface

Turns off HTTP interface.

--bind_ip address

Only allows connections from the specified
network interface.

You can use these functions and helpers as built-in cheat sheets.

Seeing Function Definitions

If you don’t understand what a function is doing, you can run it without the
parentheses in the shell to see its source code:

To start mongod securely, use the nohttpinterface and bind_ip options
and make sure it isn't accessible to the outside world. In particular, make
sure that you do not have the rest option enabled. MongoDB requires the
following network accessibility:
• Single server - ability to accept connections from clients.
• Replica set - ability to accept connections from any member of the
set, including themselves. Clients must be able to connect to any
member that can become primary.
• Sharding - mongos processes must be able to connect to
servers and shards. Shards must be able to connect to each other
and
servers. Clients must be able to connect to mongos
processes.
servers do not need to connect to anyone,
including each other.

www.dzone.com

DZone, Inc.

www.dzone.com

Apache HBase: The NoSQL Database for
Hadoop and Big Data bit.ly/1nRgDYh

MongoDB: Flexible NoSQL for
Humongous Data bit.ly/YEgXEi

Getting Started with Apache
Hadoop bit.ly/1wnFbQI

HBase is the Hadoop database: a distributed,
scalable Big Data store that lets you host very large
tables — billions of rows multiplied by millions of
columns — on clusters built with commodity
hardware.

This cheat sheet covers a bunch of handy and
easily forgotten options, commands, and
techniques for getting the most out of
MongoDB.

This Refcard presents Apache Hadoop, a
software framework that enables distributed
storage and processing of large datasets
using simple high-level programming
models.

DZONE BIG DATA ZONES
Big Data Zone

NoSQL Zone

http://dzone.com/mz/big-data

We're on top of all the best tips and news for Hadoop, R, and
data visualization technologies. We also give you advice from
data science experts on how to understand and present that data.

SQL Zone

http://dzone.com/mz/nosql

DZone's portal for following the news and trends of the
non-relational database ecosystem, including solutions such as
MongoDB, Cassandra, Redis, and many others.

http://www.dzone.com/mz/sql

DZone's portal for following the news and trends of the relational
database ecosystem, which includes solutions such as MySQL,
PostgreSQL, SQL Server, and many others.

TOP 6 BIG DATA TUTORIALS
Hadoop Tutorial Modules yhoo.it/1wiJhH3

R Introduction

The Yahoo! Hadoop tutorial, with an introduction to Hadoop
and Hadoop tutorial modules.

An in-depth introduction to the R language.

Hadoop, Hive, and Pig bit.ly/1wiJbiM

Hive Tutorial

A tutorial from Hortonworks on using Hadoop, Hive, and Pig.

bit.ly/1rVhyrM

bit.ly/1qKRlQ8

A tutorial on getting
started with Hive.
dzone.com/research/bigdata

Using R in Hive bit.ly/1pXZtI7
A tutorial on using R in MapReduce and Hive.

dzone.com

91 R Tutorials

bit.ly/1sD047U

91 tutorials to help you explore
R. 678-0300
research@dzone.com
(919)

dzone.com/research/bigdata

Big Data
Hadoop has been widely adopted to run on
both public and private clouds. Unfortunately,

By Chanwit K aewk asi

articles in 38 minutes. After batch processing, each ad hoc query can
be executed in 9-14 seconds.

the public cloud is not always a safe place for

Building Your Own Cluster

your sensitive data.

If you’d like to try making your own ARM cluster, I’d recommend
starting with a 10-node cluster. You can pick an alternate board or
use the same one that we did. Eight of them will be workers and the
other two will be your master and driver boards. Your board can be
any ARM SoC that has:

The Heartbleed exploit is one recent example of a major security
bug in many public infrastructures that went undiscovered for
years. For some organizations, it makes more sense to use a private
cloud. Unfortunately, private clouds can be costly both in terms of
building and operating, especially with x86-class processors. To
save on operating costs, Baidu (the largest Chinese search engine)
has recently changed its servers from x86-based to custom ARM
servers. After this transition, they reported 25% savings in total
cost of ownership.
An on-premise Hadoop cluster built using system-on-a-chip (SoC)
boards is the perfect answer for a cash-strapped startup or lowbudget departmental experiment. It removes the need for a data
center and gives you more control over security and tuning than
a public cloud service. This article will introduce a method for
building your own ARM-based Hadoop cluster. Even if you aren’t
interested in actually building one of these clusters, you can still
learn a lot about how Hadoop clusters are modeled, and what the
possibilities are.

The Aiyara Cluster

1.
2.
3.
4.
5.

At least 1 GHz CPU
At least 1 GB memory
At least 4 GB of built-in NAND flash
A SATA connector
A 100 Mbps or greater ethernet connector

You’ll also need an ethernet switch with enough ports. The size of
each SSD depends on how much storage you want. 120 GB for
each board would be fine for a 10-node cluster, but the default
replication level of Hadoop is three, so your usable capacity will be
one-third of the total capacity. Finally, you’ll need a power supply
large enough for the cluster and you’ll need a power splitter to
send power to all the boards.
These are the minimum requirements for building an Aiyara cluster.

PHY

Readers are always amazed when I show them that they can have
their own Hadoop cluster for data analytics without a data center.
My colleagues and I call ours the Aiyara cluster model for Big Data.
The first Aiyara cluster was presented on DZone as an ARM
cluster consisting of 22 Cubieboards (Cubieboard A10). These
ARM-based SoC boards make it fully modular so that each node
can be easily replaced or upgraded. If you’re familiar with ARM
architecture, you know that these devices will produce less heat (no
cooling system required) and consume less power than other chip
architectures like x86. In Thailand, it costs our group $0.13 a day to
run our 22-board cluster.

“We are able to
batch process
34GB of Wikipedia
articles in 38
minutes. After
batch processing,
each ad hoc query
can be executed in
9-14 seconds.”

Twenty boards are used as Spark
worker boards and the other two are
the master and driver boards. All
of the worker nodes are connected
to a solid-state drive (SSD) via SATA
port, and two of the workers are also
running Hadoop data nodes.
As far as performance goes, our
cluster is even faster than we
expected. The software stack we use
is simply Apache Spark over HDFS,
and in our benchmarks we are able
to batch process 34GB of Wikipedia

2 0 1 4 G U I D E T O B I G D ATA

WORKER
n boards

NETWORK SWITCH

SATA

SSD
n storages

PHY
PHY

Master
1 board

Driver
1 board

DRIVES TERMINAL

Fig 1. Block Diagram for Physical Cluster Model

Software Stack
To build an Aiyara cluster, you need the software packages listed
below.
• Linux 3.4+ (distribution of your choice)
• Java Development Kit 1.7+ for ARM
• Apache Hadoop 2.3+ and Apache Spark 0.9+ from CDH5
• Python 2.4 and Ansible 1.6+ for cluster-wide management
• Ganglia or JMX for cluster monitoring

2 0 1 4 G U I D E T O B I G D ATA

DRIVER
BOARD

SPARK
DRIVER

CONTROL

MASTER
BOARD

SPARK
WORKER

COORDINATE

SPARK
MASTER

HDFS
NameNode

Manage the cluster

CONTROL

FORK

SPARK
EXECUTOR

READ
WRITE

LOOK UP

WORKER
BOARDS

HDFS
DataNode

COORDINATE
REPORT

The host name and IP address on each node is set through DHCP
during starting up via the DHCP host name script. We have found that
it is good to map the host name to the node’s IP address rather than
allow it to be assigned randomly. For example, we map 192.168.0.11
to node01 and 192.168.0.22 to node12. It is the “node${ip -10}”
pattern. This technique allows us to scale up to 200 nodes per a
logical rack. It will work fine in most cases.
We write Bash scripts to perform cluster-wide operations through SSH.
For example, we have a script to perform disk checks on every worker
node using the fsck command via SSH before starting the cluster.
The following are steps for properly starting the cluster:

Fig 2. Overview of the logical clusters, HDFS, and Spark, laid atop
the hardware boards.

Managing Performance
We normally use the internal NAND flash to store the operating
system, Hadoop, and Spark on each node. External SSDs should
only be used to store data written by HDFS. When something goes
wrong with an SSD, we can just replace it with a new one. Then, we
just ask Hadoop to re-balance data to the newly added storage.
In contrast, when a Worker board fails, we can just re-flash the
whole file system to its internal NAND flash. If this recovery process
does not help, just throw the board away and buy a new one
(Cubieboards cost us $49 each). That’s one of the big advantages of
having a commodity cluster.
Because of memory limitations on each node, we configure Spark to
spill intermediate results to internal NAND flash when the JVM heap
is full. But writing a large set of small files to the ext4fs over a NAND
device is not a good idea. Many write failures occurred and the file
system became unstable.

READ
HDFS
DataNode

STORE

SPARK
EXECUTOR

MEMORY
JVM
HEAP

tmpfs

• Perform file system check
• Mount SSD on every node. Rather than having fstab to automount on each node, we mount the storage manually via the
master node.
• Start HDFS, NameNode, and DataNode
• Start Spark, master, and worker

When it comes to maintaining software packages, doing so using
plain scripts and SSH becomes harder to manage. To make cluster
management easier, Ansible is a natural choice for us. Ansible is
compatible with the armhf (ARM) architecture on Debian and it
uses agent-less architecture, so we only need to install it on the
master board.
For monitoring, there are several tools you can use. If you would
like to use Ganglia to monitor the whole cluster, you need to
install Ganglia’s agent on each node. My team chose a more
convenient option; we just use the JMX to monitor all Spark
nodes with VisualVM, a tool that comes with the JDK. Techniques
for monitoring Linux and Java servers can be generally applied
to this kind of cluster. With JMX, we can observe not only CPU
usage, but also JVM-related resources such as garbage collection
and thread behavior. Logging from HDFS and Spark are also
important for troubleshooting.

Start building your own!

SPILL
LOCAL
tmpfs

MAP

PAGE OUT

SWAP
PARTITION

Fig 3. The cluster architecture to solve spilling problems on ext4fs
over the NAND flash device.

My team believes that the Aiyara cluster model is a viable solution
for batch processing, stream processing, and interactive ad-hoc
querying on a shoestring budget. All of the components and
techniques have been thoroughly tested during our research. In
the near future, ARM SoC boards will become cheaper and even
more powerful. I believe that having this kind of low-cost Big Data
cluster in a small or medium size company will become a more
compelling alternative to managing a data center or outsourcing
your data processing.

For an Aiyara cluster, we solve this problem by setting up a swap
partition up for each Worker node, then mounting a tmpfs (an
in-memory file system) to use as the spill directory for Spark. In our
Aiyara Cluster Mk-I, we have a 2 GB swap partition on each node.
When a Spark Executor spills intermediate results to the tmpfs and
later paging out to the disk, small files in the tmpfs will be grouped
as a larger block.

WRITTEN B Y

Chanwit Kaewkasi
Chanwit Kaewkasi is an Assistant Professor at the
Suranaree University of Technology’s School of
Computer Engineering in Thailand. He currently
co-develops a series of low-cost Big Data clusters
with Spark and Hadoop. He is also a contributor to
the Grails framework, and leads development of
the ZK plugin for Grails.

Performance degradation from paging out generally does not affect
a Big Data cluster for batch processing, so we do not need to worry
about this issue. The only requirement is that we tune network
parameters to prevent dissociation of Spark’s executors.

dzone.com/research/bigdata

dzone.com

research@dzone.com

(919) 678-0300

2 0 1 4 G U I D E T O B I G D ATA

Sponsored Opinion

Big Data & Multi-Cloud
Go Hand-In-Hand
In IT organizations around the globe, CTOs, CIOs,
and CEOs are asking the same question: “how
can we use Big Data technologies to improve our
platform operations?” Whether you’re responsible
for overcoming real-time monitoring and alerting
obstacles, or providing solutions to platform
operations analysis, behavioral targeting, and
marketing operations, the solutions for each of these
use cases can vary widely.

tests, and ensure you have the right solution for your
particular use case.

Given the wide variety of new and complex solutions,
It’s easy for an executive to tell you, “I want to use
however, it’s no surprise that a recent survey of IT
Hadoop,” but it’s your job that’s on the line if Hadoop
professionals showed that more
doesn’t meet your specific needs.
than 55% of Big Data projects fail
Similarly, if your cloud vendor has a
to achieve their goals. The most
vulnerability, only businesses with a
Businesses no
significant challenge cited was a lack
multi-cloud strategy can shift workloads
longer have to
of understanding of and the ability
to another cloud with little or no impact
tie themselves
to pilot the range of technologies on
to their customers.
the market.
to a single

When it comes to Big Data and the cloud, variety is
the key. There isn’t a single one-size-fits-all solution
for every one of your use cases and assuming a single
solution fits all use cases is a pitfall that could cost
you your job. As a result, companies are frequently
using three to five Big Data solutions, and their
platform infrastructure now spans a mix of cloud and
dedicated servers.

Bottom line, businesses no longer have
solution or
This challenge systematically pushes
to tie themselves to a single solution or
vendor.
companies toward a limited set of
vendor and hope it’s the right decision.
proprietary platforms that often
Freedom to choose use case specific
reduce the choice down to a single
solutions on top of a reliable, multi-cloud
technology, perpetuating the tendency to seek one
infrastructure empowers businesses to take advantage
cure-all technology solution. But this is no longer a
of the best technologies without the risk.
realistic strategy.

With the freedom to use multiple solutions, the
challenge is how to use them effectively. Whether
you are choosing a cloud provider or a Big Data
technology, you never want to be locked into a
single vendor. When you’re evaluating solutions,
it makes sense to try out a few options, run some

No single technology such as a database can solve
every problem, especially when it comes to Big Data.
Even if such a unique solution could serve multiple
needs, successful companies are always trialing new

GoGrid Orchestration Services
Big data cloud

solutions in the quest to perpetually innovate and
thereby achieve (or maintain) a competitive edge.

WRITTEN B Y

by Andrew Nester
Director of Marketing, GoGrid LLC

by GoGrid

Big Data PaaS

GoGrid offers a dedicated PaaS for running big data applications, including several one-button deployments for
solutions to simplify the production of big data apps.
Description

Strengths

GoGrid offers an easier, faster way for users to take advantage of big
data with a cloud platform. Businesses can benefit from lower costs,
orchestrated solution deployment, and support for several open source
solutions. These solutions are integrated through a 1-Button Deploy
system to simplify the process of moving Big Data applications from trial
to pilot project and finally to full-scale production.

• Button deploy support for several databases
including Cassandra, DataStax, and HBase

notable customers

• PaaS solution designed specifically for working
with big data

• Condé Nast Digital • Glam
• Merkle
• MartiniMedia

• Utilizes Hadoop for predictive analytics, processing
of large data sets, clickstream analysis, and
managing log file data
• Support for several open source solutions means
customers have no proprietary or platform lock-in

free trial
• Artizone

14-day free trial

PROFILE LINK
Full profile link FULL dzone.com/r/Tvu6

website

gogrid.com

dzone.com/research/bigdata

twitter

dzone.com

@gogrid

Proprietary

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

Finding the Database for your Use Case
This chart will help you find the best types of databases to try testing with your software.
DB Types

Relational DB
Examples: MySQL, PostgreSQL, SQL
Server

Key-Value Store
Examples: Redis, Riak, DynamoDB

Document Store
Examples: MongoDB, Couchbase,
RavenDB

When ACID transactions are required

ċċ

Looking up data by different keys with secondary indexes
(also a feature of several NoSQL DBs)

Systems that need to tolerate partition
failures

ċċ

Schema-free management

ċċ

When strong consistency for results and queries is required

ċċ

Conventional online transaction processing

ċċ

Risk-averse projects seeking very mature technologies and
widely available skills

Handling any complex / rich entities that
require you to do multiple joins to get the
entire entity back.

ċċ

Products for enterprise customers that are more familiar with
relational DBs

ċċ

Handling lots of small, continuous, and potentially volatile
reads and writes; also look for any DB with fast in-memory
access or SSD storage

ċċ

Correlating data between different sets
of keys

ċċ

Storing session information, user preferences, and
e-commerce carts

Saving multiple transactions (Redis is
exempt from this weakness)

ċċ

Simplifying the upgrade path of your software with the
support of optional fields, adding fields, and removing fields
without having to build a schema migration framework

Performing well during key searches based
on values (DynamoDB is exempt)

ċċ

Operating on multiple keys (it’s only
possible through the client side)

ċċ

Returning only partial values is required

ċċ

Updates in place are necessary

ċċ

Atomic cross-document operations
(RavenDB is exempt)

ċċ

Querying large aggregate data structures
that frequently change

ċċ

Returning only partial values is required

ċċ

Joins are desired

ċċ

Foreign key usage is desired

ċċ

Partial updates of documents (especially
child/sub-documents)

ċċ

Early prototyping or situations where
there will be significant query changes
(high cost for query changes compared
to schema changes)

ċċ

Referential integrity required

ċċ

Processing many columns simultaneously

ċċ

High volume write situations

ċċ

Serving and storing binary data

ċċ

Querying unrestricted across massive
data sets

ċċ

Handling a wide variety of access patterns and data types

ċċ

Handling reads with low latency

ċċ

Handling frequently changing, user generated data

ċċ

Simplifying the upgrade path of your software with the
support of optional fields, adding fields, and removing fields
without having to build a schema migration framework

ċċ

Deployment on a mobile device (Mobile Couchbase)

When high availability is crucial, and eventual consistency
is tolerable

ċċ

Event Sourcing

ċċ

Logging continuous streams of data that have no
consistency guarantees

ċċ

Storing a constantly growing set of data that is accessed
rarely

ċċ

Deep visitor analytics

ċċ

Handling frequently expiring data (Redis can also set
values to expire)

Graph Store

ċċ

Handling entities that have a large number of relationships,
such as social graphs, tag systems, or any link-rich domain.

Examples: Neo4j, Titan, Giraph

ċċ

Routing and location services

ċċ

Recommendation engines or user data mapping

ċċ

Dynamically building relationships between objects with
dynamic properties

ċċ

Allowing a very deep join depth

Examples: Cassandra, HBase,
Accumulo

Weak Use Cases

ċċ

Column Store

Strong Use Cases

2 0 1 4 G U I D E T O B I G D ATA

Sources: NoSQL Distilled, High Scalability

2 0 1 4 G U I D E T O B I G D ATA

The solutions Directory

Actian Analytics Platform
Data Platform

This directory of data management and analysis tools provides
comprehensive, factual comparison data gathered from third-party
sources and the tool creators’ organizations. Solutions in the directory
are selected based on several impartial criteria including solution
maturity, technical innovativeness, relevance, and data availability.
The solution summaries underneath the product titles are based on
the organization’s opinion of its most distinguishing features.
NOTE: The bulk of information gathered about these solutions is not
present in these quarter-page profiles. For example, the Language/
Drivers Supported and Database Integration sections only contain a
subset of the databases and languages that these solutions support.
To view an extended profile of any product, you can click the shortcode link found at the bottom of each profile, or simply go to dzone.
com/zb/products and enter the shortcode at the end of the link.

Data Management, Data Integration, Analytics

Actian’s platform offers complete end-to-end analytics,
built with next gen software architecture that can be run
on commodity hardware.
hadoop support

Db integrations

Hadoop integrations
available

Oracle SQL Server
IBM DB2 SAP HANA
MongoDB

integration support

•R
• SAS

Built-in ide

Cloud hosting

No IDE

SaaS, PaaS, On-Premise

Stream processing

Mapreduce job designer

FULL PROFILE LINK

Proprietary

Aerospike
Database

Data Platform

Aerospike is a flash-optimized database that indexes
data in RAM or flash for predictable low latency, high
throughput, and ACID transactions.
Languages/drivers supported

• Asynchronous

C C++ Java
Node.js Python

SQL Support

transactions Supported

“Compare-and-set” type
transactions

Consistency model

Auto-Sharding

Yes

Indexing Capabilities

Simple indexes
Full Text Search

Open Source
TWITTER @aerospikedb

Data Management, Data Integration, Analytics

hadoop support

MySQL PostgreSQL
MongoDB

integration support

Statistical Languages

• ETL
• ELT

• None

Built-in ide

Cloud hosting

No IDE

SaaS, PaaS

Stream processing

Mapreduce job designer

FULL PROFILE LINK

Proprietary

dzone.com/r/j3GR
website aerospike.com

Db integrations

Hadoop integrations
available

Yes

Amazon

Kinesis is a fully managed service for real-time processing
of streaming data at massive scale.

Replication

Strong consistency

website actian.com

Amazon Kinesis

In-Memory, Unordered Key-Value

dzone.com/r/QM4k

TWITTER @ActianCorp

Aerospike

• Synchronous

Statistical Languages

• ETL
• ELT

Get easy access to full product profiles with this URL.

Actian

TWITTER @awscloud

dzone.com/research/bigdata

dzone.com

Yes

FULL PROFILE LINK

dzone.com/r/VsbC
website aws.amazon.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

BigMemory
Data Platform

Software AG

Data Management, Data Integration

BigMemory supports hundreds of terabytes of data inmemory with a predictable latency of low milliseconds
regardless of scale.
hadoop support

Db integrations

No Hadoop Support

integration support

Data Platform

Actuate

Data Management, Data Integration, Analytics

BIRT iHub has inherent extensibility at multiple levels,
enabling developers to address the most complex
application requirements.
hadoop support

Oracle SQL Server
IBM DB2 MySQL

Built on Hadoop

Db integrations

Oracle SQL Server
IBM DB2 SAP Hana
MongoDB

• ETL

Yes

business modeler

integration support
• ELT

• None

Built-in ide

Cloud hosting

Built-in ide

Cloud hosting

Eclipse

SaaS, On-Premise

Eclipse

SaaS, PaaS, On-Premise

Stream processing

Mapreduce job designer

Stream processing

Mapreduce job designer

Yes

FULL PROFILE LINK

Open Source

dzone.com/r/LG7u

TWITTER @softwareag_NA

website terracotta.org

Cassandra
Database

Columnar

Languages/drivers supported

• Asynchronous

C# Go Java
Node.js Ruby

SQL Support

transactions Supported

No support

Consistency model

Tunable per operation
Auto-Sharding

Yes

Indexing Capabilities

Rich query language
Full Text Search

Via DataStax

Open Source
TWITTER @cassandra

Proprietary
TWITTER @actuate

Data Platform

Replication

• Synchronous

Yes

Statistical Languages

Yes

FULL PROFILE LINK

dzone.com/r/YyWw
website actuate.com

Cloudera Enterprise

Apache Cassandra is a NoSQL database originally
developed at Facebook, and is currently used at tech
firms like Adobe and Netflix.

BIRT iHub

FULL PROFILE LINK

dzone.com/r/zrPN
website cassandra.apache.org

2 0 1 4 G U I D E T O B I G D ATA

Cloudera

Data Management, Data Integration, Analytics

A unified platform with compliance-ready security and
governance, holistic system management, broad partner
integration, and world-class support.
hadoop support

Built on Hadoop

integration support

Db integrations

Oracle SQL Server
IBM DB2 MongoDB
Hbase
Statistical Languages

• ETL
• ELT

•R
• SAS

Built-in ide

Cloud hosting

No IDE

SaaS

Stream processing

Mapreduce job designer

Yes

Proprietary &
Open Source
TWITTER @Cloudera

FULL PROFILE LINK

dzone.com/r/Jf3V
website cloudera.com

2 0 1 4 G U I D E T O B I G D ATA

Continuuity Reactor
Data Platform

Couchbase Server

Continuuity

Data Management, Data Integration, Analytics

Database

Couchbase

In-Memory, Unordered Key-Value, Document

Allows developers to build data applications quickly by
enabling them to focus on business logic and value rather
than infrastructure.

Couchbase Server features an integrated cache, rack
awareness, cross data center replication, and an
integrated administration console.

hadoop support

Replication

Languages/drivers supported

• Asynchronous

C C++ Java
Node.js Ruby

SQL Support

transactions Supported

Db integrations

Built on Hadoop

MongoDB Cassandra
Hbase

integration support

Built-in ide

Cloud hosting

None

SaaS, PaaS

Stream processing

Mapreduce job designer

Yes

Open Source
TWITTER @continuuity

business modeler

• ETL
• ELT

“Compare-and-set” type
transactions

Consistency model

Strong consistency

Indexing Capabilities

Rich query language

Auto-Sharding

Yes

Full Text Search

Via ElasticSearch

FULL PROFILE LINK

Open Source

dzone.com/r/HaQ4
website continuuity.com

DataTorrent RTS
Data Platform

• Synchronous

DataTorrent

TWITTER @couchbase

website couchbase.com

FoundationDB

Database

Data Management

dzone.com/r/RGL7

Ordered Key-Value

DataTorrent RTS enable real-time insights via an easy to
use high performance, scalable, fault-tolerant Hadoop
2.0 native-platform.

FoundationDB provides a key-value API with ordering
and full ACID transactions that allow users to layer
multiple data models.

hadoop support

Replication

Built on Hadoop

integration support

Db integrations

Oracle SQL Server
IBM DB2 SAP Hana
MongoDB

Built-in ide

Cloud hosting

None

On-Premise

Auto-Sharding

Stream processing

Mapreduce job designer

Yes

TWITTER @datatorrent

transactions Supported

Full ANSI SQL

business modeler

Proprietary

C Java Node.js
Ruby Python

SQL Support

• ETL

Yes

Languages/drivers supported

• Synchronous

Consistency model

Strong consistency

Arbitrary multi-statement
transactions spanning arbitrary
nodes
Indexing Capabilities

Rich query language
Full Text Search

Yes

FULL PROFILE LINK

Proprietary

dzone.com/r/aPAt
website datatorrent.com

TWITTER @FoundationDB

dzone.com/research/bigdata

dzone.com

FULL PROFILE LINK

dzone.com/r/fsVb
website foundationdb.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

Hazelcast
Database

HBase

Hazelcast

In-Memory, Relational, Document, Key-Value

Database

Columnar

Hazelcast is a small, open source, 3.1MB JAR database
library with no external dependencies that is easily
embeddable into database apps.

HBase excels at random, real-time read/write access
to very large tables of data atop clusters of commodity
hardware.

Replication

Languages/drivers supported

Replication

• Asynchronous

C C++ Java
Python Scala

• Asynchronous

SQL Support

transactions Supported

SQL Support

• Synchronous

Limited subset
Consistency model

Tunable per database

Arbitrary multi-statement
transactions spanning arbitrary
nodes
Indexing Capabilities

Simple indexes

Auto-Sharding

Yes

Full Text Search

Consistency model

Strong consistency
Auto-Sharding

Yes

IBM DB2
Database

Indexing Capabilities

Via Solr
Full Text Search
FULL PROFILE LINK

Open Source

dzone.com/r/MPaA
website hazelcast.com

Arbitrary multi-statement
transactions spanning arbitrary
nodes

Via open source libraries

FULL PROFILE LINK

TWITTER @hazelcast

C C++ C#
Java Python
transactions Supported

Via open source libraries

Open Source

Languages/drivers supported

dzone.com/r/rNMp

TWITTER @HBase

website hbase.apache.org

InfiniteGraph

IBM

Database

Relational

Objectivity

Graph

IBM DB2 is a database for Linux, Windows, UNIX, and
z/OS, offering high-performing storage and analytics
capabilities for distributed systems.

InfiniteGraph is a database providing scalability of
data and processing with performance in a distributed
environment.

Replication

Languages/drivers supported

Replication

• Asynchronous

C C++ Java
PHP Ruby

• Synchronous

SQL Support

transactions Supported

SQL Support

• Synchronous

Full ANSI SQL
Consistency model

Eventual consistency
Auto-Sharding

Yes

Arbitrary multi-statement
transactions spanning arbitrary
nodes

Limited subset

Indexing Capabilities

Strong consistency

Simple indexes
Full Text Search

Consistency model

Auto-Sharding

TWITTER @ibm

C++ Java Python
C#
transactions Supported

Arbitrary multi-statement
transactions spanning arbitrary
nodes
Indexing Capabilities

No
Full Text Search

Yes

Proprietary

Languages/drivers supported

Via Lucene
FULL PROFILE LINK

dzone.com/r/QQ4k
website ibm.com

2 0 1 4 G U I D E T O B I G D ATA

Proprietary
TWITTER @objectivitydb

FULL PROFILE LINK

dzone.com/r/Qa4k
website objectivity.com

2 0 1 4 G U I D E T O B I G D ATA

Informatica BDE
Data Platform

MapR

Informatica

Data Management, Data Integration

hadoop support

Db integrations

integration support

Data Management, Data Integration, Analytics

Data Platform

BDE provides a safe, efficient way to integrate & process
all types of data on Hadoop at any scale, without having
to learn Hadoop.
Hadoop integrations
available

MapR Technologies

MapR’s platform features true built-in enterprisegrade features like high availability, full multi-tenancy,
integrated optimized NoSQL, and full NFS access.
hadoop support

Oracle SQL Server
IBM DB2 SAP Hana
MongoDB

Built on Hadoop

Db integrations

Oracle SQL Server
IBM DB2 SAP HANA
MongoDB

• ETL
• ELT

Yes

business modeler

integration support
• ETL

•R
• SAS

Built-in ide

Cloud hosting

Informatica Developer

SaaS, On-Premise

Built-in ide

Cloud hosting

No IDE

SaaS, PaaS, On-Premise

Stream processing

Mapreduce job designer

Stream processing

Mapreduce job designer

Yes

FULL PROFILE LINK

Proprietary

Database

FULL PROFILE LINK

Proprietary &
Open Source

dzone.com/r/rWNp

TWITTER @INFA_BD

MemSQL

Yes

website informatica.com

Statistical Languages

dzone.com/r/7u6h

TWITTER @mapr

website mapr.com

MongoDB Enterprise

MemSQL

Database

In-Memory, Relational, Columnar

MongoDB, INC.

Document

MemSQL has fast data load and query execution during
mixed OLTP/OLAP workloads due to compiled query
plans and lock-free data structures.

MongoDB blends linear scalability and schema flexibility
with the rich query and indexing functionality of an
RDBMS.

Replication

Languages/drivers supported

Replication

Languages/drivers supported

• Asynchronous

C C++ Java
Node.js Ruby

• Synchronous
• Asynchronous

C C++ Java
Node.js Ruby

SQL Support

transactions Supported

SQL Support

transactions Supported

• Synchronous

Full ANSI SQL
Consistency model

Strong consistency
Auto-Sharding

Yes

Arbitrary multi-statement
transactions spanning arbitrary
nodes

Indexing Capabilities

Strong consistency

Rich query language
Full Text Search

“Compare-and-set” type
transactions

Consistency model
Indexing Capabilities

Rich query language

Auto-Sharding

Yes

Full Text Search

Proprietary
TWITTER @memsql

Yes
FULL PROFILE LINK

Open Source

dzone.com/r/TLv6
website memsql.com

TWITTER @mongoDBinc

dzone.com/research/bigdata

dzone.com

FULL PROFILE LINK

dzone.com/r/w9dP
website mongodb.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

Neo4j

New Relic Insights

Neo Technology

Database

Graph

Data Platform

Neo4j is a schema-optional, ACID, scalable graph
database with minutes-to-milliseconds performance
over RDBMS and NOSQL.
Replication

Languages/drivers supported

• Asynchronous

C C++ Java
Node.js Ruby

SQL Support

Operational Intelligence

New Relic’s analytics platform provides real-time data
collection and querying capabilities based on closedsource database technologies.
hadoop support

Db integrations

No Hadoop Support

New Relic Query Language

transactions Supported

No
Consistency model

Eventual consistency
Auto-Sharding

Arbitrary multi-statement
transactions spanning arbitrary
nodes
Indexing Capabilities

Rich query language
Full Text Search

Via Lucene

website neo4j.com

Statistical Languages

• None

Built-in ide

Cloud hosting

No IDE

SaaS

Stream processing

Mapreduce job designer

Yes

FULL PROFILE LINK

Proprietary

dzone.com/r/rdNp

TWITTER @neo4j

NuoDB

integration support

FULL PROFILE LINK

Open Source

Database

New Relic

dzone.com/r/pdHQ

TWITTER @newrelic

website newrelic.com/insights

Oracle Database

NuoDB
In-Memory, Distributed relational object store

Database

Oracle

Relational, Document, Columnar, Graph

NuoDB is a distributed, peer-to-peer database with
elastic scaling capabilities to store and analyze big data.

Oracle features a multitenant architecture, automatic
data optimization, defense-in-depth security, high
availability, and failure protection.

Replication

Languages/drivers supported

Replication

Languages/drivers supported

C C++ Java
Node.js Ruby

• Synchronous
• Asynchronous

C C++ Java
Node.js Ruby

transactions Supported

SQL Support

transactions Supported

• Asynchronous

SQL Support

Full ANSI SQL

Arbitrary multi-statement
transactions on a single node

Consistency model

Strong consistency
Auto-Sharding

Full ANSI SQL
Consistency model

Indexing Capabilities

Rich query language
Full Text Search

Strong consistency
Auto-Sharding

TWITTER @nuodb

Indexing Capabilities

Rich query language
Full Text Search

Proprietary

Arbitrary multi-statement
transactions spanning arbitrary
nodes

Yes
FULL PROFILE LINK

dzone.com/r/NdpH
website nuodb.com

2 0 1 4 G U I D E T O B I G D ATA

Proprietary
TWITTER @OracleDatabase

FULL PROFILE LINK

dzone.com/r/ttUJ
website oracle.com/database

2 0 1 4 G U I D E T O B I G D ATA

Pivotal Big Data Suite
Data Platform

Postgres Plus

Pivotal

Data Management, Data Integration, Analytics

Database

EnterpriseDB

Relational, Document, Key-Value

Pivotal Big Data Suite delivers HAWQ, GemFire,
Greenplum, and Hadoop in a single integrated analytics
and integration platform.

Postgres Plus Advanced Server advances PostgreSQL
with enterprise-grade performance, security and
manageability enhancements.

hadoop support

Replication

Languages/drivers supported

• Asynchronous

C++ Java Node.js
Ruby Python

Db integrations

Built on Hadoop

MySQL Hbase Cassandra

• Synchronous

SQL Support
integration support
• ETL
• ELT

Built-in ide

Cloud hosting

Spring Tools Suite

SaaS, PaaS

Stream processing

Mapreduce job designer

Yes

Consistency model

Strong consistency

Indexing Capabilities

Rich query language

Auto-Sharding

Full Text Search

Yes

FULL PROFILE LINK

website pivotal.io

FULL PROFILE LINK

Open Source

dzone.com/r/wdrP

TWITTER @Pivotal

Database

Arbitrary multi-statement
transactions on a single node

Proprietary

RavenDB

transactions Supported

Full ANSI SQL

business modeler

dzone.com/r/MdaA

TWITTER @enterprisedb

Redis Cloud

Hibernating Rhinos

Database

Document

website enterprisedb.com

Redis Labs

In-Memory, Unordered Key-Value

RavenDB is a self-optimizing ACID database with multi
master replication, dynamic queries, and strong support
for reporting.

Redis Cloud is an infinitely scalable, highly available, and
top performing hosted Redis service.

Replication

Languages/drivers supported

Replication

Languages/drivers supported

Java Node.js Python
PHP Scala

• Synchronous
• Asynchronous

C C++ Java
Node.js Ruby

transactions Supported

SQL Support

transactions Supported

• Asynchronous

SQL Support

No
Consistency model

Strong consistency
Auto-Sharding

Yes

Arbitrary multi-statement
transactions spanning arbitrary
nodes

Indexing Capabilities

Strong consistency

Rich query language
Full Text Search

Arbitrary multi-statement
transactions on a single node

Consistency model
Indexing Capabilities

Auto-Sharding

Yes

Full Text Search

Yes

Open Source
TWITTER @ravendb

No
FULL PROFILE LINK

Proprietary

dzone.com/r/U4Jf
website ravendb.net

TWITTER @redislabsinc

dzone.com/research/bigdata

dzone.com

FULL PROFILE LINK

dzone.com/r/3sjG
website redislabs.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

Riak

SAP HANA

Basho

Database

Unordered Key-Value

Data Platform

Riak excels at high write volumes using a straight
key value store and vector clocks to provide the most
flexibility for data storage.
Replication

Languages/drivers supported

• Synchronous

C C++ Java
Node.js Ruby

SQL Support

Consistency model

Eventual consistency

SAP HANA is a platform with a columnar database, an
app and web server, as well as predictive, spatial, graph,
and text processing libraries and engines.
hadoop support

Db integrations

Hadoop integrations
available

Riak 2.0 adds single object compareand-set for strongly consistent
bucket types
Indexing Capabilities

Rich query language

Auto-Sharding

Yes

Full Text Search

SAP HANA IBM DB2
Cassandra

Open Source

Statistical Languages

• ETL
• ELT

•R

Built-in ide

Cloud hosting

No IDE

SaaS, PaaS, On-Premise

Stream processing

Mapreduce job designer

FULL PROFILE LINK

website basho.com

Job Designer

FULL PROFILE LINK

Proprietary

dzone.com/r/r9Np

TWITTER @basho

SAS Platform

integration support

Yes

dzone.com/r/Uksf

TWITTER @SAPInMemory

website sap.com

ScaleOut hServer

SAS

Data Management, Data Integration, Analytics

Database

ScaleOut Software

In-Memory, Key-Value

SAS provides advanced analytics, data management,
BI & visualization products for data scientists, business
analysts and IT.

Runs Hadoop MapReduce applications continuously on
live, fast-changing, memory-based data with low latency
and high scalability.

hadoop support

Db integrations

Replication

Languages/drivers supported

Oracle IBM DB2
SAP HANA Teradata
Netezza

• Asynchronous

C C++ C#
Java REST

SQL Support

transactions Supported

Built on Hadoop

integration support

Statistical Languages

Limited subset

• SAS
•R

Built-in ide

Cloud hosting

SAS AppDev Studio

SaaS, PaaS, On-Premise

Auto-Sharding

Stream processing

Mapreduce job designer

Yes

Proprietary
TWITTER @SASsoftware

• PMML

• Synchronous

• ETL
• ELT

Yes

Data Management, Analytics, Data Integration

transactions Supported

Data Platform

SAP AG

Consistency model

Strong consistency

dzone.com/r/L7vu
website sas.com

2 0 1 4 G U I D E T O B I G D ATA

Indexing Capabilities

Rich query language
Full Text Search

Yes

FULL PROFILE LINK

“Compare-and-set” type
transactions

Proprietary
TWITTER @ScaleOut_Inc

FULL PROFILE LINK

dzone.com/r/d9PM
website scaleoutsoftware.com

2 0 1 4 G U I D E T O B I G D ATA

Splunk Enterprise

Spring XD

Splunk

Operational Intelligence

Data Platform

Pivotal

Data Management, Data Integration, Analytics

Splunk is a fully-integrated platform that supports both
real-time and batch search, and treats time series data as
a first-class construct.

Spring XD offers a unified, distributed, highly available,
and extensible runtime for Big Data ingestion, analytics,
batch processing, and data export.

hadoop support

Db integrations

Hadoop integrations
available

Orace SQL Server
MongoDB Cassandra
Hbase

integration support

Db integrations

Hadoop Integrations
Available

Statistical Languages

Orace SQL Server
ibm db2 sap hana
mysql

integration support

busniness modeler

• None

Built-in ide

Cloud hosting

Built-in ide

Cloud hosting

No IDE

SaaS, On-Premise

Spring Tools Suite

SaaS, PaaS, On-Premise

Stream processing

Mapreduce job designer

Stream processing

Mapreduce job designer

Yes

FULL PROFILE LINK

Proprietary

Database

website splunk.com/product

In-Memory, Relational

Data Platform

SQL Server is considered the de facto database for .NET
development. Its ecosystem is connected to numerous
.NET technologies.
Replication

Languages/drivers supported

• Asynchronous

C# Java Ruby
PHP Visual Basic

SQL Support

transactions Supported

Full ANSI SQL
Consistency model

Tunable per database
Auto-Sharding

Indexing Capabilities

Rich query language

TWITTER @Microsoft

Sumo Logic

Sumo Logic features machine-learning based analytics,
elastic log processing, real-time dashboards, and multitenant SaaS architecture.
hadoop support

Db integrations

Oracle SQL Server
IBM DB2 MySQL

integration support

Statistical Languages

• None

Built-in ide

Cloud hosting

No IDE

SaaS

Stream processing

Mapreduce job designer

FULL PROFILE LINK

Proprietary

dzone.com/r/CKLT
website microsoft.com

dzone.com/r/dPNM

Operational Intelligence

Yes

FULL PROFILE LINK

website pivotal.io

Hadoop integrations
available

Arbitrary multi-statement
transactions spanning arbitrary
nodes

Full Text Search

Proprietary

TWITTER @Pivotal

Sumo Logic

Microsoft

• Synchronous

open source

dzone.com/r/jsGR

TWITTER @splunkdev

SQL Server

Yes

TWITTER @sumologic

dzone.com/research/bigdata

dzone.com

FULL PROFILE LINK

dzone.com/r/Xx9z
website sumologic.com

research@dzone.com

(919) 678-0300

dzone.com/research/bigdata

Tableau

Tableau Software

Data Platform

Data Management, Analytics

Database

Teradata

Relational

Tableau has a highly-rated user experience with intuitive
visual data exploration tools that make ordinary business
users into data experts.

Teradata’s strengths and maturity lie in the data
warehouse market. Their focus is also on Hadoop
capabilities and multistructured formats.

hadoop support

Db integrations

Replication

Oracle SQL Server
MySQL PostgreSQL
Teradata

• Asynchronous

Hadoop integrations
available

integration support

Statistical Languages

Languages/drivers supported

• Synchronous

Java C#

SQL Support

transactions Supported

Full ANSI SQL

Arbitrary multi-statement
transactions on a single node

• None

•R

Built-in ide

Cloud hosting

No IDE

SaaS, On-Premise

Auto-Sharding

Stream processing

Mapreduce job designer

Yes

Proprietary
TWITTER @tableau

Tibco Spotfire
Data Platform

Consistency model

hadoop support

integration support

Rich query language
Full Text Search

Proprietary

dzone.com/r/NNpH
website tableausoftware.com

Tibco Software

Db integrations

Indexing Capabilities

Yes

FULL PROFILE LINK

Data Management, Data Integration, Analytics

Hadoop integrations
available

Strong consistency

Yes

Tibco’s strength lies in data discovery with real-time and
bidirectional integration with business processes in an
easy-to-use interface.

TWITTER @Teradata

Vertica

FULL PROFILE LINK

dzone.com/r/jGCR
website teradata.com

Hewlett-Packard

Data Platform

Data Management, Analytics

Vertica thrives at petabyte-scale, offers complete SQL on
Hadoop, and queries have been found to run 50-1,000x
faster with Vertica than on legacy solutions.
hadoop support

Oracle SQL Server
IBM DB2 SAP HANA
MySQL

Hadoop integrations
available

Db integrations

HP Vertica

• None

•R
• S+

Statistical Languages

integration support
• ETL
• ELT

•R

Built-in ide

Cloud hosting

Built-in ide

Cloud hosting

Tibco Spotfire S+

SaaS, On-Premise

No IDE

SaaS, PaaS, On-Premise

Stream processing

Mapreduce job designer

Stream processing

Mapreduce job designer

Yes

Proprietary
TWITTER @TIBCO

Teradata Aster

Yes

FULL PROFILE LINK

dzone.com/r/99zr
website spotfire.tibco.com

2 0 1 4 G U I D E T O B I G D ATA

Yes

Proprietary
TWITTER @HPVertica

Statistical Languages

FULL PROFILE LINK

dzone.com/r/taUJ
website vertica.com

2 0 1 4 G U I D E T O B I G D ATA

Glossary of Terms
A
ACID (Atomicity, Consistency,
Isolation, Durability): A term that refers
to the model properties of database transactions,
traditionally used for SQL databases.

B
BASE (Basic Availability, Soft State,
Eventual Consistency): A term that refers
to the model properties of database transactions,
specifically for NoSQL databases needing to
manage unstructured data.
Batch Processing: The execution of a series
of programs (jobs) that process sets of records as
complete units (batches). This method is commonly
used for processing large sets of data offline for fast
analysis later.
Big Data: The entire process of collecting,
managing, and analyzing datasets too massive
to be handled efficiently by traditional database
tools and methods; the industry challenge posed
by the management of massive structured and
unstructured datasets.
Business Intelligence (BI): The use of tools
and systems for the identification and analysis of
business data to provide historical and predictive
insights.

C
Column Store: A high-availability database
deployed on multiple datacenters that is primarily
used for logging continuous streams of data with
few consistency guarantees.
Complex Event Processing: An
organizational process for collecting data from
multiple streams for the purpose of analysis and
planning.

D
Data Analytics: The process of harvesting,
managing, and analyzing large sets of data to
identify patterns and insights.
Data Management: The complete lifecycle
of how an organization handles storing, processing,
and analyzing datasets.
Data Mining: The process of patterns in large
sets of data and transforming that information into
an understandable format.
Data Munging: The process of converting raw
mapping data into other formats using automated tools
to create visualizations, aggregations, and models.
Data Science: The field of study broadly
related to the collection, management, and
analysis of raw data by various means of tools,
methods, and technologies.

Data Warehouse: A collection of
accumulated data from multiple streams within a
business, aggregated for the purpose of business
management.

N
NewSQL: A shorthand descriptor for relational
database systems that provide horizontal scalability
and performance on par with NoSQL systems.

Database Management System (DBMS):
A suite of software and tools that manages data
between the end user and the database.

NoSQL: A class of database systems that
incorporate other means of querying outside of
traditional SQL and do not follow standard relational
database rules.

Document Store: A type of database that
aggregates data from documents rather than
defined tables and is used to present document
data in a searchable form.

M
Machine Learning: An area of study in
artificial intelligence (AI) that attempts to mimic
human intelligence by enabling computers to
interpret situations through observation and analysis.

E
Extract Load Transform (ELT): The
process of preparing integrated data in a database
to be used by downstream users.

MapReduce: A programming model created
by Google for high scalability and distribution on
multiple clusters for the purpose of data processing.

Extract Transform Load (ETL): The
process of extracting, transforming, and loading
data during the data storage process; often used to
integrate data from multiple sources.

Massively Parallel Processing (MPP):
The strategy of pairing independent database
processors together with a messaging interface to
create cooperative clusters of processors.

Event-Stream Processing (ESP): An
organizational process for handling data that
includes event visualization, event processing
languages, and event-driven middleware.

Message Passing Interface (MPI): A
standardized messaging interface created to govern
parallel computing systems.

Eventual Consistency: The idea that
databases conforming to the BASE model will
contain data that becomes consistent over time.

O
Online Analytical Processing (OLAP):
A concept that refers to tools which aid in the
processing of complex queries, often for the
purpose of data mining.

F
Fault Tolerance: A system’s ability to
respond to hardware or software failure without
disrupting other systems.

Online Transaction Processing
(OLTP): A type of system that supports the
efficient processing of large numbers of database
transactions, used heavily for business client services.

G
Graph Store: A type of database used for
handling entities that have a large number of
relationships, such as social graphs, tag systems, or
any link-rich domain; it is also often used for routing
and location services.

R
Relational Database: A database that
structures interrelated datasets in tables, records,
and columns.

H
Hadoop: An Apache Software Foundation
framework developed specifically for high-scalability,
data-intensive, distributed computing.

S
Strong Consistency: A database concept
that refers to the inability to commit transactions
that violate a database’s rules for data validity.

Hadoop Distributed File System
(HDFS): A distributed file system created by
Apache Hadoop to utilize the data throughput and
access from the MapReduce algorithm.

Structured Query Language (SQL): A
programming language designed for managing
and manipulating data; used primarily in relational
databases.

System-on-a-Chip(SOC): An integrated chip
that is comprised of electronic circuits of multiple
computer components to create a complete device.

Key-Value Store: A type of database that
stores data in simple key-value pairs. They are
used for handling lots of small, continuous, and
potentially volatile reads and writes.

dzone.com/research/bigdata

dzone.com

research@dzone.com

(919) 678-0300

smart content for tech professionals

dzone.com

now hiring
JAVA DEVELOPERS,
WEB DESIGNERS, FRONT-END/UI BUILDERS, AND OTHER SMART PEOPLE

DZone was recently named to the Inc. 5000 as one of the
fastest growing companies in the US, and we are looking for
talented people to help us continue our growth. With you on
our team, hopefully we can move up to the Inc. 500 next year!

required
SKILLS
JAVA DEVELOPERS: Excellent working
knowledge of Java and Java Web
Architectures. Skilled in back-end technologies
like Spring, Hibernate, Lucene and SQL, as
well as standard web technologies.
WEB DESIGNERS: “Live and breathe” the Web
with superior creative and innovative
problem-solving skills. Knowledge of key web
technologies like HTML, CSS, Bootstrap, as
well as Adobe Creative Suite.
FRONT-END/UI BUILDER: Passion for simple and
beautiful designs to help the look and feel of
our web and mobile products. Knowledge of
standard web technologies like HTML, CSS,
Bootstrap, LESS, and Javascript as well as
Adobe Creative Suite.

why work
AT DZONE?
Working at DZone sets you up with a meaningful
career and the ability to see your hard work
make a difference on a global scale.
• Work with other smart, ambitious people who
are passionate about technology
• Ability to work in our Cary, NC headquarters
or telecommute from anywhere around the
world
• Flexible and fun startup environment
• Opportunity for personal growth and learning
experiences through a variety projects (you
won’t be doing the same thing every day)
• Fantastic benefits package including your
choice of several comprehensive medical,
life, and disability plans

about

DZONE

DZone makes online content and resources
for developers, tech professionals, and
smart people everywhere.
Our website, DZone.com is visited by
millions of tech pros from all over the
world every month, and our free resources
have been downloaded millions of times.
AnswerHub, our software platform for
building online communities, is used by
some of the most recognizable companies
in the world including LinkedIn, eBay, Epic
Games and Microsoft.

• Awesome perks like catered weekly lunch,
XBox games, snacks, and beer on tap

Check out dzone.com/jobs to learn more.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : Yes
XMP Toolkit                     : Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27
Create Date                     : 2014:09:18 10:00:26-04:00
Metadata Date                   : 2014:09:18 10:00:57-04:00
Modify Date                     : 2014:09:18 10:00:57-04:00
Creator Tool                    : Adobe InDesign CS6 (Macintosh)
Instance ID                     : uuid:5b26ced0-3ea7-8649-8d80-3f9f8d5e87c9
Original Document ID            : xmp.did:077A3D7F0720681188C6F047AB2C3FC0
Document ID                     : xmp.id:865A077E382068118083AD474C172CA7
Rendition Class                 : proof:pdf
Derived From Instance ID        : xmp.iid:845A077E382068118083AD474C172CA7
Derived From Document ID        : xmp.did:8CEE20740720681180838C62D1B0D1E0
Derived From Original Document ID: xmp.did:077A3D7F0720681188C6F047AB2C3FC0
Derived From Rendition Class    : default
History Action                  : converted
History Parameters              : from application/x-indesign to application/pdf
History Software Agent          : Adobe InDesign CS6 (Macintosh)
History Changed                 : /
History When                    : 2014:09:18 10:00:29-04:00
Format                          : application/pdf
Producer                        : Adobe PDF Library 10.0.1
Trapped                         : False
Page Count                      : 36
Creator                         : Adobe InDesign CS6 (Macintosh)

EXIF Metadata provided by EXIF.tools

Big Data Guide 2014

Navigation menu

Versions of this User Manual:

Views

Navigation