Lab Guide 1504191891 Participant PDF EN

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 148

DownloadLab Guide 1504191891 Participant PDF - EN
Open PDF In BrowserView PDF
Participant Guide

Big Data Basics
Version 6.3 edition 2

Copyright 2017 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agreements. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200

Welcome to Talend Training

Congratulations on choosing a Talend training course.

Working through the course
You will develop your skills by working through use cases and practice exercises using live software. Completing the exercises is critical to learning!
If you are following a self-paced, on-demand training (ODT) module, and you need an answer to proceed with a particular exercise,
use the help suggestions on your image desktop. If you can’t access your image, contact customercare@talend.com.

Exploring
You will be working in actual Talend software, not a simulation. We hope you have fun and get lots of practice using the software!
However, if you work on tasks beyond the scope of the training, you could run out of time with the environment, or you could mess up
data or Jobs needed for subsequent exercises. We suggest finishing the course first, and if you have remaining time, explore as you
wish. Keep in mind that our technical support team can’t assist with your exploring beyond the course materials.

For more information
Talend product documentation (help.talend.com)
Talend Community (community.talend.com)

Sharing
This course is provided for your personal use under an agreement with Talend. You may not take screenshots or redistribute the content or software.

Intentionally blank

CONTENTS | Participant Guide

CONTENTS

LESSON 1 Big Data in Context
Concepts

8

LESSON 2 Basic Concepts
Concepts
Basic Concepts
Opening a Project
Monitoring the Hadoop Cluster
Creating Cluster Metadata
Review

16
21
22
27
29
39

LESSON 3 Reading and Writing Data in HDFS
Concepts
Reading and Writing Data in HDFS
Storing a File on HDFS
Storing Multiple files on HDFS
Reading Data from HDFS
Storing Sparse Dataset with HBase
Challenges
Solutions
Review

42
46
47
55
58
61
70
71
74

LESSON 4 Working with Tables
Concepts
Working With Tables
Importing Tables with Sqoop
Creating Tables with Hive
Review

76
81
82
89
99

LESSON 5 Processing Data and Tables in HDFS
Concepts
Processing Data and Tables in HDFS
Processing Hive Tables with Jobs
Profiling Hive Tables - Optional
Processing Data with Pig

102
107
108
116
128

Processing Data with Big Data Batch Job
Review

CONTENTS | Participant Guide

136
147

LESSON 1

Big Data in Context
This chapter discusses:

Concepts

8

Concepts

What is Big Data? How can we define it?
One simple definition is provided by the three Vs.
The first V corresponds to volume. Nowadays, the
amount of data that’s stored and processed is huge.
This is a primary characteristic of Big Data: How big it
is. The second V stands for variety. You can collect different types of data from different sources. This
presents a challenge: How do you deal with these different types of data? The last V stands for velocity.
Data arrives from different sources at different
speeds. For example, social media, such as Twitter or
Facebook, generates data at increasing speeds. Big
Data comes with many challenges that Hadoop tries to
solve.
Today we’re generating more data than ever: from financial transactions, sensor networks, social media,
and server logs, just to name a few. We’re generating
data faster than ever because of automation and
user-generated content. This data has many valuable
applications for marketing analysis, product recommendations, and even fraud detection. And these are
only a few examples. To extract this valuable information, the data must be processed.
Fortunately, while storage capacity has increased, the
cost of storage has decreased. Disk performance has
also improved in recent years so from a business perspective, it’s more valuable to store and process the
data than throw it away. Unfortunately, transfer rates
have not improved as fast as storage capacity. Even if
we can process data more quickly, accessing it is slow.
Technically speaking, there are two main problems to
solve: Large-scale data storage, and large-scale data
analysis.

8 | Big Data Basics - Participant Guide

Hadoop offers a solution to these problems by helping
you store and process your data. HDFS is the Hadoop
Distributed File System, where you can store your
data.
MapReduce will help you analyze your data.
Hadoop runs on a cluster.

A cluster is made of different elements. The most basic
element is a node. A node is simple, cheap dedicated
hardware, composed of CPU, RAM, and disk. Multiple
nodes form a rack. And multiple racks form a cluster.
This cluster is used to store and process your data.

HDFS is the Hadoop distributed file system. Imagine
you have a large file to store on HDFS. First it’ll be split
into small pieces of 64 or 128 megabytes. This size is
configurable. Then each piece will be distributed
across the cluster. This enables faster processing
because multiple nodes can operate on your large file
simultaneously.

Your file blocks are replicated three times by default
and distributed in different data nodes. Then the
address of the different blocks is stored in a special
node called a name node. If one node fails, it’s still possible to find a copy on another node.
Another benefit of duplicating and distributing file
blocks is that you increase the data locality. That
means you’ll have a better chance to find your blocks
quickly, which will improve your computational performance.

LESSON 1 | 9

Usually, high availability is configured with a standby
name node. This standby name node is synchronized
with the main name node while it’s available. If the
main name node fails, the standby takes over. The key
concept here is the name node. In order to work on
your cluster, many Big Data components within
Talend Studio require your name node address.

To illustrate the use of the MapReduce framework we
take a basic word count use case.
Given some text files stored on HDFS, we want to
count the number of times that certain words are mentioned. For example, suppose we want to know how
many times the words Big Data, Talend, and MapReduce appear in the text files.

The MapReduce framework processes your data in
two stages. Stage one, map, and stage two, reduce.
First, the mapper processes each file block independently. The result is given as key-value pairs.
As the file blocks are distributed, the computation is
done in parallel by nodes, each hosting a file block.
This allows you to get your results faster than if you
were doing this computation in a single node.

A reduce node is designated. Then the results are
sent to the reducer.
The results are shuffled and sorted before reduction.
And finally, the reducer aggregates the result.

10 | Big Data Basics - Participant Guide

The JobTracker splits our job into tasks, and dispatches them to a TaskTracker running on each node
of your cluster. Each task can be either a map or a
reduce task.
Each TaskTracker gives updates on its task to the
JobTracker, which will send information back to the client on the overall job progress.
The JobTracker is one of Talend Studio’s entry points
into your cluster. Just as for the name node, many components require that you provide the JobTracker
address.

To summarize, MapReduce version 1 handles processing and resource management and uses a static
master-slave model.
This solves some Big Data issues, but the resource
management is not optimal.
MapReduce version 1 has some drawbacks. It has an
inflexible slot-based memory management model.
Each TaskTracker is configured at startup to have a
certain number of slots.
And a task, map or reduce, is executed in a single slot.
Each slot is configured with a maximum memory at
startup of the cluster.
Due to the limitations of MapReduce v1, a new framework has been created called YARN.YARN is
designed to manage only resources.It facilitates the
development of distributed applications of any kind
and is not limited to MapReduce.For example, Spark
can run on top of YARN. Spark is covered in the Big
Data Advanced course.
YARN provides daemons just as MapReduce does.
The JobTracker daemon was introduced with MapReduce.In YARN, the JobTracker is split into two daemons. The first one is the ResourceManager, which
administers resources on the cluster.The second daemon is the Application Master. It manages applications
such as MapReduce applications.

LESSON 1 | 11

Hadoop is an open-source project from Apache. It’s
not actually a single product but instead is a collection
of several components. Some of them are illustrated
here. The first one is HDFS which stores your data
files. An alternative data storage method is HBase, a
No SQL database optimized for sparse data. Sparse
data sets have a lot of NULL values. On top of HDFS
sits YARN. To process your data using YARN, you can
use MapReduce v2. Alternatively, you can write your
own Java code, or use other options.

With Hive, you can create tables on your cluster and
then process them using a SQL-like query language.
You can also use Pig, which is a scripting language. It
is a higher-level language than Java, which makes it
much easier to write your processing. Your Pig code
will be automatically translated into mappers and reducers and then run on your cluster. Another option is to
use Mahout, which is a machine learning library.

Sqoop is a tool designed for efficiently transferring bulk
data between Apache Hadoop and structured data
stores, such as relational databases. Oozie is an application used to schedule Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work.
Oozie supports Hadoop jobs for MapReduce, Pig,
Hive, and Sqoop. Zookeeper is a centralized service
for maintaining configuration information, naming,
providing distributed synchronization, and providing
group services. All these kinds of services are used by
distributed applications. Hue is a set of web applications that allow you to interact with your cluster. Hue
applications help you browse your files stored on
HDFS, track your MapReduce Jobs, work with Hive,
or use Oozie workflows.
Those separate projects can be installed independently, or you can use a packaged version of
Hadoop. Several companies offer a prepackaged version of Hadoop, for example Hortonworks, Cloudera,
and Amazon.
Talend Studio for Big Data supports all of these distributions and much more.

12 | Big Data Basics - Participant Guide

LESSON 1 | 13

Intentionally blank

LESSON 2

Basic Concepts
This chapter discusses:

Concepts

16

Basic Concepts

21

Opening a Project

22

Monitoring the Hadoop Cluster

27

Creating Cluster Metadata

29

Review

39

Concepts

16 | Big Data Basics - Participant Guide

LESSON 2 | 17

18 | Big Data Basics - Participant Guide

LESSON 2 | 19

20 | Big Data Basics - Participant Guide

Basic Concepts
Overview
During this course, you will be assigned a preconfigured Hadoop cluster. The Hadoop cluster was built with a Cloudera
CDH 5.8 distribution. The purpose of this exercise is to try different functions, not to have a production cluster. So, this
training cluster is in pseudo-distributed mode. That means that there is only one node. This is enough to understand the
different concepts in this training.
Before starting to create Jobs to read and write data to and from HDFS, there are some prerequisites. First, you will open
a new project in the Talend Studio. Then, you will connect to your cluster to monitor it, using Cloudera Manager and Hue.
For your Jobs to succeed, the cluster must be up and running. So, if a service fails you need to know how to restart it.
Finally, you will create your cluster Metadata. This step will avoid repetitive configuration of components in the Jobs you
will create.

Objectives
After completing this lesson, you will be able to:
Open a new project in Talend Studio
Connect to the Cloudera Manager
Connect to Hue
Create Cluster Metadata manually, using configuration files, and automatically

Before you begin
Be sure that you are working in an environment that contains the following:
A properly installed copy of Talend Studio
A properly configured Hadoop cluster
The supporting files for this lesson
Everything has already been set up in your training environment.
The first step is to open your training project in the Talend Studio.

LESSON 2 | 21

Opening a Project
Task outline
Before developing Talend Jobs, you will need a project to store them in. In this exercise, you will open a pre-existing project for your Big Data Jobs.

Accessing the training environment
For this course, two virtual machines have been set up for you. The first machine is a Windows machine where the Talend Studio is
installed. This is where you will create your Job.
The second machine is a Linux machine hosting a Cloudera CDH5.8 cluster. You do not need to access this machine. In fact, the
cluster is monitored from the Windows machine through a Web Browser.
To connect to the Windows machine, in your Skytap environment, run the machine named TalendStudio.
To start your training cluster, run the Linux machine.

Run
1. START TALEND STUDIO
Double-click the Talend Studio shortcut on your desktop to run Talend Studio.

2. OPEN YOUR TRAINING PROJECT
Using the Local connection, open the BDBasics project.

22 | Big Data Basics - Participant Guide

a. On the On Connection list, make sure that the Local (Local) connection is selected.
b. Click Select an existing project ,then click BDBasics in the Project list.

The project has already been created for you.
Note: You may have a different version of the Studio in your training environment. However, you will have the same
functions as in the Talend Real-time Big Data Platform.
c. Click Finish to open the project.

Talend Forge
1. LOG IN
When the Connect to TalendForge window appears, log in with your existing Talend account, or create a new one.

LESSON 2 | 23

24 | Big Data Basics - Participant Guide

2. When the initialization is complete, Talend Studio displays this Welcome page.

LESSON 2 | 25

Start
Click Start now!. The Talend Studio main window appears, ready for you to create Jobs:

The next step is to connect to your Hadoop cluster.

26 | Big Data Basics - Participant Guide

Monitoring the Hadoop Cluster
Task outline
In order to develop Jobs using HDFS, HBase, and Hive to store your data, you need a running Hadoop cluster.
For this training, a Hadoop cluster has been installed using a Cloudera CDH 5.8 distribution. It is running the core Hadoop
functions, such as HDFS, YARN, HBase, Hive or Sqoop. It also runs Hue, which will help you to browse your files and
tables.

Connect to Cloudera Manager
1. CONNECT TO THE Cloudera Manager
Click the Cloudera Manager shortcut in your web browser.
Or navigate to the url http://hadoopcluster:7180.
If you can't access the web page, wait a couple of minutes so that the service starts, and try again.
2. LOGIN
Enter your credentials and log in.
a. In the Username box, enter admin then, in the Password box, enter admin.
b. Click Login.

Check services' health
The Cloudera Manager is a web interface to monitor and perform administration tasks on a cluster. It helps to check services' health
and to restart services individually if needed.

LESSON 2 | 27

If a service is red flagged, it means it is in bad health. You can restart it individually by clicking the black arrow on the right side of the
service then clicking Restart.
If you can see interrogation points on a white background instead of green or red ligths, restart the Cloudera Management services.
This is done by clicking the right arrow next to Cloudera Management services and then clicking Restart. This should fix your issue
and you should be able to monitor your cluster as shown above. Otherwise, refer to the troubleshooting guide.

Connect to Hue
Hue is a web interface that helps to check on what is done on your cluster. You can browse your files and tables. It is also possible to
track Map Reduce tasks.
1. CONNECT TO Hue
Navigate to the Hue web interface.
a. Open a new tab in your web browser.
b. Click the Hue shortcut or navigate to http://hadoopcluster:8888.
2. LOGIN
Enter your credentials and log in.
a. In the Username box, enter student. In the Password box, enter training.
b. Click Sign in.
You are now logged in Hue.

The next step is to create Metadata on your Hadoop Cluster.

28 | Big Data Basics - Participant Guide

Creating Cluster Metadata
Task outline
To be able to run Big Data Jobs, Talend Studio needs to be connected to a running Hadoop Cluster. The connection
information can be configured in each component individually, or the configuration can be stored as metadata in the
Repository and be reused as needed in the different components.
Instead of individual connection configuration of components and Jobs, you will use the second approach.
You will now create your cluster metadata using three different methods. First, you will configure the connection to the
cluster manually, next, from the Hadoop configuration files, and last, using a Wizard.

Manual configuration
The first way to create a Hadoop cluster Metadata is to create it manually. This requires that you already have information about your
cluster, such as the Namenode URI, and either the Resource Manager address or the Job Tracker URI- depending on if you will use
YARN or Map Reduce v1. You may also need other information such as the Job History server, or the Resource Manager Scheduler
location.
1. CREATE A NEW HADOOP CLUSTER METADATA
Create a new Hadoop cluster metadata named TrainingCluster_manual.
a. In the Studio, in the Repository, under Metadata, locate Hadoop Cluster.
b. Right-click Hadoop Cluster then, click Create Hadoop Cluster:

LESSON 2 | 29

c. In the Name box, enter TrainingCluster_manual, then click Next.
The Hadoop Configuration Import Wizard opens:

2. SELECT THE DISTRIBUTION AND THE VERSION
In the Distribution list, select Cloudera, and in the Version list, select Cloudera CDH5.8(YARN mode).
3. SELECT THE MANUAL CONFIGURATION
Select Enter manually Hadoop services, then, click Finish.

30 | Big Data Basics - Participant Guide

The Hadoop Cluster Connection window opens.

4. OBSERVE THE DEFAULT CONFIGURATION
Check that the Distribution information is correct.
There are a few values preconfigured, such as the Namenode URI and the Resource Manager address.
The localhost value and the 0.0.0.0 value must be changed to the IP address or to the DNS name of your cluster. If the
cluster was configured keeping the default port values, then 8020 and 8032 are the host port for the Namenode and the
Resource Manager respectively.
The Hadoop cluster has already been configured for you with the name hadoopcluster.
5. CONFIGURE THE CONNECTION
Replace the localhost and 0.0.0.0 values by the hostname of the cluster.
a. Configure the connection as follows:
Namenode URI: hdfs://hadoopcluster:8020
Resource Manager: hadoopcluster:8032
Resource Manager Scheduler: hadoopcluster:8030
Job History: hadoopcluster:10020
Staging directory: /user
User name: student

LESSON 2 | 31

b. Check your configuration.

c. Click Finish. Your cluster metadata appears under Repository/Metadata/Hadoop Cluster.

Discovering Hadoop configuration files
You can also create your Metadata using the Hadoop configuration files. The configuration files have been copied in the
HadoopConf folder under C:\StudentFiles\BDBasics.
The Hadoop configuration files are .xml files that describe each parameter value of your Hadoop cluster. In the HadoopConf folder,
you will find four files: core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml. The files were copied from a Hadoop cluster,
installed with the same distribution and version as your training cluster.
1. EXAMINE core-site.xml
Open core-site.xml with Notepad++:

32 | Big Data Basics - Participant Guide

The first property that appears in this file is the location of the Namenode:
hdfs://hadoopcluster:8020.
2. EXAMINE yarn-site.xml
Open yarn-site.xml with Notepad++:

In this file, you will find the Resource Manager address and the Resource Manager Scheduler address.
3. EXAMINE mapred-site.xml
Open mapred-site.xml with Notepad++:

In this file, you will find the Job History address.
You need all this information to create your Cluster Metadata. If you choose to create the cluster Metadata using the configuration
files, the files will be parsed to find all these values.

Configuration with the Hadoop configuration files
1. CREATE A NEW HADOOP CLUSTER METADATA
Create a new Hadoop cluster metadata and name it TrainingCluster_files.

LESSON 2 | 33

a. Right-click Hadoop Cluster, then click Create Hadoop Cluster.
b. In the Name box, enter TrainingCluster_files then, click Next.
2. SELECT THE DISTRIBUTION AND THE VERSION
In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.8(YARN mode).
3. SELECT A CONFIGURATION FROM FILES
Select Import configuration from local files,then click Next.
4. LOCATE THE CONFIGURATION FILES
Click Browse... then locate the configuration files under
C:\StudentFiles\BDBasics\HadoopConf. Click OK.

5. FINALIZE THE CONFIGURATION
Set the user name to student and finalize the metadata creation.
a. Click Finish.
b. The configuration is almost done, except for the user name.
In the User name box, enter student, then click Finish.
Your cluster's metadata appears under Repository/Metadata/Hadoop Cluster.

Automatic configuration
The last way to configure your metadata is to connect to the Cloudera Manager, so that all of the connection information will be
retrieved automatically to create the cluster's metadata.
1. CREATE A NEW HADOOP CLUSTER METADATA
Create a new Hadoop cluster metadata and name it TrainingCluster
a. Right-click Hadoop Cluster, then click Create Hadoop Cluster.
b. In the Name box, enter TrainingCluster, then, click Next.
2. SELECT THE DISTRIBUTION AND THE VERSION
In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.8(YARN mode).
3. SELECT AN AUTOMATIC CONFIGURATION
Select Retrieve configuration from Ambari or Cloudera then, click Next.

34 | Big Data Basics - Participant Guide

4. SET THE Cloudera Manager URI
In the Manager URI (with port) box, enter
http://hadoopcluster:7180.
5. ENTER CREDENTIALS
In the Username and Password boxes, enter admin.
6. CONNECT TO THE Cloudera Manager
Click Connect. This will list all the clusters administered by the Cloudera Manager:

7. FETCH SERVICES
Click Fetch. The wizard will retrieve the configurations files of all running services in your cluster:

LESSON 2 | 35

For each service listed, the wizard can create the corresponding metadata in the Repository.
For this lab, you will create metadata only for YARN. You will investigate later how to create metadata for HDFS, Hive and
HBase.
8. CREATE METADATA ONLY FOR YARN
You will create metadata only for the YARN service.
a. Click Deselect All then, select YARN.
b. Click Finish.

36 | Big Data Basics - Participant Guide

9. SET THE USERNAME
Enter student in the User name box.

10. CHECK SERVICES
Click Check Services, to check if you succeeded in connecting to the Namenode and the Resource Manager.

LESSON 2 | 37

If the progress bars go up to 100%, and you have no error message, then your connection was successful.
11. FINALIZE THE HADOOP CLUSTER METADATA CREATION
Now, you have to complete the last steps to create the metadata in the Repository.
a. Click Close.
b. Click Finish to finalize the metadata creation. It will appears in the Repository under Metadata/Hadoop Cluster.
You have now created the same metadata three times using different techniques. For the next lessons, the last metadata, named
TrainingCluster, will be used.

Next step
You have almost finished this section. Time for a quick review.

38 | Big Data Basics - Participant Guide

Review
Recap
In this lesson, you covered the key base knowledge required to be efficient in building and running Big Data Jobs.
First, you opened your project, then, you connected to Cloudera Manager and Hue. Cloudera Manager will help you to restart a service if needed. Hue will help you to browse your files and tables, and track the execution of your Map Reduce Jobs.
You also learned how to create Metadata on your cluster. You created the same Metadata three times using different techniques, so
you can use any of them for your Jobs.
For the next lessons, the TrainingCluster Metadata will be used. Feel free to try any of the three Metadata. The results will be the
same. If not, that means that a metadata is not well configured.

LESSON 2 | 39

Intentionally blank

LESSON 3

Reading and Writing Data in HDFS
This chapter discusses:

Concepts

42

Reading and Writing Data in HDFS

46

Storing a File on HDFS

47

Storing Multiple files on HDFS

55

Reading Data from HDFS

58

Storing Sparse Dataset with HBase

61

Challenges

70

Solutions

71

Review

74

Concepts

42 | Big Data Basics - Participant Guide

LESSON 3 | 43

44 | Big Data Basics - Participant Guide

LESSON 3 | 45

Reading and Writing Data in HDFS
Overview
Hadoop's file system—HDFS—scales to hold petabytes of data. In this lesson you will use Talend Big Data components
to read and write data to HDFS.
First, you will read a text file that stores customer information from your local system and write it to HDFS. Then, you will
read a collection of customer Twitter messages, stored in separate files, and write them individually to HDFS. Next, you
will develop a Job to read a designated subset of the Twitter files back from HDFS.
Finally, you will simulate a sparse data set write it to HDFS using HBase dedicated components.

Objectives
After completing this lesson, you will be able to:
Write text files to HDFS
Read text files from HDFS
Use HBase to store sparse data on HDFS
Configure connections to HDFS and HBase

Before you begin
Be sure that you are in a working environment that contains the following:
A properly installed copy of the Talend Studio
A properly configured Hadoop cluster
The supporting files for this lesson
The first step is to create a new Job to read a local text file and write it to HDFS.

46 | Big Data Basics - Participant Guide

Storing a File on HDFS
Task outline
Your first step is to store text data on HDFS. In this lab, you are saving a file containing customer information such as first
name, last name, city, state, etc...
As you did previously for your cluster connection, it is possible to create metadata to get connected to HDFS. Once connected to HDFS, you will be able to read and write files on HDFS.

Configure HDFS connection
Now you will create metadata for an HDFS connection.
1. CREATE HDFS CONNECTION METADATA
Right-click TrainingCluster under Repository>Metadata>Hadoop Cluster, then click Create HDFS:

2. NAME THE METADATA
In the Name box, enter HDFSConnection , then click Next:

LESSON 3 | 47

A default configuration is proposed, but you can adjust it to your needs.
For this training, you will keep the default values. This means that your username to connect to HDFS will be student, "\n" will
be the row separator, and ";" will be the field separator.
3. CHECK THE CONNECTION
Click Check. You should quickly be connected to your cluster, and you will have the following success message:

4. FINALIZE THE METADATA CREATION
Click OK , then click Finish to create your HDFS connection Metadata. It will appear in the repository, below the TrainingCluster metadata:

48 | Big Data Basics - Participant Guide

Create a Job to write data to HDFS
In the C:\StudentFiles\BDBasics folder, you will find the CustomersData.csv file. You will create a Job that will read this file and write it
to HDFS.
1. CREATE A NEW STANDARD JOB
Create a Standard Job named PutCustomersData.
a. In the Repository, right-click Job Designs , then click Create Standard Job:

b. Name your Job PutCustomersData.
2. ADD A tHDFSPut COMPONENT
Place a tHDFSPut component on the design workspace. You use this component to move files from a local file system to
HDFS.
3. CONFIGURE THE COMPONENT TO USE THE HDFSConnection METADATA
Configure the tHDFSPut component to use the HDFSConnection metadata created previously.
a. Double-click to open the Component view.
b. In the Property Type list, select Repository.
c. Click (...) , then locate the HDFSConnection metadata that you previously created.
4. CONFIGURE THE LOCAL FOLDER
To configure the Local directory, click (...) , then navigate to "C:\StudentFiles\BDBasics".
5. CONFIGURE THE TARGET FOLDER
Configure tHDFSPut to overwrite the /user/student/BDBasics/Customers folder.
a. In the HDFS directory box, enter /user/student/BDBasics/Customers.
b. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of
the same name is being saved.
6. READ THE CUSTOMERS DATA FILE
Configure the Files table to read the CustomersData.csv file.
a. Click the green plus sign below the Files table.
b. In the Filemask column, enter "CustomersData.csv".
c. Verify your configuration.

LESSON 3 | 49

Next, you will run your Job and check the result in Hue.

Run the Job and verify the results
1. RUN THE JOB AND VERIFY THE RESULTS IN THE CONSOLE
Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equal to 0 and
several other messages that prove that the Job successfully executed:

50 | Big Data Basics - Participant Guide

Note: This is not the default Console output for a Standard Job. The BDBasics project has been configured with a Log4j
level set to INFO. This allows to have more details about the execution of your Job. For example, you will get the name of the
files copied and the number of files uploaded to HDFS.
2. CONNECT TO HUE AND VERIFY THE RESULTS
Check your results using the File Browser of Hue.
a. In your web browser, the page with Hue should be already opened. Otherwise, navigate to http://hadoopcluster:8888 and login with student/training.
b. Click File Browser:

LESSON 3 | 51

c. Navigate to /user/student/BDBasics/Customers/CustomersData.csv:

This shows you the content of the CustomersData.csv file. The file was created by your Job in the new directory BDBasics/Customers that you specified.

Examining the Console output - Winutils.exe error
Take a few minutes to read the output in the Console. At the end of the execution of the PutCustomersData Job, you should have the
following text in the Console:

52 | Big Data Basics - Participant Guide

In the Console, you will get logs created by your cluster. By default, only logs with WARN or ERROR status are displayed.
You may have noticed that there is an error displayed in the Console:

You will see an error message regarding the winutils.exe binary not being available in the Hadoop binaries. In the current Job, this
error won't prevent the Job to succeed but it could in some cases.
You can fix this issue by downloading the winutils.exe file and then saving it in a tmp folder. This has been done for you in your training
environment.
1. FIX THE ERROR
Using the Advanced settings tab, set the Hadoop home directory to the location of the winutils.exe file.
a. In the Run view, click Advanced settings.
b. Select the Use specific JVM arguments option.
c. Click New....

LESSON 3 | 53

d. In the Argument box, enter:
-Dhadoop.home.dir=C:/tmp/winutils

e. Click OK to save the new argument and run your Job again.
The Job should run successfully without error messages:

For the next Jobs built in this training, you can add a JVM argument to set the Hadoop home directory to C:\tmp\winutils, or you can
safely ignore this error.

Next, you will put multiple files to HDFS.

54 | Big Data Basics - Participant Guide

Storing Multiple files on HDFS
Task outline
Your next step is to store a set of files on HDFS. In this scenario, you are saving a series of Twitter message files for later
analysis.

Create the Job
Under the C:\StudentFiles\BDBasics\tweets_in folder, you will find tweets saved as text files. You will put these files on HDFS.
1. DUPLICATE THE JOB
Duplicate the PutCustomersData Job and name the new Job PutTweets.
a. Right-click PutCustomersData Job in the Repository, then click Duplicate.
b. In the Input new name box, enter PutTweets and in the Job Type list, select Standard:

c. Click OK to duplicate the Job. Then, open PutTweets.
2. UPDATE THE LOCAL DIRECTORY
Configure tHDFSPut to read from the C:/StudentFiles/BDBasics/tweets_in folder in your local file system.
a. Double-click tHDFSPut to open the Component view.
b. In the Local directory, locate C:/StudentFiles/BDBasics/tweets_in.
c. In the Filemask column, enter "*".
This means to select all the files under C:\StudenFiles\BDBasics\tweets_in.
3. UPDATE THE TARGET DIRECTORY
In the HDFS directory box, enter /user/student/BDBasics/tweets.

LESSON 3 | 55

Next you will run your Job and check the result in Hue.

Run
1. RUN YOUR JOB AND VERIFY THE RESULTSRun your Job and follow the execution in the Console. At the end of execution, you should have an exit code equal to 0 and several other messages that prove that the Job executed successfully:

56 | Big Data Basics - Participant Guide

2. CONNECT TO Hue
In your web browser, the page with Hue should already be opened. Otherwise, navigate to http://hadoopcluster:8888 and
login with student/training.
3. VERIFY THE RESULTS
Click File Browser and navigate to
/user/student/BDBasics/tweets:

There should be eleven files, because there are eleven tweet files in your local directory. The tHDFSPut component used
HDFS to write all of the files in your local directory into the Hadoop file system.
You used a file mask to write a file set. File masks give you considerable control over what files you want Talend components
to operate on.
Now that some data exists on HDFS, you can create a Job to read it.

LESSON 3 | 57

Reading Data from HDFS
Task outline
Now you will create a Job to read HDFS data. You are going to transfer a subset of the Twitter files from HDFS to your
local file system.

Create the Job
If you examine the files in the tweets folder that you just created on HDFS, you will notice that the files were produced in 2012 and
2013. You will read only the 2012 files.
1. CREATE A NEW STANDARD JOB
Create a new Job, naming it GetTweets.
2. ADD A tHDFSGet COMPONENT
Place a tHDFSGet component on the design workspace.
You use tHDFSGet to read files stored on HDFS and make copies of them on your local system.
3. CONFIGURE tHDFSGet TO USE THE HDFSConnection METADATA
Use the HDFSConnection metadata previously created.
a. Double-click tHDFSGet to open the Component view.
b. In the Property Type list, select Repository.
c. Click (...) , then locate the HDFSConnection metadata that you previously created.
4. CONFIGURE THE HDFS DIRECTORY
To configure the HDFS directory, click (...) , then navigate to /user/student/BDBasics/tweets.
5. CONFIGURE THE LOCAL DIRECTORY
Configure tHDFSGet to overwrite the C:/StudentFiles/BDBasics/tweets_out folder.
a. In the Local directory, enter "C:/StudentFiles/BDBasics/tweets_out".
b. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of
the same name is being saved.
6. ADD A CONDITION TO READ ALL THE 2012 TWEETS
Configure the Files table to read the tweets corresponding to 2012.
a. Click the green plus sign below the Files table.
b. Replace "newline" with "2012*".
You are using a file mask to request all files in the HDFS directory tweets with names that begin with the string
"2012":

58 | Big Data Basics - Participant Guide

Now, you will run the Job and verify the results in your local file browser.

Run
1. RUN YOUR JOB AND VERIFY THE RESULTS IN THE CONSOLE
Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equal to 0 and
several other messages that prove that the Job successfully executed:

2. VERIFY THE RESULTS IN YOUR LOCAL FILE BROWSER
In a file browser, navigate to C:\StudentFiles\BDBasics\tweets_out:

LESSON 3 | 59

There are five tweet files that your Job read from HDFS and put on your local system. Because a file mask was used to limit
the files selected for the operation, only files with names that begin with "2012" were read from HDFS.
Now that you have worked with text files on HDFS, it is time to work with sparse dataset and experiment how to store them efficiently
in HDFS.

60 | Big Data Basics - Participant Guide

Storing Sparse Dataset with HBase
Task outline
Apache HBase is the Hadoop column-oriented database. It is an open-source, non-relational database modeled after
Google's BigTable, and provides the same functionality in top of Hadoop and HDFS.
HBase is useful when you need random, real-time read/write access to your Big Data. Furthermore, HBase can host very
large tables -- billions of rows X millions of columns-- and is particularly well suited for sparse data sets.
If your relational table looks like the table below (data missing in columns), it is considered "sparse" and is a good candidate for HBase.

In this lab, you will generate sparse data and build a Job to store them to HDFS with HBase:

LESSON 3 | 61

Generate sparse data
First, you will open a Job that will generate a sparse data set.
It will create a dataset that represents the number of connections per month to a website by customers identified by their Id.
1. IMPORT THE GenerateSparseData JOB
From the JobDesigns.zip archive, import the GenerateSparseData Job.
a. Under the C:\StudentFiles\BDBasics folder, you will find the JobDesigns.zip archive file.
b. Import the Job named GenerateSparseData:

This Job is composed of two components. The tRowGenerator component will generate an integer value for the
customers ID, and random integer values to simulate the number of connections per month.
The second component will save the data in a file named HBase_sample.txt under the C:\StudentFiles\BDBasics
folder.
2. RUN THE JOB AND VERIFY THE RESULTS
Run the Job, then locate and open C:\StudentFiles\BDBasics\HBase_sample.txt with Notepad++.

As expected, there are a lot of 0 values in the data.
Next, you will configure the connection to HBase.

Configure HBase connection
An alternative way to configure a connection is to use a dedicated component. You will now create a new Job. The first step will be to
configure the connection to HBase.
1. CREATE A NEW STANDARD JOB
Create a new standard Job and name it StoreSparseData.
2. ADD A tHBaseConnection COMPONENT
Add a tHBaseConnection component and open the Component view.

62 | Big Data Basics - Participant Guide

3. CONFIGURE THE DISTRIBUTION AND THE VERSION OF YOUR CLUSTER
In the Distribution list, select Cloudera and in the HBase version list, select Cloudera CDH5.8 (YARN mode).
4. CONFIGURE THE CONNECTION TO Zookeeper
Configure the host name and the port of the Zookeeper service.
a. In the Zookeeper quorum box, enter "hadoopcluster".
b. In the Zookeeper client port box, enter "2181".
Your configuration should be as follows:

Read sparse data
You will now continue to build the Job to store your data with HBase.
1. ADD A tFileInputDelimited COMPONENT
Add a tFileInputDelimited component below tHBaseConnection.
2. CONNECT WITH A TRIGGER
Connect tHBaseConnection to tFileInputDelimited with an OnSubjobOk trigger.
3. READ THE HBase_sample.txt FILE
Configure tFileInputDelimited to read the "C:/StudentFiles/BDBasics/HBase_sample.txt" file, knowing that the row separator is "\n", the field separator is "\t" and that there is 1 Header row.

4. CONFIGURE THE SCHEMA
The schema of the HBase_sample.txt file can be found in the GenerateSparseData Job.

LESSON 3 | 63

a. In the GenerateSparseData Job, double-click tFileOutputDelimited to open the Component view.
b. Edit the schema, select all the columns and copy the schema:

c. In the StoreSparseData Job, double-click tFileInputDelimited to open the Component view.
d. Edit the schema and paste the schema:

Handle Null values
HBase is well suited for sparse dataset because it does not persist Null values. In our current data set, there are a lot of zeros. You will
process the data to find the zeros and replace them with Nulls.
You will store the raw data set in HBase, as well as the processed data.
1. ADD A tReplicate COMPONENT
Add a tReplicate component at the right side of tFileInputDelimited and connect it with the Main row.
2. ADD A tMap COMPONENT
Add a tMap component at the right side of tReplicate and connect it with the Main row.
3. OPEN THE MAPPING EDITOR
Double-click to open the tMap editor.
4. CREATE A NEW OUTPUT TABLE
Add an output and name it out_with_NULLS.

64 | Big Data Basics - Participant Guide

5. COPY ALL INPUT COLMUNS IN THE OUTPUT TABLE
Select all the columns in the row table and drag to the out_with_NULLS table.
Note: You may have a different row index.
6. CONFIGURE THE MAPPING TO REPLACE ALL 0 VALUES BY Null VALUES
Edit the expression for each month and modify it as follows:
(row3.Jan==0)?null:row3.Jan
This means that all 0 values will be replaced by a null.
Note: The Jan value must be replaced as needed to fit all months.
Your configuration should be similar to this:

Click Ok.

Save data to HBase
Now you will add components to save raw and processed data to HBase.
1. ADD A tHBaseOutput COMPONENT
At the right side of tMap, add a tHBaseOutput component and connect it with the out_with_NULLS row.Then, open the
Component view.
2. USE THE EXISTING CONNECTION TO HBASE
Select Use an existing connection and select tHBaseConnection_1 in the list.
3. Configure the schema
Click Sync columns.
4. GIVE A NAME TO THE HBASE TABLE
In the Table name box, enter data_withNulls.
5. SELECT THE APPROPRIATE ACTION
In the Action on table list, select Drop table if exists and create.
6. CREATE THE COLUMN FAMILY NAMES
To create HBase tables, a family name must be associated with each column.

LESSON 3 | 65

a. To create the family names, click Advanced settings.
b. In the Family parameters table, add 2 lines as follows:

7. CONFIGURE THE Families TABLE
The Id family name should be used for the Id column and the Date family name for all other columns.
a. In the Basic settings tab, set the Id column Family Name to "Id".
b. Set the other column's Family Name to "Date".
Your configuration should be as follows:

8. ADD ANOTHER tHBaseOutput COMPONENT AND CONFIGURE IT AS THE FIRST ONE
Copy tHBaseOutput_1 and paste below the tMap component.
9. CONNECT IT TO tReplicateE
Connect tHBaseOutput_2 to tReplicate with a Main row.

66 | Big Data Basics - Participant Guide

10. NAME THE TABLE RawData
Open the Component view of tHBaseOutput_2 and change the Table name to "RawData":

11. ADD A tHBaseClose COMPONENT AND CONNECT IT
Add a tHBaseClose component below the tFileInputDelimited component and connect it with an OnSubjobOk trigger.
Now you can run your Job and check the results in Hue.

Run the Job and verify the results in Hue
1. RUN YOUR JOB AND VERIFY THE RESULT IN THE CONSOLE
Run your Job and check the results of the execution in the Console.

LESSON 3 | 67

Your execution should be successful. Otherwise, double check your Job and check HBase health in the Cloudera Manager.
2. VERIFY THE RESULTS IN Hue
Connect to Hue to check your results.
a. Connect to Hue and click Data Browsers>HBase.
This will give you the list of HBase tables. You should see Data_withNulls and RawData in the list:

68 | Big Data Basics - Participant Guide

b. Click RawData:

This is an extract of the RawData table. You will still find zeros, but if you compare with the content of Data_
withNulls:

The Null values are not stored.
Now that you have used HBase to store sparse data, it's time to move to the Challenge to test your knowledge.

LESSON 3 | 69

Challenges
Task outline
Complete these exercises to further reinforce the skills you learned in the previous lesson. See Solutions for possible solutions to the exercises.

Add Support File
Develop a Job to write an XML file of support requests stored locally to HDFS. Configure the input and output targets as follows:
Source file: C:/StudentFiles/BDBasics/support/support.xml
Target file: /user/student/BDBasics/support/support.xml

Double Up Orders
Develop a second Job to write a file of duplicated orders to HDFS. Use the local file C:/StudentFiles/BDBasics/duplicated_orders as
the source file. Put the file into the HDFS directory /user/student/BDBasics/erp, keeping the same file name. Use the schema stored
in the file orders.xml and sort the file on column id before saving it.
Hint: use tSortRow to sort the file.

70 | Big Data Basics - Participant Guide

Solutions
Suggested solutions
These are possible solutions to the exercises. Note that your solutions may differ and still be acceptable.

Put Support file
1. Use a tHDFSPut component to create the HDFS folder and write the target file to it.
2. Configure the component as shown in the following image:

LESSON 3 | 71

Double-up orders
1. Use a tFileInputDelimited to read the file duplicated_orders from the local file system:

2. Import the schema from the file orders.xml:

3. Sort the data in ascending order by the column id by using the tSortRow component :

72 | Big Data Basics - Participant Guide

4. Use a tHDFSOutput component to rewrite the data to HDFS:

5. Click Overwrite in the Action list.
6. Run your Job and check the results in Hue.

Next step
You have almost finished this section. Time for a quick review.

LESSON 3 | 73

Review
Recap
In this lesson, you learned the basics of writing files using Talend's Big Data components for Hadoop Distributed File System, HDFS.
You used tHDFSPut to write different types of files to HDFS. Then, you used tHDFSGet to read a subset of the files back from
HDFS.
Last, you used HBase dedicated components to store sparse data on HDFS (tHBaseConnection, tHBaseOutput,
tHBaseClose).

Further Reading
For more information about topics covered in this lesson, see the Talend Data Integration Studio User Guide, Talend Big Data Platform User Guide and the Talend Components Reference Guide.

74 | Big Data Basics - Participant Guide

LESSON 4

Working with Tables
This chapter discusses:

Concepts

76

Working With Tables

81

Importing Tables with Sqoop

82

Creating Tables with Hive

89

Review

99

Concepts

76 | Big Data Basics - Participant Guide

LESSON 4 | 77

78 | Big Data Basics - Participant Guide

LESSON 4 | 79

80 | Big Data Basics - Participant Guide

Working With Tables
Overview
In this lesson, you will cover two common use cases.
HDFS can be used to for data warehouse optimization. So, you could decide to move your data from a relational database to HDFS. The first use case will show you how to transfer MySQL tables to HDFS using Sqoop.
The second use case will show you how to create a table using Hive. Then, this table can be processed using Hive QL,
which is very similar to SQL.

Objectives
After completing this lesson, you will be able to:
Transfer MySQL tables to HDFS using Sqoop
Create Hive connection Metadata
Save data as Hive tables

Before you begin
Be sure that you are in a working environment that contains the following:
A properly installed copy of the Talend Studio
A properly configured Hadoop cluster
The supporting files for this lesson

First, you will use Sqoop to import a MySQL table to HDFS.

LESSON 4 | 81

Importing Tables with Sqoop
Task outline
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data
from a relational database management system (RDBMS) such as MySQL or Oracle, into the Hadoop Distributed File
System (HDFS), transform the data using Map Reduce, and export that data back into a RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported.
Sqoop creates and runs a Map-only Map Reduce Job to import the data.
In this exercise, you will transfer MySQL tables into HDFS using Sqoop dedicated components. First, you will push the customers data into a MySQL database.

Preparing the MySQL database
The Job to copy the Customers data in MySQL has already been created for you. It is saved in the C:\StudentFiles\BDBasics folder.
1. IMPORT THE PushCustomerDataToMySQL JOB
From the C:\StudentFiles\BDBasics\JobDesigns.zip archive file, import the PushCustomerDataToMySQL Job and
the associated RemoteMySQL database Metadata.
2. RUN THE JOB
Run PushCustomerDataToMySQL.
This will copy 1 million rows in a remotely hosted MySQL database named CustomersData:

3. USE THE DATA VIEWER TO VALIDATE THE DATA
To check the data copied in the CustomersData table, right-click the tMysqlOutput component, then click Data Viewer:

82 | Big Data Basics - Participant Guide

You will now transfer this table to HDFS using the tSqoopImport component.

Create a generic database metadata
The tSqoopImport component calls Sqoop to transfer data from a relational database management system, such as MySQL or
Oracle, into the Hadoop Distributed File System.
First, you will create a generic database Metadata, which is required for Sqoop to connect to your MySQL database.
1. CREATE A NEW DATABASE CONNECTION METADATA
Create a new Generic JDBC database connection metadata named genericConnection.
a. In the Repository, under Metadata, right-click Db Connections , then click Create connection.
b. In the Name box, enter genericConnection. You can also add a Purpose and a Description. Then, click Next.
c. In the DB Type list, select General JDBC.
2. CONFIGURE THE METADATA TO CONFIGURE THE CONNECTION TO YOUR MySQL DATABASE
To configure the genericConnection metadata, add the database URL, driver jar, class name, credentials and mapping file.
a. In the JDBC URL box, enter
jdbc:mysql://hadoopcluster:3306/training .
b. Click the (...) next to the Driver jar box and select
mysql-connector-java-5.1.30-bin.jar.
c. Click the (...) next to the Class name box, then, in the drop-down list, select com.mysql.jdbc.Driver.

LESSON 4 | 83

d. In the User name and Password boxes, enter root.
e. Click the (...) next to the Mapping file box, then select mapping_Mysql.xml in the list. Click OK.
Your configuration should be as follows:

3. TEST THE CONNECTIVITY
Click Check. Your connection should be successful:

4. FINALIZE THE METADATA CREATION
Click OK and Finish.
The genericConnection Metadata appears in the Repository.

Importing tables
You will create a simple Job to import the CustomersData MySQL table into HDFS using a tSqoopImport component.
In the tSqoopImport component, the first option is to choose the Mode: Commandline or Java API.
If you choose Commandline, the Sqoop shell is used to call Sqoop. In this mode, you have to deploy and run the Job in the host where
Sqoop is installed. This means that you have to install and use the Jobserver, as described in the Talend Data Integration Advanced
training, or as described in the Talend Installation Guide.
If you select Use Java API, the Java API is used to call Sqoop. In this mode, the Job can be run locally in the Studio, but you have to
configure the connection to your cluster.
Note: A JDK is required to execute the Job with the Java API and the JDK versions on both machines must be compatible.
1. CREATE A NEW STANDARD JOB
Create a new standard Job and name it SqoopImport.
2. ADD A tSqoopImport COMPONENT
Add a tSqoopImport component and open the Component view.
3. CONFIGURE THE tSqoopImport COMPONENT TO USE THE Java API
In the Mode box, select Use Java API.

84 | Big Data Basics - Participant Guide

4. CONFIGURE THE tSqoopImport COMPONENT TO USE THE HDFSConnection METADATA
In the Hadoop Property list, select Repository , then browse to find the HDFSConnection you configured earlier in the
course.
This will configure the Distribution, the Hadoop version, the NameNode URI, the Resource Manager, and the Hadoop user
name:

5. USE YOUR DATABASE CONNECTION METADATA
In the JDBC Property list, select Repository, then browse to find the genericConnection Metadata.
This configures the Connection, the Username, the Password, and the Driver JAR values:

6. IMPORT THE CustomersData TABLE
In the Table Name box, enter "CustomersData"and select Delete target directory.
7. CONFIGURE THE TARGET DIRECTORY
Select Specify Target Dir and enter
"/user/student/BDBasics/SqoopTable".

LESSON 4 | 85

Your configuration should be as follows:

Run the Job and verify the results
As you did previously, you will run your Job and check the results in the Console and in Hue.
1. RUN YOUR JOB AND VERIFY THE RESULTS IN THE CONSOLE
Examine the output in the Console.
a. Run your Job and check the results in the Console. The last line should be an exit code equal to 0.
b. You can investigate what you see in the Console a little bit. For example, you can see the execution of the Map
Reduce Job generated by the Sqoop import:

Note: Your Job Id will be different. The Id is the number following "job_...". In the current example, the Job Id is
1476871003091_0006.

86 | Big Data Basics - Participant Guide

c. Right after, you will find a recap of various counters:

Here you can see that 4 map tasks ran on the cluster and that the 1 million records were transferred in approximately 21 seconds.
2. CONNECT TO Hue AND VERIFY YOUR RESULTS
In Hue, use the Job browser to find your jobs and the File browser to find your data.
a. To see your Job in Hue, click Job Browser.
The Job Browser window will open, and you will see the list of all your Jobs.
b. From the Console, find the Id of your Job. Then, in Hue, find your Job Id in the Job list. It should be followed by
green boxes, corresponding to a successful execution of Map and Reduce tasks:

LESSON 4 | 87

c. In Hue, click File Browser and navigate to /user/student/BDBasics/SqoopTable:

The data has been split in multiple parts.
d. Click part-m-00000:

Here you can check that your data have been imported as expected.
Now that you have imported a MySQL table to HDFS using a tSqoopImport component, you can continue to experiment working
with tables. The next topic will show you how to create tables in HDFS with Hive.

88 | Big Data Basics - Participant Guide

Creating Tables with Hive
Task outline
Hive is a data warehouse infrastructure tool used to process structured data in Hadoop. It is a database that resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Hive supports HiveQL, which is similar to SQL, but does not support the complete construct of SQL.
Hive converts the HiveQL query into Map Reduce code and submits it to the Hadoop cluster. Hive, through HiveQL language, provides a higher level of abstraction over Java Map Reduce programming.
First, you will create Hive Metadata in the Repository. Then, you will use various methods to create Hive tables.

Create Hive connection metadata
As you did previously for the cluster and HDFS connections, you will create a Hive connection metadata in the Repository.
1. CREATE A HIVE CONNECTION METADATA
Right-click TrainingCluster in the Repository under Metadata/Hadoop Cluster, , then click Create Hive.
2. NAME THE METADATA
In the Name box, enter HiveConnection, then click Next.
3. CONFIGURE THE CONNECTION TO HIVE
Select the Hive Model and enter the port number.
a. In the Hive Model list, select Standalone.
b. In the Port box, enter 10000.
Your configuration should be as follows:

LESSON 4 | 89

4. TEST THE CONNECTION TO HIVE
Click Check. You should have a successful connection message:

5. FINALIZE THE METADATA CREATION
Click Ok and Finish to create the Hive connection Metadata.
The Hive connection metadata appears in the Repository under Hadoop Cluster/TrainingCluster. The Hive connection
metadata is also available in the Repository under Db Connections.

Create a Hive table manually
You will now create a Job that will create a table and populate it with the customer's data.
1. CREATE A NEW STANDARD JOB
Create a new standard Job and name it HiveLoad.
2. ADD A tHiveCreateTable COMPONENT
Add a tHiveCreateTable and open the Component view.
The tHiveCreateTable component will create an empty Hive table according to the specified schema.
3. USE THE HiveConnection METADATA
In the Property Type list, select Repository , then browse to find the HiveConnection Metadata:

90 | Big Data Basics - Participant Guide

4. USE THE CustomersData GENERIC SCHEMA
In the Schema list, select Repository then browse to find the CustomersData generic schema metadata.
5. READ THE CustomersData TABLE
In the Table Name box, enter "CustomersData".
6. SELECT THE ACTION ON TABLE
In the Action on table list, select Create table if not exists:

7. ADD A tHiveLoad COMPONENT
Add a tHiveLoad component.
The tHiveLoad component will populate the table with data.
8. CONNECT WITH A TRIGGER
Connect tHiveLoad to tHiveCreateTable with an OnSubjobOk trigger and open the Component view.
9. USE THE HiveConnection METADATA
Set the Property Type to Repository and use the HiveConnection Metadata.
10. LOAD the CustomersData.csv FILE
Configure tHiveLoad to load the CustomersData.csv file in a table named CustomersData.
a. In the Load action list, select LOAD.
b. In the File Path box, enter
"/user/student/BDBasics/Customers/CustomersData.csv".
c. In the Table Name box, enter "CustomersData".
d. In the Action on file list, select OVERWRITE.

You will now run your Job and check the results in the Console and in Hue.

Run the Job and verify the results
1. RUN YOUR JOB
Run your Job and check the output in the Console:

LESSON 4 | 91

The Job successfully executed. Now, you can check the results in Hue.
2. CONNECT TO Hue AND CHECK YOUR RESULTS
In Hue, use the Data Browsers to examine your CustomersData Hive table.
a. In Hue, click Data Browsers.Then, click Metastore Tables:

92 | Big Data Basics - Participant Guide

b. Click the customersdata table:

This is the overview of your CustomersData table.

LESSON 4 | 93

c. Click Columns(8) to inspect the schema of the customersdata table:

Note:The columns type have been automatically converted.
d. Click Sample:

If you examine the results in the purchasedata column, you will see only NULL values. This is due to the fact that the
timestamp format of Hive is not equivalent to the date format in the Talend Studio. This leads to a failure in the data
conversion.
A possible workaround is to consider dates as String types.
You will now experiment with another way to create a Hive table. You will use a Wizard which will automatically create a Hive table
from a file stored in HDFS.

Using the Hive table creation wizard
To create a Hive table automatically from the CustomersData.csv file stored on HDFS, you will have to change the perspective of the
Studio. You will move to the Profiling perspective, which is dedicated to Data Quality analysis on database or on HDFS, depending on
where you stored your data.
1. COPY CustomersData.csv TO HDFS
To make sure that the CustomersData.csv file is available for the following steps, run the PutCustomersData Job again.

94 | Big Data Basics - Participant Guide

2. SWITCH TO THE Profiling PERSPECTIVE
In the upper-right corner of the Studio, click Profiling to open the Profiling perspective:

3. LOCATE THE HDFSConnection METADATA
In the DQ Repository, expand Metadata/Hadoop Cluster/TrainingCluster:

There, you will find the connection metadata you created earlier.
4. CREATE A HIVE TABLE FROM CustomersData.csv
From the CustomersData.csv file copied in HDFS, use the wizard to create a Hive table.
a. Right-click HDFSConnection. Then click Create Hive Table:

The connection to HDFS will be checked and next, the wizard to create a Hive table based on a file stored in HDFS
will start.
b. Browse to find the CustomersData.csv file under /user/student/BDBasics/Customers.

LESSON 4 | 95

c. Select CustomersData.csv and wait until the Creation status changes to Success:

As expected, the wizard detects 8 columns in the CustomersData.csv file.
Note: The Hive table creation wizard will convert all of the files under the selected folder. So, in the current
example, only the CustomersData.csv file will be converted because it is the only file in the Customers folder.
d. Click Next.
e. In the Schema, edit the columns' names as follows:
Column0 > Id
Column1 > FirstName
Column2 > LastName
Column3 > City
Column4 > State
Column5 > ProductCategory
Column6 > Gender
Column7 > PurchaseDate

f. Click Next.
g. In the New Table Name box, enter CustomersData_auto.
h. In the Hive connection list, select HiveConnection , then click Finish.
The Hive table is created and then the wizard closes.
5. REFRESH THE HIVE TABLE LIST
From the Repository, reload the Hive table list.

96 | Big Data Basics - Participant Guide

a. Under DQ Repository>Metadata>DB connections>HiveConnection>default, right-click Tables and clickReload
table list:

b. Click Reload in the Reload pop up message.
The CustomersData_auto table appears in the table list.

Verify the Hive table
You can check the table in the Studio or in Hue.
1. PREVIEW THE HIVE TABLE
From the Repository, preview the customersdata_auto Hive table.
a. In the DQ Repository, right-click the CustomersData_auto table, then click Preview:

LESSON 4 | 97

b. This Preview translates into a HiveQL query applied to your Hive table:

The query is "select * from default.customersdata_auto" and appears in the SQL Editor, at the top of
the window.
The result appears in the tab 1, displayed below the SQL editor.
2. VERIFY THE HIVE TABLE FROM Hue
Use the Data Browsers in Hue to check the customersdata_auto Hive table.
a. To check the new table in Hue, click Data Browsers>Metastore Tables, then click customersdata_auto in the
table list.
b. Click Sample:

You have covered the various ways to work with Tables with Hive.

Next step
You have almost finished this section. Time for a quick review.

98 | Big Data Basics - Participant Guide

Review
Recap
In this lesson, you learned how to use Talend's Big Data components for Hive and Sqoop.
First, you imported a MySQL table to HDFS using the tSqoopImport component. The import was done through a Map-only Map
Reduce Job.
Next, you manually created a Hive table with the tHiveCreateTable component and populated it with the tHiveLoad component.
Last, you used the Hive table creation wizard to automatically create your Hive table from a file stored on HDFS.

LESSON 4 | 99

Intentionally blank

LESSON 5

Processing Data and Tables in
HDFS
This chapter discusses:

Concepts

102

Processing Data and Tables in HDFS

107

Processing Hive Tables with Jobs

108

Profiling Hive Tables - Optional

116

Processing Data with Pig

128

Processing Data with Big Data Batch Job

136

Review

147

Concepts

102 | Big Data Basics - Participant Guide

LESSON 5 | 103

104 | Big Data Basics - Participant Guide

LESSON 5 | 105

106 | Big Data Basics - Participant Guide

Processing Data and Tables in HDFS
Use case
Once stored in HDFS, you will need to process your tables and data to extract useful information.
Depending on your data type, you can adopt various strategies.
If your data is stored as Hive Tables, Hive QL might be the best way to address your needs. Hive QL is a high level programming language similar to SQL. Hive converts the request as Map Reduce Jobs that will be executed on your cluster.
In this lesson, you will analyze Hive tables with a Job or with the Profiling view of the Studio.
If your data is stored as text files, one option is to use Pig. Pig Latin is a high-level language providing common operations
to group, filter and join data. The Pig Latin script is automatically converted in Java Map Reduce code to be executed on
the cluster. Talend provides components to use Pig with minimal programming efforts.
Another way to process your data, covered in this lesson, is to use a Big Data Batch Job. This kind of Job automatically
converts the components in Java Map Reduce code that will be run on the Cluster.

Objectives
After completing this lesson, you will be able to:
Process Hive tables with a standard Job
Process Hive tables in the Profiling perspective of the Studio
Process data with Pig components
Process data with a Big Data Batch Job

Before you begin
Be sure that you are in a working environment that contains the following:
A properly installed copy of the Talend Studio
A properly configured Hadoop cluster
The supporting files for this lesson

First, you will use Hive to process the tables created in the previous lesson.

LESSON 5 | 107

Processing Hive Tables with Jobs
Task outline
Hive converts the HiveQL query into Map Reduce code and then submits it to the Hadoop cluster. Through HiveQL language, Hive provides a higher level of abstraction compared to Java Map Reduce programming.
You will now analyze the customers data table you created in the Working with Tables lesson.
Using the Studio, you can analyze your data with your own Hive queries or you can use the Profiling view and use the
Data Quality functions of the Studio over your Hive tables.
You will use various components to extract useful data from a Hive table, process it, and then store the result in another
Hive table.
At the end of this lab, your Job will look like the following:

Extracting useful data
The first step is to collect useful data from the CustomersData Hive table you previously created.
You will limit your investigations to the first 100 000 rows of the table.
1. SWITCH TO THE Integration PERSPECTIVE
In the upper-right corner of the Studio, click Integration to open the Integration perspective.
2. CREATE A NEW STANDARD JOB
Create a new Standard Job and name it HiveProcessing.

108 | Big Data Basics - Participant Guide

3. ADD A tHiveConnection COMPONENT
Add a tHiveConnection component which uses the HiveConnection metadata.
a. In the Repository, click HiveConnection under Metadata/Hadoop cluster/TrainingCluster/Hive.
b. Drag it to the Designer.
c. Select tHiveConnection in the Components list and click OK.
4. ADD A tHiveInput COMPONENT
Add a tHiveInput component which uses the existing connection to Hive.
a. Add a tHiveInput component and connect it with an OnSubjobOk trigger.
b. Open the Component view.
c. Select the Use an existing connection option.
d. Ensure that tHiveConnection_1 is selected on the Component List.
5. CONFIGURE THE SCHEMA
Set the Schema to Repository and then use the CustomersData generic schema metadata.
6. CONFIGURE THE TABLE NAME
In the Table Name box, enter "CustomersData".
7. READ THE FIRST 100,000 ROWS OF THE TABLE
In the Query box, you will enter the HiveQL query that will be sent to the cluster.
As mentionned in the Overview, the investigations will be limited to the first 100 000 rows.
In the Query box, enter:
"select * from CustomersData where CustomersData.Id<=100000"
Your configuration should be as follows:

Process data
You will now aggregate the data and store the result in HDFS.
1. ADD A tAggregateRow COMPONENT
Add a tAggregateRow component, connect it with the Main row, and then, open the Component view.
2. CONFIGURE THE SCHEMA
Configure the schema to have 3 output columns named ProductCategory, Gender and Count. The first 2 columns are
strings and the third one is an Integer.

LESSON 5 | 109

a. Click Sync columns and then click (...) to edit the schema.
b. Configure the output schema to have 3 columns named ProductCategory, Gender and Count, as follows:

c. Click OK to save the schema.
3. CONFIGURE THE AGGREGATION
Configure the Group by and Operations tables to aggregate your data by ProductCategory and Gender.
a. Click the green plus sign below the Group by table to add 2 Output column: ProductCategory and Gender.
b. Click the green plus sign below the Operations table to add Count to the Output Column.
c. In the Function column, select count.
d. In the Input column position column, select ProductCategory.
Your configuration should be as follows:

4. ADD A tHDFSOutput COMPONENT AND CONFIGURE THE TARGET FOLDER
Add a tHDFSOutput component. Configure it to use the HDFSConnection metadata and to write the results in the /user/student/BDBasics/Hive/agg_results folder.
a. In the Repository, click HDFSConnection under Metadata/Hadoop cluster/TrainingCluster/HDFS.
b. Drag it to the Designer.
c. Select tHDFSOutput in the Components list, then click OK.
d. Connect tAggregateRow to tHDFSOutput with the Main row and then open the Component view.
e. In the File Name box, enter
"/user/student/BDBasics/Hive/agg_results".
This will save the aggregation results in HDFS. The last step is to transfer the results in a Hive table.

Transfer results to Hive
1. COPY THE OUTPUT SCHEMA
In the tHDFSOuput component, copy the schema. It will be reused in a tHiveCreateTable component.

110 | Big Data Basics - Participant Guide

a. In the tHDFSOutput Component view, click (...) to edit the schema.
b. Select ProductCategory,Gender and Count in the Input or Output table, then copy the schema:

c. Close the schema window.
2. ADD A tHiveCreateTable COMPONENT AND CONNECT IT
Add a tHiveCreateTable below tHiveInput and connect it with an OnSubjobOk trigger.
3. CONFIGURE tHiveCreateTableE
Configure the tHiveCreateTable component to create a table named AggResults with the previously copied schema.
a. Open the Component view.
b. Select the Use an existing connection option.
c. Click (...) to edit the schema.
d. Paste the Schema:

LESSON 5 | 111

e. In the DB Type column, select STRING for ProductCategory and Gender and then, select INT for Count:

f. Click OK to save the schema.
g. In the Table Name box, enter "AggResults".
h. In the Action on table list, select Create table if not exists.
Your configuration should be as follows:

4. ADD A tHiveRow COMPONENT
Add a tHiveRow component below tHiveCreateTable and connect it with an OnSubjobOk trigger.
5. CONFIGURE tHiveRow
In the tHiveRow component, write a query to populate the AggResults Hive table with the agg_results file.
a. In the Component view, select the Use an existing connection option.
b. Copy the schema in tHiveCreateTable and paste it in the schema of the tHiveRow component.
c. Click OK to save the schema.
d. In the Query box, you will be able to write your own HiveQL.
The query in a tHiveRow component is executed at each flow iteration in your Job. In the current Job, the query will

112 | Big Data Basics - Participant Guide

be executed only once to transfer the data from the HDFS file to the AggResults Hive table.
In the Query box, enter:
"LOAD DATA INPATH
'/user/student/BDBasics/Hive/agg_results' OVERWRITE INTO TABLE AggResults "
Note: Copy and paste the Query from the LabCodeToCopy file in the C:\StudentFiles\BDBasics folder.
e. Your configuration should be as follows:

Your Job is now complete. It's time to run it and check the results.

Run the Job and verify the results
1. RUN YOUR JOB
Run your Job and check the results in the Console:

If you did not include the extra JVM argument as explained in the first lab, you will see an error message regarding the
winutils.exe binary not being available in the Hadoop binaries. In the current Job, this error won't prevent the Job to succeed

LESSON 5 | 113

but it could in some cases.
2. FIX THE winutils.exe ERROR
Using the Advanced settings tab, set the Hadoop home directory to the location of the winutils.exe file.
a. In the Run view, click Advanced settings.
b. Select the Use specific JVM arguments option.
c. Click New....
d. In the Argument box, enter:
-Dhadoop.home.dir=C:/tmp/winutils

e. Click OK to save the new argument and run your Job again.
The Job should run successfully without error messages:

3. Even if it's not clearly stated in the Console, the HiveQL query executed on the Cluster in the tHiveInput component,. You
can see the Job generated in the Hue Job Browser:

4. Using the Hue File Browser, navigate to
/user/student/BDBasics/Hive:

114 | Big Data Basics - Participant Guide

The folder is empty because the data have been transferred to the Hive table, deleting the file on HDFS.
5. In Hue, click Data Browsers>Metastore Tables. The table AggResults should be in the table list.
6. Click AggResults, then click the Sample tab:

You have processed your Hive Table with various components such as tHiveInput and tHiveRow. You will now process your Hive
table using the Profiling perspective of the Studio.

LESSON 5 | 115

Profiling Hive Tables - Optional
Task outline
Using the Studio, you can run various analysis over your Hive tables. In this exercise, you will use the Profiling view of the
Studio to run analysis of your Hive connection, Hive Tables, and columns.

Hive connection analysis
1. SWITCH TO THE Profiling PERSPECTIVE
In the upper-right corner of the Studio, click Profiling to switch to the Profiling perspective.
2. REFRESH THE HIVE TABLE LIST
From the DQ Repository, reload the list of Hive tables.
a. In the DQ Repository, under Metadata/DB connections, you will see your HiveConnection.
Right-click HiveConnection/default/Tables, then click Reload table list:

b. Click Reload in the Reload pop up message.
This will update the table list :

As expected, you will find the AggResults, CustomersData, and CustomersData_auto Hive tables.
Now, you will start the Overview Analysis of HiveConnection.
3. CREATE AN OVERVIEW TABLE
Create an Overview Analysis on the HiveConnection metadata.
a. Right-click HiveConnection under DQ Repository/Metadata/DB connections, then click Overview
analysis:

b. In the Name box, enter OverviewAnalysis and click Next.

116 | Big Data Basics - Participant Guide

c. It is possible to filter the tables of interest by using the Table name filter and View name filterboxes. In this lab,
you will keep these boxes empty.
d. Click Finish.
Your analysis appears in the DQ Repository, under Data Profiling/Analyses:

4. START THE ANALYSIS
The OverviewAnalysis opens so that you can examine the Analysis settings.

LESSON 5 | 117

Click the green arrow to start the analysis:

5. EXAMINE THE RESULTS
At the bottom of the Connection Analysis window, you will find the Analysis summary:

118 | Big Data Basics - Participant Guide

In the Analysis summary, you should see that the analysis was successful. There is 1 schema, named default. To get more
details, click default. In the table below, you will get 3 tables, named customersdata, customersdata_auto, and aggresults.
As expected, CustomersData and CustomersData_auto have 1 million rows, and AggResults has 14 rows.
Now that you have an overview of your Hive Tables, you can move to the next step, which is to analyze each table.

Hive tables analysis
In the Profiling view, you can easily refine the kind of analysis needed to suit your needs. You will now focus on the CustomersData
table.
1. VISUALIZE THE customersdata HIVE TABLE
From the DQ Repository, display the customersdata Hive table.
a. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, then
click Preview:

b. This will open a SQL Editor window which is split in two parts.
The upper part is the request submitted to preview your table:

LESSON 5 | 119

The bottom part is a preview of the CustomersData table:

2. CREATE A TABLE ANALYSIS
From the DQ Repository, create a Table analysis on the customersdata table. Add a filter to examine data corresponding to
women.
a. Right-click customersdata and click Table analysis.

b. In the Name box, enter TableAnalysis and click Finish.
c. The TableAnalysis window opens. By default, a row count operation is proposed. You will add a filter to count the
number of rows where the customer is a woman.
In the Where box, in the Data Filter tab, enter customersdata.gender='F' :

d. Click the green arrow icon to start the analysis.

120 | Big Data Basics - Participant Guide

e. On the right side of the Table Analysis, you will find the result of your request:

You will continue to investigate your data, running analysis on some columns of the CustomersData table.

Column analysis (optional)
1. CREATE A COLUMN ANALYSIS
From the DQ Repository, create a Column Analysis named ColumnAnalysis.
a. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, then
click Column Analysis:

b. In the Name box, enter ColumnAnalysis, then click Finish.
The Column Analysis page opens.
2. CONFIGURE THE COLUMN ANALYSIS
As the analysis can be time consuming, you will need to reduce the scope of the analysis. You will exclude the id, firstname,
lastname, city, and purchasedate columns.

LESSON 5 | 121

a. Under Data preview, click Select Columns, then select the state, productcategory and gender columns:

b. Click OK to save the selection.
3. EXAMINE THE ANALYSIS CONFIGURATION
Under Analyzed Columns you will find state, productcategory and gender columns.
Click the plus sign next to state to view the details of what will be analyzed in the state column:

You will see basic information about your columns, such as the number of rows, the number of blank or null values, the number of distinct count.
4. CONFIGURE THE EXECUTION ENGINE
In the Analysis Parameters tab, you can choose the execution engine: SQL or Java. To send your requests to the cluster,

122 | Big Data Basics - Participant Guide

select SQL, then click the green arrow icon:

5. CONNECT TO Hue
While the analysis is running, you can go in to the Hue Job Browser and follow the execution of the Map and Reduce tasks
launched by the Column Analysis:

6. EXAMINE THE ANALYSIS RESULTS
At the end of the execution, open the Analysis Results tab. There you will find the results of each analysis of the state, productcategory and gender columns.

LESSON 5 | 123

There are 7 distinct product categories. You will now run an analysis to list these values.

Product Category column analysis
1. CREATE A COLUMN ANALYSIS
From the DQ Repository, create a Column Analysis over the productcategory column and name it ProductCategoryAnalysis.
a. Right-click productcategory under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables/customersdata/Columns, and clickColumn Analysis>Analyze:

b. In the Name box, enter ProductCategoryAnalysis and click Finish.
2. SELECT INDICATORS
Select the Row Count and Value Frequency indicators.

124 | Big Data Basics - Participant Guide

a. In the Column Analysis window, under Data preview, click Select Indicators:

This is were you will specify which statistics you are interested in.
b. If you scroll down, you will see an option named Hide non applicable indicators.
Select this option. This will simplify the indicators selection:

LESSON 5 | 125

c. Expand Simple Statistics and select Row Count in the productcategory column:

d. Expand Advanced Statistics and select Value Frequency.
e. Click OK to save your selection:

3. RUN THE ANALYSIS
Click the green arrow to start the analysis.
4. EXAMINE THE ANALYSIS RESULTS
Open the Analysis Results tab:

126 | Big Data Basics - Participant Guide

The frequency analysis lists the product categories values and the count for each value. The numbers are very close to each
others because this data set has been randomly generated.
You have covered the last analysis for this exercise. Now it's time to move to the next exercise, where you will cover how to process
data on HDFS using Pig.

LESSON 5 | 127

Processing Data with Pig
Task outline
Map Reduce is very powerful but it requires a Java programmer, and the programmer may have to re-invent common
functions such as joining or filtering. This is the motivation behind Pig.
Pig is a platform for analyzing large data sets. It consists of a high-level programming language, Pig Latin, that opens
Hadoop to non-Java programmers. Pig Latin also provides common operations to group, filter, join, or sort data.
Pig provides an execution engine on top of Hadoop.
The Pig Latin script is converted to Map Reduce code, which will be executed in your cluster.
In this lab, you will process the Customers data stored in HDFS. You will perform basic tasks such as filtering rows, mapping, aggregating and sorting data, and storing your results in HDFS.
At the end of this lab, your Job will look like the following:

Writing data to HDFS
Earlier in the course, you used the tHDFSPut component to copy a file from your local file system and paste it in HDFS. Another way
to write data to HDFS is to use the tHDFSOutput component, which writes a data flow in HDFS.
You will create a Job that reads the CustomersData.csv file and writes it to HDFS using the tHDFSOutput component.
1. SWITCH TO THE Integration PERSPECTIVE
In the upper-right corner of the Studio, click Integration to switch your Studio to the Integration perspective.
2. CREATE A NEW STANDARD JOB
Create a new Standard Job and name it WritingHDFS.
3. ADD A tFileInputDelimited COMPONENT
Add a tFileInputDelimited component and open the Component view.
4. READ THE INPUT FILE
Configure the tFileInputDelimited component to read the CustomersData.csv file.
a. Next to theFile name box, click(...)and navigate to "C:\StudentFiles\BDBasics\CustomersData.csv".
b. Set the Row Separator to "\n" and the Field Separator to ";".

128 | Big Data Basics - Participant Guide

c. Set the Schema type to Repository and browse to find the CustomersData generic schema metadata.
Your configuration should be as follows:

5. ADD A tHDFSOutput COMPONENT
Add a tHDFSOutput component and connect it with the Main row.
6. CONFIGURE THE OUTPUT FOLDER
Configure tHDFSOutput to write in the /user/student/BDBasics/CustomersData folder.
a. Double-click the tHDFSOutput component to open the Component view.
b. In the Property Type list, select Repository. Then, find HDFSConnection.
c. In the Schema list, select Repository. Then, find the CustomersData generic schema metadata.
d. In the File Name box, enter
"/user/student/BDBasics/CustomersData".
Your configuration should be as follows:

e. Save your Job.
This Job will be executed later from another Job, using a tRunJob component.

Load data
You will create a new Job named PigProcessing, which will process the customers' data.
1. CREATE A NEW STANDARD JOB
Create a a new Standard Job and name it PigProcessing.
2.

ADD THE WritingHDFS JOB
Drag the WritingHDFS Job from the Repository and drop it in the Designer. It will appear as a tRunJob component

LESSON 5 | 129

labelled WritingHDFS:

3. ADD A tPigLoad COMPONENT
Add a tPigLoad component and Connect the WritingHDFS component to tPigLoad with an OnSubjobOk trigger
4. SELECT THE EXECUTION MODE
Double-click tPigLoad to open the Component view.
Pig Jobs can be executed in local or Map/Reduce mode. This is configured in the tPigLoad component. For this lab, you will
use the Map/Reduce mode.
5. CONFIGURE THE CONNECTION TO HDFS
As you did for the tHDFSOutput component, set the Property Type to Repository using the HDFSConnection
metadata.
6. CONFIGURE THE SCHEMA
Set the Schema type to Repository using the CustomersData generic schema metadata.
7. CONFIGURE THE INPUT FILE
In the Input file URI box, enter
"/user/student/BDBasics/CustomersData".
Your configuration should be as follows:

130 | Big Data Basics - Participant Guide

Filter and map Data
You will continue to build your Job with the next two components, in order to filter and map your data. The goal here is to extract customers living in California and get the corresponding gender and product category.
You will use the tPigFilterRow component to filter the State. Then, you will use the tPigMap component to extract the data from the
Gender and ProductCategory columns.
1. ADD A tPigFilterRow COMPONENT
Add a tPigFilterRow and connect with a Pig Combine row.
2. CONFIGURE THE FILTERING
Add a filter to select customers living in California.
a. Open the Component view.
b. Select the Use advanced filter option.
c. In the Filter box, enter "State matches 'California"':

3. ADD A tPigMap COMPONENT AND CONNECT IT
Add a tPigMap component, connect it with the Pig Combine row, and then open the Component view.
4. CREATE THE MAPPING OUTPUT
Create a new output named MappingOut.
5. CONFIGURE THE MAPPING
Select ProductCategory and Gender columns in the row2 table and drag in the MappingOut table.
Your mapping should be as follows:

6. SAVE THE MAPPING
Click OK to save the mapping.

Aggregate and sort data
Now that you have extracted the data, you will aggregate and sort it. The goal here is to have the number of men and women per ProductCategory, and then sort them by alphabetical order.
1. ADD A TPIGAGGREGATE COMPONENT
Add a tPigAggregate component, and connect it with the MappingOut row. Then, open the Component view.
2. CONFIGURE THE OUTPUT SCHEMA
Add a new column to the output table, named Count which is an Integer.

LESSON 5 | 131

a. Edit the schema and add a column named Count with an Integer type:

b. Click OK to save the schema.
3. CONFIGURE THE AGGREGATION
Configure the Group by and Operations tables to aggregate your data by ProductCategory and Gender.
a. Click the green plus sign below the Group by table to add the ProductCategory and Gender columns.
b. Click the green plus sign below the Operations table.
c. In the Additional Output Column list, select Count.
d. In the Function list, select count.
e. In the Input Column list, select ProductCategory.
Your configuration should be as follows:

4. ADD A tPigSort COMPONENT
Add a tPigSort component, connect it with the Pig Combine row, and then, open the Component view.
5. CONFIGURE THE SORTING
Configure the Sort key table to sort ProductCategory in Ascending order.

132 | Big Data Basics - Participant Guide

a. Click the green plus sign below the Sort key table.
b. Configure to sort the ProductCategory column by ascending order, as shown below:

Store results
Once processed, the last step is to store your results on HDFS.
1. ADD A tPigStoreResult COMPONENT
Add a tPigStoreResult component, connect it with the Pig Combine row, and then, open the Component view.
2. USE THE HDFSConnection METADATA
Set the Property Type to Repository and then select HDFSConnection.
3. CONFIGURE THE OUTPUT FOLDER
Write the results in the /user/student/BDBasics/Pig/out folder.
a. In the Result Folder URI, enter "/user/student/BDBasics/Pig/out".
b. Select the Remove directory if exists option.
This will allow you to run the Job again as needed.
c. Your configuration should be as follows:

Run the job and verify the results
1. RUN YOUR JOB AND VERIFY THE RESULTS
Run the Job and inspect the results in the Console.
a. Scrolling down the Console, you will find information about the WritingHDFS Job:

The WritingHDFS Job succesfully executes and then the Pig components will start to work.

LESSON 5 | 133

b. If you continue to investigate in the logs, you will find the Pig requests equivalent to each Pig component. The first
Pig component is the tPigLoad:

c. Right after, you will find Pig requests for tPigFilterRow, tPigMap, tPigAggregate, tPigSort and tPigStoreResult:

d. If you continue to scroll down, you will see several MapReduce Jobs submitted, as well as their statistics. This is the
final report of all Map Reduce Jobs execution:

There, you can see that the execution was successful. First, 1 million rows were read from the CustomersData file.
Then, 14 records were written in the /user/student/BDBascis/Pig/out folder.
2. In Hue, using the File Browser, navigate to /user/student/BDBascis/Pig/out.

134 | Big Data Basics - Participant Guide

3. Click the part-r-00000 file to see the result:

4. In Hue, in the Job Browser, you will be able to see all the Jobs submitted by the Studio. The Jobs have different IDs but they
have the same name: BDBASICS_PigProcessing_0.1_tPigLoad_1.
All your Jobs have succeeded:

You can now continue investigating processing data on HDFS and move to the next exercise.

LESSON 5 | 135

Processing Data with Big Data Batch Job
Task outline
The last way to process data covered in this course is to use Big Data Batch Jobs. First, you will create a Standard Job and
then convert it to a Big Data Batch Job using the Map Reduce framework.
The Job will be very similar to the PigProcessing Job. At the end of this exercise, your Job will look like the following:

The first step will be to read the data using the HDFSConnection metadata.

Read and filter data
1. CREATE A NEW STANDARD JOB
Create a new Standard Job and name it StandardProcessing.
2. ADD A tHDFSInput
Add a tHDFSInput component which uses the HDFSConnection metadata.
a. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it to
the Designer.
b. Select tHDFSInput in the Components list and click OK:

c. Double-click the tHDFSInput component to open the Component view.
3. CONFIGURE tHDFSInput
Configure tHDFSInput to read the CustomersData file.
a. In the Schema list, select Repository. Then, navigate to find the CustomersData generic schema metadata.
b. Next to the File Name box, click (...) and browse to find
/user/student/BDBasics/CustomersData.

136 | Big Data Basics - Participant Guide

Your configuration should be as follows:

4. ADD A tFilterRow COMPONENT
Add a tFilterRow component and connect it with the Main row.
5. FILTER THE DATA
Configure the Conditions table to filter customers living in California.
a. In the Component view, click the green plus sign below the Conditions table to add a new line.
b. In the InputColumun column, select State.
c. In the Function column, select Match.
d. In the Operator column, select Equals.
e. In the Value colum, enter "California".
Your configuration should be as follows:

In the next section, you will map and aggregate your data.

Map and Aggregate Data
1. ADD A TMAP COMPONENT
Add a tMap component and connect it with the Filter row.
2. CREATE A MAPPING OUTPUT
Open the mapping editor and create a new output named out.

LESSON 5 | 137

3. CONFIGURE THE MAPPING
In the row2 table, select Gender and ProductCategory and drag to the out table, as follows:

4. SAVE THE MAPPING
Click Ok to save your configuration.
5. ADD A tAggregaterow COMPONENT
Add a tAggregateRow component and connect it with the out row.
6. CONFIGURE THE AGGREGATION AS PREVIOUSLY
Configure the Group by and Operations tables to aggregate your data by ProductCategory and Gender. The result of
the aggregation is saved in a new output column named Count.
a. Open the Component view.
b. Edit the schema and add a new column named Count with an Integer type:

c. Click OK to save the schema.
d. Click the green plus sign below the Group by table and add the ProductCategory and Gender columns.
e. Click the green plus sign below the Operations table.
f. In the Output column list, select Count.
g. In the Function list, select count.
h. In the Input colum position list, select ProductCategory:

The last steps are to sort the results and then to save them on HDFS.

138 | Big Data Basics - Participant Guide

Sort and save Data
1. ADD A tSortRow COMPONENT
Add a tSortRow component, connect it with the Main row, and then, open the Component view.
2. CONFIGURE THE SORTING
Configure the Criteria table to sort ProductCategory in ascending order.
a. Click the green plus sign below the Criteria table.
b. In the Schema column, select ProductCategory.
c. In the sort num or alpha? column, select alpha.
d. In the Order asc or desc column, select asc:

3. ADD A tHDFSOutput
Add a tHDFSOutput component which uses the HDFSConnection metadata.
a. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it
to the Designer.
b. In the Components list, select tHDFSOutput.
c. Connect it with the Main row and open the Component view.
4. CONFIGURE THE OUTPUT FILE
Write the data in the /user/student/BDBasics/CustomersDataOut folder.
a. In the File Name box, enter
"/user/student/BDBasics/CustomersDataOut".
b. In the Action list, select Overwrite.
Your configuration should be as follows:

Your Job is now ready to run.

LESSON 5 | 139

Run the job and verify the results
1. RUN THE JOB
Run your Job and check the results in the Console:

Your Job should execute successfully.
2. CONNECT TO Hue AND VERIFY THE RESULTS
In Hue, using the File Browser, navigate to /user/student/BDBasics and open CustomersDataOut:

The results should be the same as in the previous lab, Processing Data with Pig.
You will now convert this Standard Job into a Big Data Batch Job, which will use the Map Reduce framework.

140 | Big Data Basics - Participant Guide

Convert to a Map Reduce batch Job
Using the Studio, you can convert a standard Job to a Big Data Batch Job, and choose between a Map Reduce or Spark framework.
In this lesson, you will focus on Map Reduce Jobs.
Instead of converting your current Job, you will duplicate it as a Map Reduce based Batch Job.
1. DUPLICATE YOUR JOB
Duplicate the StandardProcessing Job and name the duplicate MRProcessing.
MRProcessing is a Big Data batch Job using the MapReduce framework.
a. In the Repository, right-click the StandardProcessing Job, and then, click Duplicate.
b. In the Input new name box, enter MRProcessing.
c. In the Job Type list, select Big Data Batch.
d. In the Framework list, select MapReduce:

e. Click OK.
2. CONFIGURE THE CONNECTION TO THE HADOOP CLUSTER
The Hadoop cluster configuration is set at the Job level in Big Data Batch Jobs. So, before duplicating your Job, you will be

LESSON 5 | 141

asked to choose from which component the Hadoop configuration will be retrieved:

3. SELECT A COMPONENT
Select the HDFSConnection component with the blue arrow and click OK.
4. OPEN THE JOB
In the Repository, under Big Data Batch Jobs, you will find your MRProcessing Job. If you haven't done so already,
double-click to open it:

5. CONFIGURE THE OUTPUT FOLDER
Configure tHDFSOutput to write the results in the /user/student/BDBasics/CustomersDataOut_MR folder.
a. Double-click the tHDFSOutput component to open the Component view.
b. In the Folder box, enter
"/user/student/BDBasics/CustomersDataOut_MR".
Your standard Job is now converted to a Map Reduce Job.

Run the Job and verify the results
Before running your Job, you will check that the Hadoop cluster configuration is correct.
1. OPEN THE RUN VIEW
Open the Run view of the MRProcessing Job.
2. VERIFY THE HADOOP CLUSTER CONNECTION INFORMATION
Click the Hadoop Configuration tab:

142 | Big Data Basics - Participant Guide

As the configuration has been retrieved from a tHDFSInput component, some configurations are missing, because they
are not necessary to read/write from HDFS.
The Resource Manager address is necessary to run a Map Reduce Job, and is currently missing. If you run the Job, it will fail
with the following error message:

LESSON 5 | 143

3. CONFIGURE THE RESOURCE MANAGER ADDRESS
To fix this, you can either use your TrainingCluster metadata, or set the Resource Managerto "hadoopcluster:8032":

4. RUN YOUR JOB
Go back to the Basic Run tab and run your Job.
5. OBSERVE THE DESIGNER
You can check that your Job is running in the Designer view. As the different Map Reduce Jobs execute, you will see the
progress barschanging.
First they are empty, then, when a Map or Reduce task starts to execute, the progress bar becomes red. When a task successfully complete, the progress bar becomes green:

6. OBSERVE THE CONSOLE
You can also follow the execution in the Console:

144 | Big Data Basics - Participant Guide

You should see three reports, one for each Map Reduce Job launched by the Studio.
The execution of Map Reduce tasks is given with a percentage.
7. CONNECT TO Hue
Use the Job Browser and the File Browser to check the execution and the results.
a. From the Console, you can get your Job ID and find it in the Hue Job Browser:

b. Using the File Browser in Hue, check the results saved in the /user/student/BDBasics/CustomersDataOut_
MR folder.

LESSON 5 | 145

c. Click the part-r-00000 file:

You should have the same results as in previous labs.
You have now covered the different ways to process tables and data stored on HDFS.

Next step
You have almost finished this section. Time for a quick review.

146 | Big Data Basics - Participant Guide

Review
Recap
In this chapter, you covered how to process tables and data stored on HDFS.
The first part of the chapter was dedicated to Hive Tables. You built a Job to extract data of interest in your Hive table with a tHiveInput component. You processed the data, saved the result in HDFS with a tHDFSOutput component, and you transferred the result
to a Hive table using tHiveCreateTable and tHiveRow.
Next, you used the Profiling view of the Studio to perform different levels of analyses. You started at the connection level, moved to
the Tables and column level, and ended with a custom analysis on the ProductCategory column. Each request was run as a Map
reduce Job in your cluster.
The second part of the chapter was dedicated to data processing with Pig and Big Data Batch Jobs.
You created a Job using Pig components to process your data. Each Pig request was executed as a Map Reduce Job in your cluster.
Next, you created a standard Job to process your data and reproduced the results obtained with Pig. Then, you duplicated the standard Job and created a Big Data Batch Job using the Map Reduce Framework.

Further reading
If you want to discover more about data profiling, the Talend Data Quality trainings will help you. If you are interested in discovering
more about Big Data Batch Jobs, the Talend Big Data Advanced training will give you an overview of real life use cases using Big
Data Batch Jobs.

LESSON 5 | 147

Intentionally blank



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 148
Page Mode                       : UseOutlines
Page Layout                     : TwoColumnRight
Language                        : en-us
Producer                        : madbuild
Create Date                     : 2017:08:20 10:23:58+02:00
Modify Date                     : 2017:08:20 10:23:58+02:00
Title                           : Lab Guide
Author                          : Talend
Subject                         : 
EXIF Metadata provided by EXIF.tools

Navigation menu