Lab Guide 1504191891 Participant PDF EN
User Manual:
Open the PDF directly: View PDF .
Page Count: 148
Download | |
Open PDF In Browser | View PDF |
Participant Guide Big Data Basics Version 6.3 edition 2 Copyright 2017 Talend Inc. All rights reserved. Information in this document is subject to change without notice. The software described in this document is furnished under a license agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agreements. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written permission of Talend Inc. Talend Inc. 800 Bridge Parkway, Suite 200 Redwood City, CA 94065 United States +1 (650) 539 3200 Welcome to Talend Training Congratulations on choosing a Talend training course. Working through the course You will develop your skills by working through use cases and practice exercises using live software. Completing the exercises is critical to learning! If you are following a self-paced, on-demand training (ODT) module, and you need an answer to proceed with a particular exercise, use the help suggestions on your image desktop. If you can’t access your image, contact customercare@talend.com. Exploring You will be working in actual Talend software, not a simulation. We hope you have fun and get lots of practice using the software! However, if you work on tasks beyond the scope of the training, you could run out of time with the environment, or you could mess up data or Jobs needed for subsequent exercises. We suggest finishing the course first, and if you have remaining time, explore as you wish. Keep in mind that our technical support team can’t assist with your exploring beyond the course materials. For more information Talend product documentation (help.talend.com) Talend Community (community.talend.com) Sharing This course is provided for your personal use under an agreement with Talend. You may not take screenshots or redistribute the content or software. Intentionally blank CONTENTS | Participant Guide CONTENTS LESSON 1 Big Data in Context Concepts 8 LESSON 2 Basic Concepts Concepts Basic Concepts Opening a Project Monitoring the Hadoop Cluster Creating Cluster Metadata Review 16 21 22 27 29 39 LESSON 3 Reading and Writing Data in HDFS Concepts Reading and Writing Data in HDFS Storing a File on HDFS Storing Multiple files on HDFS Reading Data from HDFS Storing Sparse Dataset with HBase Challenges Solutions Review 42 46 47 55 58 61 70 71 74 LESSON 4 Working with Tables Concepts Working With Tables Importing Tables with Sqoop Creating Tables with Hive Review 76 81 82 89 99 LESSON 5 Processing Data and Tables in HDFS Concepts Processing Data and Tables in HDFS Processing Hive Tables with Jobs Profiling Hive Tables - Optional Processing Data with Pig 102 107 108 116 128 Processing Data with Big Data Batch Job Review CONTENTS | Participant Guide 136 147 LESSON 1 Big Data in Context This chapter discusses: Concepts 8 Concepts What is Big Data? How can we define it? One simple definition is provided by the three Vs. The first V corresponds to volume. Nowadays, the amount of data that’s stored and processed is huge. This is a primary characteristic of Big Data: How big it is. The second V stands for variety. You can collect different types of data from different sources. This presents a challenge: How do you deal with these different types of data? The last V stands for velocity. Data arrives from different sources at different speeds. For example, social media, such as Twitter or Facebook, generates data at increasing speeds. Big Data comes with many challenges that Hadoop tries to solve. Today we’re generating more data than ever: from financial transactions, sensor networks, social media, and server logs, just to name a few. We’re generating data faster than ever because of automation and user-generated content. This data has many valuable applications for marketing analysis, product recommendations, and even fraud detection. And these are only a few examples. To extract this valuable information, the data must be processed. Fortunately, while storage capacity has increased, the cost of storage has decreased. Disk performance has also improved in recent years so from a business perspective, it’s more valuable to store and process the data than throw it away. Unfortunately, transfer rates have not improved as fast as storage capacity. Even if we can process data more quickly, accessing it is slow. Technically speaking, there are two main problems to solve: Large-scale data storage, and large-scale data analysis. 8 | Big Data Basics - Participant Guide Hadoop offers a solution to these problems by helping you store and process your data. HDFS is the Hadoop Distributed File System, where you can store your data. MapReduce will help you analyze your data. Hadoop runs on a cluster. A cluster is made of different elements. The most basic element is a node. A node is simple, cheap dedicated hardware, composed of CPU, RAM, and disk. Multiple nodes form a rack. And multiple racks form a cluster. This cluster is used to store and process your data. HDFS is the Hadoop distributed file system. Imagine you have a large file to store on HDFS. First it’ll be split into small pieces of 64 or 128 megabytes. This size is configurable. Then each piece will be distributed across the cluster. This enables faster processing because multiple nodes can operate on your large file simultaneously. Your file blocks are replicated three times by default and distributed in different data nodes. Then the address of the different blocks is stored in a special node called a name node. If one node fails, it’s still possible to find a copy on another node. Another benefit of duplicating and distributing file blocks is that you increase the data locality. That means you’ll have a better chance to find your blocks quickly, which will improve your computational performance. LESSON 1 | 9 Usually, high availability is configured with a standby name node. This standby name node is synchronized with the main name node while it’s available. If the main name node fails, the standby takes over. The key concept here is the name node. In order to work on your cluster, many Big Data components within Talend Studio require your name node address. To illustrate the use of the MapReduce framework we take a basic word count use case. Given some text files stored on HDFS, we want to count the number of times that certain words are mentioned. For example, suppose we want to know how many times the words Big Data, Talend, and MapReduce appear in the text files. The MapReduce framework processes your data in two stages. Stage one, map, and stage two, reduce. First, the mapper processes each file block independently. The result is given as key-value pairs. As the file blocks are distributed, the computation is done in parallel by nodes, each hosting a file block. This allows you to get your results faster than if you were doing this computation in a single node. A reduce node is designated. Then the results are sent to the reducer. The results are shuffled and sorted before reduction. And finally, the reducer aggregates the result. 10 | Big Data Basics - Participant Guide The JobTracker splits our job into tasks, and dispatches them to a TaskTracker running on each node of your cluster. Each task can be either a map or a reduce task. Each TaskTracker gives updates on its task to the JobTracker, which will send information back to the client on the overall job progress. The JobTracker is one of Talend Studio’s entry points into your cluster. Just as for the name node, many components require that you provide the JobTracker address. To summarize, MapReduce version 1 handles processing and resource management and uses a static master-slave model. This solves some Big Data issues, but the resource management is not optimal. MapReduce version 1 has some drawbacks. It has an inflexible slot-based memory management model. Each TaskTracker is configured at startup to have a certain number of slots. And a task, map or reduce, is executed in a single slot. Each slot is configured with a maximum memory at startup of the cluster. Due to the limitations of MapReduce v1, a new framework has been created called YARN.YARN is designed to manage only resources.It facilitates the development of distributed applications of any kind and is not limited to MapReduce.For example, Spark can run on top of YARN. Spark is covered in the Big Data Advanced course. YARN provides daemons just as MapReduce does. The JobTracker daemon was introduced with MapReduce.In YARN, the JobTracker is split into two daemons. The first one is the ResourceManager, which administers resources on the cluster.The second daemon is the Application Master. It manages applications such as MapReduce applications. LESSON 1 | 11 Hadoop is an open-source project from Apache. It’s not actually a single product but instead is a collection of several components. Some of them are illustrated here. The first one is HDFS which stores your data files. An alternative data storage method is HBase, a No SQL database optimized for sparse data. Sparse data sets have a lot of NULL values. On top of HDFS sits YARN. To process your data using YARN, you can use MapReduce v2. Alternatively, you can write your own Java code, or use other options. With Hive, you can create tables on your cluster and then process them using a SQL-like query language. You can also use Pig, which is a scripting language. It is a higher-level language than Java, which makes it much easier to write your processing. Your Pig code will be automatically translated into mappers and reducers and then run on your cluster. Another option is to use Mahout, which is a machine learning library. Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores, such as relational databases. Oozie is an application used to schedule Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work. Oozie supports Hadoop jobs for MapReduce, Pig, Hive, and Sqoop. Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All these kinds of services are used by distributed applications. Hue is a set of web applications that allow you to interact with your cluster. Hue applications help you browse your files stored on HDFS, track your MapReduce Jobs, work with Hive, or use Oozie workflows. Those separate projects can be installed independently, or you can use a packaged version of Hadoop. Several companies offer a prepackaged version of Hadoop, for example Hortonworks, Cloudera, and Amazon. Talend Studio for Big Data supports all of these distributions and much more. 12 | Big Data Basics - Participant Guide LESSON 1 | 13 Intentionally blank LESSON 2 Basic Concepts This chapter discusses: Concepts 16 Basic Concepts 21 Opening a Project 22 Monitoring the Hadoop Cluster 27 Creating Cluster Metadata 29 Review 39 Concepts 16 | Big Data Basics - Participant Guide LESSON 2 | 17 18 | Big Data Basics - Participant Guide LESSON 2 | 19 20 | Big Data Basics - Participant Guide Basic Concepts Overview During this course, you will be assigned a preconfigured Hadoop cluster. The Hadoop cluster was built with a Cloudera CDH 5.8 distribution. The purpose of this exercise is to try different functions, not to have a production cluster. So, this training cluster is in pseudo-distributed mode. That means that there is only one node. This is enough to understand the different concepts in this training. Before starting to create Jobs to read and write data to and from HDFS, there are some prerequisites. First, you will open a new project in the Talend Studio. Then, you will connect to your cluster to monitor it, using Cloudera Manager and Hue. For your Jobs to succeed, the cluster must be up and running. So, if a service fails you need to know how to restart it. Finally, you will create your cluster Metadata. This step will avoid repetitive configuration of components in the Jobs you will create. Objectives After completing this lesson, you will be able to: Open a new project in Talend Studio Connect to the Cloudera Manager Connect to Hue Create Cluster Metadata manually, using configuration files, and automatically Before you begin Be sure that you are working in an environment that contains the following: A properly installed copy of Talend Studio A properly configured Hadoop cluster The supporting files for this lesson Everything has already been set up in your training environment. The first step is to open your training project in the Talend Studio. LESSON 2 | 21 Opening a Project Task outline Before developing Talend Jobs, you will need a project to store them in. In this exercise, you will open a pre-existing project for your Big Data Jobs. Accessing the training environment For this course, two virtual machines have been set up for you. The first machine is a Windows machine where the Talend Studio is installed. This is where you will create your Job. The second machine is a Linux machine hosting a Cloudera CDH5.8 cluster. You do not need to access this machine. In fact, the cluster is monitored from the Windows machine through a Web Browser. To connect to the Windows machine, in your Skytap environment, run the machine named TalendStudio. To start your training cluster, run the Linux machine. Run 1. START TALEND STUDIO Double-click the Talend Studio shortcut on your desktop to run Talend Studio. 2. OPEN YOUR TRAINING PROJECT Using the Local connection, open the BDBasics project. 22 | Big Data Basics - Participant Guide a. On the On Connection list, make sure that the Local (Local) connection is selected. b. Click Select an existing project ,then click BDBasics in the Project list. The project has already been created for you. Note: You may have a different version of the Studio in your training environment. However, you will have the same functions as in the Talend Real-time Big Data Platform. c. Click Finish to open the project. Talend Forge 1. LOG IN When the Connect to TalendForge window appears, log in with your existing Talend account, or create a new one. LESSON 2 | 23 24 | Big Data Basics - Participant Guide 2. When the initialization is complete, Talend Studio displays this Welcome page. LESSON 2 | 25 Start Click Start now!. The Talend Studio main window appears, ready for you to create Jobs: The next step is to connect to your Hadoop cluster. 26 | Big Data Basics - Participant Guide Monitoring the Hadoop Cluster Task outline In order to develop Jobs using HDFS, HBase, and Hive to store your data, you need a running Hadoop cluster. For this training, a Hadoop cluster has been installed using a Cloudera CDH 5.8 distribution. It is running the core Hadoop functions, such as HDFS, YARN, HBase, Hive or Sqoop. It also runs Hue, which will help you to browse your files and tables. Connect to Cloudera Manager 1. CONNECT TO THE Cloudera Manager Click the Cloudera Manager shortcut in your web browser. Or navigate to the url http://hadoopcluster:7180. If you can't access the web page, wait a couple of minutes so that the service starts, and try again. 2. LOGIN Enter your credentials and log in. a. In the Username box, enter admin then, in the Password box, enter admin. b. Click Login. Check services' health The Cloudera Manager is a web interface to monitor and perform administration tasks on a cluster. It helps to check services' health and to restart services individually if needed. LESSON 2 | 27 If a service is red flagged, it means it is in bad health. You can restart it individually by clicking the black arrow on the right side of the service then clicking Restart. If you can see interrogation points on a white background instead of green or red ligths, restart the Cloudera Management services. This is done by clicking the right arrow next to Cloudera Management services and then clicking Restart. This should fix your issue and you should be able to monitor your cluster as shown above. Otherwise, refer to the troubleshooting guide. Connect to Hue Hue is a web interface that helps to check on what is done on your cluster. You can browse your files and tables. It is also possible to track Map Reduce tasks. 1. CONNECT TO Hue Navigate to the Hue web interface. a. Open a new tab in your web browser. b. Click the Hue shortcut or navigate to http://hadoopcluster:8888. 2. LOGIN Enter your credentials and log in. a. In the Username box, enter student. In the Password box, enter training. b. Click Sign in. You are now logged in Hue. The next step is to create Metadata on your Hadoop Cluster. 28 | Big Data Basics - Participant Guide Creating Cluster Metadata Task outline To be able to run Big Data Jobs, Talend Studio needs to be connected to a running Hadoop Cluster. The connection information can be configured in each component individually, or the configuration can be stored as metadata in the Repository and be reused as needed in the different components. Instead of individual connection configuration of components and Jobs, you will use the second approach. You will now create your cluster metadata using three different methods. First, you will configure the connection to the cluster manually, next, from the Hadoop configuration files, and last, using a Wizard. Manual configuration The first way to create a Hadoop cluster Metadata is to create it manually. This requires that you already have information about your cluster, such as the Namenode URI, and either the Resource Manager address or the Job Tracker URI- depending on if you will use YARN or Map Reduce v1. You may also need other information such as the Job History server, or the Resource Manager Scheduler location. 1. CREATE A NEW HADOOP CLUSTER METADATA Create a new Hadoop cluster metadata named TrainingCluster_manual. a. In the Studio, in the Repository, under Metadata, locate Hadoop Cluster. b. Right-click Hadoop Cluster then, click Create Hadoop Cluster: LESSON 2 | 29 c. In the Name box, enter TrainingCluster_manual, then click Next. The Hadoop Configuration Import Wizard opens: 2. SELECT THE DISTRIBUTION AND THE VERSION In the Distribution list, select Cloudera, and in the Version list, select Cloudera CDH5.8(YARN mode). 3. SELECT THE MANUAL CONFIGURATION Select Enter manually Hadoop services, then, click Finish. 30 | Big Data Basics - Participant Guide The Hadoop Cluster Connection window opens. 4. OBSERVE THE DEFAULT CONFIGURATION Check that the Distribution information is correct. There are a few values preconfigured, such as the Namenode URI and the Resource Manager address. The localhost value and the 0.0.0.0 value must be changed to the IP address or to the DNS name of your cluster. If the cluster was configured keeping the default port values, then 8020 and 8032 are the host port for the Namenode and the Resource Manager respectively. The Hadoop cluster has already been configured for you with the name hadoopcluster. 5. CONFIGURE THE CONNECTION Replace the localhost and 0.0.0.0 values by the hostname of the cluster. a. Configure the connection as follows: Namenode URI: hdfs://hadoopcluster:8020 Resource Manager: hadoopcluster:8032 Resource Manager Scheduler: hadoopcluster:8030 Job History: hadoopcluster:10020 Staging directory: /user User name: student LESSON 2 | 31 b. Check your configuration. c. Click Finish. Your cluster metadata appears under Repository/Metadata/Hadoop Cluster. Discovering Hadoop configuration files You can also create your Metadata using the Hadoop configuration files. The configuration files have been copied in the HadoopConf folder under C:\StudentFiles\BDBasics. The Hadoop configuration files are .xml files that describe each parameter value of your Hadoop cluster. In the HadoopConf folder, you will find four files: core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml. The files were copied from a Hadoop cluster, installed with the same distribution and version as your training cluster. 1. EXAMINE core-site.xml Open core-site.xml with Notepad++: 32 | Big Data Basics - Participant Guide The first property that appears in this file is the location of the Namenode: hdfs://hadoopcluster:8020. 2. EXAMINE yarn-site.xml Open yarn-site.xml with Notepad++: In this file, you will find the Resource Manager address and the Resource Manager Scheduler address. 3. EXAMINE mapred-site.xml Open mapred-site.xml with Notepad++: In this file, you will find the Job History address. You need all this information to create your Cluster Metadata. If you choose to create the cluster Metadata using the configuration files, the files will be parsed to find all these values. Configuration with the Hadoop configuration files 1. CREATE A NEW HADOOP CLUSTER METADATA Create a new Hadoop cluster metadata and name it TrainingCluster_files. LESSON 2 | 33 a. Right-click Hadoop Cluster, then click Create Hadoop Cluster. b. In the Name box, enter TrainingCluster_files then, click Next. 2. SELECT THE DISTRIBUTION AND THE VERSION In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.8(YARN mode). 3. SELECT A CONFIGURATION FROM FILES Select Import configuration from local files,then click Next. 4. LOCATE THE CONFIGURATION FILES Click Browse... then locate the configuration files under C:\StudentFiles\BDBasics\HadoopConf. Click OK. 5. FINALIZE THE CONFIGURATION Set the user name to student and finalize the metadata creation. a. Click Finish. b. The configuration is almost done, except for the user name. In the User name box, enter student, then click Finish. Your cluster's metadata appears under Repository/Metadata/Hadoop Cluster. Automatic configuration The last way to configure your metadata is to connect to the Cloudera Manager, so that all of the connection information will be retrieved automatically to create the cluster's metadata. 1. CREATE A NEW HADOOP CLUSTER METADATA Create a new Hadoop cluster metadata and name it TrainingCluster a. Right-click Hadoop Cluster, then click Create Hadoop Cluster. b. In the Name box, enter TrainingCluster, then, click Next. 2. SELECT THE DISTRIBUTION AND THE VERSION In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.8(YARN mode). 3. SELECT AN AUTOMATIC CONFIGURATION Select Retrieve configuration from Ambari or Cloudera then, click Next. 34 | Big Data Basics - Participant Guide 4. SET THE Cloudera Manager URI In the Manager URI (with port) box, enter http://hadoopcluster:7180. 5. ENTER CREDENTIALS In the Username and Password boxes, enter admin. 6. CONNECT TO THE Cloudera Manager Click Connect. This will list all the clusters administered by the Cloudera Manager: 7. FETCH SERVICES Click Fetch. The wizard will retrieve the configurations files of all running services in your cluster: LESSON 2 | 35 For each service listed, the wizard can create the corresponding metadata in the Repository. For this lab, you will create metadata only for YARN. You will investigate later how to create metadata for HDFS, Hive and HBase. 8. CREATE METADATA ONLY FOR YARN You will create metadata only for the YARN service. a. Click Deselect All then, select YARN. b. Click Finish. 36 | Big Data Basics - Participant Guide 9. SET THE USERNAME Enter student in the User name box. 10. CHECK SERVICES Click Check Services, to check if you succeeded in connecting to the Namenode and the Resource Manager. LESSON 2 | 37 If the progress bars go up to 100%, and you have no error message, then your connection was successful. 11. FINALIZE THE HADOOP CLUSTER METADATA CREATION Now, you have to complete the last steps to create the metadata in the Repository. a. Click Close. b. Click Finish to finalize the metadata creation. It will appears in the Repository under Metadata/Hadoop Cluster. You have now created the same metadata three times using different techniques. For the next lessons, the last metadata, named TrainingCluster, will be used. Next step You have almost finished this section. Time for a quick review. 38 | Big Data Basics - Participant Guide Review Recap In this lesson, you covered the key base knowledge required to be efficient in building and running Big Data Jobs. First, you opened your project, then, you connected to Cloudera Manager and Hue. Cloudera Manager will help you to restart a service if needed. Hue will help you to browse your files and tables, and track the execution of your Map Reduce Jobs. You also learned how to create Metadata on your cluster. You created the same Metadata three times using different techniques, so you can use any of them for your Jobs. For the next lessons, the TrainingCluster Metadata will be used. Feel free to try any of the three Metadata. The results will be the same. If not, that means that a metadata is not well configured. LESSON 2 | 39 Intentionally blank LESSON 3 Reading and Writing Data in HDFS This chapter discusses: Concepts 42 Reading and Writing Data in HDFS 46 Storing a File on HDFS 47 Storing Multiple files on HDFS 55 Reading Data from HDFS 58 Storing Sparse Dataset with HBase 61 Challenges 70 Solutions 71 Review 74 Concepts 42 | Big Data Basics - Participant Guide LESSON 3 | 43 44 | Big Data Basics - Participant Guide LESSON 3 | 45 Reading and Writing Data in HDFS Overview Hadoop's file system—HDFS—scales to hold petabytes of data. In this lesson you will use Talend Big Data components to read and write data to HDFS. First, you will read a text file that stores customer information from your local system and write it to HDFS. Then, you will read a collection of customer Twitter messages, stored in separate files, and write them individually to HDFS. Next, you will develop a Job to read a designated subset of the Twitter files back from HDFS. Finally, you will simulate a sparse data set write it to HDFS using HBase dedicated components. Objectives After completing this lesson, you will be able to: Write text files to HDFS Read text files from HDFS Use HBase to store sparse data on HDFS Configure connections to HDFS and HBase Before you begin Be sure that you are in a working environment that contains the following: A properly installed copy of the Talend Studio A properly configured Hadoop cluster The supporting files for this lesson The first step is to create a new Job to read a local text file and write it to HDFS. 46 | Big Data Basics - Participant Guide Storing a File on HDFS Task outline Your first step is to store text data on HDFS. In this lab, you are saving a file containing customer information such as first name, last name, city, state, etc... As you did previously for your cluster connection, it is possible to create metadata to get connected to HDFS. Once connected to HDFS, you will be able to read and write files on HDFS. Configure HDFS connection Now you will create metadata for an HDFS connection. 1. CREATE HDFS CONNECTION METADATA Right-click TrainingCluster under Repository>Metadata>Hadoop Cluster, then click Create HDFS: 2. NAME THE METADATA In the Name box, enter HDFSConnection , then click Next: LESSON 3 | 47 A default configuration is proposed, but you can adjust it to your needs. For this training, you will keep the default values. This means that your username to connect to HDFS will be student, "\n" will be the row separator, and ";" will be the field separator. 3. CHECK THE CONNECTION Click Check. You should quickly be connected to your cluster, and you will have the following success message: 4. FINALIZE THE METADATA CREATION Click OK , then click Finish to create your HDFS connection Metadata. It will appear in the repository, below the TrainingCluster metadata: 48 | Big Data Basics - Participant Guide Create a Job to write data to HDFS In the C:\StudentFiles\BDBasics folder, you will find the CustomersData.csv file. You will create a Job that will read this file and write it to HDFS. 1. CREATE A NEW STANDARD JOB Create a Standard Job named PutCustomersData. a. In the Repository, right-click Job Designs , then click Create Standard Job: b. Name your Job PutCustomersData. 2. ADD A tHDFSPut COMPONENT Place a tHDFSPut component on the design workspace. You use this component to move files from a local file system to HDFS. 3. CONFIGURE THE COMPONENT TO USE THE HDFSConnection METADATA Configure the tHDFSPut component to use the HDFSConnection metadata created previously. a. Double-click to open the Component view. b. In the Property Type list, select Repository. c. Click (...) , then locate the HDFSConnection metadata that you previously created. 4. CONFIGURE THE LOCAL FOLDER To configure the Local directory, click (...) , then navigate to "C:\StudentFiles\BDBasics". 5. CONFIGURE THE TARGET FOLDER Configure tHDFSPut to overwrite the /user/student/BDBasics/Customers folder. a. In the HDFS directory box, enter /user/student/BDBasics/Customers. b. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of the same name is being saved. 6. READ THE CUSTOMERS DATA FILE Configure the Files table to read the CustomersData.csv file. a. Click the green plus sign below the Files table. b. In the Filemask column, enter "CustomersData.csv". c. Verify your configuration. LESSON 3 | 49 Next, you will run your Job and check the result in Hue. Run the Job and verify the results 1. RUN THE JOB AND VERIFY THE RESULTS IN THE CONSOLE Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equal to 0 and several other messages that prove that the Job successfully executed: 50 | Big Data Basics - Participant Guide Note: This is not the default Console output for a Standard Job. The BDBasics project has been configured with a Log4j level set to INFO. This allows to have more details about the execution of your Job. For example, you will get the name of the files copied and the number of files uploaded to HDFS. 2. CONNECT TO HUE AND VERIFY THE RESULTS Check your results using the File Browser of Hue. a. In your web browser, the page with Hue should be already opened. Otherwise, navigate to http://hadoopcluster:8888 and login with student/training. b. Click File Browser: LESSON 3 | 51 c. Navigate to /user/student/BDBasics/Customers/CustomersData.csv: This shows you the content of the CustomersData.csv file. The file was created by your Job in the new directory BDBasics/Customers that you specified. Examining the Console output - Winutils.exe error Take a few minutes to read the output in the Console. At the end of the execution of the PutCustomersData Job, you should have the following text in the Console: 52 | Big Data Basics - Participant Guide In the Console, you will get logs created by your cluster. By default, only logs with WARN or ERROR status are displayed. You may have noticed that there is an error displayed in the Console: You will see an error message regarding the winutils.exe binary not being available in the Hadoop binaries. In the current Job, this error won't prevent the Job to succeed but it could in some cases. You can fix this issue by downloading the winutils.exe file and then saving it in a tmp folder. This has been done for you in your training environment. 1. FIX THE ERROR Using the Advanced settings tab, set the Hadoop home directory to the location of the winutils.exe file. a. In the Run view, click Advanced settings. b. Select the Use specific JVM arguments option. c. Click New.... LESSON 3 | 53 d. In the Argument box, enter: -Dhadoop.home.dir=C:/tmp/winutils e. Click OK to save the new argument and run your Job again. The Job should run successfully without error messages: For the next Jobs built in this training, you can add a JVM argument to set the Hadoop home directory to C:\tmp\winutils, or you can safely ignore this error. Next, you will put multiple files to HDFS. 54 | Big Data Basics - Participant Guide Storing Multiple files on HDFS Task outline Your next step is to store a set of files on HDFS. In this scenario, you are saving a series of Twitter message files for later analysis. Create the Job Under the C:\StudentFiles\BDBasics\tweets_in folder, you will find tweets saved as text files. You will put these files on HDFS. 1. DUPLICATE THE JOB Duplicate the PutCustomersData Job and name the new Job PutTweets. a. Right-click PutCustomersData Job in the Repository, then click Duplicate. b. In the Input new name box, enter PutTweets and in the Job Type list, select Standard: c. Click OK to duplicate the Job. Then, open PutTweets. 2. UPDATE THE LOCAL DIRECTORY Configure tHDFSPut to read from the C:/StudentFiles/BDBasics/tweets_in folder in your local file system. a. Double-click tHDFSPut to open the Component view. b. In the Local directory, locate C:/StudentFiles/BDBasics/tweets_in. c. In the Filemask column, enter "*". This means to select all the files under C:\StudenFiles\BDBasics\tweets_in. 3. UPDATE THE TARGET DIRECTORY In the HDFS directory box, enter /user/student/BDBasics/tweets. LESSON 3 | 55 Next you will run your Job and check the result in Hue. Run 1. RUN YOUR JOB AND VERIFY THE RESULTSRun your Job and follow the execution in the Console. At the end of execution, you should have an exit code equal to 0 and several other messages that prove that the Job executed successfully: 56 | Big Data Basics - Participant Guide 2. CONNECT TO Hue In your web browser, the page with Hue should already be opened. Otherwise, navigate to http://hadoopcluster:8888 and login with student/training. 3. VERIFY THE RESULTS Click File Browser and navigate to /user/student/BDBasics/tweets: There should be eleven files, because there are eleven tweet files in your local directory. The tHDFSPut component used HDFS to write all of the files in your local directory into the Hadoop file system. You used a file mask to write a file set. File masks give you considerable control over what files you want Talend components to operate on. Now that some data exists on HDFS, you can create a Job to read it. LESSON 3 | 57 Reading Data from HDFS Task outline Now you will create a Job to read HDFS data. You are going to transfer a subset of the Twitter files from HDFS to your local file system. Create the Job If you examine the files in the tweets folder that you just created on HDFS, you will notice that the files were produced in 2012 and 2013. You will read only the 2012 files. 1. CREATE A NEW STANDARD JOB Create a new Job, naming it GetTweets. 2. ADD A tHDFSGet COMPONENT Place a tHDFSGet component on the design workspace. You use tHDFSGet to read files stored on HDFS and make copies of them on your local system. 3. CONFIGURE tHDFSGet TO USE THE HDFSConnection METADATA Use the HDFSConnection metadata previously created. a. Double-click tHDFSGet to open the Component view. b. In the Property Type list, select Repository. c. Click (...) , then locate the HDFSConnection metadata that you previously created. 4. CONFIGURE THE HDFS DIRECTORY To configure the HDFS directory, click (...) , then navigate to /user/student/BDBasics/tweets. 5. CONFIGURE THE LOCAL DIRECTORY Configure tHDFSGet to overwrite the C:/StudentFiles/BDBasics/tweets_out folder. a. In the Local directory, enter "C:/StudentFiles/BDBasics/tweets_out". b. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of the same name is being saved. 6. ADD A CONDITION TO READ ALL THE 2012 TWEETS Configure the Files table to read the tweets corresponding to 2012. a. Click the green plus sign below the Files table. b. Replace "newline" with "2012*". You are using a file mask to request all files in the HDFS directory tweets with names that begin with the string "2012": 58 | Big Data Basics - Participant Guide Now, you will run the Job and verify the results in your local file browser. Run 1. RUN YOUR JOB AND VERIFY THE RESULTS IN THE CONSOLE Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equal to 0 and several other messages that prove that the Job successfully executed: 2. VERIFY THE RESULTS IN YOUR LOCAL FILE BROWSER In a file browser, navigate to C:\StudentFiles\BDBasics\tweets_out: LESSON 3 | 59 There are five tweet files that your Job read from HDFS and put on your local system. Because a file mask was used to limit the files selected for the operation, only files with names that begin with "2012" were read from HDFS. Now that you have worked with text files on HDFS, it is time to work with sparse dataset and experiment how to store them efficiently in HDFS. 60 | Big Data Basics - Participant Guide Storing Sparse Dataset with HBase Task outline Apache HBase is the Hadoop column-oriented database. It is an open-source, non-relational database modeled after Google's BigTable, and provides the same functionality in top of Hadoop and HDFS. HBase is useful when you need random, real-time read/write access to your Big Data. Furthermore, HBase can host very large tables -- billions of rows X millions of columns-- and is particularly well suited for sparse data sets. If your relational table looks like the table below (data missing in columns), it is considered "sparse" and is a good candidate for HBase. In this lab, you will generate sparse data and build a Job to store them to HDFS with HBase: LESSON 3 | 61 Generate sparse data First, you will open a Job that will generate a sparse data set. It will create a dataset that represents the number of connections per month to a website by customers identified by their Id. 1. IMPORT THE GenerateSparseData JOB From the JobDesigns.zip archive, import the GenerateSparseData Job. a. Under the C:\StudentFiles\BDBasics folder, you will find the JobDesigns.zip archive file. b. Import the Job named GenerateSparseData: This Job is composed of two components. The tRowGenerator component will generate an integer value for the customers ID, and random integer values to simulate the number of connections per month. The second component will save the data in a file named HBase_sample.txt under the C:\StudentFiles\BDBasics folder. 2. RUN THE JOB AND VERIFY THE RESULTS Run the Job, then locate and open C:\StudentFiles\BDBasics\HBase_sample.txt with Notepad++. As expected, there are a lot of 0 values in the data. Next, you will configure the connection to HBase. Configure HBase connection An alternative way to configure a connection is to use a dedicated component. You will now create a new Job. The first step will be to configure the connection to HBase. 1. CREATE A NEW STANDARD JOB Create a new standard Job and name it StoreSparseData. 2. ADD A tHBaseConnection COMPONENT Add a tHBaseConnection component and open the Component view. 62 | Big Data Basics - Participant Guide 3. CONFIGURE THE DISTRIBUTION AND THE VERSION OF YOUR CLUSTER In the Distribution list, select Cloudera and in the HBase version list, select Cloudera CDH5.8 (YARN mode). 4. CONFIGURE THE CONNECTION TO Zookeeper Configure the host name and the port of the Zookeeper service. a. In the Zookeeper quorum box, enter "hadoopcluster". b. In the Zookeeper client port box, enter "2181". Your configuration should be as follows: Read sparse data You will now continue to build the Job to store your data with HBase. 1. ADD A tFileInputDelimited COMPONENT Add a tFileInputDelimited component below tHBaseConnection. 2. CONNECT WITH A TRIGGER Connect tHBaseConnection to tFileInputDelimited with an OnSubjobOk trigger. 3. READ THE HBase_sample.txt FILE Configure tFileInputDelimited to read the "C:/StudentFiles/BDBasics/HBase_sample.txt" file, knowing that the row separator is "\n", the field separator is "\t" and that there is 1 Header row. 4. CONFIGURE THE SCHEMA The schema of the HBase_sample.txt file can be found in the GenerateSparseData Job. LESSON 3 | 63 a. In the GenerateSparseData Job, double-click tFileOutputDelimited to open the Component view. b. Edit the schema, select all the columns and copy the schema: c. In the StoreSparseData Job, double-click tFileInputDelimited to open the Component view. d. Edit the schema and paste the schema: Handle Null values HBase is well suited for sparse dataset because it does not persist Null values. In our current data set, there are a lot of zeros. You will process the data to find the zeros and replace them with Nulls. You will store the raw data set in HBase, as well as the processed data. 1. ADD A tReplicate COMPONENT Add a tReplicate component at the right side of tFileInputDelimited and connect it with the Main row. 2. ADD A tMap COMPONENT Add a tMap component at the right side of tReplicate and connect it with the Main row. 3. OPEN THE MAPPING EDITOR Double-click to open the tMap editor. 4. CREATE A NEW OUTPUT TABLE Add an output and name it out_with_NULLS. 64 | Big Data Basics - Participant Guide 5. COPY ALL INPUT COLMUNS IN THE OUTPUT TABLE Select all the columns in the row table and drag to the out_with_NULLS table. Note: You may have a different row index. 6. CONFIGURE THE MAPPING TO REPLACE ALL 0 VALUES BY Null VALUES Edit the expression for each month and modify it as follows: (row3.Jan==0)?null:row3.Jan This means that all 0 values will be replaced by a null. Note: The Jan value must be replaced as needed to fit all months. Your configuration should be similar to this: Click Ok. Save data to HBase Now you will add components to save raw and processed data to HBase. 1. ADD A tHBaseOutput COMPONENT At the right side of tMap, add a tHBaseOutput component and connect it with the out_with_NULLS row.Then, open the Component view. 2. USE THE EXISTING CONNECTION TO HBASE Select Use an existing connection and select tHBaseConnection_1 in the list. 3. Configure the schema Click Sync columns. 4. GIVE A NAME TO THE HBASE TABLE In the Table name box, enter data_withNulls. 5. SELECT THE APPROPRIATE ACTION In the Action on table list, select Drop table if exists and create. 6. CREATE THE COLUMN FAMILY NAMES To create HBase tables, a family name must be associated with each column. LESSON 3 | 65 a. To create the family names, click Advanced settings. b. In the Family parameters table, add 2 lines as follows: 7. CONFIGURE THE Families TABLE The Id family name should be used for the Id column and the Date family name for all other columns. a. In the Basic settings tab, set the Id column Family Name to "Id". b. Set the other column's Family Name to "Date". Your configuration should be as follows: 8. ADD ANOTHER tHBaseOutput COMPONENT AND CONFIGURE IT AS THE FIRST ONE Copy tHBaseOutput_1 and paste below the tMap component. 9. CONNECT IT TO tReplicateE Connect tHBaseOutput_2 to tReplicate with a Main row. 66 | Big Data Basics - Participant Guide 10. NAME THE TABLE RawData Open the Component view of tHBaseOutput_2 and change the Table name to "RawData": 11. ADD A tHBaseClose COMPONENT AND CONNECT IT Add a tHBaseClose component below the tFileInputDelimited component and connect it with an OnSubjobOk trigger. Now you can run your Job and check the results in Hue. Run the Job and verify the results in Hue 1. RUN YOUR JOB AND VERIFY THE RESULT IN THE CONSOLE Run your Job and check the results of the execution in the Console. LESSON 3 | 67 Your execution should be successful. Otherwise, double check your Job and check HBase health in the Cloudera Manager. 2. VERIFY THE RESULTS IN Hue Connect to Hue to check your results. a. Connect to Hue and click Data Browsers>HBase. This will give you the list of HBase tables. You should see Data_withNulls and RawData in the list: 68 | Big Data Basics - Participant Guide b. Click RawData: This is an extract of the RawData table. You will still find zeros, but if you compare with the content of Data_ withNulls: The Null values are not stored. Now that you have used HBase to store sparse data, it's time to move to the Challenge to test your knowledge. LESSON 3 | 69 Challenges Task outline Complete these exercises to further reinforce the skills you learned in the previous lesson. See Solutions for possible solutions to the exercises. Add Support File Develop a Job to write an XML file of support requests stored locally to HDFS. Configure the input and output targets as follows: Source file: C:/StudentFiles/BDBasics/support/support.xml Target file: /user/student/BDBasics/support/support.xml Double Up Orders Develop a second Job to write a file of duplicated orders to HDFS. Use the local file C:/StudentFiles/BDBasics/duplicated_orders as the source file. Put the file into the HDFS directory /user/student/BDBasics/erp, keeping the same file name. Use the schema stored in the file orders.xml and sort the file on column id before saving it. Hint: use tSortRow to sort the file. 70 | Big Data Basics - Participant Guide Solutions Suggested solutions These are possible solutions to the exercises. Note that your solutions may differ and still be acceptable. Put Support file 1. Use a tHDFSPut component to create the HDFS folder and write the target file to it. 2. Configure the component as shown in the following image: LESSON 3 | 71 Double-up orders 1. Use a tFileInputDelimited to read the file duplicated_orders from the local file system: 2. Import the schema from the file orders.xml: 3. Sort the data in ascending order by the column id by using the tSortRow component : 72 | Big Data Basics - Participant Guide 4. Use a tHDFSOutput component to rewrite the data to HDFS: 5. Click Overwrite in the Action list. 6. Run your Job and check the results in Hue. Next step You have almost finished this section. Time for a quick review. LESSON 3 | 73 Review Recap In this lesson, you learned the basics of writing files using Talend's Big Data components for Hadoop Distributed File System, HDFS. You used tHDFSPut to write different types of files to HDFS. Then, you used tHDFSGet to read a subset of the files back from HDFS. Last, you used HBase dedicated components to store sparse data on HDFS (tHBaseConnection, tHBaseOutput, tHBaseClose). Further Reading For more information about topics covered in this lesson, see the Talend Data Integration Studio User Guide, Talend Big Data Platform User Guide and the Talend Components Reference Guide. 74 | Big Data Basics - Participant Guide LESSON 4 Working with Tables This chapter discusses: Concepts 76 Working With Tables 81 Importing Tables with Sqoop 82 Creating Tables with Hive 89 Review 99 Concepts 76 | Big Data Basics - Participant Guide LESSON 4 | 77 78 | Big Data Basics - Participant Guide LESSON 4 | 79 80 | Big Data Basics - Participant Guide Working With Tables Overview In this lesson, you will cover two common use cases. HDFS can be used to for data warehouse optimization. So, you could decide to move your data from a relational database to HDFS. The first use case will show you how to transfer MySQL tables to HDFS using Sqoop. The second use case will show you how to create a table using Hive. Then, this table can be processed using Hive QL, which is very similar to SQL. Objectives After completing this lesson, you will be able to: Transfer MySQL tables to HDFS using Sqoop Create Hive connection Metadata Save data as Hive tables Before you begin Be sure that you are in a working environment that contains the following: A properly installed copy of the Talend Studio A properly configured Hadoop cluster The supporting files for this lesson First, you will use Sqoop to import a MySQL table to HDFS. LESSON 4 | 81 Importing Tables with Sqoop Task outline Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle, into the Hadoop Distributed File System (HDFS), transform the data using Map Reduce, and export that data back into a RDBMS. Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop creates and runs a Map-only Map Reduce Job to import the data. In this exercise, you will transfer MySQL tables into HDFS using Sqoop dedicated components. First, you will push the customers data into a MySQL database. Preparing the MySQL database The Job to copy the Customers data in MySQL has already been created for you. It is saved in the C:\StudentFiles\BDBasics folder. 1. IMPORT THE PushCustomerDataToMySQL JOB From the C:\StudentFiles\BDBasics\JobDesigns.zip archive file, import the PushCustomerDataToMySQL Job and the associated RemoteMySQL database Metadata. 2. RUN THE JOB Run PushCustomerDataToMySQL. This will copy 1 million rows in a remotely hosted MySQL database named CustomersData: 3. USE THE DATA VIEWER TO VALIDATE THE DATA To check the data copied in the CustomersData table, right-click the tMysqlOutput component, then click Data Viewer: 82 | Big Data Basics - Participant Guide You will now transfer this table to HDFS using the tSqoopImport component. Create a generic database metadata The tSqoopImport component calls Sqoop to transfer data from a relational database management system, such as MySQL or Oracle, into the Hadoop Distributed File System. First, you will create a generic database Metadata, which is required for Sqoop to connect to your MySQL database. 1. CREATE A NEW DATABASE CONNECTION METADATA Create a new Generic JDBC database connection metadata named genericConnection. a. In the Repository, under Metadata, right-click Db Connections , then click Create connection. b. In the Name box, enter genericConnection. You can also add a Purpose and a Description. Then, click Next. c. In the DB Type list, select General JDBC. 2. CONFIGURE THE METADATA TO CONFIGURE THE CONNECTION TO YOUR MySQL DATABASE To configure the genericConnection metadata, add the database URL, driver jar, class name, credentials and mapping file. a. In the JDBC URL box, enter jdbc:mysql://hadoopcluster:3306/training . b. Click the (...) next to the Driver jar box and select mysql-connector-java-5.1.30-bin.jar. c. Click the (...) next to the Class name box, then, in the drop-down list, select com.mysql.jdbc.Driver. LESSON 4 | 83 d. In the User name and Password boxes, enter root. e. Click the (...) next to the Mapping file box, then select mapping_Mysql.xml in the list. Click OK. Your configuration should be as follows: 3. TEST THE CONNECTIVITY Click Check. Your connection should be successful: 4. FINALIZE THE METADATA CREATION Click OK and Finish. The genericConnection Metadata appears in the Repository. Importing tables You will create a simple Job to import the CustomersData MySQL table into HDFS using a tSqoopImport component. In the tSqoopImport component, the first option is to choose the Mode: Commandline or Java API. If you choose Commandline, the Sqoop shell is used to call Sqoop. In this mode, you have to deploy and run the Job in the host where Sqoop is installed. This means that you have to install and use the Jobserver, as described in the Talend Data Integration Advanced training, or as described in the Talend Installation Guide. If you select Use Java API, the Java API is used to call Sqoop. In this mode, the Job can be run locally in the Studio, but you have to configure the connection to your cluster. Note: A JDK is required to execute the Job with the Java API and the JDK versions on both machines must be compatible. 1. CREATE A NEW STANDARD JOB Create a new standard Job and name it SqoopImport. 2. ADD A tSqoopImport COMPONENT Add a tSqoopImport component and open the Component view. 3. CONFIGURE THE tSqoopImport COMPONENT TO USE THE Java API In the Mode box, select Use Java API. 84 | Big Data Basics - Participant Guide 4. CONFIGURE THE tSqoopImport COMPONENT TO USE THE HDFSConnection METADATA In the Hadoop Property list, select Repository , then browse to find the HDFSConnection you configured earlier in the course. This will configure the Distribution, the Hadoop version, the NameNode URI, the Resource Manager, and the Hadoop user name: 5. USE YOUR DATABASE CONNECTION METADATA In the JDBC Property list, select Repository, then browse to find the genericConnection Metadata. This configures the Connection, the Username, the Password, and the Driver JAR values: 6. IMPORT THE CustomersData TABLE In the Table Name box, enter "CustomersData"and select Delete target directory. 7. CONFIGURE THE TARGET DIRECTORY Select Specify Target Dir and enter "/user/student/BDBasics/SqoopTable". LESSON 4 | 85 Your configuration should be as follows: Run the Job and verify the results As you did previously, you will run your Job and check the results in the Console and in Hue. 1. RUN YOUR JOB AND VERIFY THE RESULTS IN THE CONSOLE Examine the output in the Console. a. Run your Job and check the results in the Console. The last line should be an exit code equal to 0. b. You can investigate what you see in the Console a little bit. For example, you can see the execution of the Map Reduce Job generated by the Sqoop import: Note: Your Job Id will be different. The Id is the number following "job_...". In the current example, the Job Id is 1476871003091_0006. 86 | Big Data Basics - Participant Guide c. Right after, you will find a recap of various counters: Here you can see that 4 map tasks ran on the cluster and that the 1 million records were transferred in approximately 21 seconds. 2. CONNECT TO Hue AND VERIFY YOUR RESULTS In Hue, use the Job browser to find your jobs and the File browser to find your data. a. To see your Job in Hue, click Job Browser. The Job Browser window will open, and you will see the list of all your Jobs. b. From the Console, find the Id of your Job. Then, in Hue, find your Job Id in the Job list. It should be followed by green boxes, corresponding to a successful execution of Map and Reduce tasks: LESSON 4 | 87 c. In Hue, click File Browser and navigate to /user/student/BDBasics/SqoopTable: The data has been split in multiple parts. d. Click part-m-00000: Here you can check that your data have been imported as expected. Now that you have imported a MySQL table to HDFS using a tSqoopImport component, you can continue to experiment working with tables. The next topic will show you how to create tables in HDFS with Hive. 88 | Big Data Basics - Participant Guide Creating Tables with Hive Task outline Hive is a data warehouse infrastructure tool used to process structured data in Hadoop. It is a database that resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Hive supports HiveQL, which is similar to SQL, but does not support the complete construct of SQL. Hive converts the HiveQL query into Map Reduce code and submits it to the Hadoop cluster. Hive, through HiveQL language, provides a higher level of abstraction over Java Map Reduce programming. First, you will create Hive Metadata in the Repository. Then, you will use various methods to create Hive tables. Create Hive connection metadata As you did previously for the cluster and HDFS connections, you will create a Hive connection metadata in the Repository. 1. CREATE A HIVE CONNECTION METADATA Right-click TrainingCluster in the Repository under Metadata/Hadoop Cluster, , then click Create Hive. 2. NAME THE METADATA In the Name box, enter HiveConnection, then click Next. 3. CONFIGURE THE CONNECTION TO HIVE Select the Hive Model and enter the port number. a. In the Hive Model list, select Standalone. b. In the Port box, enter 10000. Your configuration should be as follows: LESSON 4 | 89 4. TEST THE CONNECTION TO HIVE Click Check. You should have a successful connection message: 5. FINALIZE THE METADATA CREATION Click Ok and Finish to create the Hive connection Metadata. The Hive connection metadata appears in the Repository under Hadoop Cluster/TrainingCluster. The Hive connection metadata is also available in the Repository under Db Connections. Create a Hive table manually You will now create a Job that will create a table and populate it with the customer's data. 1. CREATE A NEW STANDARD JOB Create a new standard Job and name it HiveLoad. 2. ADD A tHiveCreateTable COMPONENT Add a tHiveCreateTable and open the Component view. The tHiveCreateTable component will create an empty Hive table according to the specified schema. 3. USE THE HiveConnection METADATA In the Property Type list, select Repository , then browse to find the HiveConnection Metadata: 90 | Big Data Basics - Participant Guide 4. USE THE CustomersData GENERIC SCHEMA In the Schema list, select Repository then browse to find the CustomersData generic schema metadata. 5. READ THE CustomersData TABLE In the Table Name box, enter "CustomersData". 6. SELECT THE ACTION ON TABLE In the Action on table list, select Create table if not exists: 7. ADD A tHiveLoad COMPONENT Add a tHiveLoad component. The tHiveLoad component will populate the table with data. 8. CONNECT WITH A TRIGGER Connect tHiveLoad to tHiveCreateTable with an OnSubjobOk trigger and open the Component view. 9. USE THE HiveConnection METADATA Set the Property Type to Repository and use the HiveConnection Metadata. 10. LOAD the CustomersData.csv FILE Configure tHiveLoad to load the CustomersData.csv file in a table named CustomersData. a. In the Load action list, select LOAD. b. In the File Path box, enter "/user/student/BDBasics/Customers/CustomersData.csv". c. In the Table Name box, enter "CustomersData". d. In the Action on file list, select OVERWRITE. You will now run your Job and check the results in the Console and in Hue. Run the Job and verify the results 1. RUN YOUR JOB Run your Job and check the output in the Console: LESSON 4 | 91 The Job successfully executed. Now, you can check the results in Hue. 2. CONNECT TO Hue AND CHECK YOUR RESULTS In Hue, use the Data Browsers to examine your CustomersData Hive table. a. In Hue, click Data Browsers.Then, click Metastore Tables: 92 | Big Data Basics - Participant Guide b. Click the customersdata table: This is the overview of your CustomersData table. LESSON 4 | 93 c. Click Columns(8) to inspect the schema of the customersdata table: Note:The columns type have been automatically converted. d. Click Sample: If you examine the results in the purchasedata column, you will see only NULL values. This is due to the fact that the timestamp format of Hive is not equivalent to the date format in the Talend Studio. This leads to a failure in the data conversion. A possible workaround is to consider dates as String types. You will now experiment with another way to create a Hive table. You will use a Wizard which will automatically create a Hive table from a file stored in HDFS. Using the Hive table creation wizard To create a Hive table automatically from the CustomersData.csv file stored on HDFS, you will have to change the perspective of the Studio. You will move to the Profiling perspective, which is dedicated to Data Quality analysis on database or on HDFS, depending on where you stored your data. 1. COPY CustomersData.csv TO HDFS To make sure that the CustomersData.csv file is available for the following steps, run the PutCustomersData Job again. 94 | Big Data Basics - Participant Guide 2. SWITCH TO THE Profiling PERSPECTIVE In the upper-right corner of the Studio, click Profiling to open the Profiling perspective: 3. LOCATE THE HDFSConnection METADATA In the DQ Repository, expand Metadata/Hadoop Cluster/TrainingCluster: There, you will find the connection metadata you created earlier. 4. CREATE A HIVE TABLE FROM CustomersData.csv From the CustomersData.csv file copied in HDFS, use the wizard to create a Hive table. a. Right-click HDFSConnection. Then click Create Hive Table: The connection to HDFS will be checked and next, the wizard to create a Hive table based on a file stored in HDFS will start. b. Browse to find the CustomersData.csv file under /user/student/BDBasics/Customers. LESSON 4 | 95 c. Select CustomersData.csv and wait until the Creation status changes to Success: As expected, the wizard detects 8 columns in the CustomersData.csv file. Note: The Hive table creation wizard will convert all of the files under the selected folder. So, in the current example, only the CustomersData.csv file will be converted because it is the only file in the Customers folder. d. Click Next. e. In the Schema, edit the columns' names as follows: Column0 > Id Column1 > FirstName Column2 > LastName Column3 > City Column4 > State Column5 > ProductCategory Column6 > Gender Column7 > PurchaseDate f. Click Next. g. In the New Table Name box, enter CustomersData_auto. h. In the Hive connection list, select HiveConnection , then click Finish. The Hive table is created and then the wizard closes. 5. REFRESH THE HIVE TABLE LIST From the Repository, reload the Hive table list. 96 | Big Data Basics - Participant Guide a. Under DQ Repository>Metadata>DB connections>HiveConnection>default, right-click Tables and clickReload table list: b. Click Reload in the Reload pop up message. The CustomersData_auto table appears in the table list. Verify the Hive table You can check the table in the Studio or in Hue. 1. PREVIEW THE HIVE TABLE From the Repository, preview the customersdata_auto Hive table. a. In the DQ Repository, right-click the CustomersData_auto table, then click Preview: LESSON 4 | 97 b. This Preview translates into a HiveQL query applied to your Hive table: The query is "select * from default.customersdata_auto" and appears in the SQL Editor, at the top of the window. The result appears in the tab 1, displayed below the SQL editor. 2. VERIFY THE HIVE TABLE FROM Hue Use the Data Browsers in Hue to check the customersdata_auto Hive table. a. To check the new table in Hue, click Data Browsers>Metastore Tables, then click customersdata_auto in the table list. b. Click Sample: You have covered the various ways to work with Tables with Hive. Next step You have almost finished this section. Time for a quick review. 98 | Big Data Basics - Participant Guide Review Recap In this lesson, you learned how to use Talend's Big Data components for Hive and Sqoop. First, you imported a MySQL table to HDFS using the tSqoopImport component. The import was done through a Map-only Map Reduce Job. Next, you manually created a Hive table with the tHiveCreateTable component and populated it with the tHiveLoad component. Last, you used the Hive table creation wizard to automatically create your Hive table from a file stored on HDFS. LESSON 4 | 99 Intentionally blank LESSON 5 Processing Data and Tables in HDFS This chapter discusses: Concepts 102 Processing Data and Tables in HDFS 107 Processing Hive Tables with Jobs 108 Profiling Hive Tables - Optional 116 Processing Data with Pig 128 Processing Data with Big Data Batch Job 136 Review 147 Concepts 102 | Big Data Basics - Participant Guide LESSON 5 | 103 104 | Big Data Basics - Participant Guide LESSON 5 | 105 106 | Big Data Basics - Participant Guide Processing Data and Tables in HDFS Use case Once stored in HDFS, you will need to process your tables and data to extract useful information. Depending on your data type, you can adopt various strategies. If your data is stored as Hive Tables, Hive QL might be the best way to address your needs. Hive QL is a high level programming language similar to SQL. Hive converts the request as Map Reduce Jobs that will be executed on your cluster. In this lesson, you will analyze Hive tables with a Job or with the Profiling view of the Studio. If your data is stored as text files, one option is to use Pig. Pig Latin is a high-level language providing common operations to group, filter and join data. The Pig Latin script is automatically converted in Java Map Reduce code to be executed on the cluster. Talend provides components to use Pig with minimal programming efforts. Another way to process your data, covered in this lesson, is to use a Big Data Batch Job. This kind of Job automatically converts the components in Java Map Reduce code that will be run on the Cluster. Objectives After completing this lesson, you will be able to: Process Hive tables with a standard Job Process Hive tables in the Profiling perspective of the Studio Process data with Pig components Process data with a Big Data Batch Job Before you begin Be sure that you are in a working environment that contains the following: A properly installed copy of the Talend Studio A properly configured Hadoop cluster The supporting files for this lesson First, you will use Hive to process the tables created in the previous lesson. LESSON 5 | 107 Processing Hive Tables with Jobs Task outline Hive converts the HiveQL query into Map Reduce code and then submits it to the Hadoop cluster. Through HiveQL language, Hive provides a higher level of abstraction compared to Java Map Reduce programming. You will now analyze the customers data table you created in the Working with Tables lesson. Using the Studio, you can analyze your data with your own Hive queries or you can use the Profiling view and use the Data Quality functions of the Studio over your Hive tables. You will use various components to extract useful data from a Hive table, process it, and then store the result in another Hive table. At the end of this lab, your Job will look like the following: Extracting useful data The first step is to collect useful data from the CustomersData Hive table you previously created. You will limit your investigations to the first 100 000 rows of the table. 1. SWITCH TO THE Integration PERSPECTIVE In the upper-right corner of the Studio, click Integration to open the Integration perspective. 2. CREATE A NEW STANDARD JOB Create a new Standard Job and name it HiveProcessing. 108 | Big Data Basics - Participant Guide 3. ADD A tHiveConnection COMPONENT Add a tHiveConnection component which uses the HiveConnection metadata. a. In the Repository, click HiveConnection under Metadata/Hadoop cluster/TrainingCluster/Hive. b. Drag it to the Designer. c. Select tHiveConnection in the Components list and click OK. 4. ADD A tHiveInput COMPONENT Add a tHiveInput component which uses the existing connection to Hive. a. Add a tHiveInput component and connect it with an OnSubjobOk trigger. b. Open the Component view. c. Select the Use an existing connection option. d. Ensure that tHiveConnection_1 is selected on the Component List. 5. CONFIGURE THE SCHEMA Set the Schema to Repository and then use the CustomersData generic schema metadata. 6. CONFIGURE THE TABLE NAME In the Table Name box, enter "CustomersData". 7. READ THE FIRST 100,000 ROWS OF THE TABLE In the Query box, you will enter the HiveQL query that will be sent to the cluster. As mentionned in the Overview, the investigations will be limited to the first 100 000 rows. In the Query box, enter: "select * from CustomersData where CustomersData.Id<=100000" Your configuration should be as follows: Process data You will now aggregate the data and store the result in HDFS. 1. ADD A tAggregateRow COMPONENT Add a tAggregateRow component, connect it with the Main row, and then, open the Component view. 2. CONFIGURE THE SCHEMA Configure the schema to have 3 output columns named ProductCategory, Gender and Count. The first 2 columns are strings and the third one is an Integer. LESSON 5 | 109 a. Click Sync columns and then click (...) to edit the schema. b. Configure the output schema to have 3 columns named ProductCategory, Gender and Count, as follows: c. Click OK to save the schema. 3. CONFIGURE THE AGGREGATION Configure the Group by and Operations tables to aggregate your data by ProductCategory and Gender. a. Click the green plus sign below the Group by table to add 2 Output column: ProductCategory and Gender. b. Click the green plus sign below the Operations table to add Count to the Output Column. c. In the Function column, select count. d. In the Input column position column, select ProductCategory. Your configuration should be as follows: 4. ADD A tHDFSOutput COMPONENT AND CONFIGURE THE TARGET FOLDER Add a tHDFSOutput component. Configure it to use the HDFSConnection metadata and to write the results in the /user/student/BDBasics/Hive/agg_results folder. a. In the Repository, click HDFSConnection under Metadata/Hadoop cluster/TrainingCluster/HDFS. b. Drag it to the Designer. c. Select tHDFSOutput in the Components list, then click OK. d. Connect tAggregateRow to tHDFSOutput with the Main row and then open the Component view. e. In the File Name box, enter "/user/student/BDBasics/Hive/agg_results". This will save the aggregation results in HDFS. The last step is to transfer the results in a Hive table. Transfer results to Hive 1. COPY THE OUTPUT SCHEMA In the tHDFSOuput component, copy the schema. It will be reused in a tHiveCreateTable component. 110 | Big Data Basics - Participant Guide a. In the tHDFSOutput Component view, click (...) to edit the schema. b. Select ProductCategory,Gender and Count in the Input or Output table, then copy the schema: c. Close the schema window. 2. ADD A tHiveCreateTable COMPONENT AND CONNECT IT Add a tHiveCreateTable below tHiveInput and connect it with an OnSubjobOk trigger. 3. CONFIGURE tHiveCreateTableE Configure the tHiveCreateTable component to create a table named AggResults with the previously copied schema. a. Open the Component view. b. Select the Use an existing connection option. c. Click (...) to edit the schema. d. Paste the Schema: LESSON 5 | 111 e. In the DB Type column, select STRING for ProductCategory and Gender and then, select INT for Count: f. Click OK to save the schema. g. In the Table Name box, enter "AggResults". h. In the Action on table list, select Create table if not exists. Your configuration should be as follows: 4. ADD A tHiveRow COMPONENT Add a tHiveRow component below tHiveCreateTable and connect it with an OnSubjobOk trigger. 5. CONFIGURE tHiveRow In the tHiveRow component, write a query to populate the AggResults Hive table with the agg_results file. a. In the Component view, select the Use an existing connection option. b. Copy the schema in tHiveCreateTable and paste it in the schema of the tHiveRow component. c. Click OK to save the schema. d. In the Query box, you will be able to write your own HiveQL. The query in a tHiveRow component is executed at each flow iteration in your Job. In the current Job, the query will 112 | Big Data Basics - Participant Guide be executed only once to transfer the data from the HDFS file to the AggResults Hive table. In the Query box, enter: "LOAD DATA INPATH '/user/student/BDBasics/Hive/agg_results' OVERWRITE INTO TABLE AggResults " Note: Copy and paste the Query from the LabCodeToCopy file in the C:\StudentFiles\BDBasics folder. e. Your configuration should be as follows: Your Job is now complete. It's time to run it and check the results. Run the Job and verify the results 1. RUN YOUR JOB Run your Job and check the results in the Console: If you did not include the extra JVM argument as explained in the first lab, you will see an error message regarding the winutils.exe binary not being available in the Hadoop binaries. In the current Job, this error won't prevent the Job to succeed LESSON 5 | 113 but it could in some cases. 2. FIX THE winutils.exe ERROR Using the Advanced settings tab, set the Hadoop home directory to the location of the winutils.exe file. a. In the Run view, click Advanced settings. b. Select the Use specific JVM arguments option. c. Click New.... d. In the Argument box, enter: -Dhadoop.home.dir=C:/tmp/winutils e. Click OK to save the new argument and run your Job again. The Job should run successfully without error messages: 3. Even if it's not clearly stated in the Console, the HiveQL query executed on the Cluster in the tHiveInput component,. You can see the Job generated in the Hue Job Browser: 4. Using the Hue File Browser, navigate to /user/student/BDBasics/Hive: 114 | Big Data Basics - Participant Guide The folder is empty because the data have been transferred to the Hive table, deleting the file on HDFS. 5. In Hue, click Data Browsers>Metastore Tables. The table AggResults should be in the table list. 6. Click AggResults, then click the Sample tab: You have processed your Hive Table with various components such as tHiveInput and tHiveRow. You will now process your Hive table using the Profiling perspective of the Studio. LESSON 5 | 115 Profiling Hive Tables - Optional Task outline Using the Studio, you can run various analysis over your Hive tables. In this exercise, you will use the Profiling view of the Studio to run analysis of your Hive connection, Hive Tables, and columns. Hive connection analysis 1. SWITCH TO THE Profiling PERSPECTIVE In the upper-right corner of the Studio, click Profiling to switch to the Profiling perspective. 2. REFRESH THE HIVE TABLE LIST From the DQ Repository, reload the list of Hive tables. a. In the DQ Repository, under Metadata/DB connections, you will see your HiveConnection. Right-click HiveConnection/default/Tables, then click Reload table list: b. Click Reload in the Reload pop up message. This will update the table list : As expected, you will find the AggResults, CustomersData, and CustomersData_auto Hive tables. Now, you will start the Overview Analysis of HiveConnection. 3. CREATE AN OVERVIEW TABLE Create an Overview Analysis on the HiveConnection metadata. a. Right-click HiveConnection under DQ Repository/Metadata/DB connections, then click Overview analysis: b. In the Name box, enter OverviewAnalysis and click Next. 116 | Big Data Basics - Participant Guide c. It is possible to filter the tables of interest by using the Table name filter and View name filterboxes. In this lab, you will keep these boxes empty. d. Click Finish. Your analysis appears in the DQ Repository, under Data Profiling/Analyses: 4. START THE ANALYSIS The OverviewAnalysis opens so that you can examine the Analysis settings. LESSON 5 | 117 Click the green arrow to start the analysis: 5. EXAMINE THE RESULTS At the bottom of the Connection Analysis window, you will find the Analysis summary: 118 | Big Data Basics - Participant Guide In the Analysis summary, you should see that the analysis was successful. There is 1 schema, named default. To get more details, click default. In the table below, you will get 3 tables, named customersdata, customersdata_auto, and aggresults. As expected, CustomersData and CustomersData_auto have 1 million rows, and AggResults has 14 rows. Now that you have an overview of your Hive Tables, you can move to the next step, which is to analyze each table. Hive tables analysis In the Profiling view, you can easily refine the kind of analysis needed to suit your needs. You will now focus on the CustomersData table. 1. VISUALIZE THE customersdata HIVE TABLE From the DQ Repository, display the customersdata Hive table. a. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, then click Preview: b. This will open a SQL Editor window which is split in two parts. The upper part is the request submitted to preview your table: LESSON 5 | 119 The bottom part is a preview of the CustomersData table: 2. CREATE A TABLE ANALYSIS From the DQ Repository, create a Table analysis on the customersdata table. Add a filter to examine data corresponding to women. a. Right-click customersdata and click Table analysis. b. In the Name box, enter TableAnalysis and click Finish. c. The TableAnalysis window opens. By default, a row count operation is proposed. You will add a filter to count the number of rows where the customer is a woman. In the Where box, in the Data Filter tab, enter customersdata.gender='F' : d. Click the green arrow icon to start the analysis. 120 | Big Data Basics - Participant Guide e. On the right side of the Table Analysis, you will find the result of your request: You will continue to investigate your data, running analysis on some columns of the CustomersData table. Column analysis (optional) 1. CREATE A COLUMN ANALYSIS From the DQ Repository, create a Column Analysis named ColumnAnalysis. a. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, then click Column Analysis: b. In the Name box, enter ColumnAnalysis, then click Finish. The Column Analysis page opens. 2. CONFIGURE THE COLUMN ANALYSIS As the analysis can be time consuming, you will need to reduce the scope of the analysis. You will exclude the id, firstname, lastname, city, and purchasedate columns. LESSON 5 | 121 a. Under Data preview, click Select Columns, then select the state, productcategory and gender columns: b. Click OK to save the selection. 3. EXAMINE THE ANALYSIS CONFIGURATION Under Analyzed Columns you will find state, productcategory and gender columns. Click the plus sign next to state to view the details of what will be analyzed in the state column: You will see basic information about your columns, such as the number of rows, the number of blank or null values, the number of distinct count. 4. CONFIGURE THE EXECUTION ENGINE In the Analysis Parameters tab, you can choose the execution engine: SQL or Java. To send your requests to the cluster, 122 | Big Data Basics - Participant Guide select SQL, then click the green arrow icon: 5. CONNECT TO Hue While the analysis is running, you can go in to the Hue Job Browser and follow the execution of the Map and Reduce tasks launched by the Column Analysis: 6. EXAMINE THE ANALYSIS RESULTS At the end of the execution, open the Analysis Results tab. There you will find the results of each analysis of the state, productcategory and gender columns. LESSON 5 | 123 There are 7 distinct product categories. You will now run an analysis to list these values. Product Category column analysis 1. CREATE A COLUMN ANALYSIS From the DQ Repository, create a Column Analysis over the productcategory column and name it ProductCategoryAnalysis. a. Right-click productcategory under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables/customersdata/Columns, and clickColumn Analysis>Analyze: b. In the Name box, enter ProductCategoryAnalysis and click Finish. 2. SELECT INDICATORS Select the Row Count and Value Frequency indicators. 124 | Big Data Basics - Participant Guide a. In the Column Analysis window, under Data preview, click Select Indicators: This is were you will specify which statistics you are interested in. b. If you scroll down, you will see an option named Hide non applicable indicators. Select this option. This will simplify the indicators selection: LESSON 5 | 125 c. Expand Simple Statistics and select Row Count in the productcategory column: d. Expand Advanced Statistics and select Value Frequency. e. Click OK to save your selection: 3. RUN THE ANALYSIS Click the green arrow to start the analysis. 4. EXAMINE THE ANALYSIS RESULTS Open the Analysis Results tab: 126 | Big Data Basics - Participant Guide The frequency analysis lists the product categories values and the count for each value. The numbers are very close to each others because this data set has been randomly generated. You have covered the last analysis for this exercise. Now it's time to move to the next exercise, where you will cover how to process data on HDFS using Pig. LESSON 5 | 127 Processing Data with Pig Task outline Map Reduce is very powerful but it requires a Java programmer, and the programmer may have to re-invent common functions such as joining or filtering. This is the motivation behind Pig. Pig is a platform for analyzing large data sets. It consists of a high-level programming language, Pig Latin, that opens Hadoop to non-Java programmers. Pig Latin also provides common operations to group, filter, join, or sort data. Pig provides an execution engine on top of Hadoop. The Pig Latin script is converted to Map Reduce code, which will be executed in your cluster. In this lab, you will process the Customers data stored in HDFS. You will perform basic tasks such as filtering rows, mapping, aggregating and sorting data, and storing your results in HDFS. At the end of this lab, your Job will look like the following: Writing data to HDFS Earlier in the course, you used the tHDFSPut component to copy a file from your local file system and paste it in HDFS. Another way to write data to HDFS is to use the tHDFSOutput component, which writes a data flow in HDFS. You will create a Job that reads the CustomersData.csv file and writes it to HDFS using the tHDFSOutput component. 1. SWITCH TO THE Integration PERSPECTIVE In the upper-right corner of the Studio, click Integration to switch your Studio to the Integration perspective. 2. CREATE A NEW STANDARD JOB Create a new Standard Job and name it WritingHDFS. 3. ADD A tFileInputDelimited COMPONENT Add a tFileInputDelimited component and open the Component view. 4. READ THE INPUT FILE Configure the tFileInputDelimited component to read the CustomersData.csv file. a. Next to theFile name box, click(...)and navigate to "C:\StudentFiles\BDBasics\CustomersData.csv". b. Set the Row Separator to "\n" and the Field Separator to ";". 128 | Big Data Basics - Participant Guide c. Set the Schema type to Repository and browse to find the CustomersData generic schema metadata. Your configuration should be as follows: 5. ADD A tHDFSOutput COMPONENT Add a tHDFSOutput component and connect it with the Main row. 6. CONFIGURE THE OUTPUT FOLDER Configure tHDFSOutput to write in the /user/student/BDBasics/CustomersData folder. a. Double-click the tHDFSOutput component to open the Component view. b. In the Property Type list, select Repository. Then, find HDFSConnection. c. In the Schema list, select Repository. Then, find the CustomersData generic schema metadata. d. In the File Name box, enter "/user/student/BDBasics/CustomersData". Your configuration should be as follows: e. Save your Job. This Job will be executed later from another Job, using a tRunJob component. Load data You will create a new Job named PigProcessing, which will process the customers' data. 1. CREATE A NEW STANDARD JOB Create a a new Standard Job and name it PigProcessing. 2. ADD THE WritingHDFS JOB Drag the WritingHDFS Job from the Repository and drop it in the Designer. It will appear as a tRunJob component LESSON 5 | 129 labelled WritingHDFS: 3. ADD A tPigLoad COMPONENT Add a tPigLoad component and Connect the WritingHDFS component to tPigLoad with an OnSubjobOk trigger 4. SELECT THE EXECUTION MODE Double-click tPigLoad to open the Component view. Pig Jobs can be executed in local or Map/Reduce mode. This is configured in the tPigLoad component. For this lab, you will use the Map/Reduce mode. 5. CONFIGURE THE CONNECTION TO HDFS As you did for the tHDFSOutput component, set the Property Type to Repository using the HDFSConnection metadata. 6. CONFIGURE THE SCHEMA Set the Schema type to Repository using the CustomersData generic schema metadata. 7. CONFIGURE THE INPUT FILE In the Input file URI box, enter "/user/student/BDBasics/CustomersData". Your configuration should be as follows: 130 | Big Data Basics - Participant Guide Filter and map Data You will continue to build your Job with the next two components, in order to filter and map your data. The goal here is to extract customers living in California and get the corresponding gender and product category. You will use the tPigFilterRow component to filter the State. Then, you will use the tPigMap component to extract the data from the Gender and ProductCategory columns. 1. ADD A tPigFilterRow COMPONENT Add a tPigFilterRow and connect with a Pig Combine row. 2. CONFIGURE THE FILTERING Add a filter to select customers living in California. a. Open the Component view. b. Select the Use advanced filter option. c. In the Filter box, enter "State matches 'California"': 3. ADD A tPigMap COMPONENT AND CONNECT IT Add a tPigMap component, connect it with the Pig Combine row, and then open the Component view. 4. CREATE THE MAPPING OUTPUT Create a new output named MappingOut. 5. CONFIGURE THE MAPPING Select ProductCategory and Gender columns in the row2 table and drag in the MappingOut table. Your mapping should be as follows: 6. SAVE THE MAPPING Click OK to save the mapping. Aggregate and sort data Now that you have extracted the data, you will aggregate and sort it. The goal here is to have the number of men and women per ProductCategory, and then sort them by alphabetical order. 1. ADD A TPIGAGGREGATE COMPONENT Add a tPigAggregate component, and connect it with the MappingOut row. Then, open the Component view. 2. CONFIGURE THE OUTPUT SCHEMA Add a new column to the output table, named Count which is an Integer. LESSON 5 | 131 a. Edit the schema and add a column named Count with an Integer type: b. Click OK to save the schema. 3. CONFIGURE THE AGGREGATION Configure the Group by and Operations tables to aggregate your data by ProductCategory and Gender. a. Click the green plus sign below the Group by table to add the ProductCategory and Gender columns. b. Click the green plus sign below the Operations table. c. In the Additional Output Column list, select Count. d. In the Function list, select count. e. In the Input Column list, select ProductCategory. Your configuration should be as follows: 4. ADD A tPigSort COMPONENT Add a tPigSort component, connect it with the Pig Combine row, and then, open the Component view. 5. CONFIGURE THE SORTING Configure the Sort key table to sort ProductCategory in Ascending order. 132 | Big Data Basics - Participant Guide a. Click the green plus sign below the Sort key table. b. Configure to sort the ProductCategory column by ascending order, as shown below: Store results Once processed, the last step is to store your results on HDFS. 1. ADD A tPigStoreResult COMPONENT Add a tPigStoreResult component, connect it with the Pig Combine row, and then, open the Component view. 2. USE THE HDFSConnection METADATA Set the Property Type to Repository and then select HDFSConnection. 3. CONFIGURE THE OUTPUT FOLDER Write the results in the /user/student/BDBasics/Pig/out folder. a. In the Result Folder URI, enter "/user/student/BDBasics/Pig/out". b. Select the Remove directory if exists option. This will allow you to run the Job again as needed. c. Your configuration should be as follows: Run the job and verify the results 1. RUN YOUR JOB AND VERIFY THE RESULTS Run the Job and inspect the results in the Console. a. Scrolling down the Console, you will find information about the WritingHDFS Job: The WritingHDFS Job succesfully executes and then the Pig components will start to work. LESSON 5 | 133 b. If you continue to investigate in the logs, you will find the Pig requests equivalent to each Pig component. The first Pig component is the tPigLoad: c. Right after, you will find Pig requests for tPigFilterRow, tPigMap, tPigAggregate, tPigSort and tPigStoreResult: d. If you continue to scroll down, you will see several MapReduce Jobs submitted, as well as their statistics. This is the final report of all Map Reduce Jobs execution: There, you can see that the execution was successful. First, 1 million rows were read from the CustomersData file. Then, 14 records were written in the /user/student/BDBascis/Pig/out folder. 2. In Hue, using the File Browser, navigate to /user/student/BDBascis/Pig/out. 134 | Big Data Basics - Participant Guide 3. Click the part-r-00000 file to see the result: 4. In Hue, in the Job Browser, you will be able to see all the Jobs submitted by the Studio. The Jobs have different IDs but they have the same name: BDBASICS_PigProcessing_0.1_tPigLoad_1. All your Jobs have succeeded: You can now continue investigating processing data on HDFS and move to the next exercise. LESSON 5 | 135 Processing Data with Big Data Batch Job Task outline The last way to process data covered in this course is to use Big Data Batch Jobs. First, you will create a Standard Job and then convert it to a Big Data Batch Job using the Map Reduce framework. The Job will be very similar to the PigProcessing Job. At the end of this exercise, your Job will look like the following: The first step will be to read the data using the HDFSConnection metadata. Read and filter data 1. CREATE A NEW STANDARD JOB Create a new Standard Job and name it StandardProcessing. 2. ADD A tHDFSInput Add a tHDFSInput component which uses the HDFSConnection metadata. a. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it to the Designer. b. Select tHDFSInput in the Components list and click OK: c. Double-click the tHDFSInput component to open the Component view. 3. CONFIGURE tHDFSInput Configure tHDFSInput to read the CustomersData file. a. In the Schema list, select Repository. Then, navigate to find the CustomersData generic schema metadata. b. Next to the File Name box, click (...) and browse to find /user/student/BDBasics/CustomersData. 136 | Big Data Basics - Participant Guide Your configuration should be as follows: 4. ADD A tFilterRow COMPONENT Add a tFilterRow component and connect it with the Main row. 5. FILTER THE DATA Configure the Conditions table to filter customers living in California. a. In the Component view, click the green plus sign below the Conditions table to add a new line. b. In the InputColumun column, select State. c. In the Function column, select Match. d. In the Operator column, select Equals. e. In the Value colum, enter "California". Your configuration should be as follows: In the next section, you will map and aggregate your data. Map and Aggregate Data 1. ADD A TMAP COMPONENT Add a tMap component and connect it with the Filter row. 2. CREATE A MAPPING OUTPUT Open the mapping editor and create a new output named out. LESSON 5 | 137 3. CONFIGURE THE MAPPING In the row2 table, select Gender and ProductCategory and drag to the out table, as follows: 4. SAVE THE MAPPING Click Ok to save your configuration. 5. ADD A tAggregaterow COMPONENT Add a tAggregateRow component and connect it with the out row. 6. CONFIGURE THE AGGREGATION AS PREVIOUSLY Configure the Group by and Operations tables to aggregate your data by ProductCategory and Gender. The result of the aggregation is saved in a new output column named Count. a. Open the Component view. b. Edit the schema and add a new column named Count with an Integer type: c. Click OK to save the schema. d. Click the green plus sign below the Group by table and add the ProductCategory and Gender columns. e. Click the green plus sign below the Operations table. f. In the Output column list, select Count. g. In the Function list, select count. h. In the Input colum position list, select ProductCategory: The last steps are to sort the results and then to save them on HDFS. 138 | Big Data Basics - Participant Guide Sort and save Data 1. ADD A tSortRow COMPONENT Add a tSortRow component, connect it with the Main row, and then, open the Component view. 2. CONFIGURE THE SORTING Configure the Criteria table to sort ProductCategory in ascending order. a. Click the green plus sign below the Criteria table. b. In the Schema column, select ProductCategory. c. In the sort num or alpha? column, select alpha. d. In the Order asc or desc column, select asc: 3. ADD A tHDFSOutput Add a tHDFSOutput component which uses the HDFSConnection metadata. a. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it to the Designer. b. In the Components list, select tHDFSOutput. c. Connect it with the Main row and open the Component view. 4. CONFIGURE THE OUTPUT FILE Write the data in the /user/student/BDBasics/CustomersDataOut folder. a. In the File Name box, enter "/user/student/BDBasics/CustomersDataOut". b. In the Action list, select Overwrite. Your configuration should be as follows: Your Job is now ready to run. LESSON 5 | 139 Run the job and verify the results 1. RUN THE JOB Run your Job and check the results in the Console: Your Job should execute successfully. 2. CONNECT TO Hue AND VERIFY THE RESULTS In Hue, using the File Browser, navigate to /user/student/BDBasics and open CustomersDataOut: The results should be the same as in the previous lab, Processing Data with Pig. You will now convert this Standard Job into a Big Data Batch Job, which will use the Map Reduce framework. 140 | Big Data Basics - Participant Guide Convert to a Map Reduce batch Job Using the Studio, you can convert a standard Job to a Big Data Batch Job, and choose between a Map Reduce or Spark framework. In this lesson, you will focus on Map Reduce Jobs. Instead of converting your current Job, you will duplicate it as a Map Reduce based Batch Job. 1. DUPLICATE YOUR JOB Duplicate the StandardProcessing Job and name the duplicate MRProcessing. MRProcessing is a Big Data batch Job using the MapReduce framework. a. In the Repository, right-click the StandardProcessing Job, and then, click Duplicate. b. In the Input new name box, enter MRProcessing. c. In the Job Type list, select Big Data Batch. d. In the Framework list, select MapReduce: e. Click OK. 2. CONFIGURE THE CONNECTION TO THE HADOOP CLUSTER The Hadoop cluster configuration is set at the Job level in Big Data Batch Jobs. So, before duplicating your Job, you will be LESSON 5 | 141 asked to choose from which component the Hadoop configuration will be retrieved: 3. SELECT A COMPONENT Select the HDFSConnection component with the blue arrow and click OK. 4. OPEN THE JOB In the Repository, under Big Data Batch Jobs, you will find your MRProcessing Job. If you haven't done so already, double-click to open it: 5. CONFIGURE THE OUTPUT FOLDER Configure tHDFSOutput to write the results in the /user/student/BDBasics/CustomersDataOut_MR folder. a. Double-click the tHDFSOutput component to open the Component view. b. In the Folder box, enter "/user/student/BDBasics/CustomersDataOut_MR". Your standard Job is now converted to a Map Reduce Job. Run the Job and verify the results Before running your Job, you will check that the Hadoop cluster configuration is correct. 1. OPEN THE RUN VIEW Open the Run view of the MRProcessing Job. 2. VERIFY THE HADOOP CLUSTER CONNECTION INFORMATION Click the Hadoop Configuration tab: 142 | Big Data Basics - Participant Guide As the configuration has been retrieved from a tHDFSInput component, some configurations are missing, because they are not necessary to read/write from HDFS. The Resource Manager address is necessary to run a Map Reduce Job, and is currently missing. If you run the Job, it will fail with the following error message: LESSON 5 | 143 3. CONFIGURE THE RESOURCE MANAGER ADDRESS To fix this, you can either use your TrainingCluster metadata, or set the Resource Managerto "hadoopcluster:8032": 4. RUN YOUR JOB Go back to the Basic Run tab and run your Job. 5. OBSERVE THE DESIGNER You can check that your Job is running in the Designer view. As the different Map Reduce Jobs execute, you will see the progress barschanging. First they are empty, then, when a Map or Reduce task starts to execute, the progress bar becomes red. When a task successfully complete, the progress bar becomes green: 6. OBSERVE THE CONSOLE You can also follow the execution in the Console: 144 | Big Data Basics - Participant Guide You should see three reports, one for each Map Reduce Job launched by the Studio. The execution of Map Reduce tasks is given with a percentage. 7. CONNECT TO Hue Use the Job Browser and the File Browser to check the execution and the results. a. From the Console, you can get your Job ID and find it in the Hue Job Browser: b. Using the File Browser in Hue, check the results saved in the /user/student/BDBasics/CustomersDataOut_ MR folder. LESSON 5 | 145 c. Click the part-r-00000 file: You should have the same results as in previous labs. You have now covered the different ways to process tables and data stored on HDFS. Next step You have almost finished this section. Time for a quick review. 146 | Big Data Basics - Participant Guide Review Recap In this chapter, you covered how to process tables and data stored on HDFS. The first part of the chapter was dedicated to Hive Tables. You built a Job to extract data of interest in your Hive table with a tHiveInput component. You processed the data, saved the result in HDFS with a tHDFSOutput component, and you transferred the result to a Hive table using tHiveCreateTable and tHiveRow. Next, you used the Profiling view of the Studio to perform different levels of analyses. You started at the connection level, moved to the Tables and column level, and ended with a custom analysis on the ProductCategory column. Each request was run as a Map reduce Job in your cluster. The second part of the chapter was dedicated to data processing with Pig and Big Data Batch Jobs. You created a Job using Pig components to process your data. Each Pig request was executed as a Map Reduce Job in your cluster. Next, you created a standard Job to process your data and reproduced the results obtained with Pig. Then, you duplicated the standard Job and created a Big Data Batch Job using the Map Reduce Framework. Further reading If you want to discover more about data profiling, the Talend Data Quality trainings will help you. If you are interested in discovering more about Big Data Batch Jobs, the Talend Big Data Advanced training will give you an overview of real life use cases using Big Data Batch Jobs. LESSON 5 | 147 Intentionally blank
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 148 Page Mode : UseOutlines Page Layout : TwoColumnRight Language : en-us Producer : madbuild Create Date : 2017:08:20 10:23:58+02:00 Modify Date : 2017:08:20 10:23:58+02:00 Title : Lab Guide Author : Talend Subject :EXIF Metadata provided by EXIF.tools