Instructions
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 17
Download | |
Open PDF In Browser | View PDF |
Lab Center – Hands-on Lab Session 1865 Spark for Cloudant Analytics Holger Kache, IBM, kache@de.ibm.com Mayya Sharipova, IBM, mayyas@ca.ibm.com Tony Sun, IBM, tonysun@us.ibm.com Table of Contents Lab instructions ....................................................................................................................................... 3 Draw insights from Twitter data about the upcoming US election ...................................................... 3 Create a Bluemix account .................................................................................................................... 4 Prepare the data set and data flow ..................................................................................................... 6 1. Provision a new Cloudant account ............................................................................................. 7 2. Create a Cloudant database ........................................................................................................ 8 3. Create an Insights for Twitter service instance ......................................................................... 9 4. Create an Apache Spark service instance .............................................................................. 11 Work with a Python notebook ............................................................................................................ 13 Work with a Scala notebook .............................................................................................................. 15 We Value Your Feedback! ................................................................................................................... 17 2 Lab instructions The instructions in this lab are completely web based and require a working internet connection. For browser we recommend Firefox (version 45.2 is installed on the lab computer). You can go ahead and execute the lab on your private laptop. The instructions have no local dependencies and all resources are accessible online. There are no specific platform requirements either. Draw insights from Twitter data about the upcoming US election The lab shows you how to analyze tweets about the upcoming US election and extract interesting insights from these tweets. You will learn how to find, filter, and sort tweets by location, sentiment, party affiliation, and candidate. The work is done in Jupyter notebooks running on Bluemix. A shared Spark cluster is running your computations and your results are immediately available in the notebooks. Data is extracted from the Twitter API and staged in a your own Cloudant database instance. The analysis results are written back to another Cloudant database and plotted in graphs inline with the notebook. You will exercise two languages to run data analysis in Python and Scala and leverage frameworks including Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Spark GraphX. 3 Create a Bluemix account The services you are about to use in this tutorial are all hosted in the IBM Platform-as-aService called Bluemix. If you already have a Bluemix account you can skip this section and proceed to the next section. To sign up for Bluemix please navigate to http://bluemix.net/ On the signup sheet you have to provide your contact details and create password with security questions to recover it. For country or region select "UNITED STATES" for the purpose of this lab. You can change your profile later but using the US location will provide you better performance while you are on-site at the convention. With the successful sign up you get a 30-day trial account at no charge. No credit card information is required to create the account. It will expire automatically after 30 days. You have to provide payment information only if you want to convert to an unlimited account after the 30 days. To activate the account, open your email inbox and find a note from "The Bluemix Team" with a subject "Action required: Confirm your Bluemix account". It contains a link to Confirm your account. Note: The activation email should arrive within minutes but can theoretically take up to 24 hours. If you don't have the activation note in your inbox shortly, please ask us. We have a few active Bluemix accounts available we can share. Upon activation you should get a Success message with a link to log into your Bluemix dashboard. A three-page wizard opens with a few additional questions. 1 - Your location should already have been set to "US South". If not, please do so. 2 - Name your organization. Feel free to pick any name (including the suggested) 3 - Create a space. You can use the "dev" space for this lab. With that you are all set and should be in your Bluemix console looking like the one below. 4 5 Prepare the data set and data flow From the Bluemix overview console you can pick the Catalog in the upper right hand menu. It offers a complete catalog of all services hosted on Bluemix. Search for "Cloudant" with the search bar and pick the one in the "Data & Analytics" section called Cloudant NoSQL DB 6 1. Provision a new Cloudant account Click on the Cloudant NoSQL DB service icon to provision a new instance of a Cloudant account. From the list of Pricing Plans you want the Lite plan: Create the instance and make note of the user id and password that got created automatically with your Cloudant service instance. You will need those credentials later and can find them in the tab called "Service Credentials". 7 2. Create a Cloudant database Navigate back over to the "Manage" tab and open the Cloudant dashboard with the "Launch" button. The experience will change to a completely different dashboard outside of Bluemix. Here the pages can be navigated on the left hand panel. To create a database you want to use the Databases page. Please create a database and note that database name again for later. 8 3. Create an Insights for Twitter service instance The next step in the analysis process is to harvest the data. You can use the IBM Insights for Twitter API service at https://console.ng.bluemix.net/catalog/services/insights-for-twitter/ to get Twitter data about the election. Back in the Bluemix catalog you can also navigate to the service when you search for the keyword "Twitter" Create a service instance and accept the default values, including the Free Plan tier: 9 Make note of the Service Credential in your newly deployed Insights for Twitter service instance. 10 4. Create an Apache Spark service instance In this step you will use the IBM Apache Spark service at https://console.ng.bluemix.net/catalog/services/apache-spark/ in Bluemix to create a Jupyter notebook. The notebook is written in Python and allows you to script the calls to the Twitter service API created above. Results of the Twitter service API calls are persisted into your new Cloudant database. Again, using the Bluemix catalog you can search for the keyword "Spark" in the catalog to find the same service: Create a new service instance for the IBM Apache Spark service with the default settings. Select the Personal plan. 11 Note: The personal plan has a price plan of $0.70 for a 2 node execution engine per hour. With the 30-day Bluemix trial account we don't incur any costs for this lab. Should you use your personal Bluemix account and don't want to see any charge on your credit card for this lab, please ask us to give you a trial account instead. 12 Work with a Python notebook In the new service instance you are presented with the console. Create a new Notebook and select the "Create Notebook" From URL option. Provide a name, an optional description, and select the following Notebook URL: https://raw.githubusercontent.com/cloudant-labs/sparkcloudant/master/tutorials/wowPython.ipynb Note: In case you get an error like "The service is not responding, try again later" use the back button and try the load again. The notebook you just loaded contains the actual instructions with Spark as engine and Cloudant as data store. All code in the notebook is written in Python 2 syntax and requires a running Python 2.7 kernel. By default, the kernel should have been started and your notebook be connected to it. Code is structured into cells where you want to execute cells sequentially, starting at the top. While a cell executes, you should see a [*] next to the cell that indicates the running status. When the cell completed, you see a number like [1]. That number increments with every cell execution. Nothing stops you from running a cell "out of order" instead of sequentially. Just make sure to meet all the conditions for the execution of a cell. We prepared the notebook so that all conditions are met when you run it top-to-bottom. A successful cell execution will almost always dump some output right below the cell. For example: Should there be no output, you probably have a problem with your Python kernel and should perform a kernel restart. Use the menu options or action buttons atop your notebook to interrupt or restart the kernel. A kernel restart clears your entire session context and you will have to re-run every instruction required up to the point where you want to resume your work. The comments provided in the notebook should make it somewhat obvious what every cell requires for execution. If you are unclear how to proceed after a kernel restart just ask us. 13 Please go ahead and follow the instructions given in the Python notebook at this point. To validate the success of your executions you can compare with the RESULT output we provided at https://github.com/cloudant-labs/spark-cloudant/master/tutorials/wowPython_RESULT.ipynb Your cell execution output should look very similar to the output in that HTML page. 14 Work with a Scala notebook With the Python notebook above you have already seen a lot of Spark core and graphing technology at play. There is, however, one powerful framework in Spark to support analysis of data streams with Spark Streaming. Spark Streaming is currently supported only in Scala notebooks. We prepared a second notebook written in Scala you can use to exercise Spark Streaming technology. Note: Don't worry if you are not familiar with Scala as language. The notebook can again be executed in a point-and-click fashion and does not require any actual coding. To load the Scala notebook, open a new browser tab and navigate to the Spark service instance you already created in Bluemix earlier. Note: If you need a quick way to navigate back to the Spark service, use the Bluemix menu on the left hand side and click the "Analytics" tab. There you will immediately see all your notebooks. Create a new Notebook and select the "Create Notebook" From URL option again. Provide a name, an optional description, and select the following Notebook URL: https://raw.githubusercontent.com/cloudant-labs/sparkcloudant/master/tutorials/wowScala.ipynb 15 Please go ahead and follow the instructions given in the Scala notebook at this point. At one point in the notebook you are asked to start the Spark Streaming Context (ssc). It will subscribe to changes in your Cloudant database and process any changes in a window of 120 sec. During these 120 sec, you should initiate changes in your Cloudant database. A simple way to initiate changes to the Cloudant database is to simply add new tweets. Go back to your Python notebook and re-execute the cell that calls the TwitterToCloudant() class. Note: If you closed the tab with the Python notebook accidentially, you will have to re-open it and re-execute the cells that load libraries and create context. Don't forget to adjust your connection details either. Here you can optionally change the query you want to use for the Twitter API and/or the count of tweets to process. Now you simulate a stream of events (loading new documents with tweets) in your Cloudant database and see how the Spark Streaming receiver can process these events. After the 120 sec you can always restart your Streaming Context or increase the duration. To validate the success of your executions you can compare with the RESULT output we provided at https://github.com/cloudant-labs/spark-cloudant/master/tutorials/wowScala_RESULT.ipynb Your cell execution output should look very similar to the output in that HTML page. 16 We Value Your Feedback! • Don’t forget to submit your World of Watson session and speaker feedback! Your feedback is very important to us – we use it to continually improve the conference. • Access the World of Watson Conference Connect tool to quickly submit your surveys from your smartphone, laptop or conference kiosk. 17
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 17 PDF Version : 1.4 Title : WoW2016_LabInstructions_AL1865_student_version Author : Holger Kache Subject : Producer : Mac OS X 10.11.6 Quartz PDFContext Creator : Word Create Date : 2016:10:20 08:02:52Z Modify Date : 2016:10:20 08:02:52ZEXIF Metadata provided by EXIF.tools