Instructions

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 17

DownloadInstructions
Open PDF In BrowserView PDF
Lab Center – Hands-on Lab
Session 1865
Spark for Cloudant Analytics
Holger Kache, IBM, kache@de.ibm.com
Mayya Sharipova, IBM, mayyas@ca.ibm.com
Tony Sun, IBM, tonysun@us.ibm.com

Table of Contents
Lab instructions ....................................................................................................................................... 3
Draw insights from Twitter data about the upcoming US election ...................................................... 3
Create a Bluemix account .................................................................................................................... 4
Prepare the data set and data flow ..................................................................................................... 6
1. Provision a new Cloudant account ............................................................................................. 7
2. Create a Cloudant database ........................................................................................................ 8
3. Create an Insights for Twitter service instance ......................................................................... 9
4. Create an Apache Spark service instance .............................................................................. 11
Work with a Python notebook ............................................................................................................ 13
Work with a Scala notebook .............................................................................................................. 15
We Value Your Feedback! ................................................................................................................... 17

2

Lab instructions
The instructions in this lab are completely web based and require a working internet
connection. For browser we recommend Firefox (version 45.2 is installed on the lab
computer).
You can go ahead and execute the lab on your private laptop. The instructions have no
local dependencies and all resources are accessible online. There are no specific
platform requirements either.

Draw insights from Twitter data about the upcoming US
election
The lab shows you how to analyze tweets about the upcoming US election and extract
interesting insights from these tweets. You will learn how to find, filter, and sort tweets
by location, sentiment, party affiliation, and candidate.
The work is done in Jupyter notebooks running on Bluemix. A shared Spark cluster is
running your computations and your results are immediately available in the notebooks.
Data is extracted from the Twitter API and staged in a your own Cloudant database
instance. The analysis results are written back to another Cloudant database and
plotted in graphs inline with the notebook.
You will exercise two languages to run data analysis in Python and Scala and leverage
frameworks including Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and
Spark GraphX.

3

Create a Bluemix account
The services you are about to use in this tutorial are all hosted in the IBM Platform-as-aService called Bluemix. If you already have a Bluemix account you can skip this section
and proceed to the next section.
To sign up for Bluemix please navigate to
http://bluemix.net/
On the signup sheet you have to provide your contact details and create password with
security questions to recover it. For country or region select "UNITED STATES" for the
purpose of this lab. You can change your profile later but using the US location will
provide you better performance while you are on-site at the convention.
With the successful sign up you get a 30-day trial account at no charge. No credit card
information is required to create the account. It will expire automatically after 30 days.
You have to provide payment information only if you want to convert to an unlimited
account after the 30 days.
To activate the account, open your email inbox and find a note from "The Bluemix
Team" with a subject "Action required: Confirm your Bluemix account". It contains a link
to Confirm your account.
Note: The activation email should arrive within minutes but can theoretically take
up to 24 hours. If you don't have the activation note in your inbox shortly, please
ask us. We have a few active Bluemix accounts available we can share.
Upon activation you should get a Success message with a link to log into your Bluemix
dashboard. A three-page wizard opens with a few additional questions.
1 - Your location should already have been set to "US South". If not, please do so.
2 - Name your organization. Feel free to pick any name (including the suggested)
3 - Create a space. You can use the "dev" space for this lab.

With that you are all set and should be in your Bluemix console looking like the one
below.

4

5

Prepare the data set and data flow
From the Bluemix overview console you can pick the Catalog in the upper right hand
menu. It offers a complete catalog of all services hosted on Bluemix. Search for
"Cloudant" with the search bar and pick the one in the "Data & Analytics" section called
Cloudant NoSQL DB

6

1. Provision a new Cloudant account
Click on the Cloudant NoSQL DB service icon to provision a new instance of a Cloudant
account. From the list of Pricing Plans you want the Lite plan:

Create the instance and make note of the user id and password that got created
automatically with your Cloudant service instance. You will need those credentials later
and can find them in the tab called "Service Credentials".

7

2. Create a Cloudant database
Navigate back over to the "Manage" tab and open the Cloudant dashboard with the
"Launch" button. The experience will change to a completely different dashboard
outside of Bluemix. Here the pages can be navigated on the left hand panel.
To create a database you want to use the Databases page.

Please create a database and note that database name again for later.

8

3. Create an Insights for Twitter service instance
The next step in the analysis process is to harvest the data. You can use the IBM
Insights for Twitter API service at
https://console.ng.bluemix.net/catalog/services/insights-for-twitter/
to get Twitter data about the election. Back in the Bluemix catalog you can also navigate
to the service when you search for the keyword "Twitter"

Create a service instance and accept the default values, including the Free Plan tier:

9

Make note of the Service Credential in your newly deployed Insights for Twitter service
instance.

10

4. Create an Apache Spark service instance
In this step you will use the IBM Apache Spark service at
https://console.ng.bluemix.net/catalog/services/apache-spark/
in Bluemix to create a Jupyter notebook. The notebook is written in Python and allows
you to script the calls to the Twitter service API created above. Results of the Twitter
service API calls are persisted into your new Cloudant database.
Again, using the Bluemix catalog you can search for the keyword "Spark" in the catalog
to find the same service:

Create a new service instance for the IBM Apache Spark service with the default
settings. Select the Personal plan.

11

Note: The personal plan has a price plan of $0.70 for a 2 node execution engine
per hour. With the 30-day Bluemix trial account we don't incur any costs for this
lab. Should you use your personal Bluemix account and don't want to see any
charge on your credit card for this lab, please ask us to give you a trial account
instead.

12

Work with a Python notebook
In the new service instance you are presented with the console. Create a new Notebook
and select the "Create Notebook" From URL option. Provide a name, an optional
description, and select the following Notebook URL:
https://raw.githubusercontent.com/cloudant-labs/sparkcloudant/master/tutorials/wowPython.ipynb
Note: In case you get an error like "The service is not responding, try again later"
use the back button and try the load again.
The notebook you just loaded contains the actual instructions with Spark as engine and
Cloudant as data store. All code in the notebook is written in Python 2 syntax and
requires a running Python 2.7 kernel. By default, the kernel should have been started
and your notebook be connected to it.
Code is structured into cells where you want to execute cells sequentially, starting at the
top. While a cell executes, you should see a [*] next to the cell that indicates the running
status. When the cell completed, you see a number like [1]. That number increments
with every cell execution. Nothing stops you from running a cell "out of order" instead of
sequentially. Just make sure to meet all the conditions for the execution of a cell. We
prepared the notebook so that all conditions are met when you run it top-to-bottom.
A successful cell execution will almost always dump some output right below the cell.
For example:

Should there be no output, you probably have a problem with your Python kernel and
should perform a kernel restart. Use the menu options or action buttons atop your
notebook to interrupt or restart the kernel.

A kernel restart clears your entire session context and you will have to re-run every
instruction required up to the point where you want to resume your work. The comments
provided in the notebook should make it somewhat obvious what every cell requires for
execution. If you are unclear how to proceed after a kernel restart just ask us.
13

Please go ahead and follow the instructions given in the Python notebook at this point.
To validate the success of your executions you can compare with the RESULT output
we provided at
https://github.com/cloudant-labs/spark-cloudant/master/tutorials/wowPython_RESULT.ipynb

Your cell execution output should look very similar to the output in that HTML page.

14

Work with a Scala notebook
With the Python notebook above you have already seen a lot of Spark core and
graphing technology at play. There is, however, one powerful framework in Spark to
support analysis of data streams with Spark Streaming. Spark Streaming is currently
supported only in Scala notebooks. We prepared a second notebook written in Scala
you can use to exercise Spark Streaming technology.
Note: Don't worry if you are not familiar with Scala as language. The notebook
can again be executed in a point-and-click fashion and does not require any
actual coding.
To load the Scala notebook, open a new browser tab and navigate to the Spark service
instance you already created in Bluemix earlier.
Note: If you need a quick way to navigate back to the Spark service, use the
Bluemix menu on the left hand side and click the "Analytics" tab. There you will
immediately see all your notebooks.

Create a new Notebook and select the "Create Notebook" From URL option again.
Provide a name, an optional description, and select the following Notebook URL:
https://raw.githubusercontent.com/cloudant-labs/sparkcloudant/master/tutorials/wowScala.ipynb

15

Please go ahead and follow the instructions given in the Scala notebook at this point.
At one point in the notebook you are asked to start the Spark Streaming Context (ssc).
It will subscribe to changes in your Cloudant database and process any changes in a
window of 120 sec. During these 120 sec, you should initiate changes in your Cloudant
database.
A simple way to initiate changes to the Cloudant database is to simply add new tweets.
Go back to your Python notebook and re-execute the cell that calls the
TwitterToCloudant() class.
Note: If you closed the tab with the Python notebook accidentially, you will have
to re-open it and re-execute the cells that load libraries and create context. Don't
forget to adjust your connection details either.

Here you can optionally change the query you want to use for the Twitter API and/or the
count of tweets to process.

Now you simulate a stream of events (loading new documents with tweets) in your
Cloudant database and see how the Spark Streaming receiver can process these
events.
After the 120 sec you can always restart your Streaming Context or increase the
duration.

To validate the success of your executions you can compare with the RESULT output
we provided at
https://github.com/cloudant-labs/spark-cloudant/master/tutorials/wowScala_RESULT.ipynb

Your cell execution output should look very similar to the output in that HTML page.

16

We Value Your Feedback!
•

Don’t forget to submit your World of Watson session and speaker feedback! Your
feedback is very important to us – we use it to continually improve the conference.

•

Access the World of Watson Conference Connect tool to quickly submit your surveys
from your smartphone, laptop or conference kiosk.

17



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
Page Count                      : 17
PDF Version                     : 1.4
Title                           : WoW2016_LabInstructions_AL1865_student_version
Author                          : Holger Kache
Subject                         : 
Producer                        : Mac OS X 10.11.6 Quartz PDFContext
Creator                         : Word
Create Date                     : 2016:10:20 08:02:52Z
Modify Date                     : 2016:10:20 08:02:52Z
EXIF Metadata provided by EXIF.tools

Navigation menu