Scala Guide Data Science Professionals

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 1101

DownloadScala-guide-data-science-professionals
Open PDF In BrowserView PDF
Scala: Guide for Data Science
Professionals
Scala will be a valuable tool to have on hand during your data science journey for
everything from data cleaning to cutting-edge machine learning

A course in three modules

BIRMINGHAM - MUMBAI

Scala: Guide for Data Science Professionals
Copyright © 2017 Packt Publishing

All rights reserved. No part of this course may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this course to ensure the accuracy
of the information presented. However, the information contained in this course
is sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this course.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this course by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

Published on: January 2017

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78728-285-8
www.packtpub.com

Credits
Authors
Pascal Bugnion

Content Development Editor
Trusha Shriyan

Arun Manivannan
Patrick R. Nicolas

Graphics
Kirk D'Penha

Reviewers
Umanga Bista
Radek Ostrowski
Yuanhang Wang
Amir Hajian
Shams Mahmood Imam
Gerald Loeffler
Subhajit Datta
Rui Gonçalves
Patricia Hoffman, PhD
Md Zahidul Islam

Production Coordinator
Shantanu N. Zagade

Preface
Scala is a popular language for data science. By emphasizing immutability and
functional constructs, Scala lends itself well to the construction of robust libraries
for concurrency and big data analysis. A rich ecosystem of tools for data science has
therefore developed around Scala, including libraries for accessing SQL and NoSQL
databases, frameworks for building distributed applications like Apache Spark and
libraries for linear algebra and numerical algorithms. We will explore this rich and
growing ecosystem in this learning path.

What this learning path covers
Module 1, Scala for Data Science, will introduce you to the libraries for ingesting,
storing, manipulating, processing, and visualizing data in Scala. Packed with realworld examples and interesting data sets, this module will teach you to ingest data
from flat files and web APIs and store it in a SQL or NoSQL database. It will show
you how to design scalable architectures to process and modeling your data, starting
from simple concurrency constructs such as parallel collections and futures, through
to actor systems and Apache Spark. As well as Scala's emphasis on functional
structures and immutability, you will learn how to use the right parallel construct
for the job at hand, minimizing development time without compromising scalability.
Finally, you will learn how to build beautiful interactive visualizations using web
frameworks. This module gives tutorials on some of the most common Scala libraries
for data science, allowing you to quickly get up to speed with building data science
and data engineering solutions.

[i]

Preface

Module 2, Scala Data Analysis Cookbook, will introduce you to the most popular
Scala tools, libraries, and frameworks through practical recipes around loading,
manipulating, and preparing your data. It will also help you explore and make sense
of your data using stunning and insightful visualizations, and machine learning
toolkits.Starting with introductory recipes on utilizing the Breeze and Spark libraries,
get to grips with how to import data from a host of possible sources and how to
pre-process numerical, string, and date data. Next, you'll get an understanding of
concepts that will help you visualize data using the Apache Zeppelin and Bokeh
bindings in Scala, enabling exploratory data analysis. Discover how to program
quintessential machine learning algorithms using Spark ML library. Work through
steps to scale your machine learning models and deploy them into a standalone
cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by
Spark Streaming, and machine learning for streaming data, as well as utilizing
Spark GraphX.
Module 3, Scala for Machine Learning, will introduce you to the functional
capabilities of the Scala programming language that are critical to the creation
of machine learning algorithms such as dependency injection and implicits.Your
learning journey starts with data pre-processing and filtering techniques, then
move on to clustering and dimension reduction, Naïve Bayes, regression models,
sequential data, regularization and kernelization, support vector machines, Neural
networks, generic algorithms and re-enforcement learning. The review of the Akka
framework and Apache Spark clusters concludes the tutorial. Techniques throughout
the module is applied to the analysis, recommendation, classification, and prediction
of financial markets.
This module will guide you through the process of building AI applications with
diagrams, formal mathematical notation, source code snippets and useful tips.

What you need for this learning path
The examples provided in this learning path require that you have a working Scala
installation and SBT, the Simple Build Tool, a command line utility for compiling
and running Scala code. We will walk you through how to install these in the next
sections. We do not require a specific IDE. The code examples can be written in your
favorite text editor or IDE.

Who this learning path is for
This learning path is perfect for those who are comfortable with Scala programming
and now want to enter the field of data science. Some knowledge of statistics is
expected.
[ ii ]

Preface

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this course—what you liked or disliked. Reader feedback is important for us as it
helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail feedback@packtpub.com, and mention
the course's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing
or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt course, we have a number of things to
help you to get the most from your purchase.

Downloading the example code
You can download the example code files for this course from your account at
http://www.packtpub.com. If you purchased this course elsewhere, you can visit
http://www.packtpub.com/support and register to have the files e-mailed directly
to you.
You can download the code files by following these steps:
1. Log in or register to our website using your e-mail address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the course in the Search box.
5. Select the course for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this course from.
7. Click on Code Download.
You can also download the code files by clicking on the Code Files button on the
course's webpage at the Packt Publishing website. This page can be accessed by
entering the course's name in the Search box. Please note that you need to be logged
in to your Packt account.

[ iii ]

Preface

Once the file is downloaded, please make sure that you unzip or extract the folder
using the latest version of:
•

WinRAR / 7-Zip for Windows

•

Zipeg / iZip / UnRarX for Mac

•

7-Zip / PeaZip for Linux

The code bundle for the course is also hosted on GitHub at https://github.
com/PacktPublishing/Scala-Guide-for-Data-Science-Professionals.We
also have other code bundles from our rich catalog of books, videos, and courses
available at https://github.com/PacktPublishing/. Check them out!

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes
do happen. If you find a mistake in one of our courses—maybe a mistake in the text
or the code—we would be grateful if you could report this to us. By doing so, you
can save other readers from frustration and help us improve subsequent versions
of this course. If you find any errata, please report them by visiting http://www.
packtpub.com/submit-errata, selecting your course, clicking on the Errata
Submission Form link, and entering the details of your errata. Once your errata are
verified, your submission will be accepted and the errata will be uploaded to our
website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/
content/support and enter the name of the course in the search field. The required

information will appear under the Errata section.

Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all
media. At Packt, we take the protection of our copyright and licenses very seriously.
If you come across any illegal copies of our works in any form on the Internet, please
provide us with the location address or website name immediately so that we can
pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated
material.
We appreciate your help in protecting our authors and our ability to bring you
valuable content.

[ iv ]

Preface

Questions
If you have a problem with any aspect of this course, you can contact us at
questions@packtpub.com, and we will do our best to address the problem.

[v]

Module 1: Scala for Data Science
Chapter 1: Scala and Data Science
Data science
Programming in data science
Why Scala?
When not to use Scala
Summary
References

3
3
6
7
14
14
15

Chapter 2: Manipulating Data with Breeze
Code examples
Installing Breeze
Getting help on Breeze
Basic Breeze data types
An example – logistic regression
Towards re-usable code
Alternatives to Breeze
Summary
References

17
17
18
18
19
37
45
47
47
47

Chapter 3: Plotting with breeze-viz
Diving into Breeze
Customizing plots
Customizing the line type
More advanced scatter plots
Multi-plot example – scatterplot matrix plots
Managing without documentation
Breeze-viz reference
Data visualization beyond breeze-viz
Summary

[i]

49
50
52
55
60
62
67
68
69
69

Table of Contents

Chapter 4: Parallel Collections and Futures
Parallel collections
Futures
Summary
References

71
71
85
95
95

Chapter 5: Scala and SQL through JDBC
Interacting with JDBC
First steps with JDBC
JDBC summary
Functional wrappers for JDBC
Safer JDBC connections with the loan pattern
Enriching JDBC statements with the "pimp my library" pattern
Wrapping result sets in a stream
Looser coupling with type classes
Creating a data access layer
Summary
References

Chapter 6: Slick – A Functional Interface for SQL
FEC data
Invokers
Operations on columns
Aggregations with "Group by"
Accessing database metadata
Slick versus JDBC
Summary
References

97
98
98
106
107
108
110
113
115
121
122
122

125
125
137
138
140
142
143
143
143

Chapter 7: Web APIs

145

A whirlwind tour of JSON
Querying web APIs
JSON in Scala – an exercise in pattern matching
Extraction using case classes
Concurrency and exception handling with futures
Authentication – adding HTTP headers
Summary
References

[ ii ]

146
147
148
154
158
160
164
165

Table of Contents

Chapter 8: Scala and MongoDB

167

MongoDB
Connecting to MongoDB with Casbah
Inserting documents
Extracting objects from the database
Complex queries
Casbah query DSL
Custom type serialization
Beyond Casbah
Summary
References

168
169
172
178
182
184
185
187
187
188

Chapter 9: Concurrency with Akka

189

GitHub follower graph
Actors as people
Hello world with Akka
Case classes as messages
Actor construction
Anatomy of an actor
Follower network crawler
Fetcher actors
Routing
Message passing between actors
Queue control and the pull pattern
Accessing the sender of a message
Stateful actors
Follower network crawler
Fault tolerance
Custom supervisor strategies
Life-cycle hooks
What we have not talked about
Summary
References

189
191
193
195
196
197
198
200
204
205
211
213
214
215
218
220
222
226
227
227

Chapter 10: Distributed Batch Processing with Spark
Installing Spark
Acquiring the example data
Resilient distributed datasets
Building and running standalone programs

[ iii ]

229
229
230
231
246

Table of Contents

Spam filtering
Lifting the hood
Data shuffling and partitions
Summary
Reference

250
258
261
263
263

Chapter 11: Spark SQL and DataFrames
DataFrames – a whirlwind introduction
Aggregation operations
Joining DataFrames together
Custom functions on DataFrames
DataFrame immutability and persistence
SQL statements on DataFrames
Complex data types – arrays, maps, and structs
Interacting with data sources
Standalone programs
Summary
References

Chapter 12: Distributed Machine Learning with MLlib
Introducing MLlib – Spam classification
Pipeline components
Evaluation
Regularization in logistic regression
Cross-validation and model selection
Beyond logistic regression
Summary
References

Chapter 13: Web APIs with Play

265
265
270
272
274
276
277
279
282
284
285
285

287
288
291
302
308
310
315
315
315

317

Client-server applications
Introduction to web frameworks
Model-View-Controller architecture
Single page applications
Building an application
The Play framework
Dynamic routing
Actions
Interacting with JSON
Querying external APIs and consuming JSON
Creating APIs with Play: a summary

[ iv ]

318
318
319
321
323
324
329
330
335
337
344

Table of Contents

Rest APIs: best practice
Summary
References

344
345
345

Chapter 14: Visualization with D3 and the Play Framework

347

GitHub user data
Do I need a backend?
JavaScript dependencies through web-jars
Towards a web application: HTML templates
Modular JavaScript through RequireJS
Bootstrapping the applications
Client-side program architecture
Drawing plots with NVD3
Summary
References

348
348
349
350
353
355
357
366
369
370

Appendix: Pattern Matching and Extractors

371

Pattern matching in for comprehensions
Pattern matching internals
Extracting sequences
Summary
Reference

374
374
376
377
378

Module 2: Scala Data Analysis Cookbook
Chapter 1: Getting Started with Breeze
Introduction
Getting Breeze – the linear algebra library
Working with vectors
Working with matrices
Vectors and matrices with randomly distributed values
Reading and writing CSV files

Chapter 2: Getting Started with Apache Spark DataFrames
Introduction
Getting Apache Spark
Creating a DataFrame from CSV
Manipulating DataFrames
Creating a DataFrame from Scala case classes

[v]

381
381
382
385
393
405
408

413
413
414
415
418
429

Table of Contents

Chapter 3: Loading and Preparing Data – DataFrame
Introduction
Loading more than 22 features into classes
Loading JSON into DataFrames
Storing data as Parquet files
Using the Avro data model in Parquet
Loading from RDBMS
Preparing data in Dataframes

Chapter 4: Data Visualization

433
433
434
443
450
458
466
470

479

Introduction
Visualizing using Zeppelin
Creating scatter plots with Bokeh-Scala
Creating a time series MultiPlot with Bokeh-Scala

Chapter 5: Learning from Data

479
480
492
502

507

Introduction
Supervised and unsupervised learning
Gradient descent
Predicting continuous values using linear regression
Binary classification using LogisticRegression and SVM
Binary classification using LogisticRegression with Pipeline API
Clustering using K-means
Feature reduction using principal component analysis

Chapter 6: Scaling Up

507
507
508
509
516
526
532
539

549

Introduction
Building the Uber JAR
Submitting jobs to the Spark cluster (local)
Running the Spark Standalone cluster on EC2
Running the Spark Job on Mesos (local)
Running the Spark Job on YARN (local)

Chapter 7: Going Further

549
550
557
563
573
578

587

Introduction
Using Spark Streaming to subscribe to a Twitter stream
Using Spark as an ETL tool
Using StreamingLogisticRegression to classify a Twitter
stream using Kafka as a training stream
Using GraphX to analyze Twitter data

[ vi ]

587
588
593
598
602

Table of Contents

Module 3: Scala for Machine Learning
Chapter 1: Getting Started

611

Mathematical notation for the curious
Why machine learning?
Why Scala?
Model categorization
Taxonomy of machine learning algorithms
Tools and frameworks
Source code
Let's kick the tires
Summary

612
612
613
616
617
621
624
628
639

Chapter 2: Hello World!

641

Modeling
Designing a workflow
Assessing a model
Summary

641
644
656
664

Chapter 3: Data Preprocessing

665

Time series
Moving averages
Fourier analysis
The Kalman filter
Alternative preprocessing techniques
Summary

Chapter 4: Unsupervised Learning
Clustering
Dimension reduction
Performance considerations
Summary

665
668
675
687
699
699

701
702
728
735
737

Chapter 5: Naïve Bayes Classifiers

739

Probabilistic graphical models
Naïve Bayes classifiers
Multivariate Bernoulli classification
Naïve Bayes and text mining
Pros and cons
Summary

739
741
757
758
770
770

[ vii ]

Table of Contents

Chapter 6: Regression and Regularization
Linear regression
Regularization
Numerical optimization
The logistic regression
Summary

771
771
786
793
794
807

Chapter 7: Sequential Data Models
Markov decision processes
The hidden Markov model (HMM)
Conditional random fields
CRF and text analytics
Comparing CRF and HMM
Performance consideration
Summary

809
809
811
834
839
851
852
852

Chapter 8: Kernel Models and Support Vector Machines

853

Kernel functions
The support vector machine (SVM)
Support vector classifier (SVC)
Anomaly detection with one-class SVC
Support vector regression (SVR)
Performance considerations
Summary

854
858
864
884
886
890
890

Chapter 9: Artificial Neural Networks

891

Feed-forward neural networks (FFNN)
The multilayer perceptron (MLP)
Evaluation
Benefits and limitations
Summary

891
895
917
926
928

Chapter 10: Genetic Algorithms

929

Evolution
Genetic algorithms and machine learning
Genetic algorithm components
Implementation
GA for trading strategies
Advantages and risks of genetic algorithms
Summary

[ viii ]

929
932
932
942
953
965
966

Table of Contents

Chapter 11: Reinforcement Learning
Introduction
Learning classifier systems
Summary

967
967
993
1005

Chapter 12: Scalable Frameworks
Overview
Scala
Scalability with Actors
Akka
Apache Spark
Summary

1007
1008
1009
1015
1017
1033
1048

Appendix A: Basic Concepts

1049

Scala programming
Mathematics
Finances 101
Suggested online courses
References

1049
1059
1069
1075
1075

Bibliography

1077

[ ix ]

Module 1

Scala for Data Science
Leverage the power of Scala to build scalable, robust data science applications

Scala and Data Science
The second half of the 20th century was the age of silicon. In fifty years, computing
power went from extremely scarce to entirely mundane. The first half of the 21st
century is the age of the Internet. The last 20 years have seen the rise of giants such as
Google, Twitter, and Facebook—giants that have forever changed the way we view
knowledge.
The Internet is a vast nexus of information. Ninety percent of the data generated by
humanity has been generated in the last 18 months. The programmers, statisticians,
and scientists who can harness this glut of data to derive real understanding will
have an ever greater influence on how businesses, governments, and charities
make decisions.
This book strives to introduce some of the tools that you will need to synthesize the
avalanche of data to produce true insight.

Data science
Data science is the process of extracting useful information from data. As a discipline,
it remains somewhat ill-defined, with nearly as many definitions as there are experts.
Rather than add yet another definition, I will follow Drew Conway's description
(http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram).
He describes data science as the culmination of three orthogonal sets of skills:
•

Data scientists must have hacking skills. Data is stored and transmitted
through computers. Computers, programming languages, and libraries
are the hammers and chisels of data scientists; they must wield them with
confidence and accuracy to sculpt the data as they please. This is where Scala
comes in: it's a powerful tool to have in your programming toolkit.

[3]

Scala and Data Science

•

Data scientists must have a sound understanding of statistics and numerical
algorithms. Good data scientists will understand how machine learning
algorithms function and how to interpret results. They will not be fooled by
misleading metrics, deceptive statistics, or misinterpreted causal links.

•

A good data scientist must have a sound understanding of the problem
domain. The data science process involves building and discovering
knowledge about the problem domain in a scientifically rigorous manner.
The data scientist must, therefore, ask the right questions, be aware of
previous results, and understand how the data science effort fits in the wider
business or research context.

Drew Conway summarizes this elegantly with a Venn diagram showing data
science at the intersection of hacking skills, maths and statistics knowledge,
and substantive expertise:

It is, of course, rare for people to be experts in more than one of these areas. Data
scientists often work in cross-functional teams, with different members providing the
expertise for different areas. To function effectively, every member of the team must
nevertheless have a general working knowledge of all three areas.

[4]

Chapter 1

To give a more concrete overview of the workflow in a data science project, let's
imagine that we are trying to write an application that analyzes the public perception
of a political campaign. This is what the data science pipeline might look like:
•

Obtaining data: This might involve extracting information from text files,
polling a sensor network or querying a web API. We could, for instance,
query the Twitter API to obtain lists of tweets with the relevant hashtags.

•

Data ingestion: Data often comes from many different sources and might be
unstructured or semi-structured. Data ingestion involves moving data from
the data source, processing it to extract structured information, and storing
this information in a database. For tweets, for instance, we might extract the
username, the names of other users mentioned in the tweet, the hashtags, text
of the tweet, and whether the tweet contains certain keywords.

•

Exploring data: We often have a clear idea of what information we want to
extract from the data but very little idea how. For instance, let's imagine that
we have ingested thousands of tweets containing hashtags relevant to our
political campaign. There is no clear path to go from our database of tweets
to the end goal: insight into the overall public perception of our campaign.
Data exploration involves mapping out how we are going to get there. This
step will often uncover new questions or sources of data, which requires
going back to the first step of the pipeline. For our tweet database, we might,
for instance, decide that we need to have a human manually label a thousand
or more tweets as expressing "positive" or "negative" sentiments toward
the political campaign. We could then use these tweets as a training set to
construct a model.

•

Feature building: A machine learning algorithm is only as good as the
features that enter it. A significant fraction of a data scientist's time involves
transforming and combining existing features to create new features more
closely related to the problem that we are trying to solve. For instance, we
might construct a new feature corresponding to the number of "positive"
sounding words or pairs of words in a tweet.

•

Model construction and training: Having built the features that enter the
model, the data scientist can now train machine learning algorithms on their
datasets. This will often involve trying different algorithms and optimizing
model hyperparameters. We might, for instance, settle on using a random
forest algorithm to decide whether a tweet is "positive" or "negative" about
the campaign. Constructing the model involves choosing the right number
of trees and how to calculate impurity measures. A sound understanding of
statistics and the problem domain will help inform these decisions.

[5]

Scala and Data Science

•

Model extrapolation and prediction: The data scientists can now use their
new model to try and infer information about previously unseen data points.
They might pass a new tweet through their model to ascertain whether it
speaks positively or negatively of the political campaign.

•

Distillation of intelligence and insight from the model: The data scientists
combine the outcome of the data analysis process with knowledge of the
business domain to inform business decisions. They might discover that
specific messages resonate better with the target audience, or with specific
segments of the target audience, leading to more accurate targeting. A key
part of informing stakeholders involves data visualization and presentation:
data scientists create graphs, visualizations, and reports to help make the
insights derived clear and compelling.

This is far from a linear pipeline. Often, insights gained at one stage will require the
data scientists to backtrack to a previous stage of the pipeline. Indeed, the generation
of business insights from raw data is normally an iterative process: the data scientists
might do a rapid first pass to verify the premise of the problem and then gradually
refine the approach by adding new data sources or new features or trying new
machine learning algorithms.
In this book, you will learn how to deal with each step of the pipeline in Scala,
leveraging existing libraries to build robust applications.

Programming in data science
This book is not a book about data science. It is a book about how to use Scala, a
programming language, for data science. So, where does programming come in
when processing data?
Computers are involved at every step of the data science pipeline, but not necessarily
in the same manner. The style of programs that we build will be drastically different
if we are just writing throwaway scripts to explore data or trying to build a scalable
application that pushes data through a well-understood pipeline to continuously
deliver business intelligence.
Let's imagine that we work for a company making games for mobile phones in
which you can purchase in-game benefits. The majority of users never buy anything,
but a small fraction is likely to spend a lot of money. We want to build a model that
recognizes big spenders based on their play patterns.

[6]

Chapter 1

The first step is to explore data, find the right features, and build a model based on
a subset of the data. In this exploration phase, we have a clear goal in mind but little
idea of how to get there. We want a light, flexible language with strong libraries to
get us a working model as soon as possible.
Once we have a working model, we need to deploy it on our gaming platform to
analyze the usage patterns of all the current users. This is a very different problem:
we have a relatively clear understanding of the goals of the program and of how to
get there. The challenge comes in designing software that will scale out to handle all
the users and be robust to future changes in usage patterns.
In practice, the type of software that we write typically lies on a spectrum ranging
from a single throwaway script to production-level code that must be proof against
future expansion and load increases. Before writing any code, the data scientist
must understand where their software lies on this spectrum. Let's call this the
permanence spectrum.

Why Scala?
You want to write a program that handles data. Which language should you choose?
There are a few different options. You might choose a dynamic language such as
Python or R or a more traditional object-oriented language such as Java. In this
section, we will explore how Scala differs from these languages and when it might
make sense to use it.
When choosing a language, the architect's trade-off lies in a balance of provable
correctness versus development speed. Which of these aspects you need to
emphasize will depend on the application requirements and where on the
permanence spectrum your program lies. Is this a short script that will be used by a
few people who can easily fix any problems that arise? If so, you can probably permit
a certain number of bugs in rarely used code paths: when a developer hits a snag,
they can just fix the problem as it arises. By contrast, if you are developing a database
engine that you plan on releasing to the wider world, you will, in all likelihood, favor
correctness over rapid development. The SQLite database engine, for instance, is
famous for its extensive test suite, with 800 times as much testing code as application
code (https://www.sqlite.org/testing.html).
What matters, when estimating the correctness of a program, is not the perceived
absence of bugs, it is the degree to which you can prove that certain bugs are absent.

[7]

Scala and Data Science

There are several ways of proving the absence of bugs before the code has even run:
•

Static type checking occurs at compile time in statically typed languages,
but this can also be used in strongly typed dynamic languages that support
type annotations or type hints. Type checking helps verify that we are using
functions and classes as intended.

•

Static analyzers and linters that check for undefined variables or suspicious
behavior (such as parts of the code that can never be reached).

•

Declaring some attributes as immutable or constant in compiled languages.

•

Unit testing to demonstrate the absence of bugs along particular code paths.

There are several more ways of checking for the absence of some bugs at runtime:
•

Dynamic type checking in both statically typed and dynamic languages

•

Assertions verifying supposed program invariants or expected contracts

In the next sections, we will examine how Scala compares to other languages in
data science.

Static typing and type inference
Scala's static typing system is very versatile. A lot of information as to the program's
behavior can be encoded in types, allowing the compiler to guarantee a certain
level of correctness. This is particularly useful for code paths that are rarely used. A
dynamic language cannot catch errors until a particular branch of execution runs, so
a bug can persist for a long time until the program runs into it. In a statically typed
language, any bug that can be caught by the compiler will be caught at compile time,
before the program has even started running.
Statically typed object-oriented languages have often been criticized for being
needlessly verbose. Consider the initialization of an instance of the Example
class in Java:
Example myInstance = new Example() ;

We have to repeat the class name twice—once to define the compile-time type of
the myInstance variable and once to construct the instance itself. This feels like
unnecessary work: the compiler knows that the type of myInstance is Example (or a
superclass of Example) as we are binding a value of the Example type.

[8]

Chapter 1

Scala, like most functional languages, uses type inference to allow the compiler to
infer the type of variables from the instances bound to them. We would write the
equivalent line in Scala as follows:
val myInstance = new Example()

The Scala compiler infers that myInstance has the Example type at compile time. A
lot of the time, it is enough to specify the types of the arguments and of the return
value of a function. The compiler can then infer types for all the variables defined in
the body of the function. Scala code is usually much more concise and readable than
the equivalent Java code, without compromising any of the type safety.

Scala encourages immutability
Scala encourages the use of immutable objects. In Scala, it is very easy to define an
attribute as immutable:
val amountSpent = 200

The default collections are immutable:
val clientIds = List("123", "456") // List is immutable
clientIds(1) = "589" // Compile-time error

Having immutable objects removes a common source of bugs. Knowing that some
objects cannot be changed once instantiated reduces the number of places bugs
can creep in. Instead of considering the lifetime of the object, we can narrow in
on the constructor.

Scala and functional programs
Scala encourages functional code. A lot of Scala code consists of using higher-order
functions to transform collections. You, as a programmer, do not have to deal with
the details of iterating over the collection. Let's write an occurrencesOf function
that returns the indices at which an element occurs in a list:
def occurrencesOf[A](elem:A, collection:List[A]):List[Int] = {
for {
(currentElem, index) <- collection.zipWithIndex
if (currentElem == elem)
} yield index
}

How does this work? We first declare a new list, collection.zipWithIndex, whose
elements are (collection(0), 0), (collection(1), 1), and so on: pairs of the
collection's elements and their indexes.
[9]

Scala and Data Science

We then tell Scala that we want to iterate over this collection, binding the
currentElem variable to the current element and index to the index. We apply
a filter on the iteration, selecting only those elements for which currentElem ==
elem. We then tell Scala to just return the index variable.
We did not need to deal with the details of the iteration process in Scala. The syntax is
very declarative: we tell the compiler that we want the index of every element equal to
elem in collection and let the compiler worry about how to iterate over collection.
Consider the equivalent in Java:
static  List occurrencesOf(T elem, List collection) {
List occurrences = new ArrayList() ;
for (int i=0; i List occurrencesOf(T elem, List collection) {
List occurences = new ArrayList() ;
for (int i=0; i val nTosses = 100
nTosses: Int = 100
scala> def trial = (0 until nTosses).count { i =>
util.Random.nextBoolean() // count the number of heads
}
trial: Int

[ 12 ]

Chapter 1

The trial function runs a single set of 100 throws, returning the number of heads:
scala> trial
Int = 51

To get our answer, we just need to repeat trial as many times as we can and
aggregate the results. Repeating the same set of operations is ideally suited to
parallel collections:
scala> val nTrials = 100000
nTrials: Int = 100000
scala> (0 until nTrials).par.count { i => trial >= 60 }
Int = 2745

The probability is thus approximately 2.5% to 3%. All we had to do to distribute the
calculation over every CPU in our computer is use the par method to parallelize
the range (0 until nTrials). This demonstrates the benefits of having a coherent
abstraction: parallel collections let us trivially parallelize any computation that can be
phrased in terms of higher-order functions on collections.
Clearly, not every problem is as easy to parallelize as a simple Monte Carlo problem.
However, by offering a rich set of intuitive abstractions, Scala makes writing parallel
applications manageable.

Interoperability with Java
Scala runs on the Java virtual machine. The Scala compiler compiles programs to
Java byte code. Thus, Scala developers have access to Java libraries natively. Given
the phenomenal number of applications written in Java, both open source and as
part of the legacy code in organizations, the interoperability of Scala and Java helps
explain the rapid uptake of Scala.
Interoperability has not just been unidirectional: some Scala libraries, such as the
Play framework, are becoming increasingly popular among Java developers.

[ 13 ]

Scala and Data Science

When not to use Scala
In the previous sections, we described how Scala's strong type system, preference
for immutability, functional capabilities, and parallelism abstractions make it easy to
write reliable programs and minimize the risk of unexpected behavior.
What reasons might you have to avoid Scala in your next project? One important
reason is familiarity. Scala introduces many concepts such as implicits, type classes,
and composition using traits that might not be familiar to programmers coming
from the object-oriented world. Scala's type system is very expressive, but getting to
know it well enough to use its full power takes time and requires adjusting to a new
programming paradigm. Finally, dealing with immutable data structures can feel
alien to programmers coming from Java or Python.
Nevertheless, these are all drawbacks that can be overcome with time. Scala does
fall short of the other data science languages in library availability. The IPython
Notebook, coupled with matplotlib, is an unparalleled resource for data exploration.
There are ongoing efforts to provide similar functionality in Scala (Spark Notebooks
or Apache Zeppelin, for instance), but there are no projects with the same level of
maturity. The type system can also be a minor hindrance when one is exploring data
or trying out different models.
Thus, in this author's biased opinion, Scala excels for more permanent programs. If
you are writing a throwaway script or exploring data, you might be better served
with Python. If you are writing something that will need to be reused and requires a
certain level of provable correctness, you will find Scala extremely powerful.

Summary
Now that the obligatory introduction is over, it is time to write some Scala code. In
the next chapter, you will learn about leveraging Breeze for numerical computations
with Scala. For our first foray into data science, we will use logistic regression to
predict the gender of a person given their height and weight.

[ 14 ]

Chapter 1

References
By far, the best book on Scala is Programming in Scala by Martin Odersky, Lex Spoon,
and Bill Venners. Besides being authoritative (Martin Odersky is the driving force
behind Scala), this book is also approachable and readable.
Scala Puzzlers by Andrew Phillips and Nermin Šerifović provides a fun way to learn
more advanced Scala.
Scala for Machine Learning by Patrick R. Nicholas provides examples of how to write
machine learning algorithms with Scala.

[ 15 ]

Manipulating Data
with Breeze
Data science is, by and large, concerned with the manipulation of structured data.
A large fraction of structured datasets can be viewed as tabular data: each row
represents a particular instance, and columns represent different attributes of that
instance. The ubiquity of tabular representations explains the success of spreadsheet
programs like Microsoft Excel, or of tools like SQL databases.
To be useful to data scientists, a language must support the manipulation of columns
or tables of data. Python does this through NumPy and pandas, for instance.
Unfortunately, there is no single, coherent ecosystem for numerical computing in
Scala that quite measures up to the SciPy ecosystem in Python.
In this chapter, we will introduce Breeze, a library for fast linear algebra and
manipulation of data arrays as well as many other features necessary for scientific
computing and data science.

Code examples
The easiest way to access the code examples in this book is to clone the GitHub
repository:
$ git clone 'https://github.com/pbugnion/s4ds'

The code samples for each chapter are in a single, standalone folder. You may also
browse the code online on GitHub.

[ 17 ]

Manipulating Data with Breeze

Installing Breeze
If you have downloaded the code examples for this book, the easiest way of using
Breeze is to go into the chap02 directory and type sbt console at the command
line. This will open a Scala console in which you can import Breeze.
If you want to build a standalone project, the most common way of installing Breeze
(and, indeed, any Scala module) is through SBT. To fetch the dependencies required
for this chapter, copy the following lines to a file called build.sbt, taking care to
leave an empty line after scalaVersion:
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.scalanlp" %% "breeze" % "0.11.2",
"org.scalanlp" %% "breeze-natives" % "0.11.2"
)

Open a Scala console in the same directory as your build.sbt file by typing sbt
console in a terminal. You can check that Breeze is working correctly by importing
Breeze from the Scala prompt:

scala> import breeze.linalg._
import breeze.linalg._

Getting help on Breeze
This chapter gives a reasonably detailed introduction to Breeze, but it does not aim
to give a complete API reference.
To get a full list of Breeze's functionality, consult the Breeze Wiki page on GitHub
at https://github.com/scalanlp/breeze/wiki. This is very complete for some
modules and less complete for others. The source code (https://github.com/
scalanlp/breeze/) is detailed and gives a lot of information. To understand how a
particular function is meant to be used, look at the unit tests for that function.

[ 18 ]

Chapter 2

Basic Breeze data types
Breeze is an extensive library providing fast and easy manipulation of arrays of
data, routines for optimization, interpolation, linear algebra, signal processing,
and numerical integration.
The basic linear algebra operations underlying Breeze rely on the netlib-java
library, which can use system-optimized BLAS and LAPACK libraries, if present.
Thus, linear algebra operations in Breeze are often extremely fast. Breeze is still
undergoing rapid development and can, therefore, be somewhat unstable.

Vectors
Breeze makes manipulating one- and two-dimensional data structures easy. To start,
open a Scala console through SBT and import Breeze:
$ sbt console
scala> import breeze.linalg._
import breeze.linalg._

Let's dive straight in and define a vector:
scala> val v = DenseVector(1.0, 2.0, 3.0)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)

We have just defined a three-element vector, v. Vectors are just one-dimensional
arrays of data exposing methods tailored to numerical uses. They can be indexed
like other Scala collections:
scala> v(1)
Double = 2.0

They support element-wise operations with a scalar:
scala> v :* 2.0 // :* is 'element-wise multiplication'
breeze.linalg.DenseVector[Double] = DenseVector(2.0, 4.0, 6.0)

They also support element-wise operations with another vector:
scala> v :+ DenseVector(4.0, 5.0, 6.0) // :+ is 'element-wise addition'
breeze.linalg.DenseVector[Double] = DenseVector(5.0, 7.0, 9.0)

[ 19 ]

Manipulating Data with Breeze

Breeze makes writing vector operations intuitive and considerably more readable
than the native Scala equivalent.
Note that Breeze will refuse (at compile time) to coerce operands to the correct type:
scala> v :* 2 // element-wise multiplication by integer
:15: error: could not find implicit value for parameter op:
...

It will also refuse (at runtime) to add vectors together if they have different lengths:
scala> v :+ DenseVector(8.0, 9.0)
java.lang.IllegalArgumentException: requirement failed: Vectors must have
same length: 3 != 2
...

Basic manipulation of vectors in Breeze will feel natural to anyone used to working
with NumPy, MATLAB, or R.
So far, we have only looked at element-wise operators. These are all prefixed with
a colon. All the usual suspects are present: :+, :*, :-, :/, :% (remainder), and :^
(power) as well as Boolean operators. To see the full list of operators, have a look at
the API documentation for DenseVector or DenseMatrix (https://github.com/
scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet).
Besides element-wise operations, Breeze vectors support the operations you might
expect of mathematical vectors, such as the dot product:
scala> val v2 = DenseVector(4.0, 5.0, 6.0)
breeze.linalg.DenseVector[Double] = DenseVector(4.0, 5.0, 6.0)
scala> v dot v2
Double = 32.0

[ 20 ]

Chapter 2

Pitfalls of element-wise operators
Besides the :+ and :- operators for element-wise addition and
subtraction that we have seen so far, we can also use the more
traditional + and - operators:
scala> v + v2
breeze.linalg.DenseVector[Double] = DenseVector(5.0,
7.0, 9.0)

One must, however, be very careful with operator precedence rules
when mixing :+ or :* with :+ operators. The :+ and :* operators have
very low operator precedence, so they will be evaluated last. This can
lead to some counter-intuitive behavior:
scala> 2.0 :* v + v2 // !! equivalent to 2.0 :* (v + v2)
breeze.linalg.DenseVector[Double] = DenseVector(10.0,
14.0, 18.0)

By contrast, if we use :+ instead of +, the mathematical precedence of
operators is respected:
scala> 2.0 :* v :+ v2 // equivalent to (2.0 :* v) :+ v2
breeze.linalg.DenseVector[Double] = DenseVector(6.0,
9.0, 12.0)

In summary, one should avoid mixing the :+ style operators with the +
style operators as much as possible.

Dense and sparse vectors and the vector trait
All the vectors we have looked at thus far have been dense vectors. Breeze also
supports sparse vectors. When dealing with arrays of numbers that are mostly zero,
it may be more computationally efficient to use sparse vectors. The point at which
a vector has enough zeros to warrant switching to a sparse representation depends
strongly on the type of operations, so you should run your own benchmarks to
determine which type to use. Nevertheless, a good heuristic is that, if your vector is
about 90% zero, you may benefit from using a sparse representation.
Sparse vectors are available in Breeze as the SparseVector and HashVector classes.
Both these types support many of the same operations as DenseVector but use a
different internal implementation. The SparseVector instances are very memoryefficient, but adding non-zero elements is slow. HashVector is more versatile, at
the cost of an increase in memory footprint and computational time for iterating
over non-zero elements. Unless you need to squeeze the last bits of memory out of
your application, I recommend using HashVector. We will not discuss these further
in this book, but the reader should find them straightforward to use if needed.
DenseVector, SparseVector, and HashVector all implement the Vector trait,
giving them a common interface.
[ 21 ]

Manipulating Data with Breeze

Breeze remains very experimental and, as of this writing, somewhat
unstable. I have found dealing with specific implementations of the
Vector trait, such as DenseVector or SparseVector, to be more
reliable than dealing with the Vector trait directly. In this chapter,
we will explicitly type every vector as DenseVector.

Matrices
Breeze allows the construction and manipulation of two-dimensional arrays in a
similar manner:
scala> val m = DenseMatrix((1.0, 2.0, 3.0), (4.0, 5.0, 6.0))
breeze.linalg.DenseMatrix[Double] =
1.0

2.0

3.0

4.0

5.0

6.0

scala> 2.0 :* m
breeze.linalg.DenseMatrix[Double] =
2.0

4.0

6.0

8.0

10.0

12.0

Building vectors and matrices
We have seen how to explicitly build vectors and matrices by passing their
values to the constructor (or rather, to the companion object's apply method):
DenseVector(1.0, 2.0, 3.0). Breeze offers several other powerful ways of
building vectors and matrices:
scala> DenseVector.ones[Double](5)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.0, 1.0, 1.0, 1.0)
scala> DenseVector.zeros[Int](3)
breeze.linalg.DenseVector[Int] = DenseVector(0, 0, 0)

The linspace method (available in the breeze.linalg package object) creates a
Double vector of equally spaced values. For instance, to create a vector of 10 values
distributed uniformly between 0 and 1, perform the following:
scala> linspace(0.0, 1.0, 10)
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.1111111111111111,
..., 1.0)
[ 22 ]

Chapter 2

The tabulate method lets us construct vectors and matrices from functions:
scala> DenseVector.tabulate(4) { i => 5.0 * i }
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 5.0, 10.0, 15.0)
scala> DenseMatrix.tabulate[Int](2, 3) {
(irow, icol) => irow*2 + icol
}
breeze.linalg.DenseMatrix[Int] =
0

1

2

2

3

4

The first argument to DenseVector.tabulate is the size of the vector, and the
second is a function returning the value of the vector at a particular position.
This is useful for creating ranges of data, among other things.
The rand function lets us create random vectors and matrices:
scala> DenseVector.rand(2)
breeze.linalg.DenseVector[Double] = DenseVector(0.8072865137359484,
0.5566507203838562)
scala> DenseMatrix.rand(2, 3)
breeze.linalg.DenseMatrix[Double] =
0.5755491874682879

0.8142161471517582

0.9043780212739738

0.31530195124023974

0.2095094278911871

0.22069103504148346

Finally, we can construct vectors from Scala arrays:
scala> DenseVector(Array(2, 3, 4))
breeze.linalg.DenseVector[Int] = DenseVector(2, 3, 4)

To construct vectors from other Scala collections, you must use the splat operator,
:_ *:
scala> val l = Seq(2, 3, 4)
l: Seq[Int] = List(2, 3, 4)
scala> DenseVector(l :_ *)
breeze.linalg.DenseVector[Int] = DenseVector(2, 3, 4)

[ 23 ]

Manipulating Data with Breeze

Advanced indexing and slicing
We have already seen how to select a particular element in a vector v by its index
with, for instance, v(2). Breeze also offers several powerful methods for selecting
parts of a vector.
Let's start by creating a vector to play around with:
scala> val v = DenseVector.tabulate(5) { _.toDouble }
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0, 3.0, 4.0)

Unlike native Scala collections, Breeze vectors support negative indexing:
scala> v(-1) // last element
Double = 4.0

Breeze lets us slice the vector using a range:
scala> v(1 to 3)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)
scala v(1 until 3) // equivalent to Python v[1:3]
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0)
scala> v(v.length-1 to 0 by -1) // reverse view of v
breeze.linalg.DenseVector[Double] = DenseVector(4.0, 3.0, 2.0, 1.0, 0.0)

Indexing by a range returns a view of the original vector: when running
val v2 = v(1 to 3), no data is copied. This means that slicing is
extremely efficient. Taking a slice of a huge vector does not increase the
memory footprint at all. It also means that one should be careful updating
a slice, since it will also update the original vector. We will discuss
mutating vectors and matrices in a subsequent section in this chapter.

Breeze also lets us select an arbitrary set of elements from a vector:
scala> val vSlice = v(2, 4) // Select elements at index 2 and 4
breeze.linalg.SliceVector[Int,Double] = breeze.linalg.SliceVector@9c04d22

[ 24 ]

Chapter 2

This creates a SliceVector, which behaves like a DenseVector (both implement
the Vector interface), but does not actually have memory allocated for values: it
just knows how to map from its indices to values in its parent vector. One should
think of vSlice as a specific view of v. We can materialize the view (give it its own
data rather than acting as a lens through which v is viewed) by converting it to
DenseVector:
scala> vSlice.toDenseVector
breeze.linalg.DenseVector[Double] = DenseVector(2.0, 4.0)

Note that if an element of a slice is out of bounds, an exception will only be thrown
when that element is accessed:
scala> val vSlice = v(2, 7) // there is no v(7)
breeze.linalg.SliceVector[Int,Double] = breeze.linalg.
SliceVector@2a83f9d1
scala> vSlice(0) // valid since v(2) is still valid
Double = 2.0
scala> vSlice(1) // invalid since v(7) is out of bounds
java.lang.IndexOutOfBoundsException: 7 not in [-5,5)
...

Finally, one can index vectors using Boolean arrays. Let's start by defining an array:
scala> val mask = DenseVector(true, false, false, true, true)
breeze.linalg.DenseVector[Boolean] = DenseVector(true, false, false,
true, true)

Then, v(mask) results in a view containing the elements of v for which mask is true:
scala> v(mask).toDenseVector
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 3.0, 4.0)

This can be used as a way of filtering certain elements in a vector. For instance, to
select the elements of v which are less than 3.0:
scala> val filtered = v(v :< 3.0) // :< is element-wise "less than"
breeze.linalg.SliceVector[Int,Double] = breeze.linalg.
SliceVector@2b1edef3
scala> filtered.toDenseVector
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0)
[ 25 ]

Manipulating Data with Breeze

Matrices can be indexed in much the same way as vectors. Matrix indexing functions
take two arguments—the first argument selects the row(s) and the second one slices
the column(s):
scala> val m = DenseMatrix((1.0, 2.0, 3.0), (5.0, 6.0, 7.0))
m: breeze.linalg.DenseMatrix[Double] =
1.0

2.0

3.0

5.0

6.0

7.0

scala> m(1, 2)
Double = 7.0
scala> m(1, -1)
Double = 7.0
scala> m(0 until 2, 0 until 2)
breeze.linalg.DenseMatrix[Double] =
1.0

2.0

5.0

6.0

You can also mix different slicing types for rows and columns:
scala> m(0 until 2, 0)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 5.0)

Note how, in this case, Breeze returns a vector. In general, slicing returns the
following objects:
•

A scalar when single indices are passed as the row and column arguments

•

A vector when the row argument is a range and the column argument is a
single index

•

A vector transpose when the column argument is a range and the row
argument is a single index

•

A matrix otherwise

The symbol :: can be used to indicate every element along a particular direction. For
instance, we can select the second column of m:
scala> m(::, 1)
breeze.linalg.DenseVector[Double] = DenseVector(2.0, 6.0)

[ 26 ]

Chapter 2

Mutating vectors and matrices
Breeze vectors and matrices are mutable. Most of the slicing operations described
above can also be used to set elements of a vector or matrix:
scala> val v = DenseVector(1.0, 2.0, 3.0)
v: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)
scala> v(1) = 22.0 // v is now DenseVector(1.0, 22.0, 3.0)

We are not limited to mutating single elements. In fact, all the indexing operations
outlined above can be used to set the elements of vectors or matrices. When mutating
slices of vectors or matrices, use the element-wise assignment operator, :=:
scala> v(0 until 2) := DenseVector(50.0, 51.0) // set elements at
position 0 and 1
breeze.linalg.DenseVector[Double] = DenseVector(50.0, 51.0)
scala> v
breeze.linalg.DenseVector[Double] = DenseVector(50.0, 51.0, 3.0)

The assignment operator, :=, works like other element-wise operators in Breeze. If
the right-hand side is a scalar, it will automatically be broadcast to a vector of the
given shape:
scala> v(0 until 2) := 0.0 // equivalent to v(0 until 2) :=
DenseVector(0.0, 0.0)
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.0)
scala> v
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.0, 3.0)

All element-wise operators have an update counterpart. For instance, the :+=
operator acts like the element-wise addition operator :+, but also updates its
left-hand operand:
scala> val v = DenseVector(1.0, 2.0, 3.0)
v: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0)
scala> v :+= 4.0
breeze.linalg.DenseVector[Double] = DenseVector(5.0, 6.0, 7.0)
scala> v
breeze.linalg.DenseVector[Double] = DenseVector(5.0, 6.0, 7.0)
[ 27 ]

Manipulating Data with Breeze

Notice how the update operator updates the vector in place and returns it.
We have learnt how to slice vectors and matrices in Breeze to create new views of
the original data. These views are not independent of the vector they were created
from—updating the view will update the underlying vector and vice-versa. This is
best illustrated with an example:
scala> val v = DenseVector.tabulate(6) { _.toDouble }
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0, 2.0, 3.0, 4.0,
5.0)
scala> val viewEvens = v(0 until v.length by 2)
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 2.0, 4.0)
scala> viewEvens := 10.0 // mutate viewEvens
breeze.linalg.DenseVector[Double] = DenseVector(10.0, 10.0, 10.0)
scala> viewEvens
breeze.linalg.DenseVector[Double] = DenseVector(10.0, 10.0, 10.0)
scala> v

// v has also been mutated!

breeze.linalg.DenseVector[Double] = DenseVector(10.0, 1.0, 10.0, 3.0,
10.0, 5.0)

This quickly becomes intuitive if we remember that, when we create a vector or
matrix, we are creating a view of an underlying data array rather than creating the
data itself:
underlying
array

0

1

2

3

4

5

6

v
v(0 to 6 by 2)

A vector slice v(0 to 6 by 2) of the v vector is just a different view of the array underlying v.
The view itself contains no data. It just contains pointers to the data in the original array. Internally,
the view is just stored as a pointer to the underlying data and a recipe for iterating over that data: in the
case of this slice, the recipe is just "start at the first element of the underlying data and go to the seventh element
of the underlying data in steps of two".

[ 28 ]

Chapter 2

Breeze offers a copy function for when we want to create independent copies of data.
In the previous example, we can construct a copy of viewEvens as:
scala> val copyEvens = v(0 until v.length by 2).copy
breeze.linalg.DenseVector[Double] = DenseVector(10.0, 10.0, 10.0)

We can now update copyEvens independently of v.

Matrix multiplication, transposition, and the
orientation of vectors
So far, we have mostly looked at element-wise operations on vectors and matrices.
Let's now look at matrix multiplication and related operations.
The matrix multiplication operator is *:
scala> val m1 = DenseMatrix((2.0, 3.0), (5.0, 6.0), (8.0, 9.0))
breeze.linalg.DenseMatrix[Double] =
2.0

3.0

5.0

6.0

8.0

9.0

scala> val m2 = DenseMatrix((10.0, 11.0), (12.0, 13.0))
breeze.linalg.DenseMatrix[Double]
10.0

11.0

12.0

13.0

scala> m1 * m2
56.0

61.0

122.0

133.0

188.0

205.0

[ 29 ]

Manipulating Data with Breeze

Besides matrix-matrix multiplication, we can use the matrix multiplication operator
between matrices and vectors. All vectors in Breeze are column vectors. This means
that, when multiplying matrices and vectors together, a vector should be viewed as
an (n * 1) matrix. Let's walk through an example of matrix-vector multiplication. We
want the following operation:

⎛ 2 3⎞
⎜
⎟⎛1⎞
5
6
⎜
⎟⎜ ⎟
⎜8 9⎟⎝ 2⎠
⎝
⎠
scala> val v = DenseVector(1.0, 2.0)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0)
scala> m1 * v
breeze.linalg.DenseVector[Double] = DenseVector(8.0, 17.0, 26.0)

By contrast, if we wanted:

(1

⎛10 11 ⎞
2) ⎜
⎟
⎝12 13 ⎠

We must convert v to a row vector. We can do this using the transpose operation:
scala> val vt = v.t
breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] =
Transpose(DenseVector(1.0, 2.0))
scala> vt * m2
breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] =
Transpose(DenseVector(34.0, 37.0))

Note that the type of v.t is Transpose[DenseVector[_]]. A
Transpose[DenseVector[_]] behaves in much the same way as a DenseVector as far
as element-wise operations are concerned, but it does not support mutation or slicing.

Data preprocessing and feature engineering
We have now discovered the basic components of Breeze. In the next few sections,
we will apply them to real examples to understand how they fit together to form a
robust base for data science.
[ 30 ]

Chapter 2

An important part of data science involves preprocessing datasets to construct useful
features. Let's walk through an example of this. To follow this example and access
the data, you will need to download the code examples for the book (www.github.
com/pbugnion/s4ds).
You will find, in directory chap02/data/ of the code attached to this book, a CSV file
with true heights and weights as well as self-reported heights and weights for 181
men and women. The original dataset was collected as part of a study on body image.
Refer to the following link for more information: http://vincentarelbundock.
github.io/Rdatasets/doc/car/Davis.html.
There is a helper function in the package provided with the book to load the data
into Breeze arrays:
scala> val data = HWData.load
HWData [ 181 rows ]
scala> data.genders
breeze.linalg.Vector[Char] = DenseVector(M, F, F, M, ... )

The data object contains five vectors, each 181 element long:
•

data.genders: A Char vector describing the gender of the participants

•

data.heights: A Double vector of the true height of the participants

•

data.weights: A Double vector of the true weight of the participants

•
•

data.reportedHeights: A Double vector of the self-reported height of

the participants

data.reportedWeights: A Double vector of the self-reported weight of

the participants

Let's start by counting the number of men and women in the study. We will
define an array that contains just 'M' and do an element-wise comparison with
data.genders:
scala> val maleVector = DenseVector.fill(data.genders.length)('M')
breeze.linalg.DenseVector[Char] = DenseVector(M, M, M, M, M, M,... )
scala> val isMale = (data.genders :== maleVector)
breeze.linalg.DenseVector[Boolean] = DenseVector(true, false, false, true
...)

[ 31 ]

Manipulating Data with Breeze

The isMale vector is the same length as data.genders. It is true where the
participant is male, and false otherwise. We can use this Boolean array as a mask
for the other arrays in the dataset (remember that vector(mask) selects the elements
of vector where mask is true). Let's get the height of the men in our dataset:
scala> val maleHeights = data.heights(isMale)
breeze.linalg.SliceVector[Int,Double] = breeze.linalg.
SliceVector@61717d42
scala> maleHeights.toDenseVector
breeze.linalg.DenseVector[Double] = DenseVector(182.0, 177.0, 170.0, ...

To count the number of men in our dataset, we can use the indicator function. This
transforms a Boolean array into an array of doubles, mapping false to 0.0 and
true to 1.0:
scala> import breeze.numerics._
import breeze.numerics._
scala> sum(I(isMale))
Double: 82.0

Let's calculate the mean height of men and women in the experiment. We can
calculate the mean of a vector using mean(v), which we can access by importing
breeze.stats._:
scala> import breeze.stats._
import breeze.stats._
scala> mean(data.heights)
Double = 170.75690607734808

To calculate the mean height of the men, we can use our isMale array to slice data.
heights; data.heights(isMale) is a view of the data.heights array with all the
height values for the men:
scala> mean(data.heights(isMale)) // mean male height
Double = 178.0121951219512
scala> mean(data.heights(!isMale)) // mean female height
Double = 164.74747474747474

[ 32 ]

Chapter 2

As a somewhat more involved example, let's look at the discrepancy between real
and reported weight for both men and women in this experiment. We can get an
array of the percentage difference between the reported weight and the true weight:
scala> val discrepancy =
(data.weights - data.reportedWeights) / data.weights
breeze.linalg.Vector[Double] = DenseVector(0.0, 0.1206896551724138,
-0.018867924528301886, -0.029411764705882353, ... )

Notice how Breeze's overloading of mathematical operators allows us to manipulate
data arrays easily and elegantly.
We can now calculate the mean and standard deviation of this array for men:
scala> mean(discrepancy(isMale))
res6: Double = -0.008451852933123775
scala> stddev(discrepancy(isMale))
res8: Double = 0.031901519634244195

We can also calculate the fraction of men who overestimated their height:
scala> val overReportMask =
(data.reportedHeights :> data.heights).toDenseVector
breeze.linalg.DenseVector[Boolean] = DenseVector(false, false, false,
false...
scala> sum(I(overReportMask :& isMale))
Double: 10.0

There are thus ten men who believe they are taller than they actually are. The
element-wise AND operator :& returns a vector that is true for all indices for which
both its arguments are true. The vector overReportMask :& isMale is thus true for
all participants that are male and over-reported their height.

Breeze – function optimization
Having studied feature engineering, let's now look at the other end of the data
science pipeline. Typically, a machine learning algorithm defines a loss function that
is a function of a set of parameters. The value of the loss function represents how
well the model fits the data. The parameters are then optimized to minimize (or
maximize) the loss function.

[ 33 ]

Manipulating Data with Breeze

In Chapter 12, Distributed Machine Learning with MLlib, we will look at MLlib, a
machine learning library that contains many well-known algorithms. Often, we
don't need to worry about optimizing loss functions directly since we can rely on the
machine learning algorithms provided by MLlib. It is nevertheless useful to have a
basic knowledge of optimization.
Breeze has an optimize module that contains functions for finding a local minimum:
scala> import breeze.optimize._
import breeze.optimize._

Let's create a toy function that we want to optimize:

f ( x ) = ∑ xi2
i

We can represent this function in Scala as follows:
scala> def f(xs:DenseVector[Double]) = sum(xs :^ 2.0)
f: (xs: breeze.linalg.DenseVector[Double])Double

Most local optimizers also require the gradient of the function being optimized. The
gradient is a vector of the same dimension as the arguments to the function. In our
case, the gradient is:

!
∇f = 2 x
We can represent the gradient in Breeze with a function that takes a vector argument
and returns a vector of the same length:
scala> def gradf(xs:DenseVector[Double]) = 2.0 :* xs
gradf: (xs:breeze.linalg.DenseVector[Double])breeze.linalg.
DenseVector[Double]

For instance, at the point (1, 1, 1), we have:
scala> val xs = DenseVector.ones[Double](3)
breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.0, 1.0)
scala> f(xs)
Double = 3.0
scala> gradf(xs)
breeze.linalg.DenseVector[Double] = DenseVector(2.0, 2.0, 2.0)
[ 34 ]

Chapter 2

Let's set up the optimization problem. Breeze's optimization methods require that
we pass in an implementation of the DiffFunction trait with a single method,
calculate. This method must return a tuple of the function and its gradient:
scala> val optTrait = new DiffFunction[DenseVector[Double]] {
def calculate(xs:DenseVector[Double]) = (f(xs), gradf(xs))
}
breeze.optimize.DiffFunction[breeze.linalg.DenseVector[Double]] =


We are now ready to run the optimization. The optimize module provides a
minimize function that does just what we want. We pass it optTrait and a starting
point for the optimization:
scala> val minimum = minimize(optTrait, DenseVector(1.0, 1.0, 1.0))
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.0, 0.0)

The true minimum is at (0.0, 0.0, 0.0). The optimizer therefore correctly finds
the minimum.
The minimize function uses the L-BFGS method to run the optimization by default.
It takes several additional arguments to control the optimization. We will explore
these in the next sections.

Numerical derivatives
In the previous example, we specified the gradient of f explicitly. While this is
generally good practice, calculating the gradient of a function can often be tedious.
Breeze provides a gradient approximation function using finite differences. Reusing
the same objective function def f(xs:DenseVector[Double]) = sum(xs :^ 2.0)
as in the previous section:
scala> val approxOptTrait = new ApproximateGradientFunction(f)
breeze.optimize.ApproximateGradientFunction[Int,breeze.linalg.
DenseVector[Double]] = 

The trait approxOptTrait has a gradientAt method that returns an approximation
to the gradient at a point:
scala> approxOptTrait.gradientAt(DenseVector.ones(3))
breeze.linalg.DenseVector[Double] = DenseVector(2.00001000001393,
2.00001000001393, 2.00001000001393)

[ 35 ]

Manipulating Data with Breeze

Note that this can be quite inaccurate. The ApproximateGradientFunction
constructor takes an epsilon optional argument that controls the size of the step
taken when calculating the finite differences. Changing the value of epsilon can
improve the accuracy of the finite difference algorithm.
The ApproximateGradientFunction instance implements the DiffFunction trait. It
can therefore be passed to minimize directly:
scala> minimize(approxOptTrait, DenseVector.ones[Double](3))
breeze.linalg.DenseVector[Double] = DenseVector(-5.000001063126813E-6,
-5.000001063126813E-6, -5.000001063126813E-6)

This, again, gives a result close to zero, but somewhat further away than when we
specified the gradient explicitly. In general, it will be significantly more efficient
and more accurate to calculate the gradient of a function analytically than to rely on
Breeze's numerical gradient. It is probably best to only use the numerical gradient
during data exploration or to check analytical gradients.

Regularization
The minimize function takes many optional arguments relevant to machine learning
algorithms. In particular, we can instruct the optimizer to use a regularization
parameter when performing the optimization. Regularization introduces a penalty
in the loss function to prevent the parameters from growing arbitrarily. This is useful
to avoid overfitting. We will discuss regularization in greater detail in Chapter 12,
Distributed Machine Learning with MLlib.
For instance, to use L2Regularization with a hyperparameter of 0.5:
scala> minimize(optTrait,
DenseVector(1.0, 1.0, 1.0), L2Regularization(0.5))
breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.0, 0.0)

The regularization makes no difference in this case, since the parameters are zero at
the minimum.
To see a list of optional arguments that can be passed to minimize, consult the
Breeze documentation online.

[ 36 ]

Chapter 2

An example – logistic regression
Let's now imagine we want to build a classifier that takes a person's height and
weight and assigns a probability to their being Male or Female. We will reuse the
height and weight data introduced earlier in this chapter. Let's start by plotting
the dataset:

Height versus weight data for 181 men and women

There are many different algorithms for classification. A first glance at the data
shows that we can, approximately, separate men from women by drawing a straight
line across the plot. A linear method is therefore a reasonable initial attempt at
classification. In this section, we will use logistic regression to build a classifier.
A detailed explanation of logistic regression is beyond the scope of this book. The
reader unfamiliar with logistic regression is referred to The Elements of Statistical
Learning by Hastie, Tibshirani, and Friedman. We will just give a brief summary here.

[ 37 ]

Manipulating Data with Breeze

Logistic regression estimates the probability of a given height and weight belonging to
a male with the following sigmoid function:

P ( male | height , weight ) =

1
1 + exp ( − f ( height , weight ; params ) )

Here, f is a linear function:

f ( height , weight ; params ) = params ( 0 ) + height ⋅ params (1) + weight ⋅ params ( 2 )
Here, params is an array of parameters that we need to determine using the training
set. If we consider the height and weight as a features = (height, weight) matrix, we can
re-write the sigmoid kernel f as a matrix multiplication of the features matrix with the
params vector:

f ( features; params ) = params ( 0 ) + features ⋅ params (1:)
To simplify this expression further, it is common to add a dummy feature whose
value is always 1 to the features matrix. We can then multiply params(0) by this
feature, allowing us to write the entire sigmoid kernel f as a single matrix-vector
multiplication:

f ( features; params ) = params ⋅ features
The feature matrix, features, is now a (181 * 3) matrix, where each row is (1, height,
weight) for a particular participant.
To find the optimal values of the parameters, we can maximize the likelihood
function, L(params|features). The likelihood takes a given set of parameter values
as input and returns the probability that these particular parameters gave rise
to the training set. For a set of parameters and associated probability function
P(male|featuresi), the likelihood is:

[ 38 ]

Chapter 2

L ( params | features ) =

∏
i

P ( male | featuresi ) ×

targeti is male

∏
i

1 − P ( male | featuresi )

targeti not male

If we magically know, ahead of time, the gender of everyone in the population,
we can assign P(male)=1 for the men and P(male)=0 for the women. The likelihood
function would then be 1. Conversely, any uncertainty leads to a reduction in
the likelihood function. If we choose a set of parameters that consistently lead to
classification errors (low P(male) for men or high P(male) for women), the likelihood
function drops to 0.
The maximum likelihood corresponds to those values of the parameters most likely
to describe the observed data. Thus, to find the parameters that best describe our
training set, we just need to find parameters that maximize L(params|features).
However, maximizing the likelihood function itself is very rarely done, since it
involves multiplying many small values together, which quickly leads to floating
point underflow. It is best to maximize the log of the likelihood, which has the
same maximum as the likelihood. Finally, since most optimization algorithms
are geared to minimize a function rather than maximize it, we will minimize
− log L ( params | features ) .

(

)

For logistic regression, this is equivalent to minimizing:

Cost ( params ) = ∑ targeti × ( params ⋅ featuresi ) − log ( exp ( params ⋅ featuresi ) + 1)
i

Here, the sum runs over all participants in the training data, featuresi is a vector
(1, heighti , weighti ) of the i-th observation in the training set, and targeti is 1 if the
person is male, and 0 if the participant is female.
To minimize the Cost function, we must also know its gradient with respect to the
parameters. This is:

∇ params Cost = ∑ featuresi ⋅ ⎡⎣ P ( male | featuresi ) − targeti ⎤⎦
i

[ 39 ]

Manipulating Data with Breeze

We will start by rescaling the height and weight by their mean and standard deviation.
While this is not strictly necessary for logistic regression, it is generally good practice.
It facilitates the optimization and would become necessary if we wanted to use
regularization methods or build superlinear features (features that allow the boundary
separating men from women to be curved rather than a straight line).
For this example, we will move away from the Scala shell and write a standalone
Scala script. Here's the full code listing. Don't worry if this looks daunting. We will
break it up into manageable chunks in a minute:
import breeze.linalg._
import breeze.numerics._
import breeze.optimize._
import breeze.stats._
object LogisticRegressionHWData extends App {
val data = HWData.load
// Rescale the features to have mean of 0.0 and s.d. of 1.0
def rescaled(v:DenseVector[Double]) =
(v - mean(v)) / stddev(v)
val rescaledHeights = rescaled(data.heights)
val rescaledWeights = rescaled(data.weights)
// Build the feature matrix as a matrix with
//181 rows and 3 columns.
val rescaledHeightsAsMatrix = rescaledHeights.toDenseMatrix.t
val rescaledWeightsAsMatrix = rescaledWeights.toDenseMatrix.t
val featureMatrix = DenseMatrix.horzcat(
DenseMatrix.ones[Double](rescaledHeightsAsMatrix.rows, 1),
rescaledHeightsAsMatrix,
rescaledWeightsAsMatrix
)
println(s"Feature matrix size: ${featureMatrix.rows} x " +
s"${featureMatrix.cols}")
// Build the target variable to be 1.0 where a participant
// is male, and 0.0 where the participant is female.
val target = data.genders.values.map {

[ 40 ]

Chapter 2
gender => if(gender == 'M') 1.0 else 0.0
}
// Build the loss function ready for optimization.
// We will worry about refactoring this to be more
// efficient later.
def costFunction(parameters:DenseVector[Double]):Double = {
val xBeta = featureMatrix * parameters
val expXBeta = exp(xBeta)
- sum((target :* xBeta) - log1p(expXBeta))
}
def costFunctionGradient(parameters:DenseVector[Double])
:DenseVector[Double] = {
val xBeta = featureMatrix * parameters
val probs = sigmoid(xBeta)
featureMatrix.t * (probs - target)
}
val f = new DiffFunction[DenseVector[Double]] {
def calculate(parameters:DenseVector[Double]) =
(costFunction(parameters), costFunctionGradient(parameters))
}
val optimalParameters = minimize(f, DenseVector(0.0, 0.0, 0.0))
println(optimalParameters)
// => DenseVector(-0.0751454743, 2.476293647, 2.23054540)
}

That was a mouthful! Let's take this one step at a time. After the obvious imports, we
start with:
object LogisticRegressionHWData extends App {

By extending the built-in App trait, we tell Scala to treat the entire object as a main
function. This just cuts out def main(args:Array[String]) boilerplate. We
then load the data and rescale the height and weight to have a mean of zero and a
standard deviation of one:
def rescaled(v:DenseVector[Double]) =
(v - mean(v)) / stddev(v)
val rescaledHeights = rescaled(data.heights)
val rescaledWeights = rescaled(data.weights)
[ 41 ]

Manipulating Data with Breeze

The rescaledHeights and rescaledWeights vectors will be the features of our
model. We can now build the training set matrix for this model. This is a (181 * 3)
matrix, for which the i-th row is (1, height(i), weight(i)), corresponding to the
values of the height and weight for the ith participant. We start by transforming both
rescaledHeights and rescaledWeights from vectors to (181 * 1) matrices
val rescaledHeightsAsMatrix = rescaledHeights.toDenseMatrix.t
val rescaledWeightsAsMatrix = rescaledWeights.toDenseMatrix.t

We must also create a (181 * 1) matrix containing just 1 to act as the dummy feature.
We can do this using:
DenseMatrix.ones[Double](rescaledHeightsAsMatrix.rows, 1)

We now need to combine our three (181 * 1) matrices together into a single feature
matrix of shape (181 * 3). We can use the horzcat method to concatenate the three
matrices together:
val featureMatrix = DenseMatrix.horzcat(
DenseMatrix.ones[Double](rescaledHeightsAsMatrix.rows, 1),
rescaledHeightsAsMatrix,
rescaledWeightsAsMatrix
)

The final step in the data preprocessing stage is to create the target variable. We need
to convert the data.genders vector to a vector of ones and zeros. We assign a value
of one for men and zero for women. Thus, our classifier will predict the probability
that any given person is male. We will use the .values.map method, a method
equivalent to the .map method on Scala collections:
val target = data.genders.values.map {
gender => if(gender == 'M') 1.0 else 0.0
}

Note that we could also have used the indicator function which we discovered earlier:
val maleVector = DenseVector.fill(data.genders.size)('M')
val target = I(data.genders :== maleVector)

This results in the allocation of a temporary array, maleVector, and might
therefore increase the program's memory footprint if there were many
participants in the experiment.

[ 42 ]

Chapter 2

We now have a matrix representing the training set and a vector denoting the target
variable. We can write the loss function that we want to minimize. As mentioned
previously, we will minimize − log L ( parameters | training ) . The loss function
takes as input a set of values for the linear coefficients and returns a number
indicating how well those values of the linear coefficients fit the training data:

(

)

def costFunction(parameters:DenseVector[Double]):Double = {
val xBeta = featureMatrix * parameters
val expXBeta = exp(xBeta)
- sum((target :* xBeta) - log1p(expXBeta))
}

Note that we use log1p(x) to calculate log(1+x). This is robust to underflow for small
values of x.
Let's explore the cost function:
costFunction(DenseVector(0.0, 0.0, 0.0)) // 125.45963968135031
costFunction(DenseVector(0.0, 0.1, 0.1)) // 113.33336518036882
costFunction(DenseVector(0.0, -0.1, -0.1)) // 139.17134594294433

We can see that the cost function is somewhat lower for slightly positive values of
the height and weight parameters. This indicates that the likelihood function is larger
for slightly positive values of the height and weight. This, in turn, implies (as we
expect from the plot) that people who are taller and heavier than average are more
likely to be male.
We also need a function that calculates the gradient of the loss function, since that
will help with the optimization:
def costFunctionGradient(parameters:DenseVector[Double])
:DenseVector[Double] = {
val xBeta = featureMatrix * parameters
val probs = sigmoid(xBeta)
featureMatrix.t * (probs - target)
}

Having defined the loss function and gradient, we are now in a position to set up
the optimization:
val f = new DiffFunction[DenseVector[Double]] {
def calculate(parameters:DenseVector[Double]) =
(costFunction(parameters), costFunctionGradient(parameters))
}

[ 43 ]

Manipulating Data with Breeze

All that is left now is to run the optimization. The cost function for logistic
regression is convex (it has a single minimum), so the starting point for optimization
is irrelevant in principle. In practice, it is common to start with a coefficient vector
that is zero everywhere (equating to assigning a 0.5 probability of being male to
every participant):
val optimalParameters = minimize(f, DenseVector(0.0, 0.0, 0.0))

This returns the vector of optimal parameters:
DenseVector(-0.0751454743, 2.476293647, 2.23054540)

How can we interpret the values of the optimal parameters? The coefficients for
the height and weight are both positive, indicating that people who are taller and
heavier are more likely to be male.
We can also get the decision boundary (the line separating (height, weight) pairs
more likely to belong to a woman from (height, weight) pairs more likely to belong
to a man) directly from the coefficients. The decision boundary is:

−0.075 + 2.48 rescaledHeight + 2.23 rescaledWeight = 0

Height and weight data (shifted by the mean and rescaled by the standard deviation).
The orange line is the logistic regression decision boundary. Logistic regression predicts that
individuals above the boundary are male.

[ 44 ]

Chapter 2

Towards re-usable code
In the previous section, we performed all of the computation in a single script. While
this is fine for data exploration, it means that we cannot reuse the logistic regression
code that we have built. In this section, we will start the construction of a machine
learning library that you can reuse across different projects.
We will factor the logistic regression algorithm out into its own class. We construct a
LogisticRegression class:
import breeze.linalg._
import breeze.numerics._
import breeze.optimize._
class LogisticRegression(
val training:DenseMatrix[Double],
val target:DenseVector[Double])
{

The class takes, as input, a matrix representing the training set and a vector denoting
the target variable. Notice how we assign these to vals, meaning that they are set
on class creation and will remain the same until the class is destroyed. Of course, the
DenseMatrix and DenseVector objects are mutable, so the values that training
and target point to might change. Since programming best practice dictates that
mutable state makes reasoning about program behavior difficult, we will avoid
taking advantage of this mutability.
Let's add a method that calculates the cost function and its gradient:
def costFunctionAndGradient(coefficients:DenseVector[Double])
:(Double, DenseVector[Double]) = {
val xBeta = training * coefficients
val expXBeta = exp(xBeta)
val cost = - sum((target :* xBeta) - log1p(expXBeta))
val probs = sigmoid(xBeta)
val grad = training.t * (probs - target)
(cost, grad)
}

[ 45 ]

Manipulating Data with Breeze

We are now all set up to run the optimization to calculate the coefficients that best
reproduce the training set. In traditional object-oriented languages, we might define
a getOptimalCoefficients method that returns a DenseVector of the coefficients.
Scala, however, is more elegant. Since we have defined the training and target
attributes as vals, there is only one possible set of values of the optimal coefficients.
We could, therefore, define a val optimalCoefficients = ??? class attribute
that holds the optimal coefficients. The problem with this is that it forces all the
computation to happen when the instance is constructed. This will be unexpected
for the user and might be wasteful: if the user is only interested in accessing the cost
function, for instance, the time spent minimizing it will be wasted. The solution is to
use a lazy val. This value will only be evaluated when the client code requests it:
lazy val optimalCoefficients = ???

To help with the calculation of the coefficients, we will define a private helper method:
private def calculateOptimalCoefficients
:DenseVector[Double] = {
val f = new DiffFunction[DenseVector[Double]] {
def calculate(parameters:DenseVector[Double]) =
costFunctionAndGradient(parameters)
}
minimize(f, DenseVector.zeros[Double](training.cols))
}
lazy val optimalCoefficients = calculateOptimalCoefficients

We have refactored the logistic regression into its own class, that we can reuse across
different projects.
If we were planning on reusing the height-weight data, we could, similarly, refactor
it into a class of its own that facilitates data loading, feature scaling, and any other
functionality that we find ourselves reusing often.

[ 46 ]

Chapter 2

Alternatives to Breeze
Breeze is the most feature-rich and approachable Scala framework for linear algebra
and numeric computation. However, do not take my word for it: experiment with
other libraries for tabular data. In particular, I recommend trying Saddle, which
provides a Frame object similar to data frames in pandas or R. In the Java world, the
Apache Commons Maths library provides a very rich toolkit for numerical computation.
In Chapter 10, Distributed Batch Processing with Spark, Chapter 11, Spark SQL and
DataFrames, and Chapter 12, Distributed Machine Learning with MLlib, we will explore
Spark and MLlib, which allow the user to run distributed machine learning algorithms.

Summary
This concludes our brief overview of Breeze. We have learned how to manipulate
basic Breeze data types, how to use them for linear algebra, and how to perform
convex optimization. We then used our knowledge to clean a real dataset and
performed logistic regression on it.
In the next chapter, we will discuss breeze-viz, a plotting library for Scala.

References
The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman, gives a lucid,
practical description of the mathematical underpinnings of machine learning.
Anyone aspiring to do more than mindlessly apply machine learning algorithms as
black boxes ought to have a well-thumbed copy of this book.
Scala for Machine Learning, by Patrick R. Nicholas, describes practical implementations
of many useful machine learning algorithms in Scala.
The Breeze documentation (https://github.com/scalanlp/breeze/wiki/
Quickstart), API docs (http://www.scalanlp.org/api/breeze/#package), and
source code (https://github.com/scalanlp/breeze) provide the most up-to-date
sources of documentation on Breeze.

[ 47 ]

Plotting with breeze-viz
Data visualization is an integral part of data science. Visualization needs fall into
two broad categories: during the development and validation of new models and,
at the end of the pipeline, to distill meaning from the data and the models to provide
insight to external stakeholders.
The two types of visualizations are quite different. At the data exploration and
model development stage, the most important feature of a visualization library
is its ease of use. It should take as few steps as possible to go from having data as
arrays of numbers (or CSVs or in a database) to having data displayed on a screen.
The lifetime of graphs is also quite short: once the data scientist has learned all he
can from the graph or visualization, it is normally discarded. By contrast, when
developing visualization widgets for external stakeholders, one is willing to tolerate
increased development time for greater flexibility. The visualizations can have
significant lifetime, especially if the underlying data changes over time.
The tool of choice in Scala for the first type of visualization is breeze-viz. When
developing visualizations for external stakeholders, web-based visualizations
(such as D3) and Tableau tend to be favored.
In this chapter, we will explore breeze-viz. In Chapter 14, Visualization with D3
and the Play Framework, we will learn how to build Scala backends for JavaScript
visualizations.
Breeze-viz is (no points for guessing) Breeze's visualization library. It wraps
JFreeChart, a very popular Java charting library. Breeze-viz is still very experimental.
In particular, it is much less feature-rich than matplotlib in Python, or R or MATLAB.
Nevertheless, breeze-viz allows access to the underlying JFreeChart objects so one
can always fall back to editing these objects directly. The syntax for breeze-viz is
inspired by MATLAB and matplotlib.

[ 49 ]

Plotting with breeze-viz

Diving into Breeze
Let's get started. We will work in the Scala console, but a program similar to this
example is available in BreezeDemo.scala in the examples corresponding to this
chapter. Create a build.sbt file with the following lines:
scalaVersion := "2.11.7"
libraryDependencies
"org.scalanlp" %%
"org.scalanlp" %%
"org.scalanlp" %%
)

++= Seq(
"breeze" % "0.11.2",
"breeze-viz" % "0.11.2",
"breeze-natives" % "0.11.2"

Start an sbt console:
$ sbt console
scala> import breeze.linalg._
import breeze.linalg._
scala> import breeze.plot._
import breeze.plot._
scala> import breeze.numerics._
import breeze.numerics._

Let's start by plotting a sigmoid curve, f ( x ) = 1/ (1 + e − x ) . We will first generate the
data using Breeze. Recall that the linspace method creates a vector of doubles,
uniformly distributed between two values:
scala> val x = linspace(-4.0, 4.0, 200)
x: DenseVector[Double] = DenseVector(-4.0, -3.959798...
scala> val fx = sigmoid(x)
fx: DenseVector[Double] = DenseVector(0.0179862099620915,...

We now have the data ready for plotting. The first step is to create a figure:
scala> val fig = Figure()
fig: breeze.plot.Figure = breeze.plot.Figure@37e36de9

[ 50 ]

Chapter 3

This creates an empty Java Swing window (which may appear on your taskbar or
equivalent). A figure can contain one or more plots. Let's add a plot to our figure:
scala> val plt = fig.subplot(0)
plt: breeze.plot.Plot = breeze.plot.Plot@171c2840

For now, let's ignore the 0 passed as argument to .subplot. We can add data points
to our plot:
scala> plt += plot(x, fx)
breeze.plot.Plot = breeze.plot.Plot@63d6a0f8

The plot function takes two arguments, corresponding to the x and y values of the
data series to be plotted. To view the changes, you need to refresh the figure:
scala> fig.refresh()

Look at the Swing window now. You should see a beautiful sigmoid, similar to the
one below. Right-clicking on the window lets you interact with the plot and save the
image as a PNG:

[ 51 ]

Plotting with breeze-viz

You can also save the image programmatically as follows:
scala> fig.saveas("sigmoid.png")

Breeze-viz currently only supports exporting to PNG.

Customizing plots
We now have a curve on our chart. Let's add a few more:
scala> val f2x = sigmoid(2.0*x)
f2x: breeze.linalg.DenseVector[Double] = DenseVector(3.353501304664E-4...
scala> val f10x = sigmoid(10.0*x)
f10x: breeze.linalg.DenseVector[Double] = DenseVector(4.2483542552
9E-18...
scala> plt += plot(x, f2x, name="S(2x)")
breeze.plot.Plot = breeze.plot.Plot@63d6a0f8
scala> plt += plot(x, f10x, name="S(10x)")
breeze.plot.Plot = breeze.plot.Plot@63d6a0f8
scala> fig.refresh()

Looking at the figure now, you should see all three curves in different colors. Notice
that we named the data series as we added them to the plot, using the name=""
keyword argument. To view the names, we must set the legend attribute:
scala> plt.legend = true

[ 52 ]

Chapter 3

Our plot still leaves a lot to be desired. Let's start by restricting the range of the x axis
to remove the bands of white space on either side of the plot:
scala> plt.xlim = (-4.0, 4.0)
plt.xlim: (Double, Double) = (-4.0,4.0)

Now, notice how, while the x ticks are sensibly spaced, there are only two y ticks: at
0 and 1. It would be useful to have ticks every 0.1 increment. Breeze does not provide
a way to set this directly. Instead, it exposes the underlying JFreeChart Axis object
belonging to the current plot:
scala> plt.yaxis
org.jfree.chart.axis.NumberAxis = org.jfree.chart.axis.NumberAxis@0

The Axis object supports a .setTickUnit method that lets us set the tick spacing:
scala> import org.jfree.chart.axis.NumberTickUnit
import org.jfree.chart.axis.NumberTickUnit
scala> plt.yaxis.setTickUnit(new NumberTickUnit(0.1))

[ 53 ]

Plotting with breeze-viz

JFreeChart allows extensive customization of the Axis object. For a full list of
methods available, consult the JFreeChart documentation (http://www.jfree.org/
jfreechart/api/javadoc/org/jfree/chart/axis/Axis.html).
Let's also add a vertical line at x=0 and a horizontal line at f(x)=1. We will need to
access the underlying JFreeChart plot to add these lines. This is available (somewhat
confusingly) as the .plot attribute in our Breeze Plot object:
scala> plt.plot
org.jfree.chart.plot.XYPlot = org.jfree.chart.plot.XYPlot@17e4db6c

We can use the .addDomainMarker and .addRangeMarker methods to add vertical
and horizontal lines to JFreeChart XYPlot objects:
scala> import org.jfree.chart.plot.ValueMarker
import org.jfree.chart.plot.ValueMarker
scala> plt.plot.addDomainMarker(new ValueMarker(0.0))
scala> plt.plot.addRangeMarker(new ValueMarker(1.0))

Let's also add labels to the axes:
scala> plt.xlabel = "x"
plt.xlabel: String = x
scala> plt.ylabel = "f(x)"
plt.ylabel: String = f(x)

[ 54 ]

Chapter 3

If you have run all these commands, you should have a graph that looks like this:

We now know how to customize the basic building blocks of a graph. The next step
is to learn how to change how curves are drawn.

Customizing the line type
So far, we have just plotted lines using the default settings. Breeze lets us customize
how lines are drawn, at least to some extent.
For this example, we will use the height-weight data discussed in Chapter 2,
Manipulating Data with Breeze. We will use the Scala shell here for demonstrative
purposes, but you will find a program in BreezeDemo.scala that follows the
example shell session.
The code examples for this chapter come with a module for loading the data,
HWData.scala, that loads the data from the CSVs:
scala> val data = HWData.load
data: HWData = HWData [ 181 rows ]
scala> data.heights
[ 55 ]

Plotting with breeze-viz
breeze.linalg.DenseVector[Double] = DenseVector(182.0, ...
scala> data.weights
breeze.linalg.DenseVector[Double] = DenseVector(77.0, 58.0...

Let's create a scatter plot of the heights against the weights:
scala> val fig = Figure("height vs. weight")
fig: breeze.plot.Figure = breeze.plot.Figure@743f2558
scala> val plt = fig.subplot(0)
plt: breeze.plot.Plot = breeze.plot.Plot@501ea274
scala> plt += plot(data.heights, data.weights, '+',
colorcode="black")
breeze.plot.Plot = breeze.plot.Plot@501ea274

This produces a scatter-plot of the height-weight data:

[ 56 ]

Chapter 3

Note that we passed a third argument to the plot method, '+'. This controls the
plotting style. As of this writing, there are three available styles: '-' (the default),
'+', and '.'. Experiment with these to see what they do. Finally, we pass a
colorcode="black" argument to control the color of the line. This is either a color
name or an RGB triple, written as a string. Thus, to plot red points, we could have
passed colorcode="[255,0,0]".
Looking at the height-weight plot, there is clearly a trend between height and
weight. Let's try and fit a straight line through the data points. We will fit the
following function:

weight ( Kg ) = a + b × height ( cm )
Scientific literature suggests that it would be better to fit
something more like mass ∝ height 2 . You should find it
straightforward to fit a quadratic line to the data, should you
wish to.

We will use Breeze's least squares function to find the values of a and b. The
leastSquares method expects an input matrix of features and a target vector, just
like the LogisticRegression class that we defined in the previous chapter. Recall
that in Chapter 2, Manipulating Data with Breeze, when we prepared the training set
for logistic regression classification, we introduced a dummy feature that was one for
every participant to provide the degree of freedom for the y intercept. We will use
the same approach here. Our feature matrix, therefore, contains two columns—one
that is 1 everywhere and one for the height:
scala> val features = DenseMatrix.horzcat(
DenseMatrix.ones[Double](data.npoints, 1),
data.heights.toDenseMatrix.t
)
features: breeze.linalg.DenseMatrix[Double] =
1.0

182.0

1.0

161.0

1.0

161.0

1.0

177.0

1.0

157.0

...
scala> import breeze.stats.regression._
[ 57 ]

Plotting with breeze-viz
import breeze.stats.regression._
scala> val leastSquaresResult = leastSquares(features, data.weights)
leastSquaresResult: breeze.stats.regression.LeastSquaresRegressionResult
= 

The leastSquares method returns an instance of LeastSquareRegressionResult,
which contains a coefficients attribute containing the coefficients that best fit the
data:
scala> leastSquaresResult.coefficients
breeze.linalg.DenseVector[Double] = DenseVector(-131.042322, 1.1521875)

The best-fit line is therefore:

weight ( Kg ) = −131.04 + 1.1522 × height ( cm )
Let's extract the coefficients. An elegant way of doing this is to use Scala's pattern
matching capabilities:
scala> val Array(a, b) = leastSquaresResult.coefficients.toArray
a: Double = -131.04232269750622
b: Double = 1.1521875435418725

By writing val Array(a, b) = ..., we are telling Scala that the right-hand side of
the expression is a two-element array and to bind the first element of that array to the
value a and the second to the value b. See Appendix, Pattern Matching and Extractors,
for a discussion of pattern matching.
We can now add the best-fit line to our graph. We start by generating evenly-spaced
dummy height values:
scala> val dummyHeights = linspace(min(data.heights),
max(data.heights), 200)
dummyHeights: breeze.linalg.DenseVector[Double] = DenseVector(148.0, ...
scala> val fittedWeights = a :+ (b :* dummyHeights)
fittedWeights: breeze.linalg.DenseVector[Double] = DenseVector(39.4814...
scala> plt += plot(dummyHeights, fittedWeights, colorcode="red")
breeze.plot.Plot = breeze.plot.Plot@501ea274

[ 58 ]

Chapter 3

Let's also add the equation for the best-fit line to the graph as an annotation. We will
first generate the label:
scala> val label = f"weight = $a%.4f + $b%.4f * height"
label: String = weight = -131.0423 + 1.1522 * height

To add an annotation, we must access the underlying JFreeChart plot:
scala> import org.jfree.chart.annotations.XYTextAnnotation
import org.jfree.chart.annotations.XYTextAnnotation
scala> plt.plot.addAnnotation(new XYTextAnnotation(label, 175.0, 105.0))

The XYTextAnnotation constructor takes three parameters: the annotation string
and a pair of (x, y) coordinates defining the centre of the annotation on the graph.
The coordinates of the annotation are expressed in the coordinate system of the
data. Thus, calling new XYTextAnnotation(label, 175.0, 105.0) generates an
annotation whose centroid is at the point corresponding to a height of 175 cm and
weight of 105 kg:

[ 59 ]

Plotting with breeze-viz

More advanced scatter plots
Breeze-viz offers a scatter function that adds a significant degree of customization
to scatter plots. In particular, we can use the size and color of the marker points to
add additional dimensions of information to the plot.
The scatter function takes, as its first two arguments, collections of x and y points.
The third argument is a function mapping an integer i to a Double indicating the
size of the ith point. The size of the point is measured in units of the x axis. If you
have the sizes as a Scala collection or a Breeze vector, you can use that collection's
apply method as the function. Let's see how this works in practice.
As with the previous examples, we will use the REPL, but you can find a sample
program in BreezeDemo.scala:
scala> val fig = new Figure("Advanced scatter example")
fig: breeze.plot.Figure = breeze.plot.Figure@220821bc
scala> val plt = fig.subplot(0)
plt: breeze.plot.Plot = breeze.plot.Plot@668f8ae0
scala> val xs = linspace(0.0, 1.0, 100)
xs: breeze.linalg.DenseVector[Double] = DenseVector(0.0,
0.010101010101010102, 0.0202 ...
scala> val sizes = 0.025 * DenseVector.rand(100) // random sizes
sizes: breeze.linalg.DenseVector[Double] =
DenseVector(0.014879265631723166, 0.00219551...
scala> plt += scatter(xs, xs :^ 2.0, sizes.apply)
breeze.plot.Plot = breeze.plot.Plot@668f8ae0

[ 60 ]

Chapter 3

Selecting custom colors works in a similar manner: we pass in a colors argument
that maps an integer index to a java.awt.Paint object. Using these directly
can be cumbersome, so Breeze provides some default palettes. For instance, the
GradientPaintScale maps doubles in a given domain to a uniform color gradient.
Let's map doubles in the range 0.0 to 1.0 to the colors between red and green:
scala> val palette = new GradientPaintScale(
0.0, 1.0, PaintScale.RedToGreen)
palette: breeze.plot.GradientPaintScale[Double] = 
scala> palette(0.5) // half-way between red and green
java.awt.Paint = java.awt.Color[r=127,g=127,b=0]
scala> palette(1.0) // green
java.awt.Paint = java.awt.Color[r=0,g=254,b=0]

Besides the GradientPaintScale, breeze-viz provides a CategoricalPaintScale
class for categorical palettes. For an overview of the different palettes, consult the
source file PaintScale.scala at scala: https://github.com/scalanlp/breeze/
blob/master/viz/src/main/scala/breeze/plot/PaintScale.scala.
Let's use our newfound knowledge to draw a multicolor scatter plot. We will assume
the same initialization as the previous example. We will assign a random color to
each point:
scala> val palette = new GradientPaintScale(0.0, 1.0,
PaintScale.MaroonToGold)
palette: breeze.plot.GradientPaintScale[Double] = 
scala> val colors = DenseVector.rand(100).mapValues(palette)
colors: breeze.linalg.DenseVector[java.awt.Paint] = DenseVector(java.awt.
Color[r=162,g=5,b=0], ...
scala> plt += scatter(xs, xs :^ 2.0, sizes.apply, colors.apply)
breeze.plot.Plot = breeze.plot.Plot@8ff7e27

[ 61 ]

Plotting with breeze-viz

Multi-plot example – scatterplot matrix
plots
In this section, we will learn how to have several plots in the same figure.
The key new method that allows multiple plots in the same figure is fig.
subplot(nrows, ncols, plotIndex). This method, an overloaded version of the
fig.subplot method we have been using up to now, both sets the number of rows
and columns in the figure and returns a specific subplot. It takes three arguments:
•

nrows: The number of rows of subplots in the figure

•

ncols: The number of columns of subplots in the figure

•

plotIndex: The index of the plot to return

Users familiar with MATLAB or matplotlib will note that the .subplot method is
identical to the eponymous methods in these frameworks. This might seem a little
complex, so let's look at an example (you will find the code for this in BreezeDemo.
scala):
import breeze.plot._
def subplotExample {
val data = HWData.load
[ 62 ]

Chapter 3
val fig = new Figure("Subplot example")
// upper subplot: plot index '0' refers to the first plot
var plt = fig.subplot(2, 1, 0)
plt += plot(data.heights, data.weights, '.')
// lower subplot: plot index '1' refers to the second plot
plt = fig.subplot(2, 1, 1)
plt += plot(data.heights, data.reportedHeights, '.',
colorcode="black")
fig.refresh
}

Running this example produces the following plot:

Now that we have a basic grasp of how to add several subplots to the same figure,
let's do something a little more interesting. We will write a class to draw scatterplot
matrices. These are useful for exploring correlations between different features.

[ 63 ]

Plotting with breeze-viz

If you are not familiar with scatterplot matrices, have a look at the figure at the end
of this section for an idea of what we are constructing. The idea is to build a square
matrix of scatter plots for each pair of features. Element (i, j) in the matrix is a scatter
plot of feature i against feature j. Since a scatter plot of a variable against itself is
of limited use, one normally draws histograms of each feature along the diagonal.
Finally, since a scatter plot of feature i against feature j contains the same information
as a scatter plot of feature j against feature i, one normally only plots the upper
triangle or the lower triangle of the matrix.
Let's start by writing functions for the individual plots. These will take a Plot object
referencing the correct subplot and vectors of the data to plot:
import breeze.plot._
import breeze.linalg._
class ScatterplotMatrix(val fig:Figure) {
/** Draw the histograms on the diagonal */
private def plotHistogram(plt:Plot)(
data:DenseVector[Double], label:String) {
plt += hist(data)
plt.xlabel = label
}
/** Draw the off-diagonal scatter plots */
private def plotScatter(plt:Plot)(
xdata:DenseVector[Double],
ydata:DenseVector[Double],
xlabel:String,
ylabel:String) {
plt += plot(xdata, ydata, '.')
plt.xlabel = xlabel
plt.ylabel = ylabel
}
...

Notice the use of hist(data) to draw a histogram. The argument to hist must
be a vector of data points. The hist method will bin these and represent them
as a histogram.

[ 64 ]

Chapter 3

Now that we have the machinery for drawing individual plots, we just need to wire
everything together. The tricky part is to know how to select the correct subplot
for a given row and column position in the matrix. We can select a single plot by
calling fig.subplot(nrows, ncolumns, plotIndex), but translating from a (row,
column) index pair to a single plotIndex is not obvious. The plots are numbered in
increasing order, first from left to right, then from top to bottom:
0 1 2 3
4 5 6 7
...

Let's write a short function to select a plot at a (row, column) index pair:
private def selectPlot(ncols:Int)(irow:Int, icol:Int):Plot = {
fig.subplot(ncols, ncols, (irow)*ncols + icol)
}

We are now in a position to draw the matrix plot itself:
/** Draw a scatterplot matrix.
*
* This function draws a scatterplot matrix of the correlation
* between each pair of columns in `featureMatrix`.
*
* @param featureMatrix A matrix of features, with each column
*
representing a feature.
* @param labels Names of the features.
*/
def plotFeatures(
featureMatrix:DenseMatrix[Double],
labels:List[String]
) {
val ncols = featureMatrix.cols
require(ncols == labels.size,
"Number of columns in feature matrix "+
"must match length of labels"
)
fig.clear
fig.subplot(ncols, ncols, 0)
(0 until ncols) foreach { irow =>
val p = selectPlot(ncols)(irow, irow)
plotHistogram(p)(featureMatrix(::, irow), labels(irow))
(0 until irow) foreach { icol =>

[ 65 ]

Plotting with breeze-viz
val p = selectPlot(ncols)(irow, icol)
plotScatter(p)(
featureMatrix(::, irow),
featureMatrix(::, icol),
labels(irow),
labels(icol)
)
}
}
}
}

Let's write an example for our class. We will use the height-weight data again:
import breeze.linalg._
import breeze.numerics._
import breeze.plot._
object ScatterplotMatrixDemo extends App {
val data = HWData.load
val m = new ScatterplotMatrix(Figure("Scatterplot matrix demo"))
// Make a matrix with three columns: the height, weight and
// reported weight data.
val featureMatrix = DenseMatrix.horzcat(
data.heights.toDenseMatrix.t,
data.weights.toDenseMatrix.t,
data.reportedWeights.toDenseMatrix.t
)
m.plotFeatures(featureMatrix,
List("height", "weight", "reportedWeights"))
}

[ 66 ]

Chapter 3

Running this through SBT produces the following plot:

Managing without documentation
Breeze-viz is unfortunately rather poorly documented. This can make the learning
curve somewhat steep. Fortunately, it is still quite a small project: at the time of
writing, there are just ten source files (https://github.com/scalanlp/breeze/
tree/master/viz/src/main/scala/breeze/plot). A good way to understand
exactly what breeze-viz does is to read the source code. For instance, to see what
methods are available on a Plot object, read the source file Plot.scala. If you
need functionality beyond that provided by Breeze, consult the documentation
for JFreeChart to discover if you can implement what you need by accessing the
underlying JFreeChart objects.

[ 67 ]

Plotting with breeze-viz

Breeze-viz reference
Writing a reference in a programming book is a dangerous exercise: you quickly
become out of date. Nevertheless, given the paucity of documentation for breeze-viz,
this section becomes more relevant – it is easier to compete against something that
does not exist. Take this section with a pinch of salt, and if a command in this section
does not work, head over to the source code:
Command

Description

plt += plot(xs, ys)

This plots a series of (xs, ys) values. The xs and ys
values must be collection-like objects (Breeze vectors,
Scala arrays, or lists, for instance).

plt += scatter(xs, ys,
size)

This plots a series of (xs, ys) values as a scatter plot.
The size argument is an (Int) => Double function
mapping the index of a point to its size (in the same
units as the x axis). The color argument is an (Int)
=> java.awt.Paint function mapping from integers
to colors. Read the more advanced scatter plots section for
further details.

plt += scatter(xs, ys,
size, color)

plt += hist(xs)
plt += hist(xs, bins=10)
plt += image(mat)

This bins xs and plots a histogram. The bins argument
controls the number of bins.
This plots an image or matrix. The mat argument
should be Matrix[Double]. Read the package.
scala source file in breeze.plot for details
(https://github.com/scalanlp/breeze/blob/
master/viz/src/main/scala/breeze/plot/
package.scala).

It is also useful to summarize the options available on a plot object:
Attribute

Description

plt.xlabel = "x-label"

This sets the axis label

plt.ylabel = "y-label"
plt.xlim = (0.0, 1.0)

This sets the axis maximum and minimum value

plt.ylim = (0.0, 1.0)
plt.logScaleX = true

This switches the axis to a log scale

plt.logScaleY = true
plt.title = "title"

This sets the plot title

[ 68 ]

Chapter 3

Data visualization beyond breeze-viz
Other tools for data visualization in Scala are emerging: Spark notebooks (https://
github.com/andypetrella/spark-notebook#description) based on the IPython
notebook and Apache Zeppelin (https://zeppelin.incubator.apache.org). Both

of these rely on Apache Spark, which we will explore later in this book.

Summary
In this chapter, we learned how to draw simple charts with breeze-viz. In the last
chapter of this book, we will learn how to build interactive visualizations using
JavaScript libraries.
Next, we will learn about basic Scala concurrency constructs—specifically,
parallel collections.

[ 69 ]

Parallel Collections
and Futures
Data science often involves processing medium or large amounts of data. Since the
previously exponential growth in the speed of individual CPUs has slowed down
and the amount of data continues to increase, leveraging computers effectively must
entail parallel computation.
In this chapter, we will look at ways of parallelizing computation and data
processing over a single computer. Virtually all new computers have more than one
processing unit, and distributing a calculation over these cores can be an effective
way of hastening medium-sized calculations.
Parallelizing calculations over a single chip is suitable for calculations involving
gigabytes or a few terabytes of data. For larger data flows, we must resort to
distributing the computation over several computers in parallel. We will discuss
Apache Spark, a framework for parallel data processing in Chapter 10, Distributed
Batch Processing with Spark.
In this book, we will look at three common ways of leveraging parallel architectures
in a single machine: parallel collections, futures, and actors. We will consider the first
two in this chapter, and leave the study of actors to Chapter 9, Concurrency with Akka.

Parallel collections
Parallel collections offer an extremely easy way to parallelize independent tasks.
The reader, being familiar with Scala, will know that many tasks can be phrased as
operations on collections, such as map, reduce, filter, or groupBy. Parallel collections
are an implementation of Scala collections that parallelize these operations to run
over several threads.
[ 71 ]

Parallel Collections and Futures

Let's start with an example. We want to calculate the frequency of occurrence of each
letter in a sentence:
scala> val sentence = "The quick brown fox jumped over the lazy dog"
sentence: String = The quick brown fox jumped ...

Let's start by converting our sentence from a string to a vector of characters:
scala> val characters = sentence.toVector
Vector[Char] = Vector(T, h, e,

, q, u, i, c, k, ...)

We can now convert characters to a parallel vector, a ParVector. To do this, we use
the par method:
scala> val charactersPar = characters.par
ParVector[Char] = ParVector(T, h, e,

, q, u, i, c, k,

, ...)

ParVector collections support the same operations as regular vectors, but their
methods are executed in parallel over several threads.

Let's start by filtering out the spaces in charactersPar:
scala> val lettersPar = charactersPar.filter { _ != ' ' }
ParVector[Char] = ParVector(T, h, e, q, u, i, c, k, ...)

Notice how Scala hides the execution details. The filter operation was performed
using multiple threads, and you barely even noticed! The interface and behavior of a
parallel vector is identical to its serial counterpart, save for a few details that we will
explore in the next section.
Let's now use the toLower function to make the letters lowercase:
scala> val lowerLettersPar = lettersPar.map { _.toLower }
ParVector[Char] = ParVector(t, h, e, q, u, i, c, k, ...)

As before, the map method was applied in parallel. To find the frequency of
occurrence of each letter, we use the groupBy method to group characters into
vectors containing all the occurrences of that character:
scala> val intermediateMap = lowerLettersPar.groupBy(identity)
ParMap[Char,ParVector[Char]] = ParMap(e -> ParVector(e, e, e, e), ...)

[ 72 ]

Chapter 4

Note how the groupBy method has created a ParMap instance, the parallel equivalent
of an immutable map. To get the number of occurrences of each letter, we do a
mapValues call on intermediateMap, replacing each vector by its length:
scala> val occurenceNumber = intermediateMap.mapValues { _.length }
ParMap[Char,Int] = ParMap(e -> 4, x -> 1, n -> 1, j -> 1, ...)

Congratulations! We've written a multi-threaded algorithm for finding the frequency
of occurrence of each letter in a few lines of code. You should find it straightforward
to adapt this to find the frequency of occurrence of each word in a document, a
common preprocessing problem for analyzing text data.
Parallel collections make it very easy to parallelize some operation pipelines: all we
had to do was call .par on the characters vector. All subsequent operations were
parallelized. This makes switching from a serial to a parallel implementation very easy.

Limitations of parallel collections
Part of the power and the appeal of parallel collections is that they present the same
interface as their serial counterparts: they have a map method, a foreach method, a
filter method, and so on. By and large, these methods work in the same way on
parallel collections as they do in serial. There are, however, some notable caveats.
The most important one has to do with side effects. If an operation on a parallel
collection has a side effect, this may result in a race condition: a situation in which
the final result depends on the order in which the threads perform their operations.
Side effects in collections arise most commonly when we update a variable defined
outside of the collection. To give a trivial example of unexpected behavior, let's
define a count variable and increment it a thousand times using a parallel range:
scala> var count = 0
count: Int = 0
scala> (0 until 1000).par.foreach { i => count += 1 }
scala> count
count: Int = 874 // not 1000!

[ 73 ]

Parallel Collections and Futures

What happened here? The function passed to foreach has a side effect: it increments
count, a variable outside of the scope of the function. This is a problem because the
+= operator is a sequence of two operations:
•

Retrieve the value of count and add one to it

•

Assign the result back to count

To understand why this causes unexpected behavior, let's imagine that the foreach
loop has been parallelized over two threads. Thread A might read the count variable
when it is 832 and add one to it to give 833. Before it has time to reassign 833 to
count, Thread B reads count, still at 832, and adds one to give 833. Thread A then
assigns 833 to count. Thread B then assigns 833 to count. We've run through two
updates but only incremented the count by one. The problem arises because += can
be separated into two instructions: it is not atomic. This leaves room for threads to
interleave their operations:

The anatomy of a race condition: both thread A and thread B are trying to update count concurrently,
resulting in one of the updates being overwritten. The final value of count is 833 instead of 834.

To give a somewhat more realistic example of problems caused by non-atomicity,
let's look at a different method for counting the frequency of occurrence of each letter
in our sentence. We define a mutable Char -> Int hash map outside of the loop.
Each time we encounter a letter, we increment the corresponding integer in the map:
scala> import scala.collection.mutable
import scala.collection.mutable
scala> val occurenceNumber = mutable.Map.empty[Char, Int]
[ 74 ]

Chapter 4
occurenceNumber: mutable.Map[Char,Int] = Map()
scala> lowerLettersPar.foreach { c =>
occurenceNumber(c) = occurenceNumber.getOrElse(c, 0) + 1
}
scala> occurenceNumber('e') // Should be 4
Int = 2

The discrepancy occurs because of the non-atomicity of the operations in the
foreach loop.
In general, it is good practice to avoid side effects in higher-order functions on
collections. They make the code harder to understand and preclude switching from
serial to parallel collections. It is also good practice to avoid exposing mutable state:
immutable objects can be shared freely between threads and cannot be affected by
side effects.
Another limitation of parallel collections occurs in reduction (or folding) operations.
The function used to combine items together must be associative. For instance:
scala> (0 until 1000).par.reduce {_ - _ } // should be -499500
Int = 63620

The minus operator, –, is not associative. The order in which consecutive operations
are applied matters: (a – b) – c is not the same as a – (b – c). The function
used to reduce a parallel collection must be associative because the order in which
the reduction occurs is not tied to the order of the collection.

Error handling
In single-threaded programs, exception handling is relatively straightforward: if an
exception occurs, the function can either handle it or escalate it. This is not nearly
as obvious when parallelism is introduced: a single thread might fail, but the others
might return successfully.
Parallel collection methods will throw an exception if they fail on any element, just
like their serial counterparts:
scala> Vector(2, 0, 5).par.map { 10 / _ }
java.lang.ArithmeticException: / by zero
...

[ 75 ]

Parallel Collections and Futures

There are cases when this isn't the behavior that we want. For instance, we might
be using a parallel collection to retrieve a large number of web pages in parallel.
We might not mind if a few of the pages cannot be fetched.
Scala's Try type was designed for sandboxing code that might throw exceptions. It is
similar to Option in that it is a one-element container:
scala> import scala.util._
import scala.util._
scala> Try { 2 + 2 }
Try[Int] = Success(4)

Unlike the Option type, which indicates whether an expression has a useful value,
the Try type indicates whether an expression can be executed without throwing an
exception. It takes on the following two values:
•
•

Try { 2 + 2 } == Success(4) if the expression in the Try statement is

evaluated successfully

Try { 2 / 0 } == Failure(java.lang.ArithmeticException: / by
zero) if the expression in the Try block results in an exception

This will make more sense with an example. To see the Try type in action, we will
try to fetch web pages in a fault tolerant manner. We will use the built-in Source.
fromURL method which fetches a web page and opens an iterator of the page's
content. If it fails to fetch the web page, it throws an error:
scala> import scala.io.Source
import scala.io.Source
scala> val html = Source.fromURL("http://www.google.com")
scala.io.BufferedSource = non-empty iterator
scala> val html = Source.fromURL("garbage")
java.net.MalformedURLException: no protocol: garbage
...

Instead of letting the expression propagate out and crash the rest of our code, we can
wrap the call to Source.fromURL in Try:
scala> Try { Source.fromURL("http://www.google.com") }

[ 76 ]

Chapter 4
Try[BufferedSource] = Success(non-empty iterator)
scala> Try { Source.fromURL("garbage") }
Try[BufferedSource] = Failure(java.net.MalformedURLException: no
protocol: garbage)

To see the power of our Try statement, let's now retrieve a list of URLs in parallel in
a fault tolerant manner:
scala> val URLs = Vector("http://www.google.com",
"http://www.bbc.co.uk",
"not-a-url"
)
URLs: Vector[String] = Vector(http://www.google.com, http://www.bbc.
co.uk, not-a-url)
scala> val pages = URLs.par.map { url =>
url -> Try { Source.fromURL(url) }
}
pages: ParVector[(String, Try[BufferedSource])] = ParVector((http://
www.google.com,Success(non-empty iterator)), (http://www.bbc.
co.uk,Success(non-empty iterator)), (not-a-url,Failure(java.net.
MalformedURLException: no protocol: not-a-url)))

We can then use a collect statement to act on the pages we could fetch successfully.
For instance, to get the number of characters on each page:
scala> pages.collect { case(url, Success(it)) => url -> it.size }
ParVector[(String, Int)] = ParVector((http://www.google.com,18976),
(http://www.bbc.co.uk,132893))

By making good use of Scala's built-in Try classes and parallel collections, we have
built a fault tolerant, multithreaded URL retriever in a few lines of code. (Compare
this to the myriad of Java/C++ books that prefix code examples with 'error handling
is left out for clarity'.)

[ 77 ]

Parallel Collections and Futures

The Try type versus try/catch statements
Programmers with imperative or object-oriented backgrounds will be
more familiar with try/catch blocks for handling exceptions. We could
have accomplished similar functionality here by wrapping the code for
fetching URLs in a try block, returning null if the call raises an exception.
However, besides being more verbose, returning null is less satisfactory:
we lose all information about the exception and null is less expressive
than Failure(exception). Furthermore, returning a Try[T] type
forces the caller to consider the possibility that the function might fail, by
encoding this possibility in the type of the return value. In contrast, just
returning T and coding failure with a null value allows the caller to ignore
failure, raising the possibility of a confusing NullPointerException
being thrown at a completely different point in the program.
In short, Try[T] is just another higher-order type, like Option[T] or
List[T]. Treating the possibility of failure in the same way as the rest of
the code adds coherence to the program and encourages programmers to
tackle the possibility of exceptions explicitly.

Setting the parallelism level
So far, we have considered parallel collections as black boxes: add par to a normal
collection and all the operations are performed in parallel. Often, we will want more
control over how the tasks are executed.
Internally, parallel collections work by distributing an operation over multiple
threads. Since the threads share memory, parallel collections do not need to copy any
data. Changing the number of threads available to the parallel collection will change
the number of CPUs that are used to perform the tasks.
Parallel collections have a tasksupport attribute that controls task execution:
scala> val parRange = (0 to 100).par
parRange: ParRange = ParRange(0, 1, 2, 3, 4, 5,...
scala> parRange.tasksupport
TaskSupport = scala.collection.parallel.ExecutionContextTaskSupport@311a0
b3e
scala> parRange.tasksupport.parallelismLevel
Int = 8 // Number of threads to be used

[ 78 ]

Chapter 4

The task support object of a collection is an execution context, an abstraction capable
of executing Scala expressions in a separate thread. By default, the execution context
in Scala 2.11 is a work-stealing thread pool. When a parallel collection submits tasks,
the context allocates these tasks to its threads. If a thread finds that it has finished
its queued tasks, it will try and steal outstanding tasks from the other threads. The
default execution context maintains a thread pool with number of threads equal to
the number of CPUs.
The number of threads over which the parallel collection distributes the work can
be changed by changing the task support. For instance, to parallelize the operations
performed by a range over four threads:
scala> import scala.collection.parallel._
import scala.collection.parallel._
scala> parRange.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(4)
)
parRange.tasksupport: scala.collection.parallel.TaskSupport = scala.
collection.parallel.ForkJoinTaskSupport@6e1134e1
scala> parRange.tasksupport.parallelismLevel
Int: 4

An example – cross-validation with parallel
collections
Let's apply what you have learned so far to solve data science problems. There are
many parts of a machine learning pipeline that can be parallelized trivially. One such
part is cross-validation.
We will give a brief description of cross-validation here, but you can refer to
The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman for a more
in-depth discussion.
Typically, a supervised machine learning problem involves training an algorithm
over a training set. For instance, when we built a model to calculate the probability
of a person being male based on their height and weight, the training set was the
(height, weight) data for each participant, together with the male/female label for
each row. Once the algorithm is trained on the training set, we can use it to classify
new data. This process only really makes sense if the training set is representative of
the new data that we are likely to encounter.
[ 79 ]

Parallel Collections and Futures

The training set has a finite number of entries. It will thus, inevitably, have
idiosyncrasies that are not representative of the population at large, merely due to
its finite nature. These idiosyncrasies will result in prediction errors when predicting
whether a new person is male or female, over and above the prediction error of the
algorithm on the training set itself. Cross-validation is a tool for estimating the error
caused by the idiosyncrasies of the training set that do not reflect the population
at large.
Cross-validation works by dividing the training set in two parts: a smaller, new
training set and a cross-validation set. The algorithm is trained on the reduced
training set. We then see how well the algorithm models the cross-validation set.
Since we know the right answer for the cross-validation set, we can measure how
well our algorithm is performing when shown new information. We repeat this
procedure many times with different cross-validation sets.
There are several different types of cross-validation, which differ in how we
choose the cross-validation set. In this chapter, we will look at repeated random
subsampling: we select k rows at random from the training data to form the crossvalidation set. We do this many times, calculating the cross-validation error for
each subsample. Since each iteration is independent of the previous ones, we
can parallelize this process trivially. It is therefore a good candidate for parallel
collections. We will look at an alternative form of cross-validation, k-fold crossvalidation, in Chapter 12, Distributed Machine Learning with MLlib.
We will build a class that performs cross-validation in parallel. I encourage you
to write the code as you go, but you will find the source code corresponding to
these examples on GitHub (https://github.com/pbugnion/s4ds).We will use
parallel collections to handle the parallelism and Breeze data types in the inner loop.
The build.sbt file is identical to the one we used in Chapter 2, Manipulating Data
with Breeze:
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.scalanlp" %% "breeze" % "0.11.2",
"org.scalanlp" %% "breeze-natives" % "0.11.2"
)

We will build a RandomSubsample class. The class exposes a type alias, CVFunction,
for a function that takes two lists of indices—the first corresponding to the
reduced training set and the second to the validation set—and returns a Double
corresponding to the cross-validation error:
type CVFunction = (Seq[Int], Seq[Int]) => Double
[ 80 ]

Chapter 4

The RandomSubsample class will expose a single method, mapSamples, which takes a
CVFunction, repeatedly passes it different partitions of indices, and returns a vector
of the errors. This is what the class looks like:
// RandomSubsample.scala
import breeze.linalg._
import breeze.numerics._
/** Random subsample cross-validation
*
* @param nElems Total number of elements in the training set.
* @param nCrossValidation Number of elements to leave out of
training set.
*/
class RandomSubsample(val nElems:Int, val nCrossValidation:Int) {
type CVFunction = (Seq[Int], Seq[Int]) => Double
require(nElems > nCrossValidation,
"nCrossValidation, the number of elements " +
"withheld, must be < nElems")
private val indexList = DenseVector.range(0, nElems)
/** Perform multiple random sub-sample CV runs on f
*
* @param nShuffles Number of random sub-sample runs.
* @param f user-defined function mapping from a list of
*
indices in the training set and a list of indices in the
*
test-set to a double indicating the out-of sample score
*
for this split.
* @returns DenseVector of the CV error for each random split.
*/
def mapSamples(nShuffles:Int)(f:CVFunction)
:DenseVector[Double] = {
val cvResults = (0 to nShuffles).par.map { i =>
// Randomly split indices between test and training
val shuffledIndices = breeze.linalg.shuffle(indexList)
val Seq(testIndices, trainingIndices) =
split(shuffledIndices, Seq(nCrossValidation))
// Apply f for this split
[ 81 ]

Parallel Collections and Futures
f(trainingIndices.toScalaVector,
testIndices.toScalaVector)
}
DenseVector(cvResults.toArray)
}
}

Let's look at what happens in more detail, starting with the arguments passed to
the constructor:
class RandomSubsample(val nElems:Int, val nCrossValidation:Int)

We pass the total number of elements in the training set and the number of elements
to leave out for cross-validation in the class constructor. Thus, passing 100 to nElems
and 20 to nCrossValidation implies that our training set will have 80 random
elements of the total data and that the test set will have 20 elements.
We then construct a list of all integers between 0 and nElems:
private val indexList = DenseVector.range(0, nElems)

For each iteration of the cross-validation, we will shuffle this list and take the
first nCrossValidation elements to be the indices of rows in our test set and the
remaining to be the indices of rows in our training set.
Our class exposes a single method, mapSamples, that takes two curried arguments:
nShuffles, the number of times to perform random subsampling, and f, a
CVFunction:
def mapSamples(nShuffles:Int)(f:CVFunction):DenseVector[Double]

With all this set up, the code for doing cross-validation is deceptively simple. We
generate a parallel range from 0 to nShuffles and, for each item in the range,
generate a new train-test split and calculate the cross-validation error:
val cvResults = (0 to nShuffles).par.map { i =>
val shuffledIndices = breeze.linalg.shuffle(indexList)
val Seq(testIndices, trainingIndices) =
split(shuffledIndices, Seq(nCrossValidation))
f(trainingIndices.toScalaVector, testIndices.toScalaVector)
}

The only tricky part of this function is splitting the shuffled index list into a list
of indices for the training set and a list of indices for the test set. We use Breeze's
split method. This takes a vector as its first argument and a list of split-points as
its second, and returns a list of fragments of the original vector. We then use pattern
matching to extract the individual parts.
[ 82 ]

Chapter 4

Finally, mapSamples converts cvResults to a Breeze vector:
DenseVector(cvResults.toArray)

Let's see this in action. We can test our class by running cross-validation on the
logistic regression example developed in Chapter 2, Manipulating Data with Breeze.
In that chapter, we developed a LogisticRegression class that takes a training
set (in the form of a DenseMatrix) and target (in the form of a DenseVector) at
construction time. The class then calculates the parameters that best represent the
training set. We will first add two methods to the LogisticRegression class to use
the trained model to classify previously unseen examples:
•

The predictProbabilitiesMany method uses the trained model to calculate
the probability of having the target variable set to one. In the context of our
example, this is the probability of being male, given a height and weight.

•

The classifyMany method assigns classification labels (one or zero) to
members of a test set. We will assign a one if predictProbabilitiesMany
returns a value greater than 0.5.

With these two functions, our LogisticRegression class becomes:
// Logistic Regression.scala
class LogisticRegression(
val training:DenseMatrix[Double],
val target:DenseVector[Double]
) {
...
/** Probability of classification for each row
* in test set.
*/
def predictProbabilitiesMany(test:DenseMatrix[Double])
:DenseVector[Double] = {
val xBeta = test * optimalCoefficients
sigmoid(xBeta)
}
/** Predict the value of the target variable
* for each row in test set.
*/
def classifyMany(test:DenseMatrix[Double])
:DenseVector[Double] = {
val probabilities = predictProbabilitiesMany(test)
I((probabilities :> 0.5).toDenseVector)
}
...
}
[ 83 ]

Parallel Collections and Futures

We can now put together an example program for our RandomSubsample class. We
will use the same height-weight data as in Chapter 2, Manipulating Data with Breeze.
The data preprocessing will be similar. The code examples for this chapter provide a
helper module, HWData, to load the height-weight data into Breeze vectors. The data
itself is in the data/ directory of the code examples for this chapter (available on
GitHub at https://github.com/pbugnion/s4ds/tree/master/chap04).
For each new subsample, we create a new LogisticRegression instance, train it
on the subset of the training set to get the best coefficients for this train-test split, and
use classifyMany to generate predictions on the cross-validation set in this split. We
then calculate the classification error and report the average classification error over
every train-test split:
// RandomSubsampleDemo.scala
import
import
import
import

breeze.linalg._
breeze.linalg.functions.manhattanDistance
breeze.numerics._
breeze.stats._

object RandomSubsampleDemo extends App {
/* Load and pre-process data */
val data = HWData.load
val rescaledHeights:DenseVector[Double] =
(data.heights - mean(data.heights)) / stddev(data.heights)
val rescaledWeights:DenseVector[Double] =
(data.weights - mean(data.weights)) / stddev(data.weights)
val featureMatrix:DenseMatrix[Double] =
DenseMatrix.horzcat(
DenseMatrix.ones[Double](data.npoints, 1),
rescaledHeights.toDenseMatrix.t,
rescaledWeights.toDenseMatrix.t
)
val target:DenseVector[Double] = data.genders.values.map {
gender => if(gender == 'M') 1.0 else 0.0
}
/* Cross-validation */
val testSize = 20
[ 84 ]

Chapter 4
val cvCalculator = new RandomSubsample(data.npoints, testSize)
// Start parallel CV loop
val cvErrors = cvCalculator.mapSamples(1000) {
(trainingIndices, testIndices) =>
val regressor = new LogisticRegression(
data.featureMatrix(trainingIndices, ::).toDenseMatrix,
data.target(trainingIndices).toDenseVector
)
// Predictions on test-set
val genderPredictions = regressor.classifyMany(
data.featureMatrix(testIndices, ::).toDenseMatrix
)
// Calculate number of mis-classified examples
val dist = manhattanDistance(
genderPredictions, data.target(testIndices)
)
// Calculate mis-classification rate
dist / testSize.toDouble
}
println(s"Mean classification error: ${mean(cvErrors)}")
}

Running this program on the height-weight data gives a classification error of 10%.
We now have a fully working, parallelized cross-validation class. Scala's parallel
range made it simple to repeatedly compute the same function in different threads.

Futures
Parallel collections offer a simple, yet powerful, framework for parallel operations.
However, they are limited in one respect: the total amount of work must be
known in advance, and each thread must perform the same function (possibly
on different inputs).
Imagine that we want to write a program that fetches a web page (or queries a web
API) every few seconds and extracts data for further processing from this web page.
A typical example might involve querying a web API to maintain an up-to-date
value of a particular stock price. Fetching data from an external web page takes a few
hundred milliseconds, typically. If we perform this operation on the main thread, it
will needlessly waste CPU cycles waiting for the web server to reply.
[ 85 ]

Parallel Collections and Futures

The solution is to wrap the code for fetching the web page in a future. A future is
a one-element container containing the future result of a computation. When you
create a future, the computation in it gets off-loaded to a different thread in order to
avoid blocking the main thread. When the computation finishes, the result is written
to the future and thus made accessible to the main thread.
As an example, we will write a program that queries the "Markit on demand"
API to fetch the price of a given stock. For instance, the URL for the current price
of a Google share is http://dev.markitondemand.com/MODApis/Api/v2/
Quote?symbol=GOOG. Go ahead and paste this in the address box of your web
browser. You will see an XML string appear with, among other things, the current
stock price. Let's fetch this programmatically without resorting to a future first:
scala> import scala.io._
import scala.io_
scala> val url = "http://dev.markitondemand.com/MODApis/Api/v2/
Quote?symbol=GOOG"
url: String = http://dev.markitondemand.com/MODApis/Api/v2/
Quote?symbol=GOOG
scala> val response = Source.fromURL(url).mkString
response: String = SUCCESS
...

Notice how it takes a little bit of time to query the API. Let's now do the same, but
using a future (don't worry about the imports for now, we will discuss what they
mean in more detail further on):
scala> import scala.concurrent._
import scala.concurrent._
scala> import scala.util._
import scala.util._
scala> import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.ExecutionContext.Implicits.global
scala> val response = Future { Source.fromURL(url).mkString }
response: Future[String] = Promise$DefaultPromise@3301801b

[ 86 ]

Chapter 4

If you run this, you will notice that control returns to the shell instantly before
the API has had a chance to respond. To make this evident, let's simulate a slow
connection by adding a call to Thread.sleep:
scala> val response = Future {
Thread.sleep(10000) // sleep for 10s
Source.fromURL(url).mkString
}
response: Future[String] = Promise$DefaultPromise@231f98ef

When you run this, you do not have to wait for ten seconds for the next prompt to
appear: you regain control of the shell straightaway. The bit of code in the future is
executed asynchronously: its execution is independent of the main program flow.
How do we retrieve the result of the computation? We note that response has type
Future[String]. We can check whether the computation wrapped in the future has
finished by querying the future's isCompleted attribute:
scala> response.isCompleted
Boolean = true

The future exposes a value attribute that contains the computation result:
scala> response.value
Option[Try[String]] = Some(Success(SUCCESS
...

The value attribute of a future has type Option[Try[T]]. We have already seen
how to use the Try type to handle exceptions gracefully in the context of parallel
collections. It is used in the same way here. A future's value attribute is None until
the future is complete, then it is set to Some(Success(value)) if the future ran
successfully, or Some(Failure(error)) if an exception was thrown.
Repeatedly calling f.value until the future completes works well in the shell,
but it does not generalize to more complex programs. Instead, we want to tell the
computer to do something once the future is complete: we want to bind a callback
function to the future. We can do this by setting the future's onComplete attribute.
Let's tell the future to print the API response when it completes:
scala> response.onComplete {
case Success(s) => println(s)

[ 87 ]

Parallel Collections and Futures
case Failure(e) => println(s"Error fetching page: $e")
}
scala>
// Wait for response to complete, then prints:
SUCCESSAlphabet IncGOOGL695.22 import scala.xml.XML
import scala.xml.XML

[ 88 ]

Chapter 4

We will use the same URL as in the previous section:
http://dev.markitondemand.com/MODApis/Api/v2/Quote?symbol=GOOG

It is sometimes useful to think of a future as a collection that either contains one
element if a calculation has been successful, or zero elements if it has failed. For
instance, if the web API has been queried successfully, our future contains a string
representation of the response. Like other container types in Scala, futures support a
map method that applies a function to the element contained in the future, returning
a new future, and does nothing if the calculation in the future failed. But what does
this mean in the context of a computation that might not be finished yet? The map
method gets applied as soon as the future is complete, like the onComplete method.
We can use the future's map method to apply a transformation to the result of the
future asynchronously. Let's poll the "Markit on demand" API again. This time,
instead of printing the result, we will parse it as XML.
scala> val strResponse = Future {
Thread.sleep(20000) // Sleep for 20s
val res = Source.fromURL(url).mkString
println("finished fetching url")
res
}
strResponse: Future[String] = Promise$DefaultPromise@1dda9bc8
scala> val xmlResponse = strResponse.map { s =>
println("applying string to xml transformation")
XML.loadString(s)
}
xmlResponse: Future[xml.Elem] = Promise$DefaultPromise@25d1262a
// wait while the remainder of the 20s elapses
finished fetching url
applying string to xml transformation
scala> xmlResponse.value
Option[Try[xml.Elem]] = Some(Success(SUCCESS...

By registering subsequent maps on futures, we are providing a road map to the
executor running the future for what to do.
[ 89 ]

Parallel Collections and Futures

If any of the steps fail, the failed Try instance containing the exception gets
propagated instead:
scala> val strResponse = Future {
Source.fromURL("empty").mkString
}
scala> val xmlResponse = strResponse.map {
s => XML.loadString(s)
}
scala> xmlResponse.value
Option[Try[xml.Elem]] = Some(Failure(MalformedURLException: no protocol:
empty))

This behavior makes sense if you think of a failed future as an empty container.
When applying a map to an empty list, it returns the same empty list. Similarly,
when applying a map to an empty (failed) future, the empty future is returned.

Blocking until completion
The code for fetching stock prices works fine in the shell. However, if you paste it
in a standalone program, you will notice that nothing gets printed and the program
finishes straightaway. Let's look at a trivial example of this:
// BlockDemo.scala
import scala.concurrent._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
object BlockDemo extends App {
val f = Future { Thread.sleep(10000) }
f.onComplete { _ => println("future completed") }
// "future completed" is not printed
}

[ 90 ]

Chapter 4

The program stops running as soon as the main thread has completed its tasks,
which, in this example, just involves creating the futures. In particular, the line
"future completed" is never printed. If we want the main thread to wait for a
future to execute, we must explicitly tell it to block execution until the future has
finished running. This is done using the Await.ready or Await.result methods.
Both these methods block the execution of the main thread until the future
completes. We could make the above program work as intended by adding this line:
Await.ready(f, 1 minute)

The Await methods take the future as their first argument and a Duration object
as the second. If the future takes longer to complete than the specified duration, a
TimeoutException is thrown. Pass Duration.Inf to set an infinite timeout.
The difference between Await.ready and Await.result is that the latter returns
the value inside the future. In particular, if the future resulted in an exception, that
exception will get thrown. In contrast, Await.ready returns the future itself.
In general, one should try to avoid blocking as much as possible: the whole point
of futures is to run code in background threads in order to keep the main thread of
execution responsive. However, a common, legitimate use case for blocking is at
the end of a program. If we are running a large-scale integration process, we might
dispatch several futures to query web APIs, read from text files, or insert data into
a database. Embedding the code in futures is more scalable than performing these
operations sequentially. However, as the majority of the intensive work is running in
background threads, we are left with many outstanding futures when the main thread
completes. It makes sense, at this stage, to block until all the futures have completed.

Controlling parallel execution with execution
contexts
Now that we know how to define futures, let's look at controlling how they run. In
particular, you might want to control the number of threads to use when running a
large number of futures.
When a future is defined, it is passed an execution context, either directly or implicitly.
An execution context is an object that exposes an execute method that takes a block
of code and runs it, possibly asynchronously. By changing the execution context, we
can change the "backend" that runs the futures. We have already seen how to use
execution contexts to control the execution of parallel collections.
So far, we have just been using the default execution context by importing scala.
concurrent.ExecutionContext.Implicits.global. This is a fork / join thread
pool with as many threads as there are underlying CPUs.
[ 91 ]

Parallel Collections and Futures

Let's now define a new execution context that uses sixteen threads:
scala> import java.util.concurrent.Executors
import java.util.concurrent.Executors
scala> val ec = ExecutionContext.fromExecutorService(
Executors.newFixedThreadPool(16)
)
ec: ExecutionContextExecutorService = ExecutionContextImpl$$anon$1@1351
ce60

Having defined the execution context, we can pass it explicitly to futures as they
are defined:
scala> val f = Future { Thread.sleep(1000) } (ec)
f: Future[Unit] = Promise$DefaultPromise@458b456

Alternatively, we can define the execution context implicitly:
scala> implicit val context = ec
context: ExecutionContextExecutorService = ExecutionContextImpl$$anon$1@1
351ce60

It is then passed as an implicit parameter to all new futures as they are constructed:
scala> val f = Future { Thread.sleep(1000) }
f: Future[Unit] = Promise$DefaultPromise@3c4b7755

You can shut the execution context down to destroy the thread pool:
scala> ec.shutdown()

When an execution context receives a shutdown command, it will finish executing its
current tasks but will refuse any new tasks.

Futures example – stock price fetcher
Let's bring some of the concepts that we covered in this section together to build a
command-line application that prompts the user for the name of a stock and fetches
the value of that stock. The catch is that, to keep the UI responsive, we will fetch the
stock using a future:
// StockPriceDemo.scala
import scala.concurrent._
[ 92 ]

Chapter 4
import
import
import
import

scala.concurrent.ExecutionContext.Implicits.global
scala.io._
scala.xml.XML
scala.util._

object StockPriceDemo extends App {
/* Construct URL for a stock symbol */
def urlFor(stockSymbol:String) =
("http://dev.markitondemand.com/MODApis/Api/v2/Quote?" +
s"symbol=${stockSymbol}")
/* Build a future that fetches the stock price */
def fetchStockPrice(stockSymbol:String):Future[BigDecimal] = {
val url = urlFor(stockSymbol)
val strResponse = Future { Source.fromURL(url).mkString }
val xmlResponse = strResponse.map { s => XML.loadString(s) }
val price = xmlResponse.map {
r => BigDecimal((r \ "LastPrice").text)
}
price
}
/* Command line interface */
println("Enter symbol at prompt.")
while (true) {
val symbol = readLine("> ") // Wait for user input
// When user puts in symbol, fetch data in background
// thread and print to screen when complete
fetchStockPrice(symbol).onComplete { res =>
println()
res match {
case Success(price) => println(s"$symbol: USD $price")
case Failure(e) => println(s"Error fetching $symbol: $e")
}
print("> ") // Simulate the appearance of a new prompt
}
}
}

[ 93 ]

Parallel Collections and Futures

Try running the program and entering the code for some stocks:
[info] Running StockPriceDemo
Enter symbol at prompt:
> GOOG
> MSFT
>
GOOG: USD 695.22
>
MSFT: USD 47.48
> AAPL
>
AAPL: USD 111.01

Let's summarize how the code works. when you enter a stock, the main thread
constructs a future that fetches the stock information from the API, converts it to
XML, and extracts the price. We use (r \ "LastPrice").text to extract the text
inside the LastPrice tag from the XML node r. We then convert the value to a big
decimal. When the transformations are complete, the result is printed to screen by
binding a callback through onComplete. Exception handling is handled naturally
through our use of .map methods to handle transformations.
By wrapping the code for fetching a stock price in a future, we free up the main
thread to just respond to the user. This means that the user interface does not get
blocked if we have, for instance, a slow internet connection.
This example is somewhat artificial, but you could easily wrap much more
complicated logic: stock prices could be written to a database and we could add
additional commands to plot the stock price over time, for instance.
We have only scratched the surface of what futures can offer in this section. We will
revisit futures in more detail when we look at polling web APIs in Chapter 7, Web
APIs and Chapter 9, Concurrency with Akka.
Futures are a key part of the data scientist's toolkit for building scalable systems.
Moving expensive computation (either in terms of CPU time or wall time) to
background threads improves scalability greatly. For this reason, futures are an
important part of many Scala libraries such as Akka and the Play framework.

[ 94 ]

Chapter 4

Summary
By providing high-level concurrency abstractions, Scala makes writing parallel code
intuitive and straightforward. Parallel collections and futures form an invaluable
part of a data scientist's toolbox, allowing them to parallelize their code with minimal
effort. However, while these high-level abstractions obviate the need to deal directly
with threads, an understanding of the internals of Scala's concurrency model is
necessary to avoid race conditions.
In the next chapter, we will put concurrency on hold and study how to interact with
SQL databases. However, this is only temporary: futures will play an important role
in many of the remaining chapters in this book.

References
Aleksandar Prokopec, Learning Concurrent Programming in Scala. This is a detailed
introduction to the basics of concurrent programming in Scala. In particular, it
explores parallel collections and futures in much greater detail than this chapter.
Daniel Westheide's blog gives an excellent introduction to many Scala concepts,
in particular:
•

Futures: http://danielwestheide.com/blog/2013/01/09/theneophytes-guide-to-scala-part-8-welcome-to-the-future.html

•

The Try type: http://danielwestheide.com/blog/2012/12/26/theneophytes-guide-to-scala-part-6-error-handling-with-try.html

For a discussion of cross-validation, see The Elements of Statistical Learning by Hastie,
Tibshirani, and Friedman.

[ 95 ]

Scala and SQL
through JDBC
One of data science's raison d'être is the difficulty of manipulating large datasets.
Much of the data of interest to a company or research group cannot fit conveniently
in a single computer's RAM. Storing the data in a way that is easy to query is
therefore a complex problem.
Relational databases have been successful at solving the data storage problem.
Originally proposed in 1970 (http://www.seas.upenn.edu/~zives/03f/cis550/
codd.pdf), the overwhelming majority of databases in active use today are still
relational. In that time, the price of RAM per megabyte has decreased by a factor of a
hundred million. Similarly, hard drive capacity has increased from tens or hundreds
of megabytes to terabytes. It is remarkable that, despite this exponential growth in
data storage capacity, the relational model has remained dominant.
Virtually all relational databases are described and queried with variants of SQL
(Structured Query Language). With the advent of distributed computing, the
position of SQL databases as the de facto data storage standard is being challenged
by other types of databases, commonly grouped under the umbrella term NoSQL.
Many NoSQL databases are more partition-tolerant than SQL databases: they can
be split into several parts residing on different computers. While this author expects
that NoSQL databases will become increasingly popular, SQL databases are likely to
remain prevalent as a data persistence mechanism; hence, a significant portion of this
book is devoted to interacting with SQL from Scala.

[ 97 ]

Scala and SQL through JDBC

While SQL is standardized, most implementations do not follow the full standard.
Additionally, most implementations provide extensions to the standard. This
means that, while many of the concepts in this book will apply to all SQL backends,
the exact syntax will need to be adjusted. We will consider only the MySQL
implementation here.
In this chapter, you will learn how to interact with SQL databases from Scala using
JDBC, a bare bones Java API. In the next chapter, we will consider Slick, an Object
Relational Mapper (ORM) that gives a more Scala-esque feel to interacting with SQL.
This chapter is roughly composed of two sections: we will first discuss the basic
functionality for connecting and interacting with SQL databases, and then discuss
useful functional patterns that can be used to create an elegant, loosely coupled, and
coherent data access layer.
This chapter assumes that you have a basic working knowledge of SQL. If you do
not, you would be better off first reading one of the reference books mentioned at the
end of the chapter.

Interacting with JDBC
JDBC is an API for connecting to SQL databases in Java. It remains the simplest way
of connecting to SQL databases from Scala. Furthermore, the majority of higher-level
abstractions for interacting with databases still use JDBC as a backend.
JDBC is not a library in itself. Rather, it exposes a set of interfaces to interact with
databases. Relational database vendors then provide specific implementations of
these interfaces.
Let's start by creating a build.sbt file. We will declare a dependency on the MySQL
JDBC connector:
scalaVersion := "2.11.7"
libraryDependencies += "mysql" % "mysql-connector-java" % "5.1.36"

First steps with JDBC
Let's start by connecting to JDBC from the command line. To follow with the
examples, you will need access to a running MySQL server. If you added the
MySQL connector to the list of dependencies, open a Scala console by typing
the following command:
$ sbt console
[ 98 ]

Chapter 5

Let's import JDBC:
scala> import java.sql._
import java.sql._

We then need to tell JDBC to use a specific connector. This is normally done using
reflection, loading the driver at runtime:
scala> Class.forName("com.mysql.jdbc.Driver")
Class[_] = class com.mysql.jdbc.Driver

This loads the appropriate driver into the namespace at runtime. If this seems
somewhat magical to you, it's probably not worth worrying about exactly how this
works. This is the only example of reflection that we will consider in this book, and it
is not particularly idiomatic Scala.

Connecting to a database server
Having specified the SQL connector, we can now connect to a database. Let's assume
that we have a database called test on host 127.0.0.1, listening on port 3306. We
create a connection as follows:
scala> val connection = DriverManager.getConnection(
"jdbc:mysql://127.0.0.1:3306/test",
"root", // username when connecting
"" // password
)
java.sql.Connection = com.mysql.jdbc.JDBC4Connection@12e78a69

The first argument to getConnection is a URL-like string with jdbc:mysql://
host[:port]/database. The second and third arguments are the username and
password. Pass in an empty string if you can connect without a password.

Creating tables
Now that we have a database connection, let's interact with the server. For these
examples, you will find it useful to have a MySQL shell open (or a MySQL GUI such
as MySQLWorkbench) as well as the Scala console. You can open a MySQL shell by
typing the following command in a terminal:
$ mysql

[ 99 ]

Scala and SQL through JDBC

As an example, we will create a small table to keep track of famous physicists. In a
mysql shell, we would run the following command:
mysql> USE test;
mysql> CREATE TABLE physicists (
id INT(11) AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(32) NOT NULL
);

To achieve the same with Scala, we send a JDBC statement to the connection:
scala> val statementString = """
CREATE TABLE physicists (
id INT(11) AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(32) NOT NULL
)
"""
scala> val statement = connection.prepareStatement(statementString)
PreparedStatement = JDBC4PreparedStatement@c983201: CREATE TABLE ...
scala> statement.executeUpdate()
results: Int = 0

Let's ignore the return value of executeUpdate for now.

Inserting data
Now that we have created a table, let's insert some data into it. We can do this with a
SQL INSERT statement:
scala> val statement = connection.prepareStatement("""
INSERT INTO physicists (name) VALUES ('Isaac Newton')
""")
scala> statement.executeUpdate()
Int = 1

[ 100 ]

Chapter 5

In this case, executeUpdate returns 1. When inserting rows, it returns the number
of rows that were inserted. Similarly, if we had used a SQL UPDATE statement, this
would return the number of rows that were updated. For statements that do not
manipulate rows directly (such as the CREATE TABLE statement in the previous
section), executeUpdate just returns 0.
Let's just jump into a mysql shell to verify the insertion performed correctly:
mysql> select * from physicists ;
+----+--------------+
| id | name

|

+----+--------------+
|

1 | Isaac Newton |

+----+--------------+
1 row in set (0.00 sec)

Let's quickly summarize what we have seen so far: to execute SQL statements that do
not return results, use the following:
val statement = connection.prepareStatement("SQL statement string")
statement.executeUpdate()

In the context of data science, we frequently need to insert or update many rows at a
time. For instance, we might have a list of physicists:
scala> val physicistNames = List("Marie Curie", "Albert Einstein", "Paul
Dirac")

We want to insert all of these into the database. While we could create a statement
for each physicist and send it to the database, this is quite inefficient. A better
solution is to create a batch of statements and send them to the database together.
We start by creating a statement template:
scala> val statement = connection.prepareStatement("""
INSERT INTO physicists (name) VALUES (?)
""")
PreparedStatement = JDBC4PreparedStatement@621a8225: INSERT INTO
physicists (name) VALUES (** NOT SPECIFIED **)

This is identical to the previous prepareStatement calls, except that we replaced
the physicist's name with a ? placeholder. We can set the placeholder value with
the statement.setString method:
scala> statement.setString(1, "Richard Feynman")

[ 101 ]

Scala and SQL through JDBC

This replaces the first placeholder in the statement with the string Richard Feynman:
scala> statement
com.mysql.jdbc.JDBC4PreparedStatement@5fdd16c3:
INSERT INTO physicists (name) VALUES ('Richard Feynman')

Note that JDBC, somewhat counter-intuitively, counts the placeholder positions from
1 rather than 0.
We have now created the first statement in the batch of updates. Run the following
command:
scala> statement.addBatch()

By running the preceding command, we initiate a batch insert: the statement is
added to a temporary buffer that will be executed when we run the executeBatch
method. Let's add all the physicists in our list:
scala> physicistNames.foreach { name =>
statement.setString(1, name)
statement.addBatch()
}

We can now execute all the statements in the batch:
scala> statement.executeBatch
Array[Int] = Array(1, 1, 1, 1)

The return value of executeBatch is an array of the number of rows altered or
inserted by each item in the batch.
Note that we used statement.setString to fill in the template with a particular
name. The PreparedStatement object has setXXX methods for all basic types. To get
a complete list, read the PreparedStatement API documentation (http://docs.
oracle.com/javase/7/docs/api/java/sql/PreparedStatement.html).

Reading data
Now that we know how to insert data into a database, let's look at the converse:
reading data. We use SQL SELECT statements to query the database. Let's do this in
the MySQL shell first:
mysql> SELECT * FROM physicists;
+----+-----------------+
| id | name

|
[ 102 ]

Chapter 5
+----+-----------------+
|

1 | Isaac Newton

|

|

2 | Richard Feynman |

|

3 | Marie Curie

|

4 | Albert Einstein |

|

5 | Paul Dirac

|
|

+----+-----------------+
5 rows in set (0.01 sec)

To extract this information in Scala, we define a PreparedStatement:
scala> val statement = connection.prepareStatement("""
SELECT name FROM physicists
""")
PreparedStatement = JDBC4PreparedStatement@3c577c9d:
SELECT name FROM physicists

We execute this statement by running the following command:
scala> val results = statement.executeQuery()
results: java.sql.ResultSet = com.mysql.jdbc.JDBC4ResultSet@74a2e158

This returns a JDBC ResultSet instance. The ResultSet is an abstraction
representing a set of rows from the database. Note that we used statement.
executeQuery rather than statement.executeUpdate. In general, one should
execute statements that return data (in the form of ResultSet) with executeQuery.
Statements that modify the database without returning data (insert, create, alter, or
update statements, among others) are executed with executeUpdate.
The ResultSet object behaves somewhat like an iterator. It exposes a next method
that advances itself to the next record, returning true if there are records left in
ResultSet:
scala> results.next // Advance to the first record
Boolean = true

When the ResultSet instance points to a record, we can extract fields in this record
by passing in the field name:
scala> results.getString("name")
String = Isaac Newton

[ 103 ]

Scala and SQL through JDBC

We can also extract fields using positional arguments. The fields are indexed
from one:
scala> results.getString(1) // first positional argument
String = Isaac Newton

When we are done with a particular record, we call the next method to advance the
ResultSet to the next record:
scala> results.next // advances the ResultSet by one record
Boolean = true
scala> results.getString("name")
String = Richard Feynman
ResultSet

ResultSet

ResultSet

getString("name")="Isaac Newton"

getString("name")="Richard Feynman"

getString("name")="Marie Curie"

next()=true

next()=true

next()=true

next()=false

A ResultSet object supports the getXXX(fieldName) methods to access the fields of a record and a next
method to advance to the next record in the result set.

One can iterate over a result set using a while loop:
scala> while(results.next) { println(results.getString("name")) }
Marie Curie
Albert Einstein
Paul Dirac

[ 104 ]

Chapter 5

A word of warning applies to reading fields that are nullable. While
one might expect JDBC to return null when faced with a null SQL
field, the return type depends on the getXXX command used. For
instance, getInt and getLong will return 0 for any field that is
null. Similarly, getDouble and getFloat return 0.0. This can
lead to some subtle bugs in code. In general, one should be careful
with getters that return Java value types (int, long) rather than
objects. To find out if a value is null in the database, query it first
with getInt (or getLong or getDouble, as appropriate), then use
the wasNull method that returns a Boolean if the last read value
was null:
scala> rs.getInt("field")
0
scala> rs.wasNull // was the last item read null?
true

This (surprising) behavior makes reading from ResultSet
instances error-prone. One of the goals of the second part of this
chapter is to give you the tools to build an abstraction layer on top
of the ResultSet interface to avoid having to call methods such as
getInt directly.

Reading values directly from ResultSet objects feels quite unnatural in Scala. We
will look, further on in this chapter, at constructing a layer through which you can
access the result set using type classes.
We now know how to read and write to a database. Having finished with the
database for now, we close the result sets, prepared statements, and connections:
scala> results.close
scala> statement.close
scala> connection.close

While closing statements and connections is not important in the Scala shell (they
will get closed when you exit), it is important when you run programs; otherwise,
the objects will persist, leading to "out of memory exceptions". In the next sections,
we will look at establishing connections and statements with the loan pattern, a
design pattern that closes a resource automatically when we finish using it.

[ 105 ]

Scala and SQL through JDBC

JDBC summary
We now have an overview of JDBC. The rest of this chapter will concentrate on
writing abstractions that sit above JDBC, making database accesses feel more natural.
Before we do this, let's summarize what we have seen so far.
We have used three JDBC classes:
•

The Connection class represents a connection to a specific SQL database.
Instantiate a connection as follows:
import java.sql._
Class.forName("com.mysql.jdbc.Driver")
val connection = DriverManager.getConnection(
"jdbc:mysql://127.0.0.1:3306/test",
"root", // username when connecting
"" // password
)

Our main use of Connection instances has been to generate
PreparedStatement objects:
connection.prepareStatement("SELECT * FROM physicists")

•

A PreparedStatement instance represents a SQL statement about to be sent
to the database. It also represents the template for a SQL statement with
placeholders for values yet to be filled in. The class exposes the following
methods:

statement.executeUpdate

This sends the statement to the database. Use this for
SQL statements that modify the database and do not
return any data, such as INSERT, UPDATE, DELETE, and
CREATE statements.

val results =
statement.executeQuery

This sends the statement to the database. Use this for SQL
statements that return data (predominantly, the SELECT
statements). This returns a ResultSet instance.

statement.addBatch

The addBatch method adds the current statement to a

statement.executeBatch

batch of statements, and executeBatch sends the batch
of statements to the database.

[ 106 ]

Chapter 5

statement.setString(1,
"Scala")
statement.setInt(1, 42)
statement.setBoolean(1,
true)

statement.setNull(1,
java.sql.Types.BOOLEAN)

•

Fill in the placeholder values in the
PreparedStatement. The first argument is the position
in the statement (counting from 1). The second argument
is the value.
One common use case for these is in a batch update or
insert: we might have a Scala list of objects that we want
to insert into the database. We fill in the placeholders
for each object in the list using the .setXXX methods,
then add this statement to the batch using .addBatch.
We can then send the entire batch to the database using
.executeBatch.
This sets a particular item in the statement to NULL.
The second argument specifies the NULL type. If we
are setting a cell in a Boolean column, for instance, this
should be Types.BOOLEAN. A full list of types is given
in the API documentation for the java.sql.Types
package (http://docs.oracle.com/javase/7/
docs/api/java/sql/Types.html).

A ResultSet instance represents a set of rows returned by a SELECT or SHOW
statement. ResultSet exposes methods to access fields in the current row:

rs.getString(i)
rs.getInt(i)
rs.getString("name")
rs.getInt("age")
rs.wasNull

These methods get the value of the ith field in the
current row; i is measured from 1.

These methods get the value of a specific field, which
is indexed by the column name.
This returns whether the last column read was NULL.
This is particularly important when reading Java value
types, such as getInt, getBoolean, or getDouble, as
these return a default value when reading a NULL value.

The ResultSet instance exposes the .next method to move to the next row; .next
returns true until the ResultSet has advanced to just beyond the last row.

Functional wrappers for JDBC
We now have a basic overview of the tools afforded by JDBC. All the objects that we
have interacted with so far feel somewhat clunky and out of place in Scala. They do
not encourage a functional style of programming.

[ 107 ]

Scala and SQL through JDBC

Of course, elegance is not necessarily a goal in itself (or, at least, you will probably
struggle to convince your CEO that he should delay the launch of a product because
the code lacks elegance). However, it is usually a symptom: either the code is
not extensible or too tightly coupled, or it is easy to introduce bugs. The latter is
particularly the case for JDBC. Forgot to check wasNull? That will come back to
bite you. Forgot to close your connections? You'll get an "out of memory exception"
(hopefully not in production).
In the next sections, we will look at patterns that we can use to wrap JDBC types in
order to mitigate many of these risks. The patterns that we introduce here are used
very commonly in Scala libraries and applications. Thus, besides writing robust
classes to interact with JDBC, learning about these patterns will, I hope, give you
greater understanding of Scala programming.

Safer JDBC connections with the loan
pattern
We have already seen how to connect to a JDBC database and send statements to the
database for execution. This technique, however, is somewhat error prone: you have
to remember to close statements; otherwise, you will quickly run out of memory. In
more traditional imperative style, we write the following try-finally block around
every connection:
// WARNING: poor Scala code
val connection = DriverManager.getConnection(url, user, password)
try {
// do something with connection
}
finally {
connection.close()
}

Scala, with first-class functions, provides us with an alternative: the loan pattern. We
write a function that is responsible for opening the connection, loaning it to the client
code to do something interesting with it, and then closing it when the client code is
done. Thus, the client code is not responsible for closing the connection any more.

[ 108 ]

Chapter 5

Let's create a new SqlUtils object with a usingConnection method that leverages
the loan pattern:
// SqlUtils.scala
import java.sql._
object SqlUtils {
/** Create an auto-closing connection using
* the loan pattern */
def usingConnection[T](
db:String,
host:String="127.0.0.1",
user:String="root",
password:String="",
port:Int=3306
)(f:Connection => T):T = {
// Create the connection
val Url = s"jdbc:mysql://$host:$port/$db"
Class.forName("com.mysql.jdbc.Driver")
val connection = DriverManager.getConnection(
Url, user, password)
// give the connection to the client, through the callable
// `f` passed in as argument
try {
f(connection)
}
finally {
// When client is done, close the connection
connection.close()
}
}
}

Let's see this function in action:
scala> SqlUtils.usingConnection("test") {
connection => println(connection)
}
com.mysql.jdbc.JDBC4Connection@46fd3d66

[ 109 ]

Scala and SQL through JDBC

Thus, the client doesn't have to remember to close the connection, and the resultant
code (for the client) feels much more like Scala.
How does our usingConnection function work? The function definition is
def usingConnection( ... )(f : Connection => T ):T. It takes, as its
second set of arguments, a function that acts on a Connection object. The body of
usingConnection creates the connection, then passes it to f, and finally closes the
connection. This syntax is somewhat similar to code blocks in Ruby or the with
statement in Python.
Be careful when mixing the loan pattern with lazy operations. This
applies particularly to returning iterators, streams, and futures from
f. As soon as the thread of execution leaves f, the connection will be
closed. Any data structure that is not materialized at this point will
not be able to carry on accessing the connection.

The loan pattern is, of course, not exclusive to database connections. It is useful
whenever you have the following pattern, in pseudocode:
open resource (eg. database connection, file ...)
use resource somehow // loan resource to client for this part.
close resource

Enriching JDBC statements with the
"pimp my library" pattern
In the previous section, we saw how to create self-closing connections with the
loan pattern. This allows us to open connections to the database without having
to remember to close them. However, we still have to remember to close any
ResultSet and PreparedStatement that we open:
// WARNING: Poor Scala code
SqlUtils.usingConnection("test") { connection =>
val statement = connection.prepareStatement(
"SELECT * FROM physicists")
val results = statement.executeQuery
// do something useful with the results
results.close
statement.close
}

[ 110 ]

Chapter 5

Having to open and close the statement is somewhat ugly and error prone. This is
another natural use case for the loan pattern. Ideally, we would like to write the
following:
usingConnection("test") { connection =>
connection.withQuery("SELECT * FROM physicists") {
resultSet => // process results
}
}

How can we define a .withQuery method on the Connection class? We do not
control the Connection class definition as it is part of the JDBC API. We would
like to be able to somehow reopen the Connection class definition to add the
withQuery method.
Scala does not let us reopen classes to add new methods (a practice known as
monkey-patching). We can still, however, enrich existing libraries with implicit
conversions using the pimp my library pattern (http://www.artima.com/weblogs/
viewpost.jsp?thread=179766). We first define a RichConnection class that
contains the withQuery method. This RichConnection class is created from an
existing Connection instance.
// RichConnection.scala
import java.sql.{Connection, ResultSet}
class RichConnection(val underlying:Connection) {
/** Execute a SQL query and process the ResultSet */
def withQuery[T](query:String)(f:ResultSet => T):T = {
val statement = underlying.prepareStatement(query)
val results = statement.executeQuery
try {
f(results) // loan the ResultSet to the client
}
finally {
// Ensure all the resources get freed.
results.close
statement.close
}
}
}

[ 111 ]

Scala and SQL through JDBC

We could use this class by just wrapping every Connection instance in a
RichConnection instance:
// Warning: poor Scala code
SqlUtils.usingConnection("test") { connection =>
val richConnection = new RichConnection(connection)
richConnection.withQuery("SELECT * FROM physicists") {
resultSet => // process resultSet
}
}

This adds unnecessary boilerplate: we have to remember to convert every connection
instance to RichConnection to use withQuery. Fortunately, Scala provides an easier
way with implicit conversions: we tell Scala how to convert from Connection to
RichConnection and vice versa, and tell it to perform this conversion automatically
(implicitly), if necessary:
// Implicits.scala
import java.sql.Connection
// Implicit conversion methods are often put in
// an object called Implicits.
object Implicits {
implicit def pimpConnection(conn:Connection) =
new RichConnection(conn)
implicit def depimpConnection(conn:RichConnection) =
conn.underlying
}

Now, whenever pimpConnection and depimpConnection are in the current
scope, Scala will automatically use them to convert from Connection instances to
RichConnection and back as needed.
We can now write the following (I have added type information for emphasis):
// Bring the conversion functions into the current scope
import Implicits._
SqlUtils.usingConnection("test") { (connection:Connection) =>
connection.withQuery("SELECT * FROM physicists") {
// Wow! It's like we have just added
// .withQuery to the JDBC Connection class!
resultSet => // process results
}
}

[ 112 ]

Chapter 5

This might look like magic, so let's step back and look at what happens when we
call withQuery on a Connection instance. The Scala compiler will first look to see if
the class definition of Connection defines a withQuery method. When it finds that
it does not, it will look for implicit methods that convert a Connection instance to
a class that defines withQuery. It will find that the pimpConnection method allows
conversion from Connection to RichConnection, which defines withQuery. The
Scala compiler automatically uses pimpConnection to transform the Connection
instance to RichConnection.
Note that we used the names pimpConnection and depimpConnection for the
conversion functions, but they could have been anything. We never call these
methods explicitly.
Let's summarize how to use the pimp my library pattern to add methods to an
existing class:
1. Write a class that wraps the class you want to enrich: class
RichConnection(val underlying:Connection). Add all the methods that
you wish the original class had.
2. Write a method to convert from your original class to your enriched class
as part of an object called (conventionally) Implicits. Make sure that you
tell Scala to use this conversion automatically with the implicit keyword:
implicit def pimpConnection(conn:Connection):RichConnection.
You can also tell Scala to automatically convert back from the enriched class
to the original class by adding the reverse conversion method.
3. Allow implicit conversions by importing the implicit conversion methods:
import Implicits._.

Wrapping result sets in a stream
The JDBC ResultSet object plays very badly with Scala collections. The only real
way of doing anything useful with it is to loop through it directly with a while loop.
For instance, to get a list of the names of physicists in our database, we could write
the following code:
// WARNING: poor Scala code
import Implicits._ // import implicit conversions
SqlUtils.usingConnection("test") { connection =>
connection.withQuery("SELECT * FROM physicists") { resultSet =>
var names = List.empty[String]
while(resultSet.next) {

[ 113 ]

Scala and SQL through JDBC
val name = resultSet.getString("name")
names = name :: names
}
names
}
}
//=> List[String] = List(Paul Dirac, Albert Einstein, Marie Curie,
Richard Feynman, Isaac Newton)

The ResultSet interface feels unnatural because it behaves very differently from
Scala collections. In particular, it does not support the higher-order functions that we
take for granted in Scala: no map, filter, fold, or for comprehensions. Thankfully,
writing a stream that wraps ResultSet is quite straightforward. A Scala stream is a
lazily evaluated list: it evaluates the next element in the collection when it is needed
and forgets previous elements when they are no longer used.
We can define a stream method that wraps ResultSet as follows:
// SqlUtils.scala
object SqlUtils {
...
def stream(results:ResultSet):Stream[ResultSet] =
if (results.next) { results #:: stream(results) }
else { Stream.empty[ResultSet] }
}

This might look quite confusing, so let's take it slowly. We define a stream method
that wraps ResultSet, returning a Stream[ResultSet]. When the client calls
stream on an empty result set, this just returns an empty stream. When the client
calls stream on a non-empty ResultSet, the ResultSet instance is advanced by one
row, and the client gets back results #:: stream(results). The #:: operator on a
stream is similar to the cons operator, ::, on a list: it prepends results to an existing
Stream. The critical difference is that, unlike a list, stream(results) does not get
evaluated until necessary. This, therefore, avoids duplicating the entire ResultSet
in memory.
Let's use our brand new stream function to get the name of all the physicists in our
database:
import Implicits._
SqlUtils.usingConnection("test") { connection =>
connection.withQuery("SELECT * FROM physicists") { results =>
val resultsStream = SqlUtils.stream(results)

[ 114 ]

Chapter 5
resultsStream.map { _.getString("name") }.toVector
}
}
//=> Vector(Richard Feynman, Albert Einstein, Marie Curie, Paul Dirac)

Streaming the results, rather than using the result set directly, lets us interact with
the data much more naturally as we are now dealing with just a Scala collection.
When you use stream in a withQuery block (or, generally, in a block that
automatically closes the result set), you must always materialize the stream within
the function, hence the call to toVector. Otherwise, the stream will wait until its
elements are needed to materialize them, and by then, the ResultSet instance will
be closed.

Looser coupling with type classes
So far, we have been reading and writing simple types to the database. Let's imagine
that we want to add a gender column to our database. We will store the gender as an
enumeration in our physicists database. Our table is now as follows:
mysql> CREATE TABLE physicists (
id INT(11) AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(32) NOT NULL,
gender ENUM("Female", "Male") NOT NULL
);

How can we represent genders in Scala? A good way of doing this is with an
enumeration:
// Gender.scala
object Gender extends Enumeration {
val Male = Value
val Female = Value
}

However, we now have a problem when deserializing objects from the database:
JDBC has no built-in mechanism to convert from a SQL ENUM type to a Scala Gender
type. We could achieve this by just converting manually every time we need to read
gender information:
resultsStream.map {
rs => Gender.withName(rs.getString("gender"))
}.toVector
[ 115 ]

Scala and SQL through JDBC

However, we would need to write this everywhere that we want to read the gender
field. This goes against the DRY (don't repeat yourself) principle, leading to code
that is difficult to maintain. If we decide to change the way gender is stored in the
database, we would need to find every instance in the code where we read the
gender field and change it.
A somewhat better solution would be to add a getGender method to the ResultSet
class using the pimp my library idiom that we used extensively in this chapter. This
solution is still not optimal. We are adding unnecessary specificity to ResultSet: it is
now coupled to the structure of our databases.
We could create a subclass of ResultSet using inheritance, such as
PhysicistResultSet, that can read the fields in a specific table. However, this
approach is not composable: if we had another table that kept track of pets, with
name, species, and gender fields, we would have to either reimplement the code
for reading gender in a new PetResultSet or factor out a GenderedResultSet
superclass. As the number of tables grows, the inheritance hierarchy would become
unmanageable. A better approach would let us compose the functionality that we
need. In particular, we want to decouple the process of extracting Scala objects from
a result set from the code for iterating over a result set.

Type classes
Scala provides an elegant solution using type classes. Type classes are a very powerful
arrow in the Scala architect's quiver. However, they can present a bit of a learning
curve, especially as there is no direct equivalent in object-oriented programming.
Instead of presenting an abstract explanation, I will dive into an example: I will
describe how we can leverage type classes to convert fields in a ResultSet to Scala
types. The aim is to define a read[T](field) method on ResultSet that knows
exactly how to deserialize to objects of type T. This method will replace and extend
the getXXX methods in ResultSet:
// results is a ResultSet instance
val name = results.read[String]("name")
val gender = results.read[Gender.Value]("gender")

[ 116 ]

Chapter 5

We start by defining an abstract SqlReader[T] trait that exposes a read method to
read a specific field from a ResultSet and return an instance of type T:
// SqlReader.scala
import java.sql._
trait SqlReader[T] {
def read(results:ResultSet, field:String):T
}

We now need to provide a concrete implementation of SqlReader[T] for every T
type that we want to read. Let's provide concrete implementations for the Gender and
String fields. We will place the implementation in a SqlReader companion object:
// SqlReader.scala
object SqlReader {
implicit object StringReader extends SqlReader[String] {
def read(results:ResultSet, field:String):String =
results.getString(field)
}
implicit object GenderReader extends SqlReader[Gender.Value] {
def read(results:ResultSet, field:String):Gender.Value =
Gender.withName(StringReader.read(results, field))
}
}

We could now use our ReadableXXX objects to read from a result set:
import SqlReader._
val name = StringReader.read(results, "name")
val gender = GenderReader.read(results, "gender")

This is already somewhat better than using the following:
Gender.withName(results.getString("gender"))

This is because the code to map from a ResultSet field to Gender.Value is
centralized in a single place: ReadableGender. However, it would be great if we
could tell Scala to use ReadableGender whenever it needs to read Gender.Value,
and use ReadableString whenever it needs to read a String value. This is exactly
what type classes do.

[ 117 ]

Scala and SQL through JDBC

Coding against type classes
We defined a Readable[T] interface that abstracts how to read an object of type T
from a field in a ResultSet. How do we tell Scala that it needs to use this Readable
object to convert from the ResultSet fields to the appropriate Scala type?
The key is the implicit keyword that we used to prefix the GenderReader and
StringReader object definitions. It lets us write:
implicitly[SqlReader[Gender.Value]].read(results, "gender")
implicitly[SqlReader[String]].read(results, "name")

By writing implicitly[SqlReader[T]], we are telling the Scala compiler to find a
class (or an object) that extends SqlReader[T] that is marked for implicit use. Try
this out by pasting the following in the command line, for instance:
scala> :paste
import Implicits._ // Connection to RichConnection conversion
SqlUtils.usingConnection("test") {
_.withQuery("select * from physicists") {
rs => {
rs.next() // advance to first record
implicitly[SqlReader[Gender.Value]].read(rs, "gender")
}
}
}

Of course, using implicitly[SqlReader[T]] everywhere is not particularly elegant.
Let's use the pimp my library idiom to add a read[T] method to ResultSet. We first
define a RichResultSet class that we can use to "pimp" the ResultSet class:
// RichResultSet.scala
import java.sql.ResultSet
class RichResultSet(val underlying:ResultSet) {
def read[T : SqlReader](field:String):T = {
implicitly[SqlReader[T]].read(underlying, field)
}
}

[ 118 ]

Chapter 5

The only unfamiliar part of this should be the read[T : SqlReader] generic
definition. We are stating here that read will accept any T type, provided an instance
of SqlReader[T] exists. This is called a context bound.
We must also add implicit methods to the Implicits object to convert from
ResultSet to RichResultSet. You should be familiar with this now, so I will not
bore you with the details. You can now call results.read[T](fieldName) for
any T for which you have a SqlReader[T] implicit object defined:
import Implicits._
SqlUtils.usingConnection("test") { connection =>
connection.withQuery("SELECT * FROM physicists") {
results =>
val resultStream = SqlUtils.stream(results)
resultStream.map { row =>
val name = row.read[String]("name")
val gender = row.read[Gender.Value]("gender")
(name, gender)
}.toVector
}
}
//=> Vector[(String, Gender.Value)] = Vector((Albert Einstein,Male),
(Marie Curie,Female))

Let's summarize the steps needed for type classes to work. We will do this in the
context of deserializing from SQL, but you will be able to adapt these steps to solve
other problems:
•

Define an abstract generic trait that provides the interface for the type class,
for example, SqlReader[T]. Any functionality that is independent of T can
be added to this base trait.

•

Create the companion object for the base trait and add implicit objects
extending the trait for each T, for example,
implicit object StringReader extends SqlReader[T].

•

Type classes are always used in generic methods. A method that relies
on the existence of a type class for an argument must contain a context
bound in the generic definition, for example, def read[T : SqlReader]
(field:String):T. To access the type class in this method, use the
implicitly keyword: implicitly[SqlReader[T]].

[ 119 ]

Scala and SQL through JDBC

When to use type classes
Type classes are useful when you need a particular behavior for many different
types, but exactly how this behavior is implemented varies between these types.
For instance, we need to be able to read several different types from ResultSet, but
exactly how each type is read differs between types: for strings, we must read from
ResultSet using getString, whereas for integers, we must use getInt followed by
wasNull.
A good rule of thumb is when you start thinking "Oh, I could just write a generic
method to do this. Ah, but wait, I will have to write the Int implementation as a
specific edge case as it behaves differently. Oh, and the Gender implementation. I
wonder if there's a better way?", then type classes might be useful.

Benefits of type classes
Data scientists frequently have to deal with new input streams, changing
requirements, and new data types. Having an object-relational mapping layer that
is easy to extend or alter is therefore critical to responding to changes efficiently.
Minimizing coupling between code entities and separation of concerns are the only
ways to ensure that the code can be changed in response to new data.
With type classes, we maintain orthogonality between accessing records in the
database (through the ResultSet class) and how individual fields are transformed
to Scala objects: both can vary independently. The only coupling between these two
concerns is through the SqlReader[T] interface.
This means that both concerns can evolve independently: to read a new data
type, we just need to implement a SqlReader[T] object. Conversely, we can
add functionality to ResultSet without needing to reimplement how fields are
converted. For instance, we could add a getColumn method that returns a Vector[T]
of all the values of a field in a ResultSet instance:
def getColumn[T : SqlReader](field:String):Vector[T] = {
val resultStream = SqlUtils.stream(results)
resultStream.map { _.read[T](field) }.toVector
}

Note how we could do this without increasing the coupling to the way in which
individual fields are read.

[ 120 ]

Chapter 5

Creating a data access layer
Let's bring together everything that we have seen and build a data-mapper class
for fetching Physicist objects from the database. These classes (also called data
access objects) are useful to decouple the internal representation of an object from its
representation in the database.
We start by defining the Physicist class:
// Physicist.scala
case class Physicist(
val name:String,
val gender:Gender.Value
)

The data access object will expose a single method, readAll, that returns a
Vector[Physicist] of all the physicists in our database:
// PhysicistDao.scala
import java.sql.{ ResultSet, Connection }
import Implicits._ // implicit conversions
object PhysicistDao {
/* Helper method for reading a single row */
private def readFromResultSet(results:ResultSet):Physicist = {
Physicist(
results.read[String]("name"),
results.read[Gender.Value]("gender")
)
}
/* Read the entire 'physicists' table. */
def readAll(connection:Connection):Vector[Physicist] = {
connection.withQuery("SELECT * FROM physicists") {
results =>
val resultStream = SqlUtils.stream(results)
resultStream.map(readFromResultSet).toVector
}
}
}

[ 121 ]

Scala and SQL through JDBC

The data access layer can be used by client code as in the following example:
object PhysicistDaoDemo extends App {
val physicists = SqlUtils.usingConnection("test") {
connection => PhysicistDao.readAll(connection)
}
// physicists is a Vector[Physicist] instance.
physicists.foreach { println }
//=> Physicist(Albert Einstein,Male)
//=> Physicist(Marie Curie,Female)
}

Summary
In this chapter, we learned how to interact with SQL databases using JDBC.
We wrote a library to wrap native JDBC objects, aiming to give them a more
functional interface.
In the next chapter, you will learn about Slick, a Scala library that provides functional
wrappers to interact with relational databases.

References
The API documentation for JDBC is very complete: http://docs.oracle.com/
javase/7/docs/api/java/sql/package-summary.html
The API documentation for the ResultSet interface (http://docs.oracle.com/
javase/7/docs/api/java/sql/ResultSet.html), for the PreparedStatement
class (http://docs.oracle.com/javase/7/docs/api/java/sql/
PreparedStatement.html) and the Connection class (http://docs.oracle.com/
javase/7/docs/api/java/sql/Connection.html) is particularly relevant.
The data mapper pattern is described extensively in Martin Fowler's Patterns of
Enterprise Application Architecture. A brief description is also available on his website
(http://martinfowler.com/eaaCatalog/dataMapper.html).

[ 122 ]

Chapter 5

For an introduction to SQL, I suggest Learning SQL by Alan Beaulieu (O'Reilly).
For another discussion of type classes, read http://danielwestheide.com/
blog/2013/02/06/the-neophytes-guide-to-scala-part-12-type-classes.
html.
This post describes how some common object-oriented design patterns can be
reimplemented more elegantly in Scala using type classes:
https://staticallytyped.wordpress.com/2013/03/24/gang-of-fourpatterns-with-type-classes-and-implicits-in-scala-part-2/

This post by Martin Odersky details the Pimp my Library pattern:
http://www.artima.com/weblogs/viewpost.jsp?thread=179766

[ 123 ]

Slick – A Functional
Interface for SQL
In Chapter 5, Scala and SQL through JDBC, we investigated how to access SQL
databases with JDBC. As interacting with JDBC feels somewhat unnatural, we
extended JDBC using custom wrappers. The wrappers were developed to provide
a functional interface to hide the imperative nature of JDBC.
With the difficulty of interacting directly with JDBC from Scala and the ubiquity of
SQL databases, you would expect there to be existing Scala libraries that wrap JDBC.
Slick is such a library.
Slick styles itself as a functional-relational mapping library, a play on the more
traditional object-relational mapping name used to denote libraries that build objects
from relational databases. It presents a functional interface to SQL databases,
allowing the client to interact with them in a manner similar to native Scala
collections.

FEC data
In this chapter, we will use a somewhat more involved example dataset. The
Federal Electoral Commission of the United States (FEC) records all donations to
presidential candidates greater than $200. These records are publicly available. We
will look at the donations for the campaign leading up to the 2012 general elections
that resulted in Barack Obama's re-election. The data includes donations to the two
presidential candidates, Obama and Romney, and also to the other contenders in the
Republican primaries (there were no Democrat primaries).
In this chapter, we will take the transaction data provided by the FEC, store it in a
table, and learn how to query and analyze it.
[ 125 ]

Slick – A Functional Interface for SQL

The first step is to acquire the data. If you have downloaded the code samples
from the Packt website, you should already have two CSVs in the data directory
of the code samples for this chapter. If not, you can download the files using the
following links:
•

data.scala4datascience.com/fec/ohio.csv.gz (or ohio.csv.zip)

•

data.scala4datascience.com/fec/us.csv.gz (or us.csv.zip)

Decompress the two files and place them in a directory called data/ in the same
location as the source code examples for this chapter. The data files correspond
to the following:
•

The ohio.csv file is a CSV of all the donations made by donors in Ohio.

•

The us.csv file is a CSV of all the donations made by donors across the
country. This is quite a large file, with six million rows.

The two CSV files contain identical columns. Use the Ohio dataset for more
responsive behavior, or the nationwide data file if you want to wrestle with a
larger dataset. The dataset is adapted from a list of contributions downloaded
from http://www.fec.gov/disclosurep/PDownload.do.
Let's start by creating a Scala case class to represent a transaction. In the context of
this chapter, a transaction is a single donation from an individual to a candidate:
// Transaction.scala
import java.sql.Date
case class Transaction(
id:Option[Int], // unique identifier
candidate:String, // candidate receiving the donation
contributor:String, // name of the contributor
contributorState:String, // contributor state
contributorOccupation:Option[String], // contributor job
amount:Long, // amount in cents
date:Date // date of the donation
)

The code repository for this chapter includes helper functions in an FECData
singleton object to load the data from CSVs:
scala> val ohioData = FECData.loadOhio
s4ds.FECData = s4ds.FECData@718454de

[ 126 ]

Chapter 6

Calling FECData.loadOhio or FECData.loadAll will create an FECData object with
a single attribute, transactions, which is an iterator over all the donations coming
from Ohio or the entire United States:
scala> val ohioTransactions = ohioData.transactions
Iterator[Transaction] = non-empty iterator
scala> ohioTransactions.take(5).foreach(println)
Transaction(None,Paul, Ron,BROWN, TODD W MR.,OH,Some(ENGINE
ER),5000,2011-01-03)
Transaction(None,Paul, Ron,DIEHL, MARGO SONJA,OH,Some(RETIR
ED),2500,2011-01-03)
Transaction(None,Paul, Ron,KIRCHMEYER, BENJAMIN,OH,Some(COMPUTER
PROGRAMMER),20120,2011-01-03)
Transaction(None,Obama, Barack,KEYES, STEPHEN,OH,Some(HR EXECUTIVE /
ATTORNEY),10000,2011-01-03)
Transaction(None,Obama, Barack,MURPHY, MIKE W,OH,Some(MANAG
ER),5000,2011-01-03)

Now that we have some data to play with, let's try and put it in the database so that
we can run some useful queries on it.

Importing Slick
To add Slick to the list of dependencies, you will need to add "com.typesafe.slick"
%% "slick" % "2.1.0" to the list of dependencies in your build.sbt file. You will
also need to make sure that Slick has access to a JDBC driver. In this chapter, we
will connect to a MySQL database, and must, therefore, add the MySQL connector
"mysql" % "mysql-connector-java" % "5.1.37" to the list of dependencies.
Slick is imported by importing a specific database driver. As we are using MySQL,
we must import the following:
scala> import slick.driver.MySQLDriver.simple._
import slick.driver.MySQLDriver.simple._

To connect to a different flavor of SQL database, import the relevant driver. The
easiest way of seeing what drivers are available is to consult the API documentation
for the slick.driver package, which is available at http://slick.typesafe.com/
doc/2.1.0/api/#scala.slick.driver.package. All the common SQL flavors are
supported (including H2, PostgreSQL, MS SQL Server, and SQLite).

[ 127 ]

Slick – A Functional Interface for SQL

Defining the schema
Let's create a table to represent our transactions. We will use the following schema:
CREATE TABLE transactions(
id INT(11) AUTO_INCREMENT PRIMARY KEY,
candidate VARCHAR(254) NOT NULL,
contributor VARCHAR(254) NOT NULL,
contributor_state VARCHAR(2) NOT NULL,
contributor_occupation VARCHAR(254),
amount BIGINT(20) NOT NULL,
date DATE
);

Note that the donation amount is in cents. This allows us to use an integer field
(rather than a fixed point decimal, or worse, a float).
You should never use a floating point format to represent money
or, in fact, any discrete quantity because floats cannot represent
most fractions exactly:
scala> 0.1 + 0.2
Double = 0.30000000000000004

This seemingly nonsensical result occurs because there is no way to
store 0.3 exactly in doubles.
This post gives an extensive discussion of the limitations of the
floating point format:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_
goldberg.html

To use Slick with tables in our database, we first need to tell Slick about the database
schema. We do this by creating a class that extends the Table abstract class. The way
in which a schema is defined is quite straightforward, so let's dive straight into the
code. We will store our schema in a Tables singleton. We define a Transactions
class that provides the mapping to go from collections of Transaction instances to
SQL tables structured like the transactions table:
// Tables.scala
import java.sql.Date
import slick.driver.MySQLDriver.simple._
/** Singleton object for table definitions */

[ 128 ]

Chapter 6
object Tables {
// Transactions table definition
class Transactions(tag:Tag)
extends Table[Transaction](tag, "transactions") {
def id = column[Int]("id", O.PrimaryKey, O.AutoInc)
def candidate = column[String]("candidate")
def contributor = column[String]("contributor")
def contributorState = column[String](
"contributor_state", O.DBType("VARCHAR(2)"))
def contributorOccupation = column[Option[String]](
"contributor_occupation")
def amount = column[Long]("amount")
def date = column[Date]("date")
def * = (id.?, candidate, contributor,
contributorState, contributorOccupation, amount, date) <> (
Transaction.tupled, Transaction.unapply)
}
val transactions = TableQuery[Transactions]
}

Let's go through this line by line. We first define a Transactions class, which must
take a Slick Tag object as its first argument. The Tag object is used by Slick internally
to construct SQL statements. The Transactions class extends a Table object,
passing it the tag and name of the table in the database. We could, optionally, have
added a database name by extending Table[Transaction](tag, Some("fec"),
"transactions") rather than just Table[Transaction](tag, "transactions").
The Table type is parametrized by Transaction. This means that running SELECT
statements on the database returns Transaction objects. Similarly, we will insert
data into the database by passing a transaction or list of transactions to the relevant
Slick methods.
Let's look at the Transactions class definition in more detail. The body of the class
starts by listing the database columns. For instance, the id column is defined as follows:
def id = column[Int]("id", O.PrimaryKey, O.AutoInc)

We tell Slick that it should read the column called id and transform it to a Scala
integer. Additionally, we tell Slick that this column is the primary key and that
it is auto-incrementing. The Slick documentation contains a list of available
options for column.
[ 129 ]

Slick – A Functional Interface for SQL

The candidate and contributor columns are straightforward: we tell Slick to read
these as String from the database. The contributor_state column is a little more
interesting. Besides specifying that it should be read from the database as a String,
we also tell Slick that it should be stored in the database with type VARCHAR(2).
The contributor_occupation column in our table can contain NULL values. When
defining the schema, we pass the Option[String] type to the column method:
def contributorOccupation =
column[Option[String]]("contributor_occupation")

When reading from the database, a NULL field will get converted to None for columns
specified as Option[T]. Conversely, if the field has a value, it will be returned as
Some(value).
The last line of the class body is the most interesting part: it specifies how to
transform the raw data read from the database into a Transaction object and how to
convert a Transaction object to raw fields ready for insertion:
def * = (id.?, candidate, contributor,
contributorState, contributorOccupation, amount, date) <> (
Transaction.tupled, Transaction.unapply)

The first part is just a tuple of fields to be read from the database: (id.?,
candidate, contributor, contributorState, contributorOccupation,
amount, date), with a small amount of metadata. The second part is a pair of
functions that describe how to transform this tuple into a Transaction object and
back. In this case, as Transaction is a case class, we can take advantage of the
Transaction.tupled and Transaction.unapply methods automatically provided
for case classes.

Notice how we followed the id entry with .?. In our Transaction class, the
donation id has the Option[Int] type, but the column in the database has the
INT type with the additional O.AutoInc option. The .? suffix tells Slick to use the
default value provided by the database (in this case, the database's auto-increment)
if id is None.
Finally, we define the value:
val transactions = TableQuery[Transactions]

This is the handle that we use to actually interact with the database. For instance, as
we will see later, to get a list of donations to Barack Obama, we run the following
query (don't worry about the details of the query for now):
Tables.transactions.filter {_.candidate === "Obama, Barack"}.list

[ 130 ]

Chapter 6

Let's summarize the parts of our Transactions mapper class:
•

The Transactions class must extend the Table abstract class parametrized
by the type that we want to return: Table[Transaction].

•

We define the columns to read from the database explicitly using column,
for example, def contributorState = column[String]("contributor_
state", O.DBType("VARCHAR(2)")). The [String] type parameter defines
the Scala type that this column gets read as. The first argument is the SQL
column name. Consult the Slick documentation for a full list of additional
arguments (http://slick.typesafe.com/doc/2.1.0/schemas.html).

•

We describe how to convert from a tuple of the column values to a Scala
object and vice versa using def * = (id.?, candidate, ...) <>
(Transaction.tupled, Transaction.unapply).

Connecting to the database
So far, you have learned how to define Table classes that encode the transformation
from rows in a SQL table to Scala case classes. To move beyond table definitions and
start interacting with a database server, we must connect to a database. As in the
previous chapter, we will assume that there is a MySQL server running on localhost
on port 3306.
We will use the console to demonstrate the functionality in this chapter, but you can
find an equivalent sample program in SlickDemo.scala. Let's open a Scala console
and connect to the database running on port 3306:
scala> import slick.driver.MySQLDriver.simple._
import slick.driver.MySQLDriver.simple._
scala> val db = Database.forURL(
"jdbc:mysql://127.0.0.1:3306/test",
driver="com.mysql.jdbc.Driver"
)
db: slick.driver.MySQLDriver.backend.DatabaseDef = slick.jdbc.JdbcBackend
$DatabaseDef@3632d1dd

If you have read the previous chapter, you will recognize the first argument as a
JDBC-style URL. The URL starts by defining a protocol, in this case, jdbc:mysql,
followed by the IP address and port of the database server, followed by the database
name (test, here).

[ 131 ]

Slick – A Functional Interface for SQL

The second argument to forURL is the class name of the JDBC driver. This driver is
imported at runtime using reflection. Note that the driver specified here must match
the Slick driver imported statically.
Having defined the database, we can now use it to create a connection:
scala> db.withSession { implicit session =>
// do something useful with the database
println(session)
}
scala.slick.jdbc.JdbcBackend$BaseSession@af5a276

Slick functions that require access to the database take a Session argument
implicitly: if a Session instance marked as implicit is available in scope, they will
use it. Thus, preceding session with the implicit keyword saves us having to pass
session explicitly every time we run an operation on the database.
If you have read the previous chapter, you will recognize that Slick deals with the
need to close connections with the loan pattern: a database connection is created in the
form of a session object and passed temporarily to the client. When the client code
returns, the session is closed, ensuring that all opened connections are closed. The
client code is therefore spared the responsibility of closing the connection.
The loan pattern is very useful in production code, but it can be somewhat
cumbersome in the shell. Slick lets us create a session explicitly as follows:
scala> implicit val session = db.createSession
session: slick.driver.MySQLDriver.backend.Session = scala.slick.jdbc.Jdbc
Backend$BaseSession@2b775b49
scala> session.close

Creating tables
Let's use our new connection to create the transaction table in the database. We
can access methods to create and drop tables using the ddl attribute on our
TableQuery[Transactions] instance:
scala> db.withSession { implicit session =>
Tables.transactions.ddl.create
}

[ 132 ]

Chapter 6

If you jump into a mysql shell, you will see that a transactions table has been created:
mysql> describe transactions ;
+------------------------+--------------+------+-----+
| Field

| Type

| Null | Key |

+------------------------+--------------+------+-----+
| id

| int(11)

| NO

| candidate

| varchar(254) | NO

|

|

| contributor

| varchar(254) | NO

|

|

| contributor_state

| varchar(2)

|

|

| contributor_occupation | varchar(254) | YES

|

|

| amount

| bigint(20)

| NO

|

|

| date

| date

| NO

|

|

| NO

| PRI |

+------------------------+--------------+------+-----+
7 rows in set (0.01 sec)

The ddl attribute also includes a drop method to drop the table. Incidentally, ddl
stands for "data-definition language" and is commonly used to refer to the parts of
SQL relevant to schema and constraint definitions.

Inserting data
Slick TableQuery instances let us interact with SQL tables with an interface similar
to Scala collections.
Let's create a transaction first. We will pretend that a donation occurred on the 22nd
of June, 2010. Unfortunately, the code to create dates in Scala and pass these to JDBC
is particularly clunky. We first create a java.util.Date instance, which we must
then convert to a java.sql.Date to use in our newly created transaction:
scala> import java.text.SimpleDateFormat
import java.text.SimpleDateFormat
scala> val date = new SimpleDateFormat("dd-MM-yyyy").parse("22-06-2010")
date: java.util.Date = Tue Jun 22 00:00:00 BST 2010
scala> val sqlDate = new java.sql.Date(date.getTime())
sqlDate: java.sql.Date = 2010-06-22
scala> val transaction = Transaction(
[ 133 ]

Slick – A Functional Interface for SQL
None, "Obama, Barack", "Doe, John", "TX", None, 200, sqlDate
)
transaction: Transaction = Transaction(None,Obama, Barack,Doe,
John,TX,None,200,2010-06-22)

Much of the interface provided by the TableQuery instance mirrors that of a mutable
list. To insert a single row in the transaction table, we can use the += operator:
scala> db.withSession {
implicit session => Tables.transactions += transaction
}
Int = 1

Under the hood, this will create a JDBC prepared statement and run this statement's
executeUpdate method.
If you are committing many rows at a time, you should use Slick's bulk insert
operator: ++=. This takes a List[Transaction] as input and inserts all the
transactions in a single batch by taking advantage of JDBC's addBatch and
executeBatch functionality.
Let's insert all the FEC transactions so that we have some data to play with when
running queries in the next section. We can load an iterator of transactions for Ohio
by calling the following:
scala> val transactions = FECData.loadOhio.transactions
transactions: Iterator[Transaction] = non-empty iterator

We can also load the transactions for the whole of United States:
scala> val transactions = FECData.loadAll.transactions
transactions: Iterator[Transaction] = non-empty iterator

To avoid materializing all the transactions in a single fell swoop—thus potentially
exceeding our computer's available memory—we will take batches of transactions
from the iterator and insert them:
scala> val batchSize = 100000
batchSize: Int = 100000
scala> val transactionBatches = transactions.grouped(batchSize)
transactionBatches: transactions.GroupedIterator[Transaction] = non-empty
iterator

[ 134 ]

Chapter 6

An iterator's grouped method splits the iterator into batches. It is useful to split a
long collection or iterator into manageable batches that can be processed one after
the other. This is important when integrating or processing large datasets.
All that we have to do now is iterate over our batches, inserting them into the
database as we go:
scala> db.withSession { implicit session =>
transactionBatches.foreach {
batch => Tables.transactions ++= batch.toList
}
}

While this works, it is sometimes useful to see progress reports when doing
long-running integration processes. As we have split the integration into batches,
we know (to the nearest batch) how far into the integration we are. Let's print the
progress information at the beginning of every batch:
scala> db.withSession { implicit session =>
transactionBatches.zipWithIndex.foreach {
case (batch, batchNumber) =>
println(s"Processing row ${batchNumber*batchSize}")
Tables.transactions ++= batch.toList
}
}
Processing row 0
Processing row 100000
...

We use the .zipWithIndex method to transform our iterator over batches into
an iterator of (batch, current index) pairs. In a full-scale application, the progress
information would probably be written to a log file rather than to the screen.
Slick's well-designed interface makes inserting data very intuitive, integrating well
with native Scala types.

[ 135 ]

Slick – A Functional Interface for SQL

Querying data
In the previous section, we used Slick to insert donation data into our database. Let's
explore this data now.
When defining the Transactions class, we defined a TableQuery object,
transactions, that acts as the handle for accessing the transaction table. It exposes
an interface similar to Scala iterators. For instance, to see the first five elements in our
database, we can call take(5):
scala> db.withSession { implicit session =>
Tables.transactions.take(5).list
}
List[Tables.Transactions#TableElementType] =
List(Transaction(Some(1),Obama, Barack,Doe, ...

Internally, Slick implements the .take method using a SQL LIMIT. We can, in fact,
get the SQL statement using the .selectStatement method on the query:
scala> db.withSession { implicit session =>
println(Tables.transactions.take(5).selectStatement)
}
select x2.`id`, x2.`candidate`, x2.`contributor`, x2.`contributor_
state`, x2.`contributor_occupation`, x2.`amount`, x2.`date` from
(select x3.`date` as `date`, x3.`contributor` as `contributor`,
x3.`amount` as `amount`, x3.`id` as `id`, x3.`candidate` as `candidate`,
x3.`contributor_state` as `contributor_state`, x3.`contributor_
occupation` as `contributor_occupation` from `transactions` x3 limit 5)
x2

Our Slick query is made up of the following two parts:
•

.take(n): This part is called the invoker. Invokers build up the SQL
statement but do not actually fire it to the database. You can chain many
invokers together to build complex SQL statements.

•

.list: This part sends the statement prepared by the invoker to the database
and converts the result to Scala object. This takes a session argument,

possibly implicitly.

[ 136 ]

Chapter 6

Invokers
Invokers are the components of a Slick query that build up the SQL select statement.
Slick exposes a variety of invokers that allow the construction of complex queries.
Let's look at some of these invokers here:
•

The map invoker is useful to select individual columns or apply operations
to columns:
scala> db.withSession { implicit session =>
Tables.transactions.map {
_.candidate
}.take(5).list
}
List[String] = List(Obama, Barack, Paul, Ron, Paul, Ron, Paul,
Ron, Obama, Barack)

•

The filter invoker is the equivalent of the WHERE statements in SQL. Note
that Slick fields must be compared using ===:
scala> db.withSession { implicit session =>
Tables.transactions.filter {
_.candidate === "Obama, Barack"
}.take(5).list
}
List[Tables.Transactions#TableElementType] =
List(Transaction(Some(1),Obama, Barack,Doe,
John,TX,None,200,2010-06-22), ...

Similarly, to filter out donations to Barack Obama, use the =!= operator:
scala> db.withSession { implicit session =>
Tables.transactions.filter {
_.candidate =!= "Obama, Barack"
}.take(5).list
}
List[Tables.Transactions#TableElementType] =
List(Transaction(Some(2),Paul, Ron,BROWN, TODD W MR.,OH,...

•

The sortBy invoker is the equivalent of the ORDER BY statement in SQL:
scala> db.withSession { implicit session =>
Tables.transactions.sortBy {
_.date.desc
[ 137 ]

Slick – A Functional Interface for SQL
}.take(5).list
}
List[Tables.Transactions#TableElementType] = List(Transactio
n(Some(65536),Obama, Barack,COPELAND, THOMAS,OH,Some(COLLEGE
TEACHING),10000,2012-01-02)

•

The leftJoin, rightJoin, innerJoin, and outerJoin invokers are used
for joining tables. As we do not cover interactions between multiple tables
in this tutorial, we cannot demonstrate joins. See the Slick documentation
(http://slick.typesafe.com/doc/2.1.0/queries.html#joining-andzipping) for examples of these.

•

Aggregation invokers such as length, min, max, sum, and avg can be used for
computing summary statistics. These must be executed using .run, rather
than .list, as they return single numbers. For instance, to get the total
donations to Barack Obama:
scala> db.withSession { implicit session =>
Tables.transactions.filter {
_.candidate === "Obama, Barack"
}.map { _.amount

}.sum.run

}
Option[Int] = Some(849636799) // (in cents)

Operations on columns
In the previous section, you learned about the different invokers and how they
mapped to SQL statements. We brushed over the methods supported by columns
themselves, however: we can compare for equality using ===, but what other
operations are supported by Slick columns?
Most of the SQL functions are supported. For instance, to get the total donations to
candidates whose name starts with "O", we could run the following:
scala> db.withSession { implicit session =>
Tables.transactions.filter {
_.candidate.startsWith("O")
}.take(5).list
}
List[Tables.Transactions#TableElementType] = List(Transaction(So
me(1594098)...

[ 138 ]

Chapter 6

Similarly, to count donations that happened between January 1, 2011 and
February 1, 2011, we can use the .between method on the date column:
scala> val dateParser = new SimpleDateFormat("dd-MM-yyyy")
dateParser: java.text.SimpleDateFormat = SimpleDateFormat
scala> val startDate = new java.sql.Date(dateParser.parse("01-01-2011").
getTime())
startDate: java.sql.Date = 2011-01-01
scala> val endDate = new java.sql.Date(dateParser.parse("01-02-2011").
getTime())
endDate: java.sql.Date = 2011-02-01
scala> db.withSession { implicit session =>
Tables.transactions.filter {
_.date.between(startDate, endDate)
}.length.run
}
Int = 9772

The equivalent of the SQL IN (...) operator that selects values in a specific set is
inSet. For instance, to select all transactions to Barack Obama and Mitt Romney, we
can use the following:
scala> val candidateList = List("Obama, Barack", "Romney, Mitt")
candidateList: List[String] = List(Obama, Barack, Romney, Mitt)
scala> val donationCents = db.withSession { implicit session =>
Tables.transactions.filter {
_.candidate.inSet(candidateList)
}.map { _.amount }.sum.run
}
donationCents: Option[Long] = Some(2874484657)
scala> val donationDollars = donationCents.map { _ / 100 }
donationDollars: Option[Long] = Some(28744846)

So, between them, Mitt Romney and Barack Obama received over 28 million dollars
in registered donations.
[ 139 ]

Slick – A Functional Interface for SQL

We can also negate a Boolean column with the ! operator. For instance, to calculate
the total amount of donations received by all candidates apart from Barack Obama
and Mitt Romney:
scala> db.withSession { implicit session =>
Tables.transactions.filter {
! _.candidate.inSet(candidateList)
}.map { _.amount }.sum.run
}.map { _ / 100 }
Option[Long] = Some(1930747)

Column operations are added by implicit conversion on the base Column instances.
For a full list of methods available on String columns, consult the API documentation
for the StringColumnExtensionMethods class (http://slick.typesafe.com/
doc/2.1.0/api/#scala.slick.lifted.StringColumnExtensionMethods). For
the methods available on Boolean columns, consult the API documentation for
the BooleanColumnExtensionMethods class (http://slick.typesafe.com/
doc/2.1.0/api/#scala.slick.lifted.BooleanColumnExtensionMethods). For
the methods available on numeric columns, consult the API documentation for
NumericColumnExtensionMethods (http://slick.typesafe.com/doc/2.1.0/
api/#scala.slick.lifted.NumericColumnExtensionMethods).

Aggregations with "Group by"
Slick also provides a groupBy method that behaves like the groupBy method of
native Scala collections. Let's get a list of candidates with all the donations for
each candidate:
scala> val grouped = Tables.transactions.groupBy { _.candidate }
grouped: scala.slick.lifted.Query[(scala.slick.lifted.Column[...
scala> val aggregated = grouped.map {
case (candidate, group) =>
(candidate -> group.map { _.amount }.sum)
}
aggregated: scala.slick.lifted.Query[(scala.slick.lifted.Column[...
scala> val groupedDonations = db.withSession {
implicit session => aggregated.list
}
[ 140 ]

Chapter 6
groupedDonations: List[(String, Option[Long])] = List((Bachmann,
Michele,Some(7439272)),...

Let's break this down. The first statement, transactions.groupBy { _.candidate
}, specifies the key by which to group. You can think of this as building an
intermediate list of (String, List[Transaction]) tuples mapping the group key
to a list of all the table rows that satisfy this key. This behavior is identical to calling
groupBy on a Scala collection.
The call to groupBy must be followed by a map that aggregates the groups. The
function passed to map must take the tuple (String, List[Transaction]) pair
created by the groupBy call as its sole argument. The map call is responsible for
aggregating the List[Transaction] object. We choose to first pick out the amount
field of each transaction, and then to run a sum over these. Finally, we call .list
on the whole pipeline to actually run the query. This just returns a Scala list. Let's
convert the total donations from cents to dollars:
scala> val groupedDonationDollars = groupedDonations.map {
case (candidate, donationCentsOption) =>
candidate -> (donationCentsOption.getOrElse(0L) / 100)
}
groupedDonationDollars: List[(String, Long)] = List((Bachmann,
Michele,74392),...
scala> groupedDonationDollars.sortBy {
_._2
}.reverse.foreach { println }
(Romney, Mitt,20248496)
(Obama, Barack,8496347)
(Paul, Ron,565060)
(Santorum, Rick,334926)
(Perry, Rick,301780)
(Gingrich, Newt,277079)
(Cain, Herman,210768)
(Johnson, Gary Earl,83610)
(Bachmann, Michele,74392)
(Pawlenty, Timothy,42500)
(Huntsman, Jon,23571)
(Roemer, Charles E. 'Buddy' III,8579)
(Stein, Jill,5270)
(McCotter, Thaddeus G,3210)
[ 141 ]

Slick – A Functional Interface for SQL

Accessing database metadata
Commonly, especially during development, you might start the script by dropping
the table if it exists, then recreating it. We can find if a table is defined by accessing
the database metadata through the MTable object. To get a list of tables with name
matching a certain pattern, we can run MTable.getTables(pattern):
scala> import slick.jdbc.meta.MTable
import slick.jdbc.meta.MTable
scala> db.withSession { implicit session =>
MTable.getTables("transactions").list
}
List[scala.slick.jdbc.meta.MTable] = List(MTable(MQName(fec.transactions)
,TABLE,,None,None,None) ...)

Thus, to drop the transactions table if it exists, we can run the following:
scala> db.withSession { implicit session =>
if(MTable.getTables("transactions").list.nonEmpty) {
Tables.transactions.ddl.drop
}
}

The MTable instance contains a lot of metadata about the table. Go ahead and
recreate the transactions table if you dropped it in the previous example. Then, to
find information about the table's primary keys:
scala> db.withSession { implicit session =>
val tableMeta = MTable.getTables("transactions").first
tableMeta.getPrimaryKeys.list
}
List[MPrimaryKey] = List(MPrimaryKey(MQName(test.transactions),id,1,Some(
PRIMARY)))

For a full list of methods available on MTable instances, consult the Slick
documentation (http://slick.typesafe.com/doc/2.1.0/api/index.
html#scala.slick.jdbc.meta.MTable).

[ 142 ]

Chapter 6

Slick versus JDBC
This chapter and the previous one introduced two different ways of interacting with
SQL. In the previous chapter, we described how to use JDBC and build extensions
on top of JDBC to make it more usable. In this chapter, we introduced Slick, a library
that provides a functional interface on top of JDBC.
Which method should you choose? If you are starting a new project, you should
consider using Slick. Even if you spend a considerable amount of time writing
wrappers that sit on top of JDBC, it is unlikely that you will achieve the fluidity
that Slick offers.
If you are working on an existing project that makes extensive use of JDBC, I hope
that the previous chapter demonstrates that, with a little time and effort, you can
write JDBC wrappers that reduce the impedance between the imperative style of
JDBC and Scala's functional approach.

Summary
In the previous two chapters, we looked extensively at how to query relational
databases from Scala. In this chapter, you learned how to use Slick, a "functionalrelational" mapper that allows interacting with SQL databases as one would with
Scala collections.
In the next chapter, you will learn how to ingest data by querying web APIs.

References
To learn more about Slick, you can refer to the Slick documentation (http://slick.
typesafe.com/doc/2.1.0/) and its API documentation (http://slick.typesafe.
com/doc/2.1.0/api/#package).

[ 143 ]

Web APIs
Data scientists and data engineers get data from a variety of different sources. Often,
data might come as CSV files or database dumps. Sometimes, we have to obtain the
data through a web API.
An individual or organization sets up a web API to distribute data to programs
over the Internet (or an internal network). Unlike websites, where the data is
intended to be consumed by a web browser and shown to the user, the data
provided by a web API is agnostic to the type of program querying it. Web servers
serving HTML and web servers backing an API are queried in essentially the same
way: through HTTP requests.
We have already seen an example of a web API in Chapter 4, Parallel Collections and
Futures, where we queried the "Markit on demand" API for current stock prices.
In this chapter, we will explore how to interact with web APIs in more detail;
specifically, how to convert the data returned by the API to Scala objects and how to
add additional information to the request through HTTP headers (for authentication,
for instance).
The "Markit on demand" API returned the data formatted as an XML object, but
increasingly, new web APIs return data formatted as JSON. We will therefore focus
on JSON in this chapter, but the concepts will port easily to XML.
JSON is a language for formatting structured data. Many readers will have come
across JSON in the past, but if not, there is a brief introduction to the syntax and
concepts later on in this chapter. You will find it quite straightforward.
In this chapter, we will poll the GitHub API. GitHub has, over the last few years,
become the de facto tool for collaborating on open source software. It provides a
powerful, feature-rich API that gives programmatic access to nearly all the data
available through the website.

[ 145 ]

Web APIs

Let's get a taste of what we can do. Type api.github.com/users/odersky in your
web browser address bar. This will return the data offered by the API on a particular
user (Martin Odersky, in this case):
{
"login": "odersky",
"id": 795990,
...
"public_repos": 8,
"public_gists": 3,
"followers": 707,
"following": 0,
"created_at": "2011-05-18T14:51:21Z",
"updated_at": "2015-09-15T15:14:33Z"
}

The data is returned as a JSON object. This chapter is devoted to learning how to
access and parse this data programmatically. In Chapter 13, Web APIs with Play, you
will learn how to build your own web API.
The GitHub API is extensive and very well-documented. We
will explore some of the features of the API in this chapter.
To see the full extent of the API, visit the documentation
(https://developer.github.com/v3/).

A whirlwind tour of JSON
JSON is a format for transferring structured data. It is flexible, easy for computers
to generate and parse, and relatively readable for humans. It has become very
common as a means of persisting program data structures and transferring data
between programs.
JSON has four basic types: Numbers, Strings, Booleans, and null, and two
compound types: Arrays and Objects. Objects are unordered collections of key-value
pairs, where the key is always a string and the value can be any simple or compound
type. We have already seen a JSON object: the data returned by the API call api.
github.com/users/odersky.

[ 146 ]

Chapter 7

Arrays are ordered lists of simple or compound types. For instance, type api.
github.com/users/odersky/repos in your browser to get an array of objects,
each representing a GitHub repository:
[
{
"id": 17335228,
"name": "dotty",
"full_name": "odersky/dotty",
...
},
{
"id": 15053153,
"name": "frontend",
"full_name": "odersky/frontend",
...
},
...
]

We can construct complex structures by nesting objects within other objects or
arrays. Nevertheless, most web APIs return JSON structures with no more than
one or two levels of nesting. If you are not familiar with JSON, I encourage you to
explore the GitHub API through your web browser.

Querying web APIs
The easiest way of querying a web API from Scala is to use Source.fromURL. We
have already used this in Chapter 4, Parallel Collections and Futures, when we queried
the "Markit on demand" API. Source.fromURL presents an interface similar to
Source.fromFile:
scala> import scala.io._
import scala.io._
scala> val response = Source.fromURL(
"https://api.github.com/users/odersky"
).mkString
response: String = {"login":"odersky","id":795990, ...

[ 147 ]

Web APIs

Source.fromURL returns an iterator over the characters of the response. We
materialize the iterator into a string using its .mkString method. We now have the
response as a Scala string. The next step is to parse the string with a JSON parser.

JSON in Scala – an exercise in pattern
matching
There are several libraries for manipulating JSON in Scala. We prefer json4s, but if
you are a die-hard fan of another JSON library, you should be able to readily adapt
the examples in this chapter. Let's create a build.sbt file with a dependency on
json4s:
// build.sbt
scalaVersion := "2.11.7"
libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"

We can then import json4s into an SBT console session with:
scala> import org.json4s._
import org.json4s._
scala> import org.json4s.native.JsonMethods._
import org.json4s.native.JsonMethods._

Let's use json4s to parse the response to our GitHub API query:
scala> val jsonResponse = parse(response)
jsonResponse: org.json4s.JValue = JObject(List((login,JString(odersky)),(
id,JInt(795990)),...

The parse method takes a string (that contains well-formatted JSON) and converts it
to a JValue, a supertype for all json4s objects. The runtime type of the response to
this particular query is JObject, which is a json4s type representing a JSON object.
JObject is a wrapper around a List[JField], and JField represents an individual
key-value pair in the object. We can use extractors to access this list:
scala> val JObject(fields) = jsonResponse
fields: List[JField] = List((login,Jstring(odersky)),...

[ 148 ]

Chapter 7

What's happened here? By writing val JObject(fields) = ..., we are telling Scala:
•

The right-hand side has runtime type of JObject

•

Go into the JObject instance and bind the list of fields to the constant fields

Readers familiar with Python might recognize the similarity with tuple unpacking,
though Scala extractors are much more powerful and versatile. Extractors are used
extensively to extract Scala types from json4s types.
Pattern matching using case classes
How exactly does the Scala compiler know what to do with an
extractor such as:
val JObject(fields) = ...

JObject is a case class with the following constructor:
case class JObject(obj:List[JField])

Case classes all come with an extractor that reverses the constructor
exactly. Thus, writing val JObject(fields) will bind fields
to the obj attribute of the JObject. For further details on how
extractors work, read Appendix, Pattern Matching and Extractors.

We have now extracted fields, a (plain old Scala) list of fields from the JObject. A
JField is a key-value pair, with the key being a string and value being a subtype of
JValue. Again, we can use extractors to extract the values in the field:
scala> val firstField = fields.head
firstField: JField = (login,JString(odersky))
scala> val JField(key, JString(value)) = firstField
key: String = login
value: String = odersky

We matched the right-hand side against the pattern JField(_, JString(_)),
binding the first element to key and the second to value. What happens if the
right-hand side does not match the pattern?
scala> val JField(key, JInt(value)) = firstField
scala.MatchError: (login,JString(odersky)) (of class scala.Tuple2)
...

[ 149 ]

Web APIs

The code throws a MatchError at runtime. These examples demonstrate the power
of nested pattern matching: in a single line, we managed to verify the type of
firstField, that its value has type JString, and we have bound the key and value
to the key and value variables, respectively. As another example, if we know that the
first field is the login field, we can both verify this and extract the value:
scala> val JField("login", JString(loginName)) = firstField
loginName: String = odersky

Notice how this style of programming is declarative rather than imperative: we
declare that we want a JField("login", JString(_)) variable on the right-hand
side. We then let the language figure out how to check the variable types. Pattern
matching is a recurring theme in functional languages.
We can also use pattern matching in a for loop when looping over fields. When used
in a for loop, a pattern match defines a partial function: only elements that match
the pattern pass through the loop. This lets us filter the collection for elements that
match a pattern and also apply a transformation to these elements. For instance, we
can extract every string field in our fields list:
scala> for {
JField(key, JString(value)) <- fields
} yield (key -> value)
List[(String, String)] = List((login,odersky), (avatar_url,https://
avatars.githubusercontent.com/...

We can use this to search for specific fields. For instance, to extract the "followers"
field:
scala> val followersList = for {
JField("followers", JInt(followers)) <- fields
} yield followers
followersList: List[Int] = List(707)
scala> val followers = followersList.headOption
blogURL: Option[Int] = Some(707)

We first extracted all fields that matched the pattern JField("follower",
JInt(_)), returning the integer inside the JInt. As the source collection, fields, is
a list, this returns a list of integers. We then extract the first value from this list using
headOption, which returns the head of the list if the list has at least one element, or
None if the list is empty.

[ 150 ]

Chapter 7

We are not limited to extracting a single field at a time. For instance, to extract the
"id" and "login" fields together:
scala> {
for {
JField("login", JString(loginName)) <- fields
JField("id", JInt(id)) <- fields
} yield (id -> loginName)
}.headOption
Option[(BigInt, String)] = Some((795990,odersky))

Scala's pattern matching and extractors provide you with an extremely powerful
way of traversing the json4s tree, extracting the fields that we need.

JSON4S types
We have already discovered parts of json4s's type hierarchy: strings are wrapped
in JString objects, integers (or big integers) are wrapped in JInt, and so on. In this
section, we will take a step back and formalize the type structure and what Scala
types they extract to. These are the json4s runtime types:
•

val JString(s) // => extracts to a String

•

val JDouble(d) // => extracts to a Double

•

val JDecimal(d) // => extracts to a BigDecimal

•

val JInt(i) // => extracts to a BigInt

•

val JBool(b) // => extracts to a Boolean

•

val JObject(l) // => extracts to a List[JField]

•

val JArray(l) // => extracts to a List[JValue]

•

JNull // => represents a JSON null

All these types are subclasses of JValue. The compile-time result of parse is
JValue, which you normally need to cast to a concrete type using an extractor.
The last type in the hierarchy is JField, which represents a key-value pair. JField is
just a type alias for the (String, JValue) tuple. It is thus not a subtype of JValue.
We can extract the key and value using the following extractor:
val JField(key, JInt(value)) = ...

[ 151 ]

Web APIs

Extracting fields using XPath
In the previous sections, you learned how to traverse JSON objects using extractors.
In this section, we will look at a different way of traversing JSON objects and
extracting specific fields: the XPath DSL (domain-specific language). XPath is a query
language for traversing tree-like structures. It was originally designed for addressing
specific nodes in an XML document, but it works just as well with JSON. We have
already seen an example of XPath syntax when we extracted the stock price from
the XML document returned by the "Markit on demand" API in Chapter 4, Parallel
Collections and Futures. We extracted the node with tag "LastPrice" using r \
"LastPrice". The \ operator was defined by the scala.xml package.
The json4s package exposes a similar DSL to extract fields from JObject instances.
For instance, we can extract the "login" field from the JSON object jsonResponse:
scala> jsonResponse \ "login"
org.json4s.JValue = JString(odersky)

This returns a JValue that we can transform into a Scala string using an extractor:
scala> val JString(loginName) = jsonResponse \ "login"
loginName: String = odersky

Notice the similarity between the XPath DSL and traversing a filesystem: we can
think of JObject instances as directories. Field names correspond to file names and
the field value to the content of the file. This is more evident for nested structures.
The users endpoint of the GitHub API does not have nested documents, so let's try
another endpoint. We will query the API for the repository corresponding to this
book: "https://api.github.com/repos/pbugnion/s4ds". The response has the
following structure:
{
"id": 42269470,
"name": "s4ds",
...
"owner": { "login": "pbugnion", "id": 1392879 ... }
...
}

[ 152 ]

Chapter 7

Let's fetch this document and use the XPath syntax to extract the repository owner's
login name:
scala> val jsonResponse = parse(Source.fromURL(
"https://api.github.com/repos/pbugnion/s4ds"
).mkString)
jsonResponse: JValue = JObject(List((id,JInt(42269470)),
(name,JString(s4ds))...
scala> val JString(ownerLogin) = jsonResponse \ "owner" \ "login"
ownerLogin: String = pbugnion

Again, this is much like traversing a filesystem: jsonResponse \ "owner" returns
a JObject corresponding to the "owner" object. This JObject can, in turn, be
queried for the "login" field, returning the value JString(pbugnion) associated
with this key.
What if the API response is an array? The filesystem analogy breaks down
somewhat. Let's query the API endpoint listing Martin Odersky's repositories:
https://api.github.com/users/odersky/repos. The response is an array of
JSON objects, each of which represents a repository:
[
{
"id": 17335228,
"name": "dotty",
"size": 14699,
...
},
{
"id": 15053153,
"name": "frontend",
"size": 392
...
},
{
"id": 2890092,
"name": "scala",
"size": 76133,
...
},
...
]

[ 153 ]

Web APIs

Let's fetch this and parse it as JSON:
scala> val jsonResponse = parse(Source.fromURL(
"https://api.github.com/users/odersky/repos"
).mkString)
jsonResponse: JValue = JArray(List(JObject(List((id,JInt(17335228)),
(name,Jstring(dotty)), ...

This returns a JArray. The XPath DSL works in the same way on a JArray as on
a JObject, but now, instead of returning a single JValue, it returns an array of
fields matching the path in every object in the array. Let's get the size of all Martin
Odersky's repositories:
scala> jsonResponse \ "size"
JValue = JArray(List(JInt(14699), JInt(392), ...

We now have a JArray of the values corresponding to the "size" field in every
repository. We can iterate over this array with a for comprehension and use
extractors to convert elements to Scala objects:
scala> for {
JInt(size) <- (jsonResponse \ "size")
} yield size
List[BigInt] = List(14699, 392, 76133, 32010, 98166, 1358, 144, 273)

Thus, combining extractors with the XPath DSL gives us powerful, complementary
tools to extract information from JSON objects.
There is much more to the XPath syntax than we have space to cover here, including
the ability to extract fields nested at any level of depth below the current root or
fields that match a predicate or a certain type. We find that well-designed APIs
obviate the need for many of these more powerful functions, but do consult the
documentation (json4s.org) to get an overview of what you can do.
In the next section, we will look at extracting JSON directly into case classes.

Extraction using case classes
In the previous sections, we extracted specific fields from the JSON response using
Scala extractors. We can do one better and extract full case classes.

[ 154 ]

Chapter 7

When moving beyond the REPL, programming best practice dictates that we move
from json4s types to Scala objects as soon as possible rather than passing json4s
types around the program. Converting from json4s types to Scala types (or case
classes representing domain objects) is good practice because:
•

It decouples the program from the structure of the data that we receive from
the API, something we have little control over.

•

It improves type safety: a JObject is, as far as the compiler is concerned,
always a JObject, whatever fields it contains. By contrast, the compiler will
never mistake a User for a Repository.

Json4s lets us extract case classes directly from JObject instances, making writing
the layer converting JObject instances to custom types easy.

Let's define a case class representing a GitHub user:
scala> case class User(id:Long, login:String)
defined class User

To extract a case class from a JObject, we must first define an implicit Formats
value that defines how simple types should be serialized and deserialized. We will
use the default DefaultFormats provided with json4s:
scala> implicit val formats = DefaultFormats
formats: DefaultFormats.type = DefaultFormats$@750e685a

We can now extract instances of User. Let's do this for Martin Odersky:
scala> val url = "https://api.github.com/users/odersky"
url: String = https://api.github.com/users/odersky
scala> val jsonResponse = parse(Source.fromURL(url).mkString)
jsonResponse: JValue = JObject(List((login,JString(odersky)), ...
scala> jsonResponse.extract[User]
User = User(795990,odersky)

This works as long as the object is well-formatted. The extract method looks
for fields in the JObject that match the attributes of User. In this case, extract
will note that the JObject contains the "login": "odersky" field and that
JString("odersky") can be converted to a Scala string, so it binds "odersky"
to the login attribute in User.

[ 155 ]

Web APIs

What if the attribute names differ from the field names in the JSON object? We must
first transform the object to have the correct fields. For instance, let's rename the
login attribute to userName in our User class:
scala> case class User(id:Long, userName:String)
defined class User

If we try to use extract[User] on jsonResponse, we will get a mapping error
because the deserializer is missing a login field in the response. We can fix this
using the transformField method on jsonResponse to rename the login field:
scala> jsonResponse.transformField {
case("login", n) => "userName" -> n
}.extract[User]
User = User(795990,odersky)

What about optional fields? Let's assume that the JSON object returned by the GitHub
API does not always contain the login field. We could symbolize this in our object
model by giving the login parameter the type Option[String] rather than String:
scala> case class User(id:Long, login:Option[String])
defined class User

This works just as you would expect. When the response contains a non-null login
field, calling extract[User] will deserialize it to Some(value), and when it's
missing or JNull, it will produce None:
scala> jsonResponse.extract[User]
User = User(795990,Some(odersky))
scala> jsonResponse.removeField {
case(k, _) => k == "login" // remove the "login" field
}.extract[User]
User = User(795990,None)

Let's wrap this up in a small program. The program will take a single command-line
argument, the user's login name, extract a User instance, and print it to screen:
// GitHubUser.scala
import scala.io._
import org.json4s._

[ 156 ]

Chapter 7
import org.json4s.native.JsonMethods._
object GitHubUser {
implicit val formats = DefaultFormats
case class User(id:Long, userName:String)
/** Query the GitHub API corresponding to `url`
* and convert the response to a User.
*/
def fetchUserFromUrl(url:String):User = {
val response = Source.fromURL(url).mkString
val jsonResponse = parse(response)
extractUser(jsonResponse)
}
/** Helper method for transforming the response to a User */
def extractUser(obj:JValue):User = {
val transformedObject = obj.transformField {
case ("login", name) => ("userName", name)
}
transformedObject.extract[User]
}
def main(args:Array[String]) {
// Extract username from argument list
val name = args.headOption.getOrElse {
throw new IllegalArgumentException(
"Missing command line argument for user.")
}
val user = fetchUserFromUrl(
s"https://api.github.com/users/$name")
println(s"** Extracted for $name:")
println()
println(user)
}
}

[ 157 ]

Web APIs

We can run this from an SBT console as follows:
$ sbt
> runMain GitHubUser pbugnion
** Extracted for pbugnion:
User(1392879,pbugnion)

Concurrency and exception handling
with futures
While the program that we wrote in the previous section works, it is very brittle. It
will crash if we enter a non-existent user name or the GitHub API changes or returns
a badly-formatted response. We need to make it fault-tolerant.
What if we also wanted to fetch multiple users? The program, as written, is entirely
single-threaded. The fetchUserFromUrl method fires a call to the API and blocks
until the API sends data back. A better solution would be to fetch multiple users
in parallel.
As you learned in Chapter 4, Parallel Collections and Futures, there are two
straightforward ways to implement both fault tolerance and parallel execution:
we can either put all the user names in a parallel collection and wrap the code for
fetching and extracting the user in a Try block or we can wrap each query in a future.
When querying web APIs, it is sometimes the case that a request can take abnormally
long. To prevent this from blocking the other threads, it is preferable to rely on
futures rather than parallel collections for concurrency, as we saw in the Parallel
collection or Future? section at the end of Chapter 4, Parallel Collections and Futures.
Let's rewrite the code from the previous section to handle fetching multiple users
concurrently in a fault-tolerant manner. We will change the fetchUserFromUrl
method to query the API asynchronously. This is not terribly different from Chapter
4, Parallel Collections and Futures, in which we queried the "Markit on demand" API:
// GitHubUserConcurrent.scala
import
import
import
import
import

scala.io._
scala.concurrent._
scala.concurrent.duration._
ExecutionContext.Implicits.global
scala.util._

import org.json4s._
[ 158 ]

Chapter 7
import org.json4s.native.JsonMethods._
object GitHubUserConcurrent {
implicit val formats = DefaultFormats
case class User(id:Long, userName:String)
// Fetch and extract the `User` corresponding to `url`
def fetchUserFromUrl(url:String):Future[User] = {
val response = Future { Source.fromURL(url).mkString }
val parsedResponse = response.map { r => parse(r) }
parsedResponse.map { extractUser }
}
// Helper method for extracting a user from a JObject
def extractUser(jsonResponse:JValue):User = {
val o = jsonResponse.transformField {
case ("login", name) => ("userName", name)
}
o.extract[User]
}
def main(args:Array[String]) {
val names = args.toList
// Loop over each username and send a request to the API
// for that user
val name2User = for {
name <- names
url = s"https://api.github.com/users/$name"
user = fetchUserFromUrl(url)
} yield name -> user
// callback function
name2User.foreach { case(name, user) =>
user.onComplete {
case Success(u) => println(s" ** Extracted for $name: $u")
case Failure(e) => println(s" ** Error fetching $name:
$e")
}
}
// Block until all the calls have finished.
[ 159 ]

Web APIs
Await.ready(Future.sequence(name2User.map { _._2 }), 1 minute)
}
}

Let's run the code through sbt:
$ sbt
> runMain GitHubUserConcurrent odersky derekwyatt not-a-user-675
** Error fetching user not-a-user-675: java.io.FileNotFoundException:
https://api.github.com/users/not-a-user-675
** Extracted for odersky: User(795990,odersky)
** Extracted for derekwyatt: User(62324,derekwyatt)

The code itself should be straightforward. All the concepts used here have been
explored in this chapter or in Chapter 4, Parallel Collections and Futures, apart from
the last line:
Await.ready(Future.sequence(name2User.map { _._2 }), 1 minute)

This statement tells the program to wait until all futures in our list have been
completed. Await.ready(..., 1 minute) takes a future as its first argument and
blocks execution until this future returns. The second argument is a time-out on this
future. The only catch is that we need to pass a single future to Await rather than a
list of futures. We can use Future.sequence to merge a collection of futures into
a single future. This future will be completed when all the futures in the sequence
have completed.

Authentication – adding HTTP headers
So far, we have been using the GitHub API without authentication. This limits us to
sixty requests per hour. Now that we can query the API in parallel, we could exceed
this limit in seconds.
Fortunately, GitHub is much more generous if you authenticate when you query
the API. The limit increases to 5,000 requests per hour. You must have a GitHub
user account to authenticate, so go ahead and create one now if you need to. After
creating an account, navigate to https://github.com/settings/tokens and click
on the Generate new token button. Accept the default settings and enter a token
description and a long hexadecimal number should appear on the screen. Copy the
token for now.

[ 160 ]

Chapter 7

HTTP – a whirlwind overview
Before using our newly generated token, let's take a few minutes to review how
HTTP works.
HTTP is a protocol for transferring information between different computers. It is
the protocol that we have been using throughout the chapter, though Scala hid the
details from us in the call to Source.fromURL. It is also the protocol that you use
when you point your web browser to a website, for instance.
In HTTP, a computer will typically make a request to a remote server, and the server
will send back a response. Requests contain a verb, which defines the type of request,
and a URL identifying a resource. For instance, when we typed api.github.com/
users/pbugnion in our browsers, this was translated into a GET (the verb) request
for the users/pbugnion resource. All the calls that we have made so far have
been GET requests. You might use a different type of request, for instance, a POST
request, to modify (rather than just view) some content on GitHub.
Besides the verb and resource, there are two more parts to an HTTP request:
•

The headers include metadata about the request, such as the expected format
and character set of the response or the authentication credentials. Headers
are just a list of key-value pairs. We will pass the OAuth token that we have
just generated to the API using the Authorization header. This Wikipedia
article lists commonly used header fields: en.wikipedia.org/wiki/List_
of_HTTP_header_fields.

•

The request body is not used in GET requests but becomes important for
requests that modify the resource they query. For instance, if I wanted to create
a new repository on GitHub programmatically, I would send a POST request
to /pbugnion/repos. The POST body would then be a JSON object describing
the new repository. We will not use the request body in this chapter.

Adding headers to HTTP requests in Scala
We will pass the OAuth token as a header with our HTTP request. Unfortunately, the
Source.fromURL method is not particularly suited to adding headers when creating
a GET request. We will, instead, use a library, scalaj-http.
Let's add scalaj-http to the dependencies in our build.sbt:
libraryDependencies += "org.scalaj" %% "scalaj-http" % "1.1.6"

[ 161 ]

Web APIs

We can now import scalaj-http:
scala> import scalaj.http._
import scalaj.http._

We start by creating an HttpRequest object:
scala> val request = Http("https://api.github.com/users/pbugnion")
request:scalaj.http.HttpRequest = HttpRequest(api.github.com/users/
pbugnion,GET,...

We can now add the authorization header to the request (add your own token
string here):
scala> val authorizedRequest = request.header("Authorization", "token
e836389ce ...")
authorizedRequest:scalaj.http.HttpRequest = HttpRequest(api.github.com/
users/pbugnion,GET,...

The .header method returns a new HttpRequest instance.
It does not modify the request in place. Thus, just calling
request.header(...) does not actually add the header to
request itself, which can be a source of confusion.

Let's fire the request. We do this through the request's asString method, which
queries the API, fetches the response, and parses it as a Scala String:
scala> val response = authorizedRequest.asString
response:scalaj.http.HttpResponse[String] = HttpResponse({"login":"pbugni
on",...

The response is made up of three components:
•

The status code, which should be 200 for a successful request:
scala> response.code
Int = 200

•

The response body, which is the part that we are interested in:
scala> response.body
String = {"login":"pbugnion","id":1392879,...

•

The response headers (metadata about the response):
scala> response.headers
Map[String,String] = Map(Access-Control-Allow-Credentials -> true,
...
[ 162 ]

Chapter 7

To verify that the authorization was successful, query the X-RateLimit-Limit header:
scala> response.headers("X-RateLimit-Limit")
String = 5000

This value is the maximum number of requests per hour that you can make to the
GitHub API from a single IP address.
Now that we have some understanding of how to add authentication to GET
requests, let's modify our script for fetching users to use the OAuth token for
authentication. We first need to import scalaj-http:
import scalaj.http._

Injecting the value of the token into the code can be somewhat tricky. You might
be tempted to hardcode it, but this prohibits you from sharing the code. A better
solution is to use an environment variable. Environment variables are a set of variables
present in your terminal session that are accessible to all processes running in that
session. To get a list of the current environment variables, type the following on
Linux or Mac OS:
$ env
HOME=/Users/pascal
SHELL=/bin/zsh
...

On Windows, the equivalent command is SET. Let's add the GitHub token to the
environment. Use the following command on Mac OS or Linux:
$ export GHTOKEN="e83638..." # enter your token here

On Windows, use the following command:
$ SET GHTOKEN="e83638..."

If you were to reuse this environment variable across many projects, entering
export GHTOKEN=... in the shell for every session gets old quite quickly. A
more permanent solution is to add export GHTOKEN="e83638…" to your shell
configuration file (your .bashrc file if you are using Bash). This is safe provided
your .bashrc is readable by the user only. Any new shell session will have access
to the GHTOKEN environment variable.

[ 163 ]

Web APIs

We can access environment variables from a Scala program using sys.env, which
returns a Map[String, String] of the variables. Let's add a lazy val token to our
class, containing the token value:
lazy val token:Option[String] = sys.env.get("GHTOKEN") orElse {
println("No token found: continuing without authentication")
None
}

Now that we have the token, the only part of the code that must change, to add
authentication, is the fetchUserFromUrl method:
def fetchUserFromUrl(url:String):Future[User] = {
val baseRequest = Http(url)
val request = token match {
case Some(t) => baseRequest.header(
"Authorization", s"token $t")
case None => baseRequest
}
val response = Future {
request.asString.body
}
val parsedResponse = response.map { r => parse(r) }
parsedResponse.map(extractUser)
}

Additionally, we can, to gain clearer error messages, check that the response's status
code is 200. As this is straightforward, it is left as an exercise.

Summary
In this chapter, you learned how to query the GitHub API, converting the response
to Scala objects. Of course, merely printing results to screen is not terribly interesting.
In the next chapter, we will look at the next step of the data ingestion process:
storing data in a database. We will query the GitHub API and store the results in a
MongoDB database.
In Chapter 13, Web APIs with Play, we will look at building our own simple web API.

[ 164 ]

Chapter 7

References
The GitHub API, with its extensive documentation, is a good place to explore how a
rich API is constructed. It has a Getting Started section that is worth reading:
https://developer.github.com/guides/getting-started/

Of course, this is not specific to Scala: it uses cURL to query the API.
Read the documentation (http://json4s.org) and source code (https://github.
com/json4s/json4s) for json4s for a complete reference. There are many parts of
this package that we have not explored, in particular, how to build JSON from Scala.

[ 165 ]

Scala and MongoDB
In Chapter 5, Scala and SQL through JDBC, and Chapter 6, Slick – A Functional Interface
for SQL, you learned how to insert, transform, and read data in SQL databases. These
databases remain (and are likely to remain) very popular in data science, but NoSQL
databases are emerging as strong contenders.
The needs for data storage are growing rapidly. Companies are producing and
storing more data points in the hope of acquiring better business intelligence. They
are also building increasingly large teams of data scientists, who all need to access
the data store. Maintaining constant access time as the data load increases requires
taking advantage of parallel architectures: we need to distribute the database across
several computers so that, as the load on the server increases, we can just add more
machines to improve throughput.
In MySQL databases, the data is naturally split across different tables. Complex
queries necessitate joining across several tables. This makes partitioning the database
across different computers difficult. NoSQL databases emerged to fill this gap.
In this chapter, you will learn to interact with MongoDB, an open source database that
offers high performance and can be distributed easily. MongoDB is one of the more
popular NoSQL databases with a strong community. It offers a reasonable balance of
speed and flexibility, making it a natural alternative to SQL for storing large datasets
with uncertain query requirements, as might happen in data science. Many of the
concepts and recipes in this chapter will apply to other NoSQL databases.

[ 167 ]

Scala and MongoDB

MongoDB
MongoDB is a document-oriented database. It contains collections of documents. Each
document is a JSON-like object:
{
_id: ObjectId("558e846730044ede70743be9"),
name: "Gandalf",
age: 2000,
pseudonyms: [ "Mithrandir", "Olorin", "Greyhame" ],
possessions: [
{ name: "Glamdring", type: "sword" },
{ name: "Narya", type: "ring" }
]
}

Just as in JSON, a document is a set of key-value pairs, where the values can be
strings, numbers, Booleans, dates, arrays, or subdocuments. Documents are grouped
in collections, and collections are grouped in databases.
You might be thinking that this is not very different from SQL: a document is similar
to a row and a collection corresponds to a table. There are two important differences:
•

The values in documents can be simple values, arrays, subdocuments, or
arrays of subdocuments. This lets us encode one-to-many and many-to-many
relationships in a single collection. For instance, consider the wizard collection.
In SQL, if we wanted to store pseudonyms for each wizard, we would have
to use a separate wizard2pseudonym table with a row for each wizardpseudonym pair. In MongoDB, we can just use an array. In practice, this means
that we can normally use a single document to represent an entity (a customer,
transaction, or wizard, for instance). In SQL, we would normally have to join
across several tables to retrieve all the information on a specific entity.

•

MongoDB is schemaless. Documents in a collection can have varying sets of
fields with different types for the same field across different documents. In
practice, MongoDB collections have a loose schema enforced either client
side or by convention: most documents will have a subset of the same fields,
and fields will, in general, contain the same data type. Having a flexible
schema makes adjusting the data structure easy as there is no need for
time-consuming ALTER TABLE statements. The downside is that there is no
easy way of enforcing our flexible schema on the database side.

Note the _id field: this is a unique key. MongoDB will generate one automatically if
we insert a document without an _id field.

[ 168 ]

Chapter 8

This chapter gives recipes for interacting with a MongoDB database from Scala,
including maintaining type safety and best practices. We will not cover advanced
MongoDB functionality (such as aggregation or distributing the database). We
will assume that you have MongoDB installed on your computer (http://docs.
mongodb.org/manual/installation/). It will also help to have a very basic
knowledge of MongoDB (we discuss some references at the end of this chapter, but
any basic tutorial available online will be sufficient for the needs of this chapter).

Connecting to MongoDB with Casbah
The official MongoDB driver for Scala is called Casbah. Rather than a fully-fledged
driver, Casbah wraps the Java Mongo driver, providing a more functional interface.
There are other MongoDB drivers for Scala, which we will discuss briefly at the end
of this chapter. For now, we will stick to Casbah.
Let's start by adding Casbah to our build.sbt file:
scalaVersion := "2.11.7"
libraryDependencies += "org.mongodb" %% "casbah" % "3.0.0"

Casbah also expects slf4j bindings (a Scala logging framework) to be available, so
let's also add slf4j-nop:
libraryDependencies += "org.slf4j" % "slf4j-nop" % "1.7.12"

We can now start an SBT console and import Casbah in the Scala shell:
$ sbt console
scala> import com.mongodb.casbah.Imports._
import com.mongodb.casbah.Imports._
scala> val client = MongoClient()
client: com.mongodb.casbah.MongoClient = com.mongodb.casbah.
MongoClient@4ac17318

This connects to a MongoDB server on the default host (localhost) and default
port (27017). To connect to a different server, pass the host and port as arguments to
MongoClient:
scala> val client = MongoClient("192.168.1.1", 27017)
client: com.mongodb.casbah.MongoClient = com.mongodb.casbah.
MongoClient@584c6b02

[ 169 ]

Scala and MongoDB

Note that creating a client is a lazy operation: it does not attempt to connect to the
server until it needs to. This means that if you enter the wrong URL or password,
you will not know about it until you try and access documents on the server.
Once we have a connection to the server, accessing a database is as simple as using
the client's apply method. For instance, to access the github database:
scala> val db = client("github")
db: com.mongodb.casbah.MongoDB = DB{name='github'}

We can then access the "users" collection:
scala> val coll = db("users")
coll: com.mongodb.casbah.MongoCollection = users

Connecting with authentication
MongoDB supports several different authentication mechanisms. In this section,
we will assume that your server is using the SCRAM-SHA-1 mechanism, but you
should find adapting the code to a different type of authentication straightforward.
The easiest way of authenticating is to pass username and password in the URI
when connecting:
scala> val username = "USER"
username: String = USER
scala> val password = "PASSWORD"
password: String = PASSWORD
scala> val uri = MongoClientURI(
s"mongodb://$username:$password@localhost/?authMechanism=SCRAM-SHA-1"
)
uri: MongoClientURI = mongodb://USER:PASSWORD@
localhost/?authMechanism=SCRAM-SHA-1
scala> val mongoClient = MongoClient(uri)
client: com.mongodb.casbah.MongoClient = com.mongodb.casbah.
MongoClient@4ac17318

[ 170 ]

Chapter 8

In general, you will not want to put your password in plain text in the code.
You can either prompt for a password on the command line or pass it through
environment variables, as we did with the GitHub OAuth token in Chapter 7, Web
APIs. The following code snippet demonstrates how to pass credentials through
the environment:
// Credentials.scala
import com.mongodb.casbah.Imports._
object Credentials extends App {
val username = sys.env.getOrElse("MONGOUSER",
throw new IllegalStateException(
"Need a MONGOUSER variable in the environment")
)
val password = sys.env.getOrElse("MONGOPASSWORD",
throw new IllegalStateException(
"Need a MONGOPASSWORD variable in the environment")
)
val host = "127.0.0.1"
val port = 27017
val uri = s"mongodb:
//$username:$password@$host:$port/?authMechanism=SCRAM-SHA-1"
val client = MongoClient(MongoClientURI(uri))
}

You can run it through SBT as follows:
$ MONGOUSER="pascal" MONGOPASSWORD="scalarulez" sbt
> runMain Credentials

[ 171 ]

Scala and MongoDB

Inserting documents
Let's insert some documents into our newly created database. We want to store
information about GitHub users, using the following document structure:
{
id: ,
login: "pbugnion",
github_id: 1392879,
repos: [
{
name: "scikit-monaco",
id: 14821551,
language: "Python"
},
{
name: "contactpp",
id: 20448325,
language: "Python"
}
]
}

Casbah provides a DBObject class to represent MongoDB documents (and
subdocuments) in Scala. Let's start by creating a DBObject instance for each
repository subdocument:
scala> val repo1 = DBObject("name" -> "scikit-monaco", "id" -> 14821551,
"language" -> "Python")
repo1: DBObject = { "name" : "scikit-monaco" , "id" : 14821551,
"language" : "Python"}

As you can see, a DBObject is just a list of key-value pairs, where the keys are
strings. The values have compile-time type AnyRef, but Casbah will fail (at runtime)
if you try to add a value that cannot be serialized.
We can also create DBObject instances from lists of key-value pairs directly. This is
particularly useful when converting from a Scala map to a DBObject:
scala> val fields:Map[String, Any] = Map(
"name" -> "contactpp",
"id" -> 20448325,
"language" -> "Python"
)

[ 172 ]

Chapter 8
Map[String, Any] = Map(name -> contactpp, id -> 20448325, language ->
Python)
scala> val repo2 = DBObject(fields.toList)
repo2: dDBObject = { "name" : "contactpp" , "id" : 20448325, "language" :
"Python"}

The DBObject class provides many of the same methods as a map. For instance, we
can address individual fields:
scala> repo1("name")
AnyRef = scikit-monaco

We can construct a new object by adding a field to an existing object:
scala> repo1 + ("fork" -> true)
mutable.Map[String,Any] = { "name" : "scikit-monaco" , "id" : 14821551,
"language" : "python", "fork" : true}

Note the return type: mutable.Map[String,Any]. Rather than implementing
methods such as + directly, Casbah adds them to DBObject by providing an implicit
conversion to and from mutable.Map.
New DBObject instances can also be created by concatenating two existing instances:
scala> repo1 ++ DBObject(
"locs" -> 6342,
"description" -> "Python library for Monte Carlo integration"
)
DBObject = { "name" : "scikit-monaco" , "id" : 14821551, "language" :
"Python", "locs" : 6342 , "description" : "Python library for Monte Carlo
integration"}

DBObject instances can then be inserted into a collection using the += operator. Let's
insert our first document into the user collection:
scala> val userDocument = DBObject(
"login" -> "pbugnion",
"github_id" -> 1392879,
"repos" -> List(repo1, repo2)
)

[ 173 ]

Scala and MongoDB
userDocument: DBObject = { "login" : "pbugnion" , ... }
scala> val coll = MongoClient()("github")("users")
coll: com.mongodb.casbah.MongoCollection = users
scala> coll += userDocument
com.mongodb.casbah.TypeImports.WriteResult = WriteResult{, n=0,
updateOfExisting=false, upsertedId=null}

A database containing a single document is a bit boring, so let's add a few more
documents queried directly from the GitHub API. You learned how to query the
GitHub API in the previous chapter, so we won't dwell on how to do this here.
In the code examples for this chapter, we have provided a class called
GitHubUserIterator that queries the GitHub API (specifically the /users
endpoint) for user documents, converts them to a case class, and offers them as an
iterator. You will find the class in the code examples for this chapter (available on
GitHub at https://github.com/pbugnion/s4ds/tree/master/chap08) in the
GitHubUserIterator.scala file. The easiest way to have access to the class is to
open an SBT console in the directory of the code examples for this chapter. The API
then fetches users in increasing order of their login ID:
scala> val it = new GitHubUserIterator
it: GitHubUserIterator = non-empty iterator
scala> it.next // Fetch the first user
User = User(mojombo,1,List(Repo(...

GitHubUserIterator returns instances of the User case class, defined as follows:
// User.scala
case class User(login:String, id:Long, repos:List[Repo])
// Repo.scala
case class Repo(name:String, id:Long, language:String)

Let's write a short program to fetch 500 users and insert them into the MongoDB
database. We will need to authenticate with the GitHub API to retrieve these users.
The constructor for GitHubUserIterator takes the GitHub OAuth token as an
optional argument. We will inject the token through the environment, as we did in
the previous chapter.

[ 174 ]

Chapter 8

We first give the entire code listing before breaking it down—if you are typing
this out, you will need to copy GitHubUserIterator.scala from the code
examples for this chapter to the directory in which you are running this to access
the GitHubUserIterator class. The class relies on scalaj-http and json4s, so
either copy the build.sbt file from the code examples or specify those packages as
dependencies in your build.sbt file.
// InsertUsers.scala
import com.mongodb.casbah.Imports._
object InsertUsers {
/** Function for reading GitHub token from environment. */
lazy val token:Option[String] = sys.env.get("GHTOKEN") orElse {
println("No token found: continuing without authentication")
None
}
/** Transform a Repo instance to a DBObject */
def repoToDBObject(repo:Repo):DBObject = DBObject(
"github_id" -> repo.id,
"name" -> repo.name,
"language" -> repo.language
)
/** Transform a User instance to a DBObject */
def userToDBObject(user:User):DBObject = DBObject(
"github_id" -> user.id,
"login" -> user.login,
"repos" -> user.repos.map(repoToDBObject)
)
/** Insert a list of users into a collection. */
def insertUsers(coll:MongoCollection)(users:Iterable[User]) {
users.foreach { user => coll += userToDBObject(user) }
}
/** Fetch users from GitHub and passes them to `inserter` */
def ingestUsers(nusers:Int)(inserter:Iterable[User] => Unit) {
val it = new GitHubUserIterator(token)
val users = it.take(nusers).toList
inserter(users)

[ 175 ]

Scala and MongoDB
}
def main(args:Array[String]) {
val coll = MongoClient()("github")("users")
val nusers = 500
coll.dropCollection()
val inserter = insertUsers(coll)_
ingestUsers(inserter)(nusers)
}
}

Before diving into the details of how this program works, let's run it through SBT.
You will want to query the API with authentication to avoid hitting the rate limit.
Recall that we need to set the GHTOKEN environment variable:
$ GHTOKEN="e83638..." sbt
$ runMain InsertUsers

The program will take about five minutes to run (depending on your Internet
connection). To verify that the program works, we can query the number of
documents in the users collection of the github database:
$ mongo github --quiet --eval "db.users.count()"
500

Let's break the code down. We first load the OAuth token to authenticate with the
GithHub API. The token is stored as an environment variable, GHTOKEN. The token
variable is a lazy val, so the token is loaded only when we formulate the first
request to the API. We have already used this pattern in Chapter 7, Web APIs.
We then define two methods to transform from classes in the domain model to
DBObject instances:
def repoToDBObject(repo:Repo):DBObject = ...
def userToDBObject(user:User):DBObject = ...

Armed with these two methods, we can add users to our MongoDB collection easily:
def insertUsers(coll:MongoCollection)(users:Iterable[User]) {
users.foreach { user => coll += userToDBObject(user) }
}

We used currying to split the arguments of insertUsers. This lets us use
insertUsers as a function factory:
val inserter = insertUsers(coll)_
[ 176 ]

Chapter 8

This creates a new method, inserter, with signature Iterable[User] => Unit that
inserts users into coll. To see how this might come in useful, let's write a function
to wrap the whole data ingestion process. This is how a first attempt at this function
could look:
def ingestUsers(nusers:Int)(inserter:Iterable[User] => Unit) {
val it = new GitHubUserIterator(token)
val users = it.take(nusers).toList
inserter(users)
}

Notice how ingestUsers takes a method that specifies how the list of users is
inserted into the database as its second argument. This function encapsulates the
entire code specific to insertion into a MongoDB collection. If we decide, at some
later date, that we hate MongoDB and must insert the documents into a SQL
database or write them to a flat file, all we need to do is pass a different inserter
function to ingestUsers. The rest of the code remains the same. This demonstrates
the increased flexibility afforded by using higher-order functions: we can easily build
a framework and let the client code plug in the components that it needs.
The ingestUsers method, as defined previously, has one problem: if the nusers value
is large, it will consume a lot of memory in constructing the entire list of users. A better
solution would be to break it down into batches: we fetch a batch of users from the
API, insert them into the database, and move on to the next batch. This allows us to
control memory usage by changing the batch size. It is also more fault tolerant: if the
program crashes, we can just restart from the last successfully inserted batch.
The .grouped method, available on all iterables, is useful for batching. It returns an
iterator over fragments of the original iterable:
scala> val it = (0 to 10)
it: Range.Inclusive = Range(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> it.grouped(3).foreach { println } // In batches of 3
Vector(0, 1, 2)
Vector(3, 4, 5)
Vector(6, 7, 8)
Vector(9, 10)

Let's rewrite our ingestUsers method to use batches. We will also add a progress
report after each batch in order to give the user some feedback:
/** Fetch users from GitHub and pass them to `inserter` */
def ingestUsers(nusers:Int)(inserter:Iterable[User] => Unit) {
[ 177 ]

Scala and MongoDB
val batchSize = 100
val it = new GitHubUserIterator(token)
print("Inserted #users: ")
it.take(nusers).grouped(batchSize).zipWithIndex.foreach {
case (users, batchNumber) =>
print(s"${batchNumber*batchSize} ")
inserter(users)
}
println()
}

Let's look at the highlighted line more closely. We start from the user iterator, it. We
then take the first nusers. This returns an Iterator[User] that, instead of happily
churning through every user in the GitHub database, will terminate after nusers.
We then group this iterator into batches of 100 users. The .grouped method returns
Iterator[Iterator[User]]. We then zip each batch with its index so that we know
which batch we are currently processing (we use this in the print statement). The
.zipWithIndex method returns Iterator[(Iterator[User], Int)]. We unpack
this tuple in the loop using a case statement that binds users to Iterator[User]
and batchNumber to the index. Let's run this through SBT:
$ GHTOKEN="2502761..." sbt
> runMain InsertUsers
[info] Running InsertUsers
Inserted #users: 0 100 200 300 400
[success] Total time: 215 s, completed 01-Nov-2015 18:44:30

Extracting objects from the database
We now have a database populated with a few users. Let's query this database from
the REPL:
scala> import com.mongodb.casbah.Imports._
import com.mongodb.casbah.Imports._
scala> val collection = MongoClient()("github")("users")
MongoCollection = users
scala> val maybeUser = collection.findOne
Option[collection.T] = Some({ "_id" : { "$oid" :
"562e922546f953739c43df02"} , "github_id" : 1 , "login" : "mojombo" ,
"repos" : ...
[ 178 ]

Chapter 8

The findOne method returns a single DBObject object wrapped in an option, unless
the collection is empty, in which case it returns None. We must therefore use the get
method to extract the object:
scala> val user = maybeUser.get
collection.T = { "_id" : { "$oid" : "562e922546f953739c43df02"} ,
"github_id" : 1 , "login" : "mojombo" , "repos" : ...

As you learned earlier in this chapter, DBObject is a map-like object with keys of
type String and values of type AnyRef:
scala> user("login")
AnyRef = mojombo

In general, we want to restore compile-time type information as early as possible
when importing objects from the database: we do not want to pass AnyRefs around
when we can be more specific. We can use the getAs method to extract a field and
cast it to a specific type:
scala> user.getAs[String]("login")
Option[String] = Some(mojombo)

If the field is missing in the document or if the value cannot be cast, getAs will
return None:
scala> user.getAs[Int]("login")
Option[Int] = None

The astute reader may note that the interface provided by getAs[T] is similar to the
read[T] method that we defined on a JDBC result set in Chapter 5, Scala and SQL
through JDBC.
If getAs fails (for instance, because the field is missing), we can use the orElse
partial function to recover:
scala> val loginName = user.getAs[String]("login") orElse {
println("No login field found. Falling back to 'name'")
user.getAs[String]("name")
}
loginName: Option[String] = Some(mojombo)

The getAsOrElse method allows us to substitute a default value if the cast fails:
scala> user.getAsOrElse[Int]("id", 5)
Int = 1392879
[ 179 ]

Scala and MongoDB

Note that we can also use getAsOrElse to throw an exception:
scala> user.getAsOrElse[String]("name",
throw new IllegalArgumentException(
"Missing value for name")
)
java.lang.IllegalArgumentException: Missing value for name
...

Arrays embedded in documents can be cast to List[T] objects, where T is the type of
elements in the array:
scala> user.getAsOrElse[List[DBObject]]("repos",
List.empty[DBObject])
List[DBObject] = List({ "github_id" : 26899533 , "name" :
"30daysoflaptops.github.io" ...

Retrieving a single document at a time is not very useful. To retrieve all the
documents in a collection, use the .find method:
scala> val userIterator = collection.find()
userIterator: collection.CursorType = non-empty iterator

This returns an iterator of DBObjects. To actually fetch the documents from the
database, you need to materialize the iterator by transforming it into a collection,
using, for instance, .toList:
scala> val userList = userIterator.toList
List[DBObject] = List({ "_id" : { "$oid": ...

Let's bring all of this together. We will write a toy program that prints the average
number of repositories per user in our collection. The code works by fetching
every document in the collection, extracting the number of repositories from each
document, and then averaging over these:
// RepoNumber.scala
import com.mongodb.casbah.Imports._
object RepoNumber {
/** Extract the number of repos from a DBObject
* representing a user.
*/
def extractNumber(obj:DBObject):Option[Int] = {
[ 180 ]

Chapter 8
val repos = obj.getAs[List[DBObject]]("repos") orElse {
println("Could not find or parse 'repos' field")
None
}
repos.map { _.size }
}
val collection = MongoClient()("github")("users")
def main(args:Array[String]) {
val userIterator = collection.find()
// Convert from documents to Option[Int]
val repoNumbers = userIterator.map { extractNumber }
// Convert from Option[Int] to Int
val wellFormattedNumbers = repoNumbers.collect {
case Some(v) => v
}.toList
// Calculate summary statistics
val sum = wellFormattedNumbers.reduce { _ + _ }
val count = wellFormattedNumbers.size
if (count == 0) {
println("No repos found")
}
else {
val mean = sum.toDouble / count.toDouble
println(s"Total number of users with repos: $count")
println(s"Total number of repos: $sum")
println(s"Mean number of repos: $mean")
}
}
}

Let's run this through SBT:
> runMain RepoNumber
Total number of users with repos: 500
Total number of repos: 9649
Mean number of repos: 19.298

[ 181 ]

Scala and MongoDB

The code starts with the extractNumber function, which extracts the number of
repositories from each DBObject. The return value is None if the document does
not contain the repos field.
The main body of the code starts by creating an iterator over DBObjects in the
collection. This iterator is then mapped through the extractNumber function, which
transforms it into an iterator of Option[Int]. We then run .collect on this iterator
to collect all the values that are not None, converting from Option[Int] to Int in
the process. Only then do we materialize the iterator to a list using .toList. The
resulting list, wellFormattedNumbers, has the List[Int] type. We then just take the
mean of this list and print it to screen.
Note that, besides the extractNumber function, none of this program deals with
Casbah-specific types: the iterator returned by .find() is just a Scala iterator. This
makes Casbah straightforward to use: the only data type that you need to familiarize
yourself with is DBObject (compare this with JDBC's ResultSet, which we had to
explicitly wrap in a stream, for instance).

Complex queries
We now know how to convert DBObject instances to custom Scala classes. In this
section, you will learn how to construct queries that only return a subset of the
documents in the collection.
In the previous section, you learned to retrieve all the documents in a collection
as follows:
scala> val objs = collection.find().toList
List[DBObject] = List({ "_id" : { "$oid" : "56365cec46f9534fae8ffd7f"}
,...

The collection.find() method returns an iterator over all the documents in the
collection. By calling .toList on this iterator, we materialize it to a list.
We can customize which documents are returned by passing a query document to
the .find method. For instance, we can retrieve documents for a specific login name:
scala> val query = DBObject("login" -> "mojombo")
query: DBObject = { "login" : "mojombo"}
scala> val objs = collection.find(query).toList
List[DBObject] = List({ "_id" : { "$oid" : "562e922546f953739c43df02"} ,
"login" : "mojombo",...

[ 182 ]

Chapter 8

MongoDB queries are expressed as DBObject instances. Keys in the DBObject
correspond to fields in the collection's documents, and the values are expressions
controlling the allowed values of this field. Thus, DBObject("login" ->
"mojombo") will select all the documents for which the login field is mojombo.
Using a DBObject instance to represent a query might seem a little obscure, but it
will quickly make sense if you read the MongoDB documentation (https://docs.
mongodb.org/manual/core/crud-introduction/): queries are themselves just
JSON objects in MongoDB. Thus, the fact that the query in Casbah is represented as
a DBObject is consistent with other MongoDB client implementations. It also allows
someone familiar with MongoDB to start writing Casbah queries in no time.
MongoDB supports more complex queries. For instance, to query everyone with
"github_id" between 20 and 30, we can write the following query:
scala> val query = DBObject("github_id" ->
DBObject("$gte" -> 20, "$lt" -> 30))
query: DBObject = { "github_id" : { "$gte" : 20 , "$lt" : 30}}
scala> collection.find(query).toList
List[com.mongodb.casbah.Imports.DBObject] = List({ "_id" : { "$oid" :
"562e922546f953739c43df0f"} , "github_id" : 23 , "login" : "takeo" , ...

We limit the range of values that github_id can take with DBObject("$gte" -> 20,
"$lt" -> 30). The "$gte" string indicates that github_id must be greater or equal
to 20. Similarly, "$lt" denotes the less than operator. To get a full list of operators

that you can use when querying, consult the MongoDB reference documentation
(http://docs.mongodb.org/manual/reference/operator/query/).

So far, we have only looked at queries on top-level fields. Casbah also lets us query
fields in subdocuments and arrays using the dot notation. In the context of array
values, this will return all the documents for which at least one value in the array
matches the query. For instance, to retrieve all users who have a repository whose
main language is Scala:
scala> val query = DBObject("repos.language" -> "Scala")
query: DBObject = { "repos.language" : "Scala"}
scala> collection.find(query).toList
List[DBObject] = List({ "_id" : { "$oid" : "5635da4446f953234ca634df"},
"login" : "kevinclark"...

[ 183 ]

Scala and MongoDB

Casbah query DSL
Using DBObject instances to express queries can be very verbose and somewhat
difficult to read. Casbah provides a DSL to express queries much more succinctly.
For instance, to get all the documents with the github_id field between 20 and 30,
we would write the following:
scala> collection.find("github_id" $gte 20 $lt 30).toList
List[com.mongodb.casbah.Imports.DBObject] = List({ "_id" : { "$oid" :
"562e922546f953739c43df0f"} , "github_id" : 23 , "login" : "takeo" ,
"repos" : ...

The operators provided by the DSL will automatically construct DBObject instances.
Using the DSL operators as much as possible generally leads to much more readable
and maintainable code.
Going into the full details of the query DSL is beyond the scope of this chapter. You
should find it quite easy to use. For a full list of the operators supported by the DSL,
refer to the Casbah documentation at http://mongodb.github.io/casbah/3.0/
reference/query_dsl/. We summarize the most important operators here:
Operators
"login" $eq "mojombo"

Description

"login" $ne "mojombo"

This selects documents whose login field is not
mojombo

"github_id" $gt 1 $lt 20

This selects documents with github_id greater than 1
and less than 20

"github_id" $gte 1 $lte
20

This selects documents with github_id greater than
or equal to 1 and less than or equal to 20

"login" $in ("mojombo",
"defunkt")
"login" $nin ("mojombo",
"defunkt")
"login" $regex "^moj.*"

The login field is either mojombo or defunkt

"login" $exists true

The login field exists

$or("login" $eq
"mojombo", "github_id"
$gte 22)
$and("login" $eq
"mojombo", "github_id"
$gte 22)

Either the login field is mojombo or the github_id
field is greater or equal to 22

This selects documents whose login field is exactly
mojombo

The login field is not mojombo or defunkt
The login field matches the particular regular
expression

The login field is mojombo and the github_id field
is greater or equal to 22

[ 184 ]

Chapter 8

We can also use the dot notation to query arrays and subdocuments. For instance, the
following query will count all the users who have a repository in Scala:
scala> collection.find("repos.language" $eq "Scala").size
Int = 30

Custom type serialization
So far, we have only tried to serialize and deserialize simple types. What if we
wanted to decode the language field in the repository array to an enumeration
rather than a string? We might, for instance, define the following enumeration:
scala> object Language extends Enumeration {
val Scala, Java, JavaScript = Value
}
defined object Language

Casbah lets us define custom serializers tied to a specific Scala type: we can inform
Casbah that whenever it encounters an instance of the Language.Value type in a
DBObject, the instance should be passed through a custom transformer that will
convert it to, for instance, a string, before writing it to the database.
To define a custom serializer, we need to define a class that extends the Transformer
trait. This trait exposes a single method, transform(o:AnyRef):AnyRef. Let's define
a LanguageTransformer trait that transforms from Language.Value to String:
scala> import org.bson.{BSON, Transformer}
import org.bson.{BSON, Transformer}
scala> trait LanguageTransformer extends Transformer {
def transform(o:AnyRef):AnyRef = o match {
case l:Language.Value => l.toString
case _ => o
}
}
defined trait LanguageTransformer

We now need to register the trait to be used whenever an instance of type Language.
Value needs to be decoded. We can do this using the addEncodingHook method:
scala> BSON.addEncodingHook(
classOf[Language.Value], new LanguageTransformer {})
[ 185 ]

Scala and MongoDB

We can now construct DBObject instances containing values of the Language
enumeration:
scala> val repoObj = DBObject(
"github_id" -> 1234L,
"language" -> Language.Scala
)
repoObj: DBObject = { "github_id" : 1234 , "language" : "Scala"}

What about the reverse? How do we tell Casbah to read the "language" field as
Language.Value? This is not possible with custom deserializers: "Scala" is now
stored as a string in the database. Thus, when it comes to deserialization, "Scala"
is no different from, say, "mojombo". We thus lose type information when "Scala"
is serialized.
Thus, while custom encoding hooks are useful for serialization, they are much less
useful when deserializing. A cleaner, more consistent alternative to customize both
serialization and deserialization is to use type classes. We have already covered how
to use these extensively in Chapter 5, Scala and SQL through JDBC, in the context of
serializing to and from SQL. The procedure here would be very similar:
1. Define a MongoReader[T] type class with a read(v:Any):T method.
2. Define concrete implementations of MongoReader in the MongoReader
companion object for all types of interest, such as String, Language.Value.
3. Enrich DBObject with a read[T:MongoReader] method using the pimp my
library pattern.
For instance, the implementation of MongoReader for Language.Value would be as
follows:
implicit object LanguageReader extends MongoReader[Language.Value] {
def read(v:Any):Language.Value = v match {
case s:String => Language.withName(s)
}
}

[ 186 ]

Chapter 8

We could then do the same with a MongoWriter type class. Using type classes is an
idiomatic and extensible approach to custom serialization and deserialization.
We provide a complete example of type classes in the code examples associated with
this chapter (in the typeclass directory).

Beyond Casbah
We have only considered Casbah in this chapter. There are, however, other drivers
for MongoDB.
ReactiveMongo is a driver that focusses on asynchronous read and writes to and from
the database. All queries return a future, forcing asynchronous behavior. This fits in
well with data streams or web applications.
Salat sits at a higher level than Casbah and aims to provide easy serialization and
deserialization of case classes.
A full list of drivers is available at https://docs.mongodb.org/ecosystem/
drivers/scala/.

Summary
In this chapter, you learned how to interact with a MongoDB database. By weaving
the constructs learned in the previous chapter—pulling information from a web
API—with those learned in this chapter, we can now build a concurrent, reactive
program for data ingestion.
In the next chapter, you will learn to build distributed, concurrent structures with
greater flexibility using Akka actors.

[ 187 ]

Scala and MongoDB

References
MongoDB: The Definitive Guide, by Kristina Chodorow, is a good introduction to
MongoDB. It does not cover interacting with MongoDB in Scala at all, but Casbah is
intuitive enough for anyone familiar with MongoDB.
Similarly, the MongoDB documentation (https://docs.mongodb.org/manual/)
provides an in-depth discussion of MongoDB.
Casbah itself is well-documented (http://mongodb.github.io/casbah/3.0/).
There is a Getting Started guide that is somewhat similar to this chapter and a
complete reference guide that will fill in the gaps left by this chapter.
This gist, https://gist.github.com/switzer/4218526, implements type classes
to serialize and deserialize objects in the domain model to DBObjects. The premise
is a little different from the suggested usage of type classes in this chapter: we are
converting from Scala types to AnyRef to be used as values in DBObject. However,
the two approaches are complementary: one could imagine a set of type classes to
convert from User or Repo to DBObject and another to convert from Language.
Value to AnyRef.

[ 188 ]

Concurrency with Akka
Much of this book focusses on taking advantage of multicore and distributed
architectures. In Chapter 4, Parallel Collections and Futures, you learned how to use
parallel collections to distribute batch processing problems over several threads and
how to perform asynchronous computations using futures. In Chapter 7, Web APIs,
we applied this knowledge to query the GitHub API with several concurrent threads.
Concurrency abstractions such as futures and parallel collections simplify the
enormous complexity of concurrent programming by limiting what you can do.
Parallel collections, for instance, force you to phrase your parallelization problem as
a sequence of pure functions on collections.
Actors offer a different way of thinking about concurrency. Actors are very good at
encapsulating state. Managing state shared between different threads of execution is
probably the most challenging part of developing concurrent applications, and, as
we will discover in this chapter, actors make it manageable.

GitHub follower graph
In the previous two chapters, we explored the GitHub API, learning how to query
the API and parse the results using json-4s.
Let's imagine that we want to extract the GitHub follower graph: we want a program
that will start from a particular user, extract this user followers, and then extract
their followers until we tell it to stop. The catch is that we don't know ahead of time
what URLs we need to fetch: when we download the login names of a particular
user's followers, we need to verify whether we have fetched these users previously.
If not, we add them to a queue of users whose followers we need to fetch. Algorithm
aficionados might recognize this as breadth-first search.

[ 189 ]

Concurrency with Akka

Let's outline how we might write this in a single-threaded way. The central
components are a set of visited users and queue of future users to visit:
val seedUser = "odersky" // the origin of the network
// Users whose URLs need to be fetched
val queue = mutable.Queue(seedUser)
// set of users that we have already fetched
// (to avoid re-fetching them)
val fetchedUsers = mutable.Set.empty[String]
while (queue.nonEmpty) {
val user = queue.dequeue
if (!fetchedUsers(user)) {
val followers = fetchFollowersForUser(user)
followers foreach { follower =>
// add the follower to queue of people whose
// followers we want to find.
queue += follower
}
fetchedUsers += user
}
}

Here, the fetchFollowersForUser method has signature String =>
Iterable[String] and is responsible for taking a login name, transforming it

into a URL in the GitHub API, querying the API, and extracting a list of followers
from the response. We will not implement it here, but you can find a complete
example in the chap09/single_threaded directory of the code examples for this
book (https://github.com/pbugnion/s4ds). You should have all the tools to
implement this yourself if you have read Chapter 7, Web APIs.
While this works, it will be painfully slow. The bottleneck is clearly the
fetchFollowersForUser method, in particular, the part that queries the GitHub
API. This program does not lend itself to the concurrency constructs that we
have seen earlier in the book because we need to protect the state of the program,
embodied by the user queue and set of fetched users, from race conditions. Note that
it is not just a matter of making the queue and set thread-safe. We must also keep the
two synchronized.

[ 190 ]

Chapter 9

Actors offer an elegant abstraction to encapsulate state. They are lightweight objects
that each perform a single task (possibly repeatedly) and communicate with each
other by passing messages. The internal state of an actor can only be changed from
within the actor itself. Importantly, actors only process messages one at a time,
effectively preventing race conditions.
By hiding program state inside actors, we can reason about the program more
effectively: if a bug is introduced that makes this state inconsistent, the culprit
will be localized entirely in that actor.

Actors as people
In the previous section, you learned that an actor encapsulates state, interacting with
the outside world through messages. Actors make concurrent programming more
intuitive because they behave a little bit like an ideal workforce.
Let's think of an actor system representing a start-up with five people. There's
Chris, the CEO, and Mark, who's in charge of marketing. Then there's Sally, who
heads the engineering team. Sally has two minions, Bob and Kevin. As every good
organization needs an organizational chart, refer to the following diagram:

[ 191 ]

Concurrency with Akka

Let's say that Chris receives an order. He will look at the order, decide whether it
is something that he can process himself, and if not, he will forward it to Mark or
Sally. Let's assume that the order asks for a small program so Bob forwards the order
to Sally. Sally is very busy working on a backlog of orders so she cannot process
the order message straightaway, and it will just sit in her mailbox for a short while.
When she finally gets round to processing the order, she might decide to split the
order into several parts, some of which she will give to Kevin and some to Bob.
As Bob and Kevin complete items, they will send messages back to Sally to inform
her. When every part of the order is fulfilled, Sally will aggregate the parts together
and message either the customer directly or Chris with the results.
The task of keeping track of which jobs must be fulfilled to complete the order rests
with Sally. When she receives messages from Bob and Kevin, she must update
her list of tasks in progress and check whether every task related to this order is
complete. This sort of coordination would be more challenging with traditional
synchronize blocks: every access to the list of tasks in progress and to the list of
completed tasks would need to be synchronized. By embedding this logic in Sally,
who can only process a single message at a time, we can be sure that there will not be
race conditions.
Our start-up works well because each person is responsible for doing a single thing:
Chris either delegates to Mark or Sally, Sally breaks up orders into several parts and
assigns them to Bob and Kevin, and Bob and Kevin fulfill each part. You might think
"hold on, all the logic is embedded in Bob and Kevin, the employees at the bottom of
the ladder who do all the actual work". Actors, unlike employees, are cheap, so if the
logic embedded in an actor gets too complicated, it is easy to introduce additional
layers of delegation until tasks get simple enough.
The employees in our start-up refuse to multitask. When they get a piece of work,
they process it completely and then move on to the next task. This means that they
cannot get muddled by the complexities of multitasking. Actors, by processing a
single message at a time, greatly reduce the scope for introducing concurrency errors
such as race conditions.
More importantly, by offering an abstraction that programmers can intuitively
understand—that of human workers—Akka makes reasoning about concurrency easier.

[ 192 ]

Chapter 9

Hello world with Akka
Let's install Akka. We add it as a dependency to our build.sbt file:
scalaVersion := "2.11.7"
libraryDependencies += "com.typesafe.akka" %% "akka-actor" %
"2.4.0"

We can now import Akka as follows:
import akka.actor._

For our first foray into the world of actors, we will build an actor that echoes every
message it receives. The code examples for this section are in a directory called
chap09/hello_akka in the sample code provided with this book (https://github.
com/pbugnion/s4ds):
// EchoActor.scala
import akka.actor._
class EchoActor extends Actor with ActorLogging {
def receive = {
case msg:String =>
Thread.sleep(500)
log.info(s"Received '$msg'")
}
}

Let's pick this example apart, starting with the constructor. Our actor class must
extend Actor. We also add ActorLogging, a utility trait that adds the log attribute.
The Echo actor exposes a single method, receive. This is the actor's only way of
communicating with the external world. To be useful, all actors must expose a
receive method. The receive method is a partial function, typically implemented
with multiple case statements. When an actor starts processing a message, it will
match it against every case statement until it finds one that matches. It will then
execute the corresponding block.
Our echo actor accepts a single type of message, a plain string. When this message
gets processed, the actor waits for half a second and then echoes the message to the
log file.

[ 193 ]

Concurrency with Akka

Let's instantiate a couple of Echo actors and send them messages:
// HelloAkka.scala
import akka.actor._
object HelloAkka extends App {
// We need an actor system before we can
// instantiate actors
val system = ActorSystem("HelloActors")
// instantiate our two actors
val echo1 = system.actorOf(Props[EchoActor], name="echo1")
val echo2 = system.actorOf(Props[EchoActor], name="echo2")
// Send
echo1 !
echo2 !
echo1 !

them messages. We do this using the "!" operator
"hello echo1"
"hello echo2"
"bye bye"

// Give the actors time to process their messages,
// then shut the system down to terminate the program
Thread.sleep(500)
system.shutdown
}

Running this gives us the following output:
[INFO] [07/19/2015 17:15:23.954] [HelloActor-akka.actor.defaultdispatcher-2] [akka://HelloActor/user/echo1] Received 'hello echo1'
[INFO] [07/19/2015 17:15:23.954] [HelloActor-akka.actor.defaultdispatcher-3] [akka://HelloActor/user/echo2] Received 'hello echo2'
[INFO] [07/19/2015 17:15:24.955] [HelloActor-akka.actor.defaultdispatcher-2] [akka://HelloActor/user/echo1] Received 'bye bye'

Note that the echo1 and echo2 actors are clearly acting concurrently: hello echo1
and hello echo2 are logged at the same time. The second message, passed to echo1,
gets processed after the actor has finished processing hello echo1.

[ 194 ]

Chapter 9

There are a few different things to note:
•

To start instantiating actors, we must first create an actor system. There is
typically a single actor system per application.

•

The way in which we instantiate actors looks a little strange. Instead of
calling the constructor, we create an actor properties object, Props[T]. We
then ask the actor system to create an actor with these properties. In fact,
we never instantiate actors with new: they are either created by calling the
actorOf method in the actor system or a similar method from within another
actor (more on this later).

We never call an actor's methods from outside that actor. The only way to interact
with the actor is to send messages to it. We do this using the tell operator, !. There
is thus no way to mess with an actor's internals from outside that actor (or at least,
Akka makes it difficult to mess with an actor's internals).

Case classes as messages
In our "hello world" example, we constructed an actor that is expected to receive a
string as message. Any object can be passed as a message, provided it is immutable.
It is very common to use case classes to represent messages. This is better than using
strings because of the additional type safety: the compiler will catch a typo in a case
class but not in a string.
Let's rewrite our EchoActor to accept instances of case classes as messages. We will
make it accept two different messages: EchoMessage(message) and EchoHello,
which just echoes a default message. The examples for this section and the next are in
the chap09/hello_akka_case_classes directory in the sample code provided with
this book (https://github.com/pbugnion/s4ds).
A common Akka pattern is to define the messages that an actor can receive in the
actor's companion object:
// EchoActor.scala
object EchoActor {
case object EchoHello
case class EchoMessage(msg:String)
}

[ 195 ]

Concurrency with Akka

Let's change the actor definition to accept these messages:
class EchoActor extends Actor with ActorLogging {
import EchoActor._ // import the message definitions
def receive = {
case EchoHello => log.info("hello")
case EchoMessage(s) => log.info(s)
}
}

We can now send EchoHello and EchoMessage to our actors:
echo1 ! EchoActor.EchoHello
echo2 ! EchoActor.EchoMessage("We're learning Akka.")

Actor construction
Actor construction is a common source of difficulty for people new to Akka. Unlike
(most) ordinary objects, you never instantiate actors explicitly. You would never
write, for instance, val echo = new EchoActor. In fact, if you try this, Akka raises
an exception.
Creating actors in Akka is a two-step process: you first create a Props object, which
encapsulates the properties needed to construct an actor. The way to construct a
Props object differs depending on whether the actor takes constructor arguments.
If the constructor takes no arguments, we simply pass the actor class as a type
parameter to Props:
val echoProps = Props[EchoActor]

If we have an actor whose constructor does take arguments, we must pass these as
additional arguments when defining the Props object. Let's consider the following
actor, for instance:
class TestActor(a:String, b:Int) extends Actor { ... }

We pass the constructor arguments to the Props object as follows:
val testProps = Props(classOf[TestActor], "hello", 2)

The Props instance just embodies the configuration for creating an actor. It does
not actually create anything. To create an actor, we pass the Props instance to the
system.actorOf method, defined on the ActorSystem instance:
val system = ActorSystem("HelloActors")
val echo1 = system.actorOf(echoProps, name="hello-1")
[ 196 ]

Chapter 9

The name parameter is optional but is useful for logging and error messages. The
value returned by .actorOf is not the actor itself: it is a reference to the actor (it
helps to think of it as an address that the actor lives at) and has the ActorRef type.
ActorRef is immutable, but it can be serialized and duplicated without affecting the
underlying actor.
There is another way to create actors besides calling actorOf on the actor system:
each actor exposes a context.actorOf method that takes a Props instance as its
argument. The context is only accessible from within the actor:
class TestParentActor extends Actor {
val echoChild = context.actorOf(echoProps, name="hello-child")
...
}

The difference between an actor created from the actor system and an actor created
from another actor's context lies in the actor hierarchy: each actor has a parent. Any
actor created within another actor's context will have that actor as its parent. An
actor created by the actor system has a predefined actor, called the user guardian, as
its parent. We will understand the importance of the actor hierarchy when we study
the actor lifecycle at the end of this chapter.
A very common idiom is to define a props method in an actor's companion object
that acts as a factory method for Props instances for that actor. Let's amend the
EchoActor companion object:
object EchoActor {
def props:Props = Props[EchoActor]
// message case class definitions here
}

We can then instantiate the actor as follows:
val echoActor = system.actorOf(EchoActor.props)

Anatomy of an actor
Before diving into a full-blown application, let's look at the different components of
the actor framework and how they fit together:
•

Mailbox: A mailbox is basically a queue. Each actor has its own mailbox.
When you send a message to an actor, the message lands in its mailbox and
does nothing until the actor takes it off the queue and passes it through its
receive method.
[ 197 ]

Concurrency with Akka

•

Messages: Messages make synchronization between actors possible. A
message can have any type with the sole requirement that it should be
immutable. In general, it is better to use case classes or case objects to gain
the compiler's help in checking message types.

•

Actor reference: When we create an actor using val echo1 = system.
actorOf(Props[EchoActor]), echo1 has type ActorRef. An ActorRef is
a proxy for an actor and is what the rest of the world interacts with: when
you send a message, you send it to the ActorRef, not to the actor directly. In
fact, you can never obtain a handle to an actor directly in Akka. An actor can
obtain an ActorRef for itself using the .self method.

•

Actor context: Each actor has a context attribute through which you can
access methods to create or access other actors and find information about
the outside world. We have already seen how to create new actors with
context.actorOf(props). We can also obtain a reference to an actor's
parent through context.parent. An actor can also stop another actor with
context.stop(actorRef), where actorRef is a reference to the actor that
we want to stop.

•

Dispatcher: The dispatcher is the machine that actually executes the code in
an actor. The default dispatcher uses a fork/join thread pool. Akka lets us
use different dispatchers for different actors. Tweaking the dispatcher can be
useful to optimize the performance and give priority to certain actors. The
dispatcher that an actor runs on is accessible through context.dispatcher.
Dispatchers implement the ExecutionContext interface so they can be used
to run futures.

Follower network crawler
The end game for this chapter is to build a crawler to explore GitHub's follower
graph. We have already outlined how we can do this in a single-threaded manner
earlier in this chapter. Let's design an actor system to do this concurrently.
The moving parts in the code are the data structures managing which users have
been fetched or are being fetched. These need to be encapsulated in an actor to avoid
race conditions arising from multiple actors trying to change them concurrently. We
will therefore create a fetcher manager actor whose job is to keep track of which users
have been fetched and which users we are going to fetch next.

[ 198 ]

Chapter 9

The part of the code that is likely to be a bottleneck is querying the GitHub API. We
therefore want to be able to scale the number of workers doing this concurrently. We
will create a pool of fetchers, actors responsible for querying the API for the followers
of a particular user. Finally, we will create an actor whose responsibility is to
interpret the API's response. This actor will forward its interpretation of the response
to another actor who will extract the followers and give them to the fetcher manager.
This is what the architecture of the program will look like:

Actor system for our GitHub API crawler

Each actor in our program performs a single task: fetchers just query the GitHub
API and the queue manager just distributes work to the fetchers. Akka best practice
dictates giving actors as narrow an area of responsibility as possible. This enables
better granularity when scaling out (for instance, by adding more fetcher actors, we
just parallelize the bottleneck) and better resilience: if an actor fails, it will only affect
his area of responsibility. We will explore actor failure later on in this chapter.

[ 199 ]

Concurrency with Akka

We will build the app in several steps, exploring the Akka toolkit as we write
the program. Let's start with the build.sbt file. Besides Akka, we will mark
scalaj-http and json4s as dependencies:
// build.sbt
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.json4s" %% "json4s-native" % "3.2.10",
"org.scalaj" %% "scalaj-http" % "1.1.4",
"com.typesafe.akka" %% "akka-actor" % "2.3.12"
)

Fetcher actors
The workhorse of our application is the fetcher, the actor responsible for fetching
the follower details from GitHub. In the first instance, our actor will accept a single
message, Fetch(user). It will fetch the followers corresponding to user and log
the response to screen. We will use the recipes developed in Chapter 7, Web APIs, to
query the GitHub API with an OAuth token. We will inject the token through the
actor constructor.
Let's start with the companion object. This will contain the definition of the

Fetch(user) message and two factory methods to create the Props instances. You can
find the code examples for this section in the chap09/fetchers_alone directory in the
sample code provided with this book (https://github.com/pbugnion/s4ds):
// Fetcher.scala
import akka.actor._
import scalaj.http._
import scala.concurrent.Future
object Fetcher {
// message definitions
case class Fetch(login:String)
// Props factory definitions
def props(token:Option[String]):Props =
Props(classOf[Fetcher], token)
def props():Props = Props(classOf[Fetcher], None)
}

[ 200 ]

Chapter 9

Let's now define the fetcher itself. We will wrap the call to the GitHub API in a
future. This avoids a single slow request blocking the actor. When our actor receives
a Fetch request, it wraps this request into a future, sends it off, and can then process
the next message. Let's go ahead and implement our actor:
// Fetcher.scala
class Fetcher(val token:Option[String])
extends Actor with ActorLogging {
import Fetcher._ // import message definition
// We will need an execution context for the future.
// Recall that the dispatcher doubles up as execution
// context.
import context.dispatcher
def receive = {
case Fetch(login) => fetchUrl(login)
}
private def fetchUrl(login:String) {
val unauthorizedRequest = Http(
s"https://api.github.com/users/$login/followers")
val authorizedRequest = token.map { t =>
unauthorizedRequest.header("Authorization", s"token $t")
}
// Prepare the request: try to use the authorized request
// if a token was given, and fall back on an unauthorized
// request
val request = authorizedRequest.getOrElse(unauthorizedRequest)
// Fetch from github
val response = Future { request.asString }
response.onComplete { r =>
log.info(s"Response from $login: $r")
}
}
}

[ 201 ]

Concurrency with Akka

Let's instantiate an actor system and four fetchers to check whether our actor is
working as expected. We will read the GitHub token from the environment, as
described in Chapter 7, Web APIs, then create four actors and ask each one to fetch the
followers of a particular GitHub user. We wait five seconds for the requests to get
completed, and then shut the system down:
// FetcherDemo.scala
import akka.actor._
object FetcherDemo extends App {
import Fetcher._ // Import the messages
val system = ActorSystem("fetchers")
// Read the github token if present.
val token = sys.env.get("GHTOKEN")
val fetchers = (0 until 4).map { i =>
system.actorOf(Fetcher.props(token))
}
fetchers(0)
fetchers(1)
fetchers(2)
fetchers(3)

!
!
!
!

Fetch("odersky")
Fetch("derekwyatt")
Fetch("rkuhn")
Fetch("tototoshi")

Thread.sleep(5000) // Wait for API calls to finish
system.shutdown // Shut system down
}

Let's run the code through SBT:
$ GHTOKEN="2502761..." sbt run
[INFO] [11/08/2015 16:28:06.500] [fetchers-akka.actor.defaultdispatcher-2] [akka://fetchers/user/$d] Response from tototoshi: Success
(HttpResponse([{"login":"akr4","id":10892,"avatar_url":"https://avatars.
githubusercontent.com/u/10892?v=3","gravatar_id":""...

[ 202 ]

Chapter 9

Notice how we explicitly need to shut the actor system down using system.
shutdown. The program hangs until the system is shut down. However, shutting
down the system will stop all the actors, so we need to make sure that they have
finished working. We do this by inserting a call to Thread.sleep.
Using Thread.sleep to wait until the API calls have finished to shut down the actor
system is a little crude. A better approach could be to let the actors signal back to the
system that they have completed their task. We will see examples of this pattern later
when we implement the fetcher manager actor.
Akka includes a feature-rich scheduler to schedule events. We can use the scheduler
to replace the call to Thread.sleep by scheduling a system shutdown five seconds
in the future. This is preferable as the scheduler does not block the calling thread,
unlike Thread.sleep. To use the scheduler, we need to import a global execution
context and the duration module:
// FetcherDemoWithScheduler.scala
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._

We can then schedule a system shutdown by replacing our call to Thread.sleep
with the following:
system.scheduler.scheduleOnce(5.seconds) { system.shutdown }

Besides scheduleOnce, the scheduler also exposes a schedule method that lets you
schedule events to happen regularly (every two seconds, for instance). This is useful
for heartbeat checks or monitoring systems. For more information, read the API
documentation on the scheduler available at http://doc.akka.io/docs/akka/
snapshot/scala/scheduler.html.
Note that we are actually cheating a little bit here by not fetching every follower. The
response to the follower's query is actually paginated, so we would need to fetch
several pages to fetch all the followers. Adding logic to the actor to do this is not
terribly complicated. We will ignore this for now and assume that users are capped
at 100 followers each.

[ 203 ]

Concurrency with Akka

Routing
In the previous example, we created four fetchers and dispatched messages to them,
one after the other. We have a pool of identical actors among which we distribute
tasks. Manually routing the messages to the right actor to maximize the utilization
of our pool is painful and error-prone. Fortunately, Akka provides us with several
routing strategies that we can use to distribute work among our pool of actors.
Let's rewrite the previous example with automatic routing. You can find the code
examples for this section in the chap09/fetchers_routing directory in the sample
code provided with this book (https://github.com/pbugnion/s4ds). We will
reuse the same definition of Fetchers and its companion object as we did in the
previous section.
Let's start by importing the routing package:
// FetcherDemo.scala
import akka.routing._

A router is an actor that forwards the messages that it receives to its children. The
easiest way to define a pool of actors is to tell Akka to create a router and pass it a
Props object for its children. The router will then manage the creation of the workers
directly. In our example (we will only comment on the parts that differ from the
previous example in the text, but you can find the full code in the fetchers_routing
directory with the examples for this chapter), we replace the custom Fetcher
creation code with the following:
// FetcherDemo.scala
// Create a router with 4 workers of props Fetcher.props()
val router = system.actorOf(
RoundRobinPool(4).props(Fetcher.props(token))
)

We can then send the fetch messages directly to the router. The router will route the
messages to the children in a round-robin manner:
List("odersky", "derekwyatt", "rkuhn", "tototoshi").foreach {
login => router ! Fetch(login)
}

We used a round-robin router in this example. Akka offers many different types of
routers, including routers with dynamic pool size, to cater to different types of load
balancing. Head over to the Akka documentation for a list of all the available routers,
at http://doc.akka.io/docs/akka/snapshot/scala/routing.html.

[ 204 ]

Chapter 9

Message passing between actors
Merely logging the API response is not very useful. To traverse the follower graph,
we must perform the following:
•

Check the return code of the response to make sure that the GitHub API was
happy with our request

•

Parse the response as JSON

•

Extract the login names of the followers and, if we have not fetched them
already, push them into the queue

You learned how to do all these things in Chapter 7, Web APIs, but not in the context
of actors.
We could just add the additional processing steps to the receive method of our
Fetcher actor: we could add further transformations to the API response by future
composition. However, having actors do several different things, and possibly failing
in several different ways, is an anti-pattern: when we learn about managing the actor
life cycle, we will see that it becomes much more difficult to reason about our actor
systems if the actors contain several bits of logic.
We will therefore use a pipeline of three different actors:
•

The fetchers, which we have already encountered, are responsible just for
fetching a URL from GitHub. They will fail if the URL is badly formatted or
they cannot access the GitHub API.

•

The response interpreter is responsible for taking the response from the
GitHub API and parsing it to JSON. If it fails at any step, it will just log
the error (in a real application, we might take different corrective actions
depending on the type of failure). If it manages to extract JSON successfully,
it will pass the JSON array to the follower extractor.

•

The follower extractor will extract the followers from the JSON array and
pass them on to the queue of users whose followers we need to fetch.

We have already built the fetchers, though we will need to modify them to forward
the API response to the response interpreter rather than just logging it.

[ 205 ]

Concurrency with Akka

You can find the code examples for this section in the chap09/all_workers
directory in the sample code provided with this book (https://github.com/
pbugnion/s4ds).The first step is to modify the fetchers so that, instead of logging
the response, they forward the response to the response interpreter. To be able to
forward the response to the response interpreter, the fetchers will need a reference
to this actor. We will just pass the reference to the response interpreter through the
fetcher constructor, which is now:
// Fetcher.scala
class Fetcher(
val token:Option[String],
val responseInterpreter:ActorRef)
extends Actor with ActorLogging {
...
}

We must also modify the Props factory method in the companion object:
// Fetcher.scala
def props(
token:Option[String], responseInterpreter:ActorRef
):Props = Props(classOf[Fetcher], token, responseInterpreter)

We must also modify the receive method to forward the HTTP response to the
interpreter rather than just logging it:
// Fetcher.scala
class Fetcher(...) extends Actor with ActorLogging {
...
def receive = {
case Fetch(login) => fetchFollowers(login)
}
private def fetchFollowers(login:String) {
val unauthorizedRequest = Http(
s"https://api.github.com/users/$login/followers")
val authorizedRequest = token.map { t =>
unauthorizedRequest.header("Authorization", s"token $t")
}
val request = authorizedRequest.getOrElse(unauthorizedRequest)
val response = Future { request.asString }
// Wrap the response in an InterpretResponse message and
// forward it to the interpreter.
[ 206 ]

Chapter 9
response.onComplete { r =>
responseInterpreter !
ResponseInterpreter.InterpretResponse(login, r)
}
}
}

The response interpreter takes the response, decides if it is valid, parses it to JSON, and
forwards it to a follower extractor. The response interpreter will need a reference to
the follower extractor, which we will pass in the constructor.
Let's start by defining the ResponseInterpreter companion. It will just contain the
definition of the messages that the response interpreter can receive and a factory to
create a Props object to help with instantiation:
// ResponseInterpreter.scala
import akka.actor._
import scala.util._
import scalaj.http._
import org.json4s._
import org.json4s.native.JsonMethods._
object ResponseInterpreter {
// Messages
case class InterpretResponse(
login:String, response:Try[HttpResponse[String]]
)
// Props factory
def props(followerExtractor:ActorRef) =
Props(classOf[ResponseInterpreter], followerExtractor)
}

The body of ResponseInterpreter should feel familiar: when the actor receives a
message giving it a response to interpret, it parses it to JSON using the techniques
that you learned in Chapter 7, Web APIs. If we parse the response successfully, we
forward the parsed JSON to the follower extractor. If we fail to parse the response
(possibly because it was badly formatted), we just log the error. We could recover
from this in other ways, for instance, by re-adding this login to the queue manager to
be fetched again:
// ResponseInterpreter.scala
class ResponseInterpreter(followerExtractor:ActorRef)
[ 207 ]

Concurrency with Akka
extends Actor with ActorLogging {
// Import the message definitions
import ResponseInterpreter._
def receive = {
case InterpretResponse(login, r) => interpret(login, r)
}
// If the query was successful, extract the JSON response
// and pass it onto the follower extractor.
// If the query failed, or is badly formatted, throw an error
// We should also be checking error codes here.
private def interpret(
login:String, response:Try[HttpResponse[String]]
) = response match {
case Success(r) => responseToJson(r.body) match {
case Success(jsonResponse) =>
followerExtractor ! FollowerExtractor.Extract(
login, jsonResponse)
case Failure(e) =>
log.error(
s"Error parsing response to JSON for $login: $e")
}
case Failure(e) => log.error(
s"Error fetching URL for $login: $e")
}
// Try and parse the response body as JSON.
// If successful, coerce the `JValue` to a `JArray`.
private def responseToJson(responseBody:String):Try[JArray] = {
val jvalue = Try { parse(responseBody) }
jvalue.flatMap {
case a:JArray => Success(a)
case _ => Failure(new IllegalStateException(
"Incorrectly formatted JSON: not an array"))
}
}
}

We now have two-thirds of our worker actors. The last link is the follower extractor.
This actor's job is simple: it takes the JArray passed to it by the response interpreter
and converts it to a list of followers. For now, we will just log this list, but when
we build our fetcher manager, the follower extractor will send messages asking the
manager to add the followers to its queue of logins to fetch.
[ 208 ]

Chapter 9

As before, the companion just defines the messages that this actor can receive and a
Props factory method:
// FollowerExtractor.scala
import akka.actor._
import org.json4s._
import org.json4s.native.JsonMethods._
object FollowerExtractor {
// Messages
case class Extract(login:String, jsonResponse:JArray)
// Props factory method
def props = Props[FollowerExtractor]
}

The FollowerExtractor class receives Extract messages containing a JArray of
information representing a follower. It extracts the login field and logs it:
class FollowerExtractor extends Actor with ActorLogging {
import FollowerExtractor._
def receive = {
case Extract(login, followerArray) => {
val followers = extractFollowers(followerArray)
log.info(s"$login -> ${followers.mkString(", ")}")
}
}
def extractFollowers(followerArray:JArray) = for {
JObject(follower) <- followerArray
JField("login", JString(login)) <- follower
} yield login
}

Let's write a new main method to exercise all our actors:
// FetchNetwork.scala
import akka.actor._
import akka.routing._
import scala.concurrent.ExecutionContext.Implicits.global

[ 209 ]

Concurrency with Akka
import scala.concurrent.duration._
object FetchNetwork extends App {
import Fetcher._ // Import messages and factory method
// Get token if exists
val token = sys.env.get("GHTOKEN")
val system = ActorSystem("fetchers")
// Instantiate actors
val followerExtractor = system.actorOf(FollowerExtractor.props)
val responseInterpreter =
system.actorOf(ResponseInterpreter.props(followerExtractor))
val router = system.actorOf(RoundRobinPool(4).props(
Fetcher.props(token, responseInterpreter))
)
List("odersky", "derekwyatt", "rkuhn", "tototoshi") foreach {
login => router ! Fetch(login)
}
// schedule a shutdown
system.scheduler.scheduleOnce(5.seconds) { system.shutdown }
}

Let's run this through SBT:
$ GHTOKEN="2502761d..." sbt run
[INFO] [11/05/2015 20:09:37.048] [fetchers-akka.actor.defaultdispatcher-3] [akka://fetchers/user/$a] derekwyatt -> adulteratedjedi,
joonas, Psycojoker, trapd00r, tyru, ...
[INFO] [11/05/2015 20:09:37.050] [fetchers-akka.actor.defaultdispatcher-3] [akka://fetchers/user/$a] tototoshi -> akr4, yuroyoro,
seratch, yyuu, ...
[INFO] [11/05/2015 20:09:37.051] [fetchers-akka.actor.defaultdispatcher-3] [akka://fetchers/user/$a] odersky -> misto, gkossakowski,
mushtaq, ...
[INFO] [11/05/2015 20:09:37.052] [fetchers-akka.actor.defaultdispatcher-3] [akka://fetchers/user/$a] rkuhn -> arnbak, uzoice, jond3k,
TimothyKlim, relrod, ...

[ 210 ]

Chapter 9

Queue control and the pull pattern
We have now defined the three worker actors in our crawler application. The next
step is to define the manager. The fetcher manager is responsible for keeping a queue
of logins to fetch as well as a set of login names that we have already seen in order to
avoid fetching the same logins more than once.
A first attempt might involve building an actor that keeps a set of users that we
have already seen and just dispatches it to a round-robin router for fetchers when it
is given a new user to fetch. The problem with this approach is that the number of
messages in the fetchers' mailboxes would accumulate quickly: for each API query,
we are likely to get tens of followers, each of which is likely to make it back to a
fetcher's inbox. This gives us very little control over the amount of work piling up.
The first problem that this is likely to cause involves the GitHub API rate limit: even
with authentication, we are limited to 5,000 requests per hour. It would be useful to
stop queries as soon as we hit this threshold. We cannot be responsive if each fetcher
has a backlog of hundreds of users that they need to fetch.
A better alternative is to use a pull system: the fetchers request work from a central
queue when they find themselves idle. Pull systems are common in Akka when we
have a producer that produces work faster than consumers can process it (refer to
http://www.michaelpollmeier.com/akka-work-pulling-pattern/).
Conversations between the manager and fetchers will proceed as follows:
•

If the manager goes from a state of having no work to having work, it sends a
WorkAvailable message to all the fetchers.

•

Whenever a fetcher receives a WorkAvailable message or when it completes
an item of work, it sends a GiveMeWork message to the queue manager.

•

When the queue manager receives a GiveMeWork message, it ignores the
request if no work is available or it is throttled. If it has work, it sends a
Fetch(user) message to the actor.

Let's start by modifying our fetcher. You can find the code examples for this section
in the chap09/ghub_crawler directory in the sample code provided with this book
(https://github.com/pbugnion/s4ds). We will pass a reference to the fetcher
manager through the constructor. We need to change the companion object to add
the WorkAvailable message and the props factory to include the reference to
the manager:
// Fecther.scala
object Fetcher {
case class Fetch(url:String)
[ 211 ]

Concurrency with Akka
case object WorkAvailable
def props(
token:Option[String],
fetcherManager:ActorRef,
responseInterpreter:ActorRef):Props =
Props(classOf[Fetcher],
token, fetcherManager, responseInterpreter)
}

We also need to change the receive method so that it queries the FetcherManager
asking for more work once it's done processing a request or when it receives a
WorkAvailable message.
This is the final version of the fetchers:
class Fetcher(
val token:Option[String],
val fetcherManager:ActorRef,
val responseInterpreter:ActorRef)
extends Actor with ActorLogging {
import Fetcher._
import context.dispatcher
def receive = {
case Fetch(login) => fetchFollowers(login)
case WorkAvailable =>
fetcherManager ! FetcherManager.GiveMeWork
}
private def fetchFollowers(login:String) {
val unauthorizedRequest = Http(
s"https://api.github.com/users/$login/followers")
val authorizedRequest = token.map { t =>
unauthorizedRequest.header("Authorization", s"token $t")
}
val request = authorizedRequest.getOrElse(unauthorizedRequest)
val response = Future { request.asString }
response.onComplete { r =>
responseInterpreter !
ResponseInterpreter.InterpretResponse(login, r)
fetcherManager ! FetcherManager.GiveMeWork
}
}
}

[ 212 ]

Chapter 9

Now that we have a working definition of the fetchers, let's build the
FetcherManager. This is the most complex actor that we have built so far, and,
before we dive into building it, we need to learn a bit more about the components of
the Akka toolkit.

Accessing the sender of a message
When our fetcher manager receives a GiveMeWork request, we will need to send
work back to the correct fetcher. We can access the actor who sent a message
using the sender method, which is a method of Actor that returns the ActorRef
corresponding to the actor who sent the message currently being processed. The
case statement corresponding to GiveMeWork in the fetcher manager is therefore:
def receive = {
case GiveMeWork =>
login = // get next login to fetch
sender ! Fetcher.Fetch(login)
...
}

As sender is a method, its return value will change for every new incoming message.
It should therefore only be used synchronously with the receive method. In
particular, using it in a future is dangerous:
def receive = {
case DoSomeWork =>
val work = Future { Thread.sleep(20000) ; 5 }
work.onComplete { result =>
sender ! Complete(result) // NO!
}
}

The problem is that when the future is completed 20 seconds after the message is
processed, the actor will, in all likelihood, be processing a different message so the
return value of sender will have changed. We will thus send the Complete message
to a completely different actor.

[ 213 ]

Concurrency with Akka

If you need to reply to a message outside of the receive method, such as when a
future completes, you should bind the value of the current sender to a variable:
def receive = {
case DoSomeWork =>
// bind the current value of sender to a val
val requestor = sender
val work = Future { Thread.sleep(20000) ; 5 }
work.onComplete { result => requestor ! Complete(result) }
}

Stateful actors
The behavior of the fetcher manager depends on whether it has work to give out to
the fetchers:
•

If it has work to give, it needs to respond to GiveMeWork messages with a
Fetcher.Fetch message

•

If it does not have work, it must ignore the GiveMeWork messages and, if
work gets added, it must send a WorkAvailable message to the fetchers

Encoding the notion of state is straightforward in Akka. We specify different

receive methods and switch from one to the other depending on the state. We will
define the following receive methods for our fetcher manager, corresponding to

each of the states:

// receive method when the queue is empty
def receiveWhileEmpty: Receive = {
...
}
// receive method when the queue is not empty
def receiveWhileNotEmpty: Receive = {
...
}

Note that we must define the return type of the receive methods as Receive.
To switch the actor from one method to the other, we can use context.
become(methodName). Thus, for instance, when the last login name is popped
off the queue, we can transition to using the receiveWhileEmpty method with
context.become(receiveWhileEmpty). We set the initial state by assigning
receiveWhileEmpty to the receive method:
def receive = receiveWhileEmpty
[ 214 ]

Chapter 9

Follower network crawler
We are now ready to code up the remaining pieces of our network crawler. The
largest missing piece is the fetcher manager. Let's start with the companion object.
As with the worker actors, this just contains the definitions of the messages that the
actor can receive and a factory to create the Props instance:
// FetcherManager.scala
import scala.collection.mutable
import akka.actor._
object FetcherManager {
case class AddToQueue(login:String)
case object GiveMeWork
def props(token:Option[String], nFetchers:Int) =
Props(classOf[FetcherManager], token, nFetchers)
}

The manager can receive two messages: AddToQueue, which tells it to add a
username to the queue of users whose followers need to be fetched, and GiveMeWork,
emitted by the fetchers when they are unemployed.
The manager will be responsible for launching the fetchers, response interpreter, and
follower extractor, as well as maintaining an internal queue of usernames and a set of
usernames that we have seen:
// FetcherManager.scala
class FetcherManager(val token:Option[String], val nFetchers:Int)
extends Actor with ActorLogging {
import FetcherManager._
// queue of usernames whose followers we need to fetch
val fetchQueue = mutable.Queue.empty[String]
// set of users we have already fetched.
val fetchedUsers = mutable.Set.empty[String]
// Instantiate worker actors
val followerExtractor = context.actorOf(
FollowerExtractor.props(self))
val responseInterpreter = context.actorOf(
ResponseInterpreter.props(followerExtractor))
[ 215 ]

Concurrency with Akka
val fetchers = (0 until nFetchers).map { i =>
context.actorOf(
Fetcher.props(token, self, responseInterpreter))
}
// receive method when the actor has work:
// If we receive additional work, we just push it onto the
// queue.
// If we receive a request for work from a Fetcher,
// we pop an item off the queue. If that leaves the
// queue empty, we transition to the 'receiveWhileEmpty'
// method.
def receiveWhileNotEmpty:Receive = {
case AddToQueue(login) => queueIfNotFetched(login)
case GiveMeWork =>
val login = fetchQueue.dequeue
// send a Fetch message back to the sender.
// we can use the `sender` method to reply to a message
sender ! Fetcher.Fetch(login)
if (fetchQueue.isEmpty) {
context.become(receiveWhileEmpty)
}
}
// receive method when the actor has no work:
// if we receive work, we add it onto the queue, transition
// to a state where we have work, and notify the fetchers
// that work is available.
def receiveWhileEmpty:Receive = {
case AddToQueue(login) =>
queueIfNotFetched(login)
context.become(receiveWhileNotEmpty)
fetchers.foreach { _ ! Fetcher.WorkAvailable }
case GiveMeWork => // do nothing
}
// Start with an empty queue.
def receive = receiveWhileEmpty
def queueIfNotFetched(login:String) {
if (! fetchedUsers(login)) {
log.info(s"Pushing $login onto queue")
// or do something useful...
fetchQueue += login
[ 216 ]

Chapter 9
fetchedUsers += login
}
}
}

We now have a fetcher manager. The rest of the code can remain the same, apart
from the follower extractor. Instead of logging followers names, it must send
AddToQueue messages to the manager. We will pass a reference to the manager at
construction time:
// FollowerExtractor.scala
import akka.actor._
import org.json4s._
import org.json4s.native.JsonMethods._
object FollowerExtractor {
// messages
case class Extract(login:String, jsonResponse:JArray)
// props factory method
def props(manager:ActorRef) =
Props(classOf[FollowerExtractor], manager)
}
class FollowerExtractor(manager:ActorRef)
extends Actor with ActorLogging {
import FollowerExtractor._
def receive = {
case Extract(login, followerArray) =>
val followers = extractFollowers(followerArray)
followers foreach { f =>
manager ! FetcherManager.AddToQueue(f)
}
}
def extractFollowers(followerArray:JArray) = for {
JObject(follower) <- followerArray
JField("login", JString(login)) <- follower
} yield login
}

[ 217 ]

Concurrency with Akka

The main method running all this is remarkably simple as all the code to instantiate
actors has been moved to the FetcherManager. We just need to instantiate the
manager and give it the first node in the network, and it will do the rest:
// FetchNetwork.scala
import akka.actor._
object FetchNetwork extends App {
// Get token if exists
val token = sys.env.get("GHTOKEN")
val system = ActorSystem("GithubFetcher")
val manager = system.actorOf(FetcherManager.props(token, 2))
manager ! FetcherManager.AddToQueue("odersky")
}

Notice how we do not attempt to shut down the actor system anymore. We will just
let it run, crawling the network, until we stop it or hit the authentication limit. Let's
run this through SBT:
$ GHTOKEN="2502761d..." sbt "runMain FetchNetwork"
[INFO] [11/06/2015 06:31:04.614] [GithubFetcher-akka.actor.defaultdispatcher-2] [akka://GithubFetcher/user/$a] Pushing odersky onto queue
[INFO] [11/06/2015 06:31:05.563] [GithubFetcher-akka.actor.defaultdispatcher-4] [akka://GithubFetcher/user/$a] Pushing misto onto
queueINFO] [11/06/2015 06:31:05.563] [GithubFetcher-akka.actor.defaultdispatcher-4] [akka://GithubFetcher/user/$a] Pushing gkossakowski onto
queue
^C

Our program does not actually do anything useful with the followers that it retrieves
besides logging them. We could replace the log.info call to, for instance, store the
nodes in a database or draw the graph to screen.

Fault tolerance
Real programs fail, and they fail in unpredictable ways. Akka, and the Scala
community in general, favors planning explicitly for failure rather than trying to
write infallible applications. A fault tolerant system is a system that can continue
to operate when one or more of its components fails. The failure of an individual
subsystem does not necessarily mean the failure of the application. How does this
apply to Akka?
[ 218 ]

Chapter 9

The actor model provides a natural unit to encapsulate failure: the actor. When an
actor throws an exception while processing a message, the default behavior is for the
actor to restart, but the exception does not leak out and affect the rest of the system.
For instance, let's introduce an arbitrary failure in the response interpreter. We will
modify the receive method to throw an exception when it is asked to interpret the
response for misto, one of Martin Odersky's followers:
// ResponseInterpreter.scala
def receive = {
case InterpretResponse("misto", r) =>
throw new IllegalStateException("custom error")
case InterpretResponse(login, r) => interpret(login, r)
}

If you rerun the code through SBT, you will notice that an error gets logged. The
program does not crash, however. It just continues as normal:
[ERROR] [11/07/2015 12:05:58.938] [GithubFetcher-akka.actor.defaultdispatcher-2] [akka://GithubFetcher/user/$a/$b] custom error
java.lang.IllegalStateException: custom error
at ResponseInterpreter$
...
[INFO] [11/07/2015 12:05:59.117] [GithubFetcher-akka.actor.defaultdispatcher-2] [akka://GithubFetcher/user/$a] Pushing samfoo onto queue

None of the followers of misto will get added to the queue: he never made it past the
ResponseInterpreter stage. Let's step through what happens when the exception
gets thrown:
•

The interpreter is sent the InterpretResponse("misto", ...) message.
This causes it to throw an exception and it dies. None of the other actors are
affected by the exception.

•

A fresh instance of the response interpreter is created with the same Props
instance as the recently deceased actor.

•

When the response interpreter has finished initializing, it gets bound to the
same ActorRef as the deceased actor. This means that, as far as the rest of the
system is concerned, nothing has changed.

•

The mailbox is tied to ActorRef rather than the actor, so the new response
interpreter will have the same mailbox as its predecessor, without the
offending message.

[ 219 ]

Concurrency with Akka

Thus, if, for whatever reason, our crawler crashes when fetching or parsing the
response for a user, the application will be minimally affected—we will just not fetch
this user's followers.
Any internal state that an actor carries is lost when it restarts. Thus, if, for instance, the
fetcher manager died, we would lose the current value of the queue and visited users.
The risks associated with losing the internal state can be mitigated by the following:
•

Adopting a different strategy for failure: we can, for instance, carry on
processing messages without restarting the actor in the event of failure.
Of course, this is of little use if the actor died because its internal state is
inconsistent. In the next section, we will discuss how to change the failure
recovery strategy.

•

Backing up the internal state by writing it to disk periodically and loading
from the backup on restart.

•

Protecting actors that carry critical state by ensuring that all "risky"
operations are delegated to other actors. In our crawler example, all the
interactions with external services, such as querying the GitHub API and
parsing the response, happen with actors that carry no internal state. As we
saw in the previous example, if one of these actors dies, the application is
minimally affected. By contrast, the precious fetcher manager is only allowed
to interact with sanitized inputs. This is called the error kernel pattern: code
likely to cause errors is delegated to kamikaze actors.

Custom supervisor strategies
The default strategy of restarting an actor on failure is not always what we want. In
particular, for actors that carry a lot of data, we might want to resume processing
after an exception rather than restarting the actor. Akka lets us customize this
behavior by setting a supervisor strategy in the actor's supervisor.
Recall that all actors have parents, including the top-level actors, who are children of
a special actor called the user guardian. By default, an actor's supervisor is his parent,
and it is the supervisor who decides what happens to the actor on failure.

[ 220 ]

Chapter 9

Thus, to change how an actor reacts to failure, you must set its parent's supervisor
strategy. You do this by setting the supervisorStrategy attribute. The default
strategy is equivalent to the following:
val supervisorStrategy = OneForOneStrategy() {
case _:ActorInitializationException => Stop
case _:ActorKilledException => Stop
case _:DeathPactException => Stop
case _:Exception => Restart
}

There are two components to a supervisor strategy:
•

OneForOneStrategy determines that the strategy applies only to the actor
that failed. By contrast, we can use AllForOneStrategy, which applies the

same strategy to all the supervisees. If a single child fails, all the children will
be restarted (or stopped or resumed).

•

A partial function mapping Throwables to a Directive, which is an
instruction on what to do in response to a failure. The default strategy, for
instance, maps ActorInitializationException (which happens if the
constructor fails) to the Stop directive and (almost all) other exceptions
to Restart.

There are four directives:
•

Restart: This destroys the faulty actor and restarts it, binding the newborn
actor to the old ActorRef. This clears the internal state of the actor, which
may be a good thing (the actor might have failed because of some internal
inconsistency).

•

Resume: The actor just moves on to processing the next message in its inbox.

•

Stop: The actor stops and is not restarted. This is useful in throwaway actors

•

Escalate: The supervisor itself rethrows the exception, hoping that its
supervisor will know what to do with it.

that you use to complete a single operation: if this operation fails, the actor is
not needed any more.

A supervisor does not have access to which of its children failed. Thus, if an actor has
children that might require different recovery strategies, it is best to create a set of
intermediate supervisor actors to supervise the different groups of children.

[ 221 ]

Concurrency with Akka

As an example of setting the supervisor strategy, let's tweak the FetcherManager
supervisor strategy to adopt an all-for-one strategy and stop its children when one of
them fails. We start with the relevant imports:
import akka.actor.SupervisorStrategy._

Then, we just need to set the supervisorStrategy attribute in the FetcherManager
definition:
class FetcherManager(...) extends Actor with ActorLogging {
...
override val supervisorStrategy = AllForOneStrategy() {
case _:ActorInitializationException => Stop
case _:ActorKilledException => Stop
case _:Exception => Stop
}
...
}

If you run this through SBT, you will notice that when the code comes across the
custom exception thrown by the response interpreter, the system halts. This is
because all the actors apart from the fetcher manager are now defunct.

Life-cycle hooks
Akka lets us specify code that runs in response to specific events in an actor's life,
through life-cycle hooks. Akka defines the following hooks:
•

preStart(): This runs after the actor's constructor has finished but before

•

postStop(): This runs when the actor dies after it has stopped processing

•

preRestart(reason: Throwable, message: Option[Any]): This is called
just after an actor receives an order to restart. The preRestart method

it starts processing messages. This is useful to run initialization code that
depends on the actor being fully constructed.

messages. This is useful to run cleanup code before terminating the actor.

has access to the exception that was thrown and to the offending message,
allowing for corrective action. The default behavior of preRestart is to stop
each child and then call postStop.

•

postRestart(reason:Throwable): This is called after an actor has restarted.
The default behavior is to call preStart().
[ 222 ]

Chapter 9

Let's use system hooks to persist the state of FetcherManager between runs of the
programs. You can find the code examples for this section in the chap09/ghub_
crawler_fault_tolerant directory in the sample code provided with this book
(https://github.com/pbugnion/s4ds). This will make the fetcher manager faulttolerant. We will use postStop to write the current queue and set of visited users to
text files and preStart to read these text files from the disk. Let's start by importing
the libraries necessary to read and write files:
// FetcherManager.scala
import scala.io.Source
import scala.util._
import java.io._

We will store the names of the two text files in which we persist the state in the
FetcherManager companion object (a better approach would be to store them in a
configuration file):
// FetcherManager.scala
object FetcherManager {
...
val fetchedUsersFileName = "fetched-users.txt"
val fetchQueueFileName = "fetch-queue.txt"
}

In the preStart method, we load both the set of fetched users and the backlog of
users to fetch from the text files, and in the postStop method, we overwrite these
files with the new values of these data structures:
class FetcherManager(
val token:Option[String], val nFetchers:Int
) extends Actor with ActorLogging {
...
/** pre-start method: load saved state from text files */
override def preStart {
log.info("Running pre-start on fetcher manager")
loadFetchedUsers
log.info(
s"Read ${fetchedUsers.size} visited users from source"
)
loadFetchQueue
[ 223 ]

Concurrency with Akka
log.info(
s"Read ${fetchQueue.size} users in queue from source"
)
// If the saved state contains a non-empty queue,
// alert the fetchers so they can start working.
if (fetchQueue.nonEmpty) {
context.become(receiveWhileNotEmpty)
fetchers.foreach { _ ! Fetcher.WorkAvailable }
}
}
/** Dump the current state of the manager */
override def postStop {
log.info("Running post-stop on fetcher manager")
saveFetchedUsers
saveFetchQueue
}
/* Helper methods to load from and write to files */
def loadFetchedUsers {
val fetchedUsersSource = Try {
Source.fromFile(fetchedUsersFileName)
}
fetchedUsersSource.foreach { s =>
try s.getLines.foreach { l => fetchedUsers += l }
finally s.close
}
}
def loadFetchQueue {
val fetchQueueSource = Try {
Source.fromFile(fetchQueueFileName)
}
fetchQueueSource.foreach { s =>
try s.getLines.foreach { l => fetchQueue += l }
finally s.close
}
}
def saveFetchedUsers {
val fetchedUsersFile = new File(fetchedUsersFileName)
val writer = new BufferedWriter(
[ 224 ]

Chapter 9
new FileWriter(fetchedUsersFile))
fetchedUsers.foreach { user => writer.write(user + "\n") }
writer.close()
}
def saveFetchQueue {
val queueUsersFile = new File(fetchQueueFileName)
val writer = new BufferedWriter(
new FileWriter(queueUsersFile))
fetchQueue.foreach { user => writer.write(user + "\n") }
writer.close()
}
...
}

Now that we save the state of the crawler when it shuts down, we can put a better
termination condition for the program than simply interrupting the program once
we get bored. In production, we might halt the crawler when we have enough names
in a database, for instance. In this example, we will simply let the crawler run for
30 seconds and then shut it down.
Let's modify the main method:
// FetchNetwork.scala
import akka.actor._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
object FetchNetwork extends App {
// Get token if exists
val token = sys.env.get("GHTOKEN")
val system = ActorSystem("GithubFetcher")
val manager = system.actorOf(FetcherManager.props(token, 2))
manager ! FetcherManager.AddToQueue("odersky")
system.scheduler.scheduleOnce(30.seconds) { system.shutdown }
}

[ 225 ]

Concurrency with Akka

After 30 seconds, we just call system.shutdown, which stops all the actors
recursively. This will stop the fetcher manager, calling the postStop life cycle hook.
After one run of the program, I have 2,164 names in the fetched-users.txt file.
Running it again increases this number to 3,728 users.
We could improve fault tolerance further by making the fetcher manager dump the
data structures at regular intervals while the code runs. As writing to the disk (or to a
database) carries a certain element of risk (What if the database server goes down or
the disk is full?) it would be better to delegate writing the data structures to a custom
actor rather than endangering the manager.
Our crawler has one minor problem: when the fetcher manager stops, it stops the
fetcher actors, response interpreter, and follower extractor. However, none of the
users currently going through these actors are stored. This also results in a small
number of undelivered messages at the end of the code: if the response interpreter
stops before a fetcher, the fetcher will try to deliver to a non-existent actor. This only
accounts for a small number of users. To recover these login names, we can create
a reaper actor whose job is to coordinate the killing of all the worker actors in the
correct order and harvest their internal state. This pattern is documented in a blog
post by Derek Wyatt (http://letitcrash.com/post/30165507578/shutdownpatterns-in-akka-2).

What we have not talked about
Akka is a very rich ecosystem, far too rich to do it justice in a single chapter.
There are some important parts of the toolkit that you will need, but we have not
covered them here. We will give brief descriptions, but you can refer to the Akka
documentation for more details:
•

The ask operator, ?, offers an alternative to the tell operator, !, that we have
used to send messages to actors. Unlike "tell", which just fires a message to
an actor, the ask operator expects a response. This is useful when we need to
ask actors questions rather than just telling them what to do. The ask pattern
is documented at http://doc.akka.io/docs/akka/snapshot/scala/
actors.html#Ask__Send-And-Receive-Future.

•

Deathwatch allows actors to watch another actor and receive a message
when it dies. This is useful for actors that might depend on another actor but
not be its direct supervisor. This is documented at http://doc.akka.io/
docs/akka/snapshot/scala/actors.html#Lifecycle_Monitoring_aka_
DeathWatch.

[ 226 ]

Chapter 9

•

In our crawler, we passed references to actors explicitly through the
constructor. We can also look up actors using the actor hierarchy with a
syntax reminiscent of files in a filesystem at http://doc.akka.io/docs/
akka/snapshot/scala/actors.html#Identifying_Actors_via_Actor_
Selection.

•

We briefly explored how to implement stateful actors with different receive
methods and using context.become to switch between them. Akka offers a
more powerful alternative, based on finite state machines, to encode a more
complex set of states and transitions: http://doc.akka.io/docs/akka/
snapshot/scala/fsm.html.

•

We have not discussed distributing actor systems across several nodes in
this chapter. The message passing architecture works well with distributed
setups: http://doc.akka.io/docs/akka/2.4.0/common/cluster.html.

Summary
In this chapter, you learned how to weave actors together to tackle a difficult
concurrent problem. More importantly, we saw how Akka's actor framework
encourages us to think about concurrent problems in terms of many separate chunks
of encapsulated mutable data, synchronized through message passing. Akka makes
concurrent programming easier to reason about and more fun.

References
Derek Wyatt's book, Akka Concurrency, is a fantastic introduction to Akka. It should
definitely be the first stop for anyone wanting to do serious Akka programming.
The LET IT CRASH blog (http://letitcrash.com) is the official Akka blog, and
contains many examples of idioms and patterns to solve common issues.

[ 227 ]

Distributed Batch
Processing with Spark
In Chapter 4, Parallel Collections and Futures, we discovered how to use parallel
collections for "embarrassingly" parallel problems: problems that can be broken
down into a series of tasks that require no (or very little) communication between
the tasks.
Apache Spark provides behavior similar to Scala parallel collections (and much
more), but, instead of distributing tasks across different CPUs on the same computer,
it allows the tasks to be distributed across a computer cluster. This provides arbitrary
horizontal scalability, since we can simply add more computers to the cluster.
In this chapter, we will learn the basics of Apache Spark and use it to explore a set of
emails, extracting features with the view of building a spam filter. We will explore
several ways of actually building a spam filter in Chapter 12, Distributed Machine
Learning with MLlib.

Installing Spark
In previous chapters, we included dependencies by specifying them in a build.
sbt file, and relying on SBT to fetch them from the Maven Central repositories. For

Apache Spark, downloading the source code or pre-built binaries explicitly is more
common, since Spark ships with many command line scripts that greatly facilitate
launching jobs and interacting with a cluster.
Head over to http://spark.apache.org/downloads.html and download Spark
1.5.2, choosing the "pre-built for Hadoop 2.6 or later" package. You can also build
Spark from source if you need customizations, but we will stick to the pre-built
version since it requires no configuration.
[ 229 ]

Distributed Batch Processing with Spark

Clicking Download will download a tarball, which you can unpack with the
following command:
$ tar xzf spark-1.5.2-bin-hadoop2.6.tgz

This will create a spark-1.5.2-bin-hadoop2.6 directory. To verify that Spark works
correctly, navigate to spark-1.5.2-bin-hadoop2.6/bin and launch the Spark shell
using ./spark-shell. This is just a Scala shell with the Spark libraries loaded.
You may want to add the bin/ directory to your system path. This will let you
call the scripts in that directory from anywhere on your system, without having to
reference the full path. On Linux or Mac OS, you can add variables to the system
path by entering the following line in your shell configuration file (.bash_profile
on Mac OS, and .bashrc or .bash_profile on Linux):
export PATH=/path/to/spark/bin:$PATH

The changes will take effect in new shell sessions. On Windows (if you
use PowerShell), you need to enter this line in the profile.ps1 file in the
WindowsPowerShell folder in Documents:
$env:Path += ";C:\Program Files\GnuWin32\bin"

If this worked correctly, you should be able to open a Spark shell in any directory on
your system by just typing spark-shell in a terminal.

Acquiring the example data
In this chapter, we will explore the Ling-Spam email dataset (The original dataset
is described at http://csmining.org/index.php/ling-spam-datasets.html).
Download the dataset from http://data.scala4datascience.com/ling-spam.
tar.gz (or ling-spam.zip, depending on your preferred mode of compression),
and unpack the contents to the directory containing the code examples for this
chapter. The archive contains two directories, spam/ and ham/, containing the spam
and legitimate emails, respectively.

[ 230 ]

Chapter 10

Resilient distributed datasets
Spark expresses all computations as a sequence of transformations and actions on
distributed collections, called Resilient Distributed Datasets (RDD). Let's explore
how RDDs work with the Spark shell. Navigate to the examples directory and
open a Spark shell as follows:
$ spark-shell
scala>

Let's start by loading an email in an RDD:
scala> val email = sc.textFile("ham/9-463msg1.txt")
email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile

email is an RDD, with each element corresponding to a line in the input file. Notice
how we created the RDD by calling the textFile method on an object called sc:
scala> sc
spark.SparkContext = org.apache.spark.SparkContext@459bf87c

sc is a SparkContext instance, an object representing the entry point to the Spark
cluster (for now, just our local machine). When we start a Spark shell, a context is
created and bound to the variable sc automatically.

Let's split the email into words using flatMap:
scala> val words = email.flatMap { line => line.split("\\s") }
words: rdd.RDD[String] = MapPartitionsRDD[2] at flatMap

This will feel natural if you are familiar with collections in Scala: the email RDD
behaves just like a list of strings. Here, we split using the regular expression \s,
denoting white space characters. Instead of using flatMap explicitly, we can also
manipulate RDDs using Scala's syntactic sugar:
scala> val words = for {
line <- email
word <- line.split("\\s")
} yield word
words: rdd.RDD[String] = MapPartitionsRDD[3] at flatMap

[ 231 ]

Distributed Batch Processing with Spark

Let's inspect the results. We can use .take(n) to extract the first n elements of
an RDD:
scala> words.take(5)
Array[String] = Array(Subject:, tsd98, workshop, -, -)

We can also use .count to get the number of elements in an RDD:
scala> words.count
Long = 939

RDDs support many of the operations supported by collections. Let's use filter
to remove punctuation from our email. We will remove all words that contain any
non-alphanumeric character. We can do this by filtering out elements that match this
regular expression anywhere in the word: [^a-zA-Z0-9].
scala> val nonAlphaNumericPattern = "[^a-zA-Z0-9]".r
nonAlphaNumericPattern: Regex = [^a-zA-Z0-9]
scala> val filteredWords = words.filter {
word => nonAlphaNumericPattern.findFirstIn(word) == None
}
filteredWords: rdd.RDD[String] = MapPartitionsRDD[4] at filter
scala> filteredWords.take(5)
Array[String] = Array(tsd98, workshop, 2nd, call, paper)
scala> filteredWords.count
Long = 627

In this example, we created an RDD from a text file. We can also create RDDs from
Scala iterables using the sc.parallelize method available on a Spark context:
scala> val words = "the quick brown fox jumped over the dog".split(" ")
words: Array[String] = Array(the, quick, brown, fox, ...)
scala> val wordsRDD = sc.parallelize(words)
wordsRDD: RDD[String] = ParallelCollectionRDD[1] at parallelize at
:23

[ 232 ]

Chapter 10

This is useful for debugging and for trialling behavior in the shell. The counterpart to
parallelize is the .collect method, which converts an RDD to a Scala array:
scala> val wordLengths = wordsRDD.map { _.length }
wordLengths: RDD[Int] = MapPartitionsRDD[2] at map at :25
scala> wordLengths.collect
Array[Int] = Array(3, 5, 5, 3, 6, 4, 3, 3)

The .collect method requires the entire RDD to fit in memory on the master node.
It is thus either used for debugging with a reduced dataset, or at the end of a pipeline
that trims down a dataset.
As you can see, RDDs offer an API much like Scala iterables. The critical difference is
that RDDs are distributed and resilient. Let's explore what this means in practice.

RDDs are immutable
You cannot change an RDD once it is created. All operations on RDDs either create
new RDDs or other Scala objects.

RDDs are lazy
When you execute operations like map and filter on a Scala collection in the
interactive shell, the REPL prints the values of the new collection to screen. The same
isn't true of Spark RDDs. This is because operations on RDDs are lazy: they are only
evaluated when needed.
Thus, when we write:
val email = sc.textFile(...)
val words = email.flatMap { line => line.split("\\s") }

We are creating an RDD, words that knows how to build itself from its parent RDD,
email, which, in turn, knows that it needs to read a text file and split it into lines.
However, none of the commands actually happen until we force the evaluation of
the RDDs by calling an action to return a Scala object. This is most evident if we try to
read from a non-existent text file:
scala> val inp = sc.textFile("nonexistent")
inp: rdd.RDD[String] = MapPartitionsRDD[5] at textFile

[ 233 ]

Distributed Batch Processing with Spark

We can create the RDD without a hitch. We can even define further transformations
on the RDD. The program crashes only when these transformations are finally
evaluated:
scala> inp.count // number of lines
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/Users/pascal/...

The action .count is expected to return the number of elements in our RDD as an
integer. Spark has no choice but to evaluate inp, which results in an exception.
Thus, it is probably more appropriate to think of an RDD as a pipeline of operations,
rather than a more traditional collection.

RDDs know their lineage
RDDs can only be constructed from stable storage (for instance, by loading data
from a file that is present on every node in the Spark cluster), or through a set of
transformations based on other RDDs. Since RDDs are lazy, they need to know
how to build themselves when needed. They do this by knowing who their parent
RDD is, and what operation they need to apply to the parent. This is a well-defined
process since the parent RDD is immutable.
The toDebugString method provides a diagram of how an RDD is constructed:
scala> filteredWords.toDebugString
(2) MapPartitionsRDD[6] at filter at :27 []
|

MapPartitionsRDD[3] at flatMap at :23 []

|

MapPartitionsRDD[1] at textFile at :21 []

|

ham/9-463msg1.txt HadoopRDD[0] at textFile at :21 []

RDDs are resilient
If you run an application on a single computer, you generally don't need to worry
about hardware failure in your application: if the computer fails, your application is
doomed anyway.
Distributed architectures should, by contrast, be fault-tolerant: the failure of a single
machine should not crash the entire application. Spark RDDs are built with fault
tolerance in mind. Let's imagine that one of the worker nodes fails, causing the
destruction of some of the data associated with an RDD. Since the Spark RDD knows
how to build itself from its parent, there is no permanent data loss: the elements that
were lost can just be re-computed when needed on another computer.
[ 234 ]

Chapter 10

RDDs are distributed
When you construct an RDD, for instance from a text file, Spark will split the RDD
into a number of partitions. Each partition will be entirely localized on a single
machine (though there is, in general, more than one partition per machine).
Many transformations on RDDs can be executed on each partition independently.
For instance, when performing a .map operation, a given element in the output RDD
depends on a single element in the parent: data does not need to be moved between
partitions. The same is true of .flatMap and .filter operations. This means that
the partition in the RDD produced by one of these operations depends on a single
partition in the parent RDD.
On the other hand, a .distinct transformation, which removes all duplicate
elements from an RDD, requires the data in a given partition to be compared to
the data in every other partition. This requires shuffling the data across the nodes.
Shuffling, especially for large datasets, is an expensive operation and should be
avoided if possible.

Transformations and actions on RDDs
The set of operations supported by an RDD can be split into two categories:
•

Transformations create a new RDD from the current one. Transformations
are lazy: they are not evaluated immediately.

•

Actions force the evaluation of an RDD, and normally return a Scala object,
rather than an RDD, or have some form of side-effect. Actions are evaluated
immediately, triggering the execution of all the transformations that make up
this RDD.

In the tables below, we give some examples of useful transformations and actions.
For a full, up-to-date list, consult the Spark documentation (http://spark.apache.
org/docs/latest/programming-guide.html#rdd-operations).
For the examples in these tables, we assume that you have created an RDD with:
scala> val rdd = sc.parallelize(List("quick", "brown", "quick", "dog"))

[ 235 ]

Distributed Batch Processing with Spark

The following table lists common transformations on an RDD. Recall that
transformations always generate a new RDD, and that they are lazy operations:
Transformation

Notes

Example (assuming rdd is { "quick",
"brown", "quick", "dog" })

rdd.map(func)

rdd.map { _.size } // => { 5, 5,
5, 3 }
rdd.filter { _.length < 4 } //
=> { "dog" }
rdd.flatMap { _.toCharArray } //
=> { 'q', 'u', 'i', 'c', 'k',
'b', 'r', 'o' … }
rdd.distinct // => { "dog",
"brown", "quick" }

rdd.
filter(pred)
rdd.
flatMap(func)
rdd.distinct()

Remove duplicate
elements in RDD.

rdd.
pipe(command,
[envVars])

Pipe through an
external program.
RDD elements are
written, line-byline, to the process's
stdin. The output
is read from
stdout.

rdd.pipe("tr a-z A-Z") // => {
"QUICK", "BROWN", "QUICK", "DOG"
}

The following table describes common actions on RDDs. Recall that actions always
generate a Scala type or cause a side-effect, rather than creating a new RDD. Actions
force the evaluation of the RDD, triggering the execution of the transformations
underpinning the RDD.
Action

Nodes

rdd.first

First element in the
RDD.

rdd.collect

Transform the RDD
to an array (the
array must be able
to fit in memory on
the master node).

rdd.collect // =>
Array[String]("quick",
"brown", "quick", "dog")

rdd.count

Number of elements
in the RDD.

rdd.count // => 4

[ 236 ]

Example (assuming rdd is {
"quick", "brown", "quick",
"dog" })
rdd.first // => quick

Chapter 10

Action

Nodes

rdd.countByValue

Map of element to
the number of times
this element occurs.
The map must fit on
the master node.

rdd.take(n)

Return an array of
the first n elements
in the RDD.

rdd.take(2) // =>
Array(quick, brown)

rdd.
takeOrdered(n:Int)
(implicit ordering:
Ordering[T])

Top n elements in
the RDD according
to the element's
default ordering, or
the ordering passed
as second argument.
See the Scala docs
for Ordering
for how to define
custom comparison
functions (http://
www.scalalang.org/api/
current/index.
html#scala.
math.Ordering).

rdd.takeOrdered(2) // =>
Array(brown, dog)

rdd.reduce(func)

Reduce the RDD
according to the
specified function.
Uses the first
element in the
RDD as the base.
func should be
commutative and
associative.

rdd.map { _.size }.reduce
{ _ + _ } // => 18

[ 237 ]

Example (assuming rdd is {
"quick", "brown", "quick",
"dog" })
rdd.countByValue // =>
Map(quick -> 2, brown ->
1, dog -> 1)

rdd.takeOrdered(2)
(Ordering.by { _.size
}) // => Array[String] =
Array(dog, quick)

Distributed Batch Processing with Spark

Action

Nodes

rdd.
aggregate(zeroValue)
(seqOp, combOp)

Reduction for
cases where the
reduction function
returns a value of
type different to the
RDD's type. In this
case, we need to
provide a function
for reducing
within a single
partition (seqOp)
and a function for
combining the value
of two partitions
(combOp).

Example (assuming rdd is {
"quick", "brown", "quick",
"dog" })
rdd.aggregate(0) ( _ +
_.size, _ + _ ) // => 18

Persisting RDDs
We have learned that RDDs only retain the sequence of operations needed to
construct the elements, rather than the values themselves. This, of course, drastically
reduces memory usage since we do not need to keep intermediate versions of our
RDDs in memory. For instance, let's assume we want to trawl through transaction
logs to identify all the transactions that occurred on a particular account:
val allTransactions = sc.textFile("transaction.log")
val interestingTransactions = allTransactions.filter {
_.contains("Account: 123456")
}

The set of all transactions will be large, while the set of transactions on the account
of interest will be much smaller. Spark's policy of remembering how to construct a
dataset, rather than the dataset itself, means that we never have all the lines of our
input file in memory at any one time.
There are two situations in which we may want to avoid re-computing the elements
of an RDD every time we use it:
•

For interactive use: we might have detected fraudulent behavior on account
"123456", and we want to investigate how this might have arisen. We will
probably want to perform many different exploratory calculations on this
RDD, without having to re-read the entire log file every time. It therefore
makes sense to persist interestingTransactions.
[ 238 ]

Chapter 10

•

When an algorithm re-uses an intermediate result, or a dataset. A canonical
example is logistic regression. In logistic regression, we normally use an
iterative algorithm to find the 'optimal' coefficients that minimize the loss
function. At every step in our iterative algorithm, we must calculate the loss
function and its gradient from the training set. We should avoid re-computing
the training set (or re-loading it from an input file) if at all possible.

Spark provides a .persist method on RDDs to achieve this. By calling .persist on
an RDD, we tell Spark to keep the dataset in memory next time it is computed.
scala> words.persist
rdd.RDD[String] = MapPartitionsRDD[3] at filter

Spark supports different levels of persistence, which you can tune by passing
arguments to .persist:
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> interestingTransactions.persist(
StorageLevel.MEMORY_AND_DISK)
rdd.RDD[String] = MapPartitionsRDD[3] at filter

Spark provides several persistence levels, including:
•

MEMORY_ONLY: the default storage level. The RDD is stored in RAM. If the

•

MEMORY_AND_DISK: As much of the RDD is stored in memory as possible. If
the RDD is too big, it will spill over to disk. This is only worthwhile if the
RDD is expensive to compute. Otherwise, re-computing it may be faster than
reading from the disk.

RDD is too big to fit in memory, parts of it will not persist, and will need to
be re-computed on the fly.

If you persist several RDDs and run out of memory, Spark will clear the least
recently used out of memory (either discarding them or saving them to disk,
depending on the chosen persistence level). RDDs also expose an unpersist
method to explicitly tell Spark than an RDD is not needed any more.

[ 239 ]

Distributed Batch Processing with Spark

Persisting RDDs can have a drastic impact on performance. What and how to
persist therefore becomes very important when tuning a Spark application. Finding
the best persistence level generally requires some tinkering, benchmarking and
experimentation. The Spark documentation provides guidelines on when to use
which persistence level (http://spark.apache.org/docs/latest/programmingguide.html#rdd-persistence), as well as general tips on tuning memory usage
(http://spark.apache.org/docs/latest/tuning.html).
Importantly, the persist method does not force the evaluation of the RDD. It just
notifies the Spark engine that, next time the values in this RDD are computed, they
should be saved rather than discarded.

Key-value RDDs
So far, we have only considered RDDs of Scala value types. RDDs of more complex
data types support additional operations. Spark adds many operations for key-value
RDDs: RDDs whose type parameter is a tuple (K, V), for any type K and V.
Let's go back to our sample email:
scala> val email = sc.textFile("ham/9-463msg1.txt")
email: rdd.RDD[String] = MapPartitionsRDD[1] at textFile
scala> val words = email.flatMap { line => line.split("\\s") }
words: rdd.RDD[String] = MapPartitionsRDD[2] at flatMap

Let's persist the words RDD in memory to avoid having to re-read the email file
from disk repeatedly:
scala> words.persist

To access key-value operations, we just need to apply a transformation to our RDD
that creates key-value pairs. Let's use the words as keys. For now, we will just use 1
for every value:
scala> val wordsKeyValue = words.map { _ -> 1 }
wordsKeyValue: rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at map
scala> wordsKeyValue.first
(String, Int) = (Subject:,1)

[ 240 ]

Chapter 10

Key-value RDDs support several operations besides the core RDD operations.
These are added through an implicit conversion, using the "pimp my library"
pattern that we explored in Chapter 5, Scala and SQL through JDBC. These additional
transformations fall into two broad categories: by-key transformations and joins
between RDDs.
By-key transformations are operations that aggregate the values corresponding to
the same key. For instance, we can count the number of times each word appears in
our email using reduceByKey. This method takes all the values that belong to the
same key and combines them using a user-supplied function:
scala> val wordCounts = wordsKeyValue.reduceByKey { _ + _ }
wordCounts: rdd.RDD[(String, Int)] = ShuffledRDD[35] at reduceByKey
scala> wordCounts.take(5).foreach { println }
(university,6)
(under,1)
(call,3)
(paper,2)
(chasm,2)

Note that reduceByKey requires (in general) shuffling the RDD, since not every
occurrence of a given key will be in the same partition:
scala> wordCounts.toDebugString
(2) ShuffledRDD[36] at reduceByKey at :30 []
+-(2) MapPartitionsRDD[32] at map at :28 []
|

MapPartitionsRDD[7] at flatMap at :23 []

|
CachedPartitions: 2; MemorySize: 50.3 KB;
ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
|

MapPartitionsRDD[3] at textFile at :21 []

|
CachedPartitions: 2; MemorySize: 5.1 KB;
ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
|

ham/9-463msg1.txt HadoopRDD[2] at textFile at :21 []

Note that key-value RDDs are not like Scala Maps: the same key can occur multiple
times, and they do not support O(1) lookup. A key-value RDD can be transformed to
a Scala map using the .collectAsMap action:
scala> wordCounts.collectAsMap
scala.collection.Map[String,Int] = Map(follow -> 2, famous -> 1...

[ 241 ]

Distributed Batch Processing with Spark

This requires pulling the entire RDD onto the main Spark node. You therefore need
to have enough memory on the main node to house the map. This is often the last
stage in a pipeline that filters a large RDD to just the information that we need.
There are many by-key operations, which we describe in the table below. For the
examples in the table, we assume that rdd is created as follows:
scala> val words = sc.parallelize(List("quick", "brown","quick", "dog"))
words: RDD[String] = ParallelCollectionRDD[25] at parallelize at
:21
scala> val rdd = words.map { word => (word -> word.size) }
rdd: RDD[(String, Int)] = MapPartitionsRDD[26] at map at :23
scala> rdd.collect
Array[(String, Int)] = Array((quick,5), (brown,5), (quick,5), (dog,3))

Transformation

Notes

rdd.mapValues

Apply an operation
to the values.

rdd.groupByKey

Return a key-value
RDD in which values
corresponding
to the same key
are grouped into
iterables.

rdd.
reduceByKey(func)

Return a key-value
RDD in which values
corresponding to
the same key are
combined using
a user-supplied
function.

rdd.reduceByKey { _ + _ } //
=> { quick -> 10, brown -> 5,
dog -> 3 }

rdd.keys

Return an RDD of the
keys.

rdd.keys // => { quick,
brown, quick, dog }

rdd.values

Return an RDD of the
values.

rdd.values // => { 5, 5, 5,
3 }

[ 242 ]

Example (assumes rdd is { quick
-> 5, brown -> 5, quick -> 5,
dog -> 3 })
rdd.mapValues { _ * 2 } // =>
{ quick -> 10, brown -> 10,
quick -> 10, dog ->6 }
rdd.groupByKey // => {
quick -> Iterable(5, 5),
brown -> Iterable(5), dog ->
Iterable(3) }

Chapter 10

The second category of operations on key-value RDDs involves joining different
RDDs together by key. This is somewhat similar to SQL joins, where the keys are the
column being joined on. Let's load a spam email and apply the same transformations
we applied to our ham email:
scala> val spamEmail = sc.textFile("spam/spmsgb17.txt")
spamEmail: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at
textFile at :24
scala> val spamWords = spamEmail.flatMap { _.split("\\s") }
spamWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at
flatMap at :26
scala> val spamWordCounts = spamWords.map {
_ -> 1 }.reduceByKey { _ + _ }
spamWordsCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[55]
at reduceByKey at :30
scala> spamWordCounts.take(5).foreach { println }
(banner,3)
(package,14)
(call,1)
(country,2)
(offer,1)

Both spamWordCounts and wordCounts are key-value RDDs for which the keys
correspond to unique words in the message, and the values are the number of times
that word occurs. There will be some overlap in keys between spamWordCounts and
wordCounts, since the emails will share many of the same words. Let's do an inner
join between those two RDDs to get the words that occur in both emails:
scala> val commonWordCounts = wordCounts.join(spamWordCounts)
res93: rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[58] at join at
:41
scala> commonWordCounts.take(5).foreach { println }
(call,(3,1))
(include,(6,2))
(minute,(2,1))
(form,(1,7))
((,(36,5))
[ 243 ]

Distributed Batch Processing with Spark

The values in the RDD resulting from an inner join will be pairs. The first element
in the pair is the value for that key in the first RDD, and the second element is the
value for that key in the second RDD. Thus, the word call occurs three times in the
legitimate email and once in the spam email.
Spark supports all four join types. For instance, let's perform a left join:
scala> val leftWordCounts = wordCounts.leftOuterJoin(spamWordCounts)
leftWordCounts: rdd.RDD[(String, (Int, Option[Int]))] =
MapPartitionsRDD[64] at leftOuterJoin at :40
scala> leftWordCounts.take(5).foreach { println }
(call,(3,Some(1)))
(paper,(2,None))
(chasm,(2,None))
(antonio,(1,None))
(event,(3,None))

Notice that the second element in our pair has type Option[Int], to accommodate
keys absent in spamWordCounts. The word paper, for instance, occurs twice in the
legitimate email and never in the spam email. In this case, it is more useful to have
zeros to indicate absence, rather than None. Replacing None with a default value is
simple with getOrElse:
scala> val defaultWordCounts = leftWordCounts.mapValues {
case(leftValue, rightValue) => (leftValue, rightValue.getOrElse(0))
}
org.apache.spark.rdd.RDD[(String, (Int, Option[Int]))] =
MapPartitionsRDD[64] at leftOuterJoin at :40
scala> defaultwordCounts.take(5).foreach { println }
(call,(3,1))
(paper,(2,0))
(chasm,(2,0))
(antonio,(1,0))
(event,(3,0))

[ 244 ]

Chapter 10

The table below lists the most common joins on key-value RDDs:
Transformation

rdd1.join(rdd2)
rdd1.
leftOuterJoin(rdd2)
rdd1.
rightOuterJoin(rdd2)
rdd1.
fullOuterJoin(rdd2)

Result (assuming rdd1 is { quick -> 1, brown ->
2, quick -> 3, dog -> 4 } and rdd2 is { quick
-> 78, brown -> 79, fox -> 80 })
{ quick -> (1, 78), quick -> (3, 78), brown
-> (2, 79) }
{ dog -> (4, None), quick -> (1, Some(78)),
quick -> (3, Some(78)), brown -> (2,
Some(79)) }
{ quick -> (Some(1), 78), quick -> (Some(3),
78), brown -> (Some(2), 79), fox -> (None,
80) }
{ dog -> (Some(4), None), quick -> (Some(1),
Some(78)), quick -> (Some(3), Some(78)),
brown -> (Some(2), Some(79)), fox -> (None,
Some(80)) }

For a complete list of transformations, consult the API documentation for
PairRDDFunctions, http://spark.apache.org/docs/latest/api/scala/index.
html#org.apache.spark.rdd.PairRDDFunctions.

Double RDDs
In the previous section, we saw that Spark adds functionality to key-value RDDs
through an implicit conversion. Similarly, Spark adds statistics functionality to RDDs
of doubles. Let's extract the word frequencies for the ham message, and convert the
values from integers to doubles:
scala> val counts = wordCounts.values.map { _.toDouble }
counts: rdd.RDD[Double] = MapPartitionsRDD[9] at map

We can then get summary statistics using the .stats action:
scala> counts.stats
org.apache.spark.util.StatCounter = (count: 397, mean: 2.365239, stdev:
5.740843, max: 72.000000, min: 1.000000)

Thus, the most common word appears 72 times. We can also use the .histogram
action to get an idea of the distribution of values:
scala> counts.histogram(5)
(Array(1.0, 15.2, 29.4, 43.6, 57.8, 72.0),Array(391, 1, 3, 1, 1))

[ 245 ]

Distributed Batch Processing with Spark

The .histogram method returns a pair of arrays. The first array indicates the bounds
of the histogram bins, and the second is the count of elements in that bin. Thus, there
are 391 words that appear less than 15.2 times. The distribution of words is very
skewed, such that a histogram with regular-sized bin is not really appropriate. We
can, instead, pass in custom bins by passing an array of bin edges to the histogram
method. For instance, we might distribute the bins logarithmically:
scala> counts.histogram(Array(1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0,
128.0))
res13: Array[Long] = Array(264, 94, 22, 11, 1, 4, 1)

Building and running standalone
programs
So far, we have interacted exclusively with Spark through the Spark shell. In the
section that follows, we will build a standalone application and launch a Spark
program either locally or on an EC2 cluster.

Running Spark applications locally
The first step is to write the build.sbt file, as you would if you were running a
standard Scala script. The Spark binary that we downloaded needs to be run against
Scala 2.10 (You need to compile Spark from source to run against Scala 2.11. This is
not difficult to do, just follow the instructions on http://spark.apache.org/docs/
latest/building-spark.html#building-for-scala-211).
// build.sbt file
name := "spam_mi"
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.1"
)

We then run sbt package to compile and build a jar of our program. The jar will be
built in target/scala-2.10/, and called spam_mi_2.10-0.1-SNAPSHOT.jar. You
can try this with the example code provided for this chapter.

[ 246 ]

Chapter 10

We can then run the jar locally using the spark-submit shell script, available in the
bin/ folder in the Spark installation directory:
$ spark-submit target/scala-2.10/spam_mi_2.10-0.1-SNAPSHOT.jar
... runs the program

The resources allocated to Spark can be controlled by passing arguments to sparksubmit. Use spark-submit --help to see the full list of arguments.
If the Spark programs has dependencies (for instance, on other Maven packages),
it is easiest to bundle them into the application jar using the SBT assembly plugin.
Let's imagine that our application depends on breeze-viz. The build.sbt file now
looks like:
// build.sbt
name := "spam_mi"
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.5.2" % "provided",
"org.scalanlp" %% "breeze" % "0.11.2",
"org.scalanlp" %% "breeze-viz" % "0.11.2",
"org.scalanlp" %% "breeze-natives" % "0.11.2"
)

SBT assembly is an SBT plugin that builds fat jars: jars that contain not only the
program itself, but all the dependencies for the program.
Note that we marked Spark as "provided" in the list of dependencies, which
means that Spark itself will not be included in the jar (it is provided by the Spark
environment anyway). To include the SBT assembly plugin, create a file called
assembly.sbt in the project/ directory, with the following line:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.0")

You will need to re-start SBT for the changes to take effect. You can then create the
assembly jar using the assembly command in SBT. This will create a jar called spam_
mi-assembly-0.1-SNAPSHOT.jar in the target/scala-2.10 directory. You can run
this jar using spark-submit.

[ 247 ]

Distributed Batch Processing with Spark

Reducing logging output and Spark
configuration
Spark is, by default, very verbose. The default log-level is set to INFO. To avoid
missing important messages, it is useful to change the log settings to WARN. To
change the default log level system-wide, go into the conf directory in the directory
in which you installed Spark. You should find a file called log4j.properties.
template. Rename this file to log4j.properties and look for the following line:
log4j.rootCategory=INFO, console

Change this line to:
log4j.rootCategory=WARN, console

There are several other configuration files in that directory that you can use to
alter Spark's default behavior. For a full list of configuration options, head over
to http://spark.apache.org/docs/latest/configuration.html.

Running Spark applications on EC2
Running Spark locally is useful for testing, but the whole point of using a distributed
framework is to run programs harnessing the power of several different computers.
We can set Spark up on any set of computers that can communicate with each other
using HTTP. In general, we also need to set up a distributed file system like HDFS,
so that we can share input files across the cluster. For the purpose of this example,
we will set Spark up on an Amazon EC2 cluster.
Spark comes with a shell script, ec2/spark-ec2, for setting up an EC2 cluster and
installing Spark. It will also install HDFS. You will need an account with Amazon
Web Services (AWS) to follow these examples (https://aws.amazon.com). You will
need the AWS access key and secret key, which you can access through the Account
/ Security Credentials / Access Credentials menu in the AWS web console. You
need to make these available to the spark-ec2 script through environment variables.
Inject them into your current session as follows:
$ export AWS_ACCESS_KEY_ID=ABCDEF...
$ export AWS_SECRET_ACCESS_KEY=2dEf...

You can also write these lines into the configuration script for your shell (your
.bashrc file, or equivalent), to avoid having to re-enter them every time you run
the setup-ec2 script. We discussed environment variables in Chapter 6, Slick – A
Functional Interface for SQL.

[ 248 ]

Chapter 10

You will also need to create a key pair by clicking on Key Pairs in the EC2 web
console, creating a new key pair and downloading the certificate file. I will assume
you named the key pair test_ec2 and the certificate file test_ec2.pem. Make
sure that the key pair is created in the N. Virginia region (by choosing the correct
region in the upper right corner of the EC2 Management console), to avoid having
to specify the region explicitly in the rest of this chapter. You will need to set access
permissions on the certificate file to user-readable only:
$ chmod 400 test_ec2.pem

We are now ready to launch the cluster. Navigate to the ec2 directory and run:
$ ./spark-ec2 -k test_ec2 -i ~/path/to/certificate/test_ec2.pem -s 2
launch test_cluster

This will create a cluster called test_cluster with a master and two slaves. The
number of slaves is set through the -s command line argument. The cluster will take
a while to start up, but you can verify that the instances are launching correctly by
looking at the Instances window in the EC2 Management Console.
The setup script supports many options for customizing the type of instances, the
number of hard drives and so on. You can explore these options by passing the
--help command line option to spark-ec2.
The life cycle of the cluster can be controlled by passing different commands to the

spark-ec2 script, such as:

# shut down 'test_cluster'
$ ./spark-ec2 stop test_cluster
# start 'test_cluster'
$ ./spark-ec2 -i test_ec2.pem start test_cluster
# destroy 'test_cluster'
$ ./spark-ec2 destroy test_cluster

For more detail on using Spark on EC2, consult the official documentation at
http://spark.apache.org/docs/latest/ec2-scripts.html#runningapplications.

[ 249 ]

Distributed Batch Processing with Spark

Spam filtering
Let's put all we've learned to good use and do some data exploration for our spam
filter. We will use the Ling-Spam email dataset: http://csmining.org/index.php/
ling-spam-datasets.html. The dataset contains 2412 ham emails and 481 spam
emails, all of which were received by a mailing list on linguistics. We will extract the
words that are most informative of whether an email is spam or ham.
The first steps in any natural language processing workflow are to remove stop
words and lemmatization. Removing stop words involves filtering very common
words such as the, this and so on. Lemmatization involves replacing different forms
of the same word with a canonical form: both colors and color would be mapped to
color, and organize, organizing and organizes would be mapped to organize. Removing
stop words and lemmatization is very challenging, and beyond the scope of this
book (if you do need to remove stop words and lemmatize a dataset, your go-to
tool should be the Stanford NLP toolkit: http://nlp.stanford.edu/software/
corenlp.shtml). Fortunately, the Ling-Spam e-mail dataset has been cleaned and
lemmatized already (which is why the text in the emails looks strange).
When we do build the spam filter, we will use the presence of a particular word in an
email as the feature for our model. We will use a bag-of-words approach: we consider
which words appear in an email, but not the word order.
Intuitively, some words will be more important than others when deciding whether
an email is spam. For instance, an email that contains language is likely to be ham,
since the mailing list was for linguistics discussions, and language is a word unlikely
to be used by spammers. Conversely, words which are common to both message
types, for instance hello, are unlikely to be much use.
One way of quantifying the importance of a word in determining whether a message
is spam is through the Mutual Information (MI). The mutual information is the
gain in information about whether a message is ham or spam if we know that it
contains a particular word. For instance, the presence of language in a particular
email is very informative as to whether that email is spam or ham. Similarly, the
presence of the word dollar is informative since it appears often in spam messages
and only infrequently in ham messages. By contrast, the presence of the word
morning is uninformative, since it is approximately equally common in both spam
and ham messages. The formula for the mutual information between the presence of
a particular word in an email, and whether that email is spam or ham is:

MI ( word ) =

∑

wordPresent∈{true , false}
class∈{spam , ham}

P ( wordPresent , class ) ⋅ log 2

[ 250 ]

P ( wordPresent , class )
P ( wordPresent ) P ( class )

Chapter 10

where P ( wordPresent , class ) is the joint probability of an email containing a particular
word and being of that class (either ham or spam), P ( wordPresent ) is the probability
that a particular word is present in an email, and P ( class ) is the probability that any
email is of that class. The MI is commonly used in decision trees.
The derivation of the expression for the mutual information is
beyond the scope of this book. The interested reader is directed to
David MacKay's excellent Information Theory, Inference, and Learning
Algorithms, especially the chapter Dependent Random Variables.

A key component of our MI calculation is evaluating the probability that a word
occurs in spam or ham messages. The best approximation to this probability, given
our data set, is the fraction of messages a word appears in. Thus, for instance,
if language appears in 40% of messages, we will assume that the probability
P ( languagePresent ) of language being present in any message is 0.4. Similarly, if 40%
of the messages are ham, and language appears in 50% of those, we will assume that
the probability of language being present in an email, and that email being ham is
P ( languagePresent , ham ) = 0.5 × 0.4 = 0.2 .
Let's write a wordFractionInFiles function to calculate the fraction of messages in
which each word appears, for all the words in a given corpus. Our function will take,
as argument, a path with a shell wildcard identifying a set of files, such as ham/*,
and it will return a key-value RDD, where the keys are words and the values are the
probability that that word occurs in any of those files. We will put the function in an
object called MutualInformation.
We first give the entire code listing for this function. Don't worry if this doesn't all
make sense straight-away: we explain the tricky parts in more detail just after the
code. You may find it useful to type some of these commands in the shell, replacing
fileGlob with, for instance "ham/*":
// MutualInformation.scala
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
object MutualInformation extends App {
def wordFractionInFiles(sc:SparkContext)(fileGlob:String)
:(RDD[(String, Double)], Long) = {
// A set of punctuation words that need to be filtered out.
val wordsToOmit = Set[String](
"", ".", ",", ":", "-", "\"", "'", ")",
[ 251 ]

Distributed Batch Processing with Spark
"(", "@", "/", "Subject:"
)
val messages = sc.wholeTextFiles(fileGlob)
// wholeTextFiles generates a key-value RDD of
// file name -> file content
val nMessages = messages.count()
// Split the content of each message into a Set of unique
// words in that message, and generate a new RDD mapping:
// message -> word
val message2Word = messages.flatMapValues {
mailBody => mailBody.split("\\s").toSet
}

val message2FilteredWords = message2Word.filter {
case(email, word) => ! wordsToOmit(word)
}
val word2Message = message2FilteredWords.map { _.swap }
// word -> number of messages it appears in.
val word2NumberMessages = word2Message.mapValues {
_ => 1
}.reduceByKey { _ + _ }
// word -> fraction of messages it appears in
val pPresent = word2NumberMessages.mapValues {
_ / nMessages.toDouble
}
(pPresent, nMessages)
}
}

Let's play with this function in the Spark shell. To be able to access this function from
the shell, we need to create a jar with the MutualInformation object. Write a build.
sbt file similar to the one presented in the previous section and package the code
into a jar using sbt package. Then, open a Spark shell with:
$ spark-shell --jars=target/scala-2.10/spam_mi_2.10-0.1-SNAPSHOT.jar

[ 252 ]

Chapter 10

This will open a Spark shell with our newly created jar on the classpath. Let's run our
wordFractionInFiles method on the ham emails:
scala> import MutualInformation._
import MutualInformation._
scala> val (fractions, nMessages) = wordFractionInFiles(sc)("ham/*")
fractions: org.apache.spark.rdd.RDD[(String, Double)] =
MapPartitionsRDD[13] at mapValues
nMessages: Long = 2412

Let's get a snapshot of the fractions RDD:
scala> fractions.take(5)
Array[(String, Double)] = Array((rule-base,0.002902155887230514), (re
union,4.1459369817578774E-4), (embarrasingly,4.1459369817578774E-4),
(mller,8.291873963515755E-4), (sapore,4.1459369817578774E-4))

It would be nice to see the words that come up most often in ham messages. We
can use the .takeOrdered action to take the top values of an RDD, with a custom
ordering. .takeOrdered expects, as its second argument, an instance of the type
class Ordering[T], where T is the type parameter of our RDD: (String, Double)
in this case. Ordering[T] is a trait with a single compare(a:T, b:T) method
describing how to compare a and b. The easiest way of creating an Ordering[T] is
through the companion object's by method, which defines a key by which to compare
the elements of our RDD.
We want to order the elements in our key-value RDD by the value and, since we want
the most common words, rather than the least, we need to reverse that ordering:
scala> fractions.takeOrdered(5)(Ordering.by { - _._2 })
res0: Array[(String, Double)] = Array((language,0.6737147595356551),
(university,0.6048922056384743), (linguistic,0.5149253731343284),
(information,0.45480928689883915), ('s,0.4369817578772803))

Unsurprisingly, language is present in 67% of ham emails, university in 60% of
ham emails and so on. A similar investigation on spam messages reveals that the
exclamation mark character ! is present in 83% of spam emails, our is present in 61%
and free in 57%.
We are now in a position to start writing the body of our application to calculate the
mutual information between each word and whether a message is spam or ham.
We will put the body of the code in the MutualInformation object, which already
contains the wordFractionInFiles method.
[ 253 ]

Distributed Batch Processing with Spark

The first step is to create a Spark context:
// MutualInformation.scala
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
object MutualInformation extends App {
def wordFractionInFiles(sc:SparkContext)(fileGlob:String)
:(RDD[(String, Double)], Long) = {
...
}
val conf = new SparkConf().setAppName("lingSpam")
val sc = new SparkContext(conf)

Note that we did not need to do this when we were using the Spark shell because the
shell comes with a pre-built context bound to the variable sc.
We can now calculate the conditional probabilities of a message containing a
particular word given that it is spam, P ( wordPresent | spam ) . This is just the fraction
of messages containing that word in the spam corpus. This, in turn, lets us infer
the joint probability of a message containing a certain word and being spam
P ( wordPresent , spam ) = P ( wordPresent | spam ) × P ( spam ) . We will do this for all four
combinations of classes: whether any given word is present or absent in a message,
and whether that message is spam or ham:
/* Conditional probabilities RDD:
word -> P(present | spam)
*/
val (pPresentGivenSpam, nSpam) = wordFractionInFiles(sc)("spam/*")
val pAbsentGivenSpam = pPresentGivenSpam.mapValues { 1.0 - _ }
val (pPresentGivenHam, nHam) = wordFractionInFiles(sc)("ham/*")
val pAbsentGivenHam = pPresentGivenHam.mapValues { 1.0 - _ }
// pSpam is the fraction of spam messages
val nMessages = nSpam + nHam
val pSpam = nSpam / nMessages.toDouble
// pHam is the fraction of ham messages

[ 254 ]

Chapter 10
val pHam = 1.0 - pSpam
/* pPresentAndSpam is a key-value RDD of joint probabilities
word -> P(word present, spam)
*/
val pPresentAndSpam = pPresentGivenSpam.mapValues {
_ * pSpam
}
val pPresentAndHam = pPresentGivenHam.mapValues { _ * pHam }
val pAbsentAndSpam = pAbsentGivenSpam.mapValues { _ * pSpam }
val pAbsentAndHam = pAbsentGivenHam.mapValues { _ * pHam }

We will re-use these RDDs in several places in the calculation, so let's tell Spark to
keep them in memory to avoid having to re-calculate them:
pPresentAndSpam.persist
pPresentAndHam.persist
pAbsentAndSpam.persist
pAbsentAndHam.persist

We now need to calculate the probabilities of words being present, P ( wordPresent ) .
This is just the sum of pPresentAndSpam and pPresentAndHam, for each word. The
tricky part is that not all words are present in both the ham and spam messages.
We must therefore do a full outer join of those RDDs. This will give an RDD
mapping each word to a pair of Option[Double] values. For words absent in
either the ham or spam messages, we must use a default value. A sensible default is
P ( wordPresent | spam ) = ( 0.5 / nSpam ) × P ( spam ) for spam messages (a more rigorous
approach would be to use additive smoothing). This implies that the word would
appear once if the corpus was twice as large.
val pJoined = pPresentAndSpam.fullOuterJoin(pPresentAndHam)
val pJoinedDefault = pJoined.mapValues {
case (presentAndSpam, presentAndHam) =>
(presentAndSpam.getOrElse(0.5/nSpam * pSpam),
presentAndHam.getOrElse(0.5/nHam * pHam))
}

Note that we could also have chosen 0 as the default value. This complicates the
information gain calculation somewhat, since we cannot just take the log of a zero
value, and it seems unlikely that a particular word has exactly zero probability of
occurring in an email.

[ 255 ]

Distributed Batch Processing with Spark

We can now construct an RDD mapping words to P ( wordPresent ) , the probability
that a word exists in either a spam or a ham message:
val pPresent = pJoinedDefault.mapValues {
case(presentAndHam, presentAndSpam) =>
presentAndHam + presentAndSpam
}
pPresent.persist
val pAbsent = pPresent.mapValues { 1.0 - _ }
pAbsent.persist

We now have all the RDDs that we need to calculate the mutual information
between the presence of a word in a message and whether it is ham or spam.
We need to bring them all together using the equation for the mutual information
outlined earlier.
We will start by defining a helper method that, given an RDD of joint probabilities

⎛ P ( X ,Y ) ⎞
⎟⎟ .
⎝ P ( X ) P (Y ) ⎠

P(X, Y) and marginal probabilities P(X) and P(Y), calculates P ( X , Y ) × log ⎜⎜

Here, P(X) could, for instance, be the probability of a word being present in a

message P ( wordPresent ) and P(Y) would be the probability that that message is
spam, P ( spam ) :
def miTerm(
pXYs:RDD[(String, Double)],
pXs:RDD[(String, Double)],
pY: Double,
default: Double // for words absent in PXY
):RDD[(String, Double)] =
pXs.leftOuterJoin(pXYs).mapValues {
case (pX, Some(pXY)) => pXY * math.log(pXY/(pX*pY))
case (pX, None) => default * math.log(default/(pX*pY))
}

We can use our function to calculate the four terms in the mutual information sum:
val miTerms = List(
miTerm(pPresentAndSpam, pPresent, pSpam, 0.5/nSpam * pSpam),
miTerm(pPresentAndHam, pPresent, pHam, 0.5/nHam * pHam),
miTerm(pAbsentAndSpam, pAbsent, pSpam, 0.5/nSpam * pSpam),
miTerm(pAbsentAndHam, pAbsent, pHam, 0.5/nHam * pHam)
)

[ 256 ]

Chapter 10

Finally, we just need to sum those four terms together:
val mutualInformation = miTerms.reduce {
(term1, term2) => term1.join(term2).mapValues {
case (l, r) => l + r
}
}

The RDD mutualInformation is a key-value RDD mapping each word to a measure
of how informative the presence of that word is in discerning whether a message is
spam or ham. Let's print out the twenty words that are most informative of whether
a message is ham or spam:
mutualInformation.takeOrdered(20)(Ordering.by { - _._2 })
.foreach { println }

Let's run this using spark-submit:
$ sbt package
$ spark-submit target/scala-2.10/spam_mi_2.10-0.1-SNAPSHOT.jar
(!,0.1479941771292119)
(language,0.14574624861510874)
(remove,0.11380645864246142)
(free,0.1073496947123657)
(university,0.10695975885487692)
(money,0.07531772498093084)
(click,0.06887598051593441)
(our,0.058950906866052394)
(today,0.05485248095680509)
(sell,0.05385519653184113)
(english,0.053509319455430575)
(business,0.05299311289740539)
(market,0.05248394151802276)
(product,0.05096229706182162)
(million,0.050233193237964546)
(linguistics,0.04990172586630499)
(internet,0.04974101556655623)
(company,0.04941817269989519)
(%,0.04890193809823071)
(save,0.04861393414892205)

[ 257 ]

Distributed Batch Processing with Spark

Thus, we find that the presence of words like language or free or ! carry the
most information, because they are almost exclusively present in either just spam
messages or just ham messages. A very simple classification algorithm could just
take the top 10 (by mutual information) spam words, and the top 10 ham words and
see whether a message contains more spam words or ham words. We will explore
machine learning algorithms for classification in more depth in Chapter 12, Distributed
Machine Learning with MLlib.

Lifting the hood
In the last section of this chapter, we will discuss, very briefly, how Spark works
internally. For a more detailed discussion, see the References section at the end of
the chapter.
When you open a Spark context, either explicitly or by launching the Spark shell,
Spark starts a web UI with details of how the current task and past tasks have
executed. Let's see this in action for the example mutual information program we
wrote in the last section. To prevent the context from shutting down when the
program completes, you can insert a call to readLine as the last line of the main
method (after the call to takeOrdered). This expects input from the user, and will
therefore pause program execution until you press enter.
To access the UI, point your browser to 127.0.0.1:4040. If you have other instances
of the Spark shell running, the port may be 4041, or 4042 and so on.

[ 258 ]

Chapter 10

The first page of the UI tells us that our application contains three jobs. A job occurs
as the result of an action. There are, indeed, three actions in our application: the first
two are called within the wordFractionInFiles function:
val nMessages = messages.count()

The last job results from the call to takeOrdered, which forces the execution of the
entire pipeline of RDD transformations that calculate the mutual information.
The web UI lets us delve deeper into each job. Click on the takeOrdered job in the
job table. You will get taken to a page that describes the job in more detail:

Of particular interest is the DAG visualization entry. This is a graph of the execution
plan to fulfill the action, and provides a glimpse of the inner workings of Spark.

[ 259 ]

Distributed Batch Processing with Spark

When you define a job by calling an action on an RDD, Spark looks at the RDD's
lineage and constructs a graph mapping the dependencies: each RDD in the lineage
is represented by a node, with directed edges going from this RDD's parent to itself.
This type of graph is called a directed acyclic graph (DAG), and is a data structure
useful for dependency resolution. Let's explore the DAG for the takeOrdered job in
our program using the web UI. The graph is quite complex, and it is therefore easy
to get lost, so here is a simplified reproduction that only lists the RDDs bound to
variable names in the program.

[ 260 ]

Chapter 10

As you can see, at the bottom of the graph, we have the mutualInformation RDD.
This is the RDD that we need to construct for our action. This RDD depends on the
intermediate elements in the sum, igFragment1, igFragment2, and so on. We can
work our way back through the list of dependencies until we reach the other end of
the graph: RDDs that do not depend on other RDDs, only on external sources.
Once the graph is built, the Spark engines formulates a plan to execute the job. The
plan starts with the RDDs that only have external dependencies (such as RDDs built
by loading files from disk or fetching from a database) or RDDs that already have
cached data. Each arrow along the graph is translated to a set of tasks, with each task
applying a transformation to a partition of the data.
Tasks are grouped into stages. A stage consists of a set of tasks that can all be
performed without needing an intermediate shuffle.

Data shuffling and partitions
To understand data shuffling in Spark, we first need to understand how data is
partitioned in RDDs. When we create an RDD by, for instance, loading a file from
HDFS, or reading a file in local storage, Spark has no control over what bits of data
are distributed in which partitions. This becomes a problem for key-value RDDs:
these often require knowing where occurrences of a particular key are, for instance to
perform a join. If the key can occur anywhere in the RDD, we have to look through
every partition to find the key.
To prevent this, Spark allows the definition of a partitioner on key-value RDDs.
A partitioner is an attribute of the RDD that determines which partition a particular
key lands in. When an RDD has a partitioner set, the location of a key is entirely
determined by the partitioner, and not by the RDD's history, or the number of keys.
Two different RDDs with the same partitioner will map the same key to the same
partition.

[ 261 ]

Distributed Batch Processing with Spark

Partitions impact performance through their effect on transformations. There are two
types of transformations on key-value RDDs:
•

Narrow transformations, like mapValues. In narrow transformations,
the data to compute a partition in the child RDD resides on a single
partition in the parent. The data processing for a narrow transformation
can therefore be performed entirely locally, without needing to communicate
data between nodes.

•

Wide transformations, like reduceByKey. In wide transformations, the data
to compute any single partition can reside on all the partitions in the parent.
The RDD resulting from a wide transformation will, in general, have a
partitioner set. For instance, the output of a reduceByKey transformation are
hash-partitioned by default: the partition that a particular key ends up in is
determined by hash(key) % numPartitions.

Thus, in our mutual information example, the RDDs pPresentAndSpam and
pPresentAndHam will have the same partition structure since they both have the
default hash partitioner. All descendent RDDs retain the same keys, all the way
down to mutualInformation. The word language, for instance, will be in the
same partition for each RDD.
Why does all this matter? If an RDD has a partitioner set, this partitioner is retained
through all subsequent narrow transformations originating from this RDD. Let's
go back to our mutual information example. The RDDs pPresentGivenHam and
pPresentGivenSpam both originate from reduceByKey operations, and they
both have string keys. They will therefore both have the same hash-partitioner
(unless we explicitly set a different partitioner). This partitioner is retained as we
construct pPresentAndSpam and pPresentAndHam. When we construct pPresent,
we perform a full outer join of pPresentAndSpam and pPresentAndHam. Since
both these RDDs have the same partitioner, the child RDD pPresent has narrow
dependencies: we can just join the first partition of pPresentAndSpam with the first
partition of pPresentAndHam, the second partition of pPresentAndSpam with the
second partition of pPresentAndHam and so on, since any string key will be hashed
to the same partition in both RDDs. By contrast, without partitioner, we would
have to join the data in each partition of pPresentAndSpam with every partition of
pPresentAndSpam. This would require sending data across the network to all the
nodes holding pPresentAndSpam, a time-consuming exercise.
This process of having to send the data to construct a child RDD across the network,
as a result of wide dependencies, is called shuffling. Much of the art of optimizing
a Spark program involves reducing shuffling and, when shuffling is necessary,
reducing the amount of shuffling.

[ 262 ]

Chapter 10

Summary
In this chapter, we explored the basics of Spark and learned how to construct
and manipulate RDDs. In the next chapter, we will learn about Spark SQL and
DataFrames, a set of implicit conversions that allow us to manipulate RDDs in a
manner similar to pandas DataFrames, and how to interact with different data
sources using Spark.

Reference
•

Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei
Zaharia, O'Reilly, provides a much more complete introduction to Spark that
this chapter can provide. I thoroughly recommend it.

•

If you are interested in learning more about information theory, I recommend
David MacKay's book Information Theory, Inference, and Learning Algorithms.

•

Information Retrieval, by Manning, Raghavan, and Schütze, describes how to
analyze textual data (including lemmatization and stemming). An online

•

On the Ling-Spam dataset, and how to analyze it: http://www.aueb.gr/
users/ion/docs/ir_memory_based_antispam_filtering.pdf.

•

This blog post delves into the Spark Web UI in more detail. https://
databricks.com/blog/2015/06/22/understanding-your-sparkapplication-through-visualization.html.

•

This blog post, by Sandy Ryza, is the first in a two-part series discussing Spark
internals, and how to leverage them to improve performance: http://blog.
cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobspart-1/.

[ 263 ]

Spark SQL and DataFrames
In the previous chapter, we learned how to build a simple distributed application
using Spark. The data that we used took the form of a set of e-mails stored as text files.
We learned that Spark was built around the concept of resilient distributed datasets
(RDDs). We explored several types of RDDs: simple RDDs of strings, key-value
RDDs, and RDDs of doubles. In the case of key-value RDDs and RDDs of doubles,
Spark added functionality beyond that of the simple RDDs through implicit
conversions. There is one important type of RDD that we have not explored yet:
DataFrames (previously called SchemaRDD). DataFrames allow the manipulation
of objects significantly more complex than those we have explored to date.
A DataFrame is a distributed tabular data structure, and is therefore very useful
for representing and manipulating structured data. In this chapter, we will first
investigate DataFrames through the Spark shell, and then use the Ling-spam e-mail
dataset, presented in the previous chapter, to see how DataFrames can be integrated
in a machine learning pipeline.

DataFrames – a whirlwind introduction
Let's start by opening a Spark shell:
$ spark-shell

Let's imagine that we are interested in running analytics on a set of patients to
estimate their overall health level. We have measured, for each patient, their height,
weight, age, and whether they smoke.

[ 265 ]

Spark SQL and DataFrames

We might represent the readings for each patient as a case class (you might wish to
write some of this in a text editor and paste it into the Scala shell using :paste):
scala> case class PatientReadings(
val patientId: Int,
val heightCm: Int,
val weightKg: Int,
val age:Int,
val isSmoker:Boolean
)
defined class PatientReadings

We would, typically, have many thousands of patients, possibly stored in a database
or a CSV file. We will worry about how to interact with external sources later in this
chapter. For now, let's just hard-code a few readings directly in the shell:
scala> val readings = List(
PatientReadings(1, 175, 72, 43, false),
PatientReadings(2, 182, 78, 28, true),
PatientReadings(3, 164, 61, 41, false),
PatientReadings(4, 161, 62, 43, true)
)
List[PatientReadings] = List(...

We can convert readings to an RDD by using sc.parallelize:
scala> val readingsRDD = sc.parallelize(readings)
readingsRDD: RDD[PatientReadings] = ParallelCollectionRDD[0] at
parallelize at :25

Note that the type parameter of our RDD is PatientReadings. Let's convert the
RDD to a DataFrame using the .toDF method:
scala> val readingsDF = readingsRDD.toDF
readingsDF: sql.DataFrame = [patientId: int, heightCm: int, weightKg:
int, age: int, isSmoker: boolean]

[ 266 ]

Chapter 11

We have created a DataFrame where each row corresponds to the readings for a
specific patient, and the columns correspond to the different features:
scala> readingsDF.show
+---------+--------+--------+---+--------+
|patientId|heightCm|weightKg|age|isSmoker|
+---------+--------+--------+---+--------+
|

1|

175|

72| 43|

false|

|

2|

182|

78| 28|

true|

|

3|

164|

61| 41|

false|

|

4|

161|

62| 43|

true|

+---------+--------+--------+---+--------+

The easiest way to create a DataFrame is to use the toDF method on an RDD. We
can convert any RDD[T], where T is a case class or a tuple, to a DataFrame. Spark
will map each attribute of the case class to a column of the appropriate type in the
DataFrame. It uses reflection to discover the names and types of the attributes.
There are several other ways of constructing DataFrames, both from RDDs and from
external sources, which we will explore later in this chapter.
DataFrames support many operations for manipulating the rows and columns.
For instance, let's add a column for the Body Mass Index (BMI). The BMI is a
common way of aggregating height and weight to decide if someone is overweight or
underweight. The formula for the BMI is:

BMI = weight ( kg ) height ( m )

2

Let's start by creating a column of the height in meters:
scala> val heightM = readingsDF("heightCm") / 100.0
heightM: sql.Column = (heightCm / 100.0)

heightM has data type Column, representing a column of data in a DataFrame.

Columns support many arithmetic and comparison operators that apply
element-wise across the column (similarly to Breeze vectors encountered in
Chapter 2, Manipulating Data with Breeze). Operations on columns are lazy: the
heightM column is not actually computed when defined. Let's now define a
BMI column:
scala> val bmi = readingsDF("weightKg") / (heightM*heightM)

bmi: sql.Column = (weightKg / ((heightCm / 100.0) * (heightCm / 100.0)))

[ 267 ]

Spark SQL and DataFrames

It would be useful to add the bmi column to our readings DataFrame. Since
DataFrames, like RDDs, are immutable, we must define a new DataFrame that is
identical to readingsDF, but with an additional column for the BMI. We can do this
using the withColumn method, which takes, as its arguments, the name of the new
column and a Column instance:
scala> val readingsWithBmiDF = readingsDF.withColumn("BMI", bmi)
readingsWithBmiDF: sql.DataFrame = [heightCm: int, weightKg: int, age:
int, isSmoker: boolean, BMI: double]

All the operations we have seen so far are transformations: they define a pipeline of
operations that create new DataFrames. These transformations are executed when
we call an action, such as show:
scala> readingsWithBmiDF.show
+---------+--------+--------+---+--------+------------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+------------------+
|

1|

175|

72| 43|

false|23.510204081632654|

|

2|

182|

78| 28|

true| 23.54788069073783|

|

3|

164|

61| 41|

false|22.679952409280194|

|

4|

161|

62| 43|

true|

23.9188302920412|

+---------+--------+--------+---+--------+------------------+

Besides creating additional columns, DataFrames also support filtering rows that
satisfy a certain predicate. For instance, we can select all smokers:
scala> readingsWithBmiDF.filter {
readingsWithBmiDF("isSmoker")
}.show
+---------+--------+--------+---+--------+-----------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+-----------------+
|

2|

182|

78| 28|

true|23.54788069073783|

|

4|

161|

62| 43|

true| 23.9188302920412|

+---------+--------+--------+---+--------+-----------------+

[ 268 ]

Chapter 11

Or, to select everyone who weighs more than 70 kgs:
scala> readingsWithBmiDF.filter {
readingsWithBmiDF("weightKg") > 70
}.show
+---------+--------+--------+---+--------+------------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+------------------+
|

1|

175|

72| 43|

false|23.510204081632654|

|

2|

182|

78| 28|

true| 23.54788069073783|

+---------+--------+--------+---+--------+------------------+

It can become cumbersome to keep repeating the DataFrame name in an expression.
Spark defines the operator $ to refer to a column in the current DataFrame. Thus, the
filter expression above could have been written more succinctly using:
scala> readingsWithBmiDF.filter { $"weightKg" > 70 }.show
+---------+--------+--------+---+--------+------------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+------------------+
|

1|

175|

72| 43|

false|23.510204081632654|

|

2|

182|

78| 28|

true| 23.54788069073783|

+---------+--------+--------+---+--------+------------------+

The .filter method is overloaded. It accepts either a column of Boolean values, as
above, or a string identifying a Boolean column in the current DataFrame. Thus, to
filter our readingsWithBmiDF DataFrame to sub-select smokers, we could also have
used the following:
scala> readingsWithBmiDF.filter("isSmoker").show
+---------+--------+--------+---+--------+-----------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+-----------------+
|

2|

182|

78| 28|

true|23.54788069073783|

|

4|

161|

62| 43|

true| 23.9188302920412|

+---------+--------+--------+---+--------+-----------------+

[ 269 ]

Spark SQL and DataFrames

When comparing for equality, you must compare columns with the special
triple-equals operator:
scala> readingsWithBmiDF.filter { $"age" === 28 }.show
+---------+--------+--------+---+--------+-----------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+-----------------+
|

2|

182|

78| 28|

true|23.54788069073783|

+---------+--------+--------+---+--------+-----------------+

Similarly, you must use !== to select rows that are not equal to a value:
scala> readingsWithBmiDF.filter { $"age" !== 28 }.show
+---------+--------+--------+---+--------+------------------+
|patientId|heightCm|weightKg|age|isSmoker|

BMI|

+---------+--------+--------+---+--------+------------------+
|

1|

175|

72| 43|

false|23.510204081632654|

|

3|

164|

61| 41|

false|22.679952409280194|

|

4|

161|

62| 43|

true|

23.9188302920412|

+---------+--------+--------+---+--------+------------------+

Aggregation operations
We have seen how to apply an operation to every row in a DataFrame to create
a new column, and we have seen how to use filters to build new DataFrames
with a sub-set of rows from the original DataFrame. The last set of operations on
DataFrames is grouping operations, equivalent to the GROUP BY statement in SQL.
Let's calculate the average BMI for smokers and non-smokers. We must first tell
Spark to group the DataFrame by a column (the isSmoker column, in this case), and
then apply an aggregation operation (averaging, in this case) to reduce each group:
scala> val smokingDF = readingsWithBmiDF.groupBy(
"isSmoker").agg(avg("BMI"))
smokingDF: org.apache.spark.sql.DataFrame = [isSmoker: boolean, AVG(BMI):
double]

[ 270 ]

Chapter 11

This has created a new DataFrame with two columns: the grouping column and the
column over which we aggregated. Let's show this DataFrame:
scala> smokingDF.show
+--------+------------------+
|isSmoker|

AVG(BMI)|

+--------+------------------+
|

true|23.733355491389517|

|

false|23.095078245456424|

+--------+------------------+

Besides averaging, there are several operators for performing the aggregation across
each group. We outline some of the more important ones in the table below, but,
for a full list, consult the Aggregate functions section of http://spark.apache.org/
docs/latest/api/scala/index.html#org.apache.spark.sql.functions$:
Operator
avg(column)

Notes

count(column)

Number of elements in each group in the specified
column.

countDistinct(column, ... )

Number of distinct elements in each group. This can
also accept multiple columns to return the count of
unique elements across several columns.

first(column), last(column)

First/last element in each group

max(column), min(column)

Largest/smallest element in each group

sum(column)

Sum of the values in each group

Group averages of the values in the specified
column.

Each aggregation operator takes either the name of a column, as a string, or an
expression of type Column. The latter allows aggregation of compound expressions.
If we wanted the average height, in meters, of the smokers and non-smokers in our
sample, we could use:
scala> readingsDF.groupBy("isSmoker").agg {
avg($"heightCm"/100.0)
}.show
+--------+-----------------------+
|isSmoker|AVG((heightCm / 100.0))|
+--------+-----------------------+

[ 271 ]

Spark SQL and DataFrames
|

true|

1.715|

|

false|

1.6949999999999998|

+--------+-----------------------+

We can also use compound expressions to define the column on which to group. For
instance, to count the number of patients in each age group, increasing by decade,
we can use:
scala> readingsDF.groupBy(floor($"age"/10)).agg(count("*")).show
+-----------------+--------+
|FLOOR((age / 10))|count(1)|
+-----------------+--------+
|

4.0|

3|

|

2.0|

1|

+-----------------+--------+

We have used the short-hand "*" to indicate a count over every column.

Joining DataFrames together
So far, we have only considered operations on a single DataFrame. Spark also offers
SQL-like joins to combine DataFrames. Let's assume that we have another DataFrame
mapping the patient id to a (systolic) blood pressure measurement. We will assume we
have the data as a list of pairs mapping patient IDs to blood pressures:
scala> val bloodPressures = List((1 -> 110), (3 -> 100), (4 -> 125))
bloodPressures: List[(Int, Int)] = List((1,110), (3,100), (4,125))
scala> val bloodPressureRDD = sc.parallelize(bloodPressures)
res16: rdd.RDD[(Int, Int)] = ParallelCollectionRDD[74] at parallelize at
:24

We can construct a DataFrame from this RDD of tuples. However, unlike when
constructing DataFrames from RDDs of case classes, Spark cannot infer column
names. We must therefore pass these explicitly to .toDF:
scala> val bloodPressureDF = bloodPressureRDD.toDF(
"patientId", "bloodPressure")
bloodPressureDF: DataFrame = [patientId: int, bloodPressure: int]
scala> bloodPressureDF.show
[ 272 ]

Chapter 11
+---------+-------------+
|patientId|bloodPressure|
+---------+-------------+
|

1|

110|

|

3|

100|

|

4|

125|

+---------+-------------+

Let's join bloodPressureDF with readingsDF, using the patient ID as the join key:
scala> readingsDF.join(bloodPressureDF,
readingsDF("patientId") === bloodPressureDF("patientId")
).show
+---------+--------+--------+---+--------+---------+-------------+
|patientId|heightCm|weightKg|age|isSmoker|patientId|bloodPressure|
+---------+--------+--------+---+--------+---------+-------------+
|

1|

175|

72| 43|

false|

1|

110|

|

3|

164|

61| 41|

false|

3|

100|

|

4|

161|

62| 43|

true|

4|

125|

+---------+--------+--------+---+--------+---------+-------------+

This performs an inner join: only patient IDs present in both DataFrames are included
in the result. The type of join can be passed as an extra argument to join. For
instance, we can perform a left join:
scala> readingsDF.join(bloodPressureDF,
readingsDF("patientId") === bloodPressureDF("patientId"),
"leftouter"
).show
+---------+--------+--------+---+--------+---------+-------------+
|patientId|heightCm|weightKg|age|isSmoker|patientId|bloodPressure|
+---------+--------+--------+---+--------+---------+-------------+
|

1|

175|

72| 43|

false|

1|

110|

|

2|

182|

78| 28|

true|

null|

null|

|

3|

164|

61| 41|

false|

3|

100|

|

4|

161|

62| 43|

true|

4|

125|

+---------+--------+--------+---+--------+---------+-------------+

[ 273 ]

Spark SQL and DataFrames

Possible join types are inner, outer, leftouter, rightouter, or leftsemi. These
should all be familiar, apart from leftsemi, which corresponds to a left semi join.
This is the same as an inner join, but only the columns on the left-hand side are
retained after the join. It is thus a way to filter a DataFrame for rows which are
present in another DataFrame.

Custom functions on DataFrames
So far, we have only used built-in functions to operate on DataFrame columns. While
these are often sufficient, we sometimes need greater flexibility. Spark lets us apply
custom transformations to every row through user-defined functions (UDFs). Let's
assume that we want to use the equation that we derived in Chapter 2, Manipulating
Data with Breeze, for the probability of a person being male, given their height and
weight. We calculated that the decision boundary was given by:

f = −0.75 + 2.48 × rescaledHeight + 2.23 × rescaledWeight
Any person with f > 0 is more likely to be male than female, given their height
and weight and the training set used for Chapter 2, Manipulating Data with Breeze
(which was based on students, so is unlikely to be representative of the population
as a whole). To convert from a height in centimeters to the normalized height,
rescaledHeight, we can use this formula:

rescaledHeight =

height − height

σ height

=

height − 171
8.95

Similarly, to convert a weight (in kilograms) to the normalized weight,
rescaledWeight, we can use:

rescaledWeight =

weight − weight

σ weight

=

weight − 65.7
13.4

The average and standard deviation of the height and weight are calculated from the
training set. Let's write a Scala function that returns whether a person is more likely
to be male, given their height and weight:
scala> def likelyMale(height:Int, weight:Int):Boolean = {
val rescaledHeight = (height - 171.0)/8.95

[ 274 ]

Chapter 11
val rescaledWeight = (weight - 65.7)/13.4
-0.75 + 2.48*rescaledHeight + 2.23*rescaledWeight > 0
}

To use this function on Spark DataFrames, we need to register it as a user-defined
function (UDF). This transforms our function, which accepts integer arguments, into
one that accepts column arguments:
scala> val likelyMaleUdf = sqlContext.udf.register(
"likelyMaleUdf", likelyMale _)
likelyMaleUdf: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunc
tion(,BooleanType,List())

To register a UDF, we must have access to a sqlContext instance. The SQL context
provides the entry point for DataFrame operations. The Spark shell creates a SQL
context at startup, bound to the variable sqlContext, and destroys it when the shell
session is closed.
The first argument passed to the register function is the name of the UDF (we will
use the UDF name later when we write SQL statements on the DataFrame, but you
can ignore it for now). We can then use the UDF just like the built-in transformations
included in Spark:
scala> val likelyMaleColumn = likelyMaleUdf(
readingsDF("heightCm"), readingsDF("weightKg"))
likelyMaleColumn: org.apache.spark.sql.Column = UDF(heightCm,weightKg)
scala> readingsDF.withColumn("likelyMale", likelyMaleColumn).show
+---------+--------+--------+---+--------+----------+
|patientId|heightCm|weightKg|age|isSmoker|likelyMale|
+---------+--------+--------+---+--------+----------+
|

1|

175|

72| 43|

false|

true|

|

2|

182|

78| 28|

true|

true|

|

3|

164|

61| 41|

false|

false|

|

4|

161|

62| 43|

true|

false|

+---------+--------+--------+---+--------+----------+

[ 275 ]

Spark SQL and DataFrames

As you can see, Spark applies the function underlying the UDF to every row in the
DataFrame. We are not limited to using UDFs to create new columns. We can also
use them in filter expressions. For instance, to select rows likely to correspond
to women:
scala> readingsDF.filter(
! likelyMaleUdf($"heightCm", $"weightKg")
).show
+---------+--------+--------+---+--------+
|patientId|heightCm|weightKg|age|isSmoker|
+---------+--------+--------+---+--------+
|

3|

164|

61| 41|

false|

|

4|

161|

62| 43|

true|

+---------+--------+--------+---+--------+

Using UDFs lets us define arbitrary Scala functions to transform rows, giving
tremendous additional power for data manipulation.

DataFrame immutability and persistence
DataFrames, like RDDs, are immutable. When you define a transformation on a
DataFrame, this always creates a new DataFrame. The original DataFrame cannot be
modified in place (this is notably different to pandas DataFrames, for instance).
Operations on DataFrames can be grouped into two: transformations, which result in
the creation of a new DataFrame, and actions, which usually return a Scala type or
have a side-effect. Methods like filter or withColumn are transformations, while
methods like show or head are actions.
Transformations are lazy, much like transformations on RDDs. When you generate
a new DataFrame by transforming an existing DataFrame, this results in the
elaboration of an execution plan for creating the new DataFrame, but the data
itself is not transformed immediately. You can access the execution plan with the
queryExecution method.
When you call an action on a DataFrame, Spark processes the action as if it were
a regular RDD: it implicitly builds a direct acyclic graph to resolve dependencies,
processing the transformations needed to build the DataFrame on which the action
was called.

[ 276 ]

Chapter 11

Much like RDDs, we can persist DataFrames in memory or on disk:
scala> readingsDF.persist
readingsDF.type = [patientId: int, heightCm: int,...]

This works in the same way as persisting RDDs: next time the RDD is calculated, it
will be kept in memory (provided there is enough space), rather than discarded. The
level of persistence can also be set:
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> readingsDF.persist(StorageLevel.MEMORY_AND_DISK)
readingsDF.type = [patientId: int, heightCm: int, ...]

SQL statements on DataFrames
By now, you will have noticed that many operations on DataFrames are inspired
by SQL operations. Additionally, Spark allows us to register DataFrames as tables
and query them with SQL statements directly. We can therefore build a temporary
database as part of the program flow.
Let's register readingsDF as a temporary table:
scala> readingsDF.registerTempTable("readings")

This registers a temporary table that can be used in SQL queries. Registering a
temporary table relies on the presence of a SQL context. The temporary tables are
destroyed when the SQL context is destroyed (when we close the shell, for instance).
Let's explore what we can do with our temporary tables and the SQL context. We can
first get a list of all the tables currently registered with the context:
scala> sqlContext.tables
DataFrame = [tableName: string, isTemporary: boolean]

This returns a DataFrame. In general, all operations on a SQL context that return data
return DataFrames:
scala> sqlContext.tables.show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| readings|

true|

+---------+-----------+
[ 277 ]

Spark SQL and DataFrames

We can query this table by passing SQL statements to the SQL context:
scala> sqlContext.sql("SELECT * FROM readings").show
+---------+--------+--------+---+--------+
|patientId|heightCm|weightKg|age|isSmoker|
+---------+--------+--------+---+--------+
|

1|

175|

72| 43|

false|

|

2|

182|

78| 28|

true|

|

3|

164|

61| 41|

false|

|

4|

161|

62| 43|

true|

+---------+--------+--------+---+--------+

Any UDFs registered with the sqlContext are available through the name given to
them when they were registered. We can therefore use them in SQL queries:
scala> sqlContext.sql("""
SELECT
patientId,
likelyMaleUdf(heightCm, weightKg) AS likelyMale
FROM readings
""").show
+---------+----------+
|patientId|likelyMale|
+---------+----------+
|

1|

true|

|

2|

true|

|

3|

false|

|

4|

false|

+---------+----------+

You might wonder why one would want to register DataFrames as temporary tables
and run SQL queries on those tables, when the same functionality is available directly
on DataFrames. The main reason is for interacting with external tools. Spark can run a
SQL engine that exposes a JDBC interface, meaning that programs that know how to
interact with a SQL database will be able to make use of the temporary tables.
We don't have the space to cover how to set up a distributed SQL engine in this
book, but you can find details in the Spark documentation (http://spark.apache.
org/docs/latest/sql-programming-guide.html#distributed-sql-engine).

[ 278 ]

Chapter 11

Complex data types – arrays, maps, and
structs
So far, all the elements in our DataFrames were simple types. DataFrames support
three additional collection types: arrays, maps, and structs.

Structs
The first compound type that we will look at is the struct. A struct is similar to a case
class: it stores a set of key-value pairs, with a fixed set of keys. If we convert an RDD
of a case class containing nested case classes to a DataFrame, Spark will convert the
nested objects to a struct.
Let's imagine that we want to serialize Lords of the Ring characters. We might use
the following object model:
case class Weapon(name:String, weaponType:String)
case class LotrCharacter(name:String, val weapon:Weapon)

We want to create a DataFrame of LotrCharacter instances. Let's create some
dummy data:
scala> val characters = List(
LotrCharacter("Gandalf", Weapon("Glamdring", "sword")),
LotrCharacter("Frodo", Weapon("Sting", "dagger")),
LotrCharacter("Aragorn", Weapon("Anduril", "sword"))
)
characters: List[LotrCharacter] = List(LotrCharacter...
scala> val charactersDF = sc.parallelize(characters).toDF
charactersDF: DataFrame = [name: string, weapon: struct]
scala> charactersDF.printSchema
root
|-- name: string (nullable = true)
|-- weapon: struct (nullable = true)
|

|-- name: string (nullable = true)

[ 279 ]

Spark SQL and DataFrames
|

|-- weaponType: string (nullable = true)

scala> charactersDF.show
+-------+-----------------+
|

name|

weapon|

+-------+-----------------+
|Gandalf|[Glamdring,sword]|
|

Frodo|

[Sting,dagger]|

|Aragorn|

[Anduril,sword]|

+-------+-----------------+

The weapon attribute in the case class was converted to a struct column in the
DataFrame. To extract sub-fields from a struct, we can pass the field name to the
column's .apply method:
scala> val weaponTypeColumn = charactersDF("weapon")("weaponType")
weaponTypeColumn: org.apache.spark.sql.Column = weapon[weaponType]

We can use this derived column just as we would any other column. For instance,
let's filter our DataFrame to only contain characters who wield a sword:
scala> charactersDF.filter { weaponTypeColumn === "sword" }.show
+-------+-----------------+
|

name|

weapon|

+-------+-----------------+
|Gandalf|[Glamdring,sword]|
|Aragorn|

[Anduril,sword]|

+-------+-----------------+

Arrays
Let's return to the earlier example, and assume that, besides height, weight, and age
measurements, we also have phone numbers for our patients. Each patient might
have zero, one, or more phone numbers. We will define a new case class and new
dummy data:
scala> case class PatientNumbers(
patientId:Int, phoneNumbers:List[String])

[ 280 ]

Chapter 11
defined class PatientNumbers
scala> val numbers = List(
PatientNumbers(1, List("07929123456")),
PatientNumbers(2, List("07929432167", "07929234578")),
PatientNumbers(3, List.empty),
PatientNumbers(4, List("07927357862"))
)
scala> val numbersDF = sc.parallelize(numbers).toDF
numbersDF: org.apache.spark.sql.DataFrame = [patientId: int,
phoneNumbers: array]

The List[String] array in our case class gets translated to an array
data type:
scala> numbersDF.printSchema
root
|-- patientId: integer (nullable = false)
|-- phoneNumbers: array (nullable = true)
|

|-- element: string (containsNull = true)

As with structs, we can construct a column for a specific index the array. For
instance, we can select the first element in each array:
scala> val bestNumberColumn = numbersDF("phoneNumbers")(0)
bestNumberColumn: org.apache.spark.sql.Column = phoneNumbers[0]
scala> numbersDF.withColumn("bestNumber", bestNumberColumn).show
+---------+--------------------+-----------+
|patientId|

phoneNumbers| bestNumber|

+---------+--------------------+-----------+
|

1|

List(07929123456)|07929123456|

|

2|List(07929432167,...|07929432167|

|

3|

List()|

|

4|

List(07927357862)|07927357862|

null|

+---------+--------------------+-----------+

[ 281 ]

Spark SQL and DataFrames

Maps
The last compound data type is the map. Maps are similar to structs inasmuch as
they store key-value pairs, but the set of keys is not fixed when the DataFrame is
created. They can thus store arbitrary key-value pairs.
Scala maps will be converted to DataFrame maps when the DataFrame is
constructed. They can then be queried in a manner similar to structs.

Interacting with data sources
A major challenge in data science or engineering is dealing with the wealth of input
and output formats for persisting data. We might receive or send data as CSV files,
JSON files, or through a SQL database, to name a few.
Spark provides a unified API for serializing and de-serializing DataFrames to and
from different data sources.

JSON files
Spark supports loading data from JSON files, provided that each line in the JSON file
corresponds to a single JSON object. Each object will be mapped to a DataFrame row.
JSON arrays are mapped to arrays, and embedded objects are mapped to structs.
This section would be a little dry without some data, so let's generate some from
the GitHub API. Unfortunately, the GitHub API does not return JSON formatted
as a single object per line. The code repository for this chapter contains a script,
FetchData.scala which will download and format JSON entries for Martin
Odersky's repositories, saving the objects to a file named odersky_repos.json
(go ahead and change the GitHub user in FetchData.scala if you want). You can
also download a pre-constructed data file from data.scala4datascience.com/
odersky_repos.json.
Let's dive into the Spark shell and load this data into a DataFrame. Reading from a
JSON file is as simple as passing the file name to the sqlContext.read.json method:
scala> val df = sqlContext.read.json("odersky_repos.json")
df: DataFrame = [archive_url: string, assignees_url: ...]

[ 282 ]

Chapter 11

Reading from a JSON file loads data as a DataFrame. Spark automatically infers the
schema from the JSON documents. There are many columns in our DataFrame. Let's
sub-select a few to get a more manageable DataFrame:
scala> val reposDF = df.select("name", "language", "fork", "owner")
reposDF: DataFrame = [name: string, language: string, ...]
scala> reposDF.show
+----------------+----------+-----+--------------------+
|

name|

language| fork|

owner|

+----------------+----------+-----+--------------------+
|
|

dotty|

Scala| true|[https://avatars....|

frontend|JavaScript| true|[https://avatars....|

|

scala|

Scala| true|[https://avatars....|

|

scala-dist|

Scala| true|[https://avatars....|

|scala.github.com|JavaScript| true|[https://avatars....|
|

scalax|

Scala|false|[https://avatars....|

|

sips|

CSS|false|[https://avatars....|

+----------------+----------+-----+--------------------+

Let's save the DataFrame back to JSON:
scala> reposDF.write.json("repos_short.json")

If you look at the files present in the directory in which you are running the Spark
shell, you will notice a repos_short.json directory. Inside it, you will see files
named part-000000, part-000001, and so on. When serializing JSON, each
partition of the DataFrame is serialized independently. If you are running this on
several machines, you will find parts of the serialized output on each computer.
You may, optionally, pass a mode argument to control how Spark deals with the case
of an existing repos_short.json file:
scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
scala> reposDF.write.mode(
SaveMode.Overwrite).json("repos_short.json")

Available save modes are ErrorIfExists, Append (only available for Parquet files),
Overwrite, and Ignore (do not save if the file exists already).
[ 283 ]

Spark SQL and DataFrames

Parquet files
Apache Parquet is a popular file format well-suited for storing tabular data. It is
often used for serialization in the Hadoop ecosystem, since it allows for efficient
extraction of specific columns and rows without having to read the entire file.
Serialization and deserialization of Parquet files is identical to JSON, with the
substitution of json with parquet:
scala> reposDF.write.parquet("repos_short.parquet")
scala> val newDF = sqlContext.read.parquet("repos_short.parquet")
newDF: DataFrame = [name: string, language: string, fo...]
scala> newDF.show
+----------------+----------+-----+--------------------+
|

name|

language| fork|

owner|

+----------------+----------+-----+--------------------+
|
|

dotty|

Scala| true|[https://avatars....|

frontend|JavaScript| true|[https://avatars....|

|

scala|

Scala| true|[https://avatars....|

|

scala-dist|

Scala| true|[https://avatars....|

|scala.github.com|JavaScript| true|[https://avatars....|
|

scalax|

Scala|false|[https://avatars....|

|

sips|

CSS|false|[https://avatars....|

+----------------+----------+-----+--------------------+

In general, Parquet will be more space-efficient than JSON for storing large
collections of objects. Parquet is also much more efficient at retrieving specific
columns or rows, if the partition can be inferred from the row. Parquet is thus
advantageous over JSON unless you need the output to be human-readable, or
de-serializable by an external program.

Standalone programs
So far, we have been using Spark SQL and DataFrames through the Spark shell. To use
it in standalone programs, you will need to create it explicitly, from a Spark context:
val conf = new SparkConf().setAppName("applicationName")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
[ 284 ]

Chapter 11

Additionally, importing the implicits object nested in sqlContext allows the
conversions of RDDs to DataFrames:
import sqlContext.implicits._

We will use DataFrames extensively in the next chapter to manipulate data to get it
ready for use with MLlib.

Summary
In this chapter, we explored Spark SQL and DataFrames. DataFrames add a
rich layer of abstraction on top of Spark's core engine, greatly facilitating the
manipulation of tabular data. Additionally, the source API allows the serialization
and de-serialization of DataFrames from a rich variety of data files.
In the next chapter, we will build on our knowledge of Spark and DataFrames to
build a spam filter using MLlib.

References
DataFrames are a relatively recent addition to Spark. There is thus still a dearth of
literature and documentation. The first port of call should be the Scala docs, available
at: http://spark.apache.org/docs/latest/api/scala/index.html#org.
apache.spark.sql.DataFrame.
The Scaladocs for operations available on the DataFrame Column type can be found
at: http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.
sql.Column.
There is also extensive documentation on the Parquet file format: https://parquet.
apache.org.

[ 285 ]

Distributed Machine
Learning with MLlib
Machine learning describes the construction of algorithms that make predictions
from data. It is a core component of most data science pipelines, and is often seen
to be the component adding the most value: the accuracy of the machine learning
algorithm determines the success of the data science endeavor. It is also, arguably,
the section of the data science pipeline that requires the most knowledge from fields
beyond software engineering: a machine learning expert will be familiar, not just
with algorithms, but also with statistics and with the business domain.
Choosing and tuning a machine learning algorithm to solve a particular problem
involves significant exploratory analysis to try and determine which features are
relevant, how features are correlated, whether there are outliers in the dataset, and so
on. Designing suitable machine learning pipelines is difficult. Add on an additional
layer of complexity resulting from the size of datasets and the need for scalability,
and you have a real challenge.
MLlib helps mitigate this difficulty. MLlib is a component of Spark that provides
machine learning algorithms on top of the core Spark libraries. It offers a set of
learning algorithms that parallelize well over distributed datasets.
MLlib has evolved into two separate layers. MLlib itself contains the core algorithms,
and ml, also called the pipeline API, defines an API for gluing algorithms together and
provides a higher level of abstraction. The two libraries differ in the data types on
which they operate: the original MLlib predates the introduction of DataFrames, and
acts mainly on RDDs of feature vectors. The pipeline API operates on DataFrames.
In this chapter, we will study the newer pipeline API, diving into MLlib only when
the functionality is missing from the pipeline API.

[ 287 ]

Distributed Machine Learning with MLlib

This chapter does not try to teach the machine learning fundamentals behind the
algorithms that we present. We assume that the reader has a good enough grasp
of machine learning tools and techniques to understand, at least superficially,
what the algorithms presented here do, and we defer to better authors for in-depth
explanations of the mechanics of statistical learning (we present several references at
the end of the chapter).
MLlib is a rich library that is evolving rapidly. This chapter does not aim to give
a complete overview of the library. We will work through the construction of
a machine learning pipeline to train a spam filter, learning about the parts of
MLlib that we need along the way. Having read this chapter, you will have an
understanding of how the different parts of the library fit together, and can use the
online documentation, or a more specialized book (see references at the end of this
chapter) to learn about the parts of MLlib not covered here.

Introducing MLlib – Spam classification
Let's introduce MLlib with a concrete example. We will look at spam classification
using the Ling-Spam dataset that we used in the Chapter 10, Distributed Batch
Processing with Spark. We will create a spam filter that uses logistic regression to
estimate the probability that a given message is spam.
We will run through examples using the Spark shell, but you will find an analogous
program in LogisticRegressionDemo.scala among the examples for this chapter.
If you have not installed Spark, refer to Chapter 10, Distributed Batch Processing with
Spark, for installation instructions.
Let's start by loading the e-mails in the Ling-Spam dataset. If you have not done this
for Chapter 10, Distributed Batch Processing with Spark, download the data from data.
scala4datascience.com/ling-spam.tar.gz or data.scala4datascience.com/
ling-spam.zip, depending on whether you want a tar.gz file or a zip file, and
unpack the archive. This will create a spam directory and a ham directory containing
spam and ham messages, respectively.
Let's use the wholeTextFiles method to load spam and ham e-mails:
scala> val spamText = sc.wholeTextFiles("spam/*")
spamText: RDD[(String, String)] = spam/...
scala> val hamText = sc.wholeTextFiles("ham/*")
hamText: RDD[(String, String)] = ham/...

[ 288 ]

Chapter 12

The wholeTextFiles method creates a key-value RDD where the keys are the file
names and the values are the contents of the files:
scala> spamText.first
(String, String) =
(file:spam/spmsga1.txt,"Subject: great part-time summer job! ...")
scala> spamText.count
Long = 481

The algorithms in the pipeline API work on DataFrames. We must therefore convert
our key-value RDDs to DataFrames. We define a new case class, LabelledDocument,
which contains a message text and a category label identifying whether a message is
spam or ham:
scala> case class LabelledDocument(
fileName:String,
text:String,
category:String
)
defined class LabelledDocument
scala> val spamDocuments = spamText.map {
case (fileName, text) =>
LabelledDocument(fileName, text, "spam")
}
spamDocuments: RDD[LabelledDocument] = MapPartitionsRDD[2] at map
scala> val hamDocuments = hamText.map {
case (fileName, text) =>
LabelledDocument(fileName, text, "ham")
}
hamDocuments: RDD[LabelledDocument] = MapPartitionsRDD[3] at map

[ 289 ]

Distributed Machine Learning with MLlib

To create models, we will need all the documents in a single DataFrame. Let's
therefore take the union of our two LabelledDocument RDDs, and transform that to
a DataFrame. The union method concatenates RDDs together:
scala> val allDocuments = spamDocuments.union(hamDocuments)
allDocuments: RDD[LabelledDocument] = UnionRDD[4] at union
scala> val documentsDF = allDocuments.toDF
documentsDF: DataFrame = [fileName: string, text: string, category:
string]

Let's do some basic checks to verify that we have loaded all the documents. We start
by persisting the DataFrame in memory to avoid having to re-create it from the raw
text files.
scala> documentsDF.persist
documentsDF.type = [fileName: string, text: string, category: string]
scala> documentsDF.show
+--------------------+--------------------+--------+
|

fileName|

text|category|

+--------------------+--------------------+--------+
|file:/Users/pasca...|Subject: great pa...|

spam|

|file:/Users/pasca...|Subject: auto ins...|

spam|

|file:/Users/pasca...|Subject: want bes...|

spam|

|file:/Users/pasca...|Subject: email 57...|

spam|

|file:/Users/pasca...|Subject: n't miss...|

spam|

|file:/Users/pasca...|Subject: amaze wo...|

spam|

|file:/Users/pasca...|Subject: help loa...|

spam|

|file:/Users/pasca...|Subject: beat irs...|

spam|

|file:/Users/pasca...|Subject: email 57...|

spam|

|file:/Users/pasca...|Subject: best , b...|

spam|

|...

|

+--------------------+--------------------+--------+
scala> documentsDF.groupBy("category").agg(count("*")).show
+--------+--------+
|category|COUNT(1)|
+--------+--------+
[ 290 ]

Chapter 12
|

spam|

481|

|

ham|

2412|

+--------+--------+

Let's now split the DataFrame into a training set and a test set. We will use the test
set to validate the model that we build. For now, we will just use a single split,
training the model on 70% of the data and testing it on the remaining 30%. In the
next section, we will look at cross-validation, which provides more rigorous way to
check the accuracy of our models.
We can achieve this 70-30 split using the DataFrame's .randomSplit method:
scala> val Array(trainDF, testDF) = documentsDF.randomSplit(
Array(0.7, 0.3))
trainDF: DataFrame = [fileName: string, text: string, category: string]
testDF: DataFrame = [fileName: string, text: string, category: string]

The .randomSplit method takes an array of weights and returns an array of
DataFrames, of approximately the size specified by the weights. For instance, we
passed weights 0.7 and 0.3, indicating that any given row has a 70% chance of
ending up in trainDF, and a 30% chance of ending up in testDF. Note that this
means the split DataFrames are not of fixed size: trainDF is approximately, but not
exactly, 70% the size of documentsDF:
scala> trainDF.count / documentsDF.count.toDouble
Double = 0.7013480815762184

If you need a fixed size sample, use the DataFrame's .sample method to obtain
trainDF and filter documentDF for rows not in trainDF.
We are now in a position to start using MLlib. Our attempt at classification will
involve performing logistic regression on term-frequency vectors: we will count how
often each word appears in each message, and use the frequency of occurrence as a
feature. Before jumping into the code, let's take a step back and discuss the structure
of machine learning pipelines.

Pipeline components
Pipelines consist of a set of components joined together such that the DataFrame
produced by one component is used as input for the next component. The
components available are split into two classes: transformers and estimators.

[ 291 ]

Distributed Machine Learning with MLlib

Transformers
Transformers transform one DataFrame into another, normally by appending one or
more columns.
The first step in our spam classification algorithm is to split each message into an
array of words. This is called tokenization. We can use the Tokenizer transformer,
provided by MLlib:
scala> import org.apache.spark.ml.feature._
import org.apache.spark.ml.feature._
scala> val tokenizer = new Tokenizer()
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_75559f60e8cf

The behavior of transformers can be customized through getters and setters.
The easiest way of obtaining a list of the parameters available is to call the
.explainParams method:
scala> println(tokenizer.explainParams)
inputCol: input column name (undefined)
outputCol: output column name (default: tok_75559f60e8cf__output)

We see that the behavior of a Tokenizer instance can be customized using two
parameters: inputCol and outputCol, describing the header of the column
containing the input (the string to be tokenized) and the output (the array of words),
respectively. We can set these parameters using the setInputCol and setOutputCol
methods.
We set inputCol to "text", since that is what the column is called in our training
and test DataFrames. We will set outputCol to "words":
scala> tokenizer.setInputCol("text").setOutputCol("words")
org.apache.spark.ml.feature.Tokenizer = tok_75559f60e8cf

In due course, we will integrate tokenizer into a pipeline, but, for now, let's just use
it to transform the training DataFrame, to verify that it works correctly.
scala> val tokenizedDF = tokenizer.transform(trainDF)
tokenizedDF: DataFrame = [fileName: string, text: string, category:
string, words: array]
scala> tokenizedDF.show

[ 292 ]

Chapter 12
+--------------+----------------+--------+--------------------+
|

fileName|

text|category|

words|

+--------------+----------------+--------+--------------------+
|file:/Users...|Subject: auto...|

spam|[subject:, auto, ...|

|file:/Users...|Subject: want...|

spam|[subject:, want, ...|

|file:/Users...|Subject: n't ...|

spam|[subject:, n't, m...|

|file:/Users...|Subject: amaz...|

spam|[subject:, amaze,...|

|file:/Users...|Subject: help...|

spam|[subject:, help, ...|

|file:/Users...|Subject: beat...|

spam|[subject:, beat, ...|

|...

|

+--------------+----------------+--------+--------------------+

The tokenizer transformer produces a new DataFrame with an additional column,
words, containing an array of the words in the text column.
Clearly, we can use our tokenizer to transform any DataFrame with the correct
schema. We could, for instance, use it on the test set. Much of machine learning
involves calling the same (or a very similar) pipeline on different data sets. By
providing the pipeline abstraction, MLlib facilitates reasoning about complex
machine learning algorithms consisting of many cleaning, transformation, and
modeling components.
The next step in our pipeline is to calculate the frequency of occurrence of each
word in each message. We will eventually use these frequencies as features in our
algorithm. We will use the HashingTF transformer to transform from arrays of words
to word frequency vectors for each message.
The HashingTF transformer constructs a sparse vector of word frequencies from
input iterables. Each element in the word array gets transformed to a hash code. This
hash code is truncated to a value between 0 and a large number n, the total number
of elements in the output vector. The term frequency vector is just the number of
occurrences of the truncated hash.

[ 293 ]

Distributed Machine Learning with MLlib

Let's run through an example manually to understand how this works. We will
calculate the term frequency vector for Array("the", "dog", "jumped", "over",
"the"). Let's set n, the number of elements in the sparse output vector, to 16 for this
example. The first step is to calculate the hash code for each element in our array. We
can use the built-in ## method, which calculates a hash code for any object:
scala> val words = Array("the", "dog", "jumped", "over", "the")
words: Array[String] = Array(the, dog, jumped, over, the)
scala> val hashCodes = words.map { _.## }
hashCodes: Array[Int] = Array(114801, 99644, -1148867251, 3423444,
114801)

To transform the hash codes into valid vector indices, we take the modulo of each
hash by the size of the vector (16, in this case):
scala> val indices = hashCodes.map { code => Math.abs(code % 16) }
indices: Array[Int] = Array(1, 12, 3, 4, 1)

We can then create a mapping from indices to the number of times that index
appears:
scala> val indexFrequency = indices.groupBy(identity).mapValues {
_.size.toDouble
}
indexFrequency: Map[Int,Double] = Map(4 -> 1.0, 1 -> 2.0, 3 -> 1.0, 12 ->
1.0)

Finally, we can convert this map to a sparse vector, where the value at each element
in the vector is the frequency with which this particular index occurs:
scala> import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg._
scala> val termFrequencies = Vectors.sparse(16, indexFrequency.toSeq)
termFrequencies: linalg.Vector = (16,[1,3,4,12],[2.0,1.0,1.0,1.0])

Note that the .toString output for a sparse vector consists of three elements: the
total size of the vector, followed by two lists: the first is a series of indices, and the
second is a series of values at those indices.

[ 294 ]

Chapter 12

Using a sparse vector provides a compact and efficient way of representing the
frequency of occurrence of words in the message, and is exactly how HashingTF
works under the hood. The disadvantage is that the mapping from words to indices
is not necessarily unique: truncating hash codes by the length of the vector will map
different strings to the same index. This is known as a collision. The solution is to
make n large enough that the frequency of collisions is minimized.
HashingTF is similar to building a hash table (for example, a Scala
map) whose keys are words and whose values are the number of times
that word occurs in the message, with one important difference: it
does not attempt to deal with hash collisions. Thus, if two words map
to the same hash, they will have the wrong frequency. There are two
advantages to using this algorithm over just constructing a hash table:
•
•

We do not have to maintain a list of distinct words in memory.
Each e-mail can be transformed to a vector independently of
all others: we do not have to reduce over different partitions
to get the set of keys in the map. This greatly eases applying
this algorithm to each e-mail in a distributed manner, since we
can apply the HashingTF transformation on each partition
independently.

The main disadvantage is that we must use machine learning algorithms
that can take advantage of the sparse representation efficiently. This is
the case with logistic regression, which we will use here.

As you might expect, the HashingTF transformer takes, as parameters, the input
and output columns. It also takes a parameter defining the number of distinct hash
buckets in the vector. Increasing the number of buckets decreases the number
of collisions. In practice, a value between 218 = 262144 and 220 = 1048576 is
recommended.
scala> val hashingTF = (new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(1048576))
hashingTF: org.apache.spark.ml.feature.HashingTF = hashingTF_3b78eca9595c
scala> val hashedDF = hashingTF.transform(tokenizedDF)

[ 295 ]

Distributed Machine Learning with MLlib
hashedDF: DataFrame = [fileName: string, text: string, category: string,
words: array, features: vector]
scala> hashedDF.select("features").show
+--------------------+
|

features|

+--------------------+
|(1048576,[0,33,36...|
|(1048576,[0,36,40...|
|(1048576,[0,33,34...|
|(1048576,[0,33,36...|
|(1048576,[0,33,34...|
|(1048576,[0,33,34...|
+--------------------+

Each element in the features column is a sparse vector:
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val firstRow = hashedDF.select("features").first
firstRow: org.apache.spark.sql.Row = ...
scala> val Row(v:Vector) = firstRow
v: Vector = (1048576,[0,33,36,37,...],[1.0,3.0,4.0,1.0,...])

We can thus interpret our vector as: the word that hashes to element 33 occurs three
times, the word that hashes to element 36 occurs four times etc.

Estimators
We now have the features ready for logistic regression. The last step prior to
running logistic regression is to create the target variable. We will transform the
category column in our DataFrame to a binary 0/1 target column. Spark provides
a StringIndexer class that replaces a set of strings in a column with doubles. A
StringIndexer is not a transformer: it must first be 'fitted' to a set of categories to
calculate the mapping from string to numeric value. This introduces the second class
of components in the pipeline API: estimators.

[ 296 ]

Chapter 12

Unlike a transformer, which works "out of the box", an estimator must be fitted to
a DataFrame. For our string indexer, the fitting process involves obtaining the list
of unique strings ("spam" and "ham") and mapping each of these to a double. The
fitting process outputs a transformer which can be used on subsequent DataFrames.
scala> val indexer = (new StringIndexer()
.setInputCol("category")
.setOutputCol("label"))
indexer: org.apache.spark.ml.feature.StringIndexer = strIdx_16db03fd0546
scala> val indexTransform = indexer.fit(trainDF)
indexTransform: StringIndexerModel = strIdx_16db03fd0546

The transformer produced by the fitting process has a labels attribute describing
the mapping it applies:
scala> indexTransform.labels
Array[String] = Array(ham, spam)

Each label will get mapped to its index in the array: thus, our transformer maps ham
to 0 and spam to 1:
scala> val labelledDF = indexTransform.transform(hashedDF)
labelledDF: org.apache.spark.sql.DataFrame = [fileName: string, text:
string, category: string, words: array, features: vector, label:
double]
scala> labelledDF.select("category", "label").distinct.show
+--------+-----+
|category|label|
+--------+-----+
|

ham|

0.0|

|

spam|

1.0|

+--------+-----+

[ 297 ]

Distributed Machine Learning with MLlib

We now have the feature vectors and classification labels in the correct format for
logistic regression. The component for performing logistic regression is an estimator:
it is fitted to a training DataFrame to create a trained model. The model can then be
used to transform test DataFrames.
scala> import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.LogisticRegression
scala> val classifier = new LogisticRegression().setMaxIter(50)
classifier: LogisticRegression = logreg_a5e921e7c1a1

The LogisticRegression estimator expects the feature column to be named
"features" and the label column (the target) to be named "label", by default.
There is no need to set these explicitly, since they match the column names set by
hashingTF and indexer. There are several parameters that can be set to control how
logistic regression works:
scala> println(classifier.explainParams)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For
alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1
penalty. (default: 0.0)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 50)
regParam: regularization parameter (>= 0) (default: 0.0)
threshold: threshold in binary classification prediction, in range [0, 1]
(default: 0.5)
tol: the convergence tolerance for iterative algorithms (default: 1.0E-6)
...

For now, we just set the maxIter parameter. We will look at the effect of
other parameters, such as regularization, later on. Let's now fit the classifier to
labelledDF:
scala> val trainedClassifier = classifier.fit(labelledDF)
trainedClassifier: LogisticRegressionModel = logreg_353d18f6a5f0

[ 298 ]

Chapter 12

This produces a transformer that we can use on a DataFrame with a features
column. The transformer appends a prediction column and a probability
column. We can, for instance use trainedClassifier to transform labelledDF, the
training set itself:
scala> val labelledDFWithPredictions = trainedClassifier.transform(
labelledDF)
labelledDFWithPredictions: DataFrame = [fileName: string, ...
scala> labelledDFWithPredictions.select($"label", $"prediction").show
+-----+----------+
|label|prediction|
+-----+----------+
|

1.0|

1.0|

|

1.0|

1.0|

|

1.0|

1.0|

|

1.0|

1.0|

|

1.0|

1.0|

|

1.0|

1.0|

|

1.0|

1.0|

|

1.0|

1.0|

+-----+----------+

A quick way of checking the performance of our model is to just count the number of
misclassified messages:
scala> labelledDFWithPredictions.filter {
$"label" !== $"prediction"
}.count
Long = 1

In this case, logistic regression managed to correctly classify every message but
one in the training set. This is perhaps unsurprising, given the large number of
features and the relatively clear demarcation between the words used in spam
and legitimate e-mails.

[ 299 ]

Distributed Machine Learning with MLlib

Of course, the real test of a model is not how well it performs on the training set, but
how well it performs on a test set. To test this, we could just push the test DataFrame
through the same stages that we used to train the model, replacing estimators with
the fitted transformer that they produced. MLlib provides the pipeline abstraction to
facilitate this: we wrap an ordered list of transformers and estimators in a pipeline.
This pipeline is then fitted to a DataFrame corresponding to the training set. The
fitting produces a PipelineModel instance, equivalent to the pipeline but with
estimators replaced by transformers, as shown in this diagram:
fit
trainDF

Indexer

IndexerModel

Tokenizer

HashingTF

Tokenizer

HashingTF

Classifier

ClassifierModel

pipeline

fittedPipeline

Let's construct the pipeline for our logistic regression spam filter:
scala> import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.Pipeline
scala> val pipeline = new Pipeline().setStages(
Array(indexer, tokenizer, hashingTF, classifier)
)
pipeline: Pipeline = pipeline_7488113e284d

Once the pipeline is defined, we fit it to the DataFrame holding the training set:
scala> val fittedPipeline = pipeline.fit(trainDF)
fittedPipeline: org.apache.spark.ml.PipelineModel = pipeline_089525c6f100

[ 300 ]

Chapter 12

When fitting a pipeline to a DataFrame, estimators and transformers are treated
differently:
•

Transformers are applied to the DataFrame and copied, as is, into the
pipeline model.

•

Estimators are fitted to the DataFrame, producing a transformer. The
transformer is then applied to the DataFrame, and appended to the
pipeline model.

We can now apply the pipeline model to the test set:
scala> val testDFWithPredictions = fittedPipeline.transform(testDF)
testDFWithPredictions: DataFrame = [fileName: string, ...

This has added a prediction column to the DataFrame with the predictions of our
logistic regression model. To measure the performance of our algorithm, we calculate
the classification error on the test set:
scala> testDFWithPredictions.filter {
$"label" !== $"prediction"
}.count
Long = 20

Thus, our naive logistic regression algorithm, with no model selection, or
regularization, mis-classifies 2.3% of e-mails. You may, of course, get slightly
different results, since the train-test split was random.
Let's save the training and test DataFrames, with predictions, as parquet files:
scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
scala> (labelledDFWithPredictions
.select("fileName", "label", "prediction", "probability")
.write.mode(SaveMode.Overwrite)
.parquet("transformedTrain.parquet"))
scala> (testDFWithPredictions
.select("fileName", "label", "prediction", "probability")
.write.mode(SaveMode.Overwrite)
.parquet("transformedTest.parquet"))

[ 301 ]

Distributed Machine Learning with MLlib

In spam classification, a false positive is considerably worse than a false
negative: it is much worse to classify a legitimate message as spam,
than it is to let a spam message through. To account for this, we could
increase the threshold for classification: only messages that score, for
instance, 0.7 or above would get classified as spam. This raises the
obvious question of choosing the right threshold. One way to do this
would be to investigate the false positive rate incurred in the test set for
different thresholds, and choosing the lowest threshold to give us an
acceptable false positive rate. A good way of visualizing this is to use
ROC curves, which we will investigate in the next section.

Evaluation
Unfortunately, the functionality for evaluating model quality in the pipeline API
remains limited, as of version 1.5.2. Logistic regression does output a summary
containing several evaluation metrics (available through the summary attribute on the
trained model), but these are calculated on the training set. In general, we want to
evaluate the performance of the model both on the training set and on a separate
test set. We will therefore dive down to the underlying MLlib layer to access
evaluation metrics.
MLlib provides a module, org.apache.spark.mllib.evaluation,
with a set of classes for assessing the quality of a model. We will use the
BinaryClassificationMetrics class here, since spam classification is a binary
classification problem. Other evaluation classes provide metrics for multi-class
models, regression models and ranking models.
As in the previous section, we will illustrate the concepts in the shell, but you will
find analogous code in the ROC.scala script in the code examples for this chapter.
We will use breeze-viz to plot curves, so, when starting the shell, we must ensure that
the relevant libraries are on the classpath. We will use SBT assembly, as described
in Chapter 10, Distributed Batch Processing with Spark (specifically, the Building and
running standalone programs section), to create a JAR with the required dependencies.
We will then pass this JAR to the Spark shell, allowing us to import breeze-viz. Let's
write a build.sbt file that declares a dependency on breeze-viz:
// build.sbt
name := "spam_filter"
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(

[ 302 ]

Chapter 12
"org.apache.spark" %% "spark-core" % "1.5.2" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.5.2" % "provided",
"org.scalanlp" %% "breeze" % "0.11.2",
"org.scalanlp" %% "breeze-viz" % "0.11.2",
"org.scalanlp" %% "breeze-natives" % "0.11.2"
)

Package the dependencies into a jar with:
$ sbt assembly

This will create a jar called spam_filter-assembly-0.1-SNAPSHOT.jar in the
target/scala-2.10/ directory. To include this jar in the Spark shell, re-start the
shell with the --jars command line argument:
$ spark-shell --jars=target/scala-2.10/spam_filter-assembly-0.1-SNAPSHOT.
jar

To verify that the packaging worked correctly, try to import breeze.plot:
scala> import breeze.plot._
import breeze.plot._

Let's load the test set, with predictions, which we created in the previous section and
saved as a parquet file:
scala> val testDFWithPredictions = sqlContext.read.parquet(
"transformedTest.parquet")
testDFWithPredictions: org.apache.spark.sql.DataFrame = [fileName:
string, label: double, prediction: double, probability: vector]

The BinaryClassificationMetrics object expects an RDD[(Double, Double)]
object of pairs of scores (the probability assigned by the classifier that a particular
e-mail is spam) and labels (whether an e-mail is actually spam). We can extract this
RDD from our DataFrame:
scala> import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.Vector
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val scoresLabels = testDFWithPredictions.select(

[ 303 ]

Distributed Machine Learning with MLlib
"probability", "label").map {
case Row(probability:Vector, label:Double) =>
(probability(1), label)
}
org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[3] at map
at :23
scala> scoresLabels.take(5).foreach(println)
(0.9999999967713409,1.0)
(0.9999983827108793,1.0)
(0.9982059900606365,1.0)
(0.9999790713978142,1.0)
(0.9999999999999272,1.0)

We can now construct the BinaryClassificationMetrics instance:
scala> import org.apache.spark.mllib.evaluation.
BinaryClassificationMetrics
import mllib.evaluation.BinaryClassificationMetrics
scala> val bm = new BinaryClassificationMetrics(scoresLabels)
bm: BinaryClassificationMetrics = mllib.evaluation.BinaryClassificationMe
trics@254ed9ba

The BinaryClassificationMetrics objects contain many useful metrics for
evaluating the performance of a classification model. We will look at the receiver
operating characteristic (ROC) curve.

[ 304 ]

Chapter 12

ROC Curves
Imagine gradually decreasing, from 1.0, the probability threshold at
which we assume a particular e-mail is spam. Clearly, when the threshold
is set to 1.0, no e-mails will get classified as spam. This means that there
will be no false positives (ham messages which we incorrectly classify
as spam), but it also means that there will be no true positives (spam
messages that we correctly identify as spam): all spam e-mails will be
incorrectly identified as ham.
As we gradually lower the probability threshold at which we assume a
particular e-mail is spam, our spam filter will, hopefully, start identifying
a large fraction of e-mails as spam. The vast majority of these will, if our
algorithm is well-designed, be real spam. Thus, our rate of true positives
increases. As we gradually lower the threshold, we start classifying
messages about which we are less sure of as spam. This will increase the
number of messages correctly identified as spam, but it will also increase
the number of false positives.
The ROC curve plots, for each threshold value, the fraction of true
positives against the fraction of false positives. In the best case, the curve
is always 1: this happens when all spam messages are given a score of 1.0,
and all ham messages are given a score of 0.0. By contrast, the worst case
happens when the curve is a diagonal P(true positive) = P(false positive),
which occurs when our algorithm does no better than random. In general,
ROC curves fall somewhere in between, forming a convex shell above the
diagonal. The deeper this shell, the better our algorithm.

(left) ROC curve for a model performing much better than random: the
curve reaches very high true positive rates for a low false positive rate.
(middle) ROC curve for a model performing significantly better than
random.
(right) ROC curve for a model performing only marginally better than
random: the true positive rate is only marginally larger than the rate
of false positives, for any given threshold, meaning that nearly half the
examples are misclassified.

[ 305 ]

Distributed Machine Learning with MLlib

We can calculate an array of points on the ROC curve using the .roc method on our
BinaryClassificationMetrics instance. This returns an RDD[(Double, Double)]
of (false positive, true positive) fractions for each threshold value. We can collect this as
an array:
scala> val rocArray = bm.roc.collect
rocArray: Array[(Double, Double)] = Array((0.0,0.0),
(0.0,0.16793893129770993), ...

Of course, an array of numbers is not very enlightening, so let's plot the ROC curve
with breeze-viz. We start by transforming our array of pairs into two arrays, one of
false positives and one of true positives:
scala> val falsePositives = rocArray.map { _._1 }
falsePositives: Array[Double] = Array(0.0, 0.0, 0.0, 0.0, 0.0, ...
scala> val truePositives = rocArray.map { _._2 }
truePositives: Array[Double] = Array(0.0, 0.16793893129770993,
0.19083969465...

Let's plot these two arrays:
scala> import breeze.plot._
import breeze.plot.
scala> val f = Figure()
f: breeze.plot.Figure = breeze.plot.Figure@3aa746cd
scala> val p = f.subplot(0)
p: breeze.plot.Plot = breeze.plot.Plot@5ed1438a
scala> p += plot(falsePositives, truePositives)
p += plot(falsePositives, truePositives)
scala> p.xlabel = "false positives"
p.xlabel: String = false positives
scala> p.ylabel = "true positives"
p.ylabel: String = true positives
scala> p.title = "ROC"
[ 306 ]

Chapter 12
p.title: String = ROC
scala> f.refresh

The ROC curve hits 1.0 for a small value of x: that is, we retrieve all true positives at
the cost of relatively few false positives. To visualize the curve more accurately, it is
instructive to limit the range on the x-axis from 0 to 0.1.
scala> p.xlim = (0.0, 0.1)
p.xlim: (Double, Double) = (0.0,0.1)

We also need to tell breeze-viz to use appropriate tick spacing, which requires going
down to the JFreeChart layer underlying breeze-viz:
scala> import org.jfree.chart.axis.NumberTickUnit
import org.jfree.chart.axis.NumberTickUnit
scala> p.xaxis.setTickUnit(new NumberTickUnit(0.01))
scala> p.yaxis.setTickUnit(new NumberTickUnit(0.1))

We can now save the graph:
scala> f.saveas("roc.png")

This produces the following graph, stored in roc.png:

ROC curve for spam classification with logistic regression.
Note that we have limited the false positive axis at 0.1

[ 307 ]

Distributed Machine Learning with MLlib

By looking at the graph, we see that we can filter out 85% of spam without a single
false positive. Of course, we would need a larger test set to really validate this
assumption.
A graph is useful to really understand the behavior of a model. Sometimes, however,
we just want to have a single measure of the quality of a model. The area under the
ROC curve can be a good such metric:
scala> bm.areaUnderROC
res21: Double = 0.9983061235861147

This can be interpreted as follows: given any two messages randomly drawn
from the test set, one of which is ham, and one of which is spam, there is a 99.8%
probability that the model assigned a greater likelihood of spam to the spam
message than to the ham message.
Other useful measures of model quality are the precision and recall for
particular thresholds, or the F1 score. All of these are provided by the
BinaryClassificationMetrics instance. The API documentation lists the
methods available: https://spark.apache.org/docs/latest/api/scala/index.
html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.

Regularization in logistic regression
One of the dangers of machine learning is over-fitting: the algorithm captures not
only the signal in the training set, but also the statistical noise that results from the
finite size of the training set.
A way to mitigate over-fitting in logistic regression is to use regularization: we
impose a penalty for large values of the parameters when optimizing. We can do this
by adding a penalty to the cost function that is proportional to the magnitude of the
parameters. Formally, we re-write the logistic regression cost function (described in
Chapter 2, Manipulating Data with Breeze) as:

Cost ( params ) = Cost LR ( params ) + λ params

n

where Cost LR is the normal logistic regression cost function:

CostLR ( params ) = ∑ targeti × ( params ⋅ trainingi ) − log ⎡⎣exp ( params ⋅ trainingi ) + 1⎤⎦
i

[ 308 ]

Chapter 12

Here, params is the vector of parameters, training i is the vector of features for the ith
training example, and targeti is 1 if the ith training example is spam, and 0 otherwise.
This is identical to the logistic regression cost-function introduced in Chapter 2,
Manipulating data with Breeze, apart from the addition of the regularization term
λ params n , the Ln norm of the parameter vector. The most common value of n is
2, in which case params 2 is just the magnitude of the parameter vector:

params 2 =

∑ params
i

2
i

The additional regularization term drives the algorithm to reduce the magnitude of
the parameter vector. When using regularization, features must all have comparable
magnitude. This is commonly achieved by normalizing the features. The logistic
regression estimator provided by MLlib normalizes all features by default. This can
be turned off with the setStandardization parameter.
Spark has two hyperparameters that can be tweaked to control regularization:
•
•

The type of regularization, set with the elasticNetParam parameter. A
value of 0 indicates L2 regularization.

The degree of regularization ( λ in the cost function), set with the regParam
parameter. A high value of the regularization parameter indicates a strong
regularization. In general, the greater the danger of over-fitting, the larger the
regularization parameter ought to be.

Let's create a new logistic regression instance that uses regularization:
scala> val lrWithRegularization = (new LogisticRegression()
.setMaxIter(50))
lrWithRegularization: LogisticRegression = logreg_16b65b325526
scala> lrWithRegularization.setElasticNetParam(0)
lrWithRegularization.type = logreg_1e3584a59b3a

[ 309 ]

Distributed Machine Learning with MLlib

To choose the appropriate value of λ , we fit the pipeline to the training set and
calculate the classification error on the test set for several values of λ . Further on in
the chapter, we will learn about cross-validation in MLlib, which provides a much
more rigorous way of choosing hyper-parameters.
scala> val lambdas = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
lambdas: Array[Double] = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
scala> lambdas foreach { lambda =>
lrWithRegularization.setRegParam(lambda)
val pipeline = new Pipeline().setStages(
Array(indexer, tokenizer, hashingTF, lrWithRegularization))
val model = pipeline.fit(trainDF)
val transformedTest = model.transform(testDF)
val classificationError = transformedTest.filter {
$"prediction" !== $"label"
}.count
println(s"$lambda => $classificationError")
}
0 => 20
1.0E-12 => 20
1.0E-10 => 20
1.0E-8 => 23

For our example, we see that any attempt to add L2 regularization leads to a decrease
in classification accuracy.

Cross-validation and model selection
In the previous example, we validated our approach by withholding 30% of the data
when training, and testing on this subset. This approach is not particularly rigorous:
the exact result changes depending on the random train-test split. Furthermore, if we
wanted to test several different hyperparameters (or different models) to choose the
best one, we would, unwittingly, choose the model that best reflects the specific rows
in our test set, rather than the population as a whole.
This can be overcome with cross-validation. We have already encountered
cross-validation in Chapter 4, Parallel Collections and Futures. In that chapter,
we used random subsample cross-validation, where we created the train-test
split randomly.
[ 310 ]

Chapter 12

In this chapter, we will use k-fold cross-validation: we split the training set into k
parts (where, typically, k is 10 or 3) and use k-1 parts as the training set and the last as
the test set. The train/test cycle is repeated k times, keeping a different part as test set
each time.
Cross-validation is commonly used to choose the best set of hyperparameters for
a model. To illustrate choosing suitable hyperparameters, we will go back to our
regularized logistic regression example. Instead of intuiting the hyper-parameters
ourselves, we will choose the hyper-parameters that give us the best cross-validation
score.
We will explore setting both the regularization type (through elasticNetParam)
and the degree of regularization (through regParam). A crude, but effective way
to find good values of the parameters is to perform a grid search: we calculate the
cross-validation score for every pair of values of the regularization parameters
of interest.
We can build a grid of parameters using MLlib's ParamGridBuilder.
scala> import org.apache.spark.ml.tuning.{ParamGridBuilder,
CrossValidator}
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
scala> val paramGridBuilder = new ParamGridBuilder()
paramGridBuilder: ParamGridBuilder = ParamGridBuilder@1dd694d0

To add hyper-parameters over which to optimize to the grid, we use the addGrid
method:
scala> val lambdas = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
Array[Double] = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
scala> val elasticNetParams = Array(0.0, 1.0)
elasticNetParams: Array[Double] = Array(0.0, 1.0)
scala> paramGridBuilder.addGrid(
lrWithRegularization.regParam, lambdas).addGrid(
lrWithRegularization.elasticNetParam, elasticNetParams)
paramGridBuilder.type = ParamGridBuilder@1dd694d0

[ 311 ]

Distributed Machine Learning with MLlib

Once all the dimensions are added, we can just call the build method on the builder
to build the grid:
scala> val paramGrid = paramGridBuilder.build
paramGrid: Array[org.apache.spark.ml.param.ParamMap] =
Array({
logreg_f7dfb27bed7d-elasticNetParam: 0.0,
logreg_f7dfb27bed7d-regParam: 0.0
}, {
logreg_f7dfb27bed7d-elasticNetParam: 1.0,
logreg_f7dfb27bed7d-regParam: 0.0
} ...)
scala> paramGrid.length
Int = 8

As we can see, the grid is just a one-dimensional array of sets of parameters to pass
to the logistic regression model prior to fitting.
The next step in setting up the cross-validation pipeline is to define a metric
for comparing model performance. Earlier in the chapter, we saw how to use
BinaryClassificationMetrics to estimate the quality of a model. Unfortunately,
the BinaryClassificationMetrics class is part of the core MLLib API, rather than
the new pipeline API, and is thus not (easily) compatible. The pipeline API offers
a BinaryClassificationEvaluator class instead. This class works directly on
DataFrames, and thus fits perfectly into the pipeline API flow:
scala> import org.apache.spark.ml.evaluation.
BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
scala> val evaluator = new BinaryClassificationEvaluator()
evaluator: BinaryClassificationEvaluator = binEval_64b08538f1a2
scala> println(evaluator.explainParams)
labelCol: label column name (default: label)
metricName: metric name in evaluation (areaUnderROC|areaUnderPR)
(default: areaUnderROC)
rawPredictionCol: raw prediction (a.k.a. confidence) column name
(default: rawPrediction)
[ 312 ]

Chapter 12

From the parameter list, we see that the BinaryClassificationEvaluator
class supports two metrics: the area under the ROC curve, and the area under
the precision-recall curve. It expects, as input, a DataFrame containing a label
column (the model truth) and a rawPrediction column (the column containing the
probability that an e-mail is spam or ham).
We now have all the parameters we need to run cross-validation. We first build the
pipeline, and then pass the pipeline, the evaluator and the array of parameters over
which to run the cross-validation to an instance of CrossValidator:
scala> val pipeline = new Pipeline().setStages(
Array(indexer, tokenizer, hashingTF, lrWithRegularization))
pipeline: Pipeline = pipeline_3ed29f72a4cc
scala> val crossval = (new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3))
crossval: CrossValidator = cv_5ebfa1143a9d

We will now fit crossval to trainDF:
scala> val cvModel = crossval.fit(trainDF)
cvModel: CrossValidatorModel = cv_5ebfa1143a9d

This step can take a fairly long time (over an hour on a single machine). This creates
a transformer, cvModel, corresponding to the logistic regression object with the
parameters that best represent trainDF. We can use it to predict the classification
error on the test DataFrame:
scala> cvModel.transform(testDF).filter {
$"prediction" !== $"label"
}.count
Long = 20

Cross-validation has therefore resulted in a model that performs identically to the
original, naive logistic regression model with no hyper-parameters. cvModel also
contains a list of the evaluation score for each set of parameter in the parameter grid:
scala> cvModel.avgMetrics
Array[Double] = Array(0.996427805316161, ...)

[ 313 ]

Distributed Machine Learning with MLlib

The easiest way to relate this to the hyper-parameters is to zip it with cvModel.
getEstimatorParamMaps. This gives us a list of (hyperparameter values, cross-validation

score) pairs:

scala> val params2score = cvModel.getEstimatorParamMaps.zip(
cvModel.avgMetrics)
Array[(ml.param.ParamMap,Double)] = Array(({
logreg_8f107aabb304-elasticNetParam: 0.0,
logreg_8f107aabb304-regParam: 0.0
},0.996427805316161),...
scala> params2score.foreach {
case (params, score) =>
val lambda = params(lrWithRegularization.regParam)
val elasticNetParam = params(
lrWithRegularization.elasticNetParam)
val l2Orl1 = if(elasticNetParam == 0.0) "L2" else "L1"
println(s"$l2Orl1, $lambda => $score")
}
L2, 0.0 => 0.996427805316161
L1, 0.0 => 0.996427805316161
L2, 1.0E-12 => 0.9964278053175655
L1, 1.0E-12 => 0.9961429402772803
L2, 1.0E-10 => 0.9964382546369551
L1, 1.0E-10 => 0.9962223090037103
L2, 1.0E-8 => 0.9964159754613495
L1, 1.0E-8 => 0.9891008277659763

The best set of hyper-parameters correspond to L2 regularization with a
regularization parameter of 1E-10, though this only corresponds to a tiny
improvement in AUC.
This completes our spam filter example. We have successfully trained a spam filter
for this particular Ling-Spam dataset. To obtain better results, one could experiment
with better feature extraction: we could remove stop words or use TF-IDF vectors,
rather than just term frequency vectors as features, and we could add additional
features like the length of messages, or even n-grams. We could also experiment
with non-linear algorithms, such as random forest. All of these steps would be
straightforward to add to the pipeline.
[ 314 ]

Chapter 12

Beyond logistic regression
We have concentrated on logistic regression in this chapter, but MLlib offers many
alternative algorithms that will capture non-linearity in the data more effectively.
The consistency of the pipeline API makes it easy to try out different algorithms
and see how they perform. The pipeline API offers decision trees, random forest
and gradient boosted trees for classification, as well as a simple feed-forward neural
network, which is still experimental. It offers lasso and ridge regression and decision
trees for regression, as well as PCA for dimensionality reduction.
The lower level MLlib API also offers principal component analysis for
dimensionality reduction, several clustering methods including k-means and latent
Dirichlet allocation and recommender systems using alternating least squares.

Summary
MLlib tackles the challenge of devising scalable machine learning algorithms
head-on. In this chapter, we used it to train a simple scalable spam filter. MLlib is
a vast, rapidly evolving library. The best way to learn more about what it can offer
is to try and port code that you might have written using another library (such as
scikit-learn).
In the next chapter, we will look at how to build web APIs and interactive
visualizations to share our results with the rest of the world.

References
The best reference is the online documentation, including:
•

The pipeline API: http://spark.apache.org/docs/latest/ml-features.
html

•

A full list of transformers: http://spark.apache.org/docs/latest/
mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines

Advanced Analytics with Spark, by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills
provides a detailed and up-to-date introduction to machine learning with Spark.

[ 315 ]

Distributed Machine Learning with MLlib

There are several books that introduce machine learning in more detail than we can
here. We have mentioned The Elements of Statistical Learning, by Friedman, Tibshirani
and Hastie several times in this book. It is one of the most complete introductions to
the mathematical underpinnings of machine learning currently available.
Andrew Ng's Machine Learning course on https://www.coursera.org/
provides a good introduction to machine learning. It uses Octave/MATLAB as the
programming language, but should be straightforward to adapt to Breeze and Scala.

[ 316 ]

Web APIs with Play
In the first 12 chapters of this book, we introduced basic tools and libraries for
anyone wanting to build data science applications: we learned how to interact with
SQL and MongoDB databases, how to build fast batch processing applications using
Spark, how to apply state-of-the-art machine learning algorithms using MLlib, and
how to build modular concurrent applications in Akka.
In the last chapters of this book, we will branch out to look at a web framework:
Play. You might wonder why a web framework would feature in a data science
book; surely such topics are best left to software engineers or web developers. Data
scientists, however, rarely exist in a vacuum. They often need to communicate results
or insights to stakeholders. As compelling as an ROC curve may be to someone
well versed in statistics, it may not carry as much weight with less technical people.
Indeed, it can be much easier to sell insights when they are accompanied by an
engaging visualization.
Many modern interactive data visualization applications are web applications
running in a web browser. Often, these involve D3.js, a JavaScript library for
building data-driven web pages. In this chapter and the next, we will look at
integrating D3 with Scala.
Writing a web application is a complex endeavor. We will split this task over this
chapter and the next. In this chapter, we will learn how to write a REST API that we
can use as backend for our application, or query in its own right. In the next chapter,
we will look at integrating front-end code with Play to query the API exposed by the
backend and display it using D3. We assume at least a basic familiarity with HTTP in
this chapter: you should have read Chapter 7, Web APIs, at least.

[ 317 ]

Web APIs with Play

Many data scientists or aspiring data scientists are unlikely to be familiar with the
inner workings of web technologies. Learning how to build complex websites or
web APIs can be daunting. This chapter therefore starts with a general discussion
of dynamic websites and the architecture of web applications. If you are already
familiar with server-side programming and with web frameworks, you can easily
skip over the first few sections.

Client-server applications
A website works through the interaction between two computers: the client and
the server. If you enter the URL www.github.com/pbugnion/s4ds/graphs in a
web browser, your browser queries one of the GitHub servers. The server will look
though its database for information concerning the repository that you are interested
in. It will serve this information as HTML, CSS, and JavaScript to your computer.
Your browser is then responsible for interpreting this response in the correct way.
If you look at the URL in question, you will notice that there are several graphs
on that page. Unplug your internet connection and you can still interact with
the graphs. All the information necessary for interacting with the graphs was
transferred, as JavaScript, when you loaded that webpage. When you play with the
graphs, the CPU cycles necessary to make those changes happen are spent on your
computer, not a GitHub server. The code is executed client-side. Conversely, when
you request information about a new repository, that request is handled by a GitHub
server. It is said to be handled server-side.
A web framework like Play can be used on the server. For client-side code, we can
only use a language that the client browser will understand: HTML for the layout,
CSS for the styling and JavaScript, or languages that can compile to JavaScript, for
the logic.

Introduction to web frameworks
This section is a brief introduction to how modern web applications are designed. Go
ahead and skip it if you already feel comfortable writing backend code.
Loosely, a web framework is a set of tools and code libraries for building web
applications. To understand what a web framework provides, let's take a step back
and think about what you would need to do if you did not have one.
You want to write a program that listens on port 80 and sends HTML (or JSON or
XML) back to clients that request it. This is simple if you are serving the same file
back to every client: just load the HTML from file when you start the server, and
send it to clients who request it.
[ 318 ]

Chapter 13

So far, so good. But what if you now want to customize the HTML based on the
client request? You might choose to respond differently based on part of the URL
that the client put in his browser, or based on specific elements in the HTTP request.
For instance, the product page on amazon.com is different to the payment page. You
need to write code to parse the URL and the request, and then route the request to
the relevant handler.
You might now want to customize the HTML returned dynamically, based on
specific elements of the request. The page for every product on amazon.com follows
the same outline, but specific elements are different. It would be wasteful to store the
entire HTML content for every product. A better way is to store the details for each
product in a database and inject them into an HTML template when a client requests
information on that product. You can do this with a template processor. Of course,
writing a good template processor is difficult.
You might deploy your web framework and realize that it cannot handle the traffic
directed to it. You decide that handlers responding to client requests should run
asynchronously. You now have to deal with concurrency.
A web framework essentially provides the wires to bind everything together. Besides
bundling an HTTP server, most frameworks will have a router that automatically
routes a request, based on the URL, to the correct handler. In most cases, the handler
will run asynchronously, giving you much better scalability. Many frameworks
have a template processor that lets you write HTML (or sometimes JSON or XML)
templates intuitively. Some web frameworks also provide functionality for accessing
a database, for parsing JSON or XML, for formulating HTTP requests and for
localization and internationalization.

Model-View-Controller architecture
Many web frameworks impose program architectures: it is difficult to provide
wires to bind disparate components together without making some assumptions
about what those components are. The Model-View-Controller (MVC) architecture
is particularly popular on the Web, and it is the architecture the Play framework
assumes. Let's look at each component in turn:
•

The model is the data underlying the application. For example, I expect
the application underlying GitHub has models for users, repositories,
organizations, pull requests and so on. In the Play framework, a model is
often an instance of a case class. The core responsibility of the model is to
remember the current state of the application.

•

Views are representations of a model or a set of models on the screen.

[ 319 ]

Web APIs with Play

•

The controller handles client interactions, possibly changing the model.
For instance, if you star a project on GitHub, the controller will update the
relevant models. Controllers normally carry very little application state:
remembering things is the job of the models.

MVC architecture: the state of the application is provided by the model. The view provides
a visual representation of the model to the user, and the controller handles logic: what to do when
the user presses a button or submits a form.

The MVC framework works well because it decouples the user interface from the
underlying data and structures the flow of actions: a controller can update the model
state or the view, a model can send signals to the view to tell it to update, and the
view merely displays that information. The model carries no information related to
the user interface. This separation of concerns results in an easier mental model of
information flow, better encapsulation and greater testability.

[ 320 ]

Chapter 13

Single page applications
The client-server duality adds a degree of complication to the elegant MVC
architecture. Where should the model reside? What about the controller?
Traditionally, the model and the controller ran almost entirely on the server,
which just pushed the relevant HTML view to the client.
The growth in client-side JavaScript frameworks, such AngularJS, has resulted in a
gradual shift to putting more code in the client. Both the controller and a temporary
version of the model typically run client-side. The server just functions as a web API:
if, for instance, the user updates the model, the controller will send an HTTP request
to the server informing it of the change.
It then makes sense to think of the program running server-side and the one running
client-side as two separate applications: the server persists data in databases, for
instance, and provides a programmatic interface to this data, usually as a web service
returning JSON or XML data. The client-side program maintains its own model and
controller, and polls the server whenever it needs a new model, or whenever it needs
to inform the server that the persistent view of the model should be changed.
Taken to the extreme, this results in Single-Page Applications. In a single-page
application, the first time the client requests a page from the server, he receives
the HTML and the JavaScript necessary to build the framework for the entire
application. If the client needs further data from the server, he will poll the server's
API. This data is returned as JSON or XML.
This might seem a little complicated in the abstract, so let's think how the Amazon
website might be structured as a single-page application. We will just concern
ourselves with the products page here, since that's complicated enough. Let's imagine
that you are on the home page, and you hit a link for a particular product. The
application running on your computer knows how to display products, for instance
through an HTML template. The JavaScript also has a prototype for the model, such as:
{
product_id: undefined,
product_name: undefined,
product_price: undefined,
...
}

[ 321 ]

Web APIs with Play

What it's currently missing is knowledge of what data to put in those fields for the
product you have just selected: there is no way that information could have been
sent to your computer when the website loaded, since there was no way to know
what product you might click on (and sending information about every product
would be prohibitively costly). So the Amazon client sends a request to the server
for information on that product. The Amazon server replies with a JSON object (or
maybe XML). The client then updates its model with that information. When the
update is complete, an event is fired to update the view:

Client-server communications in a single-page application: when the client first accesses
the website, it receives HTML, CSS and JavaScript files that contain the entire logic for
the application. From then on, the client only uses the server as an API when it requests
additional data. The application running in the user's web browser and the one running on the
server are nearly independent. The only coupling is through the structure of the API
exposed by the server.

[ 322 ]

Chapter 13

Building an application
In this chapter and the next, we will build a single-page application that relies on an
API written in Play. We will build a webpage that looks like this:

The user enters the name of someone on GitHub and can view a list of their
repositories and a chart summarizing what language they use. You can find the
application deployed at app.scala4datascience.com. Go ahead and give it a whirl.
To get a glimpse of the innards, type app.scala4datascience.com/api/repos/
odersky. This returns a JSON object like:
[{"name":"dotty","language":"Scala","is_fork":true,"size":14653},
{"name":"frontend","language":"JavaScript","is_fork":true,"size":392},
{"name":"legacy-svn-scala","language":"Scala","is_
fork":true,"size":296706},
...

We will build the API in this chapter, and write the front-end code in the next chapter.

[ 323 ]

Web APIs with Play

The Play framework
The Play framework is a web framework built on top of Akka. It has a proven track
record in industry, and is thus a reliable choice for building scalable web applications.
Play is an opinionated web framework: it expects you to follow the MVC architecture,
and it has a strong opinion about the tools you should be using. It comes bundled
with its own JSON and XML parsers, with its own tools for accessing external APIs,
and with recommendations for how to access databases.
Web applications are much more complex than the command line scripts we
have been developing in this book, because there are many more components: the
backend code, routing information, HTML templates, JavaScript files, images, and
so on. The Play framework makes strong assumptions about the directory structure
for your project. Building that structure from scratch is both mind-numbingly boring
and easy to get wrong. Fortunately, we can use Typesafe activators to bootstrap
the project (you can also download the code from the Git repository in https://
github.com/pbugnion/s4ds but I encourage you to start the project from a basic
activator structure and code along instead, using the finished version as an example).
Typesafe activator is a custom version of SBT that includes templates to get Scala
programmers up and running quickly. To install activator, you can either download
a JAR from https://www.typesafe.com/activator/download, or, on Mac OS,
via homebrew:
$ brew install typesafe-activator

You can then launch the activator console from the terminal. If you downloaded
activator:
$ ./path/to/activator/activator new

Or, if you installed via Homebrew:
$ activator new

This starts a new project in the current directory. It starts by asking what template
you want to start with. Choose play-scala. It then asks for a name for your
application. I chose ghub-display, but go ahead and be creative!
Let's explore the newly created project structure (I have only retained the most
important files):
├── app
│

├── controllers

│

│

└── Application.scala
[ 324 ]

Chapter 13
│

└── views

│

├── main.scala.html

│

└── index.scala.html

├── build.sbt
├── conf
│

├── application.conf

│

└── routes

├── project
│

├── build.properties

│

└── plugins.sbt

├── public
│

├── images

│

│

│

├── javascripts

│

│

│

└── stylesheets

│

└── favicon.png
└── hello.js
└── main.css

└── test
├── ApplicationSpec.scala
└── IntegrationSpec.scala

Let's run the app:
$ ./activator
[ghub-display] $ run

Head over to your browser and navigate to the URL 127.0.0.1:9000/. The page
may take a few seconds to load. Once it is loaded, you should see a default page that
says Your application is ready.
Before we modify anything, let's walk through how this happens. When you ask
your browser to take you to 127.0.0.1:9000/, your browser sends an HTTP request
to the server listening at that address (in this case, the Netty server bundled with
Play). The request is a GET request for the route /. The Play framework looks in
conf/routes to see if it has a route satisfying /:
$ cat conf/routes
# Home page
GET
/
...

controllers.Application.index

[ 325 ]

Web APIs with Play

We see that the conf/routes file does contain the route / for GET requests. The
second part of that line, controllers.Application.index, is the name of a Scala
function to handle that route (more on that in a moment). Let's experiment. Change
the route end-point to /hello. Refresh your browser without changing the URL.
This will trigger recompilation of the application. You should now see an error page:

The error page tells you that the app does not have an action for the route / any more.
If you navigate to 127.0.0.1:9000/hello, you should see the landing page again.
Besides learning a little of how routing works, we have also learned two things about
developing Play applications:
•

In development mode, code gets recompiled when you refresh your browser
and there have been code changes

•

Compilation and runtime errors get propagated to the web page

Let's change the route back to /. There is a lot more to say on routing, but it can wait
till we start building our application.
The conf/routes file tells the Play framework to use the method controllers.
Application.index to handle requests to /. Let's look at the Application.scala
file in app/controllers, where the index method is defined:
// app/controllers/Application.scala
package controllers
import play.api._
import play.api.mvc._
class Application extends Controller {
def index = Action {
Ok(views.html.index("Your new application is ready."))
}
}
[ 326 ]

Chapter 13

We see that controllers.Application.index refers to the method index in
the class Application. This method has return type Action. An Action is just a
function that maps HTTP requests to responses. Before explaining this in more
detail, let's change the action to:
def index = Action {
Ok("hello, world")
}

Refresh your browser and you should see the landing page replaced with "hello
world". By having our action return Ok("hello, world"), we are asking Play
to return an HTTP response with status code 200 (indicating that the request was
successful) and the body "hello world".
Let's go back to the original content of index:
Action {
Ok(views.html.index("Your new application is ready."))
}

We can see that this calls the method views.html.index. This might appear strange,
because there is no views package anywhere. However, if you look at the app/views
directory, you will notice two files: index.scala.html and main.scala.html. These
are templates, which, at compile time, get transformed into Scala functions. Let's
have a look at main.scala.html:
// app/views/main.scala.html
@(title: String)(content: Html)



@title



@content



[ 327 ]

Web APIs with Play

At compile time, this template is compiled to a function main(title:String)
(content:Html) in the package views.html. Notice that the function package and

name comes from the template file name, and the function arguments come from the
first line of the template. The template contains embedded @title and @content
values, which get filled in by the arguments to the function. Let's experiment with
this in a Scala console:
$ activator console
scala> import views.html._
import views.html._
scala> val title = "hello"
title: String = hello
scala> val content = new play.twirl.api.Html("World")
content: play.twirl.api.Html = World
scala> main(title)(content)
res8: play.twirl.api.HtmlFormat.Appendable =



hello



World



We can call views.html.main, just like we would call a normal Scala function. The
arguments we pass in get embedded in the correct place, as defined by the template
in views/main.scala.html.

[ 328 ]

Chapter 13

This concludes our introductory tour of Play. Let's briefly go over what we have
learnt: when a request reaches the Play server, the server reads the URL and the
HTTP verb and checks that these exist in its conf/routes file. It will then pass the
request to the Action defined by the controller for that route. This Action returns
an HTTP response that gets fed back to the browser. In constructing the response, the
Action may make use of a template, which, as far as it is concerned is just a function
(arguments list) => String or (arguments list) => HTML.

Dynamic routing
Routing, as we saw, is the mapping of HTTP requests to Scala handlers. Routes are
stored in conf/routes. A route is defined by an HTTP verb, followed by the endpoint, followed by a Scala function:
// verb
GET

// end-point
/

// Scala handler
controllers.Application.index

We learnt to add new routes by just adding lines to the routes file. We are not
limited to static routes, however. The Play framework lets us include wild cards in
routes. The value of the wild card can be passed as an argument to the controller.
To see how this works, let's create a controller that takes the name of a person as
argument. In the Application object in app.controllers, add:
// app/controllers/Application.scala
class Application extends Controller {
...
def hello(name:String) = Action {
Ok(s"hello, $name")
}
}

We can now define a route handled by this controller:
// conf/routes
GET /hello/:name

controllers.Application.hello(name)

If you now point your browser to 127.0.0.1:9000/hello/Jim, you will see hello,
Jim appear on the screen.

[ 329 ]

Web APIs with Play

Any string between : and the following / is treated as a wild card: it will match any
combination of characters. The value of the wild card can be passed to the controller.
Note that the wild card can appear anywhere in the URL, and there can be more than
one wild card. The following are all valid route definitions, for instance:
GET /hello/person-:name
controllers.Application.hello(name)
// ... matches /hello/person-Jim
GET /hello/:name/picture controllers.Application.pictureFor(name)
// ... matches /hello/Jim/picture
GET /hello/:first/:last controllers.Application.hello(first, last)
// ... matches /hello/john/doe

There are many other options for selecting routes and passing arguments to
the controller. Consult the documentation for the Play framework for a full
discussion on the routing possibilities: https://www.playframework.com/
documentation/2.4.x/ScalaRouting.
URL design
It is generally considered best practice to leave the URL as simple
as possible. The URL should reflect the hierarchical structure of the
information of the website, rather than the underlying implementation.
GitHub is a very good example of this: its URLs make intuitive sense.
For instance, the URL for the repository for this book is:
https://github.com/pbugnion/s4ds
To access the issues page for that repository, add /issues to the route.
To access the first issue, add /1 to that route. These are called semantic
URLs (https://en.wikipedia.org/wiki/Semantic_URL).

Actions
We have talked about routes, and how to pass parameters to controllers. Let's now
talk about what we can do with the controller.
The method defined in the route must return a play.api.mvc.Action instance.
The Action type is a thin wrapper around the type Request[A] => Result, where
Request[A] identifies an HTTP request and Result is an HTTP response.

[ 330 ]

Chapter 13

Composing the response
An HTTP response, as we saw in Chapter 7, Web APIs, is composed of:
•

the status code (such as 200 for a successful response, or 404 for a
missing page)

•

the response headers, a key-value list indicating metadata related to the
response

•

The response body. This can be HTML for web pages, or JSON, XML or
plain text (or many other formats). This is generally the bit that we are really
interested in.

The Play framework defines a play.api.mvc.Result object that symbolizes a
response. The object contains a header attribute with the status code and the
headers, and a body attribute containing the body.
The simplest way to generate a Result is to use one of the factory methods in play.
api.mvc.Results. We have already seen the Ok method, which generates a response
with status code 200:
def hello(name:String) = Action {
Ok("hello, $name")
}

Let's take a step back and open a Scala console so we can understand how this works:
$ activator console
scala> import play.api.mvc._
import play.api.mvc._
scala> val res = Results.Ok("hello, world")
res: play.api.mvc.Result = Result(200, Map(Content-Type -> text/plain;
charset=utf-8))
scala> res.header.status
Int = 200
scala> res.header.headers
Map[String,String] = Map(Content-Type -> text/plain; charset=utf-8)
scala> res.body
play.api.libs.iteratee.Enumerator[Array[Byte]] = play.api.libs.iteratee.
Enumerator$$anon$18@5fb83873
[ 331 ]

Web APIs with Play

We can see how the Results.Ok(...) creates a Result object with status 200 and
(in this case), a single header denoting the content type. The body is a bit more
complicated: it is an enumerator that can be pushed onto the output stream when
needed. The enumerator contains the argument passed to Ok: "hello, world", in
this case.
There are many factory methods in Results for returning different status codes.
Some of the more relevant ones are:
•

Action { Results.NotFound }

•

Action { Results.BadRequest("bad request") }

•

Action { Results.InternalServerError("error") }

•

Action { Results.Forbidden }

•

Action { Results.Redirect("/home") }

For a full list of Result factories, consult the API documentation for Results
(https://www.playframework.com/documentation/2.4.x/api/scala/index.
html#play.api.mvc.Results).
We have, so far, been limiting ourselves to passing strings as the content of the Ok
result: Ok("hello, world"). We are not, however, limited to passing strings. We
can pass a JSON object:
scala> import play.api.libs.json._
import play.api.libs.json._
scala> val jsonObj = Json.obj("hello" -> "world")
jsonObj: play.api.libs.json.JsObject = {"hello":"world"}
scala> Results.Ok(jsonObj)
play.api.mvc.Result = Result(200, Map(Content-Type -> application/json;
charset=utf-8))

We will cover interacting with JSON in more detail when we start building the
API. We can also pass HTML as the content. This is most commonly the case when
returning a view:
scala> val htmlObj = views.html.index("hello")
htmlObj: play.twirl.api.HtmlFormat.Appendable =


[ 332 ]

Chapter 13


...
scala> Results.Ok(htmlObj)
play.api.mvc.Result = Result(200, Map(Content-Type -> text/html;
charset=utf-8))

Note how the Content-Type header is set based on the type of content passed to Ok.
The Ok factory uses the Writeable type class to convert its argument to the body of the
response. Thus, any content type for which a Writeable type class exists can be used
as argument to Ok. If you are unfamiliar with type classes, you might want to read the
Looser coupling with type classes section in Chapter 5, Scala and SQL through JDBC.

Understanding and parsing the request
We now know how to formulate (basic) responses. The other half of the equation
is the HTTP request. Recall that an Action is just a function mapping Request =>
Result. We can access the request using:
def hello(name:String) = Action { request =>
...
}

One of the reasons for needing a reference to the request is to access parameters in
the query string. Let's modify the Hello,  example that we wrote earlier to,
optionally, include a title in the query string. Thus, a URL could be formatted as /
hello/Jim?title=Dr. The request instance exposes the getQueryString method
for accessing specific keys in the query string. This method returns Some[String]
if the key is present in the query, or None otherwise. We can re-write our hello
controller as:
def hello(name:String) = Action { request =>
val title = request.getQueryString("title")
val titleString = title.map { _ + " " }.getOrElse("")
Ok(s"Hello, $titleString$name")
}

Try this out by accessing the URL 127.0.0.1:9000/hello/Odersky?title=Dr in
your browser. The browser should display Hello, Dr Odersky.

[ 333 ]

Web APIs with Play

We have, so far, been concentrating on GET requests. These do not have a body.
Other types of HTTP request, most commonly POST requests, do contain a body.
Play lets the user pass body parsers when defining the action. The request body will be
passed through the body parser, which will convert it from a byte stream to a Scala
type. As a very simple example, let's define a new route that accepts POST requests:
POST

/hello

controllers.Application.helloPost

We will apply the predefined parse.text body parser to the incoming request body.
This converts the body of the request to a string. The helloPost controller looks like:
def helloPost = Action(parse.text) { request =>
Ok("Hello. You told me: " + request.body)
}

You cannot test POST requests easily in the browser. You can use cURL
instead. cURL is a command line utility for dispatching HTTP requests.
It is installed by default on Mac OS and should be available via the
package manager on Linux distributions. The following will send a
POST request with "I think that Scala is great" in the body:
$ curl --data "I think that Scala is great" --header
"Content-type:text/plain" 127.0.0.1:9000/hello

This prints the following line to the terminal:
Hello. You told me: I think that Scala is great

There are several types of built-in body parsers:
•

parse.file(new File("filename.txt")) will save the body to a file.

•

parse.json will parse the body as JSON (we will learn more about

•

parse.xml will parse the body as XML.

•

parse.urlFormEncoded will parse the body as returned by submitting an
HTML form. The request.body attribute is a Scala map from String to
Seq[String], mapping each form element to its value(s).

interacting with JSON in the next section).

For a full list of body parsers, the best source is the Scala API documentation for
play.api.mvc.BodyParsers.parse available at: https://www.playframework.
com/documentation/2.5.x/api/scala/index.html#play.api.mvc.
BodyParsers$parse$.

[ 334 ]

Chapter 13

Interacting with JSON
JSON, as we discovered in previous chapters, is becoming the de-facto language for
communicating structured data over HTTP. If you develop a web application or a
web API, it is likely that you will have to consume or emit JSON, or both.
In Chapter 7, Web APIs, we learned how to parse JSON through json4s. The Play
framework includes its own JSON parser and emitter. Fortunately, it behaves in
much the same way as json4s.
Let's imagine that we are building an API that summarizes information about
GitHub repositories. Our API will emit a JSON array listing a user's repositories
when queried about a specific user (much like the GitHub API, but with just a subset
of fields).
Let's start by defining a model for the repository. In Play applications, models are
normally stored in the folder app/models, in the models package:
// app/models/Repo.scala
package models
case class Repo (
val name:String,
val language:String,
val isFork: Boolean,
val size: Long
)

Let's add a route to our application that serves arrays of repos for a particular user.
In conf/routes, add the following line:
// conf/routes
GET
/api/repos/:username

controllers.Api.repos(username)

Let's now implement the framework for the controller. We will create a new
controller for our API, imaginatively called Api. For now, we will just have the
controller return dummy data. This is what the code looks like (we will explain the
details shortly):
// app/controllers/Api.scala
package controllers
import play.api._
import play.api.mvc._

[ 335 ]

Web APIs with Play
import play.api.libs.json._
import models.Repo
class Api extends Controller {
// Some dummy data.
val data = List[Repo](
Repo("dotty", "Scala", true, 14315),
Repo("frontend", "JavaScript", true, 392)
)
// Typeclass for converting Repo -> JSON
implicit val writesRepos = new Writes[Repo] {
def writes(repo:Repo) = Json.obj(
"name" -> repo.name,
"language" -> repo.language,
"is_fork" -> repo.isFork,
"size" -> repo.size
)
}
// The controller
def repos(username:String) = Action {
val repoArray = Json.toJson(data)
// toJson(data) relies on existence of
// `Writes[List[Repo]]` type class in scope
Ok(repoArray)
}
}

If you point your web browser to 127.0.0.1:9000/api/repos/odersky, you
should now see the following JSON object:
[{"name":"dotty","language":"Scala","is_fork":true,"size":14315},{"nam
e":"frontend","language":"JavaScript","is_fork":true,"size":392}]

The only tricky part of this code is the conversion from Repo to JSON. We call Json.
toJson on data, an instance of type List[Repo]. The toJson method relies on the
existence of a type class Writes[T] for the type T passed to it.

[ 336 ]

Chapter 13

The Play framework makes extensive use of type classes to define how to convert
models to specific formats. Recall that we learnt how to write type classes in the
context of SQL and MongoDB. The Play framework's expectations are very similar:
for the Json.toJson method to work on an instance of type Repo, there must be a
Writes[Repo] implementation available that specifies how to transform Repo objects
to JSON.
In the Play framework, the Writes[T] type class defines a single method:
trait Writes[T] {
def writes(obj:T):Json
}

Writes methods for built-in simple types and for collections are already built into
the Play framework, so we do not need to worry about defining Writes[Boolean],

for instance.

The Writes[Repo] instance is commonly defined either directly in the controller, if it
is just used for that controller, or in the Repo companion object, where it can be used
across several controllers. For simplicity, we just embedded it in the controller.
Note how type-classes allow for separation of concerns. The model just defines the Repo
type, without attaching any behavior. The Writes[Repo] type class just knows how to
convert from a Repo instance to JSON, but knows nothing of the context in which it is
used. Finally, the controller just knows how to create a JSON HTTP response.
Congratulations, you have just defined a web API that returns JSON! In the next
section, we will learn how to fetch data from the GitHub web API to avoid constantly
returning the same array.

Querying external APIs and consuming
JSON
So far, we have learnt how to provide the user with a dummy JSON array of
repositories in response to a request to /api/repos/:username. In this section, we will
replace the dummy data with the user's actual repositories, dowloaded from GitHub.
In Chapter 7, Web APIs, we learned how to query the GitHub API using Scala's
Source.fromURL method and scalaj-http. It should come as no surprise that
the Play framework implements its own library for interacting with external
web services.

[ 337 ]

Web APIs with Play

Let's edit the Api controller to fetch information about a user's repositories from
GitHub, rather than using dummy data. When called with a username as argument,
the controller will:
1. Send a GET request to the GitHub API for that user's repositories.
2. Interpret the response, converting the body from a JSON object to a
List[Repo].
3. Convert from the List[Repo] to a JSON array, forming the response.
We start by giving the full code listing before explaining the thornier parts in detail:
// app/controllers/Api.scala
package controllers
import
import
import
import
import
import
import

play.api._
play.api.mvc._
play.api.libs.ws.WS // query external APIs
play.api.Play.current
play.api.libs.json._ // parsing JSON
play.api.libs.functional.syntax._
play.api.libs.concurrent.Execution.Implicits.defaultContext

import models.Repo
class Api extends Controller {
// type class for Repo -> Json conversion
implicit val writesRepo = new Writes[Repo] {
def writes(repo:Repo) = Json.obj(
"name" -> repo.name,
"language" -> repo.language,
"is_fork" -> repo.isFork,
"size" -> repo.size
)
}
// type class for Github Json -> Repo conversion
implicit val readsRepoFromGithub:Reads[Repo] = (
(JsPath \ "name").read[String] and
(JsPath \ "language").read[String] and
(JsPath \ "fork").read[Boolean] and

[ 338 ]

Chapter 13
(JsPath \ "size").read[Long]
)(Repo.apply _)
// controller
def repos(username:String) = Action.async {
// GitHub URL
val url = s"https://api.github.com/users/$username/repos"
val response = WS.url(url).get() // compose get request
// "response" is a Future
response.map { r =>
// executed when the request completes
if (r.status == 200) {
// extract a list of repos from the response body
val reposOpt = Json.parse(r.body).validate[List[Repo]]
reposOpt match {
// if the extraction was successful:
case JsSuccess(repos, _) => Ok(Json.toJson(repos))
// If there was an error during the extraction
case _ => InternalServerError
}
}
else {
// GitHub returned something other than 200
NotFound
}
}
}
}

If you have written all this, point your browser to, for instance, 127.0.0.1:9000/
api/repos/odersky to see the list of repositories owned by Martin Odersky:
[{"name":"dotty","language":"Scala","is_fork":true,"size":14653},{"nam
e":"frontend","language":"JavaScript","is_fork":true,"size":392},...

This code sample is a lot to take in, so let's break it down.

[ 339 ]

Web APIs with Play

Calling external web services
The first step in querying external APIs is to import the WS object, which defines
factory methods for creating HTTP requests. These factory methods rely on a reference
to an implicit Play application in the namespace. The easiest way to ensure this is the
case is to import play.api.Play.current, a reference to the current application.
Let's ignore the readsRepoFromGithub type class for now and jump straight to the
controller body. The URL that we want to hit with a GET request is "https://api.
github.com/users/$username/repos", with the appropriate value for $username.
We create a GET request with WS.url(url).get(). We can also add headers to an
existing request. For instance, to specify the content type, we could have written:
WS.url(url).withHeaders("Content-Type" ->
"application/json").get()

We can use headers to pass a GitHub OAuth token using:
val token = "2502761d..."
WS.url(url).withHeaders("Authorization" -> s"token $token").get()

To formulate a POST request, rather than a GET request, replace the final .get()
with .post(data). Here, data can be JSON, XML or a string.
Adding .get or .post fires the request, returning a Future[WSResponse]. You
should, by now, be familiar with futures. By writing response.map { r => ... },
we specify a transformation to be executed on the future result, when it returns. The
transformation verifies the response's status, returning NotFound if the status code of
the response is anything but 200.

Parsing JSON
If the status code is 200, the callback parses the response body to JSON and converts
the parsed JSON to a List[Repo] instance. We already know how to convert from
a Repo object to JSON using the Writes[Repo] type class. The converse, going
from JSON to a Repo object, is a little more challenging, because we have to account
for incorrectly formatted JSON. To this effect, the Play framework provides the
.validate[T] method on JSON objects. This method tries to convert the JSON to an
instance of type T, returning JsSuccess if the JSON is well-formatted, or JsError
otherwise (similar to Scala's Try object). The .validate method relies on the
existence of a type class Reads[Repo]. Let's experiment with a Scala console:
$ activator console
scala> import play.api.libs.json._
[ 340 ]

Chapter 13
import play.api.libs.json._
scala> val s = """
{ "name": "dotty", "size": 150, "language": "Scala", "fork": true }
"""
s: String = "
{ "name": "dotty", "size": 150, "language": "Scala", "fork": true }
"
scala> val parsedJson = Json.parse(s)
parsedJson: play.api.libs.json.JsValue = {"name":"dotty","size":150,"lang
uage":"Scala","fork":true}

Using Json.parse converts a string to an instance of JsValue, the super-type for
JSON instances. We can access specific fields in parsedJson using XPath-like syntax
(if you are not familiar with XPath-like syntax, you might want to read Chapter 6,
Slick – A Functional Interface for SQL):
scala> parsedJson \ "name"
play.api.libs.json.JsLookupResult = JsDefined("dotty")

XPath-like lookups return an instance with type JsLookupResult. This takes two
values: either JsDefined, if the path is valid, or JsUndefined if it is not:
scala> parsedJson \ "age"
play.api.libs.json.JsLookupResult = JsUndefined('age' is undefined on
object: {"name":"dotty","size":150,"language":"Scala","fork":true})

To go from a JsLookupResult instance to a String in a type-safe way, we can use the
.validate[String] method:
scala> (parsedJson \ "name").validate[String]
play.api.libs.json.JsResult[String] = JsSuccess(dotty,)

The .validate[T] method returns either JsSuccess if the JsDefined instance
could be successfully cast to T, or JsError otherwise. To illustrate the latter, let's try
validating this as an Int:
scala> (parsedJson \ "name").validate[Int]
dplay.api.libs.json.JsResult[Int] = JsError(List((,List(ValidationError(L
ist(error.expected.jsnumber),WrappedArray())))))

[ 341 ]

Web APIs with Play

Calling .validate on an instance of type JsUndefined also returns in a JsError:
scala> (parsedJson \ "age").validate[Int]
play.api.libs.json.JsResult[Int] = JsError(List((,List(ValidationError
(List('age' is undefined on object: {"name":"dotty","size":150,
"language":"Scala","fork":true}),WrappedArray())))))

To convert from an instance of JsResult[T] to an instance of type T, we can use
pattern matching:
scala> val name = (parsedJson \ "name").validate[String] match {
case JsSuccess(n, _) => n
case JsError(e) => throw new IllegalStateException(
s"Error extracting name: $e")
}
name: String = dotty

We can now use .validate to cast JSON to simple types in a type-safe manner.
But, in the code example, we used .validate[Repo]. This works provided a
Reads[Repo] type class is implicitly available in the namespace.
The most common way of defining Reads[T] type classes is through a DSL
provided in import play.api.libs.functional.syntax._. The DSL works by
chaining operations returning either JsSuccess or JsError together. Discussing
exactly how this DSL works is outside the scope of this chapter (see, for instance,
the Play framework documentation page on JSON combinators: https://www.
playframework.com/documentation/2.4.x/ScalaJsonCombinators). We will
stick to discussing the syntax.
scala> import play.api.libs.functional.syntax._
import play.api.libs.functional.syntax._
scala> import models.Repo
import models.Repo
scala> implicit val readsRepoFromGithub:Reads[Repo] = (
(JsPath \ "name").read[String] and
(JsPath \ "language").read[String] and
(JsPath \ "fork").read[Boolean] and

[ 342 ]

Chapter 13
(JsPath \ "size").read[Long]
)(Repo.apply _)
readsRepoFromGithub: play.api.libs.json.Reads[models.Repo] = play.api.
libs.json.Reads$$anon$8@a198ddb

The Reads type class is defined in two stages. The first chains together read[T]
methods with and, combining successes and errors. The second uses the apply
method of the companion object of a case class (or Tuple instance) to construct the
object, provided the first stage completed successfully. Now that we have defined the
type class, we can call validate[Repo] on a JsValue object:
scala> val repoOpt = parsedJson.validate[Repo]
play.api.libs.json.JsResult[models.Repo] = JsSuccess(Repo(dotty,Scala,tr
ue,150),)

We can then use pattern matching to extract the Repo object from the JsSuccess
instance:
scala> val JsSuccess(repo, _) = repoOpt
repo: models.Repo = Repo(dotty,Scala,true,150)

We have, so far, only talked about validating single repos. The Play framework
defines type classes for collection types, so, provided Reads[Repo] is defined,
Reads[List[Repo]] will also be defined.
Now that we understand how to extract Scala objects from JSON, let's get back to
the code. If we manage to successfully convert the repositories to a List[Repo], we
emit it again as JSON. Of course, converting from GitHub's JSON representation
of a repository to a Scala object, and from that Scala object directly to our JSON
representation of the object, might seem convoluted. However, if this were a real
application, we would have additional logic. We could, for instance, store repos
in a cache, and try and fetch from that cache instead of querying the GitHub API.
Converting from JSON to Scala objects as early as possible decouples the code that
we write from the way GitHub returns repositories.

Asynchronous actions
The last bit of the code sample that is new is the call to Action.async, rather than
just Action. Recall that an Action instance is a thin wrapper around a Request
=> Result method. Our code, however, returns a Future[Result], rather than a
Result. When that is the case, use the Action.async to construct the action, rather
than Action directly. Using Action.async tells the Play framework that the code
creating the Action is asynchronous.

[ 343 ]

Web APIs with Play

Creating APIs with Play: a summary
In the last section, we deployed an API that responds to GET requests. Since this is a
lot to take in, let's summarize how to go about API creation:
1. Define appropriate routes in /conf/routes, using wildcards in the URL
as needed.
2. Create Scala case classes in /app/models to represent the models used by
the API.
3. Create Write[T] methods to write models to JSON or XML so that they can
be returned by the API.
4. Bind the routes to controllers. If the controllers need to do more than a trivial
amount a work, wrap the work in a future to avoid blocking the server.
There are many more useful components of the Play framework that you are likely
to need, such as, for instance, how to use Slick to access SQL databases. We do not,
unfortunately, have time to cover these in this introduction. The Play framework has
extensive, well-written documentation that will fill the gaping holes in this tutorial.

Rest APIs: best practice
As the Internet matures, REST (representational state transfer) APIs are emerging as
the most reliable design pattern for web APIs. An API is described as RESTful if it
follows these guiding principles:
•

The API is designed as a set of resources. For instance, the GitHub API
provides information about users, repositories, followers, etc. Each user, or
repository, is a specific resource. Each resource can be addressed through a
different HTTP end-point.

•

The URLs should be simple and should identify the resource clearly. For
instance, api.github.com/users/odersky is simple and tells us clearly that
we should expect information about the user Martin Odersky.

•

There is no world resource that contains all the information about the system.
Instead, top-level resources contain links to more specialized resources. For
instance, the user resource in the GitHub API contains links to that user's
repositories and that user's followers, rather than having all that information
embedded in the user resource directly.

[ 344 ]

Chapter 13

•

The API should be discoverable. The response to a request for a specific
resource should contain URLs for related resources. When you query the
user resource on GitHub, the response contains the URL for accessing that
user's followers, repositories etc. The client should use the URLs provided by
the API, rather than attempting to construct them client-side. This makes the
client less brittle to changes in the API.

•

There should be as little state maintained on the server as possible. For
instance, when querying the GitHub API, we must pass the authentication
token with every request, rather than expecting our authentication status to
be remembered on the server. Having each interaction be independent of the
history provides much better scalability: if any interaction can be handled by
any server, load balancing is much easier.

Summary
In this chapter, we introduced the Play framework as a tool for building web APIs.
We built an API that returns a JSON array of a user's GitHub repositories. In the
next chapter, we will build on this API and construct a single-page application to
represent this data graphically.

References
•

This Wikipedia page gives information on semantic URLs:
https://en.wikipedia.org/wiki/Semantic_URL and

http://apiux.com/2013/04/03/url-design-restful-web-services/.

•

For a much more in depth discussion of the Play framework, I suggest Play
Framework Essentials by Julien Richard-Foy.

•

REST in Practice: Hypermedia and Systems Architecture, by Jim Webber, Savas
Parastatidis and Ian Robinson describes how to architect REST APIs.

[ 345 ]

Visualization with D3 and the
Play Framework
In the previous chapter, we learned about the Play framework, a web framework
for Scala. We built an API that returns a JSON array describing a user's GitHub
repositories.
In this chapter, we will construct a fully-fledged web application that displays a
table and a chart describing a user's repositories. We will learn to integrate D3.js,
a JavaScript library for building data-driven web pages, with the Play framework.
This will set you on the path to building compelling interactive visualizations that
showcase results obtained with machine learning.
This chapter assumes that you are familiar with HTML, CSS, and JavaScript.
We present references at the end of the chapter. You should also have read the
previous chapter.

[ 347 ]

Visualization with D3 and the Play Framework

GitHub user data
We will build a single-page application that uses, as its backend, the API developed
in the previous chapter. The application contains a form where the user enters the
login name for a GitHub account. The application queries the API to get a list of
repositories for that user and displays them on the screen as both a table and a pie
chart summarizing programming language use for that user:

To see a live version of the application, head over to
http://app.scala4datascience.com.

Do I need a backend?
In the previous chapter, we learned about the client-server model that underpins
how the internet works: when you enter a website URL in your browser, the server
serves HTML, CSS, and JavaScript to your browser, which then renders it in the
appropriate manner.
What does this all mean for you? Arguably the second question that you should
be asking yourself when building a web application is whether you need to do any
server-side processing (right after "is this really going to be worth the effort?"). Could
you just create an HTML web-page with some JavaScript?

[ 348 ]

Chapter 14

You can get away without a backend if the data needed to build the whole
application is small enough: typically a few megabytes. If your application is
larger, you will need a backend to transfer just the data the client currently needs.
Surprisingly, you can often build visualizations without a backend: while data
science is accustomed to dealing with terabytes of data, the goal of the data science
process is often condensing these huge data sets to a few meaningful numbers.
Having a backend also lets you include logic invisible to the client. If you need
to validate a password, you clearly cannot send the code to do that to the client
computer: it needs to happen out of sight, on the server.
If your application is small enough and you do not need to do any server-side
processing, stop reading this chapter, brush up on your JavaScript if you have to,
and forget about Scala for now. Not having to worry about building a backend will
make your life easier.
Clearly, however, we do not have that freedom for the application that we want
to build: the user could enter the name of anyone on GitHub. Finding information
about that user requires a backend with access to tremendous storage and querying
capacity (which we simulate by just forwarding the request to the GitHub API and
re-interpreting the response).

JavaScript dependencies through
web-jars
One of the challenges of developing web applications is that we are writing two
quasi-separate programs: the server-side program and the client-side program. These
generally require different technologies. In particular, for any but the most trivial
application, we must keep track of JavaScript libraries, and integrate processing the
JavaScript code (for instance, for minification) in the build process.

[ 349 ]

Visualization with D3 and the Play Framework

The Play framework manages JavaScript dependencies through web-jars. These are
just JavaScript libraries packaged as jars. They are deployed on Maven Central,
which means that we can just add them as dependencies to our build.sbt file. For
this application, we will need the following JavaScript libraries:
•

Require.js, a library for writing modular JavaScript

•

JQuery

•

Bootstrap

•

Underscore.js, a library that adds many functional constructs and
client-side templating.

•

D3, the graph plotting library

•

NVD3, a graph library built on top of D3

If you are planning on coding up the examples provided in this chapter, the easiest
will be for you to start from the code for the previous chapter (You can download
the code for Chapter 13, Web APIs with Play, from GitHub: https://github.com/
pbugnion/s4ds/tree/master/chap13). We will assume this as a starting point
here onwards.
Let's include the dependencies on the web-jars in the build.sbt file:
libraryDependencies ++= Seq(
"org.webjars" % "requirejs" % "2.1.22",
"org.webjars" % "jquery" % "2.1.4",
"org.webjars" % "underscorejs" % "1.8.3",
"org.webjars" % "nvd3" % "1.8.1",
"org.webjars" % "d3js" % "3.5.6",
"org.webjars" % "bootstrap" % "3.3.6"
)

Fetch the modules by running activator update. Once you have done this, you
will notice the JavaScript libraries in target/web/public/main/lib.

Towards a web application: HTML
templates
In the previous chapter, we briefly saw how to construct HTML templates by
interleaving Scala snippets in an HTML file. We saw that templates are compiled
to Scala functions, and we learned how to call these functions from the controllers.

[ 350 ]

Chapter 14

In single-page applications, the majority of the logic governing what is actually
displayed in the browser resides in the client-side JavaScript, not in the server.
The pages served by the server contain the bare-bones HTML framework.
Let's create the HTML layout for our application. We will save this in views/index.
scala.html. The template will just contain the layout for the application, but will
not contain any information about any user's repositories. To fetch that information,
the application will have to query the API developed in the previous chapter. The
template does not take any parameters, since all the dynamic HTML generation will
happen client-side.
We use the Bootstrap grid layout to control the HTML layout. If you are not familiar
with Bootstrap layouts, consult the documentation at http://getbootstrap.com/
css/#grid-example-basic.
// app/views/index.scala.html



Github User display






Github user search

[ 351 ] Visualization with D3 and the Play Framework
In the HTML head, we link the CSS stylesheets that we need for the application. Instead of specifying the path explicitly, we use the @routes.Assets. versioned(...) function. This resolves to a URI corresponding to the location where the assets are stored post-compilation. The argument passed to the function should be the path from target/web/public/main to the asset you need. We want to serve the compiled version of this view when the user accesses the route / on our server. We therefore need to add this route to conf/routes: # conf/routes GET / controllers.Application.index The route is served by the index function in the Application controller. All this controller needs to do is serve the index view: // app/controllers/Application.scala package controllers import play.api._ import play.api.mvc._ class Application extends Controller { def index = Action { Ok(views.html.index()) } } Start the Play framework by running activator run in the root directory of the application and point your web browser to 127.0.0.1:9000/. You should see the framework for our web application. Of course, the application does not do anything yet, since we have not written any of the JavaScript logic yet. [ 352 ] Chapter 14 Modular JavaScript through RequireJS The simplest way of injecting JavaScript libraries into the namespace is to add them to the HTML framework via tags in the HTML header. For instance, to add JQuery, we would add the following line to the head of the document: While this works, it does not scale well to large applications, since every library gets imported into the global namespace. Modern client-side JavaScript frameworks such as AngularJS provide an alternative way of defining and loading modules that preserve encapsulation. We will use RequireJS. In a nutshell, RequireJS lets us encapsulate JavaScript modules through functions. For instance, if we wanted to write a module example that contains a function for hiding a div, we would define the module as follows: // example.js define(["jquery", "underscore"], function($, _) { // hide a div function hide(div_name) { $(div_name).hide() ; } // what the module exports. return { "hide": hide } }) ; We encapsulate our module as a callback in a function called define. The define function takes two arguments: a list of dependencies, and a function definition. The define function binds the dependencies to the arguments list of the callback: in this case, functions in JQuery will be bound to $ and functions in Underscore will be bound to _. This creates a module which exposes whatever the callback function returns. In this case, we export the hide function, binding it to the name "hide". Our example module thus exposes the hide function. To load this module, we pass it as a dependency to the module in which we want to use it: define(["example"], function(example) { function hide_all() { [ 353 ] Visualization with D3 and the Play Framework example.hide("#top") ; example.hide("#bottom") ; } return { "hide_all": hide_all } ; }); Notice how the functions in example are encapsulated, rather than existing in the global namespace. We call them through example.. Furthermore, any functions or variables defined internally to the example module remain private. Sometimes, we want JavaScript code to exist outside of modules. This is often the case for the script that bootstraps the application. For these, replace define with require: require(["jquery", "example"], function($, example) { $(document).ready(function() { example.hide("#header") ; }); }) ; Now that we have an overview of RequireJS, how do we use it in the Play framework? The first step is to add the dependency on the RequireJS web jar, which we have done. The Play framework also adds a RequireJS SBT plugin (https://github.com/sbt/sbt-rjs), which should be installed by default if you used the play-scala activator. If this is missing, it can be added with the following line in plugins.sbt: // project/plugins.sbt addSbtPlugin("com.typesafe.sbt" % "sbt-rjs" % "1.0.7") We also need to add the plugin to the list of stages. This allows the plugin to manipulate the JavaScript assets when packaging the application as a jar. Add the following line to build.sbt: pipelineStages := Seq(rjs) You will need to restart the activator for the changes to take effect. We are now ready to use RequireJS in our application. We can use it by adding the following line in the head section of our view: // index.scala.html [ 354 ] Chapter 14 ... ... When the view is compiled, this is resolved to tags like: The argument passed to data-main is the entry point for our application. When RequireJS loads, it will execute main.js. That script must therefore bootstrap our application. In particular, it should contain a configuration object for RequireJS, to make it aware of where all the libraries are. Bootstrapping the applications When we linked require.js to our application, we told it to use main.js as our entry point. To test that this works, let's start by entering a dummy main.js. JavaScript files in Play applications go in /public/javascripts: // public/javascripts/main.js require([], function() { console.log("hello, JavaScript"); }); To verify that this worked, head to 127.0.0.1:9000 and open the browser console. You should see "hello, JavaScript" in the console. [ 355 ] Visualization with D3 and the Play Framework Let's now write a more useful main.js. We will start by configuring RequireJS, giving it the location of modules we will use in our application. Unfortunately, NVD3, the graph library that we use, does not play very well with RequireJS so we have to use an ugly hack to make it work. This complicates our main.js file somewhat: // public/javascripts/main.js (function (requirejs) { 'use strict'; // -- RequireJS config -requirejs.config({ // path to the web jars. These definitions allow us // to use "jquery", rather than "../lib/jquery/jquery", // when defining module dependencies. paths: { "jquery": "../lib/jquery/jquery", "underscore": "../lib/underscorejs/underscore", "d3": "../lib/d3js/d3", "nvd3": "../lib/nvd3/nv.d3", "bootstrap": "../lib/bootstrap/js/bootstrap" }, shim: { // hack to get nvd3 to work with requirejs. // see this so question: // http://stackoverflow.com/questions/13157704/how-to-integrated3-with-require-js#comment32647365_13171592 nvd3: { deps: ["d3.global"], exports: "nv" }, bootstrap : { deps :['jquery'] } } }) ; })(requirejs) ; // hack to get nvd3 to work with requirejs. // see this so question on Stack Overflow: [ 356 ] Chapter 14 // http://stackoverflow.com/questions/13157704/how-to-integrate-d3with-require-js#comment32647365_13171592 define("d3.global", ["d3"], function(d3global) { d3 = d3global; }); require([], function() { // Our application console.log("hello, JavaScript"); }) ; Now that we have the configuration in place, we can dig into the JavaScript part of the application. Client-side program architecture The basic idea is simple: the user searches for the name of someone on GitHub in the input box. When he enters a name, we fire a request to the API designed earlier in this chapter. When the response from the API returns, the program binds that response to a model and emits an event notifying that the model has been changed. The views listen for this event and refresh from the model in response. Designing the model Let's start by defining the client-side model. The model holds information regarding the repos of the user currently displayed. It gets filled in after the first search. // public/javascripts/model.js define([], function(){ return { ghubUser: "", // last name that was searched for exists: true, // does that person exist on github? repos: [] // list of repos } ; }); [ 357 ] Visualization with D3 and the Play Framework To see a populated value of the model, head to the complete application example on app.scala4datascience.com, open a JavaScript console in your browser, search for a user (for example, odersky) in the application and type the following in the console: > require(["model"], function(model) { console.log(model) ; }) {ghubUser: "odersky", exists: true, repos: Array} > require(["model"], function(model) { console.log(model.repos[0]); }) {name: "dotty", language: "Scala", is_fork: true, size: 14653} These import the "model" module, bind it to the variable model, and then print information to the console. The event bus We need a mechanism for informing the views when the model is updated, since the views need to refresh from the new model. This is commonly handled through events in web applications. JQuery lets us bind callbacks to specific events. The callback is executed when that event occurs. For instance, to bind a callback to the event "custom-event", enter the following in a JavaScript console: > $(window).on("custom-event", function() { console.log("custom event received") ; }); We can fire the event using: > $(window).trigger("custom-event"); custom event received [ 358 ] Chapter 14 Events in JQuery require an event bus, a DOM element on which the event is registered. In this case, we used the window DOM element as our event bus, but any JQuery element would have served. Centralizing event definitions to a single module is helpful. We will, therefore, create an events module containing two functions: trigger, which triggers an event (specified by a string) and on, which binds a callback to a specific event: // public/javascripts/events.js define(["jquery"], function($) { var bus = $(window) ; // widget to use as an event bus function trigger(eventType) { $(bus).trigger(eventType) ; } function on(eventType, f) { $(bus).on(eventType, f) ; } return { "trigger": trigger, "on": on } ; }); We can now emit and receive events using the events module. You can test this out in a JavaScript console on the live version of the application (at app. scala4datascience.com). Let's start by registering a listener: > require(["events"], function(events) { // register event listener events.on("hello_event", function() { console.log("Received event") ; }) ; }); If we now trigger the event "hello_event", the listener prints "Received event": > require(["events"], function(events) { // trigger the event events.trigger("hello_event") ; }) ; [ 359 ] Visualization with D3 and the Play Framework Using events allows us to decouple the controller from the views. The controller does not need to know anything about the views, and vice-versa. The controller just needs to emit a "model_updated" event when the model is updated, and the views need to refresh from the model when they receive that event. AJAX calls through JQuery We can now write the controller for our application. When the user enters a name in the text input, we query the API, update the model and trigger a model_updated event. We use JQuery's $.getJSON function to query our API. This function takes a URL as its first argument, and a callback as its second argument. The API call is asynchronous: $.getJSON returns immediately after execution. All request processing must, therefore, be done in the callback. The callback is called if the request is successful, but we can define additional handlers that are always called, or called on failure. Let's try this out in the browser console (either your own, if you are running the API developed in the previous chapter, or on app.scala4datascience. com). Recall that the API is listening to the end-point /api/repos/:user: > $.getJSON("/api/repos/odersky", function(data) { console.log("API response:"); console.log(data); console.log(data[0]); }) ; {readyState: 1, getResponseHeader: function, ...} API response: [Object, Object, Object, Object, Object, ...] {name: "dotty", language: "Scala", is_fork: true, size: 14653} getJSON returns immediately. A few tenths of a second later, the API responds, at which point the response gets fed through the callback. The callback only gets executed on success. It takes, as its argument, the JSON object returned by the API. To bind a callback that is executed when the API request fails, call the .fail method on the return value of getJSON: > $.getJSON("/api/repos/junk123456", function(data) { console.log("called on success"); }).fail(function() { console.log("called on failure") ; [ 360 ] Chapter 14 }) ; {readyState: 1, getResponseHeader: function, ...} called on failure We can also use the .always method on the return value of getJSON to specify a callback that is executed, whether the API query was successful or not. Now that we know how to use $.getJSON to query our API, we can write the controller. The controller listens for changes to the #user-selection input field. When a change occurs, it fires an AJAX request to the API for information on that user. It binds a callback which updates the model when the API replies with a list of repositories. We will define a controller module that exports a single function, initialize, that creates the event listeners: // public/javascripts/controller.js define(["jquery", "events", "model"], function($, events, model) { function initialize() { $("#user-selection").change(function() { var user = $("#user-selection").val() ; console.log("Fetching information for " + user) ; // Change cursor to a 'wait' symbol // while we wait for the API to respond $("*").css({"cursor": "wait"}) ; $.getJSON("/api/repos/" + user, function(data) { // Executed on success model.exists = true ; model.repos = data ; }).fail(function() { // Executed on failure model.exists = false ; model.repos = [] ; }).always(function() { // Always executed model.ghubUser = user ; // Restore cursor $("*").css({"cursor": "initial"}) ; // Tell the rest of the application [ 361 ] Visualization with D3 and the Play Framework // that the model has been updated. events.trigger("model_updated") ; }); }) ; } ; return { "initialize": initialize }; }); Our controller module just exposes the initialize method. Once the initialization is performed, the controller interacts with the rest of the application through event listeners. We will call the controller's initialize method in main.js. Currently, the last lines of that file are just an empty require block. Let's import our controller and initialize it: // public/javascripts/main.js require(["controller"], function(controller) { controller.initialize(); }); To test that this works, we can bind a dummy listener to the "model_updated" event. For instance, we could log the current model to the browser JavaScript console with the following snippet (which you can write directly in the JavaScript console): > require(["events", "model"], function(events, model) { events.on("model_updated", function () { console.log("model_updated event received"); console.log(model); }); }); If you then search for a user, the model will be printed to the console. We now have the controller in place. The last step is writing the views. [ 362 ] Chapter 14 Response views If the request fails, we just display Not found in the response div. This part is the easiest to code up, so let's do that first. We define an initialize method that generates the view. The view then listens for the "model_updated" event, which is fired by the controller after it updates the model. Once the initialization is complete, the only way to interact with the response view is through "model_updated" events: // public/javascripts/responseView.js define(["jquery", "model", "events"], function($, model, events) { var failedResponseHtml = "
Not found
" ; function initialize() { events.on("model_updated", function() { if (model.exists) { // success – we will fill this in later. console.log("model exists") } else { // failure – the user entered // is not a valid GitHub login $("#response").html(failedResponseHtml) ; } }) ; } return { "initialize": initialize } ; }); To bootstrap the view, we must call the initialize function from main.js. Just add a dependency on responseView in the require block, and call responseView. initialize(). With these modifications, the final require block in main.js is: // public/javascripts/main.js require(["controller", "responseView"], function(controller, responseView) { controller.initialize(); responseView.initialize() ; }) ; [ 363 ] Visualization with D3 and the Play Framework You can check that this all works by entering junk in the user input to deliberately cause the API request to fail. When the user enters a valid GitHub login name and the API returns a list of repos, we must display those on the screen. We display a table and a pie chart that aggregates the repository sizes by language. We will define the pie chart and the table in two separate modules, called repoGraph.js and repoTable.js. Let's assume those exist for now and that they expose a build method that accepts a model and the name of a div in which to appear. Let's update the code for responseView to accommodate the user entering a valid GitHub user name: // public/javascripts/responseView.js define(["jquery", "model", "events", "repoTable", "repoGraph"], function($, model, events, repoTable, repoGraph) { // HTHML to inject when the model represents a valid user var successfulResponseHtml = "
" + "
" ; // HTML to inject when the model is for a non-existent user var failedResponseHtml = "
Not found
" ; function initialize() { events.on("model_updated", function() { if (model.exists) { $("#response").html(successfulResponseHtml) ; repoTable.build(model, "#response-table") ; repoGraph.build(model, "#response-graph") ; } else { $("#response").html(failedResponseHtml) ; } }) ; } return { "initialize": initialize } ; }); [ 364 ] Chapter 14 Let's walk through what happens in the event of a successful API call. We inject the following bit of HTML in the #response div: var successfulResponseHtml = "
" + "
" ; This adds two HTML divs, one for the table of repositories, and the other for the graph. We use Bootstrap classes to split the response div vertically. Let's now turn our attention to the table view, which needs to expose a single build method, as described in the previous section. We will just display the repositories in an HTML table. We will use Underscore templates to build the table dynamically. Underscore templates work much like string interpolation in Scala: we define a template with placeholders. Let's try this in a browser console: > require(["underscore"], function(_) { var myTemplate = _.template( "Hello, <%= title %> <%= name %>!" ) ; }); This creates a myTemplate function which accepts an object with attributes title and name: > require(["underscore"], function(_) { var myTemplate = _.template( ... ); var person = { title: "Dr.", name: "Odersky" } ; console.log(myTemplate(person)) ; }); Underscore templates thus provide a convenient mechanism for formatting an object as a string. We will create a template for each row in our table, and pass the model for each repository to the template: // public/javascripts/repoTable.js define(["underscore", "jquery"], function(_, $) { // Underscore template for each row var rowTemplate = _.template("" + "<%= name %>" + "<%= language %>" + "<%= size %>" + "") ; // template for the table [ 365 ] Visualization with D3 and the Play Framework var repoTable = _.template( "" + "" + "" + "" + "" + "" + "" + "<%= tbody %>" + "" + "
NameLanguageSize
") ; // Builds a table for a model function build(model, divName) { var tbody = "" ; _.each(model.repos, function(repo) { tbody += rowTemplate(repo) ; }) ; var table = repoTable({tbody: tbody}) ; $(divName).html(table) ; } return { "build": build } ; }) ; Drawing plots with NVD3 D3 is a library that offers low-level components for building interactive visualizations in JavaScript. By offering the low-level components, it gives a huge degree of flexibility to the developer. The learning curve can, however, be quite steep. In this example, we will use NVD3, a library which provides pre-made graphs for D3. This can greatly speed up initial development. We will place the code in the file repoGraph.js and expose a single method, build, which takes, as arguments, a model and a div and draws a pie chart in that div. The pie chart will aggregate language use across all the user's repositories. [ 366 ] Chapter 14 The code for generating a pie chart is nearly identical to the example given in the NVD3 documentation, available at http://nvd3.org/examples/pie.html. The data passed to the graph must be available as an array of objects. Each object must contain a label field and a size field. The label field identifies the language, and the size field is the total size of all the repositories for that user written in that language. The following would be a valid data array: [ { label: "Scala", size: 1234 }, { label: "Python", size: 4567 } ] To get the data in this format, we must aggregate sizes across the repositories written in a particular language in our model. We write the generateDataFromModel function to transform the repos array in the model to an array suitable for NVD3. The crux of the aggregation is performed by a call to Underscore's groupBy method, to group repositories by language. This method works exactly like Scala's groupBy method. With this in mind, the generateDataFromModel function is: // public/javascripts/repoGraph.js define(["underscore", "d3", "nvd3"], function(_, d3, nv) { // Aggregate the repo size by language. // Returns an array of objects like: // [ { label: "Scala", size: 1245}, // { label: "Python", size: 432 } ] function generateDataFromModel(model) { // Build an initial object mapping each // language to the repositories written in it var language2Repos = _.groupBy(model.repos, function(repo) { return repo.language ; }) ; // Map each { "language": [ list of repos ], ...} // pairs to a single document { "language": totalSize } // where totalSize is the sum of the individual repos. var plotObjects = _.map(language2Repos, function(repos, language) { var sizes = _.map(repos, function(repo) { return repo.size; }); // Sum over the sizes using 'reduce' var totalSize = _.reduce(sizes, [ 367 ] Visualization with D3 and the Play Framework function(memo, size) { return memo + size; }, 0) ; return { label: language, size: totalSize } ; }) ; return plotObjects; } We can now build the pie chart, using NVD3's addGraph method: // Build the chart. function build(model, divName) { var transformedModel = generateDataFromModel(model) ; nv.addGraph(function() { var height = 350; var width = 350; var chart = nv.models.pieChart() .x(function (d) { return d.label ; }) .y(function (d) { return d.size ;}) .width(width) .height(height) ; d3.select(divName).append("svg") .datum(transformedModel) .transition() .duration(350) .attr('width', width) .attr('height', height) .call(chart) ; return chart ; }); } return { "build" : build } ; }); This was the last component of our application. Point your browser to 127.0.0.1:9000 and you should see the application running. Congratulations! We have built a fully-functioning single-page web application. [ 368 ] Chapter 14 Summary In this chapter, we learned how to write a fully-featured web application with the Play framework. Congratulations on making it this far. Building web applications are likely to push many data scientists beyond their comfort zone, but knowing enough about the web to build basic applications will allow you to share your results in a compelling, engaging manner, as well as facilitate communications with software engineers and web developers. This concludes our whistle stop tour of Scala libraries. Over the course of this book, we have learned how to tackle linear algebra and optimization problems efficiently using Breeze, how to insert and query data in SQL databases in a functional manner, and both how to interact with web APIs and how to create them. We have reviewed some of tools available to the data scientist for writing concurrent or parallel applications, from parallel collections and futures to Spark via Akka. We have seen how pervasive these constructs are in Scala libraries, from futures in the Play framework to Akka as the backbone of Spark. If you have read this far, pat yourself on the back. This books gives you the briefest of introduction to the libraries it covers, hopefully just enough to give you a taste of what each tool is good for, what you could accomplish with it, and how it fits in the wider Scala ecosystem. If you decide to use any of these in your data science pipeline, you will need to read the documentation in more detail, or a more complete reference book. The references listed at the end of each chapter should provide a good starting point. Both Scala and data science are evolving rapidly. Do not stay wedded to a particular toolkit or concept. Remain on top of current developments and, above all, remain pragmatic: find the right tool for the right job. Scala and the libraries discussed here will often be that tool, but not always: sometimes, a shell command or a short Python script will be more effective. Remember also that programming skills are but one aspect of the data scientist's body of knowledge. Even if you want to specialize in the engineering side of data science, learn about the problem domain and the mathematical underpinnings of machine learning. Most importantly, if you have taken the time to read this book, it is likely that you view programming and data science as more than a day job. Coding in Scala can be satisfying and rewarding, so have fun and be awesome! [ 369 ] Visualization with D3 and the Play Framework References There are thousands of HTML and CSS tutorials dotted around the web. A simple Google search will give you a much better idea of the resources available than any list of references I can provide. Mike Bostock's website has a wealth of beautiful D3 visualizations: http://bost. ocks.org/mike/. To understand a bit more about D3, I recommend Scott Murray's Interactive Data Visualization for the Web. You may also wish to consult the references given in the previous chapter for reference books on the Play framework and designing REST APIs. [ 370 ] Pattern Matching and Extractors Pattern matching is a powerful tool for control flow in Scala. It is often underused and under-estimated by people coming to Scala from imperative languages. Let's start with a few examples of pattern matching before diving into the theory. We start by defining a tuple: scala> val names = ("Pascal", "Bugnion") names: (String, String) = (Pascal,Bugnion) We can use pattern matching to extract the elements of this tuple and bind them to variables: scala> val (firstName, lastName) = names firstName: String = Pascal lastName: String = Bugnion We just extracted the two elements of the names tuple, binding them to the variables firstName and lastName. Notice how the left-hand side defines a pattern that the right-hand side must match: we are declaring that the variable names must be a two-element tuple. To make the pattern more specific, we could also have specified the expected types of the elements in the tuple: scala> val (firstName:String, lastName:String) = names firstName: String = Pascal lastName: String = Bugnion [ 371 ] Pattern Matching and Extractors What happens if the pattern on the left-hand side does not match the right-hand side? scala> val (firstName, middleName, lastName) = names :13: error: constructor cannot be instantiated to expected type; found : (T1, T2, T3) required: (String, String) val (firstName, middleName, lastName) = names This results in a compile error. Other types of pattern matching failures result in runtime errors. Pattern matching is very expressive. To achieve the same behavior without pattern matching, you would have to do the following explicitly: • Verify that the variable names is a two-element tuple • Extract the first element and bind it to firstName • Extract the second element and bind it to lastName If we expect certain elements in the tuple to have specific values, we can verify this as part of the pattern match. For instance, we can verify that the first element of the names tuple matches "Pascal": scala> val ("Pascal", lastName) = names lastName: String = Bugnion Besides tuples, we can also match on Scala collections: scala> val point = Array(1, 2, 3) point: Array[Int] = Array(1, 2, 3) scala> val Array(x, y, z) = point x: Int = 1 y: Int = 2 z: Int = 3 Notice the similarity between this pattern matching and array construction: scala> val point = Array(x, y, z) point: Array[Int] = Array(1, 2, 3) Syntactically, Scala expresses pattern matching as the reverse process to instance construction. We can think of pattern matching as the deconstruction of an object, binding the object's constituent parts to variables. [ 372 ] Appendix When matching against collections, one is sometimes only interested in matching the first element, or the first few elements, and discarding the rest of the collection, whatever its length. The operator _* will match against any number of elements: scala> val Array(x, _*) = point x: Int = 1 By default, the part of the pattern matched by the _* operator is not bound to a variable. We can capture it as follows: scala> val Array(x, xs @ _*) = point x: Int = 1 xs: Seq[Int] = Vector(2, 3) Besides tuples and collections, we can also match against case classes. Let's start by defining a case representing a name: scala> case class Name(first: String, last: String) defined class Name scala> val name = Name("Martin", "Odersky") name: Name = Name(Martin,Odersky) We can match against instances of Name in much the same way we matched against tuples: scala> val Name(firstName, lastName) = name firstName: String = Martin lastName: String = Odersky All these patterns can also be used in match statements: scala> def greet(name:Name) = name match { case Name("Martin", "Odersky") => "An honor to meet you" case Name(first, "Bugnion") => "Wow! A family member!" case Name(first, last) => s"Hello, $first" } greet: (name: Name)String [ 373 ] Pattern Matching and Extractors Pattern matching in for comprehensions Pattern matching is useful in for comprehensions for extracting items from a collection that match a specific pattern. Let's build a collection of Name instances: scala> val names = List(Name("Martin", "Odersky"), Name("Derek", "Wyatt")) names: List[Name] = List(Name(Martin,Odersky), Name(Derek,Wyatt)) We can use pattern matching to extract the internals of the class in a forcomprehension: scala> for { Name(first, last) <- names } yield first List[String] = List(Martin, Derek) So far, nothing terribly ground-breaking. But what if we wanted to extract the surname of everyone whose first name is "Martin"? scala> for { Name("Martin", last) <- names } yield last List[String] = List(Odersky) Writing Name("Martin", last) <- names extracts the elements of names that match the pattern. You might think that this is a contrived example, and it is, but the examples in Chapter 7, Web APIs demonstrate the usefulness and versatility of this language pattern, for instance, for extracting specific fields from JSON objects. Pattern matching internals If you define a case class, as we saw with Name, you get pattern matching against the constructor for free. You should be using case classes to represent your data as much as possible, thus reducing the need to implement your own pattern matching. It is nevertheless useful to understand how pattern matching works. When you create a case class, Scala automatically builds a companion object: scala> case class Name(first: String, last: String) defined class Name scala> Name. apply asInstanceOf unapply curried isInstanceOf [ 374 ] toString tupled Appendix The method used (internally) for pattern matching is unapply. This method takes, as argument, an object and returns Option[T], where T is a tuple of the values of the case class. scala> val name = Name("Martin", "Odersky") name: Name = Name(Martin,Odersky) scala> Name.unapply(name) Option[(String, String)] = Some((Martin,Odersky)) The unapply method is an extractor. It plays the opposite role of the constructor: it takes an object and extracts the list of parameters needed to construct that object. When you write val Name(firstName, lastName), or when you use Name as a case in a match statement, Scala calls Name.unapply on what you are matching against. A value of Some[(String, String)] implies a pattern match, while a value of None implies that the pattern fails. To write custom extractors, you just need an object with an unapply method. While unapply normally resides in the companion object of a class that you are deconstructing, this need not be the case. In fact, it does not need to correspond to an existing class at all. For instance, let's define a NonZeroDouble extractor that matches any non-zero double: scala> object NonZeroDouble { def unapply(d:Double):Option[Double] = { if (d == 0.0) { None } else { Some(d) } } } defined object NonZeroDouble scala> val NonZeroDouble(denominator) = 5.5 denominator: Double = 5.5 scala> val NonZeroDouble(denominator) = 0.0 scala.MatchError: 0.0 (of class java.lang.Double) ... 43 elided We defined an extractor for NonZeroDouble, despite the absence of a corresponding NonZeroDouble class. [ 375 ] Pattern Matching and Extractors This NonZeroDouble extractor would be useful in a match object. For instance, let's define a safeDivision function that returns a default value when the denominator is zero: scala> def safeDivision(numerator:Double, denominator:Double, fallBack:Double) = denominator match { case NonZeroDouble(d) => numerator / d case _ => fallBack } safeDivision: (numerator: Double, denominator: Double, fallBack: Double) Double scala> safeDivision(5.0, 2.0, 100.0) Double = 2.5 scala> safeDivision(5.0, 0.0, 100.0) Double = 100.0 This is a trivial example because the NonZeroDouble.unapply method is so simple, but you can hopefully see the usefulness and expressiveness, if we were to define a more complex test. Defining custom extractors lets you define powerful control flow constructs to leverage match statements. More importantly, they enable the client using the extractors to think about control flow declaratively: the client can declare that they need a NonZeroDouble, rather than instructing the compiler to check whether the value is zero. Extracting sequences The previous section explains extraction from case classes, and how to write custom extractors, but it does not explain how extraction works on sequences: scala> val Array(a, b) = Array(1, 2) a: Int = 1 b: Int = 2 Rather than relying on an unapply method, sequences rely on an unapplySeq method defined in the companion object. This is expected to return an Option[Seq[A]]: scala> Array.unapplySeq(Array(1, 2)) Option[IndexedSeq[Int]] = Some(Vector(1, 2)) [ 376 ] Appendix Let's write an example. We will write an extractor for Breeze vectors (which do not currently support pattern matching). To avoid clashing with the DenseVector companion object, we will write our unapplySeq in a separate object, called DV. All our unapplySeq method needs to do is convert its argument to a Scala Vector instance. To avoid muddying the concepts with generics, we will write this implementation for [Double] vectors only: scala> import breeze.linalg._ import breeze.linalg._ scala> object DV { // Just need to convert to a Scala vector. def unapplySeq(v:DenseVector[Double]) = Some(v.toScalaVector) } defined object DV Let's try our new extractor implementation: scala> val vec = DenseVector(1.0, 2.0, 3.0) vec: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0) scala> val DV(x, y, z) = vec x: Double = 1.0 y: Double = 2.0 z: Double = 3.0 Summary Pattern matching is a powerful tool for control flow. It encourages the programmer to think declaratively: declare that you expect a variable to match a certain pattern, rather than explicitly tell the computer how to check that it matches this pattern. This can save many lines of code and enhance clarity. [ 377 ] Pattern Matching and Extractors Reference For an overview of pattern matching in Scala, there is no better reference than Programming in Scala, by Martin Odersky, Bill Venners, and Lex Spoon. An online version of the first edition is available at: https://www.artima.com/pins1ed/caseclasses-and-pattern-matching.html. Daniel Westheide's blog covers slightly more advanced Scala constructs, and is a very useful read: http://danielwestheide.com/blog/2012/11/21/the-neophytesguide-to-scala-part-1-extractors.html. [ 378 ] Module 2 Scala Data Analysis Cookbook Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes 1 Getting Started with Breeze In this chapter, we will cover the following recipes: f Getting Breeze—the linear algebra library f Working with vectors f Working with matrices f Vectors and matrices with randomly distributed values f Reading and writing CSV files Introduction This chapter gives you a quick overview of one of the most popular data analysis libraries in Scala, how to get them, and their most frequently used functions and data structures. We will be focusing on Breeze in this first chapter, which is one of the most popular and powerful linear algebra libraries. Spark MLlib, which we will be seeing in the subsequent chapters, builds on top of Breeze and Spark, and provides a powerful framework for scalable machine learning. 381 Getting Started with Breeze Getting Breeze – the linear algebra library In simple terms, Breeze (http://www.scalanlp.org) is a Scala library that extends the Scala collection library to provide support for vectors and matrices in addition to providing a whole bunch of functions that support their manipulation. We could safely compare Breeze to NumPy (http://www.numpy.org/) in Python terms. Breeze forms the foundation of MLlib—the Machine Learning library in Spark, which we will explore in later chapters. In this first recipe, we will see how to pull the Breeze libraries into our project using Scala Build Tool (SBT). We will also see a brief history of Breeze to better appreciate why it could be considered as the "go to" linear algebra library in Scala. For all our recipes, we will be using Scala 2.10.4 along with Java 1.7. I wrote the examples using the Scala IDE, but please feel free to use your favorite IDE. How to do it... Let's add the Breeze dependencies into our build.sbt so that we can start playing with them in the subsequent recipes. The Breeze dependencies are just two—the breeze (core) and the breeze-native dependencies. 1. Under a brand new folder (which will be our project root), create a new file called build.sbt. 2. Next, add the breeze libraries to the project dependencies: organization := "com.packt" name := "chapter1-breeze" scalaVersion := "2.10.4" libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.11.2", //Optional - the 'why' is explained in the How it works section "org.scalanlp" %% "breeze-natives" % "0.11.2" ) 3. From that folder, issue a sbt compile command in order to fetch all your dependencies. 382 Chapter 1 You could import the project into your Eclipse using sbt eclipse after installing the sbteclipse plugin https://github.com/ typesafehub/sbteclipse/. For IntelliJ IDEA, you just need to import the project by pointing to the root folder where your build.sbt file is. There's more... Let's look into the details of what the breeze and breeze-native library dependencies we added bring to us. The org.scalanlp.breeze dependency Breeze has a long history in that it isn't written from scratch in Scala. Without the native dependency, Breeze leverages the power of netlib-java that has a Java-compiled version of the FORTRAN Reference implementation of BLAS/LAPACK. The netlib-java also provides gentle wrappers over the Java compiled library. What this means is that we could still work without the native dependency but the performance won't be great considering the best performance that we could leverage out of this FORTRAN-translated library is the performance of the FORTRAN reference implementation itself. However, for serious number crunching with the best performance, we should add the breeze-natives dependency too. 383 Getting Started with Breeze The org.scalanlp.breeze-natives package With its native additive, Breeze looks for the machine-specific implementations of the BLAS/LAPACK libraries. The good news is that there are open source and (vendor provided) commercial implementations for most popular processors and GPUs. The most popular open source implementations include ATLAS (http://math-atlas.sourceforge.net) and OpenBLAS (http://www.openblas.net/). If you are running a Mac, you are in luck—Native BLAS libraries come out of the box on Macs. Installing NativeBLAS on Ubuntu / Debian involves just running the following commands: sudo apt-get install libatlas3-base libopenblas-base sudo update-alternatives --config libblas.so.3 sudo update-alternatives --config liblapack.so.3 Downloading the example code You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. 384 Chapter 1 For Windows, please refer to the installation instructions on https://github.com/ xianyi/OpenBLAS/wiki/Installation-Guide. Working with vectors There are subtle yet powerful differences between Breeze vectors and Scala's own scala. collection.Vector. As we'll see in this recipe, Breeze vectors have a lot of functions that are linear algebra specific, and the more important thing to note here is that Breeze's vector is a Scala wrapper over netlib-java and most calls to the vector's API delegates the call to it. Vectors are one of the core components in Breeze. They are containers of homogenous data. In this recipe, we'll first see how to create vectors and then move on to various data manipulation functions to modify those vectors. In this recipe, we will look at various operations on vectors. This recipe has been organized in the form of the following sub-recipes: f f Creating vectors: ‰ Creating a vector from values ‰ Creating a zero vector ‰ Creating a vector out of a function ‰ Creating a vector of linearly spaced values ‰ Creating a vector with values in a specific range ‰ Creating an entire vector with a single value ‰ Slicing a sub-vector from a bigger vector ‰ Creating a Breeze vector from a Scala vector Vector arithmetic: ‰ Scalar operations ‰ Calculating the dot product of a vector ‰ Creating a new vector by adding two vectors together 385 Getting Started with Breeze f f Appending vectors and converting a vector of one type to another: ‰ Concatenating two vectors ‰ Converting a vector of int to a vector of double Computing basic statistics: ‰ Mean and variance ‰ Standard deviation ‰ Find the largest value ‰ Finding the sum, square root and log of all the values in the vector Getting ready In order to run the code, you could either use the Scala or use the Worksheet feature available in the Eclipse Scala plugin (or Scala IDE) or in IntelliJ IDEA. The reason these options are suggested is due to their quick turnaround time. How to do it... Let's look at each of the above sub-recipes in detail. For easier reference, the output of the respective command is shown as well. All the classes that are being used in this recipe are from the breeze.linalg package. So, an "import breeze.linalg._" statement at the top of your file would be perfect. Creating vectors Let's look at the various ways we could construct vectors. Most of these construction mechanisms are through the apply method of the vector. There are two different flavors of vector—breeze.linalg.DenseVector and breeze.linalg.SparseVector—the choice of the vector depends on the use case. The general rule of thumb is that if you have data that is at least 20 percent zeroes, you are better off choosing SparseVector but then the 20 percent is a variant too. Constructing a vector from values f Creating a dense vector from values: Creating a DenseVector from values is just a matter of passing the values to the apply method: val dense=DenseVector(1,2,3,4,5) println (dense) //DenseVector(1, 2, 3, 4, 5) 386 Chapter 1 f Creating a sparse vector from values: Creating a SparseVector from values is also through passing the values to the apply method: val sparse=SparseVector(0.0, 1.0, 0.0, 2.0, 0.0) println (sparse) //SparseVector((0,0.0), (1,1.0), (2,0.0), (3,2.0), (4,0.0)) Notice how the SparseVector stores values against the index. Obviously, there are simpler ways to create a vector instead of just throwing all the data into its apply method. Creating a zero vector Calling the vector's zeros function would create a zero vector. While the numeric types would return a 0, the object types would return null and the Boolean types would return false: val denseZeros=DenseVector.zeros[Double](5) 0.0, 0.0, 0.0, 0.0) //DenseVector(0.0, val sparseZeros=SparseVector.zeros[Double](5) //SparseVector() Not surprisingly, the SparseVector does not allocate any memory for the contents of the vector. However, the creation of the SparseVector object itself is accounted for in the memory. Creating a vector out of a function The tabulate function in vector is an interesting and useful function. It accepts a size argument just like the zeros function but it also accepts a function that we could use to populate the values for the vector. The function could be anything ranging from a random number generator to a naïve index based generator, which we have implemented here. Notice how the return value of the function (Int) could be converted into a vector of Double by using the type parameter: val denseTabulate=DenseVector.tabulate[Double](5)(index=>index*index) //DenseVector(0.0, 1.0, 4.0, 9.0, 16.0) Creating a vector of linearly spaced values The linspace function in breeze.linalg creates a new Vector[Double] of linearly spaced values between two arbitrary numbers. Not surprisingly, it accepts three arguments— the start, end, and the total number of values that we would like to generate. Please note that the start and the end values are inclusive while being generated: val spaceVector=breeze.linalg.linspace(2, 10, 5) //DenseVector(2.0, 4.0, 6.0, 8.0, 10.0) 387 Getting Started with Breeze Creating a vector with values in a specific range The range function in a vector has two variants. The plain vanilla function accepts a start and end value (start inclusive): val allNosTill10=DenseVector.range(0, 10) //DenseVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) The other variant is an overloaded function that accepts a "step" value: val evenNosTill20=DenseVector.range(0, 20, 2) // DenseVector(0, 2, 4, 6, 8, 10, 12, 14, 16, 18) Just like the range function, which has all the arguments as integers, there is also a rangeD function that takes the start, stop, and the step parameters as Double: val rangeD=DenseVector.rangeD(0.5, 20, 2.5) // DenseVector(0.5, 3.0, 5.5, 8.0, 10.5, 13.0, 15.5) Creating an entire vector with a single value Filling an entire vector with the same value is child's play. We just say HOW BIG is this vector going to be and then WHAT value. That's it. val denseJust2s=DenseVector.fill(10, 2) // DenseVector(2, 2, 2, 2, 2, 2 , 2, 2, 2, 2) Slicing a sub-vector from a bigger vector Choosing a part of the vector from a previous vector is just a matter of calling the slice method on the bigger vector. The parameters to be passed are the start index, end index, and an optional "step" parameter. The step parameter adds the step value for every iteration until it reaches the end index. Note that the end index is excluded in the sub-vector: val allNosTill10=DenseVector.range(0, 10) //DenseVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) val fourThroughSevenIndexVector= allNosTill10.slice(4, 7) //DenseVector(4, 5, 6) val twoThroughNineSkip2IndexVector= allNosTill10.slice(2, 9, 2) //DenseVector(2, 4, 6) Creating a Breeze Vector from a Scala Vector A Breeze vector object's apply method could even accept a Scala Vector as a parameter and construct a vector out of it: val vectFromArray=DenseVector(collection.immutable.Vector(1,2,3,4)) // DenseVector(Vector(1, 2, 3, 4)) 388 Chapter 1 Vector arithmetic Now let's look at the basic arithmetic that we could do on vectors with scalars and vectors. Scalar operations Operations with scalars work just as we would expect, propagating the value to each element in the vector. Adding a scalar to each element of the vector is done using the + function (surprise!): val inPlaceValueAddition=evenNosTill20 +2 //DenseVector(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) Similarly the other basic arithmetic operations—subtraction, multiplication, and division involves calling the respective functions named after the universally accepted symbols (-, *, and /): //Scalar subtraction val inPlaceValueSubtraction=evenNosTill20 -2 //DenseVector(-2, 0, 2, 4, 6, 8, 10, 12, 14, 16) //Scalar multiplication val inPlaceValueMultiplication=evenNosTill20 *2 //DenseVector(0, 4, 8, 12, 16, 20, 24, 28, 32, 36) //Scalar division val inPlaceValueDivision=evenNosTill20 /2 //DenseVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) Calculating the dot product of two vectors Each vector object has a function called dot, which accepts another vector of the same length as a parameter. Let's fill in just 2s to a new vector of length 5: val justFive2s=DenseVector.fill(5, 2) //DenseVector(2, 2, 2, 2, 2) We'll create another vector from 0 to 5 with a step value of 1 (a fancy way of saying 0 through 4): val zeroThrough4=DenseVector.range(0, 5, 1) //DenseVector(0, 1, 2, 3, 4) 389 Getting Started with Breeze Here's the dot function: val dotVector=zeroThrough4.dot(justFive2s) //Int = 20 It is to be expected of the function to complain if we pass in a vector of a different length as a parameter to the dot product - Breeze throws an IllegalArgumentException if we do that. The full exception message is: Java.lang.IllegalArgumentException: Vectors must be the same length! Creating a new vector by adding two vectors together The + function is overloaded to accept a vector other than the scalar we saw previously. The operation does a corresponding element-by-element addition and creates a new vector: val evenNosTill20=DenseVector.range(0, 20, 2) //DenseVector(0, 2, 4, 6, 8, 10, 12, 14, 16, 18) val denseJust2s=DenseVector.fill(10, 2) //DenseVector(2, 2, 2, 2, 2, 2, 2, 2, 2, 2) val additionVector=evenNosTill20 + denseJust2s // DenseVector(2, 4, 6, 8, 10, 12, 14, 16, 18, 20) There's an interesting behavior encapsulated in the addition though. Assuming you try to add two vectors of different lengths, if the first vector is smaller and the second vector larger, the resulting vector would be the size of the first vector and the rest of the elements in the second vector would be ignored! val fiveLength=DenseVector(1,2,3,4,5) //DenseVector(1, 2, 3, 4, 5) val tenLength=DenseVector.fill(10, 20) //DenseVector(20, 20, 20, 20, 20, 20, 20, 20, 20, 20) fiveLength+tenLength //DenseVector(21, 22, 23, 24, 25) On the other hand, if the first vector is larger and the second vector smaller, it would result in an ArrayIndexOutOfBoundsException: tenLength+fiveLength // java.lang.ArrayIndexOutOfBoundsException: 5 390 Chapter 1 Appending vectors and converting a vector of one type to another Let's briefly see how to append two vectors and convert vectors of one numeric type to another. Concatenating two vectors There are two variants of concatenation. There is a vertcat function that just vertically concatenates an arbitrary number of vectors—the size of the vector just increases to the sum of the sizes of all the vectors combined: val justFive2s=DenseVector.fill(5, 2) //DenseVector(2, 2, 2, 2, 2) val zeroThrough4=DenseVector.range(0, 5, 1) //DenseVector(0, 1, 2, 3, 4) val concatVector=DenseVector.vertcat(zeroThrough4, justFive2s) //DenseVector(0, 1, 2, 3, 4, 2, 2, 2, 2, 2) No surprise here. There is also the horzcat method that places the second vector horizontally next to the first vector, thus forming a matrix. val concatVector1=DenseVector.horzcat(zeroThrough4, justFive2s) //breeze.linalg.DenseMatrix[Int] 0 2 1 2 2 2 3 2 4 2 While dealing with vectors of different length, the vertcat function happily arranges the second vector at the bottom of the first vector. Not surprisingly, the horzcat function throws an exception: java.lang.IllegalArgumentException, meaning all vectors must be of the same size! 391 Getting Started with Breeze Converting a vector of Int to a vector of Double The conversion of one type of vector into another is not automatic in Breeze. However, there is a simple way to achieve this: val evenNosTill20Double=breeze.linalg.convert(evenNosTill20, Double) Computing basic statistics Other than the creation and the arithmetic operations that we saw previously, there are some interesting summary statistics operations that are available in the library. Let's look at them now: Needs import of breeze.linalg._ and breeze.numerics._. The operations in the Other operations section aim to simulate the NumPy's UFunc or universal functions. Now, let's briefly look at how to calculate some basic summary statistics for a vector. Mean and variance Calculating the mean and variance of a vector could be achieved by calling the meanAndVariance universal function in the breeze.stats package. Note that this needs a vector of Double: meanAndVariance(evenNosTill20Double) //MeanAndVariance(9.0,36.666666666666664,10) As you may have guessed, converting an Int vector to a Double vector and calculating the mean and variance for that vector could be merged into a one-liner: meanAndVariance(convert(evenNosTill20, Double)) Standard deviation Calling the stddev on a Double vector could give the standard deviation: stddev(evenNosTill20Double) //Double = 6.0553007081949835 Find the largest value in a vector The max universal function inside the breeze.linalg package would help us find the maximum value in a vector: val intMaxOfVectorVals=max (evenNosTill20) //18 392 Chapter 1 Finding the sum, square root and log of all the values in the vector The same as with max, the sum universal function inside the breeze.linalg package calculates the sum of the vector: val intSumOfVectorVals=sum (evenNosTill20) //90 The functions sqrt, log, and various other universal functions in the breeze.numerics package calculate the square root and log values of all the individual elements inside the vector: The Sqrt function val sqrtOfVectorVals= sqrt (evenNosTill20) // DenseVector(0.0, 1. 4142135623730951, 2.0, 2.449489742783178, 2.8284271247461903, 3.16227766016 83795, 3.4641016151377544, 3.7416573867739413, 4.0, 4.242640687119285) The Log function val log2VectorVals=log(evenNosTill20) // DenseVector(-Infinity , 0.6931471805599453, 1.3862943611198906, 1.791759469228055, 2.079441541679 8357, 2.302585092994046, 2.4849066497880004, 2.6390573296152584, 2.77258872 2239781, 2.8903717578961645) Working with matrices As we discussed in the Working with vectors recipe, you could use the Eclipse or IntelliJ IDEA Scala worksheets for a faster turnaround time. How to do it... There are a variety of functions that we have in a matrix. In this recipe, we will look at some details around: f Creating matrices: ‰ Creating a matrix from values ‰ Creating a zero matrix ‰ Creating a matrix out of a function ‰ Creating an identity matrix ‰ Creating a matrix from random numbers ‰ Creating from a Scala collection 393 Getting Started with Breeze Matrix arithmetic: f ‰ Addition ‰ Multiplication (also element-wise) Appending and conversion: f ‰ Concatenating a matrix vertically ‰ Concatenating a matrix horizontally ‰ Converting a matrix of Int to a matrix of Double Data manipulation operations: f ‰ Getting column vectors ‰ Getting row vectors ‰ Getting values inside the matrix ‰ Getting the inverse and transpose of a matrix Computing basic statistics: f ‰ Mean and variance ‰ Standard deviation ‰ Finding the largest value ‰ Finding the sum, square root and log of all the values in the matrix ‰ Calculating the eigenvectors and eigenvalues of a matrix Creating matrices Let's first see how to create a matrix. Creating a matrix from values The simplest way to create a matrix is to pass in the values in a row-wise fashion into the apply function of the matrix object: val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23)) //Returns a DenseMatrix[Int] 1 2 3 11 12 13 21 22 23 394 Chapter 1 There's also a Sparse version of the matrix too—the Compressed Sparse Column Matrix (CSCMatrix): val sparseMatrix=CSCMatrix((1,0,0),(11,0,0),(0,0,23)) //Returns a SparseMatrix[Int] (0,0) 1 (1,0) 11 (2,2) 23 Breeze's Sparse matrix is a Dictionary of Keys (DOK) representation with (row, column) mapped against the value. Creating a zero matrix Creating a zero matrix is just a matter of calling the matrix's zeros function. The first integer parameter indicates the rows and the second parameter indicates the columns: val denseZeros=DenseMatrix.zeros[Double](5,4) //Returns a DenseMatrix[Double] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 val compressedSparseMatrix=CSCMatrix.zeros[Double](5,4) //Returns a CSCMatrix[Double] = 5 x 4 CSCMatrix Notice how the SparseMatrix doesn't allocate any memory for the values in the zero value matrix. 395 Getting Started with Breeze Creating a matrix out of a function The tabulate function in a matrix is very similar to the vector's version. It accepts a row and column size as a tuple (in the example (5,4)). It also accepts a function that we could use to populate the values for the matrix. In our example, we generated the values of the matrix by just multiplying the row and column index: val denseTabulate=DenseMatrix.tabulate[Double](5,4)((firstIdx,secondIdx)= >firstIdx*secondIdx) Returns a DenseMatrix[Double] = 0.0 0.0 0.0 0.0 0.0 1.0 2.0 3.0 0.0 2.0 4.0 6.0 0.0 3.0 6.0 9.0 0.0 4.0 8.0 12.0 The type parameter is needed only if you would like to convert the type of the matrix from an Int to a Double. So, the following call without the parameter would just return an Int matrix: val denseTabulate=DenseMatrix.tabulate(5,4)((firstIdx,secondIdx)=>firstId x*secondIdx) 0 1 2 3 0 2 4 6 0 3 6 9 0 4 8 12 Creating an identity matrix The eye function of the matrix would generate an identity square matrix with the given dimension (in the example's case, 3): val identityMatrix=DenseMatrix.eye[Int](3) Returns a DenseMatrix[Int] 1 0 0 0 1 0 0 0 1 396 Chapter 1 Creating a matrix from random numbers The rand function in the matrix would generate a matrix of a given dimension (4 rows * 4 columns in our case) with random values between 0 and 1. We'll have an in-depth look into random number generated vectors and matrices in a subsequent recipe. val randomMatrix=DenseMatrix.rand(4, 4) Returns DenseMatrix[Double] 0.09762565779429777 0.19428193961985674 0.01089176285376725 0.2660579009292807 0.9662568115400412 0.3957540854393169 0.718377391997945 0.8230367668470933 0.9080090988364429 0.26722019105654415 0.7697780247035393 0.49887760321635066 3.326843165250004E-4 0.7682752255172411 0.447925644082819 0.8195838733418965 Creating from a Scala collection We could create a matrix out of a Scala array too. The constructor of the matrix accepts three arguments—the rows, the columns, and an array with values for the dimensions. Note that the data from the array is picked up to construct the matrix in the column first order: val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5)) Returns DenseMatrix[Int] 2 4 3 5 If there are more values than the number of values required by the dimensions of the matrix, the rest of the values are ignored. Note how (6,7) is ignored in the array: val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5,6,7)) DenseMatrix[Int] 2 4 3 5 However, if fewer values are present in the array than what is required by the dimensions of the matrix, then the constructor call would throw an ArrayIndexOutOfBoundsException: val vectFromArrayIobe=new DenseMatrix(2,2,Array(2,3,4)) //throws java.lang.ArrayIndexOutOfBoundsException: 3 397 Getting Started with Breeze Matrix arithmetic Now let's look at the basic arithmetic that we could do using matrices. Let's consider a simple 3*3 simpleMatrix and a corresponding identity matrix: val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23)) //DenseMatrix[Int] 1 2 3 11 12 13 21 22 23 val identityMatrix=DenseMatrix.eye[Int](3) //DenseMatrix[Int] 1 0 0 0 1 0 0 0 1 Addition Adding two matrices will result in a matrix whose corresponding elements are summed up. val additionMatrix=identityMatrix + simpleMatrix // Returns DenseMatrix[Int] 2 2 3 11 13 13 21 22 24 Multiplication Now, as you would expect, multiplying a matrix with its identity should give you the matrix itself: val simpleTimesIdentity=simpleMatrix * identityMatrix //Returns DenseMatrix[Int] 1 2 3 11 12 13 21 22 23 398 Chapter 1 Breeze also has an alternative element-by-element operation that has the format of prefixing the operator with a colon, for example, :+,:-, :*, and so on. Check out what happens when we do an element-wise multiplication of the identity matrix and the simple matrix: val elementWiseMulti=identityMatrix :* simpleMatrix //DenseMatrix[Int] 1 0 0 0 12 0 0 0 23 Appending and conversion Let's briefly see how to append two matrices and convert matrices of one numeric type to another. Concatenating matrices – vertically Similar to vectors, matrix has a vertcat function, which vertically concatenates an arbitrary number of matrices—the row size of the matrix just increases to the sum of the row sizes of all matrices combined: val vertConcatMatrix=DenseMatrix.vertcat(identityMatrix, simpleMatrix) //DenseMatrix[Int] 1 0 0 0 1 0 0 0 1 1 2 3 11 12 13 21 22 23 Attempting to concatenate a matrix of different columns would, as expected, throw an IllegalArgumentException: java.lang.IllegalArgumentException: requirement failed: Not all matrices have the same number of columns Concatenating matrices – horizontally Not surprisingly, the horzcat function concatenates the matrix horizontally—the column size of the matrix increases to the sum of the column sizes of all the matrices: val horzConcatMatrix=DenseMatrix.horzcat(identityMatrix, simpleMatrix) // DenseMatrix[Int] 1 0 0 1 2 3 399 Getting Started with Breeze 0 1 0 11 12 13 0 0 1 21 22 23 Similar to the vertical concatenation, attempting to concatenate a matrix of a different row size would throw an IllegalArgumentException: java.lang.IllegalArgumentException: requirement failed: Not all matrices have the same number of rows Converting a matrix of Int to a matrix of Double The conversion of one type of matrix to another is not automatic in Breeze. However, there is a simple way to achieve this: import breeze.linalg.convert val simpleMatrixAsDouble=convert(simpleMatrix, Double) // DenseMatrix[Double] = 1.0 2.0 3.0 11.0 12.0 13.0 21.0 22.0 23.0 Data manipulation operations Let's create a simple 2*2 matrix that will be used for the rest of this section: val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0)) //DenseMatrix[Double] = 4.0 7.0 3.0 -5.0 Getting column vectors out of the matrix The first column vector could be retrieved by passing in the column parameter as 0 and using :: in order to say that we are interested in all the rows. val firstVector=simpleMatrix(::,0) //DenseVector(4.0, 3.0) Getting the second column vector and so on is achieved by passing the correct zero-indexed column number: val secondVector=simpleMatrix(::,1) //DenseVector(7.0, -5.0) Alternatively, you could explicitly pass in the columns to be extracted: val firstVectorByCols=simpleMatrix(0 to 1,0) //DenseVector(4.0, 3.0) 400 Chapter 1 While explicitly stating the range (as in 0 to 1), we have to be careful not to exceed the matrix size. For example, the following attempt to select 3 columns (0 through 2) on a 2 * 2 matrix would throw an ArrayIndexOutOfBoundsException: val errorTryingToSelect3ColumnsOn2By2Matrix=simpleMatrix(0,0 to 2) //java.lang.ArrayIndexOutOfBoundsException Getting row vectors out of the matrix If we would like to get the row vector, all we need to do is play with the row and column parameters again. As expected, it would give a transpose of the column vector, which is simply a row vector. Like the column vector, we could either explicitly state our columns or pass in a wildcard (::) to cover the entire range of columns: val firstRowStatingCols=simpleMatrix(0,0 to 1) //Transpose(DenseVector(4.0, 7.0)) val firstRowAllCols=simpleMatrix(0,::) //Transpose(DenseVector(4.0, 7.0)) Getting the second row vector is achieved by passing the second row (1) and all the columns (::) in that vector: val secondRow=simpleMatrix(1,::) //Transpose(DenseVector(3.0, -5.0)) Getting values inside the matrix Assuming we are just interested in the values within the matrix, pass in the exact row and the column number of the matrix. In order to get the first row and first column of the matrix, just pass in the row and the column number: val firstRowFirstCol=simpleMatrix(0,0) //Double = 4.0 Getting the inverse and transpose of a matrix Getting the inverse and the transpose of a matrix is a little counter-intuitive in Breeze. Let's consider the same matrix that we dealt with earlier: val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0)) On the one hand, transpose is a function on the matrix object itself, like so: val transpose=simpleMatrix.t 4.0 3.0 7.0 -5.0 401 Getting Started with Breeze inverse, on the other hand is a universal function under the breeze.linalg package: val inverse=inv(simpleMatrix) 0.12195121951219512 0.17073170731707318 0.07317073170731708 -0.0975609756097561 Let's do a matrix product to its inverse and confirm whether it is an identity matrix: simpleMatrix * inverse 1.0 0.0 -5.551115123125783E-17 1.0 As expected, the result is indeed an identity matrix with rounding errors when doing floating point arithmetic. Computing basic statistics Now, just like vectors, let's briefly look at how to calculate some basic summary statistics for a matrix. This needs import of breeze.linalg._, breeze.numerics._ and, breeze.stats._. The operations in the "Other operations" section aims to simulate the NumPy's UFunc or universal functions. Mean and variance Calculating the mean and variance of a matrix could be achieved by calling the meanAndVariance universal function in the breeze.stats package. Note that this needs a matrix of Double: meanAndVariance(simpleMatrixAsDouble) // MeanAndVariance(12.0,75.75,9) Alternatively, converting an Int matrix to a Double matrix and calculating the mean and variance for that Matrix could be merged into a one-liner: meanAndVariance(convert(simpleMatrix, Double)) 402 Chapter 1 Standard deviation Calling the stddev on a Double vector could give the standard deviation: stddev(simpleMatrixAsDouble) //Double = 8.703447592764606 Next up, let's look at some basic aggregation operations: val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23)) Finding the largest value in a matrix The (apply method of the) max object (a universal function) inside the breeze.linalg package will help us do that: val intMaxOfMatrixVals=max (simpleMatrix) //23 Finding the sum, square root and log of all the values in the matrix The same as with max, the sum object inside the breeze.linalg package calculates the sum of all the matrix elements: val intSumOfMatrixVals=sum (simpleMatrix) //108 The functions sqrt, log, and various other objects (universal functions) in the breeze. numerics package calculate the square root and log values of all the individual values inside the matrix. Sqrt val sqrtOfMatrixVals= sqrt (simpleMatrix) //DenseMatrix[Double] = 1.0 1.4142135623730951 1.7320508075688772 3.3166247903554 3.4641016151377544 3.605551275463989 4.58257569495584 4.69041575982343 4.795831523312719 Log val log2MatrixVals=log(simpleMatrix) //DenseMatrix[Double] 0.0 0.6931471805599453 1.0986122886681098 2.3978952727983707 2.4849066497880004 2.5649493574615367 3.044522437723423 3.091042453358316 3.1354942159291497 403 Getting Started with Breeze Calculating the eigenvectors and eigenvalues of a matrix Calculating eigenvectors is straightforward in Breeze. Let's consider our simpleMatrix from the previous section: val simpleMatrix=DenseMatrix((4.0,7.0),(3.0,-5.0)) Calling the breeze.linalg.eig universal function on a matrix returns a breeze.linalg. eig.DenseEig object that encapsulate eigenvectors and eigenvalues: val denseEig=eig(simpleMatrix) This line of code returns the following: Eig( DenseVector(5.922616289332565, -6.922616289332565), DenseVector(0.0, 0.0) ,0.9642892971721949 0.8419378679586305) -0.5395744865143975 0.26485118719604456 We could extract the eigenvectors and eigenvalues by calling the corresponding functions on the returned Eig reference: val eigenVectors=denseEig.eigenvectors //DenseMatrix[Double] = 0.9642892971721949 -0.5395744865143975 0.26485118719604456 0.8419378679586305 The two eigenValues corresponding to the two eigenvectors could be captured using the eigenvalues function on the Eig object: val eigenValues=denseEig.eigenvalues //DenseVector[Double] = DenseVector(5.922616289332565, -6.922616289332565) Let's validate the eigenvalues and the vectors: 1. Let's multiply the matrix with the first eigenvector: val matrixToEigVector=simpleMatrix*denseEig.eigenvectors (::,0) //DenseVector(5.7111154990610915, 1.568611955536362) 2. Then let's multiply the first eigenvalue with the first eigenvector. The resulting vector will be the same with a marginal error when doing floating point arithmetic: val vectorToEigValue=denseEig.eigenvectors(::,0) * denseEig.eigenvalues (0) //DenseVector(5.7111154990610915, 1.5686119555363618) 404 Chapter 1 How it works... The same as with vectors, the initialization of the Breeze matrices are achieved by way of the apply method or one of the various methods in the matrix's Object class. Various other operations are provided by way of polymorphic functions available in the breeze.numeric, breeze.linalg and breeze.stats packages. Vectors and matrices with randomly distributed values The breeze.stats.distributions package supplements the random number generator that is built into Scala. Scala's default generator just provides the ability to get the random values one by one using the "next" methods. Random number generators in Breeze provide the ability to build vectors and matrices out of these generators. In this recipe, we'll briefly see three of the most common distributions of random numbers. In this recipe, we will cover at the following sub-recipes: f Creating vectors with uniformly distributed random values f Creating vectors with normally distributed random values f Creating vectors with random values that have a Poisson distribution f Creating a matrix with uniformly random values f Creating a matrix with normally distributed random values f Creating a matrix with random values that has a Poisson distribution How it works... Before we delve into how to create the vectors and matrices out of random numbers, let's create instances of the most common random number distribution. All these generators are under the breeze.stats.distributions package: //Uniform distribution with low being 0 and high being 10 val uniformDist=Uniform(0,10) //Gaussian distribution with mean being 5 and Standard deviation being 1 val gaussianDist=Gaussian(5,1) //Poission distribution with mean being 5 val poissonDist=Poisson(5) 405 Getting Started with Breeze We could actually directly sample from these generators. Given any distribution we created previously, we could sample either a single value or a sequence of values: //Samples a single value println (uniformDist.sample()) //eg. 9.151191360491392 //Returns a sample vector of size that is passed in as parameter println (uniformDist.sample(2)) //eg. Vector(6.001980062275654, 6.210874664967401) Creating vectors with uniformly distributed random values With no generator parameter, the DenseVector.rand method accepts a parameter for the length of the vector to be returned. The result is a vector (of length 10) with uniformly distributed values between 0 and 1: val uniformWithoutSize=DenseVector.rand(10) println ("uniformWithoutSize \n"+ uniformWithoutSize) //DenseVector(0.1235038023750481, 0.3120595941786264, 0.3575638744660876, 0.5640844223813524, 0.5336149399548831, 0.1338053814330793, 0.9099684427908603, 0.38690724148973166, 0.22561993631651522, 0.45120359622713657) The DenseVector.rand method optionally accepts a distribution object and generates random values using that input distribution. The following line generates a vector of 10 uniformly distributed random values that are within the range 0 and 10: val uniformDist=Uniform(0,10) val uniformVectInRange=DenseVector.rand(10, uniformDist) println ("uniformVectInRange \n"+uniformVectInRange) //DenseVector(1.5545833905907314, 6.172564377264846, 8.45578509265587, 7.683763574965107, 8.018688137742062, 4.5876187984930406, 3.274758584944064, 2.3873947264259954, 2.139988841403757, 8.314112884416943) Creating vectors with normally distributed random values In the place of the uniformDist generator, we could also pass the previously created Gaussian generator, which is configured to yield a distribution that has a mean of 5 and standard deviation of 1: val gaussianVector=DenseVector.rand(10, gaussianDist) println ("gaussianVector \n"+gaussianVector) 406 Chapter 1 //DenseVector(4.235655596913547, 5.535011377545014, 6.201428236839494, 6.046289604188366, 4.319709374229152, 4.2379652913447154, 2.957868021601233, 3.96371080427211, 4.351274306757224, 5.445022658876723) Creating vectors with random values that have a Poisson distribution Similarly, by passing the previously created Poisson random number generator, a vector of values that has a mean of 5 could be generated: val poissonVector=DenseVector.rand(10, poissonDist) println ("poissonVector \n"+poissonVector) //DenseVector(5, 5, 7, 11, 7, 6, 6, 6, 6, 6) We saw how easy it is to create a vector of random values. Now, let's proceed to create a matrix of random values. Similar to DenseVector.rand to generate vectors with random values, we'll use the DenseMatrix.rand function to generate a matrix of random values. Creating a matrix with uniformly random values The DenseMatrix.rand defaults to the uniform distribution and generates a matrix of random values given the row and the column parameter. However, if we would like to have a distribution within a range, then as in vectors, we could use the optional parameter:. //Uniform distribution, Creates a 3 * 3 Matrix with random values from 0 to 1 val uniformMat=DenseMatrix.rand(3, 3) println ("uniformMat \n"+uniformMat) 0.4492155777289115 0.9098840386699856 0.8203022252988292 0.0888975848853315 0.009677790736892788 0.6058885905934237 0.6201415814136939 0.7017492438727635 0.08404147915159443 //Creates a 3 * 3 Matrix with uniformly distributed random values with low being 0 and high being 10 val uniformMatrixInRange=DenseMatrix.rand(3,3, uniformDist) println ("uniformMatrixInRange \n"+uniformMatrixInRange) 7.592014659345548 8.164652560340933 6.966445294464401 8.35949395084735 3.442654641743763 3.6761640240938442 9.42626645215854 0.23658921372298636 7.327120138868571 407 Getting Started with Breeze Creating a matrix with normally distributed random values Just as in vectors, in place of the uniformDist generator, we could also pass the previously created Gaussian generator to the rand function to generate a matrix of random values that has a mean of 5 and standard deviation of 1: //Creates a 3 * 3 Matrix with normally distributed random values with mean being 5 and Standard deviation being 1 val gaussianMatrix=DenseMatrix.rand(3, 3,gaussianDist) println ("gaussianMatrix \n"+gaussianMatrix) 5.724540885605018 5.647051873430568 5.337906135107098 6.2228893721489875 4.799561665187845 5.12469779489833 5.136960834730864 5.176410360757703 5.262707072950913 Creating a matrix with random values that has a Poisson distribution Similarly, by passing the previously created Poisson random number generator, a matrix of random values that has a mean of 5 could be generated: //Creates a 3 * 3 Matrix with Poisson distribution with mean being 5 val poissonMatrix=DenseMatrix.rand(3, 3,poissonDist) println ("poissonMatrix \n"+poissonMatrix) 4 11 3 6 6 5 6 4 2 Reading and writing CSV files Reading and writing a CSV file in Breeze is really a breeze. We just have two functions in breeze.linalg package to play with. They are very intuitively named csvread and csvwrite. In this recipe, you'll see how to: 1. Read a CSV file into a matrix 2. Save selected columns of a matrix into a new matrix 3. Write the newly created matrix into a CSV file 4. Extract a vector out of the matrix 5. Write the vector into a CSV 408 Chapter 1 How it works... There are just two functions that we need to remember in order to read and write data from and to CSV files. The signatures of the functions are pretty straightforward too: csvread(file, separator, quote, escape, skipLines) csvwrite(file, mat, separator, quote, escape, skipLines) Let's look at the parameters by order of importance: f file: java.io.File: Represents the file location. f separator: Defaults to a comma so as to represent a CSV. Could be overridden when needed. f skipLines: This is the number of lines to be skipped while reading the file. Generally, if there is a header, we pass a skipLines=1. f mat: While writing, this is the matrix object that is being written. f quote: This defaults to double quotes. It is a character that implies that the value inside is one single value. f escape: This defaults to a backspace. It is a character used to escape special characters. Let's see these in action. For the sake of clarity, I have skipped the quote and the escape parameter while calling the csvread and csvwrite functions. For this recipe, we will do three things: f Read a CSV file as a matrix f Extract a sub-matrix out of the read matrix f Write the matrix 409 Getting Started with Breeze Read the CSV as a matrix: 1. Let's use the csvread function to read a CSV file into a 100*3 matrix. We'll also skip the header while reading and print 5 rows as a sample: val usageMatrix=csvread(file=new File("WWWusage.csv"), separator=',', skipLines=1) //print first five rows println ("Usage matrix \n"+ usageMatrix(0 to 5,::)) Output : 1.0 1.0 88.0 2.0 2.0 84.0 3.0 3.0 85.0 4.0 4.0 85.0 5.0 5.0 84.0 6.0 6.0 85.0 2. Extract a sub-matrix out of the read matrix: For the sake of generating a submatrix let's skip the first column and save the second and the third column into a new matrix. Let's call it firstColumnSkipped: val firstColumnSkipped= usageMatrix(::, 1 to usageMatrix.cols-1) //Sample some data so as to ensure we are fine 410 Chapter 1 println ("First Column skipped \n"+ firstColumnSkipped(0 to 5, ::)) Output : 1.0 88.0 2.0 84.0 3.0 85.0 4.0 85.0 5.0 84.0 6.0 85.0 3. Write the matrix: As a final step, let's write the firstColumnSkipped matrix to a new CSV file named firstColumnSkipped.csv: //Write this modified matrix to a file csvwrite(file=new File ("firstColumnSkipped.csv"), mat=firstColumnSkipped, separator=',') 411 2 Getting Started with Apache Spark DataFrames In this chapter, we will cover the following recipes: f Getting Apache Spark f Creating a DataFrame from CSV f Manipulating DataFrames f Creating a DataFrame from Scala case classes Introduction Apache Spark is a cluster computing platform that claims to run about 10 times faster than Hadoop. In general terms, we could consider it as a means to run our complex logic over massive amounts of data at a blazingly fast speed. The other good thing about Spark is that the programs that we write are much smaller than the typical MapReduce classes that we write for Hadoop. So, not only do our programs run faster but it also takes less time to write them. Spark has four major higher level tools built on top of the Spark Core: Spark Streaming, Spark MLlib (machine learning), Spark SQL (an SQL interface for accessing the data), and GraphX (for graph processing). The Spark Core is the heart of Spark. Spark provides higher level abstractions in Scala, Java, and Python for data representation, serialization, scheduling, metrics, and so on. 413 Getting Started with Apache Spark DataFrames At the risk of stating the obvious, a DataFrame is one of the primary data structures used in data analysis. They are just like an RDBMS table that organizes all your attributes into columns and all your observations into rows. It's a great way to store and play with heterogeneous data. In this chapter, we'll talk about DataFrames in Spark. Getting Apache Spark In this recipe, we'll take a look at how to bring Spark into our project (using SBT) and how Spark works internally. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter1-spark-csv/build.sbt. 414 Chapter 2 How to do it... Let's now throw some Spark dependencies into our build.sbt file so that we can start playing with them in subsequent recipes. For now, we'll just focus on three of them: Spark Core, Spark SQL, and Spark MLlib. We'll take a look at a host of other Spark dependencies as we proceed further in this book: 1. Under a brand new folder (which will be your project root), create a new file called build.sbt. 2. Next, let's add the Spark libraries to the project dependencies. 3. Note that Spark 1.4.x requires Scala 2.10.x. This becomes the first section of our build.sbt: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= "org.apache.spark" %% "org.apache.spark" %% "org.apache.spark" %% ) Seq( "spark-core" % sparkVersion, "spark-sql" % sparkVersion, "spark-mllib" % sparkVersion Creating a DataFrame from CSV In this recipe, we'll look at how to create a new DataFrame from a delimiter-separated values file. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter1-spark-csv/src/main/scala/com/packt/ scaladata/spark/csv/DataFrameCSV.scala. 415 Getting Started with Apache Spark DataFrames How to do it... This recipe involves four steps: 1. Add the spark-csv support to our project. 2. Create a Spark Config object that gives information on the environment that we are running Spark in. 3. Create a Spark context that serves as an entry point into Spark. Then, we proceed to create an SQLContext from the Spark context. 4. Load the CSV using the SQLContext. 5. CSV support isn't first-class in Spark, but it is available through an external library from Databricks. So, let's go ahead and add that to our build.sbt. After adding the spark-csv dependency, our complete build.sbt looks like this: organization := "com.packt" name := "chapter1-spark-csv" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "com.databricks" %% "spark-csv" % "1.0.3" ) 6. SparkConf holds all of the information required to run this Spark "cluster." For this recipe, we are running locally, and we intend to use only two cores in the machine— local[2]. More details about this can be found in the There's more… section of this recipe: import org.apache.spark.SparkConf val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]") When we say that the master of this run is "local," we mean that we are running Spark on standalone mode. We'll see what "standalone" mode means in the There's more… section. 416 Chapter 2 7. Initialize the Spark context with the Spark configuration. This is the core entry point for doing anything with Spark: import org.apache.spark.SparkContext val sc = new SparkContext(conf) The easiest way to query data in Spark is by using SQL queries: import org.apache.spark.sql.SQLContext val sqlContext=new SQLContext(sc) 8. Now, let's load our pipe-separated file. The students is of type org.apache. spark.sql.DataFrame: import com.databricks.spark.csv._ val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') How it works... The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag defaults to a comma, but you can override the character as needed. Instead of using the csvFile function, we could also use the load function available in SQLContext. The load function accepts the format of the file (in our case, it is CSV) and options as Map. We can specify the same parameters that we specified earlier using Map, like this: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") val newStudents=sqlContext.load("com.databricks.spark.csv",options) There's more… As we saw earlier, we now ran the Spark program in standalone mode. In standalone mode, the Driver program (the brain) and the Worker nodes all get crammed into a single JVM. In our example, we set master to local[2], which means that we intend to run Spark in standalone mode and request it to use only two cores in the machine. Spark can be run on three different modes: f Standalone f Standalone cluster, using its in-built cluster manager f Using external cluster managers, such as Apache Mesos and YARN 417 Getting Started with Apache Spark DataFrames In Chapter 6, Scaling Up, we have dedicated explanations and recipes for how to run Spark on inbuilt cluster modes on Mesos and YARN. In a clustered environment, Spark runs a Driver program along with a number of Worker nodes. As the name indicates, the Driver program houses the brain of the program, which is our main program. The Worker nodes have the data and perform various transformations on it. Manipulating DataFrames In the previous recipe, we saw how to create a DataFrame. The next natural step, after creating DataFrames, is to play with the data inside them. Other than the numerous functions that help us to do that, we also find other interesting functions that help us sample the data, print the schema of the data, and so on. We'll take a look at them one by one in this recipe. The code and the sample file for this recipe could be found at https://github.com/arunma/ScalaDataAnalysisCookbook/ blob/master/chapter1-spark-csv/src/main/scala/com/ packt/scaladata/spark/csv/DataFrameCSV.scala. How to do it... Now, let's see how we can manipulate DataFrames using the following subrecipes: f Printing the schema of the DataFrame f Sampling data in the DataFrame f Selecting specific columns in the DataFrame f Filtering data by condition f Sorting data in the frame f Renaming columns f Treating the DataFrame as a relational table to execute SQL queries f Saving the DataFrame as a file Printing the schema of the DataFrame After creating the DataFrame from various sources, we would obviously want to quickly check its schema. The printSchema function lets us do just that. It prints our column names and the data types to the default output stream: 1. Let's load a sample DataFrame from the StudentData.csv file: //Now, lets load our pipe-separated file //students is of type org.apache.spark.sql.DataFrame 418 Chapter 2 val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|') 2. Let's print the schema of this DataFrame: students.printSchema Output root |-- id: string (nullable = true) |-- studentName: string (nullable = true) |-- phone: string (nullable = true) |-- email: string (nullable = true) Sampling the data in the DataFrame The next logical thing that we would like to do is to check whether our data got loaded into the DataFrame correctly. There are a few ways of sampling the data in the newly created DataFrame: f Using the show method. This is the simplest way. There are two variants of the show method, as explained here: ‰ ‰ One with an integer parameter that specifies the number of rows to be sampled. The second is without the integer parameter. In it, the number of rows defaults to 20. The distinct quality about the show method as compared to the other functions that sample data is that it displays the rows along with the headers and prints the output directly to the default output stream (console): //Sample n records along with headers students.show (3) //Sample 20 records along with headers students.show () //Output of show(3) +--+-----------+--------------+--------------------+ |id|studentName| phone| email| 419 Getting Started with Apache Spark DataFrames +--+-----------+--------------+--------------------+ | 1| Burke|1-300-746-8446|ullamcorper.velit...| | 2| Kamal|1-668-571-5046|pede.Suspendisse@...| | 3| Olga|1-956-311-1686|Aenean.eget.metus...| +--+-----------+--------------+--------------------+ f Using the head method. This method also accepts an integer parameter representing the number of rows to be fetched. The head method returns an array of rows. To print these rows, we can pass the println method to the foreach function of the arrays: //Sample the first 5 records students.head(5).foreach(println) If you are not a great fan of head, you can use the take function, which is common across all Scala sequences. The take method is just an alias of the head method and delegates all its calls to head: //Alias of head students.take(5).foreach(println) //Output [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk] [2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu] [3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu] [4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk] [5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net] Selecting DataFrame columns As you have seen, all DataFrame columns have names. The select function helps us pick and choose specific columns from a previously existing DataFrame and form a completely new one out of it: f Selecting a single column: Let's say that you would like to select only the email column from a DataFrame. Since DataFrames are immutable, the selection returns a new DataFrame: val emailDataFrame:DataFrame=students.select("email") 420 Chapter 2 Now, we have a new DataFrame called emailDataFrame, which has only the e-mail as its contents. Let's sample and check whether that is true: emailDataFrame.show(3) //Output +--------------------+ | email| +--------------------+ |ullamcorper.velit...| |pede.Suspendisse@...| |Aenean.eget.metus...| +--------------------+ f Selecting more than one column: The select function actually accepts an arbitrary number of column names, which means that you can easily select more than one column from your source DataFrame: val studentEmailDF=students.select("studentName", "email") The only requirement is that the string parameters that specify must be a valid column name. Otherwise, an org.apache.spark.sql. AnalysisException exception is thrown. The printSchema function serves as a quick reference for the column names. Let's sample and check whether we have indeed selected the studentName and email columns in the new DataFrame: studentEmailDF.show(3) Output +-----------+--------------------+ |studentName| email| +-----------+--------------------+ | Burke|ullamcorper.velit...| | Kamal|pede.Suspendisse@...| | Olga|Aenean.eget.metus...| +-----------+--------------------+ 421 Getting Started with Apache Spark DataFrames Filtering data by condition Now that we have seen how to select columns from a DataFrame, let's see how to filter the rows of a DataFrame based on conditions. For row-based filtering, we can treat the DataFrame as a normal Scala collection and filter the data based on a condition. In all of these examples, I have added the show method at the end for clarity: 1. Filtering based on a column value: //Print the first 5 records that has student id more than 5 students.filter("id > 5").show(7) Output +--+-----------+--------------+--------------------+ |id|studentName| phone| email| +--+-----------+--------------+--------------------+ | 6| Laurel|1-691-379-9921|adipiscing@consec...| | 7| Sara|1-608-140-1995|Donec.nibh@enimEt...| | 8| Kaseem|1-881-586-2689|cursus.et.magna@e...| | 9| Lev|1-916-367-5608|Vivamus.nisi@ipsu...| |10| Maya|1-271-683-2698|accumsan.convalli...| |11| |12| Emi|1-467-270-1337| est@nunc.com| Caleb|1-683-212-0896|Suspendisse@Quisq...| +--+-----------+--------------+--------------------+ Notice that even though the id field is inferenced as a String type, it does the numerical comparison correctly. On the other hand, students.filter("email > 'c'") would give back all the e-mail IDs that start with a character greater than 'c'. 2. Filtering based on an empty column value. The following filter selects all students without names: students.filter("studentName =''").show(7) Output +--+-----------+--------------+--------------------+ |id|studentName| phone| email| +--+-----------+--------------+--------------------+ 422 |21| |1-598-439-7549|consectetuer.adip...| |32| |1-184-895-9602|accumsan.laoreet@...| Chapter 2 |45| |1-245-752-0481|Suspendisse.eleif...| |83| |1-858-810-2204|sociis.natoque@eu...| |94| |1-443-410-7878|Praesent.eu.nulla...| +--+-----------+--------------+--------------------+ 3. Filtering based on more than one condition. This filter shows all records whose student names are empty or student name field has a NULL string value: students.filter("studentName ='' OR studentName = 'NULL'").show(7) Output +--+-----------+--------------+--------------------+ |id|studentName| phone| email| +--+-----------+--------------+--------------------+ |21| |1-598-439-7549|consectetuer.adip...| |32| |1-184-895-9602|accumsan.laoreet@...| |33| NULL|1-105-503-0141|Donec@Inmipede.co.uk| |45| |1-245-752-0481|Suspendisse.eleif...| |83| |1-858-810-2204|sociis.natoque@eu...| |94| |1-443-410-7878|Praesent.eu.nulla...| +--+-----------+--------------+--------------------+ We are just limiting the output to seven records using the show(7) function. 4. Filtering based on SQL-like conditions. This filter gets the entries of all students whose names start with the letter 'M'. students.filter("SUBSTR(studentName,0,1) ='M'").show(7) Output +--+-----------+--------------+--------------------+ |id|studentName| phone| email| +--+-----------+--------------+--------------------+ |10| Maya|1-271-683-2698|accumsan.convalli...| |19| Malachi|1-608-637-2772|Proin.mi.Aliquam@...| |24| Marsden|1-477-629-7528|Donec.dignissim.m...| |37| Maggy|1-910-887-6777|facilisi.Sed.nequ...| |61| Maxine|1-422-863-3041|aliquet.molestie....| |77| Maggy|1-613-147-4380| pellentesque@mi.net| |97| Maxwell|1-607-205-1273|metus.In@musAenea...| +--+-----------+--------------+--------------------+ 423 Getting Started with Apache Spark DataFrames Sorting data in the frame Using the sort function, we can order the DataFrame by a particular column: 1. Ordering by a column in descending order: students.sort(students("studentName").desc).show(7) Output +--+-----------+--------------+--------------------+ |id|studentName| phone| email| +--+-----------+--------------+--------------------+ |50| Yasir|1-282-511-4445|eget.odio.Aliquam...| |52| Xena|1-527-990-8606|in.faucibus.orci@...| |86| Xandra|1-677-708-5691|libero@arcuVestib...| |43| Wynter|1-440-544-1851|amet.risus.Donec@...| |31| Wallace|1-144-220-8159| lorem.lorem@non.net| |66| Vance|1-268-680-0857|pellentesque@netu...| |41| Tyrone|1-907-383-5293|non.bibendum.sed@...| | 5| Trevor|1-300-527-4967|dapibus.id@acturp...| |65| Tiger|1-316-930-7880|nec@mollisnoncurs...| |15| Tarik|1-398-171-2268|turpis@felisorci.com| +--+-----------+--------------+--------------------+ 2. Ordering by more than one column (ascending): students.sort("studentName", "id").show(10) Output +--+-----------+--------------+--------------------+ |id|studentName| phone| email| +--+-----------+--------------+--------------------+ |21| |1-598-439-7549|consectetuer.adip...| |32| |1-184-895-9602|accumsan.laoreet@...| |45| |1-245-752-0481|Suspendisse.eleif...| |83| |1-858-810-2204|sociis.natoque@eu...| |94| |1-443-410-7878|Praesent.eu.nulla...| |91| 424 Abel|1-530-527-7467| urna@veliteu.edu| Chapter 2 |69| Aiko|1-682-230-7013|turpis.vitae.puru...| |47| Alma|1-747-382-6775| nec.enim@non.org| |26| Amela|1-526-909-2605| in@vitaesodales.edu| |16| Amena|1-878-250-3129|lorem.luctus.ut@s...| +--+-----------+--------------+--------------------+ Alternatively, the orderBy alias of the sort function can be used to achieve this. Also, multiple column orders could be specified using the DataFrame's apply method: students.sort(students("studentName").desc, students("id").asc).show(10) Renaming columns If we don't like the column names of the source DataFrame and wish to change them to something nice and meaningful, we can do that using the as function while selecting the columns. In this example, we rename the "studentName" column to "name" and retain the "email" column's name as is: val copyOfStudents=students.select(students("studentName").as("name"), students("email")) copyOfStudents.show() Output +--------+--------------------+ | name| email| +--------+--------------------+ | Burke|ullamcorper.velit...| | Kamal|pede.Suspendisse@...| | Olga|Aenean.eget.metus...| | Belle|vitae.aliquet.nec...| | Trevor|dapibus.id@acturp...| | Laurel|adipiscing@consec...| | Sara|Donec.nibh@enimEt...| 425 Getting Started with Apache Spark DataFrames Treating the DataFrame as a relational table The real power of DataFrames lies in the fact that we can treat it like a relational table and use SQL to query. This involves two simple steps: 1. Register the students DataFrame as a table with the name "students" (or any other name): students.registerTempTable("students") 2. Query it using regular SQL: val dfFilteredBySQL=sqlContext.sql("select * from students where studentName!='' order by email desc") dfFilteredBySQL.show(7) id studentName phone email 87 Selma 1-601-330-4409 vulputate.velit@p 96 Channing 1-984-118-7533 viverra.Donec.tem 4 1-246-894-6340 vitae.aliquet.nec Belle 78 Finn 1-213-781-6969 vestibulum.massa@ 53 Kasper 1-155-575-9346 velit.eget@pedeCu 63 Dylan 1-417-943-8961 vehicula.aliquet@ 35 Cadman 1-443-642-5919 ut.lacus@adipisci The lifetime of the temporary table is tied to the life of the SQLContext that was used to create the DataFrame. Joining two DataFrames Now that we have seen how to register a DataFrame as a table, let's see how to perform SQLlike join operations on DataFrames. Inner join An inner join is the default join and it just gives those results that are matching on both DataFrames when a condition is given: val students1=sqlContext.csvFile(filePath="StudentPrep1.csv", useHeader=true, delimiter='|') val students2=sqlContext.csvFile(filePath="StudentPrep2.csv", useHeader=true, delimiter='|') 426 Chapter 2 val studentsJoin=students1.join(students2, students1("id")===students2(" id")) studentsJoin.show(studentsJoin.count.toInt) The output is as follows: Right outer join A right outer join shows all the additional unmatched rows that are available in the right-handside DataFrame. We can see from the following output that the entry with ID 999 from the right- hand-side DataFrame is now shown: val studentsRightOuterJoin=students1.join(students2, students1("id")===st udents2("id"), "right_outer") studentsRightOuterJoin.show(studentsRightOuterJoin.count.toInt) 427 Getting Started with Apache Spark DataFrames Left outer join Similar to a right outer join, a left outer join returns not only the matching rows, but also the additional unmatched rows of the left-hand-side DataFrame: val studentsLeftOuterJoin=students1.join(students2, students1("id")===st udents2("id"), "left_outer") studentsLeftOuterJoin.show(studentsLeftOuterJoin.count.toInt) Saving the DataFrame as a file As the next step, let's save a DataFrame in a file store. The load function, which we used in an earlier recipe, has a similar-looking counterpart called save. This involves two steps: 1. Create a map containing the various options that you would like the save method to use. In this case, we specify the filename and ask it to have a header: val options=Map("header"->"true", "path"->"ModifiedStudent.csv") To keep it interesting, let's choose column names from the source DataFrame. In this example, we pick the studentName and email columns and change the studentName column's name to just name. val copyOfStudents=students.select(students("studentName"). as("name"), students("email")) 2. Finally, save this new DataFrame with the headers in a file named ModifiedStudent.csv: copyOfStudents.save("com.databricks.spark.csv", SaveMode. Overwrite, options) The second argument is a little interesting. We can choose Overwrite (as we did here), Append, Ignore, or ErrorIfExists. Overwrite— as the name implies—overwrites the file if it already exists, Ignore ignores writing if the file exists, ErrorIfExists complains for pre-existence of the file, and Append continues writing from the last edit location. Throwing an error is the default behavior. 428 Chapter 2 The output of the save method looks like this: Creating a DataFrame from Scala case classes In this recipe, we'll see how to create a new DataFrame from Scala case classes. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/chapter1spark-csv/src/main/scala/com/packt/scaladata/spark/ csv/DataFrameFromCaseClasses.scala. How to do it... 1. We create a new entity called Employee with the id and name fields, like this: case class Employee(id:Int, name:String) Similar to the previous recipe, we create SparkContext and SQLContext. val conf = new SparkConf().setAppName("colRowDataFrame"). setMaster("local[2]") //Initialize Spark context with Spark configuration. core entry point to do anything with Spark This is the val sc = new SparkContext(conf) 429 Getting Started with Apache Spark DataFrames //The easiest way to query data in Spark is to use SQL queries. val sqlContext=new SQLContext(sc) 2. We can source these employee objects from a variety of sources, such as an RDBMS data source, but for the sake of this example, we construct a list of employees, as follows: val listOfEmployees =List(Employee(1,"Arun"), Employee(2, "Jason"), Employee (3, "Abhi")) 3. The next step is to pass the listOfEmployees to the createDataFrame function of SQLContext. That's it! We now have a DataFrame. When we try to print the schema using the printSchema method of the DataFrame, we will see that the DataFrame has two columns, with names id and name, as defined in the case class: //Pass in the Employees into the `createDataFrame` function. val empFrame=sqlContext.createDataFrame(listOfEmployees) empFrame.printSchema Output: root |-- id: integer (nullable = false) |-- name: string (nullable = true) As you might have guessed, the schema of the DataFrame is inferenced from the case class using reflection. 4. We can get a different name for the DataFrame—other than the names specified in the case class—using the withColumnRenamed function, as shown here: val empFrameWithRenamedColumns=sqlContext.createDataFrame(listOf Employees).withColumnRenamed("id", "empId") empFrameWithRenamedColumns.printSchema Output: root |-- empId: integer (nullable = false) |-- name: string (nullable = true) 430 Chapter 2 5. Let's query the DataFrame using Spark's first-class SQL support. Before that, however, we'll have to register the DataFrame as a table. The registerTempTable, as we saw in the previous recipe, helps us achieve this. With the following command, we will have registered the DataFrame as a table by name "employeeTable" "employeeTable" empFrameWithRenamedColumns.registerTempTable("employeeTable") 6. Now, for the actual query. Let's arrange the DataFrame in descending order of names: val sortedByNameEmployees=sqlContext.sql("select * from employeeTable order by name desc") sortedByNameEmployees.show() Output: +-----+-----+ |empId| name| +-----+-----+ | 2|Jason| | 1| Arun| | 3| Abhi| +-----+-----+ How it works... The createDataFrame function accepts a sequence of scala.Product. Scala case classes extend from Product, and therefore it fits in the budget. That said, we can actually use a sequence of tuples to create a DataFrame, since tuples implement Product too: val mobiles=sqlContext.createDataFrame(Seq((1,"Android"), (2, "iPhone"))) mobiles.printSchema mobiles.show() Output: //Schema root |-- _1: integer (nullable = false) |-- _2: string (nullable = true) 431 Getting Started with Apache Spark DataFrames //Data +--+-------+ |_1| _2| +--+-------+ | 1|Android| | 2| iPhone| +--+-------+ Of course, you can rename the column using withColumnRenamed. 432 3 Loading and Preparing Data – DataFrame In this chapter, we will cover the following recipes: f Loading more than 22 features into classes f Loading JSON into DataFrames f Storing data as Parquet files f Using the Avro data model in Parquet f Loading from RDBMS f Preparing data in DataFrames Introduction In previous chapters, we saw how to import data from a CSV file to Breeze and Spark DataFrames. However, almost all the time, the source data that is to be analyzed is available in a variety of source formats. Spark, with its DataFrame API, provides a uniform API that can be used to represent any source (or multiple sources). In this chapter, we'll focus on the various input formats that we can load from in Spark. Towards the end of this chapter, we'll also briefly see some data preparation recipes. 433 Loading and Preparing Data – DataFrame Loading more than 22 features into classes Case classes have an inherent limitation. They can hold only 22 attributes—Catch 22, if you will. While a reasonable percentage of datasets would fit in that budget, in many cases, the limitation of 22 features in a dataset is a huge turnoff. In this recipe, we'll take a sample Student dataset (http://archive.ics.uci.edu/ml/datasets/ Student+Performance), which has 33 features, and we'll see how we can work around this. The 22-field limit is resolved in Scala version 2.11. However, Spark 1.4 uses Scala 2.10. How to do it... Case classes in Scala cannot go beyond encapsulating 22 fields because the companion classes that are generated (during compilation) for these case classes cannot find the matching FunctionN and TupleN classes. Let's take the example of the Employee case class that we created in Chapter 2, Getting Started with Apache Spark DataFrames: case class Employee(id:Int, name:String) When we look at its decompiled companion object, we notice that for the two constructor parameters of the case class, the companion class uses Tuple2 and AbstractFunction2 in its unapply method, the method that gets invoked when we pattern-match against a case class. The problem we face is that the Scala library has objects only until Tuple22 and Function22 (probably because outside the data analysis world, having an entity object with 10 fields is not a great idea). However, there is a simple yet powerful workaround, and we will be seeing it in this recipe. 434 Chapter 3 We saw in Chapter 2, Getting Started with Apache Spark DataFrames (in the Creating a DataFrame from CSV recipe), that the requirement for creating a DataFrame using SQLContext.createDataFrame from a collection of classes is that the class must extend scala.Product. So, what we intend to do is write our own class that extends from scala.Product. This recipe consists of four steps: 1. Creating SQLContext from SparkContext and Config. 2. Creating a Student class that extends Product and overrides the necessary functions. 3. Constructing an RDD of the Student classes from the sample dataset (student-mat.csv). 4. Creating a DataFrame from the RDD, followed by printing the schema and sampling the data. Refer to the How it works… section of this recipe for a basic introduction to RDD. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/tree/master/ chapter3-data-loading. Let's now cover these steps in detail: 1. Creating SQLContext: As with our recipes from the previous chapter, we construct SparkContext from SparkConfig and then create an SQLContext from SparkContext: val conf=new SparkConf().setAppName("DataWith33Atts").setMaster("local[2]") val sc=new SparkContext(conf) val sqlContext=new SQLContext(sc) 2. Creating the Student class: Our next step is to create a simple Scala class that declares its constructor parameters, and make it extend Product. Making a class extend Product requires us to override two functions from scala.Product and one function from scala.Equals (which scala.Product, in turn, extends from). The implementation of each of these functions is pretty straightforward. Refer to the API docs of Product (http://www.scala-lang. org/api/2.10.4/index.html#scala.Product) and Equals (http://www.scala-lang.org/api/2.10.4/ index.html#scala.Equals) for more details. 435 Loading and Preparing Data – DataFrame Firstly, let's make our Student class declare its fields and extend Product: class Student (school:String, sex:String, age:Int, address:String, famsize:String, pstatus:String, medu:Int, fedu:Int, mjob:String, fjob:String, reason:String, guardian:String, traveltime:Int, studytime:Int, failures:Int, schoolsup:String, famsup:String, paid:String, activities:String, nursery:String, higher:String, internet:String, romantic:String, famrel:Int, freetime:Int, goout:Int, dalc:Int, walc:Int, health:Int, absences:Int, g1:Int, g2:Int, g3:Int) extends Product{ Next, let's implement these three functions after briefly looking at what they are expected to do: ‰ productArity():Int: This returns the size of the attributes. In our case, it's 33. So, our implementation looks like this: override def productArity: Int = 33 436 Chapter 3 ‰ productElement(n:Int):Any: Given an index, this returns the attribute. As protection, we also have a default case, which throws an IndexOutOfBoundsException exception: @throws(classOf[IndexOutOfBoundsException]) override def productElement(n: Int): Any = n match { case 0 => school case 1 => sex case 2 => age case 3 => address case 4 => famsize case 5 => pstatus case 6 => medu case 7 => fedu case 8 => mjob case 9 => fjob case 10 => reason case 11 => guardian case 12 => traveltime case 13 => studytime case 14 => failures case 15 => schoolsup case 16 => famsup case 17 => paid case 18 => activities case 19 => nursery case 20 => higher case 21 => internet case 22 => romantic case 23 => famrel case 24 => freetime case 25 => goout case 26 => dalc case 27 => walc case 28 => health case 29 => absences case 30 => g1 case 31 => g2 case 32 => g3 case _ => throw new IndexOutOfBoundsException(n.toString()) } 437 Loading and Preparing Data – DataFrame ‰ canEqual (that:Any):Boolean: This is the last of the three functions, and it serves as a boundary condition when an equality check is being done against this class: override def canEqual(that: Any): Boolean = that.isInstanceOf[Student] 3. Constructing an RDD of students from the student-mat.csv file: Now that we have our Student class ready, let's convert the "student-mat.csv" input file into a DataFrame: val rddOfStudents=convertCSVToStudents("student-mat.csv", sc) def convertCSVToStudents(filePath: String, sc: SparkContext): RDD[Student] = { val rddOfStudents: RDD[Student] = sc.textFile(filePath). flatMap(eachLine => Student(eachLine)) rddOfStudents } As you can see, we have an apply method for Student that accepts a String and returns an Option[Student]. We use flatMap to filter out None thereby resulting in RDD[Student]. Let's look at the Student companion object's apply function. It's a very simple function that takes a String, splits it based on semicolons into an array, and then passes the parameters to the Student's constructor. The method returns None if there is an error: object Student { def apply(str: String): Option[Student] = { val paramArray = str.split(";").map(param => param.replaceAll("\"", "")) //Few values have extra double quotes around it Try( new Student(paramArray(0), paramArray(1), paramArray(2).toInt, paramArray(3), paramArray(4), paramArray(5), paramArray(6).toInt, paramArray(7).toInt, paramArray(8), 438 Chapter 3 paramArray(9), paramArray(10), paramArray(11), paramArray(12).toInt, paramArray(13).toInt, paramArray(14).toInt, paramArray(15), paramArray(16), paramArray(17), paramArray(18), paramArray(19), paramArray(20), paramArray(21), paramArray(22), paramArray(23).toInt, paramArray(24).toInt, paramArray(25).toInt, paramArray(26).toInt, paramArray(27).toInt, paramArray(28).toInt, paramArray(29).toInt, paramArray(30).toInt, paramArray(31).toInt, paramArray(32).toInt)) match { case Success(student) => Some(student) case Failure(throwable) => { println (throwable.getMessage()) None } } } 4. Creating a DataFrame, printing the schema, and sampling: Finally, we create a DataFrame from RDD[Student]. Converting an RDD[T] to a DataFrame of the same type is just a matter of calling the toDF() function. You are required to import sqlContext.implicits._. Optionally, you can use the createDataFrame method of sqlContext too. The toDF() function is overloaded so as to accept custom column names while converting to a DataFrame. 439 Loading and Preparing Data – DataFrame We then print the schema using the DataFrame's printSchema() method and sample data for confirmation using the show() method: import sqlContext.implicits._ //Create DataFrame val studentDFrame = rddOfStudents.toDF() studentDFrame.printSchema() studentDFrame.show() The following is the output of the preceding code: root |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-- 440 school: string (nullable = true) sex: string (nullable = true) age: integer (nullable = false) address: string (nullable = true) famsize: string (nullable = true) pstatus: string (nullable = true) medu: integer (nullable = false) fedu: integer (nullable = false) mjob: string (nullable = true) fjob: string (nullable = true) reason: string (nullable = true) guardian: string (nullable = true) traveltime: integer (nullable = false) studytime: integer (nullable = false) failures: integer (nullable = false) schoolsup: string (nullable = true) famsup: string (nullable = true) paid: string (nullable = true) activities: string (nullable = true) nursery: string (nullable = true) higher: string (nullable = true) internet: string (nullable = true) romantic: string (nullable = true) famrel: integer (nullable = false) freetime: integer (nullable = false) goout: integer (nullable = false) dalc: integer (nullable = false) walc: integer (nullable = false) health: integer (nullable = false) absences: integer (nullable = false) Chapter 3 |-- g1: integer (nullable = false) |-- g2: integer (nullable = false) |-- g3: integer (nullable = false) How it works... The foundation of Spark is the Resilient Distributed Dataset (RDD). From a programmer's perspective, the composability of RDDs just like a regular Scala collection is a huge advantage. An RDD wraps three vital (and two subsidiary) pieces of information that help in the reconstruction of data. This enables fault tolerance. The other major advantage is that while RDDs can be composed into hugely complex graphs using RDD operations, the entire flow of data itself is not very difficult to reason with. Other than optional optimization attributes (such as data location), at its core, RDD just wraps three vital pieces of information: f The dependent/parent RDD (empty if not available) f The number of partitions f The function that needs to be applied to each element of the RDD 441 Loading and Preparing Data – DataFrame In simple words, RDDs are just collections of data elements that can exist in the memory or on the disk. These data elements must be serializable in order to have the capability to be moved across multiple machines (or be serialized on the disk). The number of partitions or blocks of data is primarily determined by the source of the input data (say, if the data is in HDFS, then each block would translate to a single partition), but there are also other ways of playing around with the number of partitions. So, the number of partitions could be any of these: f Dictated by the input data itself, for example, the number of blocks in the case of reading files from HDFS f The number set by the spark.default.parallelism parameter (set while starting the cluster) f The number set by calling repartition or coalesce on the RDD itself Note that currently, for all our recipes, we are running our Spark application in the selfcontained single JVM mode. While the programs work just fine, we are not yet exploiting the distributed nature of the RDDs. In Chapter 6, Scaling Up, we'll explore how to bundle and deploy our Spark application on a variety of cluster managers: YARN, Spark standalone clusters, and Mesos. There's more… In the previous chapter, we created a DataFrame from a List of Employee case classes: val listOfEmployees =List(Employee(1,"Arun"), Employee(2, "Jason"), Employee (3, "Abhi")) val empFrame=sqlContext.createDataFrame(listOfEmployees) However, in this recipe, we loaded a file, converted them to RDD[String], transformed them into case classes, and finally converted them into a DataFrame. There are subtle, yet powerful, differences in these approaches. In the first approach (converting a List of case classes into a DataFrame), we have the entire collection in the memory of the driver (we'll look at drivers and workers in Chapter 6, Scaling Up). Except for playing around with Spark, for all practical purposes, we don't have our dataset as a collection of case classes. We generally have it as a text file or read from a database. Also, requiring to hold the entire collection in a single machine before converting it into a distributed dataset (RDD) will unfold itself as a memory issue. 442 Chapter 3 In this recipe, we loaded an HDFS distributed file as an RDD[String] that is distributed across a cluster of worker nodes, and then serialized each String into a case class, making the RDD[String] into an RDD[Student]. So, each worker node that holds some partitions of the dataset handles the computation around transforming RDD[String] to the case class, while making the resulting dataset conform to a fixed schema enforced by the case class itself. Since the computation and the data itself are distributed, we don't need to worry about a single machine requiring a lot of memory to store the entire dataset. Loading JSON into DataFrames JSON has become the most common text-based data representation format these days. In this recipe, we'll see how to load data represented as JSON into our DataFrame. To make it more interesting, let's have our JSON in HDFS instead of our local filesystem. The Hadoop Distributed File System (HDFS) is a highly distributed filesystem that is both scalable and fault tolerant. It is a critical part of the Hadoop ecosystem and is inspired by the Google File System paper (http://research.google.com/archive/gfs.html). More details about the architecture and communication protocols on HDFS can be found at http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. How to do it… In this recipe, we'll see three subrecipes: f How to create a schema-inferenced DataFrame from JSON using sqlContext.jsonFile f Alternatively, if we prefer to preprocess the input file before parsing it into JSON, we'll parse the input file as text and convert it into JSON using sqlContext.jsonRDD f Finally, we'll take a look at declaring an explicit schema and using it to create a DataFrame Reading a JSON file using SQLContext.jsonFile This recipe consists of three steps: 1. Storing our json (profiles.json) in HDFS: A copy of the data file is added to our project repository, and it can be downloaded from https://github.com/arunma/ ScalaDataAnalysisCookbook/blob/master/chapter3-data-loading/ profiles.json: hadoop fs -mkdir -p /data/scalada hadoop fs -put profiles.json /data/scalada/profiles.json hadoop fs -ls /data/scalada -rw-r--r-1 Gabriel supergroup 176948 2015-05-16 22:13 / data/scalada/profiles.json 443 Loading and Preparing Data – DataFrame The following screenshot shows the HDFS file explorer available at http://localhost:50070, which confirms that our upload is successful: 2. Creating contexts: We do the regular stuff—create SparkConfig, SparkContext, and then SQLContext: val conf = new SparkConf().setAppName("DataFromJSON").setMaster("local[2]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) 3. Creating a DataFrame from JSON: In this step, we use the jsonFile function of SQLContext to create a DataFrame. This is very similar to the sqlContext. csvFile function that we used in Chapter 2, Getting Started with Apache Spark DataFrames. There's just one thing that we need to watch out here; our .json should be formatted as one line per record. It is unusual to store JSON as one line per record considering that it is a structured format, but the jsonFile function treats every single line as one record, failing to do which it would throw a scala. MatchError error while parsing: val dFrame=sqlContext.jsonFile("hdfs://localhost:9000/data/scalada/ profiles.json") 444 Chapter 3 That's it! We are done! Let's just print the schema and sample the data: dFrame.printSchema() dFrame.show() The following screenshot shows the schema that is inferenced from the JSON file. Note that now the age is resolved as long and tags are resolved as an array of string, as you can see here: The next screenshot shows you a sample of the dataset: 445 Loading and Preparing Data – DataFrame Reading a text file and converting it to JSON RDD In the previous section, we saw how we can directly import a textFile containing JSON records as a DataFrame using sqlContext.jsonFile. Now, we'll see an alternate approach, wherein we construct an RDD[String] from the same profiles.json file and then convert them into a DataFrame. This has a distinct advantage from the previous approach—we can have more control over the schema instead of relying on the one that is inferenced: val strRDD=sc.textFile("hdfs://localhost:9000/data/scalada/profiles. json") val jsonDf=sqlContext.jsonRDD(strRDD) jsonDf.printSchema() The following is the output of the preceding command: Explicitly specifying your schema Using jsonRDD and letting it resolve the schema by itself is clean and simple. However, it gives less control over the types; for example, the age field must be Integer and not Long. Similarly, the `registered` column is inferenced as a String while it is actually a TimeStamp. In order to achieve this, let's go ahead and declare our own schema. The way we do this is by constructing a StructType and StructField: val profilesSchema = StructType( Seq( StructField("_id",StringType, true), StructField("about",StringType, true), StructField("address",StringType, true), StructField("age",IntegerType, true), 446 Chapter 3 StructField("company",StringType, true), StructField("email",StringType, true), StructField("eyeColor",StringType, true), StructField("favoriteFruit",StringType, true), StructField("gender",StringType, true), StructField("name",StringType, true), StructField("phone",StringType, true), StructField("registered",TimestampType, true), StructField("tags",ArrayType(StringType), true) ) ) val jsonDfWithSchema=sqlContext.jsonRDD(strRDD, profilesSchema) jsonDfWithSchema.printSchema() //Has timestamp jsonDfWithSchema.show() Another advantage of specifying our own schema is that all the columns need not be specified in the StructType. We just need to specify the columns that we are interested in, and only those columns will be available in the target DataFrame. Also, any column that is declared in the schema but is not available in the dataset will be filled in with null values. The following is the output. We can see that the registered feature is considered to have a timestamp data type and age as integer: 447 Loading and Preparing Data – DataFrame Finally, just for kicks, let's fire a filter query based on the timestamp. This involves three steps: 1. Register the schema as a temporary table for querying, as has been done several times in previous recipes. The following line of code registers a table by the name of profilesTable: jsonRDDWithSchema.registerTempTable("profilesTable") 2. Let's fire away our filter query. The following query returns all profiles that have been registered after August 26, 2014. Since the registered field is a timestamp, we require an additional minor step of casting the parameter into a TimeStamp: val filterCount = sqlContext.sql("select * from profilesTable where registered> CAST('2014-08-26 00:00:00' AS TIMESTAMP)").count 3. Let's print the count: println("Filtered based on timestamp count : " + filterCount) //106 There's more… If you aren't comfortable with having the schema in the code and would like to save the schema in a file, it's just a one-liner for you: import scala.reflect.io.File import scala.io.Source //Writes schema as JSON to file File("profileSchema.json").writeAll(profilesSchema.json) Obviously, you would want to reconstruct the schema from JSON, and that's also a one-liner: val loadedSchema = DataType.fromJson(Source.fromFile("profileSchema. json").mkString) Let's check whether the loadedSchema and the profileSchema encapsulate the same schema by doing an equality check on their json: println ("ProfileSchema == loadedSchema :"+(loadedSchema. json==profilesSchema.json)) The output is shown as follows: ProfileSchema == loadedSchema :true 448 Chapter 3 If we would like to eyeball the json, we have a nice method called prettyJson that formats the json: //Print loaded schema println(loadedSchema.prettyJson) The output is as follows: 449 Loading and Preparing Data – DataFrame Storing data as Parquet files Parquet (https://parquet.apache.org/) is rapidly becoming the go-to data storage format in the world of big data because of the distinct advantages it offers: f It has a column-based representation of data. This is better represented in a picture, as follows: As you can see in the preceding screenshot, Parquet stores data in chunks of rows, say 100 rows. In Parquet terms, these are called RowGroups. Each of these RowGroups has chunks of columns inside them (or column chunks). Column chunks can hold more than a single unit of data for a particular column (as represented in the blue box in the first column). For example. Jai, Suri, and Dhina form a single chunk even though they are composed of three single units of data for Name. Another unique feature is that these column chunks (groups of a single column's information) can be read independently. Let's consider the following image: 450 Chapter 3 We can see that the items of column data are stored next to each other in a sequence. Since our queries are focused on just a few columns (a projection) most of the time and not on the entire table, this storage mechanism enables us to retrieve data much faster than reading the entire row data that is stored and filtering for columns. Also, with Spark's in-memory computations, the memory requirements are reduced in this way. f The second advantage is that there is very little that is needed for our transition from the existing data models that we already use to represent the data. While Parquet has its own native object model, we are pretty much free to choose Avro, ProtoBuf, Thrift, and a variety of existing object models, and use an intermediate converter to serialize our data in Parquet. Most of these converters are readily available at the Parquet-MR project (https://github.com/Parquet/parquet-mr). In this recipe, we'll cover the following steps: 1. Load a simple CSV file and convert it into a DataFrame. 2. Save it as a Parquet file. 3. Install Parquet tools. 4. Use the tools to inspect the Parquet file. 5. Enable compression for the Parquet file. The entire code for this recipe can be found at https://github. com/arunma/ScalaDataAnalysisCookbook/tree/master/ chapter3-data-loading-parquet. 451 Loading and Preparing Data – DataFrame How to do it… Before we dive into the steps, let's briefly look at our build.sbt file, specifically the library dependencies and Avro settings (which we'll talk about in the following sections): organization := "com.packt" name := "chapter3-data-loading-parquet" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "org.apache.spark" %% "spark-hive" % sparkVersion, "org.apache.avro" % "avro" % "1.7.7", "org.apache.parquet" % "parquet-avro" % "1.8.1", "com.twitter" %% "chill-avro" % "0.6.0" ) resolvers ++= Seq( "Apache HBase" at "https://repository.apache.org/content/repositories/releases", "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/", "Twitter" at "http://maven.twttr.com/" ) fork := true seq( sbtavro.SbtAvro.avroSettings : _*) (stringType in avroConfig) := "String" javaSource in sbtavro.SbtAvro.avroConfig <<= (sourceDirectory in Compile)(_ / "java") Now that we have build.sbt out of the way, let's go ahead and look at the code behind each of the listed steps. 452 Chapter 3 Load a simple CSV file, convert it to case classes, and create a DataFrame from it We can actually create a DataFrame directly from CSV using the com.databricks/sparkcsv file, as we saw in Chapter 2, Getting Started with Apache Spark DataFrames, but for this recipe, we'll just tokenize the CSV and create classes from it. The input CSV has a header column. So, the conversion process involves skipping the first row. The class file that we will discuss in this section is the https:// github.com/arunma/ScalaDataAnalysisCookbook/blob/ master/chapter3-data-loading-parquet/src/main/ scala/com/packt/dataload/ParquetCaseClassMain.scala. There are just two interesting things that you might notice in the code: sqlContext.setConf("spark.sql.parquet.binaryAsString","true") Some Parquet producing systems, such as Impala, binary encode the strings. In order to work around this issue, we set the following configuration, which says that if it sees binary data, it should be treated as a string: Instead of using sqlContext.createDataFrame, we just use a toDF() on the RDD[Student]. The SQLContext.Implicits object has a number of implicit conversions that help us convert an RDD[T] to a DataFrame directly. The only requirement for us, as expected, is to import the implicits: import sqlContext.implicits._ The rest of the code is the same as we saw earlier: val conf = new SparkConf().setAppName("CaseClassToParquet").setMaster("local[2]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) //Treat binary encoded values as Strings sqlContext.setConf("spark.sql.parquet.binaryAsString","true") import sqlContext.implicits._ //Convert each line into Student val rddOfStudents = convertCSVToStudents("StudentData.csv", sc) //Convert RDD[Student] to a Dataframe using sqlContext.implicits val studentDFrame = rddOfStudents.toDF() 453 Loading and Preparing Data – DataFrame The convertCSVToStudents method, which converts each line into a Student object, looks like this: def convertCSVToStudents(filePath: String, sc: SparkContext): RDD[Student] = { val rddOfStudents: RDD[Student] =sc.textFile(filePath).flatMap(line => { val data = line.split("\\|") if (data(0) == "id") None else Some(Student(data(0), data(1), data(2), data(3))) }) rddOfStudents } Save it as a Parquet file This is just a one-liner once we have the DataFrame. This can be done using either the saveAsParquetFile or the save method. If you wish to save it in a Hive table (https://hive.apache.org/), then there is also a saveAsTable method for you: //Save DataFrame as Parquet using saveAsParquetFile studentDFrame.saveAsParquetFile("studentPq.parquet") //OR //Save DataFrame as Parquet using the save method studentDFrame.save("studentPq.parquet", "parquet", SaveMode.Overwrite) The save method allows the usage of SaveMode, which has the following alternatives: Append, ErrorIfExists, Ignore, or Overwrite. The save methods create a directory in the location that you specify (here, we simply store it in our project directory). The directory holds the files that represent the serialized data. It is not entirely human readable, but you may notice that the data of a single column is stored together. Just as we do for the rest of the recipes, let's read the file and sample the data for confirmation: //Read data for confirmation val pqDFrame=sqlContext.parquetFile("studentPq.parquet") pqDFrame.show() 454 Chapter 3 The following is the output: Install Parquet tools Other than using the printSchema method of the DataFrame to inspect the schema, we can use some interesting parquet tools provided as part of the parquet project to get a variety of other information. The parquet-tools is a subproject of Parquet and is available at https://github.com/Parquet/parquet-mr/tree/master/ parquet-tools. Since Spark 1.4.1 uses Parquet 1.6.0rc3, we'll need to download that version of the tools from the Maven repository. The executables and the JARs can be downloaded as one bundle from https://repo1.maven.org/maven2/com/twitter/parquet-tools/1.6.0rc3/ parquet-tools-1.6.0rc3-bin.tar.gz. Using the tools to inspect the Parquet file Let's put the tools into action. Specifically, we'll do three things in this step: f Display the schema in Parquet format f Display the meta information that is stored in Parquet's footer 455 Loading and Preparing Data – DataFrame f Sample the data using head and cat f Displaying the schema: This can be achieved by calling the parquet-tools command with schema and the parquet file as the parameter. As an example, let's print the schema using one of the part files: bash-3.2$ parquet-tools-1.6.0rc3/parquet-tools meta part-r-0000020a8b58c-fe1d-43e7-b148-f874b78eb5ec.gz.parquet message root { optional binary id (UTF8); optional binary name (UTF8); optional binary phone (UTF8); optional binary email (UTF8); } We see that the schema is indeed available in Parquet format and is derived from our case classes. f Displaying the meta information of a particular Parquet file: As we saw earlier, meta information is stored in the footer. Let's print it to see it. We see that the extra information has the schema that is specific to the data model we used. This information is used when the data is deserialized. The meta parameter of parquet-tools will help achieve this: bash-3.2$ parquet-tools-1.6.0rc3/parquet-tools meta part-r-0000020a8b58c-fe1d-43e7-b148-f874b78eb5ec.gz.parquet creator: parquet-mr version 1.6.0rc3 (build d4d5a07ec9bd262ca1e93c309f1d7d4a74ebda4c) extra: org.apache.spark.sql.parquet.row.metadata = {"type" :"struct","fields":[{"name":"id","type":"string","nullable":true ,"metadata":{}},{"name":"name","type":"string","nullable":true, [more]... file schema: root ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 456 id: OPTIONAL BINARY O:UTF8 R:0 D:1 name: OPTIONAL BINARY O:UTF8 R:0 D:1 phone: OPTIONAL BINARY O:UTF8 R:0 D:1 Chapter 3 email: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:50 TS:3516 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------id: BINARY GZIP DO:0 FPO:4 SZ:140/326/2.33 VC:50 ENC:RLE,BIT_PACKED,PLAIN name: BINARY GZIP DO:0 FPO:144 SZ:313/483/1.54 VC:50 ENC:RLE,BIT_PACKED,PLAIN phone: BINARY GZIP DO:0 FPO:457 SZ:454/961/2.12 VC:50 ENC:RLE,BIT_PACKED,PLAIN email: BINARY GZIP DO:0 FPO:911 SZ:929/1746/1.88 VC:50 ENC:RLE,BIT_PACKED,PLAIN f Sampling data using head and cat: Let's now have a sneak peek at the first few rows of the data. The head function will help us do that. It accepts an additional -n parameter, where you can specify the number of records to be displayed: bash-3.2$ parquet-tools-1.6.0rc3/parquet-tools head -n 2 part-r-00001.parquet The preceding command will display only two rows because of the additional -n 2 parameter. The following is the output of this command: id = 1 name = Burke phone = 1-300-746-8446 email = ullamcorper.velit.in@ametnullaDonec.co.uk id = 2 name = Kamal phone = 1-668-571-5046 email = pede.Suspendisse@interdumenim.edu Optionally, if you wish to display all the records in the file, you can use the cat parameter with the parquet-tools command: parquet-tools cat part-r-00001.parquet 457 Loading and Preparing Data – DataFrame Enable compression for the Parquet file As you can see from the meta information, the data is gzipped by default. In order to use Snappy compression, all that we need to do is set a configuration to our SQLContext (actually the SQLConf of SQLContext). There's just one catch with regard to enabling Lempel–Ziv–Oberhumer (LZO) compression—we are required to install native-lzo on all the machines where this data is stored. Otherwise, we get a "native-lzo library not available" error message. Let's enable Snappy (http://google.github.io/snappy/) compression by passing the configuration parameter of Parquet compression to Snappy: sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy") After running the program, let's use the parquet-tools meta command to verify it: parquet-tools meta part-r-00000-aee54b77-288e-44b2-8f36-53b38a489e8d. snappy.parquet Using the Avro data model in Parquet Parquet is a kind of highly efficient columnar storage, but it is also relatively new. Avro (https://avro.apache.org) is a widely used row-based storage format. This recipe showcases how we can retain the older and flexible Avro schema in our code but still use the Parquet format during storage. The Spark MR project (yes, the one that has the Parquet tools we saw in the previous recipe) has converters for almost all the popular data formats. These model converters take your format and convert it into Parquet format before causing it to persist. 458 Chapter 3 How to do it… In this recipe, we'll use the Avro data model and serialize the data in a Parquet file. The recipe involves the following steps: 1. Create the Avro Model. 2. Generate Avro objects using the sbt avro plugin. 3. Construct the RDD of your generated object (StudentAvro) from Students.csv. 4. Save the RDD[StudentAvro] in a Parquet file. 5. Read the file back for verification. 6. Use Parquet-tools to verify. Creation of the Avro model The Avro schema is defined using JSON. In our case, we'll just use the same Student.csv as the input file. So, let's code the four fields— id, name, phone, and email—in the schema: {"namespace": "studentavro.avro", "type": "record", "name": "StudentAvro", "fields": [ {"name": "id", "type": ["string", "null"]}, {"name": "name", "type": ["string", "null"]}, {"name": "phone", "type": ["string", "null"]}, {"name": "email", "type": ["string", "null"]} ] } Probably, you are already familiar with Avro, or you have already understood the schema just by taking a look at it, but let me bore you with some explanation of the schema anyway. The namespace and name attributes in the JSON translate into our package name and class name in our world, respectively. So, our generated class will have a fully qualified name as studentavro.avro.StudentAvro. The "record" (of the type attribute) is one of the complex types in Avro (http://avro.apache.org/docs/1.7.6/spec.html#schema_ complex). Let me rephrase this again. A record roughly translates to classes in Java/ Scala. It is at the topmost level in the schema hierarchy. A record can have multiple fields encapsulated inside it, and these fields can be primitives (https://avro.apache.org/ docs/1.7.7/spec.html#schema_primitive) or other complex types. The last bit about the type having an array of types is interesting ("type": ["string", "null"]). It just means that the field can be more than one type. In Avro terms, it is called a union. Now that we are done with the schema, let's save this file with an extension of .avsc. I have saved it as student.avsc in the src/main/avro directory. 459 Loading and Preparing Data – DataFrame Generation of Avro objects using the sbt-avro plugin The next step is to generate a class from the schema. The reason we stored the avro schema file in the src/main/avro folder is this: we'll be using an sbt-avro plugin (https://github.com/cavorite/sbt-avro) to generate a Java class from the schema. Configuring the plugin is as easy as configuring any other plugin for SBT: f Let's add the plugin to project/plugins.sbt: addSbtPlugin("com.cavorite" % "sbt-avro" % "0.3.2") f Add the default settings of the plugin to our build.sbt: seq( sbtavro.SbtAvro.avroSettings : _*) f 460 Let's generate the Java class now. We can do this by calling sbt avro:generate. You can see the generated Java file at target/scala-2.10/src_managed/ main/compiled_avro/studentavro/avro/StudentAvro.java. Chapter 3 f We also need the following library dependencies. Finally, let's perform an SBT compile to compile the class so that the rest of the project picks up the generated Java file: libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-sql" % sparkVersion, "org.apache.spark" %% "spark-mllib" % sparkVersion, "org.apache.spark" %% "spark-hive" % sparkVersion, "org.apache.avro" % "avro" % "1.7.7", "org.apache.parquet" % "parquet-avro" % "1.8.1", "com.twitter" %% "chill-avro" % "0.6.0" ) sbt compile Constructing an RDD of our generated object from Students.csv This step is very similar to the previous recipe in the sense that we use the convertCSVToStudents function to generate an RDD of the StudentAvro object. Also, since this isn't a Scala class and the generated Java object comes up with a builder inside it, we use the builder to construct the class fluently (http://en.wikipedia.org/wiki/ Fluent_interface): val conf = new SparkConf().setAppName("AvroModelToParquet").setMaster("local[2]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) sqlContext.setConf("spark.sql.parquet.binaryAsString", "true") val rddOfStudents = convertCSVToStudents("StudentData.csv", sc) //The CSV has a header row. Zipping with index and skipping the first row def convertCSVToStudents(filePath: String, sc: SparkContext): RDD[StudentAvro] = { val rddOfStudents: RDD[StudentAvro]=sc.textFile(filePath).flatMap(eachLine => { val data = eachLine.split("\\|") if (data(0) == "id") None else Some(StudentAvro.newBuilder() .setId(data(0)) .setName(data(1)) .setPhone(data(2)) .setEmail(data(3)).build()) }) rddOfStudents } 461 Loading and Preparing Data – DataFrame Saving RDD[StudentAvro] in a Parquet file This is a tricky step and involves multiple substeps. Let's decipher this step backwards. We fall back to RDD[StudentAvro] in this example instead of a DataFrame because DataFrames can be constructed only from an RDD of case classes (or classes that extend Product, as we saw earlier in this chapter) or from RDD[org.apache.spark.sql.Row]. If you prefer to use DataFrames, you can read the CSV as an array of values, and use RowFactory.create for each array of values. Once an RDD[Row] is available, we can use sqlContext.createDataFrame to convert it to a DataFrame: f In order to save the RDD as a Hadoop SequenceFile, we can use saveAsNewAPIHadoopFile. A sequence file is simply a text file that holds key-value pairs. We could have chosen one of the Student attributes as a key, but for the sake of it, let's have it as a Void in this example. To represent a pair (key-value) in Spark, we use PairRDD (https://spark. apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark. rdd.PairRDDFunctions). Not surprisingly, saveAsNewAPIHadoopFile is available only for PairRDDs. To convert the existing RDD[StudentAvro] to a PairRDD[Void,StudentAvro], we use the map function: val pairRddOfStudentsWithNullKey = rddOfStudents.map(each => (null, each)) f Spark uses Java serialization by default to serialize the RDD to be distributed across the cluster. However, the Avro model doesn't implement the serializable interface, and hence it won't be able to leverage Java serialization. That's no reason for worry, however, because Spark provides another 10x performant serialization mechanism called Kryo. The only downside is that we need to explicitly register our serialization candidates: val conf = new SparkConf().setAppName("AvroModelToParquet").setMaster("local[2]") conf.set("spark.kryo.registrator", classOf[StudentAvroRegistrator].getName) conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") So, we say using the "spark.serializer" configuration that we intend to use KryoSerializer, and that our registrator is StudentAvroRegistrator. As you may expect, what the Registrator does is register our StudentAvro class as a candidate for Kryo serialization. The twitter-chill project (https://github. com/twitter/chill) provides a nice extension to delegate the Kryo serializer to use the Avro serialization: class StudentAvroRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { 462 Chapter 3 kryo.register(classOf[StudentAvro], AvroSerializer.SpecificRecordBinarySerializer[StudentAvro]) } } f The intent of this recipe is to write a Parquet file, but the data model (schema) is Avro. Since we are going to write this down as a sequence file, we'll be using a bunch of Hadoop APIs. The org.apache.hadoop.mapreduce.OutputFormat specifies the output format of the file that we are going to write, and as expected, we use ParquetOutputFormat (this is available in the parquet-hadoop subproject in the parquet-mr project). There are two things that an OutputFormat requires: ‰ The WriteSupport class, which knows how to convert the Avro data model to the actual format. This is achieved with the following line: ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) ‰ The schema needs to be written to the footer of the Parquet file too. The schema of StudentAvro is accessible by using the getClassSchema function. This line of code achieves that: AvroParquetOutputFormat.setSchema(job, StudentAvro. getClassSchema) Now, what's that job parameter doing here in these two lines of code? The job object is just an instance of org.apache.hadoop.mapreduce.Job: val job = new Job() When we call the setWriteSupportClass and setSchema methods of ParquetOutputFormat and AvroParquetOutputFormat, the resulting configuration is captured inside the JobConf encapsulated inside the Job object. We'll be using this job configuration while saving the data in a sequence file. f Finally, we save the file by calling saveAsNewAPIHadoopFile. The save method requires a bunch of parameters, each of which we have already discussed. The first parameter is the filename, followed by the key and the value classes. The fourth is the OutputFormat of the file, and finally comes the job configuration itself: pairRddOfStudentsWithNullKey.saveAsNewAPIHadoopFile("studentAvro Pq", classOf[Void], classOf[StudentAvro], classOf[AvroParquetOutputFormat], job.getConfiguration()) 463 Loading and Preparing Data – DataFrame We saw the entire program in bits and pieces, so for the sake of completion, let's see it completely: object ParquetAvroSchemaMain extends App { val conf = new SparkConf().setAppName("AvroModelToParquet").setMaster("local[2]") conf.set("spark.kryo.registrator", classOf[StudentAvroRegistrator].getName) conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val job = new Job() val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) sqlContext.setConf("spark.sql.parquet.binaryAsString", "true") val rddOfStudents = convertCSVToStudents("StudentData.csv", sc) ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) AvroParquetOutputFormat.setSchema(job, StudentAvro.getClassSchema) val pairRddOfStudentsWithNullKey = rddOfStudents.map(each => (null, each)) pairRddOfStudentsWithNullKey.saveAsNewAPIHadoopFile("studentAvro Pq", classOf[Void], classOf[StudentAvro], classOf[AvroParquetOutputFormat], job.getConfiguration()) //The CSV has a header row. Zipping with index and skipping the first row def convertCSVToStudents(filePath: String, sc: SparkContext): RDD[StudentAvro] = { val rddOfStudents: RDD[StudentAvro]=sc.textFile(filePath).flatMap(eachLine => { val data = eachLine.split("\\|") 464 Chapter 3 if (data(0) == "id") None else Some(StudentAvro.newBuilder() .setId(data(0)) .setName(data(1)) .setPhone(data(2)) .setEmail(data(3)).build()) }) rddOfStudents } } class StudentAvroRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[StudentAvro], AvroSerializer.SpecificRecordBinarySerializer[StudentAvro]) } } Reading the file back for verification As always, let's read the file back for confirmation. The function to be called for this is newAPIHadoopFile, which accepts a similar set of parameters as saveAsNewAPIHadoopFile: the name of the file, InputFormat, the key class, the value class, and finally the job configuration. Note that we are using newAPIHadoopFile instead of the previously used the parquetFile method. This is because we are reading from a Hadoop sequence file: //Reading the file back for confirmation. ParquetInputFormat.setReadSupportClass(job, classOf[AvroWriteSupport]) val readStudentsPair = sc.newAPIHadoopFile("studentAvroPq", classOf[Av roParquetInputFormat[StudentAvro]], classOf[Void], classOf[StudentAvro], job.getConfiguration()) val justStudentRDD: RDD[StudentAvro] = readStudentsPair.map(_._2) val studentsAsString = justStudentRDD.collect().take(5).mkString("\n") println(studentsAsString) This is the output: 465 Loading and Preparing Data – DataFrame Using Parquet tools for verification We'll also use Parquet tools to confirm that the schema that is stored in the Parquet file is indeed an avro schema: /Users/Gabriel/Dropbox/arun/ScalaDataAnalysis/git/parquet-mr/parquettools/target/parquet-tools-1.6.0rc3/parquet-tools meta /Users/Gabriel/ Dropbox/arun/ScalaDataAnalysis/Code/scaladataanalysisCB-tower/chapter3data-loading-parquet/studentAvroPq Yup! Looks like it is! The extra section in meta does confirm that the avro schema is stored: creator: parquet-mr extra: parquet.avro.schema = {"type":"record","name":"StudentAvro", "namespace":"studentavro.avro","fields":[{"name":"id","type":[{"type":"st ring","avro.java.string":"Stri [more]... Loading from RDBMS As the final recipe on loading, let's try to load data from an RDBMS data source, which is MySQL in our case. This recipe assumes that you have already installed MySQL in your machine. How to do it… Let's go through the prerequisite steps first. If you already have a MySQL table to play with, you can safely ignore this step. We are just going to create a new database and a table and load some sample data into it. The prerequisite step (optional): 1. Creating a database and a table: This is achieved in MySQL by using the create database and the create table DDL: create database scalada; use scalada 466 Chapter 3 CREATE TABLE student ( id varchar(20), `name` varchar(200), phone varchar(50), email varchar(200), PRIMARY KEY (id)); 2. Loading data into the table: Let's dump some data into the table. I wrote a very simple app to do this. Alternatively, you can use the load data infile command if you have "local-infile=1" enabled on your server and the client. Refer to https://dev.mysql.com/doc/refman/5.1/en/load-data.html for details about this command. As you can see, the program loads the Student.csv that we saw in Chapter 2, Getting Started with Apache Spark DataFrames, when we saw how to use DataFrames with Spark using the databricks.csv connector. Then, for each line, the data is inserted into the table using the plain old JDBC insert. As you might have already figured out, we need to add the MySQL connector java dependency to our build.sbt too: "mysql" % "mysql-connector-java" % "5.1.34" object LoadDataIntoMySQL extends App { val conf = new SparkConf().setAppName("LoadDataIntoMySQL").setMaster("local[2]") val config=ConfigFactory.load() val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val students = sqlContext.csvFile(filePath = "StudentData.csv", useHeader = true, delimiter = '|') students.foreachPartition { iter=> val conn = DriverManager.getConnection(config.getString("mysql.connection. url")) val statement = conn.prepareStatement("insert into scalada.student (id, name, phone, email) values (?,?,?,?) ") for (eachRow <- iter) { statement.setString(1, statement.setString(2, statement.setString(3, statement.setString(4, eachRow.getString(0)) eachRow.getString(1)) eachRow.getString(2)) eachRow.getString(3)) 467 Loading and Preparing Data – DataFrame statement.addBatch() } statement.executeBatch() conn.close() println ("All rows inserted successfully") } } A "select * from scalada.student" on the MySQL client should confirm this, as shown here: Steps for loading RDBMS data into DataFrame: The recommended approach to loading data from RDBMS databases is using the SQLContext's load method: 1. Creating the Spark and SQLContext: You may have already become familiar with this step by looking at the previous recipes: val conf = new SparkConf().setAppName("DataFromRDBMS").setMaster("local[2]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) 2. Constructing a map of options: This map is expected to have not only the driver and the connection URL, but also the query to be invoked in order to load the data. In this example, we'll store the parameter values in an external Typesafe config file and load the values into our program. The Typesafe application.conf is located at src/main/resources as per standard SBT/Maven conventions. Here is a screenshot that shows the contents of application.conf: 468 Chapter 3 Now let's look at the code that constructs the map: val config = ConfigFactory.load() val options = Map( "driver" -> config.getString("mysql.driver"), "url" -> config.getString("mysql.connection.url"), "dbtable" -> "(select * from student) as student", "partitionColumn" -> "id", "lowerBound" -> "1", "upperBound" -> "100", "numPartitions"-> "2") The first three parameters are straightforward. The numPartitions specifies the number of partitions for this job, and partitionColumn specifies the column in the table based on which the job has to be partitioned. The lowerBound and upperBound are values of the "id" field. The amount of data to be handled by a single partition is calculated using the number of partitions and the lower and upper bounds. 3. Loading data from the table: The load function of SQLContext expects two parameters. The first one specifies that the source of the data is through "jdbc", and the second parameter is the options that we constructed in step 2. Let's now print the schema and show the first 20 rows, as we always do: val dFrame=sqlContext.load("jdbc", options) dFrame.printSchema() dFrame.show() This is the output: root |-- id: string (nullable = false) |-- name: string (nullable = true) |-- phone: string (nullable = true) |-- email: string (nullable = true) |-- gender: string (nullable = true) We see that the schema of the DataFrame is derived from the MySQL table definition by examining the not nullable constraint of the id field. 469 Loading and Preparing Data – DataFrame The output is as follows: Preparing data in Dataframes Other than filtering, conversions, and transformations (with DataFrames which we saw in Chapter 2, Getting Started with Apache Spark DataFrames) , let's see a few more data preparation tricks in this recipe. We'll also be looking at specific data preparation in Chapter 5, Learning from Data, where we will focus on using various machine learning algorithms. How to do it... While preprocessing data, we may be required to: f Merge two different datasets f Perform set operations on two datasets f Sort the DataFrame by casting an attribute value f Choose a member from one dataset over another based on the predicate f Parse arbitrary date/time inputs We'll use the StudentPrep1.csv and StudentPrep2.csv datasets for the first four tasks, and for the last one, we'll use StrangeDate.json, a JSON-based dataset. The CSV and the JSON dataset are chosen primarily for convenience—the input data could be anything. 470 Chapter 3 The StudentPrep1.csv dataset is shown in this screenshot: The StudentPrep2.csv dataset is shown in the following screenshot: The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/tree/master/ chapter3-data-loading. Let's convert them into a DataFrame using the databricks/spark-csv library, which we used in Chapter 2, Getting Started with Apache Spark DataFrames, when we talked about loading DataFrames from CSV: import com.databricks.spark.csv.CsvContext val students1=sqlContext.csvFile(filePath="StudentPrep1.csv", useHeader=true, delimiter='|') val students2=sqlContext.csvFile(filePath="StudentPrep2.csv", useHeader=true, delimiter='|') 471 Loading and Preparing Data – DataFrame 1. Merging datasets: The DataFrame provides a convenient way to merge another DataFrame—unionAll. The unionAll accepts another DataFrame as an argument. Not surprisingly, the merged DataFrame maintains duplicates inside it: val allStudents=students1.unionAll(students2) allStudents.show(allStudents.count().toInt) The output is shown as follows: 2. Performing set operations: Just like unionAll, the DataFrame has functions for various set operations. The intersection of two DataFrames would just entail calling the intersect function: val intersection=students1.intersect(students2) intersection.foreach(println) [7,Sara,1-608-140-1995,Donec.nibh@enimEtiamimperdiet.edu] [8,Kaseem,1-881-586-2689,cursus.et.magna@euismod.org] [10,Maya,1-271-683-2698,accumsan.convallis@ornarelectusjusto.edu] [9,Lev,1-916-367-5608,Vivamus.nisi@ipsumdolor.com] 472 Chapter 3 Deriving the difference of one DataFrame from another is done by calling the except() function with another DataFrame as the parameter: val subtraction=students1.except(students2) subtraction.foreach(println) Here is the output: [6,Laurel,1-691-379-9921,adipiscing@consectetueripsum.edu] [4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk] [2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu] [5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net] [3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu] [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk] If there are duplicates in the data, the distinct function will ignore them and return a DataFrame with only unique data: val distinctStudents=allStudents.distinct distinctStudents.foreach(println) println(distinctStudents.count()) The following is the output: [4,BelleDifferentName,1-246-894-6340,vitae.aliquet.nec@neque. co.uk] [1,Burke,1-300-746-8446,ullamcorper.velit.in@ametnullaDonec.co.uk] [2,KamalDifferentName,1-668-571-5046,pede.Suspendisse@ interdumenim.edu] [999,LevUniqueToSecondRDD,1-916-367-5608,Vivamus.nisi@ipsumdolor. com] [1,BurkeDifferentName,1-300-746-8446,ullamcorper.velit.in@ ametnullaDonec.co.uk] [2,Kamal,1-668-571-5046,pede.Suspendisse@interdumenim.edu] [3,Olga,1-956-311-1686,Aenean.eget.metus@dictumcursusNunc.edu] [7,Sara,1-608-140-1995,Donec.nibh@enimEtiamimperdiet.edu] [8,Kaseem,1-881-586-2689,cursus.et.magna@euismod.org] [5,Trevor,1-300-527-4967,dapibus.id@acturpisegestas.net] [4,Belle,1-246-894-6340,vitae.aliquet.nec@neque.co.uk] [6,Laurel,1-691-379-9921,adipiscing@consectetueripsum.edu] [6,LaurelInvalidPhone,000000000,adipiscing@consectetueripsum.edu] [9,Lev,1-916-367-5608,Vivamus.nisi@ipsumdolor.com] 473 Loading and Preparing Data – DataFrame [3,Olga,1-956-311-1686,Aenean.eget.metus@ dictumcursusNuncDifferentEmail.edu] [5,Trevor,1-300-527-4967,dapibusDifferentEmail.id@acturpisegestas. net] [10,Maya,1-271-683-2698,accumsan.convallis@ornarelectusjusto.edu] Count output: 17 3. Sorting the DataFrame by casting an attribute value: Sometimes, our DataFrame inferences an integer attribute as a string. Since, DataFrames are immutable, the correct way of converting an attribute from one type to another is by creating another DataFrame. In this recipe, we'll not only cast one attribute type to another, but also sort the DataFrame based on that attribute. The simplest way to achieve this is by using the Spark SQL expression: val sortedCols=allStudents.selectExpr("cast(id as int) as id", "studentName", "phone", "email").sort("id") println ("sorting") sortedCols.show(sortedCols.count.toInt) The output is shown here: 474 Chapter 3 4. Choosing a member from one dataset over another based on predicate: Let's assume that for a given student ID across two different datasets, you would like to pick only the one that has a longer name (or matches some predicate). The result would be just one row per ID. This involves three mini-steps: 1. Map the merged DataFrame (using unionAll) and spit out an RDD of pairs with the key as the ID (or any other field based on which you would like to merge): val idStudentPairs=allStudents.rdd.map(eachRow=>(eachRow. getString(0),eachRow)) 2. The next step is to use a function called reduceByKey. It accepts a function that takes two rows and returns a single row. In our case, we simply write the logic to choose the row with the longer name: //Removes duplicates by id and holds on to the row with the longest name val idStudentPairs=allStudents.rdd.map(eachRow=>(eachRow. getString(0),eachRow)) val longestNameRdd=idStudentPairs.reduceByKey((row1, row2) => if (row1.getString(1).length()>row2.getString(1). length()) row1 else row2 ) 3. Let's print the output: longestNameRdd.values.foreach(println) The output is as follows: [4,BelleDifferentName,1-246-894-6340,vitae.aliquet.nec@neque. co.uk] [8,Kaseem,1-881-586-2689,cursus.et.magna@euismod.org] [6,LaurelInvalidPhone,000000000,adipiscing@consectetueripsum.edu] [2,KamalDifferentName,1-668-571-5046,pede.Suspendisse@ interdumenim.edu] [7,Sara,1-608-140-1995,Donec.nibh@enimEtiamimperdiet.edu] [5,Trevor,1-300-527-4967,dapibusDifferentEmail.id@acturpisegestas. net] [9,Lev,1-916-367-5608,Vivamus.nisi@ipsumdolor.com] [3,Olga,1-956-311-1686,Aenean.eget.metus@ dictumcursusNuncDifferentEmail.edu] 475 Loading and Preparing Data – DataFrame [999,LevUniqueToSecondRDD,1-916-367-5608,Vivamus.nisi@ipsumdolor. com] [1,BurkeDifferentName,1-300-746-8446,ullamcorper.velit.in@ ametnullaDonec.co.uk] [10,Maya,1-271-683-2698,accumsan.convallis@ornarelectusjusto.edu] 5. Parsing arbitary date/time inputs and convert an array into a comma-separated string: While preparing the data, we see that, particularly, the date and time appear in some crazy formats. As always, our aim is to standardize them. For this subrecipe, we'll be using a JSON that looks like this: As we saw in previous recipes, we could bring in this JSON as a DataFrame, but the date format isn't ISO 8601, which means that it won't be considered as a timestamp and would be treated as plain string. In this subrecipe, let's see how to convert a string into a date format. This subrecipe involves four steps: Performing arbitrary transformations is a work in progress (https://issues.apache.org/jira/browse/SPARK-4190). 1. Import JSON as a text file: val stringRDD = sc.textFile("StrangeDate.json") 2. Create a new org.joda.time.format.DateTimeFormat for the specific pattern that our input is in: val formatter = DateTimeFormat.forPattern("MM/dd/yyyy HH:mm:ss") 3. We add json4s as our dependency in build.sbt: "org.json4s" % "json4s-core_2.10" % "3.2.11", "org.json4s" % "json4s-jackson_2.10" % "3.2.11" 4. For each line of the JSON string, parse, and convert the string to org. json4s.JsonValue. The advantage of JsonValue is that we can traverse the JSON object with XPath-like expressions. 476 Chapter 3 Chaining the compact and render function will help us convert JsonValue to String. In the following code, we extract the name field as is, convert an array of tags into a comma-separated string using the extract function, parse the date string using the DateTimeFormat that we created earlier, and construct a Timestamp object. Finally, we yield a case class called JsonDataModel out of the for comprehension that wraps around the name, date, and tags: case class JsonDateModel (name:String, dob:Timestamp, tags:String) import org.json4s._ import org.json4s.jackson.JsonMethods._ implicit val formats = DefaultFormats val dateModelRDD = for { json <- stringRDD jsonValue = parse(json) name = compact(render(jsonValue \ "name")) dateAsString=compact(render(jsonValue \ "dob")).replace("\"","") date = new Timestamp(formatter.parseDateTime(dateAsString). getMillis()) tags = render(jsonValue \ "tags").extract[List[String]]. mkString(",") } yield JsonDateModel(name, date, tags) After that, we construct a DataFrame out of this case class RDD and print the schema to confirm: import sqlContext.implicits._ val df=dateModelRDD.toDF() df.printSchema() df.show(df.count.toInt) The output is as follows: f Schema: f Data: 477 4 Data Visualization In this chapter, we will cover the following recipes: f Visualizing using Zeppelin f Creating scatter plots with Bokeh-Scala f Creating a time series MultiPlot with Bokeh-Scala Introduction In all honesty, free / open source data visualization tools in Scala aren't that rich compared to those in other mature data analysis languages, such as R or Python. We might partly attribute this to the lack of rich charting frameworks in Java, and visualization has never been a strong point for big data analytics. That said, Scala (or more specifically the Hadoop world, including Spark) is catching up with the presence of the Apache incubator project Zeppelin and the highly active Scala bindings (https://github.com/bokeh/bokeh-scala) for the Bokeh project (http://bokeh. pydata.org/en/latest/). With R becoming the first-class citizen in Spark—with the availability of SparkR DataFrames from 1.4 onwards—Spark gets additional visualization from R other than the already existing Python APIs. As a side note, all existing Java libraries are accessible from Scala. Hence, we are free to borrow any visualization library from Java. 479 Data Visualization Visualizing using Zeppelin Apache Zeppelin is a nifty web-based tool that helps us visualize and explore large datasets. From a technical standpoint, Apache Zeppelin is a web application on steroids. We aim to use this application to render some neat, interactive, and shareable graphs and charts. The interesting part of Zeppelin is that it has a bunch of built-in interpreters—ones that can interpret and invoke all API functions in Spark (with a SparkContext) and Spark SQL (with a SQLContext). The other interpreters that are built in are for Hive, Flink, Markdown, and Scala. It also has the ability to run remote interpreters (outside of Zeppelin's own JVM) via Thrift. To look at the list of built-in interpreters, you can go through conf/interpreter.json in the zeppelin installation directory. Alternatively, you can view and customize the interpreters from http://localhost:8080/#/interpreter once you start the zeppelin daemon. How to do it... In this recipe, we'll be using the built-in SparkContext and SQLContext inside Zeppelin and transform data using Spark. At the end, we'll register the transformed data as a table and use Spark SQL to query the data and visualize it. The list of subrecipes in this section is as follows: f Installing Zeppelin f Customizing Zeppelin's server and websocket port f Visualizing data on HDFS – parameterizing inputs f Using custom functions during visualization f Adding external dependencies to Zeppelin f Pointing to an external Spark cluster Installing Zeppelin Zeppelin (http://zeppelin-project.org/) doesn't have a binary bundle yet. However, just as its project site claims, it is pretty easy to build from source. We just ought to run one command to install it on our local machine. At the end of this recipe, we'll take a look at how to point our Zeppelin to an external Spark master: git clone https://github.com/apache/incubator-zeppelin.git cd incubator-zeppelin mvn clean package -Pspark-1.4 -Dhadoop.version=2.2.0 -Phadoop-2.2 -DskipTests 480 Chapter 4 Once built, we can start the Zeppelin daemon using the following command: bin/zeppelin-daemon.sh start To stop the daemon, we can use this command: bin/zeppelin-daemon.sh stop If you come across the following error, you can check with rat.txt, only to find that it complains about your data file: Failed to execute goal org.apache.rat:apache-ratplugin:0.11:check (verify.rat) on project zeppelin: Too many files with unapproved license: 3 Simply move your data file to a different location and initiate the build again. 481 Data Visualization Customizing Zeppelin's server and websocket port Zeppelin runs on port 8080 by default, and it has a websocket port enabled at the +1 port (8081) by default. We can customize the port by copying conf/zeppelin-site. xml.template to conf/zeppelin-site.xml and changing the ports and various other properties, if necessary. Since the Spark standalone cluster master web UI also runs on 8080, when we are running Zeppelin on the same machine as the Spark master, we have to change the ports to avoid conflicts. 1. For now, let's change the port to 8180. In order for this to take effect, let's restart Zeppelin using bin/zeppelin-daemon restart 482 Chapter 4 Visualizing data on HDFS – parameterizing inputs Once we start the daemon, we can point our browser to http://localhost:8080 (change the port as per your modified port configuration) to view the Zeppelin UI. Zeppelin organizes its contents as notes and paragraphs. A note is simply a list of all the paragraphs on a single web page. Using data from HDFS simply means that we point to the HDFS location instead of the local filesystem location. Before we consume the file from HDFS, let's quickly check the Spark version that Zeppelin uses. This can be achieved by issuing sc.version on a paragraph. The sc is an implicit variable representing the SparkContext inside Zeppelin, which simply means that we need not programmatically create a SparkContext within Zeppelin. Next, let's load the profiles.json sample data file, convert it into a DataFrame, and print the schema and the first 20 rows (show) for verification. Let's also finally register the DataFrame as a table. Just like the implicit variable for SparkContext, SQLContext is represented by the sqlc implicit variable inside Zeppelin: val profilesJsonRdd = sqlc.jsonFile("hdfs://localhost:9000/data/scalada/ profiles.json") val profileDF=profilesJsonRdd.toDF() profileDF.printSchema() profileDF.show() profileDF.registerTempTable("profiles") 483 Data Visualization The output looks like this: Be careful not to explicitly create SQLContext or SparkContext. If we create a SQLContext explicitly and register our temporary tables to it, it won't be accessible from the SQL queries that we execute. We'll get this error: no such table List ([YOUR TEMP TABLE NAME]) 484 Chapter 4 Let's now run a simple query to understand eye colors and their counts for men in the dataset: %sql select eyeColor, count(eyeColor) as count from profiles where gender='male' group by eyeColor The %sql at the beginning of the paragraph indicates to Zeppelin that we are about to execute a Spark SQL query in this paragraph. Now, if we wish to share this chart with someone or link it to an external website, we can do so by clicking on the gear icon in this paragraph and then clicking on Link this paragraph, as shown in the following screenshot: 485 Data Visualization We can actually parameterize the input for gender instead of altering our query every time. This is achieved by the use of ${PARAMETER PLACEHOLDER}: %sql select eyeColor, count(eyeColor) as count from profiles where gender="${gender}" group by eyeColor Finally, if parameterizing using free-form text isn't enough, we can use a dropdown instead: %sql select eyeColor, count(eyeColor) as count from profiles where gender ="${gender=male,male|female}" group by eyeColor Running custom functions While Spark SQL doesn't support a range of functions as wide as ANSI SQL does, it has an easy and powerful mechanism for registering a normal Scala function and using it inside the SQL context. 486 Chapter 4 Let's say we would like to find out how many profiles fall under each age group. We have a simple function called ageGroup. Given an age, it returns a string representing the age group: def ageGroup(age: Long) = { val buckets = Array("0-10", "11-20", "20-30", "31-40", "41-50", "5160", "61-70", "71-80", "81-90", "91-100", ">100") buckets(math.min((age.toInt - 1) / 10, buckets.length - 1)) } Now, in order to register this function to be used inside Spark SQL, all that we need to do is give it a name and call the register method of the SQLContext's user-defined function object: sqlc.udf.register("ageGroup", (age:Long)=>ageGroup(age.toInt)) Let's fire our query and see the use of the function in action: %sql select ageGroup(age) as group, count(1) as total from profiles where gender='${gender=male,male|female}' group by ageGroup(age) order by group Here is the output: 487 Data Visualization Adding external dependencies to Zeppelin Sooner or later, we would be depending on external libraries than that come bundled with Zeppelin, say for an efficient CSV import or RDBMS data import. Let's see how to load a MySQL database driver and visualize data from a table. In order to load a mysql connector java driver, we just need to specify the group ID, artifact ID, and version number, and the JAR gets downloaded from the maven repository. %dep indicates that the paragraph adds a dependency, and the z implicit variable represents the Zeppelin context: %dep z.load("mysql:mysql-connector-java:5.1.35") If we would like to point to our enterprise Maven repository or some other custom repository, we can add them by calling the addRepo method of the Zeppelin context available via the same z implicit variable: %dep z.addRepo("RepoName").url("RepoURL") Alternatively, we can load the jar from the local filesystem using the overloaded load method: %dep z.load("/path/to.jar") The only thing that we need to watch out for while using %dep is that the dependency paragraph should be used before using the libraries that are being loaded. So, it is generally advised to load the dependencies at the top of the Notebook. Let's see the use in action: f 488 Loading the dependency: Chapter 4 Once we have loaded the dependencies, we need to construct the options required to connect to the MySQL database: val props = scala.collection.mutable.Map[String,String](); props+=("driver" -> "com.mysql.jdbc.Driver") props+=("url" -> "jdbc:mysql://localhost/scalada?user=root&passw ord=orange123") props+=("dbtable" -> "(select id, name, phone, email, gender from scalada.student) as students") props+=("partitionColumn" -> "id") props+=("lowerBound" -> "0") props+=("upperBound" -> "100") props+=("numPartitions" -> "2") f Using the connection to create a DataFrame: import scala.collection.JavaConverters._ val studentDf = sqlContext.load("jdbc", props.asJava) studentDf.printSchema() studentDf.show() studentDf.registerTempTable("students") 489 Data Visualization f Visualizing the data: Pointing to an external Spark cluster Running Zeppelin with built-in Spark is all good, but in most of our cases, we'll be executing the Spark jobs initiated by Zeppelin on a cluster of workers. Achieving this is pretty simple; we need to configure Zeppelin to point its Spark master property to an external Spark master URL. We'll be looking at how to install and run a Spark cluster on AWS, or a truly distributed cluster, in a later chapter (Chapter 6, Scaling Up), but for this example, I have a simple and standalone external Spark cluster running on my local machine. Please note that we will have to run Zeppelin on a different port because of the Zeppelin UI port's conflict with the Spark standalone cluster master web UI over 8080: For this example, let's download the Spark source for 1.4.1 and build it for Hadoop version 2.2: build/mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package Similarly, let's download the zeppelin incubator and build it, specifying the Hadoop version to be 2.2: mvn clean install -Pspark-1.4 -Dhadoop.version=2.2.0 -Phadoop-2.2 -DskipTests -Pyarn Let's bring up the Spark cluster. From inside your Spark source, execute this: sbin/start-all.sh 490 Chapter 4 Finally, let's modify conf/interpreter.json and conf/zeppelin-env.sh to point the master property to the host on which the Spark VM is running. In this case, it will be my localhost, with the port being 7077, which is the default master port. 1. The conf/interpreter.json file: 2. The conf/zeppelin-env.sh file: Now, when we rerun Spark SQL from Zeppelin, we can see that the job runs on the external Spark instance, as shown here: 491 Data Visualization Creating scatter plots with Bokeh-Scala While Zeppelin is powerful enough to quickly execute our Spark SQLs and visualize data, it is still an evolving platform. In this section, we'll take a brief look at the most popular visualizing framework in Python, called Bokeh, and use its (also fast evolving) Scala bindings to the framework. Breeze also has a visualization API called breeze-viz, which is built on JFreeChart. Unfortunately, at the time of writing this book, the API is not actively maintained, and therefore we won't be discussing it here. The power of Zeppelin lies in the ability to share and view graphics on the browser. This is brought forth by the backing of the D3.js JavaScript visualization library. Bokeh is also backed by another JavaScript visualization library, called BokehJS. The Scala bindings library (bokeh-scala) not only gives an easier way to construct glyphs (lines, circles, and so on) out of Scala objects, but also translates glyphs into a format that is understandable by the BokehJS JavaScript components. There is a warning here: the Bokeh-Scala bindings are still evolving and act at a lower level. Sometimes, this is more cumbersome than its Python counterpart. That said, I am still sure that we all would be able to appreciate the amazing graphs that we can create right out of Scala. How to do it... In this recipe, we will be creating a scatter plot using iris data (https://archive.ics. uci.edu/ml/datasets/Iris), which has the length and width attributes of flowers belonging to three different species of the same plant. Drawing a scatter plot on this dataset involves a series of interesting substeps. For the purpose of representing the iris data in a Breeze matrix, I have naïvely transformed the species categories into numbers: f Iris setosa: 0 f Iris versicolor: 1 f Iris virginica: 2 This is available in irisNumeric.csv. Later, we'll see how we can load the original iris data (iris.data) into a Spark DataFrame and use that as a source for plotting. For the sake of clarity, let's define what the various terms in Bokeh actually mean: f 492 Glyph: All geometric shapes that we can think of—circles, squares, lines, and so on—are glyphs. This is just the UI representation and doesn't hold any data. All the properties related to this object just help us modify the UI properties: color, x, y, width, and so on. Chapter 4 f Plot: A plot is like a canvas on which we arrange various objects relevant to the visualization, such as the legend, x and y axes, grid, tools, and obviously, the core of the graph—the data itself. We construct various accessory objects and finally add them to the list of renderers in the plot object. f Document: The document is the component that does the actual rendering. It accepts the plot as an argument, and when we call the save method in the document, it uses all the child renderers in the plot object and constructs a JSON from the wrapped elements. This JSON is eventually read by the BokehJS widgets to render the data in a visually pleasing manner. More than one plot can be rendered in the document by adding it to a grid plot (we'll look at how this is done in the next recipe, Creating a time series MultiPlot with Bokeh-Scala). A plot is a composition of multiple widgets/glyphs. This consists of a series of steps: 1. Preparing our data. 2. Creating the Plot and Document objects. 3. Creating a point (marker object) and a renderer for it. 4. Setting the x and y axes' data range for the plot. 5. Drawing the x and the y axes. 6. Viewing the marker objects with varying colors. 7. Adding Grid lines. 8. Adding a legend to the plot. Preparing our data Bokeh plots require our data to be in a format that it understands, but it's really easy to do it. All that we need to do is create a new source object that inherits from ColumnDataSource. The other options are AjaxDataSource and RemoteDataSource. So, let's overlay our Breeze data source on ColumnDataSource: import breeze.linalg._ object IrisSource extends ColumnDataSource { private val colormap = Map[Int, Color](0 -> Color.Red, 1 -> Color.Green, 2 -> Color.Blue) private val iris = csvread(file = new File("irisNumeric.csv"), separator = ',') 493 Data Visualization val val val val val sepalLength = column(iris(::, 0)) sepalWidth = column(iris(::, 1)) petalLength = column(iris(::, 2)) petalWidth = column(iris(::, 3)) species = column(iris(::, 4)) } The first line just reads irisNumeric.csv using the csvread function of the Breeze library. The color map is something that we'll be using later while plotting. The purpose of this map is to translate each species of flower into a different color. The final piece is where we convert the Breeze matrix into ColumnDataSource. As required by ColumnDataSource, we select and map specific columns in the Breeze matrix to corresponding columns. Creating Plot and Document objects Let's have our image's title as Iris Petal Length vs Width and create a document object so that we can save the final HTML by the name IrisBokehBreeze.html. Since we haven't specified the full path of the target file in the save method, the file will be saved in the same directory as the project itself: val plot = new Plot().title("Iris Petal Length vs Width") val document = new Document(plot) val file = document.save("IrisBokehBreeze.html") println(s"Saved the chart as ${file.url}") Creating a marker object Our plot has neither data nor any glyphs. Let's first create a marker object that marks the data point. There are a variety of marker objects to choose from: Asterisk, Circle, CircleCross, CircleX, Cross, Diamond, DiamondCross, InvertedTriangle, PlainX, Square, SquareCross, SquareX, and Triangle. Let's choose Diamond for our purpose: val diamond = new Diamond() .x(petalLength) .y(petalWidth) .fill_color(Color.Blue) .fill_alpha(0.5) .size(5) val dataPointRenderer = new GlyphRenderer().data_source(IrisSource). glyph(diamond) 494 Chapter 4 While constructing the marker object, other than the UI attributes, we also say what the x and the y coordinates for it are. Note that we have also mentioned that the color of this marker is blue. We'll change that in a while using the color map. Setting the X and Y axes' data range for the plot The plot needs to know what the x and y data ranges of the plot are before rendering. Let's do that by creating two DataRange objects and setting them to the plot: val xRange = new DataRange1d().sources(petal_length :: Nil) val yRange = new DataRange1d().sources(petal_width :: Nil) plot.x_range(xRange).y_range(yRange) Let's try and run the first cut of this program. The following is the output: We see that this needs a lot of work to be done. Let's do it bit by bit. 495 Data Visualization Drawing the x and the y axes Let's now draw the axes, set their bounds, and add them to the plot's renderers. We also need to let the plot know which location each axis belongs to: //X and Y Axis val xAxis = new LinearAxis().plot(plot).axis_label("Petal Length"). bounds((1.0, 7.0)) val yAxis = new LinearAxis().plot(plot).axis_label("Petal Width"). bounds((0.0, 2.5)) plot.below <<= (listRenderer => (xAxis :: listRenderer)) plot.left <<= (listRenderer => (yAxis :: listRenderer)) //Add the renderer to the plot plot.renderers := List(xAxis, yAxis, dataPointRenderer) Here is the output: 496 Chapter 4 Viewing flower species with varying colors All the data points are marked with blue as of now, but we would really like to differentiate the species visually. This is a simple two-step process: 1. Add new derived data (speciesColor) into our ColumnDataSource to hold colors that represent the species: object IrisSource extends ColumnDataSource { private val colormap = Map[Int, Color](0 -> Color.Red, 1 -> Color.Green, 2 -> Color.Blue) private val iris = csvread(file = new File("irisNumeric.csv"), separator = ',') val sepalLength = column(iris(::, 0)) val sepalWidth = column(iris(::, 1)) val petalLength = column(iris(::, 2)) val petalWidth = column(iris(::, 3)) val speciesColor = column(species.value.map(v => colormap(v.round.toInt))) } So, we assign red to Iris setosa, green to Iris versicolor and blue to Iris virginica. 2. Modify the diamond marker to take this as input instead of accepting a static blue: val diamond = new Diamond() .x(petalLength) .y(petalWidth) .fill_color(speciesColor) .fill_alpha(0.5) .size(10) 497 Data Visualization The output is as follows: It looks fairly okay now. Let's add some tools to the image. Bokeh has some nice tools that can be attached to the image: BoxSelectTool, BoxZoomTool, CrosshairTool, HoverTool, LassoSelectTool, PanTool, PolySelectTool, PreviewSaveTool, ResetTool, ResizeTool, SelectTool, TapTool, TransientSelectTool, and WheelZoomTool. Let's add a few of them to see them for fun: val panTool = new PanTool().plot(plot) val wheelZoomTool = new WheelZoomTool().plot(plot) val previewSaveTool = new PreviewSaveTool().plot(plot) val resetTool = new ResetTool().plot(plot) val resizeTool = new ResizeTool().plot(plot) val crosshairTool = new CrosshairTool().plot(plot) plot.tools := List(panTool, wheelZoomTool, previewSaveTool, resetTool, resizeTool, crosshairTool) 498 Chapter 4 Adding grid lines While we have the crosshair tool, which helps us locate the exact x and y values of a particular data point, it would be nice to have a data grid too. Let's add two data grids, one for the x axis and one for the y axis: val xAxis = new LinearAxis().plot(plot).axis_label("Petal Length"). bounds((1.0, 7.0)) val yAxis = new LinearAxis().plot(plot).axis_label("Petal Width"). bounds((0.0, 2.5)) val xgrid = new Grid().plot(plot).axis(xAxis).dimension(0) val ygrid = new Grid().plot(plot).axis(yAxis).dimension(1) 499 Data Visualization Next, let's add the grids to the plot renderer list too: plot.renderers := List(xAxis, yAxis, dataPointRenderer, xgrid, ygrid) Adding a legend to the plot This step is a bit tricky in the Scala binding of Bokeh due to the lack of high-level graphing objects, such as scatter. For now, let's cook up our own legend. The legends property of the Legend object accepts a list of tuples - a label and a GlyphRenderer pair. Let's explicitly create three GlyphRenderer wrapping diamonds of three colors, which represent the species. We then add them to the plot: val setosa = new Diamond().fill_color(Color.Red).size(10).fill_alpha(0.5) val setosaGlyphRnd=new GlyphRenderer().glyph(setosa) val versicolor = new Diamond().fill_color(Color.Green).size(10).fill_ alpha(0.5) val versicolorGlyphRnd=new GlyphRenderer().glyph(versicolor) val virginica = new Diamond().fill_color(Color.Blue).size(10).fill_ alpha(0.5) val virginicaGlyphRnd=new GlyphRenderer().glyph(virginica) 500 Chapter 4 val legends = List("setosa" -> List(setosaGlyphRnd), "versicolor" -> List(versicolorGlyphRnd), "virginica" -> List(virginicaGlyphRnd)) val legend = new Legend().orientation(LegendOrientation.TopLeft). plot(plot).legends(legends) plot.renderers := List(xAxis, yAxis, dataPointRenderer, xgrid, ygrid, legend, setosaGlyphRnd, virginicaGlyphRnd, versicolorGlyphRnd) The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter4-visualization/src/main/scala/com/ packt/scalada/viz/breeze. 501 Data Visualization Creating a time series MultiPlot with Bokeh-Scala In this second recipe on plotting using Bokeh, we'll see how to plot a time series graph with a dataset borrowed from https://archive.ics.uci.edu/ml/datasets/ Dow+Jones+Index. We will also see how to plot multiple charts in a single document. How to do it... We'll be using only two fields from the dataset: the closing price of the stock at the end of the week, and the last business day of the week. Our dataset is comma separated. Let's take a look at some samples, as shown here: Preparing our data In contrast to the previous recipe, where we used the Breeze matrix to construct the Bokeh ColumnDataSource, we'll use the Spark DataFrame to construct the source this time. The getSource method accepts a ticker (MSFT-Microsoft and CAT-Caterpillar) and a SQLContext. It runs a Spark SQL, fetches the data from the table, and constructs a ColumnDataSource from it: import org.joda.time.format.DateTimeFormat object StockSource { val formatter = DateTimeFormat.forPattern("MM/dd/yyyy"); def getSource(ticker: String, sqlContext: SQLContext) = { val stockDf = sqlContext.sql(s"select stock, date, close from stocks where stock= '$ticker'") stockDf.cache() 502 Chapter 4 val dateData: Array[Double] = stockDf.select("date").collect.map(eachRow => formatter.parseDateTime(eachRow.getString(0)).getMillis().toDouble) val closeData: Array[Double] = stockDf.select("close").collect.map(eachRow => eachRow.getString(0).drop(1).toDouble) object source extends ColumnDataSource { val date = column(dateData) val close = column(closeData) } source } } Earlier, we constructed SQLContext and registered the dataset as a table, like this: val conf = new SparkConf().setAppName("csvDataFrame"). setMaster("local[2]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val stocks = sqlContext.csvFile(filePath = "dow_jones_index.data", useHeader = true, delimiter = ',') stocks.registerTempTable("stocks") The only tricky thing that we do here is convert the date value into milliseconds. This is because the Plot point requires a double. We use the Joda-Time API to achieve this. Creating a plot Let's go ahead and create the Plot object from the source: //Create Plot val plot = new Plot().title(ticker).x_range(xdr).y_range(ydr).width(800). height(400) Let's have our image's title as the ticker name of the stock and create a Document object so that we can save the final HTML by the name ClosingPrices.html: val msDocument = new Document(microsoftPlot) val msHtml = msDocument.save("ClosingPrices.html") 503 Data Visualization Creating a line that joins all the data points As we saw earlier with the Diamond marker, we'll have to pass the x and the y positions of the data points. Also, we will need to wrap the Line glyph into a renderer so that we can add it to Plot: val line = new Line().x(date).y(close).line_color(color).line_width(2) val lineGlyph = new GlyphRenderer().data_source(source).glyph(line) Setting the x and y axes' data range for the plot The plot needs to know what the x and y data ranges of the plot are before rendering. Let's do that by creating two DataRange objects and setting them to the plot: val xdr = new DataRange1d().sources(List(date)) val ydr = new DataRange1d().sources(List(close)) plot.x_range(xdr).y_range(ydr) Drawing the axes and the grids Drawing the axes and the grids is the same as before. We added some labels to the axis, formatted the display of the x axis, and then added them to the Plot: val xformatter = new DatetimeTickFormatter().formats(Map(DatetimeUnits. Months -> List("%b %Y"))) val xaxis = new DatetimeAxis().plot(plot).formatter(xformatter).axis_ label("Month") val yaxis = new LinearAxis().plot(plot).axis_label("Price") plot.below <<= (xaxis :: _) plot.left <<= (yaxis :: _) val xgrid = new Grid().plot(plot).dimension(0).axis(xaxis) val ygrid = new Grid().plot(plot).dimension(1).axis(yaxis) Adding tools As before, let's add some tools to the image—and to the plot: //Tools val panTool = new PanTool().plot(plot) val wheelZoomTool = new WheelZoomTool().plot(plot) val previewSaveTool = new PreviewSaveTool().plot(plot) val resetTool = new ResetTool().plot(plot) val resizeTool = new ResizeTool().plot(plot) 504 Chapter 4 val crosshairTool = new CrosshairTool().plot(plot) plot.tools := List(panTool, wheelZoomTool, previewSaveTool, resetTool, resizeTool, crosshairTool) Adding a legend to the plot Since we already have the Glyph renderer for the line, all we need to do is add it to the legend. The properties of the line automatically propagate to the legend: //Legend val legends = List(ticker -> List(lineGlyph)) val legend = new Legend().plot(plot).legends(legends) Next, let's add all the renderers that we created before to the plot: plot.renderers <<= (xaxis :: yaxis :: xgrid :: ygrid :: lineGlyph :: legend :: _) As the final step, let's try plotting multiple plots in the same document. 505 Data Visualization Multiple plots in the document Creating multiple plots in the same document is child's play. All that we need to do is create all our plots and then add them into a grid. Finally, instead of passing our individual plot object into the document, we pass in GridPlot: val children = List(List(microsoftPlot, bofaPlot), List(caterPillarPlot, mmmPlot)) val grid = new GridPlot().children(children) val document = new Document(grid) val html = document.save("DJClosingPrices.html") The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter4-visualization/src/main/scala/com/ packt/scalada/viz/breeze. In this chapter, we explored two methods of visualization and built some basic graphs and charts using Scala. As I mentioned earlier, the visualization libraries in Scala are actively being developed and cannot be compared to advanced visualizations that can be generated using R or, for that sake, Tableau. 506 5 Learning from Data In this chapter, we will cover the following recipes: f Predicting continuous values using linear regression f Binary classification using LogisticRegression and SVM f Binary classification using LogisticRegression with the Pipeline API f Clustering using K-means f Feature reduction using principal component analysis Introduction In previous chapters, we saw how to load, prepare, and visualize data. Now, let's start doing some interesting stuff with it. In this chapter, we'll be looking into applying various machine learning techniques on top of it. We'll look at a few examples for the two broad classifications of machine learning techniques: supervised and unsupervised learning. Before that, however, let's briefly see what these terms mean. Supervised and unsupervised learning If you are reading this book, you probably already know what supervised and unsupervised learning are, but for the sake of completion, let's briefly summarize what they mean. In supervised learning, we train the algorithms with labeled data. Labeled data is nothing but input data along with the outcome variable. For example, if our intention is to predict whether a website is about news, we would be preparing a sample dataset of website content with "news" and "not news" as labels. This dataset is called the training dataset. 507 Learning from Data With supervised learning, our end goal is to use the training dataset and come up with a function that maps our input variables to an output variable with least margin of error. We call input variables (or x variables) features or explanatory variables, and the output variable (also known as the y variable or label) the target or dependent variable. In the news website example, the text content in the website would be the input variable and "news" or "not news" would be the target variable. The function, along with its parameters (or weights or theta), is our hypothesis, or model. In the case of unsupervised learning, we aim to find a structure within the data— groups and relationships among these groups or the participants of a group. Unlike supervised learning, we don't know any information about the data or even its subset. An example would be to see whether there are similar buying patterns among a group of people (which helps cross-selling) or to see which group of people is more likely to buy pizza from our newly opened store. Gradient descent With supervised learning, in order for the algorithm to learn the relationship between the input and the output features, we provide a set of manually curated values for the target variable (y) against a set of input variables (x). We call it the training set. The learning algorithm then has to go over our training set, perform some optimization, and come up with a model that has the least cost—deviation from the true values. So technically, we have two algorithms for every learning problem: an algorithm that comes up with the function and (an initial set of) weights for each of the x features, and a supporting algorithm (also called cost minimization or optimization algorithm) that looks at our function parameters (feature weights) and tries to minimize the cost as much as possible. There are a variety of cost minimization algorithms, but one of the most popular is gradient descent. Imagine gradient descent as climbing down a mountain. The height of the mountain represents the cost, and the plain represents the feature weights. The highest point is your function with the maximum cost, and the lowest point has the least cost. Therefore, our intention is to walk down the mountain. What gradient descent does is as follows: for every single step down the slope that it takes of a particular size (the step size), it goes through the entire dataset (!) and updates all the values of the weights for x features. This goes on until it reaches a state where the cost is the minimum. This flavor of gradient descent, in which it sees all of the data per iteration and updates all the parameters during every iteration, is called batch gradient descent. The trouble with using this algorithm against the size of the data that Spark aims to handle is that going through millions of rows per iteration is definitely not optimal. So, Spark uses a variant of gradient descent, called Stochastic Gradient Descent (SGD), wherein the parameters are updated for each training example as it looks at it one by one. In this way, it starts making progress almost immediately, and therefore the computational effort is considerably reduced. The SGD settings can be customized using the optimizer attribute inside each of the ML algorithm. We'll look at this in detail in the recipes. 508 Chapter 5 In the following recipes, we'll be looking at linear regression, logistic regression, and support vector machines as examples of supervised learning and K-means clustering, as well as dimensionality reduction using Principal Component Analysis (PCA) as an example of unsupervised learning. We'll also briefly look at the Stanford NLP toolkit and Scala NLP's Epic, popular natural language processing libraries, as examples of fitting a third-party library into Spark jobs. Predicting continuous values using linear regression At the risk of stating the obvious, linear regression aims to find the relationship between an output (y) based on an input (x) using a mathematical model that is linear to the input variables. The output variable, y, is a continuous numerical value. If we have more than one input/explanatory variable (x), as in the example that we are going to see, we call it multiple linear regression. The dataset that we'll use for this recipe, for lack of creativity, is lifted from the UCI website at http://archive.ics.uci.edu/ml/machine-learningdatabases/wine-quality/. This dataset has 1599 instances of various red wines, their chemical composition, and their quality. We'll use it to predict the quality of a red wine. How to do it... Let's summarize the steps: 1. Importing the data. 2. Converting each instance into a LabeledPoint. 3. Preparing the training and test data. 4. Scaling the features. 5. Training the model. 6. Predicting against the test data. 7. Evaluating the model. 8. Regularizing the parameters. 9. Mini batching. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/tree/master/ chapter5-learning/src/main/scala/com/packt/ scalada/learning/LinearRegressionWine.scala. 509 Learning from Data "Step zero" of this process is the creation of SparkConfig and the SparkContext. There is nothing fancy here: val conf = new SparkConf().setAppName("linearRegressionWine"). setMaster("local[2]") val sc = new SparkContext(conf) Importing the data We then import the semicolon-separated text file. We map each line into an Array[String] by splitting each of them using semicolons. We end up with RDD[Array[String]]: val rdd = sc.textFile("winequality-red.csv").map(line => line.split(";")) Converting each instance into a LabeledPoint As we discussed earlier, supervised learning requires training data to be provided. We are also are required to test the model that we create for its accuracy against another set of data—the test data. If we have two different datasets for this, we can import them separately and mark them as a training set and a test set. In our example, we'll use a single dataset and split it into training and test sets. Each of our training samples has the following format: the last field is the quality of the wine, a rating from 1 to 10, and the first 11 fields are the properties of the wine. So, from our perspective, the quality of the wine is the y variable (the output) and the rest of them are x variables (input). Now, let's represent this in a format that Spark understands—a LabeledPoint. A LabeledPoint is a simple wrapper around the input features (our x variables) and our prepredicted value (y variable) for these x input values: val dataPoints=rdd.map(row=>new LabeledPoint(row.last.toDouble,Vectors. dense(row.take(row.length-1).map(str=>str.toDouble)))) The first parameter to the constructor of the LabeledPoint is the label (y variable), and the second parameter is a vector of input variables. 510 Chapter 5 Preparing the training and test data As we discussed earlier, we can have two different independent datasets for training and testing. However, it is a common practice to split the dataset into training and test datasets. In this recipe, we will be splitting the dataset into training and test sets in the ratio of 80:20, with each of the elements being selected randomly. This random shuffling of data is one of the prerequisites for better performance of the SGD too: val dataSplit = dataPoints.randomSplit(Array(0.8, 0.2)) val trainingSet = dataSplit(0) val testSet = dataSplit(1) Scaling the features Running a quick summary statistics reveals that our features aren't in the same range: val featureVector = rdd.map(row => Vectors.dense(row.take(row.length-1). map(str => str.toDouble))) print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}") println ("Min "+ stats.min) println ("Max "+ stats.max) Here is the output: Min [4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4] Max [15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9] Variance : [3.031416388997815,0.0320623776515516,0.0379 4748313440582,1.987897132985963,0.002215142653300991,10 9.41488383305895,1082.1023725325845,3.56202945332629E6,0.02383518054541292,0.02873261612976197,1.135647395000472] It is always recommended that the input variables have a mean of 0. This is easily achieved with the help of the StandardScaler built into the Spark ML library itself. The one thing that we have to watch out for here is that we have to scale the training and the test sets uniformly. The way we do it is by creating a scaler for trainingSplit and using the same scaler to scale the test set. Another side note is that feature scaling helps with faster convergence in SGD: val scaler = new StandardScaler(withMean = true, withStd = true). fit(trainingSet.map(dp => dp.features)) val scaledTrainingSet = trainingSet.map(dp => new LabeledPoint(dp.label, scaler.transform(dp.features))).cache() val scaledTestSet = testSet.map(dp => new LabeledPoint(dp.label, scaler. transform(dp.features))).cache() 511 Learning from Data Training the model The next step is to use our training data to create a model. This just involves creating an instance of LinearRegressionWithSGD and passing in a few parameters: one for the LinearRegression algorithm and two for the SGD. The SGD parameters can be accessed through the use of the optimizer attribute inside LinearRegressionWithSGD: f setIntercept: While predicting, we are more interested in the slope. This setting will force the algorithm to find the intercept too. f optimizer.setNumIterations: This determines the number of iterations that our algorithm needs to go through on the training set before finalizing the hypothesis. An optimal number would be 10^6 divided by the number of instances in your dataset. In our case, we'll set it to 1000. f setStepSize: This tells the gradient descent algorithm while it tries to reduce the parameters how big a step it needs to take during every iteration. Setting this parameter is really tricky because we would like the SGD to take bigger steps in the beginning and smaller steps towards the convergence. Setting a fixed small number would slow down the algorithm, and setting a fixed bigger number would not give us a function that is a reasonable minimum. The way Spark handles our setStepSize input parameter is as follows: it divides the input parameter by a root of the iteration number. So initially, our step size is huge, and as we go further down, it becomes smaller and smaller. The default step size parameter is 1. val regression=new LinearRegressionWithSGD().setIntercept(true) regression.optimizer.setNumIterations(1000).setStepSize(0.1) //Let's create a model out of our training examples. val model=regression.run(scaledTrainingSet) Predicting against test data This step is just a one-liner. We use the resulting model to predict the output (y) based on the features of the test set: val predictions:RDD[Double]=model.predict(scaledTestSet.map(point=>point. features)) Evaluating the model Let's evaluate our model against one of the most popular regression evaluation metrics— mean squared error. Let's get the actual values that our test data has (the y variable prepared manually) and then compare it with the predictions from our model: val actuals:RDD[Double]=scaledTestSet.map(_.label) 512 Chapter 5 Mean squared error The mean squared error is given by this formula: So, we take the difference between the actual and the predicted values (errors), square them, and calculate the sum of them all. We then divide this sum by the number of values, thereby calculating the mean: val predictsAndActuals: RDD[(Double, Double)] = predictions.zip(actuals) val sumSquaredErrors=predictsAndActuals.map{case (pred,act)=> println (s"act, pred and difference $act, $pred ${act-pred}") math.pow(act-pred,2) }.sum() val meanSquaredError = sumSquaredErrors / scaledTestSet.count println(s"SSE is $sumSquaredErrors") println(s"MSE is $meanSquaredError") Here is the output: SSE is 162.21647197365706 MSE is 0.49607483783992984 In our example, we selected all the features that are present in our dataset. Later, we'll take a look at dimensionality reduction, which helps us reduce the number of features while still maintaining the variance of the dataset at a reasonably higher level. Regularizing the parameters Before we see what regularization is, let's briefly see what overfitting is. A model is said to be overfit (or having high variance) when it memorizes the training set. The result of this is that the algorithm fails to generalize and therefore performs badly with unseen datasets. One way to solve the problem of overfitting is to manually select the important features that will be used to create the model, but for a large-dimensional dataset, it is hard to decide which ones to keep and which ones to throw away. 513 Learning from Data The other popular option is to retain all the features but reduce the magnitudes of the feature weights. Thus, even with a model that is complex (with higher degree polynomials), if the feature weights are really small, the resulting model would be simple. In other words, given two equally (or almost equally) performing models, with one model being complex (with higher degree polynomial) and the other model being simple, regularization chooses the simple model. The reasoning behind this is that models with simple parameters have a higher probability of predicting unseen data (also known as generalization). The Spark MLlib comes with implementations for the most common L1 and L2 regularizations. As a side note, LinearRegressionWithSGD, by default, uses a SimpleUpdater, which does not regularize the parameters. Interestingly, Spark has implementations of regression algorithms that are based on top of the L1 and L2 updaters; they are called the Lasso (that uses the L1 updater) and Ridge (that uses the L2 updater by default). While the L1 regularizer offers some feature selection when the dataset that we have is sparse (or if the dataset's rows are smaller than the feature itself), most of the time, it is recommended is to use the L2 regularizer. The new Pipeline API also has out-of-the-box support for ElasticNet regularization, which uses both the L1 and L2 regularizations internally. Now, let's go over the code: def algorithm(algo: String, iterations: Int, stepSize: Int) = algo match { case "linear" => { val algo = new LinearRegressionWithSGD() algo.setIntercept(true).optimizer.setNumIterations(iterations). setStepSize(stepSize) algo } case "lasso" => { val algo = new LassoWithSGD() algo.setIntercept(true).optimizer.setNumIterations(iterations). setStepSize(stepSize) algo } case "ridge" => { val algo = new RidgeRegressionWithSGD() algo.setIntercept(true).optimizer.setNumIterations(iterations). setStepSize(stepSize) algo } } 514 Chapter 5 As discussed earlier, LassoWithSGD wraps an L1 updater and RidgeRegessionWithSGD wraps an L2 updater. From a code perspective, all that we need to do is change the name of the class. The optimizer (gradient descent) now accepts a regularization parameter that penalizes larger parameters for the features. The default value of the regularization parameter is 0.01 in Spark. A smaller regularization parameter would result in underfitting, and a large parameter would result in overfitting. The following output shows that regularizing the parameters has reduced our error values: ************** Printing metrics for Linear Regression with SGD ***************** SSE is 132.39124792957116 MSE is 0.4124337941731189 ************** Printing metrics for Lasso Regression with SGD ***************** SSE is 132.3943810653321 MSE is 0.4124435547206608 ************** Printing metrics for Ridge Regression with SGD ***************** SSE is 132.44011034123344 MSE is 0.4125860135240917 Mini batching Instead of going through our dataset one by one in the case of SGD, or seeing the entire dataset for every iteration (in the case of batch gradient descent) while updating the parameter vector, we can settle for something in the middle. With the mini batch fraction parameter, for every single iteration, the SGD considers that fraction of the dataset to process for the parameter update. Let's set the batch size to 5 percent: algo.setIntercept(true).optimizer.setNumIterations(iterations). setStepSize(stepSize).setRegParam(0.001).setMiniBatchFraction(0.05) The results are as follows: ************** Printing metrics for Linear Regression with SGD ***************** SSE is 112.96958667767147 MSE is 0.3574986920179477 SST is 183.05305027649794 Residual sum of squares is 0.38285875866568087 ************** Printing metrics for Lasso Regression with SGD ***************** 515 Learning from Data SSE is 112.95392101963424 MSE is 0.35744911715074124 SST is 183.05305027649794 Residual sum of squares is 0.3829443385454675 ************** Printing metrics for Ridge Regression with SGD ***************** SSE is 112.9218089913291 MSE is 0.3573474968080035 SST is 183.05305027649794 Residual sum of squares is 0.3831197632557175 The advantage that we get from using mini batches is that this obviously gives better performance than plain SGD without batches. This is because with plain SGD, for every iteration, only one example is considered to update the parameters. However, with mini batches, we consider a batch of examples. That said, the improvement in the mean squared error from the previous run is not the result of using batches, but just a feature of SGD—roaming around the minima and not converging at a fixed point. Binary classification using LogisticRegression and SVM Unlike linear regression, wherein we predicted continuous values for the outcome (the y variable), logistic regression and the Support Vector Machine (SVM) are used to predict just one out of the n possibilities for the outcome (the y variable). If the outcome is one of two possibilities, then the classification is called a binary classification. Logistic regression, when used for binary classification, looks at each data point and estimates the probability of that data point falling under the positive case. If the probability is less than a threshold, then the outcome is negative (or 0); otherwise, the outcome is positive (or 1). As with any other supervised learning techniques, we will be providing training examples for logistic regression. We then add a bit of code for feature extraction and let the algorithm create a model that encapsulates the probability of each of the features belonging to one of the binary outcomes. 516 Chapter 5 What SVM tries to do is map all of the training data as points in the feature space. The algorithm comes up with a hyperplane that separates the positive and negative training examples in such a way that the distance (margin band) between them is maximum. This is better illustrated with a diagram: When a new and unseen data point comes up for prediction, the algorithm looks at that point and tries to find the closest point to the input data point. The label corresponding to that point will be predicted as the label for the input point as well. How to do it... Both the implementations of LogisticRegression and SVM in Spark use L2 regularization by default, but we are free to switch to L1 by setting the updater explicitly. In this recipe, we'll classify a spam/ham dataset (https://archive.ics.uci.edu/ml/ datasets/SMS+Spam+Collection) against three variants of classification algorithms: f Logistic regression with SGD as the optimization algorithm f Logistic regression with BFGS as the optimization algorithm f Support vector machine with SGD as the optimization algorithm 517 Learning from Data The BFGS optimization algorithm provides the benefits of converging to the minimum faster than SGD. Also, for BFGS, we need not break our heads coming up with an optimal learning rate. Let's summarize the steps: 1. 2. 3. 4. 5. 6. 7. Importing the data. Tokenizing the data and converting it into LabeledPoints. Factoring the Inverse Document Frequency (IDF). Preparing the training and test data. Constructing the algorithm. Training the model and predicting the test data. Evaluating the model. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter5-learning/src/main/scala/com/packt/ scalada/learning/BinaryClassificationSpam.scala. Importing the data As usual, our input data is in the form of a text file—SMSSpamCollection. The data file looks like this: As we can see, the label and the data are separated by a tab. So, while reading each line, we split the label and the content, and then populate a simple case class named Document. This Document class is just a temporary placeholder. In the next step, we'll convert these documents into LabeledPoints: //Frankly, we could make this a tuple but this looks neat case class Document(label: String, content: String) val docs = sc.textFile("SMSSpamCollection").map(line => { val words = line.split("\t") Document(words.head.trim(), words.tail.mkString(" ")) }) 518 Chapter 5 Tokenizing the data and converting it into LabeledPoints For the tokenization, instead of relying on the tokenizer provided inside Spark, we'll see how to plug in two external NLP libraries—the Stanford CoreNLP and Scala NLP's Epic libraries. These are the two most popular NLP libraries: one from the Java world and the other from Scala. However, one thing that we ought to watch out for while using external libraries is that the instantiation of these APIs, and therefore the creation of heavyweight objects required for the use of these APIs (such as a tokenizer), should be done at the partition level. If we do it at the level of a closure, such as a map over RDD, we'll end up creating new instance of the API object for every single instance of the data. In the case of Epic, we just split the documents into sentences and then tokenize them into words. We also add two more restrictions. Only those tokens that contain letters or digits will be considered, and the tokens should be of at least two characters: import epic.preprocess.TreebankTokenizer import epic.preprocess.MLSentenceSegmenter //Use Scala NLP - Epic val labeledPointsUsingEpicRdd: RDD[LabeledPoint] = docs.mapPartitions { docIter => val segmenter = MLSentenceSegmenter.bundled().get val tokenizer = new TreebankTokenizer() val hashingTf = new HashingTF(5000) docIter.map { doc => val sentences = segmenter(doc.content) val tokens = sentences.flatMap(sentence => tokenizer(sentence)) //consider only features that are letters or digits and cut off all words that are less than 2 characters val filteredTokens=tokens.toList.filter(token => token.forall(_.isLetterOrDigit)).filter(_.length() > 1) new LabeledPoint(if (doc.label=="ham") 0 else 1, hashingTf.transform(filteredTokens)) } }.cache() MLSentenceSegmenter splits the paragraph into sentences. The sentences are then split into terms (or words) using the tokenizer. HashingTF creates a map of terms with their frequency of occurrence. Finally, to construct a LabeledPoint for each document, we convert these terms into a term frequency vector for that document using the transform function of HashingTF. Also, we restrict the maximum number of interested terms to 5,000 by way of setting the numFeatures in HashingTF. 519 Learning from Data With Stanford CoreNLP, the process is a little more involved, in the sense that we reduce the tokens to lemmas (https://en.wikipedia.org/wiki/Lemmatisation). In order to do this, we create an NLP pipeline that splits sentences, tokenizes, and finally reduces the tokens to lemmas: def corePipeline(): StanfordCoreNLP = { val props = new Properties() props.put("annotators", "tokenize, ssplit, pos, lemma") new StanfordCoreNLP(props) } def lemmatize(nlp: StanfordCoreNLP, content: String): List[String] = { //We are required to prepare the text as 'annotatable' before we annotate :-) val document = new Annotation(content) //Annotate nlp.annotate(document) //Extract all sentences val sentences = document.get(classOf[SentencesAnnotation]).asScala //Extract lemmas from sentences val lemmas = sentences.flatMap { sentence => val tokens = sentence.get(classOf[TokensAnnotation]).asScala tokens.map(token => token.getString(classOf[LemmaAnnotation])) } //Only lemmas with letters or digits will be considered. Also consider only those words which has a length of at least 2 lemmas.toList.filter(lemma => lemma.forall(_.isLetterOrDigit)).filter(_.length() > 1) } val labeledPointsUsingStanfordNLPRdd: RDD[LabeledPoint] = docs.mapPartitions { docIter => val corenlp = corePipeline() val stopwords = Source.fromFile("stopwords.txt").getLines() val hashingTf = new HashingTF(5000) docIter.map { doc => val lemmas = lemmatize(corenlp, doc.content) //remove all the stopwords from the lemma list lemmas.filterNot(lemma => stopwords.contains(lemma)) 520 Chapter 5 //Generates a term frequency vector from the features val features = hashingTf.transform(lemmas) //example : List(until, jurong, point, crazy, available, only, in, bugi, great, world, la, buffet, Cine, there, get, amore, wat) new LabeledPoint( if (doc.label.equals("ham")) 0 else 1, features) } }.cache() Factoring the inverse document frequency With HashingTF, we have a map of terms along with their frequency of occurrence in the documents. Now, the problem with taking this metric is that common words such as "the" and "a" get higher rankings compared to rare words. The inverse document frequency (IDF) calculates the occurrences of a word in all the documents and gives higher weight to a term that is uncommon. We'll now factor in the inverse document frequency so that we have the TF-IDF score (https://en.wikipedia.org/wiki/Tf–idf) for each term. This is easily achievable in Spark with the availability of org.apache.spark.mllib.feature. IDFModel. We extract all term frequencies from LabeledPoints and pass them to the transform function IDFModel to generate the TF-IDF: val labeledPointsUsingStanfordNLPRdd=getLabeledPoints(docs, "STANFORD") val lpTfIdf=withIdf(labeledPointsUsingStanfordNLPRdd).cache() def withIdf(lPoints: RDD[LabeledPoint]): RDD[LabeledPoint] = { val hashedFeatures = labeledPointsWithTf.map(lp => lp.features) val idf: IDF = new IDF() val idfModel: IDFModel = idf.fit(hashedFeatures) val tfIdf: RDD[Vector] = idfModel.transform(hashedFeatures) val lpTfIdf= labeledPointsWithTf.zip(tfIdf).map { case (originalLPoint, tfIdfVector) => { new LabeledPoint(originalLPoint.label, tfIdfVector) } } lpTfIdf } val lpTfIdf=withIdf(labeledPointsWithTf).cache() 521 Learning from Data Prepare the training and test data Our test data has a skewed distribution of spam and ham data. We just have to make sure that when we split the data into training and test data into 80% and 20%, we first split the training and test data into two subsets and then split it into the 80:20 ratio. At the end of this, the training data and test data will have a ratio of 4:1 spam and ham samples. The spam and ham counts in our dataset are 747 and 4827, respectively: //Split dataset val spamPoints = lpTfIdf.filter(point => point.label == 1).randomSplit(Array(0.8, 0.2)) val hamPoints = lpTfIdf.filter(point => point.label == 0).randomSplit(Array(0.8, 0.2)) println ("Spam count:"+(spamPoints(0).count)+"::"+(spamPoints(1).count)) println ("Ham count:"+(hamPoints(0).count)+"::"+(hamPoints(1). count)) val trainingSpamSplit = spamPoints(0) val testSpamSplit = spamPoints(1) val trainingHamSplit = hamPoints(0) val testHamSplit = hamPoints(1) val trainingSplit = trainingSpamSplit ++ trainingHamSplit val testSplit = testSpamSplit ++ testHamSplit Constructing the algorithm Now that we have our training and test sets, the next obvious step is to train a model out of these examples. Let's create instances of the three variants of the algorithms that we would like to experiment with: val logisticWithSGD = getAlgorithm("logsgd", 100, 1, 0.001) val logisticWithBfgs = getAlgorithm("logbfgs", 100, Double.Nan, 0.001) val svmWithSGD = getAlgorithm("svm", 100, 1, 0.001) def getAlgorithm(algo: String, iterations: Int, stepSize: Double, regParam: Double) = algo match { case "logsgd" => { val algo = new LogisticRegressionWithSGD() 522 Chapter 5 algo.setIntercept(true).optimizer.setNumIterations(iterations). setStepSize(stepSize).setRegParam(regParam) algo } case "logbfgs" => { val algo = new LogisticRegressionWithLBFGS() algo.setIntercept(true).optimizer.setNumIterations(iterations). setRegParam(regParam) algo } case "svm" => { val algo = new SVMWithSGD() algo.setIntercept(true).optimizer.setNumIterations(iterations). setStepSize(stepSize).setRegParam(regParam) algo } } We can notice that the stepSize parameter isn't set for logistic regression with BFGS. Training the model and predicting the test data Like linear regression, training and predicting the labels for the test set is just a matter of calling the run and predict methods of the classification algorithm. Soon after the prediction is done, the next logical step is to evaluate the model. In order to generate metrics for this, we extract the predicted and the actual labels. Our runClassification function trains the model using the training data and makes predictions against the test data. It then zips the predicted and the actual outcomes into a value called predictsAndActuals. This value is returned from the function. The runClassification accepts a GeneralizedLinearAlgorithm as the parameter, which is the parent of LinearRegressionWithSGD, LogisticRegressionWithSGD, and SVMWithSGD: val logisticWithSGDPredictsActuals=runClassification(logisticWithSGD, trainingSplit, testSplit) val logisticWithBfgsPredictsActuals=runClassification(logisticWithBfgs, trainingSplit, testSplit) val svmWithSGDPredictsActuals=runClassification(svmWithSGD, trainingSplit, testSplit) 523 Learning from Data def runClassification(algorithm: GeneralizedLinearAlgorithm[_ <: GeneralizedLinearModel], trainingData:RDD[LabeledPoint], testData:RDD[LabeledPoint]): RDD[(Double, Double)] = { val model = algorithm.run(trainingData) val predicted = model.predict(testData.map(point => point.features)) val actuals = testData.map(point => point.label) val predictsAndActuals: RDD[(Double, Double)] = predicted.zip(actuals) predictsAndActuals } Evaluating the model For generating the metrics, Spark has some inbuilt APIs. The two most common metrics used to evaluate a classification model are the area under curve and the confusion matrix. The org. apache.spark.mllib.evaluation.BinaryClassificationMetrics gives us the area under the curve, and org.apache.spark.mllib.evaluation.MulticlassMetrics gives us the confusion matrix. We also calculate the simple accuracy measure manually using the values of predicated and actuals. The accuracy is simply the result of dividing the correctly classified count of the test dataset by the total count of the test dataset. Refer to https:// en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification for more details: def calculateMetrics(predictsAndActuals: RDD[(Double, Double)], algorithm: String) { val accuracy = 1.0*predictsAndActuals.filter(predActs => predActs._1 == predActs._2).count() / predictsAndActuals.count() val binMetrics = new BinaryClassificationMetrics(predictsAndActuals) println(s"************** Printing metrics for $algorithm ***************") println(s"Area under ROC ${binMetrics.areaUnderROC}") //println(s"Accuracy $accuracy") val metrics = new MulticlassMetrics(predictsAndActuals) val f1=metrics.fMeasure println(s"F1 $f1") println(s"Precision : ${metrics.precision}") println(s"Confusion Matrix \n${metrics.confusionMatrix}") println(s"************** ending metrics for $algorithm *****************") } 524 Chapter 5 As we can see from the output, LogisticRegressionWithSGD and SVMWithSGD have a slightly bigger area under curve than LogisticRegressionWithBFGS, which means that the two models perform a tad bit better. This is a sample output (your output could vary): ************** Printing metrics for Logistic Regression with SGD *************** Area under ROC 0.9208860759493671 Accuracy 0.9769585253456221 Confusion Matrix 927.0 0.0 25.0 133.0 ************** ending metrics for Logistic Regression with SGD ***************** ************** Printing metrics for SVM with SGD *************** Area under ROC 0.9318656156156157 Precision : 0.9784845650140318 Confusion Matrix 921.0 4.0 19.0 125.0 ************** ending metrics for SVM with SGD ***************** ************** Printing metrics for Logistic Regression with BFGS *************** Area under ROC 0.8790559620074445 Accuracy 0.9596136962247586 Confusion Matrix 971.0 9.0 37.0 122.0 ************** ending metrics for Logistic Regression with BFGS ********* ************************* 525 Learning from Data Binary classification using LogisticRegression with Pipeline API Earlier, with the spam example on binary classification, we saw how we prepared the data, separated it into training and test data, trained the model, and evaluated it against test data before we finally arrived at the metrics. This series of steps can be abstracted in a simplified manner using Spark's Pipeline API. In this recipe, we'll take a look at how to use the Pipeline API to solve the same classification problem. Imagine the pipeline to be a factory assembly line where things happen one after another. In our case, we'll pass our raw unprocessed data through various processors before we finally feed the data into the classifier. How to do it... In this recipe, we'll classify the same spam/ham dataset (https://archive.ics.uci. edu/ml/datasets/SMS+Spam+Collection) first using the plain Pipeline, and then using a cross-validator to select the best model for us given a grid of parameters. Let's summarize the steps: 1. Importing and splitting data as test and training sets. 2. Constructing the participants of the Pipeline. 3. Preparing a pipeline and training a model. 4. Predicting against test data. 5. Evaluating the model without cross-validation. 6. Constructing parameters for cross-validation. 7. Constructing a cross-validator and fitting the best model. 8. Evaluating a model with cross-validation. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter5-learning/src/main/scala/com/packt/scalada/ learning/BinaryClassificationSpamPipeline.scala. 526 Chapter 5 Importing and splitting data as test and training sets This process is a little different from the previous recipe, in the sense that we don't construct LabeledPoint now. Instead of an RDD of LabeledPoint, the pipeline requires a DataFrame. So, we convert each line of text into a Document object (with the label and the content) and then convert RDD[Document] into a DataFrame by calling the toDF() function on the RDD: case class Document(label: Double, content: String) val docs = sc.textFile("SMSSpamCollection").map(line => { val words = line.split("\t") val label=if (words.head.trim()=="spam") 1.0 else 0.0 Document(label, words.tail.mkString(" ")) }) //Split dataset val spamPoints = docs.filter(doc => doc.label==1.0). randomSplit(Array(0.8, 0.2)) val hamPoints = docs.filter(doc => doc.label==0.0). randomSplit(Array(0.8, 0.2)) println("Spam count:" + (spamPoints(0).count) + "::" + (spamPoints(1). count)) println("Ham count:" + (hamPoints(0).count) + "::" + (hamPoints(1). count)) val trainingSpamSplit = spamPoints(0) val testSpamSplit = spamPoints(1) val trainingHamSplit = hamPoints(0) val testHamSplit = hamPoints(1) val trainingSplit = trainingSpamSplit ++ trainingHamSplit val testSplit = testSpamSplit ++ testHamSplit import sqlContext.implicits._ val trainingDFrame=trainingSplit.toDF() val testDFrame=testSplit.toDF() 527 Learning from Data Construct the participants of the Pipeline In order to arrange the pipeline, we need to construct its participants. There are three unique participants (or pipeline stages) of this pipeline, and we have to line them up in the right order: f Tokenizer: This disintegrates the sentence into tokens f HashingTF: This creates a term frequency vector from the terms f IDF: This creates an inverse document frequency vector from the terms f VectorAssembler: This combines the TF-IDF vector and the label vector to form a single vector, which will form the input features for the classification algorithm f LogisticRegression: This is the classification algorithm itself Let's construct these first: val tokenizer=new Tokenizer().setInputCol("content"). setOutputCol("tokens") val hashingTf=new HashingTF().setInputCol(tokenizer.getOutputCol). setOutputCol("tf") val idf = new IDF().setInputCol(hashingTf.getOutputCol). setOutputCol("tfidf") val assembler = new VectorAssembler().setInputCols(Array("tfidf", "label")).setOutputCol("features") val logisticRegression=new LogisticRegression(). setFeaturesCol("features").setLabelCol("label").setMaxIter(10) When RDD[Document] is run against the first pipeline stage, that is, Tokenizer, the "content" field of the Document is taken as the input column, and the output of the tokenizer is a bag of words that is captured in the "tokens" output column. HashingTF takes the "tokens" and converts them into a TF vector. Notice that the input column of HashingTF is the same as the output column from the previous stage. IDF takes the tf vector and returns a tf-idf vector. VectorAssembler merges the tf-idf vector and the label to form a single vector. This will be used as an input to the classification algorithm. Finally, for the LogisticRegression stage, we specify the features column and the label column. However, if the input DataFrame has a column named "label" with a Double type and "features" of type Vector, there is no need to explicitly mention that. So, in our case, since we have "label" as an attribute of the Document case class and the output column of the HashingTF is named "features", there is no need for us to specify them explicitly. The following code would work just fine: val logisticRegression=new LogisticRegression().setMaxIter(10) 528 Chapter 5 Internally, this implementation of LogisticRegression constructs LabeledPoints for each instance of the data, and uses some advanced optimization algorithms to derive a model from the training data. At every stage, each of these transformations occurs against the input DataFrame of that particular stage, and the transformed DataFrame gets passed along until the final stage. Preparing a pipeline and training a model As the next step, we just need to form a pipeline out of the various pipeline stages that we constructed in the previous step. We then train a model by calling the pipeline.fit function: val pipeline=new Pipeline() pipeline.setStages(Array(tokenizer, hashingTf, logisticRegression)) val model=pipeline.fit(trainingDFrame) If you are getting java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType, it just means that your label isn't of the Double type. Predicting against test data Using the newly constructed model to predict the data is just a matter of calling the transform method of the model. Then, we also extract the actual label and the predicted value to calculate the metrics: val predictsAndActualsNoCV:RDD[(Double,Double)]=model. transform(testDFrame).map(r => (r.getAs[Double]("label"), r.getAs[Double] ("prediction"))).cache Evaluating a model without cross-validation Cross-validation is a multiple-iteration model validation technique in which our training and test sets are split into different partitions. The entire dataset is split into subsets, and for each iteration, analysis is done on one subset and validation on a different subset. For this recipe, we'll run the algorithm first without cross-validation, and then with cross-validation. 529 Learning from Data Firstly, we'll use the same validation metric and method that we used in the previous recipe. We will simply calculate the area under the ROC curve, the precision, and the confusion matrix: def calculateMetrics(predictsAndActuals: RDD[(Double, Double)], algorithm: String) { val accuracy = 1.0 * predictsAndActuals.filter(predActs => predActs._1 == predActs._2).count() / predictsAndActuals.count() val binMetrics = new BinaryClassificationMetrics(predictsAndActuals) println(s"************** Printing metrics for $algorithm ***************") println(s"Area under ROC ${binMetrics.areaUnderROC}") println(s"Accuracy $accuracy") val metrics = new MulticlassMetrics(predictsAndActuals) println(s"Precision : ${metrics.precision}") println(s"Confusion Matrix \n${metrics.confusionMatrix}") println(s"************** ending metrics for $algorithm *****************") } A sample output of this pipeline without cross-validation is as follows: ************** Printing metrics for Without Cross validation *************** Area under ROC 0.9676924738149228 Accuracy 0.9656357388316151 Confusion Matrix 993.0 36.0 4.0 131.0 ************** ending metrics for Without Cross validation ***************** Constructing parameters for cross-validation Before we use the cross-validator to choose the best model that fits the data, we would want to provide each of the parameters a set of alternate values that the validator can choose from. The way we provide alternate values is in the form of a parameter grid: val paramGrid=new ParamGridBuilder() .addGrid(hashingTf.numFeatures, Array(1000, 5000, 10000)) 530 Chapter 5 .addGrid(logisticRegression.regParam, Array(1, 0.1, 0.03, 0.01)) .build() So, we say that the number of term frequency vectors that we want HashingTF to generate could be one of 1,000, 5,000, and 10,000, and the regularization parameter for logistic regression could be one of 1, 0.1, 0.03, and 0.01. Thus, in essence, we are passing a 3 x 4 matrix as the parameter grid. Constructing cross-validator and fit the best model Next, we construct a cross-validator and pass in the following parameters: f The parameter grid that we constructed in the previous step. f The pipeline that we constructed in step 3. f An evaluator for the cross-validator to decide which model is better. f The number of folds. Say, if we set the number of folds to 10, the training data would be split into 10 blocks. For each iteration (10 iterations), the first block would be selected as the cross-validation set, and the other nine would be the training sets: val crossValidator=new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator()) .setEstimatorParamMaps(paramGrid) .setNumFolds(10) We finally let the cross-validator run against the training dataset and derive the best model out of it. Contrast the following line with pipeline.fit, where we skipped cross-validation: val bestModel=crossValidator.fit(trainingDFrame) Evaluating the model with cross-validation Now, let's evaluate the model that is generated against the actual test data set (rather than the test dataset that the cross-validator uses internally): val predictsAndActualsWithCV:RDD[(Double,Double)]=bestModel. transform(testDFrame).map(r => (r.getAs[Double]("label"), r.getAs[Double] ("prediction"))).cache calculateMetrics(predictsAndActualsWithCV, "Cross validation") A sample output of this pipeline with cross-validation is as follows: ************** Printing metrics for Cross validation *************** Area under ROC 0.9968220338983051 531 Learning from Data Accuracy 0.994579945799458 Confusion Matrix 938.0 6.0 0.0 163.0 ************** ending metrics for Cross validation ***************** As we can see, the area under ROC is far better for this model than for any of our previously generated models. Clustering using K-means Clustering is a class of unsupervised learning algorithms wherein the dataset is partitioned into a finite number of clusters in such a way that the points within a cluster are similar to each other in some way. This, intuitively, also means that the points of two different clusters should be dissimilar. K-means is one of the popular clustering algorithms, and in this recipe, we'll be looking at how Spark implements K-means and how to use the algorithm to cluster a sample dataset. Since the number of clusters is a crucial input for the K-means algorithm, we'll also see the most common method of arriving at the optimal number of clusters for the data. How to do it... Spark provides two initialization modes for cluster center (centroid) initialization: the original Lloyd's method (https://en.wikipedia.org/wiki/K-means_clustering), and a parallelizable and scalable variant of K-means++ (https://en.wikipedia.org/wiki/Kmeans%2B%2B). K-means++ itself is a variant of the original K-means and differs in the way in which the initial centroids of the clusters are picked up. We can switch between the original and the parallelized K-means++ versions by passing KMeans.RANDOM or KMeans.PARALLEL as the initialization mode. Let's first look at the details of the implementation. KMeans.RANDOM In the regular K-means (the KMeans.RANDOM initialization mode in the case of Spark), the algorithm randomly selects k points (equal to the number of clusters that we expect to see) and marks them as cluster centers (centroids). Then it iteratively does the following: f It marks all the points as belonging to a cluster based on the distance between a point and its nearest centroid. f The mean of all the points in a cluster is calculated. This mean is now set as the new centroid of that cluster. f The rest of the data points are reassigned their clusters based on this new centroid. 532 Chapter 5 Since we generally deal with more than one feature in a dataset, each instance of the data and the centroids are vectors. In Spark, we represent them as org.apache.spark.mllib. linalg.Vector. KMeans.PARALLEL Scalable K-means or K-means|| is a variant of K-means++. Let's look at what these variants of K-means actually do. K-means++ Instead of choosing all the centroids randomly, the K-means++ algorithm does the following: 1. It chooses the first centroid randomly (uniform) 2. It calculates the distance squared of each of the rest of the points from the current centroid 3. A probability is attached to each of these points based on how far they are. The farther the centroid candidate is, the higher is its probability. 4. We choose the second centroid from the distribution that we have in step 3. 5. On the ith iteration, we have 1+i clusters. Find the new centroid by going over the entire dataset and forming a distribution out of these points based on how far they are from all the precomputed centroids. These steps are repeated over k-1 iterations until k centroids are selected. K-means++ is known for considerably increasing the quality of centroids. However, as we see, in order to select the initial set of centroids, the algorithm goes through the entire dataset k times. Unfortunately, with a large dataset, this becomes a problem. K-means|| With K-means parallel (K-means||), for each iteration, instead of choosing a single point after calculating the probability distribution of each of the points in the dataset, a lot more points are chosen. In the case of Spark, the number of samples that are chosen per step is 2 * k. Once these initial centroid candidates are selected, a K-means++ is run against these data points (instead of going through the entire dataset). Let's now look at the most important parameters that are passed to the algorithm. Max iterations There are worst-case scenarios for both random and parallel. In the case of random, since the points in K-means are chosen at random, there is a distinct possibility that the model identifies two centroids from the same cluster. Say with k=3, there is a possibility of two clusters becoming a part of a single cluster and a single cluster being separated into two. A similar case applies to K-means++ with a bad choice of the initial set of centroids. 533 Learning from Data The following figure proves that though we can see three clusters, a bad choice of centroids separates a single cluster into two and makes two clusters one: To solve this problem, we run the same algorithm with a different set of randomly initialized centroids. This is determined by the maxIterations parameter. The distance between the centroid and the points in the cluster is calculated (a mean squared difference in distances). This will be the cost of the model. The iteration with the least cost is chosen and returned. The metric that Spark uses to calculate the distance is the Euclidean distance. Epsilon How does the K-means algorithm know when to stop? There will always be a small distance that the centroid can move if the clusters aren't separated by a huge margin. If all the centroids have moved by a distance less than the epsilon parameter, it's the cue to the algorithm that it has converged. In other words, the epsilon is nothing but a convergence threshold. Now that we have the parameters that need to be passed to the K-means cluster out of the way, let's look at the steps needed to run this algorithm to find the clusters: 1. Importing the data and converting it into a vector. 2. Feature scaling the data. 3. Deriving the number of clusters. 4. Constructing the model. 5. Evaluating the model. 534 Chapter 5 The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter5-learning/src/main/scala/com/packt/ scalada/learning/KMeansClusteringIris.scala. Importing the data and converting it into a vector As usual, our input data is in the form of a text file—iris.data. Since we are clustering, we can ignore the label (species) in the data. The data file looks like this: val data = sc.textFile("iris.data").map(line => { val dataArray = line.split(",").take(4) Vectors.dense(dataArray.map(_.toDouble)) }) Feature scaling the data When we look at the summary statistics of the data, the data looks alright, but it is always advisable perform do feature scaling before running a K-means: val stats = Statistics.colStats(data) println("Statistics before scaling") print(s"Max : ${stats.max}, Min : ${stats.min}, and Mean : ${stats.mean} and Variance : ${stats.variance}") Here is the statistics before scaling: Max : [7.9,4.4,6.9,2.5] Min : [4.3,2.0,1.0,0.1] 535 Learning from Data Mean : [5.843333333333332,3.0540000000000003,3.7586666666666666,1.1986666 666666668] Variance : [0.685693512304251,0.18800402684563744,3.113179418344516,0.582 4143176733783] We run the data using StandardScaler and cache the resulting RDD. Since K-means goes through the dataset multiple times, caching the data is strongly recommended to avoid recomputation: //Scale data val scaler = new StandardScaler(withMean = true, withStd = true). fit(data) val scaledData = scaler.transform(data).cache() The following is the statistics after scaling: Max : [2.483698580557868,3.1042842692548858,1.7803768862629268,1.70518904 10833728] Min : [-1.8637802962695154,-2.4308436996988485,-1.5634973589465175,1.4396268133736672], and Mean : [1.6653345369377348E-15,7.216449660063518E-16,-1.1102230246251565E-16,-3.3306690738754696E-16] Variance : [0.9999999999999997,1.0000000000000007,1.0000000000000013,0.99 99999999999997] Deriving the number of clusters Many times, we already know the number of clusters that are there in the dataset. But at times, if we aren't sure, the general method is to plot the number of clusters against the cost and watch out for the point from which the cost stops falling drastically. If the data is large, running the entire set of data just to obtain the number of clusters is computationally expensive. Instead, we can take a random sample and come up with the k value. In this example, we have taken a random 20% sample, but the sample percentage depends entirely on your dataset: //Take a sample to come up with the number of clusters val sampleData = scaledData.sample(false, 0.2).cache() //Decide number of clusters val clusterCost = (1 to 7).map { noOfClusters => val kmeans = new KMeans() .setK(noOfClusters) .setMaxIterations(5) .setInitializationMode(KMeans.K_MEANS_PARALLEL) //KMeans|| val model = kmeans.run(sampleData) 536 Chapter 5 (noOfClusters, model.computeCost(sampleData)) } println ("Cluster cost on sample data") clusterCost.foreach(println) When we plot this, we can see that after cluster 3, the cost does not reduce drastically. This point is called an Elbow bend, as shown here: 537 Learning from Data Constructing the model Now that we have figured out the number of clusters, let's run the algorithm against the entire dataset: //Let's do the real run for 3 clusters val kmeans = new KMeans() .setK(3) .setMaxIterations(5) .setInitializationMode(KMeans.K_MEANS_PARALLEL) //KMeans|| val model = kmeans.run(scaledData) Evaluating the model The last step is to evaluate the model by printing the cost of this model. The cost is nothing but the square of the distance between all points in a cluster to its centroid. Therefore, a good model must have the least cost: //Cost println("Total cost " + model.computeCost(scaledData)) printClusterCenters(model) def printClusterCenters(model:KMeansModel) { //Cluster centers val clusterCenters: Array[Vector] = model.clusterCenters println("Cluster centers") clusterCenters.foreach(println) } Here is the output: Total cost 34.98320617204239 Cluster centers [-0.011357501034038157,0.8699705596441868,0.3756258413625911,0.3106129627676019] [1.1635361185919766,0.1532643388373168,0.999796072473665,1.02619470887105 72] [-1.0111913832028123,0.839494408624649,-1.3005214861029282,1.250937862106244] 538 Chapter 5 Feature reduction using principal component analysis Quoting the curse of dimensionality (https://en.wikipedia.org/wiki/Curse_of_ dimensionality), large number of features are computationally expensive. One way of reducing the number of features is by manually choosing and ignoring certain features. However, identification of the same features (represented differently) or highly correlated features is laborious when we have a huge number of features. Dimensionality reduction is aimed at reducing the number of features in the data while still retaining its variability. Say, we have a dataset of housing prices and there are two features that represent the area of the house in feet and meters; we can always drop one of these two. Dimensionality reduction is very useful when dealing with text where the number of features easily runs into a few thousands. In this recipe, we'll be looking into Principal Component Analysis (PCA) as a means to reduce the dimensions of data that is meant for both supervised and unsupervised learning. How to do it... As we have seen earlier, the only difference between the data for supervised and unsupervised learning is that the training and the test data for supervised learning have labels attached to them. This brings in a little complication, considering that we are interested only in reducing the dimensions of the feature vector and would like to retain the labels as they are. Dimensionality reduction of data for supervised learning The only thing that we have to watch out for while reducing the dimensions of data to be used as training data for supervised learning is that PCA must be applied on training data only. The test set must not be used to extract the components. Using test data for PCA would bleed the information in the test data into the components. This may result in higher accuracy numbers while testing, but it could perform poorly on unseen production data. The least number of components that can be chosen while maintaining a sufficiently high variance is facilitated by the singular value vector available in the SingularValueDecomposion object. The singular values, available by calling the svd.s, show the amount of variance captured by the components. The first component will be the most important (by contributing the highest variance), and the importance will slowly diminish. 539 Learning from Data In order to come up with the probable number of dimensions, we can watch out for the difference and the extent to which the singular values diminish. Alternatively, we can just use simple heuristics and come up with a reasonable number if the features extend to a few thousand: val dimensionDecidingSample=new RowMatrix((trainingSplit. randomSplit(Array(0.8,0.2))(1)).map(lp=>lp.features)) val svd = dimensionDecidingSample.computeSVD(500, computeU = false) val sum = svd.s.toArray.sum //Calculate the number of principal components which retains a variance of 95% val featureRange=(0 to 500) val placeholder=svd.s.toArray.zip(featureRange).foldLeft(0.0) { case (cum, (curr, component)) => val percent = (cum + curr) / sum println(s"Component and percent ${component + 1} :: $percent :::: Singular value is : $curr") cum + curr } The steps that are involved are as follows: 1. Mean-normalizing the training data. 2. Extracting the principal components. 3. Preparing the labeled data. 4. Preparing the test data. 5. Classify and evaluate the metrics. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter5-learning/src/main/scala/com/packt/ scalada/learning/PCASpam.scala. 540 Chapter 5 Mean-normalizing the training data It is highly recommended that the data be centered before running it by PCA. We achieve this using the fit and transform functions of StandardScaler. However, since the scaler in Spark accepts a DenseVector as the argument, we'll use the Vectors.dense factory method to convert the features in the labeled point into a DenseVector: val docs = sc.textFile("SMSSpamCollection").map(line => { val words = line.split("\t") Document(words.head.trim(), words.tail.mkString(" ")) }).cache() val labeledPointsWithTf = getLabeledPoints(docs) val lpTfIdf = withIdf(labeledPointsWithTf).cache() //Split dataset val spamPoints = lpTfIdf.filter(point => point.label == 1).randomSplit(Array(0.8, 0.2)) val hamPoints = lpTfIdf.filter(point => point.label == 0).randomSplit(Array(0.8, 0.2)) val trainingSpamSplit = spamPoints(0) val trainingHamSplit = hamPoints(0) val trainingData = trainingSpamSplit ++ trainingHamSplit val unlabeledTrainData = trainingData.map(lpoint => Vectors.dense(lpoint. features.toArray)).cache() //Scale data - Does not support scaling of SparseVector. val scaler = new StandardScaler(withMean = true, withStd = false). fit(unlabeledTrainData) val scaledTrainingData = scaler.transform(unlabeledTrainData).cache() 541 Learning from Data Extracting the principal components The computePrincipalComponents function is available in RowMatrix. So, we wrap our scaled training data into a RowMatrix and then extract 100 principal components out of it (as shown earlier, the number 100 is based on a run against a sample set of data and on investigating the singular value vector of the SVD). Our training data is currently a 4419 x 5000 matrix—4419 instances of data * 5000 features restricted by us while generating the term frequency using HashingTF. We then multiply this training matrix (4419 x 5000) by the principal component matrix (5000 x 100) to arrive at a 4419 * 100 matrix—4419 instances of data by 100 features (principal components). We can extract the feature vectors from this matrix by calling the rows() function: val trainMatrix = new RowMatrix(scaledTrainingData) val pcomp: Matrix = trainMatrix.computePrincipalComponents(100) val reducedTrainingData = trainMatrix.multiply(pcomp).rows.cache() Preparing the labeled data Now that we have reduced the data fifty-fold, the next step that we have to take is to use this reduced data in our algorithm to see how it fares. The classification algorithm (in this case, LogisticRegressionWithBFGS) requires an RDD of LabeledPoints. To construct the LabeledPoint, we extract the label from the original trainingData and the feature vector from the dimension-reduced dataset: val reducedTrainingSplit = trainingData.zip(reducedTrainingData).map { case (labeled, reduced) => new LabeledPoint(labeled.label, reduced) } Preparing the test data Before predicting our test data against the algorithm, we need to bring the test data to the same dimension as the training data. This is achieved by multiplying the principal components with the test matrix. As discussed earlier, we just need to make sure that we don't compute the principal components fresh here: val unlabeledTestData=testSplit.map(lpoint=>lpoint.features) val testMatrix = new RowMatrix(unlabeledTestData) val reducedTestData=testMatrix.multiply(pcomp).rows.cache() val reducedTestSplit=testSplit.zip(reducedTestData).map{case (labeled,reduced) => new LabeledPoint (labeled.label, reduced)} 542 Chapter 5 Classify and evaluate the metrics The final step is to classify and evaluate the results of the algorithm. This step is the same as the classification recipe that we saw earlier. From the output, we can see that we not only reduced the number of features from 5,000 to 100, but also managed to maintain the accuracy of the algorithm at the same levels: val logisticWithBFGS = getAlgorithm(10, 1, 0.001) val logisticWithBFGSPredictsActuals = runClassification(logisticWithBF GS, reducedTrainingSplit, reducedTestSplit) calculateMetrics(logisticWithBFGSPredictsActuals, "Logistic with BFGS") def getAlgorithm(iterations: Int, stepSize: Double, regParam: Double) = { val algo = new LogisticRegressionWithLBFGS() algo.setIntercept(true).optimizer.setNumIterations(iterations). setRegParam(regParam) algo } def runClassification(algorithm: GeneralizedLinearAlgorithm[_ <: GeneralizedLinearModel], trainingData: RDD[LabeledPoint], testData: RDD[LabeledPoint]): RDD[(Double, Double)] = { val model = algorithm.run(trainingData) println ("predicting") val predicted = model.predict(testData.map(point => point.features)) val actuals = testData.map(point => point.label) val predictsAndActuals: RDD[(Double, Double)] = predicted. zip(actuals) println (predictsAndActuals.collect) predictsAndActuals } def calculateMetrics(predictsAndActuals: RDD[(Double, Double)], algorithm: String) { val accuracy = 1.0 * predictsAndActuals.filter(predActs => predActs._1 == predActs._2).count() / predictsAndActuals.count() val binMetrics = new BinaryClassificationMetrics(predictsAndActuals) println(s"************** Printing metrics for $algorithm ***************") println(s"Area under ROC ${binMetrics.areaUnderROC}") 543 Learning from Data println(s"Accuracy $accuracy") val metrics = new MulticlassMetrics(predictsAndActuals) println(s"Precision : ${metrics.precision}") println(s"Confusion Matrix \n${metrics.confusionMatrix}") println(s"************** ending metrics for $algorithm *****************") } This is the output: Compared to the area under the ROC at around the same levels (95%), we have considerably reduced the time of the run by reducing the dimensions of the features ten-fold: ************** Printing metrics for Logistic with BFGS *************** Area under ROC 0.9428948576675849 Accuracy 0.9829136690647482 Confusion Matrix 965.0 3.0 16.0 128.0 ************** ending metrics for Logistic with BFGS ***************** Note that the entire code for this recipe can be found at https:// github.com/arunma/ScalaDataAnalysisCookbook/ blob/master/chapter5-learning/src/main/scala/ com/packt/scalada/learning/PCASpam.scala. Dimensionality reduction of data for unsupervised learning Unlike reducing the dimensions of data with labels, reducing the dimensionality of data for unsupervised learning is very simple. We just apply the PCA to the entire dataset. This helps a lot in improving the performance of algorithms such as K-means, where the entire set of features has to be plotted on a higher dimension and the entire data must be visited multiple times. A lesser number of features means a lesser number of dimensions and less data to be held in the memory. For this recipe, we use the Iris.data that we used for clustering earlier. The dataset already has four features, and this isn't a great candidate for dimensionality reduction as such. However, the process around reducing dimensions for unlabeled data is the same as for any other dataset. 544 Chapter 5 The steps that are involved are as follows: 1. Mean-normalizing the training data. 2. Extracting the principal components. 3. Arriving at the number of components. 4. Evaluating the metrics. The code for this recipe can be found at https://github.com/ arunma/ScalaDataAnalysisCookbook/blob/master/ chapter5-learning/src/main/scala/com/packt/ scalada/learning/PCAIris.scala. Mean-normalizing the training data As we saw earlier, scaling is a must before reducing dimensions: val scaler = new StandardScaler(withMean = true, withStd = false). fit(data) val scaledData = scaler.transform(data).cache() Extracting the principal components As we saw earlier, to compute the principal components, we need to wrap our scaled training data into a RowMatrix. We then multiply the matrix by the principal component matrix to arrive at the reduced matrix. We can extract the feature vector from this matrix by calling the rows() function: val pcomp: Matrix = matrix.computePrincipalComponents(3) val reducedData = matrix.multiply(pcomp).rows Arriving at the number of components While we would like to have the least number for the components, the other goal is to retain the highest variance in the data. In this case, a run against three components was made, and we could see that holding on to just two components out of the four, we retained 90% of the variance. However, since we wanted at least 95%, 3 was chosen: val svd = matrix.computeSVD(3) val sum = svd.s.toArray.sum svd.s.toArray.zipWithIndex.foldLeft(0.0) { 545 Learning from Data case (cum, (curr, component)) => val percent = (cum + curr) / sum println(s"Component and percent ${component + 1} :: $percent :::: Singular value is : $curr") cum + curr } The output is as follows: Component and percent 1 :: 0.6893434455825798 :::: Singular value is : 25.089863978899867 Component and percent 2 :: 0.8544090583609627 :::: Singular value is : 6.0078525425063365 Component and percent 3 :: 0.9483881906752903 :::: Singular value is : 3.4205353829523646 Component and percent 4 :: 1.0 is : 1.878502340103494 :::: Singular value Evaluating the metrics After we have reduced the dimensions of the data from four to three (!?), for fun, we run the data against a range of one to seven clusters to see the elbow bend. When we compare the results of this with the K-means clustering without dimensionality reduction, the results looks practically the same: val clusterCost = (1 to 7).map { noOfClusters => val kmeans = new KMeans() .setK(noOfClusters) .setMaxIterations(5) .setInitializationMode(KMeans.K_MEANS_PARALLEL) //KMeans|| val model = kmeans.run(reducedData) (noOfClusters, model.computeCost(reducedData)) } Here is the output: 546 Chapter 5 The following screenshot shows the cost across various numbers of clusters: Here is a screenshot that shows strikingly similar results for the elbow bend: In this chapter, we first saw the difference between supervised and unsupervised learning. Then we explored a sample of machine learning algorithms in Spark: LinearRegression for predicting continuous values, LogisticRegression and SVM for classification, K-means for clustering, and finally PCA for dimensionality reduction. There are a plenty of other algorithms in Spark, and more algorithms are being added to Spark with every version, both batch and streaming. 547 6 Scaling Up In this chapter, we will cover the following recipes: f Building the Uber JAR f Submitting jobs to the Spark cluster (local) f Running the Spark standalone cluster on EC2 f Running the Spark job on Mesos (local) f Running the Spark job on YARN (local) Introduction In this chapter, we'll be looking at how to bundle our Spark application and deploy it on various distributed environments. As we discussed earlier in Chapter 3, Loading and Preparing Data – DataFrame the foundation of Spark is the RDD. From a programmer's perspective, the composability of RDDs such as a regular Scala collection is a huge advantage. RDD wraps three vital (and two subsidiary) pieces of information that help in reconstruction of data. This enables fault tolerance. The other major advantage is that while the processing of RDDs could be composed into hugely complex graphs using RDD operations, the entire flow of data itself is not very difficult to reason with. Other than optional optimization attributes, such as data location, an RDD at its core wraps only three vital pieces of information: f The dependent/parent RDD (empty if not available) f The number of partitions f The function that needs to be applied to each element of the RDD 549 Scaling Up Spark spawns one task per partition. So, a partition is the basic unit of parallelism in Spark. The number of partitions could be any of these: f Dictated by the number of blocks in the case of reading files f A number set by the spark.default.parallelism parameter (set while starting the cluster) f A number set by calling repartition or coalesce on the RDD So far, we have just run our Spark application in the self-contained single JVM mode. While the programs work just fine, we have not yet exploited the distributed nature of the RDDs. As always, all the code snippets for this chapter can be downloaded from https://github.com/arunma/ScalaDataAnalysisCookbook/ tree/master/chapter6-scalingup. Building the Uber JAR The first step for deploying our Spark application on a cluster is to bundle it into a single Uber JAR, also known as the assembly JAR. In this recipe, we'll be looking at how to use the SBT assembly plugin to generate the assembly JAR. We'll be using this assembly JAR in subsequent recipes when we run Spark in distributed mode. We could alternatively set dependent JARs using the spark.driver.extraClassPath property (https://spark. apache.org/docs/1.3.1/configuration.html#runtime-environment). However, for a large number of dependent JARs, this is inconvenient. How to do it... The goal of building the assembly JAR is to build a single, Fat JAR that contains all dependencies and our Spark application. Refer to the following screenshot, which shows the innards of an assembly JAR. You can see not only the application's files in the JAR, but also all the packages and files of the dependent libraries: 550 Chapter 6 The assembly JAR can easily be built in SBT using the SBT assembly plugin (https://github.com/sbt/sbt-assembly). In order to install the sbt-assembly plugin, let's add the following line to our project/assembly.sbt: addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0") Next, the most common issue that we face while trying to build the assembly JAR (or Uber JAR) is the problem of duplicates—duplicate transitive dependency JARs, or simply duplicate files located at the same location (such as MANIFEST.MF) in different bundled JARs. The easiest way to figure out is to install the sbt-dependency-graph plugin (https://github.com/ jrudolph/sbt-dependency-graph) and check which two trees bring in the conflicting JAR. 551 Scaling Up In order to add the sbt-dependency-graph plugin, let's add the following line to our project/plugins.sbt: addSbtPlugin("net.virtual-void" % "sbt-dependency-graph" % "0.7.5") Let's try to build the Uber JAR using sbt assembly. When we issue this command from the root of the project, we get an error that tells us that we have duplicate files in our JAR. Let's see an example of a duplicate error message that we might face: deduplicate: different file contents found in the following: /Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans2.3.0.jar:org/w3c/dom/DOMStringList.class /Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01. jar:org/w3c/dom/DOMStringList.class deduplicate: different file contents found in the following: /Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans2.3.0.jar:org/w3c/dom/TypeInfo.class /Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01. jar:org/w3c/dom/TypeInfo.class deduplicate: different file contents found in the following: /Users/Gabriel/.ivy2/cache/org.apache.xmlbeans/xmlbeans/jars/xmlbeans2.3.0.jar:org/w3c/dom/UserDataHandler.class /Users/Gabriel/.ivy2/cache/xml-apis/xml-apis/jars/xml-apis-1.4.01. jar:org/w3c/dom/UserDataHandler.class This happens most commonly if: f Two different libraries in our sbt dependencies depend on the same external library (or libraries that have bundled the classes with the same package) f We have explicitly stated the transitive dependency as a separate dependency in sbt Whatever the case, it is always recommended to go through the entire dependency tree to trim it down. Transitive dependency stated explicitly in the SBT dependency A simpler way is to export the dependency tree in an ASCII tree format and eyeball it to find the two instances where the xmlbeans JAR is referred to. The sbt dependency graph plugin lets us do that. Once we have installed the plugin as per the instructions, we can export and inspect the dependency tree: sbt dependency-tree > deptree.txt 552 Chapter 6 The graph can also be visualized using a real graph (however, this lacks the text search capabilities). The sbt dependency graph helps us analyze that too. We can export the same tree as a .dot file using this code: sbt dependency-dot > depdot.dot It outputs a depdot.dot file in our target directory, which can be opened using Graphviz (http://www.graphviz.org/). Refer to the following screenshot to see what the visualization of a .dot file in Graphviz looks like: As we can see in lines 96 and 573 of the dependency tree (refer to https://github. com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter6-scalingup/ depgraph_xmlbeans_duplicate.txt; its screenshot is given), there are two instances of the import of xmlbeans: once in the tree that leads to org.scalanlp:epicparser-en-span_2.10:2015.2.19, and once in the tree that leads to org. scalanlp:epic_2.10:0.3.1. If you notice the second level of the epic-parser library, you will realize that it is the epic library itself. So, we can resolve this error by removing scalanlp:epic_2.10:0.3.1 from the list of dependencies in our build.sbt file. 553 Scaling Up Two different libraries depend on the same external library Even after we have removed the epic library, we still see some issues with the xercesImpl and xmlapi JARs. When we analyze the dependency tree, we see that two dependent libraries of epic depend on xerces, the xml API and the scala library itself! 554 Chapter 6 We notice that the Epic library has a dependency on the Scala library, but we also know that the Scala library should already be available on the master and the worker nodes. We can exclude the Scala library altogether from getting bundled using the assemblyOption key: assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false) Next, in order to exclude the xml-apis library from the epic library, we use the exclude function: libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided", "com.databricks" %% "spark-csv" % "1.0.3", ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19"). exclude("xml-apis", "xml-apis") ) As for the rest of the conflicting files, we can use the assembly plugin's merge strategy to resolve the conflict. Since we are merging contents of multiple JARs, there is a distinct possibility of a similarly named file being available on the same path, for example, MANIFEST.MF. The sbt-assembly plugin provides various strategies to resolve conflicts if the contents of the file in the same location don't match. The default strategy is to throw an error, but we can customize the strategy to suit our needs. In the merge strategy, we append the contents of application.conf if there are multiple conf files in the JARs, use the first matching class/file in the order of the class path for the org.cyberneko.html package, and discard all the manifest files. For all others, we apply the default strategy: assemblyMergeStrategy in assembly := { case "application.conf" MergeStrategy.concat case PathList("org", "cyberneko", "html", xs @ _*) MergeStrategy.first case m if m.toLowerCase.endsWith("manifest.mf") MergeStrategy.discard case f (assemblyMergeStrategy in assembly).value(f) } => => => => 555 Scaling Up The entire build.sbt looks like this: organization := "com.packt" name := "chapter6-scalingup" scalaVersion := "2.10.4" val sparkVersion="1.4.1" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided", "com.databricks" %% "spark-csv" % "1.0.3", ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19"). exclude("xml-apis", "xml-apis") ) assemblyJarName in assembly := "scalada-learning-assembly.jar" assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false) assemblyMergeStrategy in assembly := { case "application.conf" MergeStrategy.concat case PathList("org", "cyberneko", "html", xs @ _*) MergeStrategy.first case m if m.toLowerCase.endsWith("manifest.mf") MergeStrategy.discard case f (assemblyMergeStrategy in assembly).value(f) } => => => => So finally, when we do an sbt assembly, scalada-learning-assembly.jar is created. If you would like the JAR name to be picked up from the build.sbt file's name and version, just delete the assemblyJarName key from build.sbt: > sbt clean assembly 556 Chapter 6 Submitting jobs to the Spark cluster (local) There are multiple components involved in running Spark in distributed mode. In the self-contained application mode (the main program that we have run throughout this book so far), all of these components run on a single JVM. The following diagram elaborates the various components and their functions in running the Scala program in distributed mode: As a first step, the RDD graph that we construct using the various operations on our RDD (map, filter, join, and so on) is passed to the Directed Acyclic Graph (DAG) scheduler. The DAG scheduler optimizes the flow and converts all RDD operations into groups of tasks called stages. Generally, all tasks before a shuffle are wrapped into a stage. Consider operations in which there is a one-to-one mapping between tasks; for example, a map or filter operator yields one output for every input. If there is a map on an element on RDD followed by a filter, they are generally pipelined (the map and the filter) to form a single task that can be executed by a single worker, not to mention the benefits of data locality. Relating this to our traditional Hadoop MapReduce, where data is written to the disk at every stage, would help us really appreciate the Spark lineage graph. 557 Scaling Up These shuffle-separated stages are then passed to the task scheduler, which splits them into tasks and submits them to the cluster manager. Spark comes bundled with a simple cluster manager that can receive the tasks and run it against a set of worker nodes. However, Spark applications can also be run on popular cluster managers, such as Mesos and YARN. With YARN/Mesos, we can run multiple executors on the same worker node. Besides, YARN and Mesos can host non-Spark jobs in their cluster along with Spark jobs. In the Spark standalone cluster, prior to Spark 1.4, the number of executors per worker node per application was limited to 1. However, we could increase the number of worker instances per worker node using the SPARK_WORKER_INSTANCES parameter. With Spark 1.4 (https://issues.apache.org/jira/browse/SPARK-1706), we are able to run multiple executors on the same node, just as in Mesos/YARN. If we intend to run multiple worker instances within a single machine, we must ensure that we configure the SPARK_ WORKER_CORES property to limit the number of cores that can be used by each worker. The default is all! 558 Chapter 6 In this recipe, we will be deploying the Spark application on a standalone cluster running on a single machine. For all the recipes in this chapter, we'll be using the binary classification app that we built in the previous chapter as a deployment candidate. This recipe assumes that you have some knowledge of the concepts of HDFS and basic operations on them. How to do it... Submitting a Spark job to the local cluster involves the following steps: 1. Downloading Spark. 2. Running HDFS on pseudo-clustered mode. 3. Running the Spark master and slave locally. 4. Pushing data into HDFS. 5. Submitting the Spark application on the cluster. 559 Scaling Up Downloading Spark Throughout this book, we have been using Spark version 1.4.1, as we can see in our build.sbt. Now, let's head over to the download page (https://spark.apache.org/ downloads.html) and download the spark-1.4.1-bin-hadoop2.6.tgz bundle, as shown here: Running HDFS on Pseudo-clustered mode Instead of loading the file from the local filesystem for our Spark application, let's have the file stored away in HDFS. In order to do this, let's have a locally running Pseudo-distributed cluster (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoopcommon/SingleCluster.html#Pseudo-Distributed_Operation) of Hadoop 2.6.0. After formatting our name node using bin/hdfs namenode -format and bringing up our data node and name node using sbin/start-dfs.sh, let's confirm that all the processes that we need are running properly. We do this using Jps. The following screenshot shows what you are expected to see once you start the dfs daemon: 560 Chapter 6 Running the Spark master and slave locally In order to submit our assembly JAR to a Spark cluster, we have to first bring up the Spark master and worker nodes. All that we need to do to run Spark on the local machine is go to the downloaded (and extracted) spark folder and run sbin/start-all.sh from the spark home directory. This will bring up the Master and a Worker node of Spark. The Master's web UI is accessible from port 8080. We use this port to check the status of the job. The default service port of the Master is 7077. We'll be using this port to submit our assembly JAR as a job to the Spark cluster. Let's confirm the running of the Master and the Worker nodes using Jps: Pushing data into HDFS This just involves running the mkdir and put commands on HDFS: bash-3.2$ hadoop fs -mkdir /scalada bash-3.2$ hadoop fs -put /Users/Gabriel/Apps/SMSSpamCollection /scalada/ bash-3.2$ hadoop fs -ls /scalada Found 1 items -rw-r--r-1 Gabriel supergroup SMSSpamCollection 477907 2015-07-18 16:59 /scalada/ 561 Scaling Up We can also confirm this via the HDFS web interface at 50070 and by going to Utilities | Browse the file system, as shown here: Submitting the Spark application on the cluster Before we submit the Spark application to be run against the local cluster, let's change the classification program (BinaryClassificationSpam) to point to the HDFS location: val docs = sc.textFile("hdfs://localhost:9000/scalada/SMSSpamCollection"). map(line => { val words = line.split("\t") Document(words.head.trim(), words.tail.mkString(" ")) }) By default, Spark 1.4.1 uses Hadoop 2.2.0. Now that we are trying to run the job on Hadoop 2.6.0, and are using the Spark binary prebuilt for Hadoop 2.6 and later, let's change build. sbt to reflect that: libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided", "com.databricks" %% "spark-csv" % "1.0.3", "org.apache.hadoop" % "hadoop-client" % "2.6.0", ("org.scalanlp" % "epic-parser-en-span_2.10" % "2015.2.19"). exclude("xml-apis", "xml-apis") ) 562 Chapter 6 Run sbt clean assembly to build the Uber JAR, like this: ./bin/spark-submit \ --class com.packt.scalada.learning.BinaryClassificationSpam \ --master spark://localhost:7077 \ --executor-memory 2G \ --total-executor-cores 2 \ /target/scala-2.10/scalada-learning-assembly.jar Here is the output: The following screenshot shows that we have successfully run our classification job on a Spark cluster as against the standalone app that we used in the previous chapter: Running the Spark Standalone cluster on EC2 The easiest way to create a Spark cluster and run our Spark jobs in a truly distributed mode is Amazon EC2 instances. The ec2 folder inside the Spark installation directory wraps all the scripts and libraries that we need to create a cluster. Let's quickly go through the steps that entail the creation of our first distributed cluster. 563 Scaling Up This recipe assumes that you have a basic understanding of the Amazon EC2 ecosystem, specifically how to spawn a new EC2 instance. How to do it... We'll have to ensure that we have the access key and the Privacy Enhanced Mail (PEM) files for AWS before proceeding with the steps. In fact, we are required to have these before launching any EC2 instance if we intend to log in to the machines. Creating the AccessKey and pem file Instructions for creating a key pair and the pem key are available at http://docs.aws. amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html. Anyway, the following are the relevant screenshots. 564 Chapter 6 Select Security Credentials from the user menu, like this: Click on the Users menu and create an access key, as shown in the following screenshot. Download the credentials. We'll be using this to create the EC2 instances for the Spark master and the worker nodes: 565 Scaling Up The key pair can be created from inside the EC2 instances page using the Key Pairs menu, as shown in the next screenshot. Your browser will automatically download the pem file once you create a pair: Once you have the pem file, ensure that the file permission for the pem file is 400. Otherwise, an error message stating that your pem file's permissions are too open will be shown: chmod 400 spark.pem Launching and running our Spark application involves the following steps: 1. Setting the environment variables. 2. Running the launch script. 566 Chapter 6 3. Verifying installation. 4. Making changes to the code. 5. Transferring the data and job files. 6. Loading the dataset into HDFS. 7. Running the job. 8. Destroying the cluster. Setting the environment variables As the first step, let's export the access and the secret access keys as environment variables. The ec2 script for launching our instances will use these commands: export AWS_ACCESS_KEY_ID=AKIAI7H3OFQZ5W6H4IBA export AWS_SECRET_ACCESS_KEY=[YOUR SECRET ACCESS KEY] I have also copied the pem file to the spark installation root directory, just to make the launch command shorter (by not specifying the entire path of the pem file), as marked here: 567 Scaling Up Running the launch script Now that we have the access key (and the secret key) exported and the pem file in the root folder, let's spawn a new cluster: cd spark-1.4.1-bin-hadoop2.6 ./ec2/spark-ec2 --key-pair=scalada --identity-file=scalada.pem --slaves=2 --instance-type=m3.medium --hadoop-major-version=2 launch scalada-cluster The parameters, as is clearly evident, represent the following: f key-pair: This is the name of the user to whom the access key and the secret access key you exported as environment variables belong. f identity-file: This is the location of the pem file. f slaves: This is the number of worker nodes. f instance-type: This is one of the AWS instance types (http://aws.amazon. com/ec2/instance-types/). M3 medium has one core and 3.75 GB in memory. f hadoop-major-version: This is the version of Hadoop that we want Spark to be bundled with. The spark version itself is derived from our local installation (which is 1.4.1). We can also confirm this from the EC2 console, as shown in the following screenshot: Verifying installation Let's log in to the Master to see the services that are running on each node: ssh -i scalada.pem root@ec2-54-161-176-58.compute-1.amazonaws.com 568 Chapter 6 Doing a jps on the master node shows that the Spark Master, the HDFS name node, and the Secondary name node are running on the Spark master node, as depicted in this screenshot: Similarly, on the worker nodes, we see that the Spark Worker and the HDFS data nodes are running, as follows: 569 Scaling Up Making changes to the code There is a small change that is required in our code in order to make it run on this cluster—the location of the dataset in HDFS. This, however, is not the recommended way of doing it, and the URL should be sourced from an external configuration file: val conf = new SparkConf().setAppName("BinaryClassificationSpamEc2") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val docs = sc.textFile("hdfs://ec2-54-159-166-156.compute-1. amazonaws.com:9000/scalada/SMSSpamCollection").map(line => { val words = line.split("\t") Document(words.head.trim(), words.tail.mkString(" ")) }) Transferring the data and job files As the next step, let's copy the dataset and the assembly JAR to the master node for execution from the directory where you have the pem file: scp -i scalada.pem /chapter5-learning/SMSSpamCollection root@ ec2-54-161-176-58.compute-1.amazonaws.com:~/. scp -i scalada.pem /chapter6-scalingup/target/scala-2.10/ scalada-learning-assembly.jar root@ec2-54-161-176-58.compute-1.amazonaws. com:~/. An ls on the home folder of the master confirms this, as shown in the following screenshot: Loading the dataset into HDFS Now that we have uploaded our dataset to the master's local folder, let's push it to HDFS. As we saw earlier when we verified the installation, the Spark EC2 script creates and runs an HDFS cluster for us. Let's go to the ephemeral-hdfs folder in the root and format the filesystem. Note that the files in this HDFS, as the name indicates, will be wiped off upon restarting the cluster. Ideally, we should be installing a separate HDFS cluster on these nodes instead of depending on the ephemeral installation that was created by the Spark EC2 script. 570 Chapter 6 Just as in our previous recipe, let's push the SMSSpamCollection dataset into the /scalada folder in HDFS: root@ip-10-150-76-158 ephemeral-hdfs] $ ./bin/hdfs namenode -format root@ip-10-150-76-158 ephemeral-hdfs] $ ./bin/hadoop fs -mkdir /scalada root@ip-10-150-76-158 ephemeral-hdfs] $ ./bin/hadoop fs -put ../ SMSSpamCollection /scalada/ root@ip-10-150-76-158 ephemeral-hdfs]$ ./bin/hadoop fs -ls /scalada Found 1 items -rw-r--r-3 root supergroup SMSSpamCollection 477907 2015-08-08 05:24 /scalada/ Running the job As with the previous recipe, we'll use the spark-submit script to submit the job to the cluster. Let's enter the spark home directory (/root/spark) and execute the following lines: ./bin/spark-submit \ --class com.packt.scalada.learning.BinaryClassificationSpamEc2 \ --master spark://ec2-54-161-176-58.compute-1.amazonaws.com:7077 \ --executor-memory 2G \ --total-executor-cores 2 \ ../scalada-learning-assembly.jar We can see that the job runs on both worker nodes of the cluster, as shown in this screenshot: 571 Scaling Up We can also see the various stages of this Job from the Stages tab, as shown in the following screenshot: Not surprisingly, the accuracy measure is approximately the same, except that now we can use this cluster to handle much bigger data. Destroying the cluster Finally, if you would like to destroy the cluster, you can use the same ec2 script with the destroy action. From your local Spark installation directory, execute this line: ./ec2/spark-ec2 destroy scalada-cluster 572 Chapter 6 Running the Spark Job on Mesos (local) Unlike the Spark standalone cluster manager, which can run only Spark apps, Mesos is a cluster manager that can run a wide variety of applications, including Python, Ruby, or Java EE applications. It can also run Spark jobs. In fact, it is one of the popular go-to cluster managers for Spark. In this recipe, we'll see how to deploy our Spark application on the Mesos cluster. The prerequisite for this recipe is a running HDFS cluster. How to do it... Running a Spark job on Mesos is very similar to running it against the standalone cluster. It involves the following steps: 1. Installing Mesos. 2. Starting the Mesos master and slave. 3. Uploading the Spark binary package and the dataset to HDFS. 4. Running the job. Installing Mesos Download Mesos on the local machine by following the instructions at http://mesos. apache.org/gettingstarted/. 573 Scaling Up After you have installed the OS-specific tools needed to build Mesos, you have to run the configure and make commands (with root privileges) to build Mesos (this will take a long time) unless you pass -j V=0 to your make command, as shown here: As a side note, just like Spark, the ec2 folder inside the mesos installation directory provides scripts to spawn a new EC2 mesos cluster. Starting the Mesos master and slave Now that we have Mesos installed, the next step is to start the Mesos master and slave: bash-3.2$ pwd /Users/Gabriel/Apps/mesos-0.22.1/build bash-3.2$ sudo ./bin/mesos-master.sh --ip=127.0.0.1 --work_dir=/var/lib/ mesos In another terminal window, let's bring up a worker node: Gabriel@Gabriels-MacBook-Pro ~/A/m/build> pwd /Users/Gabriel/Apps/mesos-0.22.1/build Gabriel@Gabriels-MacBook-Pro ~/A/m/build> ./bin/mesos-slave.sh --master=127.0.0.1:5050 574 Chapter 6 We can now look at the Mesos status page at http://127.0.0.1:5050, and this is what we will see: Uploading the Spark binary package and the dataset to HDFS Mesos requires that all worker nodes have Spark installed on the machines. We can achieve this either by configuring the spark.mesos.executor.home property in the spark configuration, or by simply uploading the entire Spark tar bundle to HDFS and making it available to the Mesos workers: ./bin/hadoop fs -mkdir /scalada ./bin/hadoop fs -put /Users/Gabriel/Apps/spark-1.4.1-bin-hadoop2.6.tgz / scalada/spark-1.4.1-bin-hadoop2.6.tgz 575 Scaling Up Let's set the spark binary as the executor URI export SPARK_EXECUTOR_URI=hdfs://localhost:9000/scalada/spark-1.4.1-binhadoop2.6.tgz Also, let's upload the dataset to HDFS: ./bin/hadoop fs -mkdir /scalada ./bin/hadoop fs -put /Users/Gabriel/Apps/SMSSpamCollection /scalada/ Running the job There is one thing that we need to do before running the program itself— configure the location of the libmesos native library. This file can be found in the /usr/local/lib folder as libmesos.so or libmesos.dylib, depending on your operating system: export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos-0.22.1.dylib Now, let's use cd to enter the Spark installation directory, and then run the job: cd /Users/Gabriel/Apps/spark-1.4.1-bin-hadoop2.6 export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos-0.22.1.dylib ./bin/spark-submit \ --class com.packt.scalada.learning.BinaryClassificationSpamMesos \ --master mesos://localhost:5050 \ --executor-memory 2G \ --total-executor-cores 2 \ /chapter6-scalingup/target/scala-2.10/scalada-learningassembly.jar 576 Chapter 6 As you can see in the following screenshot, the tasks run fine on this single-worker-node cluster: The next screenshot shows the list of tasks that are already completed: 577 Scaling Up Running the Spark Job on YARN (local) Hadoop has a long history, and in most cases, organizations have already invested in the Hadoop infrastructure before they move their MR jobs to Spark. Unlike the Spark standalone cluster manager, which can run only Spark jobs, and Mesos, which can run a variety of applications, YARN runs Hadoop jobs as first-class. At the same time, it can run Spark jobs as well. This means that when a team decides to replace some of their MR jobs with Spark jobs, they can use the same cluster manager to run Spark jobs. In this recipe, we'll see how to deploy our Spark application on the YARN cluster manager. How to do it... Running a Spark job on YARN is very similar to running it against a Spark standalone cluster. It involves the following steps: 1. Installing the Hadoop cluster. 2. Starting HDFS and YARN. 3. Pushing the Spark assembly and dataset to HDFS. 4. Running the Spark Job in the yarn-client mode. 5. Running the Spark Job in the yarn-cluster mode. Installing the Hadoop cluster While the setup of the cluster itself is beyond the scope of this recipe, for the sake of completeness, let's quickly look at the relevant site XML configurations that were made while setting up a single-node pseudo-distributed cluster on a local machine. Refer to http:// www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_ node_cluster.php for the complete details on how to set up a local YARN/HDFS cluster: The core-site.xml file: fs.default.name hdfs://localhost:54310 578 Chapter 6 The mapred-site.xml file: mapred.job.tracker localhost:54311 The hdfs-site.xml file: dfs.replication 1 Starting HDFS and YARN Once the setup of the cluster is done, let's format HDFS and start the cluster (dfs and yarn): Format namenode: hdfs namenode -format Start both HDFS and YARN: sbin/start-all.sh Let's confirm that the services are running through jps, and this is what we should see: 579 Scaling Up Pushing Spark assembly and dataset to HDFS Ideally, when we do a spark-submit, YARN should be able to pick our spark-assembly JAR (or Uber JAR) and upload it to HDFS. However, this doesn't happen correctly and results in the following error: Error: Could not find or load main class org.apache.spark.deploy.yarn. ExecutorLauncher In order to work around this issue, let's upload our spark-assembly JAR manually to HDFS and change our conf/spark-env.sh to reflect the location. The Hadoop config directory should also be specified in spark-env.sh: Uploading the spark assembly to HDFS. hadoop fs -mkdir /sparkbinary hadoop fs -put /Users/Gabriel/Apps/spark-1.4.1-bin-hadoop2.6/lib/sparkassembly-1.4.1-hadoop2.6.0.jar /sparkbinary/ hadoop fs -ls /sparkbinary Uploading the Spam dataset to HDFS: hadoop fs -mkdir /scalada hadoop fs -put ~/SMSSpamCollection /scalada/ hadoop fs -ls /scalada Entries in spark-env.sh: HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop SPARK_EXECUTOR_URI=hdfs://localhost:9000/sparkbinary/spark-assembly1.4.1-hadoop2.6.0.jar 580 Chapter 6 Before we submit our Spark job to the YARN cluster, let's confirm that our setup is fine using the Spark shell. The Spark shell is a wrapper arround the Scala REPL, with Spark libraries set in the classpath. Configuring HADOOP_CONF_DIR to point to the Hadoop config directory ensures that Spark will now use YARN to run its jobs. However, there are two modes in which we can run the Spark job in YARN, namely yarn-client and yarn-cluster. Let's explore both of them in this subrecipe. But before we do that, to validate our configuration, we'll launch the Spark shell pointing the master to the yarn-client. After a rain of logs, we should be able to see a Scala prompt. This confirms that our configuration is good: bin/spark-shell --master yarn-client Running a Spark job in yarn-client mode Now that we have confirmed that the shell loads up fine against the YARN master, let's head over to deploying our Spark job on YARN. As we discussed earlier, there are two modes in which we can run a Spark application on YARN: the yarn-client mode and the yarn-cluster mode. In the yarn-client mode, the driver program resides on the client side and the YARN worker nodes are used only to execute the job. All of the brain of the application resides in the client JVM that polls the application master for the status. The application master does nothing except watching out for failure of the executor nodes and reporting and requesting for resources accordingly to the resource manager. This also means that the client (our driver JVM) needs to run as long as the application executes: ./bin/spark-submit \ --class com.packt.scalada.learning.BinaryClassificationSpamYarn \ --master yarn-client \ 581 Scaling Up --executor-memory 1G \ ~/scalada-learning-assembly.jar As we see from the YARN console, our job is running fine. Here is a screenshot that shows this: 582 Chapter 6 Finally, we can see the output on the client JVM (the driver) itself: Running Spark job in yarn-cluster mode In the yarn-cluster mode, the client JVM doesn't do anything at all. In fact, it just submits and polls the Application master for status. The driver program itself runs on the Application master, which now has all the brains of the program. Unlike the yarn-client mode, the user logs won't be displayed on the client JVM because the driver, which consolidates the results, is executing inside the YARN cluster: ./bin/spark-submit \ --class com.packt.scalada.learning.BinaryClassificationSpamYarn \ --master yarn-cluster \ --executor-memory 1g \ ~/scalada-learning-assembly.jar 583 Scaling Up As expected, the client JVM indicates that the job has run successfully. It doesn't, however, show the user logs. The following screenshot shows the final status of our client and the cluster mode runs: The actual output of this program is inside the Hadoop user logs. We can either go to the logs directory of Hadoop, or check it out from the Hadoop console itself, when we click on the application link and then on the logs link in the console. As you can see in the following screenshot, the stdout file shows our embarrassing println commands: 584 Chapter 6 In this chapter, we took an example Spark application and deployed it on a Spark standalone cluster manager, YARN, and Mesos. Along the way, we touched upon the internals of these cluster managers. 585 7 Going Further In this chapter, we will cover the following recipes: f Using Spark Streaming to subscribe to a Twitter stream f Using Spark as an ETL tool (pulling data from ElasticSearch and publishing it to Kafka) f Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream f Using GraphX to analyze Twitter data f Watching other Scala libraries of interest Introduction So far, the entire book has concentrated a little around Breeze and a lot around Spark, specifically DataFrames and machine learning. However, there are a whole lot of other libraries, both in Java and Scala that could be leveraged while analyzing data from Scala. This chapter goes a little more into Spark's other components, streaming and GraphX. Note that each recipe in this chapter feeds into the next recipe. All the code related to this chapter can be downloaded from https://github.com/arunma/ScalaDataAnalysisCookbook/ tree/master/chapter7-goingfurther. 587 Going Further Using Spark Streaming to subscribe to a Twitter stream Just like all the other components of Spark, Spark Streaming is also scalable and fault-tolerant, it's just that it manages a stream of data instead of a large amount of data that Spark generally does. The way that Spark Streaming approaches streaming is unique in the sense that it accumulates streams into small batches called DStreams and then processes them as minibatches, an approach usually called micro-batching. The component that receives the stream of data and splits it into time-bound windows of batches is called the receiver. Once these batches are received, Spark takes these batches up, converts them into RDDs, and processes the RDDs in the same way as static datasets. The regular framework components such as the driver and executor stay the same. However, in terms of Spark Streaming, a DStream or Discretized stream is just a continuous stream of RDDs. Also, just like SQLContext served as an entry point to use SQL in Spark, there's StreamingContext that serves as an entry point for Spark Streaming. In this recipe, we will subscribe to a Twitter stream and index (store) the tweets into ElasticSearch (https://www.elastic.co/). How to do it... The prerequisite to run this recipe is to have a running ElasticSearch instance on your machine. 1. Running ElasticSearch: Running an instance of ElasticSearch is as simple as it gets. Just download the installable from https://www.elastic.co/downloads/ elasticsearch and run bin/elasticsearch. This recipe uses the latest version 1.7.1. 2. Creating a Twitter app: In order to subscribe to tweets, Twitter requires us to create a Twitter app. Let's quickly set up a Twitter app in order to get the consumer key and the secret key. Visit https://apps.twitter.com/ using your login and click Create New App. 588 Chapter 7 We will be using the consumer key, consumer secret key, the access token, and the access secret in our application. 3. Adding Spark Streaming and the Twitter dependency: There are two dependencies that need to be added here, the spark-streaming and the spark-streamingtwitter libraries: "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided", "org.apache.spark" %% "spark-streaming-twitter" % sparkVersion 589 Going Further 4. Creating a Twitter stream: Creating a Twitter stream is super easy in Spark. We just need to use TwitterUtils.createStream for this. TwitterUtils wraps around the twitter4j library (http://twitter4j.org/en/index.html) to provide first-class support in Spark. TwitterUtils.createStream expects a few parameters. Let's construct them one by one. ‰ StreamingContext: StreamingContext could be constructed by passing in SparkContext and the time window of the batch: val streamingContext=new StreamingContext(sc, Seconds (5)) ‰ OAuthorization: The access and the consumer keys that comprise the OAuth credentials need to be passed in order to subscribe to the Twitter stream: val builder = new ConfigurationBuilder() .setOAuthConsumerKey(consumerKey) .setOAuthConsumerSecret(consumerSecret) .setOAuthAccessToken(accessToken) .setOAuthAccessTokenSecret(accessTokenSecret) .setUseSSL(true) val twitterAuth = Some(new OAuthAuthorization(builder.build())) ‰ Filter criteria: You are free to skip this parameter if your intention is to subscribe to (a sample of) the universe of the tweets. For this recipe, we'll add some filter criteria to it: val filter=List("fashion", "tech", "startup", "spark") ‰ StorageLevel: This is where our received objects that come in batches need to be stored. The default is memory with a capability to overflow to disk. Once this is constructed, let's construct the Twitter stream itself: val stream=TwitterUtils.createStream(streamingContext, twitterAuth, filter, StorageLevel.MEMORY_AND_DISK) 5. Saving the stream to ElasticSearch: Writing the Tweets to ElasticSearch involves three steps: 1. Adding the ElasticSearch-Spark dependency: Let's add the appropriate version of ElasticSearch Spark to our build.sbt: "org.elasticsearch" %% "elasticsearch-spark" % "2.1.0" 590 Chapter 7 2. Configuring the ElasticSearch server location in the Spark configuration: ElasticSearch has a subproject called elasticsearch-spark that makes ElasticSeach a first-class citizen in the Spark world. The org. elasticsearch.spark package exposes some convenient functions that convert a case class to JSON (deriving types) and indexes to ElasticSearch. The package also provides some really cool implicits that provide functions to save RDD into ElasticSearch and load data from ElasticSearch as an RDD. We'll be looking at those functions shortly. The ElasticSearch target node URL could be specified in the Spark configuration. By default, it points to localhost and port 9200. If required, we could customize it: //Default is localhost. Point to ES node when required val conf = new SparkConf() .setAppName("TwitterStreaming") .setMaster("local[2]") .set(ConfigurationOptions.ES_NODES, "localhost") .set(ConfigurationOptions.ES_PORT, "9200") 3. Converting the stream into a case class: If we are not interested in pushing the data to ElasticSearch and are interested only in printing some values in twitter4j.Status, stream.foreach will help us iterate through the RDD[Status]. However, in this recipe, we will be extracting some data from twitter4j.Status and pushing it to ElasticSearch. For this purpose, a case class SimpleStatus is created. The reason why we are extracting data out as a case class is that twitter4j.Status has way too much information that we don't want to index: case class SimpleStatus(id:String, content:String, date:Date, hashTags:Array[String]=Array[String](), urls:Array[String]=Array[String](), user:String, userName:String, userFollowerCount:Long) The twitter4j.Status is converted to SimpleStatus using a convertToSimple function that extracts only the required information: def convertToSimple(status: twitter4j.Status): SimpleStatus = { val hashTags: Array[String] = status.getHashtagEntities().map(eachHT => eachHT.getText()) val urlArray = if (status.getURLEntities != null) status.getURLEntities().foldLeft((Array[String]()))((r, c) => (r :+ c.getExpandedURL())) else Array[String]() val user = status.getUser() 591 Going Further val utcDate = new Date(dateTimeZone.convertLocalToUTC(status.getCreatedAt.getTime, false)) SimpleStatus(id = status.getId.toString, content = status.getText(), utcDate, hashTags = hashTags, urls = urlArray, user = user.getScreenName(), userName = user.getName, userFollowerCount = user.getFollowersCount) } Once we map the twitter4j.Status to SimpleStatus, we now have a RDD[SimpleStatus]. We can now iterate over the RDD[SimpleStatus] and push every RDD to ElasticSearch's "spark" index. "twstatus" is the index type. In RDBMS terms, an index is like a database schema and the index type is like a table: stream.map(convertToSimple).foreachRDD { statusRdd => println(statusRdd) statusRdd.saveToEs("spark/twstatus") } We could confirm the indexing by pointing to ElasticSearch's spark index using Sense, a must-have Chrome plugin for ElasticSearch, or simply by performing a curl request: curl -XGET "http://localhost:9200/spark/_search" -d' { "query": { "match_all": {} } }' The Sense plugin for Chrome can be downloaded from the Chrome store at: https://chrome.google. com/webstore/detail/sense-beta/lhjgkmll caadmopgmanpapmpjgmfcfig?hl=en. 592 Chapter 7 Using Spark as an ETL tool In the previous recipe, we subscribed to a Twitter stream and stored it in ElasticSearch. Another common source of streaming is Kafka, a distributed message broker. In fact, it's a distributed log of messages, which in simple terms means that there can be multiple brokers that has the messages partitioned among them. In this recipe, we'll be subscribing the data that we ingested into ElasticSearch in the previous recipe and publishing the messages into Kafka. Soon after we publish the data to Kafka, we'll be subscribing to Kafka using the Spark Stream API. While this is a recipe that demonstrates treating ElasticSearch data as an RDD and publishing to Kafka using a KryoSerializer, the true intent of this recipe is to run a streaming classification algorithm against Twitter, which is our next recipe. How to do it... Let's look at the various steps involved in doing this. 1. Setting up Kafka: This recipe uses Kafka version 0.8.2.1 for Spark 2.10, which can be downloaded from https://www.apache.org/dyn/closer.cgi?path=/ kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz. 593 Going Further Once downloaded, let's extract, start the Kafka server, and create a Kafka topic through three commands from inside our Kafka home directory: 1. Starting Zookeeper: Kafka uses Zookeeper (https://zookeeper. apache.org/) to hold coordination information between Kafka servers. It also holds the commit offset information of the data so that if a Kafka node fails, it knows where to resume from. The Zookeeper data directory and the client port (default 2181) is present in zookeeper.properties. The zookeeper-server-start.sh expects this to be passed as a parameter for it to start: bin/zookeeper-server-start.sh config/zookeeper.properties 2. Starting the Kafka server: Again, in order to start Kafka, the configuration file to be passed to it is server.properties. The server.properties, among many things specifies the port on which the Kafka server listens (9092) and the Zookeeper port it needs to connect to (2181). This is passed to the kafka-server-start.sh startup script: bin/kafka-server-start.sh config/server.properties 3. Creating a Kafka topic: In really simple terms, a topic can be compared to a JMS topic with the difference that there could be multiple publishers as well as a single subscriber in Kafka. Since we are running the Kafka in a non-replicated and non-partitioned mode using just one Kafka server, the topic named twtopic (Twitter topic) is created with a replication factor of 1 and the number of partitions is 1 as well: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic twtopic 2. Pulling data from ElasticSearch: The next step is to pull the data from ElasticSearch and treat it as a Spark DataFrame other than the optional setting in Spark configuration to point to the correct host and port. This is just a one-liner. The configuration change (if needed) is: //Default is localhost. Point to ES node when required val conf = new SparkConf() .setAppName("KafkaStreamProducerFromES") .setMaster("local[2]") .set(ConfigurationOptions.ES_NODES, "localhost") .set(ConfigurationOptions.ES_PORT, "9200") The following line queries the "spark/twstatus" index (that we published to in the last recipe) for all documents and extracts the data into a DataFrame. Optionally, you can pass in a query as a second argument (for example, "?q=fashion"): val twStatusDf=sqlContext.esDF("spark/twstatus") 594 Chapter 7 Let's try to sample the DataFrame using show(): twStatusDf.show() The output is: 3. Preparing data to be published to Kafka: Before we do this step, let's go over what we aim to achieve from this step. Like we discussed at the beginning of the recipe, we will be running a classification algorithm against streaming data in the next recipe. As you know, any supervised learning algorithm requires a training dataset. Instead of us manually curating the dataset, we will be doing that in a very primitive fashion by marking all the tweets that have the word fashion in them as belonging to the fashion class and the rest of the tweets as not belonging to the fashion class. We will just take the content of the tweet and convert it into a case class called LabeledContent (similar to LabeledPoint in Spark MLlib): case class LabeledContent(label: Double, content: Array[String]) LabeledContent only has two fields: ‰ ‰ label: This indicates whether the tweet is about fashion or not (1.0 if the tweet is on fashion and 0.0 if it is not) content: This holds a space-tokenized version of the tweet itself def convertToLabeledContentRdd(twStatusDf: DataFrame) = { //Convert the content alone to a (label, content) pair val labeledPointRdd = twStatusDf.map{row => val content = row.getAs[String]("content").toLowerCase() val tokens = content.split(" ") //A very primitive space based tokenizer 595 Going Further val labeledContent=if (content.contains("fashion")) LabeledContent(1, tokens) else LabeledContent(0, tokens) println (labeledContent.label, content) labeledContent } labeledPointRdd } 4. Publishing data to Kafka using KryoSerializer: Now that we have the publish candidate (LabeledContent) ready, let's publish it to the Kafka topic. This involves just three lines. ‰ Constructing the connection and transport properties: In properties, we configure the Kafka server port location and register the serializer that we use to serialize LabeledContent: val properties = Map[String, Object](ProducerConfig.BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092").asJava ‰ Constructing the Kafka producer using the connection properties and the key and value serializer: The next step is to construct a Kafka producer using the properties we constructed earlier. The producer also needs a key and a value serializer. Since we don't have a key for our message, we fall back to Kafka's default, which fills in the hashcode by default, which we aren't interested on receipt. val producer = new KafkaProducer[String, Array[Byte]](properties, new StringSerializer, new ByteArraySerializer) ‰ Sending data to the Kafka topic using the send method: We then serialize LabeledContent using KryoSerializer and send it to the Kafka topic "twtopic" (the one that we created earlier) using the producer.send method. The only purpose of using a KryoSerializer here is to speed up the serialization process: val serializedPoint = KryoSerializer.serialize(lContent) producer.send(new ProducerRecord[String, Array[Byte]]("twtopic", serializedPoint)) For the KryoSerializer, we use Twitter's chill library (https://github.com/ twitter/chill), which provides an easier abstraction over the serialization for Scala. 596 Chapter 7 The actual KryoSerializer is just five lines of code: object KryoSerializer { private val kryoPool = ScalaKryoInstantiator.defaultPool def serialize[T](anObject: T): Array[Byte] = kryoPool.toBytesWithClass(anObject) def deserialize[T](bytes: Array[Byte]): T = kryoPool.fromBytes(bytes).asInstanceOf[T] } The dependency for Twitter chill that needs to be added to our build.sbt is: "com.twitter" %% "chill" % "0.7.0" The entire publishing method looks like this: def publishToKafka(labeledPointRdd: RDD[LabeledContent]) { labeledPointRdd.foreachPartition { iterator => val properties = Map[String, Object](ProducerConfig. BOOTSTRAP_SERVERS_CONFIG -> "localhost:9092", "serializer.class" -> "kafka.serializer.DefaultEncoder").asJava val producer = new KafkaProducer[String, Array[Byte]](properties, new StringSerializer, new ByteArraySerializer) iterator.foreach { lContent => val serializedPoint = KryoSerializer.serialize(lContent) producer.send(new ProducerRecord[String, Array[Byte]]("twtopic", serializedPoint)) } } } 597 Going Further 5. Confirming receipt in Kafka: We could confirm whether the data is in Kafka using the JMX MBeans exposed by it. We'll use JConsole UI to explore MBeans. As you can see, the count of the messages is 24849, which matches the ElasticSearch document count (that was published in the previous recipe). Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream In the previous recipe, we published all the tweets that were stored in ElasticSearch to a Kafka topic. In this recipe, we'll subscribe to the Kafka stream and train a classification model out of it. We will later use this trained model to classify a live Twitter stream. 598 Chapter 7 How to do it... This is a really small recipe that is composed of 3 steps: 1. Subscribing to a Kafka stream: There are two ways to subscribe to a Kafka stream and we'll be using the DirectStream method, which is faster. Just like Twitter streaming, Spark has first-class support for subscribing to a Kafka stream. This is achieved by adding the spark-streaming-kafka dependency. Let's add it to our build.sbt file: "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion The subscription process is more or less the reverse of the publishing process even in terms of the properties that we pass to Kafka: val topics = Set("twtopic") val kafkaParams = Map[String, String]("metadata.broker.list" -> "localhost:9092") Once the properties are constructed, we subscribe to twtopic using KafkaUtils. createDirectStream: val kafkaStream = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](streamingContext, kafkaParams, topics).repartition(2) With the stream at hand, let's reconstruct LabeledContent out of it. We can do that through KryoSerializer's deserialize function: val trainingStream = kafkaStream.map { case (key, value) => val labeledContent = KryoSerializer.deserialize(value).asInstanceOf[LabeledContent] 2. Training the classification model: Now that we are receiving the LabeledContent objects from the Kafka stream, let's train our classification model out of them. We will use StreamingLogisiticRegressionWithSGD for this, which as the name indicates, is a streaming version of the LogisticRegressionWithSGD algorithm we saw in Chapter 5, Learning from Data. In order to train the model, we have to construct a LabeledPoint, which is a pair of labels (represented as a double) and a feature vector. Since this is a text, we'll use the HashingTF's transform function to generate the feature vector for us: val hashingTf = new HashingTF(5000) val kafkaStream = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](streamingContext, kafkaParams, topics).repartition(2) 599 Going Further val trainingStream = kafkaStream.map { case (key, value) => val labeledContent = KryoSerializer.deserialize(value).asInstanceOf[LabeledContent] val vector = hashingTf.transform(labeledContent.content) LabeledPoint(labeledContent.label, vector) } trainingStream now is a stream of LabeledPoint, which we will be using to train our model: val model = new StreamingLogisticRegressionWithSGD() .setInitialWeights(Vectors.zeros(5000)) .setNumIterations(25).setStepSize(0.1).setRegParam(0.001) model.trainOn(trainingStream) Since we specified the maximum number of features in our HashingTF to be 5000, we set the initial weights to be 0 for all 5,000 features. The rest of the parameters are the same as the regular LogisticRegressionWithSGD algorithm that trains on a static dataset. 3. Classifying a live Twitter stream: Now that we have the model in hand, let's use it to predict whether the incoming stream of tweets is about fashion or not. The Twitter setup in this section is the same as the first recipe where we subscribed to a Twitter stream: val filter = List("fashion", "tech", "startup", "spark") val twitterStream = TwitterUtils.createStream(streamingContext, twitterAuth, filter, StorageLevel.MEMORY_AND_DISK) The crucial part is the invocation of model.predictOnValues, which gives us the predicted label. Once the prediction is made, we save them as text files in our local directory. It's not the best way to do it and we will probably want to push this data to some appendable data source instead. val contentAndFeatureVector=twitterStream.map { status => val tokens=status.getText().toLowerCase().split(" ") val vector=hashingTf.transform(tokens) (status.getText(), vector) } 600 Chapter 7 val contentAndPrediction=model.predictOnValues(contentAndFeatu reVector) //Not the best way to store the results. Creates a whole lot of files contentAndPrediction.saveAsTextFiles("predictions", "txt") In order to consolidate the predictions that are spread over multiple files, a really simple aggregation command was used: find predictions* -name "part*" |xargs cat >> output.txt Here is a sample of the prediction. The results are fairly okay considering the training dataset itself was not classified in a very scientific way. Also, the tokenization is just space-based, the data isn't scaled nor was the IDF used. 601 Going Further Using GraphX to analyze Twitter data GraphX is Spark's approach to graphs and computation against graphs. In this recipe, we will see a preview of what is possible with the GraphX component in Spark. How to do it... Now that we have the Twitter data stored in the ElasticSearch index, we will perform the following tasks on this data using a graph: 1. Convert the ElasticSearch data into a Spark Graph. 2. Sample vertices, edges, and triplets in the graph. 3. Find the top group of connected hashtags (connected component). 4. List all the hashtags in that component. 1. Converting the ElasticSearch data into a graph: This involves two steps: 1. Converting ElasticSearch data into a DataFrame: This step, like we saw in an earlier recipe, is just a one-liner: def convertElasticSearchDataToDataFrame(sqlContext: SQLContext) = { val twStatusDf = sqlContext.esDF("spark/twstatus") twStatusDf } 2. Converting DataFrame to a graph: Spark Graph construction requires an RDD for a vertex and an RDD of edges. Let's construct them one by one. Vertex RDD requires an RDD of a tuple representing a vertexId and a vertex property. In our case, we'll just do a primitive hash code on the hashTag as the vertex ID and hashTag itself as the property: val verticesRdd:RDD[(Long,String)] = df.flatMap { tweet => val hashTags = tweet.getAs[Buffer[String]]("hashTags") hashTags.map { tag => val lowercaseTag = tag.toLowerCase() val tagHashCode=lowercaseTag.hashCode().toLong (tagHashCode, lowercaseTag) } } 602 Chapter 7 For the edges, we construct an RDD[Edge] , which wraps a pair of vertex IDs and a property. In our case, we use the first URL (if present) as a property to the edge (we aren't using it for this recipe so an empty string should also be fine). Since there is a possibility of multiple hashtags for a tweet, we use the combinations function to choose pairs and then connect them together as an edge: val edgesRdd:RDD[Edge[String]] =df.flatMap { row => val hashTags = row.getAs[Buffer[String]]("hashTags") val urls = row.getAs[Buffer[String]]("urls") val topUrl=if (urls.length>0) urls(0) else "" val combinations=hashTags.combinations(2) combinations.map{ combs=> val firstHash=combs(0).toLowerCase().hashCode.toLong val secondHash=combs(1).toLowerCase().hashCode.toLong Edge(firstHash, secondHash, topUrl) } } Finally, we construct the graph using both RDDs: val graph=Graph(verticesRdd, edgesRdd) 2. Sampling vertices, edges, and triplets in the graph: Now that we have our graph constructed, let's sample and see what the vertices, edges, and triplets of the Graph look like. A triple is a representation of an edge and two vertices connected by that edge: graph.vertices.take(20).foreach(println) 603 Going Further The output is: graph.edges.take(20).foreach(println) The output is: graph.triplets.take(20).foreach(println) 604 Chapter 7 The output is: 3. Finding the top group of connected hashtags (connected component): As you know, a graph is made of vertices and edges. A connected component of a graph is just a part of the graph (a subgraph) whose vertices are connected to each other by some edge. If there is a vertex that is not connected to another vertex directly or indirectly through another vertex, then they are not connected and therefore don't belong to the same connected component. GraphX's graph.connectedComponents provides a graph of all the vertices along with their component IDs: val connectedComponents=graph.connectedComponents.cache() Let's take the component ID with the maximum number of vertices and then extract the vertices (and eventually the hashtags) that belong to that component: val ccCounts:Map[VertexId, Long]=connectedComponents.vertices.map{case (_, vertexId) => vertexId}.countByValue //Get the top component Id and count val topComponent:(VertexId, Long)=ccCounts.toSeq.sortBy{case (componentId, count) => count}.reverse.head 605 Going Further Since topComponent just has the component ID, in order to fetch the hashTags of the top component, we need to have a representation that maps hashTag to a component ID. This is achieved by joining the graph's vertices to the connectedComponent vertices: //RDD of HashTag-Component Id pair. Joins using vertexId val hashtagComponentRdd:VertexRDD[(String,VertexId)]=graph.vertices. innerJoin(connectedComponents.vertices){ case (vertexId, hashTag, componentId)=> (hashTag, componentId) } Now that we have componentId and hashTag, let's filter only the hashTags for the top component ID: val topComponentHashTags=hashtagComponentRdd .filter{ case (vertexId, (hashTag, componentId)) => (componentId==topComponent._1)} .map{case (vertexId, (hashTag,componentId)) => hashTag } topComponentHashTags The entire method looks like this: def getHashTagsOfTopConnectedComponent(graph:Graph[String,String]) :RDD[String]={ //Get all the connected components val connectedComponents=graph.connectedComponents.cache() import scala.collection._ val ccCounts:Map[VertexId, Long]=connectedComponents.vertices.map{case (_, vertexId) => vertexId}.countByValue //Get the top component Id and count val topComponent:(VertexId, Long)=ccCounts.toSeq.sortBy{case (componentId, count) => count}.reverse.head 606 Chapter 7 //RDD of HashTag-Component Id pair. Joins using vertexId val hashtagComponentRdd:VertexRDD[(String,VertexId)]=graph.vertices. innerJoin(connectedComponents.vertices){ case (vertexId, hashTag, componentId)=> (hashTag, componentId) } //Filter the vertices that belong to the top component alone val topComponentHashTags=hashtagComponentRdd .filter{ case (vertexId, (hashTag, componentId)) => (componentId==topComponent._1)} .map{case (vertexId, (hashTag,componentId)) => hashTag } topComponentHashTags } 4. List all the hashtags in that component: Saving the hashTags to a file is as simple as calling saveAsTextFile. The repartition(1) is done just so that we have a single output file. Alternatively, you could use collect() to bring all the data to the driver and inspect it: def saveTopTags(topTags:RDD[String]){ topTags.repartition(1).saveAsTextFile("topTags.txt") } 607 Going Further The number of hashtags in the top connected component for our run was 7,320. This shows that in our sample stream there are about 7,320 tags related to fashion that are interrelated. They could be synonyms, closely related, or remotely related to fashion. A snapshot of the file looks like this: In this chapter, we briefly touched upon Spark streaming, Streaming ML, and GraphX. Please note that this is by no means an exhaustive recipe list for both topics and aims to just provide a taste of what Streaming and GraphX in Spark could do. 608 Module 3 Scala for Machine Learning Leverage Scala and Machine Learning to construct and study systems that can learn from data Getting Started It is critical for any computer scientist to understand the different classes of machine learning algorithms and be able to select the ones that are relevant to the domain of their expertise and dataset. However, the application of these algorithms represents a small fraction of the overall effort needed to extract an accurate and performing model from input data. A common data mining workflow consists of the following sequential steps: 1. Loading the data. 2. Preprocessing, analyzing, and filtering the input data. 3. Discovering patterns, affinities, clusters, and classes. 4. Selecting the model features and the appropriate machine learning algorithm(s). 5. Refining and validating the model. 6. Improving the computational performance of the implementation. As we will emphasize throughout this book, each stage of the process is critical to build the right model. This first chapter introduces you to the taxonomy of machine learning algorithms, the tools and frameworks used in the book, and a simple application of logistic regression to get your feet wet. Getting Started Mathematical notation for the curious Each chapter contains a small section dedicated to the formulation of the algorithms for those interested in the mathematical concepts behind the science and art of machine learning. These sections are optional and defined within a tip box. For example, the mathematical expression of the mean and the variance of a variable X mentioned in a tip box will be as follows: Mean value of a variable X = {x} is defined as: The variance of a variable X = {x} is defined as: Why machine learning? The explosion in the number of digital devices generates an ever-increasing amount of data. The best analogy I can find to describe the need, desire, and urgency to extract knowledge from large datasets is the process of extracting a precious metal from a mine, and in some cases, extracting blood from a stone. Knowledge is quite often defined as a model that can be constantly updated or tweaked as new data comes into play. Models are obviously domain-specific ranging from credit risk assessment, face recognition, maximization of quality of service, classification of pathological symptoms of disease, optimization of computer networks, and security intrusion detection, to customers' online behavior and purchase history. Machine learning problems are categorized as classification, prediction, optimization, and regression. Classification The purpose of classification is to extract knowledge from historical data. For instance, a classifier can be built to identify a disease from a set of symptoms. The scientist collects information regarding the body temperature (continuous variable), congestion (discrete variables HIGH, MEDIUM, and LOW), and the actual diagnostic (flu). This dataset is used to create a model such as IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72), which doctors can use in their diagnostic. [ 612 ] Chapter 1 Prediction Once the model is extracted and validated against the past data, it can be used to draw inference from the future data. A doctor collects symptoms from a patient, such as body temperature and nasal congestion, and anticipates the state of his/her health. Optimization Some global optimization problems are intractable using traditional linear and non-linear optimization methods. Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search). You can imagine that fighting the spread of a new virus requires optimizing a process that may evolve over time as more symptoms and cases are uncovered. Regression Regression is a classification technique that is particularly suitable for a continuous model. Linear (least square), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model, or function, y= f (xj), to a dataset. Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical. Why Scala? Like most functional languages, Scala provides developers and scientists with a toolbox to implement iterative computations that can be easily woven dynamically into a coherent dataflow. To some extent, Scala can be regarded as an extension of the popular MapReduce model for distributed computation of large amounts of data. Among the capabilities of the language, the following features are deemed essential to machine learning and statistical analysis. Abstraction Monoids and monads are important concepts in functional programming. Monads are derived from the category and group theory allowing developers to create a high-level abstraction as illustrated in Twitter's Algebird (https://github. com/twitter/algebird) or Google's Breeze Scala (https://github.com/dlwh/ breeze) libraries. A monoid defines a binary operation op on a dataset T with the property of closure, identity operation, and associativity. [ 613 ] Getting Started Let's consider the + operation is defined for a set T using the following monoidal representation: trait Monoid[T] { def zero: T def op(a: T, b: T): c } Monoids are associative operations. For instance, if ts1, ts2, and ts3 are three time series, then the property ts1 + (ts2 + ts3) = (ts1 + ts2) + ts2 is true. The associativity of a monoid operator is critical in regards to parallelization of computational workflows. Monads are structures that can be seen either as containers by programmers or as a generalization of Monoids. The collections bundled with the Scala standard library (list, map, and so on) are constructed as monads [1:1]. Monads provide the ability for those collections to perform the following functions: 1. Create the collection. 2. Transform the elements of the collection. 3. Flatten nested collections. A common categorical representation of a monad in Scala is a trait, Monad, parameterized with a container type M: trait Monad[M[_]] { def apply[T])(a: T): M[T] def flatMap[T, U](m: M[T])(f: T=>M[U]): M[U] } Monads allow those collections or containers to be chained to generate a workflow. This property is applicable to any scientific computation [1:2]. Scalability As seen previously, monoids and monads enable parallelization and chaining of data processing functions by leveraging the Scala higher-order methods. In terms of implementation, Actors are the core elements that make Scala scalable. Actors act as coroutines, managing the underlying threads pool. Actors communicate through passing asynchronous messages. A distributed computing Scala framework such as Akka and Spark extends the capabilities of the Scala standard library to support computation on very large datasets. Akka and Spark are described in detail in the last chapter of this book [1:3]. [ 614 ] Chapter 1 In a nutshell, a workflow is implemented as a sequence of activities or computational tasks. Those tasks consist of high-order Scala methods such as flatMap, map, fold, reduce, collect, join, or filter applied to a large collection of observations. Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. Scala also supports message dispatching and routing of messages between local and remote actors. The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes. Deployment of a workflow as a distributed computation In this diagram, a controller, that is, the master node, manages the sequence of tasks 1 to 4 similar to a scheduler. These tasks are actually executed over multiple worker nodes that are implemented by the Scala actors. The master node exchanges messages with the workers to manage the state of the execution of the workflow as well as its reliability. High availability of these tasks is implemented through a hierarchy of supervising actors. Configurability Scala supports dependency injection using a combination of abstract variables, self-referenced composition, and stackable traits. One of the most commonly used dependency injection patterns, the cake pattern, is used throughout this book to create dynamic computation workflows and plots. [ 615 ] Getting Started Maintainability Scala embeds Domain Specific Languages (DSL) natively. DSLs are syntactic layers built on top of Scala native libraries. DSLs allow software developers to abstract computation in terms that are easily understood by scientists. The most notorious application of DSLs is the definition of the emulation of the syntax used in the MATLAB program, which data scientists are familiar with. Computation on demand Lazy methods and values allow developers to execute functions and allocate computing resources on demand. The Spark framework relies on lazy variables and methods to chain Resilient Distributed Datasets (RDD). Model categorization A model can be predictive, descriptive, or adaptive. Predictive models discover patterns in historical data and extract fundamental trends and relationships between factors. They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields such as marketing, insurance, and pharmaceuticals. Predictive models are created through supervised learning using a preselected training set. Descriptive models attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. These models define the first level in knowledge discovery. They are generated through unsupervised learning. A third category of models, known as adaptive modeling, is generated through reinforcement learning. Reinforcement learning consists of one or several decision-making agents that recommend and possibly execute actions in the attempt of solving a problem, optimizing an objective function, or resolving constraints. [ 616 ] Chapter 1 Taxonomy of machine learning algorithms The purpose of machine learning is to teach computers to execute tasks without human intervention. An increasing number of applications such as genomics, social networking, advertising, or risk analysis generate a very large amount of data that can be analyzed or mined to extract knowledge or provide insight into a process, a customer, or an organization. Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data [1:4]. Data mining is the process of extracting or identifying patterns in a dataset. Unsupervised learning The goal of unsupervised learning is to discover patterns of regularities and irregularities in a set of observations. The process known as density estimation in statistics is broken down into two categories: discovery of data clusters and discovery of latent factors. The methodology consists of processing input data to understand patterns similar to the natural learning process in infants or animals. Unsupervised learning does not require labeled data, and therefore, is easy to implement and execute because no expertise is needed to validate an output. However, it is possible to label the output of a clustering algorithm and use it for future classification. Clustering The purpose of data clustering is to partition a collection of data into a number of clusters or data segments. Practically, a clustering algorithm is used to organize observations into clusters by minimizing the observations within a cluster and maximizing the observations between clusters. A clustering algorithm consists of the following steps: 1. Creating a model by making an assumption on the input data. 2. Selecting the objective function or goal of the clustering. 3. Evaluating one or more algorithms to optimize the objective function. Data clustering is also known as data segmentation or data partitioning. [ 617 ] Getting Started Dimension reduction Dimension reduction techniques aim at finding the smallest but most relevant set of features that models dataset reliability. There are many reasons for reducing the number of features or parameters in a model, from avoiding overfitting to reducing computation costs. There are many ways to classify the different techniques used to extract knowledge from data using unsupervised learning. The following taxonomy breaks down these techniques according to their purpose, although the list is far for being exhaustive, as shown in the following diagram: Supervised learning The best analogy for supervised learning is function approximation or curve fitting. In its simplest form, supervised learning attempts to extract a relation or function f x → y from a training set {x, y}. Supervised learning is far more accurate and reliable than any other learning strategy. However, a domain expert may be required to label (tag) data as a training set for certain types of problems. Supervised machine learning algorithms can be broken into two categories: • Generative models • Discriminative models Generative models In order to simplify the description of statistics formulas, we adopt the following simplification: the probability of an event X is the same as the probability of the discrete random variable X to have a value x, p(X) = p(X=x). The notation of joint probability (resp. conditional probability) becomes p(X, Y) = p(X=x, Y=y) (resp. p(X|Y)=p(X=x | Y=y). [ 618 ] Chapter 1 Generative models attempt to fit a joint probability distribution, p(X,Y), of two events (or random variables), X and Y, representing two sets of observed and hidden (latent) variables x and y. Discriminative models learn the conditional probability p(Y|X) of an event or random variable Y of hidden variables y, given an event or random variable X of observed variables x. Generative models are commonly introduced through the Bayes' rule. The conditional probability of an event Y, given an event X, is computed as the product of the conditional probability of the event X, given the event Y, and the probability of the event X normalized by the probability of event Y [1:5]. Join probability (if X and Y are independent): Conditional probability: The Bayes' rule: The Bayes' rule is the foundation of the Naïve Bayes classifier, which is the topic of Chapter 5, Naïve Bayes Classifiers. Discriminative models Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification. Generative and discriminative models have their respective advantages and drawbacks. Novice data scientists learn to match the appropriate algorithm to each problem through experimentation. Here is a brief guideline describing which type of models makes sense according to the objective or criteria of the project: Objective Generative models Discriminative models Accuracy Highly dependent on the training set. Probability estimates tend to be more accurate. Modeling requirements There is a need to model both observed and hidden variables, which requires a significant amount of training. The quality of the training set does not have to be as rigorous as for generative models. [ 619 ] Getting Started Objective Generative models Discriminative models Computation cost This is usually low. For example, any graphical method derived from the Bayes' rule has low overhead. Most algorithms rely on optimization of a convex that introduces significant performance overhead. Constraints These models assume some degree of independence among the model features. Most discriminative algorithms accommodate dependencies between features. We can further refine the taxonomy of supervised learning algorithms by segregating between sequential and random variables for generative models and breaking down discriminative methods as applied to continuous processes (regression) and discrete processes (classification): Reinforcement learning Reinforcement learning is not as well understood as supervised and unsupervised learning outside the realms of robotics or game strategy. However, since the 90s, genetic-algorithms-based classifiers have become increasingly popular to solve problems that require collaboration with a domain expert. For some types of applications, reinforcement learning algorithms output a set of recommended actions for the adaptive system to execute. In its simplest form, these algorithms compute or estimate the best course of action. Most complex systems based on reinforcement learning establish and update policies that can be vetoed by an expert. The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on partially observable states and how to deal with uncertainty. [ 620 ] Chapter 1 Genetic algorithms are not usually considered part of the reinforcement learning toolbox. However, advanced models such as learning classifier systems use genetic algorithms to classify and reward the rules and policies. As with the two previous learning strategies, reinforcement learning models can be categorized as Markovian or evolutionary: This is a brief overview of machine learning algorithms with a suggested taxonomy. There are almost as many ways to introduce machine learning as there are data and computer scientists. We encourage you to browse through the list of references at the end of the book and find the documentation appropriate to your level of interest and understanding. Tools and frameworks Before getting your hands dirty, you need to download and deploy a minimum set of tools and libraries so as not to reinvent the wheel. A few key components have to be installed in order to compile and run the source code described throughout the book. We focus on open source and commonly available libraries, although you are invited to experiment with equivalent tools of your choice. The learning curve for the frameworks described here is minimal. Java The code described in the book has been tested with JDK 1.7.0_45 and JDK 1.8.0_25 on Windows x64 and MacOS X x64 . You need to install the Java Development Kit if you have not already done so. Finally, the environment variables JAVA_HOME, PATH, and CLASSPATH have to be updated accordingly. [ 621 ] Getting Started Scala The code has been tested with Scala 2.10.4. We recommend using Scala version 2.10.3 or higher and SBT 0.13 or higher. Let's assume that Scala runtime (REPL) and libraries have been properly installed and environment variables SCALA_HOME and PATH have been updated. The description and installation instructions of the Scala plugin for Eclipse are available at http://scala-ide.org/docs/user/ gettingstarted.html. You can also download the Scala plugin for Intellij IDEA from the JetBrains website at http://confluence.jetbrains.com/display/SCA/. The ubiquitous simple build tool (sbt) will be our primary building engine. The syntax of the build file sbt/build.sbt conforms to version 0.13, and is used to compile and assemble the source code presented throughout this book. Apache Commons Math Apache Commons Math is a Java library for numerical processing, algebra, statistics, and optimization [1:6]. Description This is a lightweight library that provides developers with a foundation of small, ready-to-use Java classes that can be easily weaved into a machine learning problem. The examples used throughout the book require version 3.3 or higher. The main components of Apache Commons Math are: • Functions, differentiation, and integral and ordinary differential equations • Statistics distribution • Linear and nonlinear optimization • Dense and Sparse vectors and matrices • Curve fitting, correlation, and regression For more information, visit http://commons.apache.org/proper/commons-math. Licensing We need Apache Public License 2.0; the terms are available at http://www.apache. org/licenses/LICENSE-2.0. [ 622 ] Chapter 1 Installation The installation and deployment of the Commons Math library are quite simple: 1. Go to the download page, http://commons.apache.org/proper/commonsmath/download_math.cgi. 2. Download the latest .jar files in the Binaries section, commons-math3-3.3bin.zip (for version 3.3, for instance). 3. Unzip and install the .jar files. 4. Add commons-math3-3.3.jar to classpath as follows: ° For Mac OS X, use the command export CLASSPATH=$CLASSPATH:/ Commons_Math_path/commons-math3-3.3.jar ° For Windows, navigate to System property | Advanced system settings | Advanced | Environment variables…, then edit the entry of the CLASSPATH variable 5. Add the commons-math3-3.3.jar file to your IDE environment if needed (that is, for Eclipse, navigate to Project | Properties | Java Build Path | Libraries | Add External JARs). You can also download commons-math3-3.3-src.zip from the Source section. JFreeChart JFreeChart is an open source chart and plotting Java library, widely used in the Java programmer community. It was originally created by David Gilbert [1:7]. Description The library supports a variety of configurable plots and charts (scatter, dial, pie, area, bar, box and whisker, stacked, and 3D). We use JFreeChart to display the output of data processing and algorithms throughout the book, but you are encouraged to explore this great library on your own, as time permits. Licensing It is distributed under the terms of the GNU Lesser General Public License (LGPL), which permits its use in proprietary applications. [ 623 ] Getting Started Installation To install and deploy JFreeChart, perform the following steps: 1. Visit http://www.jfree.org/jfreechart. 2. Download the latest version from Source Forge at http://sourceforge. net/projects/jfreechart/files. 3. Unzip and install the .jar file. 4. Add jfreechart-1.0.17.jar (for version 1.0.17) to classpath as follows: ° For Mac OS, update the classpath by using export CLASSPATH=$CLASSPATH:/JFreeChart_path/ jfreechart-1.0.17. jar ° For Windows, go to System property | Advanced system settings | Advanced | Environment variables… and then edit the entry of the CLASSPATH variable 5. Add the jfreechart-1.0.17.jar file to your IDE environment, if needed. Other libraries and frameworks Libraries and tools that are specific to a single chapter are introduced along with the topic. Scalable frameworks are presented in the last chapter along with the instructions to download them. Libraries related to the conditional random fields and support vector machines are described in the respective chapters. Why not use Scala algebra and numerical libraries Libraries such as Breeze, ScalaNLP, and Algebird are great Scala frameworks for linear algebra, numerical analysis, and machine learning. They provide even the most seasoned Scala programmer with a high-quality layer of abstraction. However, this book is designed as a tutorial that allows developers to write algorithms from the ground up using simple common Java libraries [1:8]. Source code The Scala programming language is used to implement and evaluate the machine learning techniques presented in this book. Only a subset of the source code used to implement the techniques are presented in the book. The formal implementation of these algorithms is available on the website of Packt Publishing (http://www. packtpub.com). [ 624 ] Chapter 1 Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www. packtpub.com/support and register to have the files e-mailed directly to you. Context versus view bounds Most Scala classes discussed in the book are parameterized with the type associated to the discrete/categorical value (Int) or continuous value (Double). Context bounds would require that any type used by the client code has Int or Double as upper bounds: class MyClassInt[T <: Int] class MyClassFloat[T <: Double] Such a design introduces constraints on the client to inherit from simple types and to deal with covariance and contravariance for container types [1:9]. For this book, view bounds are used instead of context bounds only where they require an implicit conversion to the parameterized type to be defined: Class MyClassFloat[T <% Double] implicit def T2Double(t : T): Double Presentation For the sake of readability of the implementation of algorithms, all nonessential code such as error checking, comments, exceptions, or imports are omitted. The following code elements are discarded in the code snippet presented in the book: • Code comments • Validation of class parameters and method arguments: class BaumWelchEM(val lambda: HMMLambda ...) { require( lambda != null, "Lambda model is undefined") • Exceptions and an exception handler: try { .. } catch { case e: ArrayIndexOutOfBoundsException toString) } [ 625 ] =>println(e. Getting Started • Nonessential annotation: @inline def mean = .. • Logging and debugging code: m_logger.debug( …) • Private and nonessential methods Primitives and implicits The algorithms presented in this book share the same primitive types, generic operators, and implicit conversions. Primitive types For the sake of readability of the code, the following primitive types will be used: type type type type type type XY = (Double, Double) XYTSeries = Array[(Double, Double)] DMatrix[T] = Array[Array[T]] DVector[T] = Array[T] DblMatrix = DMatrix[Double] DblVector = Array[Double] The types have the behavior (methods) of their primitive counterpart (array). However, adding a new functionality to vectors, matrices, and time series requires classes of their own right. These classes will be introduced in the next chapter. Type conversions Implicit conversion is an important feature of the Scala programming language because it allows developers to specify a type conversion for an entire library in a single place. Here are a few of the implicit type conversions used throughout the book: implicit def int2Double(n: Int): Double = n.toDouble implicit def vectorT2DblVector[T <% Double](vt: DVector[T]): DblVector = vt.map( t => t.toDouble) implicit def double2DblVector(x: Double): DblVector = Array[Double](x) implicit def dblPair2DbLVector(x: (Double, Double)): DblVector = Array[Double](x._1,x._2) implicit def dblPairs2DblRows(x: (Double, Double)): DblMatrix = Array[Array[Double]](Array[Double](x._1, x._2)) ... [ 626 ] Chapter 1 Library-specific conversion The conversion between the primitive type listed here and types introduced in a particular library (such as Apache Commons Math) is declared in future chapters the first time those libraries are used. Operators Lastly, some operations are applied by multiple machine learning or preprocessing algorithms. They need to be defined implicitly. The operation on a pair of a vector of arbitrary type and vector of Double is defined as follows: def Op[T <% Double](v: DVector[T], w: DblVector, op: (T, Double) => Double): DblVector = v.zipWithIndex.map(x => op(x._1, w(x._2))) It is also convenient to define the following operators that are included in the Scala standard library: implicit def /(v: DblVector, n: Int):DblVector = v.map( x => x/n) implicit def /(m: DblMatrix, col: Int, z: Double): DblMatrix = { (0 until m(n).size).foreach(i => m(n)(i) /= z) } We won't have to redefine the types, conversions, and operators from now on. Immutability It is usually a good idea to reduce the number of states of an object. Method invocation transitions an object from one state to another. The larger the number of methods or states, the more cumbersome the testing process becomes. There is no point in creating a model that is not defined (trained). Therefore, making the training of a model as part of the constructor of the class it implements makes a lot of sense. Therefore, the only public methods of a machine learning algorithm are: • Classification or prediction • Validation • Retrieval of model parameters (weights, latent variables, hidden states, and so on), if needed [ 627 ] Getting Started Performance of Scala iterators The evaluation of the performance of Scala high-order iterative methods is beyond the scope of this book. However, it is important to be aware of the trade-off of each method. The for loop construct is to be avoided as a counting iterator except if it is used in conjunction with yield. It is designed to implement the for-comprehension monad (map-flatMap). The source code presented in this book uses the while and foreach constructs. Scala reducer methods reduce and fold are also frequently used for their efficiency. Let's kick the tires This final section introduces the key elements of the training and classification workflow. A test case using a simple logistic regression is used to illustrate each step of the computational workflow. Overview of computational workflows In its simplest form, a computational workflow to perform runtime processing of a dataset is composed of the following stages: 1. Loading the dataset from files, databases, or any streaming devices. 2. Splitting the dataset for parallel data processing. 3. Preprocessing data using filtering techniques, analysis of variance, and applying penalty and normalization functions whenever necessary. 4. Applying the model, either a set of clusters or classes to classify new data. 5. Assessing the quality of the model. A similar sequence of tasks is used to extract a model from a training dataset: 1. Loading the dataset from files, databases, or any streaming devices. 2. Splitting the dataset for parallel data processing. 3. Applying filtering techniques, analysis of variance, and penalty and normalization functions to the raw dataset whenever necessary. 4. Selecting the training, testing, and validation set from the cleansed input data. 5. Extracting key features, establishing affinity between a similar group of observations using clustering techniques or supervised learning algorithms. [ 628 ] Chapter 1 6. Reducing the number of features to a manageable set of attributes to avoid overfitting the training set. 7. Validating the model and tuning the model by iterating steps 5, 6, and 7 until the error meets criteria. 8. Storing the model into the file or database to be loaded for runtime processing of new observations. Data clustering and data classification can be performed independent of each other or as part of a workflow that uses clustering techniques as a preprocessing stage of the training phase of a supervised learning algorithm. Data clustering does not require a model to be extracted from a training set, while classification can be performed only if a model has been built from the training set. The following image gives an overview of training and classification: A generic data flow for training and running a model This diagram is an overview of a typical data mining processing pipeline. The first phase consists of extracting the model through clustering or training of a supervised learning algorithm. The model is then validated against test data, for which the source is the same as the training set but with different observations. Once the model is created and validated, it can be used to classify real-time data or predict future behavior. In reality, real-world workflows are more complex and require being dynamically configurable to allow experimentation of different models. Several alternative classifiers can be used to perform a regression and different filtering algorithms are applied against input data depending of the latent noise in the raw data. [ 629 ] Getting Started Writing a simple workflow This book relies on financial data to experiment with a different learning strategy. The objective of the exercise is to build a model that can discriminate between volatile and nonvolatile trading sessions. For this first example, we select a simplified version of the logistic regression as our classifier as we treat a stock-price-volume action as a continuous or pseudo-continuous process. Logistic regression Logistic regression is treated in depth in Chapter 6, Regression and Regularization. The model treated in this example is a simple binary classifier using logistic regression for two-dimensional observations. The classification of trading sessions according to their volatility is as follows: • Select a dataset • Load the dataset • Preprocess the dataset • Display data • Create the model through training • Classify new data Selecting a dataset Throughout the book, we will rely on financial data to evaluate and discuss the merit of different data processing and machine learning methods. In this example, the data is extracted from Yahoo! Finances using the CSV format with the following fields: • Date • Price at open • Highest price in session • Lowest price in session • Price at session close • Volume • Adjust price at session close Let's create a simple program that loads the content of the file, executes some simple preprocessing functions, and creates a simple model. We selected the CSCO stock price between January 1, 2012 and December 1, 2013 as our data input. [ 630 ] Chapter 1 Let's consider two variables, price and volume, as illustrated by the following screenshot. The top graph displays the variation of the price of Cisco stock over time and the bottom bar chart represents the daily trading volume on Cisco stock over time: Price-Volume action for the Cisco stock Loading the dataset The first step is loading the dataset from a local file. Typically, large datasets are loaded from a database or distributed filesystem such as Hadoop Distributed File System (HDFS), as shown here: def load(fileName: String): Option[XYTSeries] = { val src = Source.fromFile(fileName) val fields = src.getLines.map( _.split(CSV_DELIM)).toArray //1 val cols = fields.drop(1) //2 val data = transform(cols) src.close //3 Some(data) } The transform method will be described in the next section. The data file is extracted through an invocation of the Source.fromFile static method, and then the fields are extracted through a map (line 1). The header (first) row is removed with a call to drop (line 2). Data extraction The Source.fromFile.getLines.map invocation pipeline method returns an iterator, which needs to be converted into an array to store the information into memory. [ 631 ] Getting Started The file has to be closed to avoid leaking of the file handle (line 3). Code readability A long pipeline of Scala high-order methods make the code and underlying code quite difficult to read. It is recommended to break down long chains of method calls. The following code is an example of a long chain of method calls: val cols = Source.fromFile.getLines.map( _.split(CSV_DELIM).toArray.drop(1) We can break down such method calls into several steps as follows: val lines = Source.fromFile.getLines val fields = lines.map(_.split(CSV_DELIM).toArray val cols = fields.drop(1) We strongly encourage you to consult the excellent guide Effective Scala, written by Marius Eriksen from Twitter. This is definitively a must read for any Scala developer [1:10]. Preprocessing the dataset The next step is to normalize the data in the range [-0.5, 0.5] to be trained by the logistic binary classifier. It is time to introduce a non-sense statistics class. Basic statistics We select the computation of mean and standard deviation of the two time series as the first step of the preprocessing phase. The computation of these statistics can be implemented by the reduce methods reduceLeft and foldLeft: val mean = price.reduceLeft( _ + _ )/price.size val s2 = price.foldLeft(0.0)((s,x) =>s+(x-mean)*(x-mean)) val stdDev = Math.sqrt(s2/(price.size-1) ) However, this implementation has one major drawback: the dataset (price in this example) has to be traversed for each method (mean, stdDev, min, max, and so on). One of the solutions is to create a class that computes the counters and the statistics on demand using, once again, the lazy values: class Stats[T <% Double](private values: DVector[T]) { class _Stats(var minValue: Double, var maxValue: Double, var sum: Double, var sumSqr: Double) val stats = { val _stats = new _Stats(Double.MaxValue, Double.MinValue, 0.0, 0.0) [ 632 ] Chapter 1 values.foreach(x => { if(x < _stats.minValue) x else _stats.minValue if(x > _stats.maxValue) x else _stats.maxValue _stats.sum + x _stats.sumSqr + x*x }) _stats } lazy val mean = _stats.sum/values.size lazy val variance = (_stats.sumSqr - mean*mean*values.size)/(values. size-1) lazy val stdDev = if(variance < ZERO_EPS) ZERO_EPS else Math. sqrt(variance) lazy val min = _stats.minValue lazy val max = _stats.mazValue } We made the statistics object generic by using the view bounds T <% Double, which assumes a conversion from type T to Double. By defining the statistics as tuple counters (minimum value, maximum value, sum of values, and sum of square values) and folding these values into a statistics object, we limit the number of invocations of the foldLeft reducer method to 1, and therefore, avoid the recomputation of these statistics for the existing dataset each time new data is added. The code illustrates the use and benefit of lazy values in Scala. The mean is computed only if and when needed. Normalization and Gauss distribution Statistics are usually used to normalize data into a probability value [0, 1] as required by most classification or clustering algorithms. It is logical to add the normalization method to the Stats class, as we have already extracted the min and max values: def normalize: DblVector = { val range = max – min; values.map(x => (x - min)/range) } The same approach is used to compute the multivariate normal distribution: def gauss: DblVector = values.map(x =>{ val y=x-mean INV_SQRT_2PI/stdDev*Math.exp(-0.5*y*y/stdDev)}) [ 633 ] Getting Started The price action chart has a very interesting characteristic. At a closer look, a sudden change in price and increase in volume occurs about every three months or so. Experienced investors will undoubtedly recognize that those price-volume patterns are related to the release of quarterly earnings of Cisco. Such regular but unpredictable patterns can be a source of concern or opportunity if risk can be managed. The strong reaction of the stock price to the release of corporate earnings may scare some long-term investors while enticing day traders. The following graph visualizes the potential correlation between sudden price change (volatility) and heavy trading volume: Correlation price-volume action for the Cisco stock Let's try to correlate the volatility of the stock price with volume. For the sake of this exercise, we define the volatility as the maximum variation of the stock price within each trading session: the relative difference between the highest price during the trading session and the lowest price during the session. The YahooFinancials enumeration extracts historical stock prices and session volume from a CSV file. For example, the volatility is extracted from the CSV fields of each line in the CSV file as follows: object YahooFinancials extends Enumeration { type YahooFinancials = Value val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value [ 634 ] Chapter 1 val volatility = (fs: Array[String]) =>fs(HIGH.id).toDouble-fs(LOW. id).toDouble … } The transform method uses the YahooFinancials enumeration to generate the input data for the model: def transform(cols: Array[Array[String]]): XYTSeries = { val volatility = Stats[Double](cols.map(YahooFinancials. volatility)).normalize val volume = Stats[Double](cols.map(YahooFinancials.volume) ).normalize volatility.zip(volume) } The volatility and volume data is normalized using the Stats.normalize method defined earlier. Plotting data Although charting is not the primary goal of this book, we thought that you will benefit from a brief introduction to JFreeChart. The skeleton code to generate a scatter plot is rather simple. The most relevant code is the transformation of the XYTSeries into graphical JFreeChart's XYSeries: val xLegend = "Session Volatility" val yLegend = "Session Volume" def display(xy: XYTSeries, w: Int, h : Int): Unit = { val series = new XYSeries("CSCO 2012-2013 Stock") xy.foreach( x => series.add( x._1,x._2)) val seriesCollection = new XYSeriesCollection seriesCollection.addSeries(series) … // plot rendering code val chart = ChartFactory.createScatterPlot(xLegend, xLegend, yLegend, seriesCollection, PlotOrientation.VERTICAL, true, false, false) createFrame("Logistic Regression", chart) } [ 635 ] Getting Started Visualization The JFreeChart library is introduced as a robust charting tool. The visualization of the results of a computation is beyond the scope of this book. The code related to plots and charts is omitted from the book in order to keep the code snippets concise and dedicated to machine learning. In a few occasions, output data is formatted as a CSV file to be simply imported into a spreadsheet. Here is an example of a plot using the ScatterPlot.display method: val plot = new ScatterPlot(("CSCO 2012-2013", "Session High - Low", "Session Volume"), new BlackPlotTheme) plot.display(volatility_vol.filter( _._1 < 0.5), 250, 340) Scatter plot of volatility and volume for the Cisco stock There is a level of correlation between session volume and session volatility. We can use this information to classify trading sessions by their volatility. Creating a model (learning) The objective of the training is to build a model that can discriminate between volatile and nonvolatile trading sessions. For the sake of the exercise, session volatility has been defined as session price high and session price low coupled with heavy trading volume, which constitute the two parameters of the model. [ 636 ] Chapter 1 Logistic regression is commonly used in statistics inference. The following implementation of the binary logistic regression classifier exposes a single method, classify, to comply with our desire to reduce the complexity and life cycle of objects. The model parameters, weights, are computed during training when the LogBinRegression class/model is instantiated. As mentioned earlier, the sections of the code nonessential to the understanding of the algorithm are omitted: class LogBinRegression(val labels: DVector[(XY, Double)], val maxIters: Int, val eta: Double, val eps: Double) { val dim = 3 val weights = train def classify(xy: XY): Option[(Boolean, Double)] = { if(weights != None) { val likelihood = sigmoid(w(0) + xy._1*w(1) + xy._2*w(2)) Some(likelihood > 0.5, likelihood) } else None } The training method, train, consists of iterating through the computation of the weight using a simple descent gradient. The method computes the weights and returns an option, so the model is either trained and ready for runtime classification or nonexistent (None): def train: Option[DblVector] = { val w = Array.fill(dim)( x=> Random.nextDouble-1.0) Range(0, maxIters).find(_ => { val deltaW = labels.foldLeft(Array.fill(dim)(0.0))((dw, lbl) => { val y = sigmoid(w(0) + w(1)*lbl._1._1 + w(2)*lbl._1._2) dw.map(dx => dx + (lbl._2 - y)*(lbl._1._1 + lbl._1._2)) }) val nextW = Array.fill(dim)(0.0) .zipWithIndex .map(nw => w(nw._2)+eta*deltaW(nw._2)) val diff = Math.abs(nextW.sum - w.sum) nextW.copyToArray(w); diff < eps }) match { case Some(iters) => Some(w) case None => { … } } } def sigmoid(x: Double):Double = 1.0/(1.0 + Math.exp(-x)) [ 637 ] Getting Started The iteration is encapsulated in the Scala find method that exists if the algorithm converges (diff < eps). The model parameters, weights, are set to None if the maximum number of iterations is reached. The training method, train, iterates across the set of observations by computing the gradient between the predicted and observed values. In our simplistic approach, the gradient is computed as a linear function of the sigmoid of the sum of the product of the weight and training observations. As for any optimization problem, the initialization of the solution vector, weights, is critical. We choose to initialize the weight with random values, although in practice, you would use a more deterministic approach to initialize the model parameters. In order to train the model, we need to label data. The process consists of tagging every trading session as volatile and non volatile according to the observations (relative session volatility and session volume). The labeling process is usually quite cumbersome; therefore, let's generate the label automatically. A trading session is considered volatile if a volatility and volume are both greater than 60 percent of the maximum relative volatility and volume: val labels = volatilityVol.zip(volatilityVol.map(x =>if( x._1>0.3 && x._2>0.3) 1.0 else 0.0)) Automated labeling Although quite convenient, automated creation of training labels is not without risk because it may mislabel singular observations. This technique is used in this test for convenience but it is not recommended unless a domain expert reviews the labels manually. The model is created (trained) by a simple instantiation of the logistic binary classifier: val logit = new LogBinRegression(labels, 300, 0.00005, 0.02) The training run is configured with a maximum of 300 iterations, a gradient slope of 0.00005, and convergence criteria of 0.02. Classify the data Finally, the model can be tested with a new fresh dataset, not related to the training set: Date,Open,High,Low,Close,Volume,Adj Close 3/9/2011,14.78,15.08,14.20,14.91,4.79E+08,14.88 11/17/2009,10.78,10.90,10.62,10.84,3901987,10.85 [ 638 ] Chapter 1 It is just a matter of executing the classification method (exceptions, conditions on method arguments, and returned values are omitted): val testData = load("resources/data/chap1/CSCO2.csv") logit.classify(testData(0)) match { case Some(topCategory) => Display.show(topCategory) case None => { … } } logit.classify(testData(1)) match { case Some(topCategory) => Display.show(topCategory) case None => { … } } The result of the classification is (true,0.516) for the first sample and (false,0.1180) for the second sample. Validation The simple classification, in this test case, is provided for illustrating the runtime application of the model. It does not constitute a validation of the model by any stretch of imagination. The next chapter digs into validation metrics and methodology. Summary We hope you enjoyed this introduction to machine learning and how to leverage your existing skills in Scala programming to create a simple regression program to predict stock price/volume action. Here are the highlights of this introductory chapter: • From monadic composition and high-order collection methods for parallelization to configurability to reusability patterns, Scala is the perfect fit to implement and leverage data mining and machine learning algorithms for large-scale projects • There are many steps to create and apply a machine learning model • The implementation of the logistic binary classifier presented as part of the test case is simple enough to encourage you to learn how to write and apply more advanced machine learning algorithms To the delight of Scala programming aficionados, the next chapter will dig deeper into building a flexible workflow by leveraging traits and dependency injection. [ 639 ] Hello World! In the first chapter, you were acquainted with some rudimentary concepts regarding data processing, clustering, and classification. This chapter is dedicated to the creation and maintenance of a flexible end-to-end workflow to train and classify data. The first section of the chapter introduces a data-centric (functional) approach to create number-crunching applications. You will learn how to: • Apply the concept of monadic design to create dynamic workflows • Leverage some of Scala's advanced functional features, such as dependency injection, to build portable computational workflows • Take into account the bias-variance trade-off in selecting a model • Overcome overfitting in modeling • Break down data into training, test, and validation sets • Implement model validation in Scala using precision, recall, and F score Modeling Data is the lifeline of any scientist, and the selection of data providers is critical in developing or evaluating any statistical inference or machine learning algorithm. A model by any other name We briefly introduced the concept of a model in the Model categorization section in Chapter 1, Getting Started. Hello World! What constitutes a model? Wikipedia provides a reasonably good definition of a model as understood by scientists [2:1]: A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way. … Models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon, or process being represented. In statistics and the probabilistic theory, a model describes data that one might observe from a system to express any form of uncertainty and noise. A model allows us to infer rules, make predictions, and learn from data. A model is composed of features, also known as attributes or variables, and a set of relation between those features. For instance, the model represented by the function f(x, y) = x.sin(2y) has two features, x and y, and a relation, f. These two features are assumed to be independent. If the model is subject to a constraint such as f(x, y) < 20, then the conditional independence is no longer valid. An astute Scala programmer would associate a model to a monoid for which the set is a group of observations and the operator is the function implementing the model. If it walks like a monoid and quacks like a monoid, then it is a monoid. Models come in a variety of shapes and forms: • Parametric: This consists of functions and equations (for example, y = sin(2t + w)) • Differential: This consists of ordinary and partial differential equations (for example, dy = 2x.dx) • Probabilistic: This consists of probability distributions (for example, p (x|c) = exp (k.logx – x)/x!) • Graphical: This consists of graphs that abstract out the conditional independence between variables (for example, p(x,y|c) = p(x|c).p(y|c)) • Directed graphs: This consists of temporal and spatial relationships (for example, a scheduler) • Numerical method: This consists of finite elements and methods such as Newton-Raphson • Chemistry: This consists of formula and components (for example, H2O, Fe + C12 = FeC13, and so on) [ 642 ] Chapter 2 • Taxonomy: This consists of a semantic definition and relationship of concepts (for example, APG/Eudicots/Rosids/Huaceae/Malvales) • Grammar and lexicon: This consists of a syntactic representation of documents (for example, Scala programming language) • Inference logic: This consists of a distribution pattern such as IF (stock vol > 1.5 * average) AND rsi > 80 THEN… Model versus design The confusion between model and design is quite common in Computer Science, the reason being that these terms have different meanings for different people depending on the subject. The following metaphors should help with your understanding of these two concepts: • Modeling: This is describing something you know. A model makes the assumption, which becomes an assertion if proven correct (for example, the US population, p, increases by 1.2 percent a year, dp/dt= 1.012). • Designing: This is manipulating representation for things you don't know. Designing can be seen as the exploration phase of modeling (for example, what are the features that contribute to the growth of the US population? Birth rate? Immigration? Economic conditions? Social policies?). Selecting a model's features The selection of a model's features is the process of discovering and documenting the minimum set of variables required to build the model. Scientists make the assumption that data contains many redundant or irrelevant features. Redundant features do not provide information already given by the selected features, and irrelevant features provide no useful information. Selecting features consists of two consecutive steps: 1. Searching for new feature subsets. 2. Evaluating these feature subsets using a scoring mechanism. The process of evaluating each possible subset of features to find the one that maximizes the objective function or minimizes the error rate is computationally intractable for large datasets. A model with n features requires 2n-1 evaluations. [ 643 ] Hello World! Extracting features An observation is a set of indirect measurements of hidden, also known as latent, variables, which may be noisy or contain a high degree of correlation and redundancies. Using raw observations in a classification task would very likely produce inaccurate classes. Using all features from the observation also incurs a high computation cost. The purpose of extracting features is to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features. The features are extracted by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set. Designing a workflow A data scientist has many options in selecting and implementing a classification or clustering algorithm. Firstly, a mathematical or statistical model is to be selected to extract knowledge from the raw input data or the output of a data upstream transformation. The selection of the model is constrained by the following parameters: • Business requirements such as accuracy of results • Availability of training data and algorithms • Access to a domain or subject-matter expert Secondly, the engineer has to select a computational and deployment framework suitable for the amount of data to be processed. The computational context is to be defined by the following parameters: • Available resources such as machines, CPU, memory, or I/O bandwidth • Implementation strategy such as iterative versus recursive computation or caching • Requirements for the responsiveness of the overall process such as duration of computation or display of intermediate results [ 644 ] Chapter 2 The following diagram illustrates the selection process to define the data transformation for each computation in the workflow: Linear regression Naive Bayes SVM, HMM, CRF, ... Concurrent maps Hadoop/HDFS In-memory databases Akka,Spark NoSQL, Streaming Relational database Select a statistical or mathemathical model Algorithm Logistic regression Select a computational framework Code new LogBinRegression(label, n, eta, eps) Learning Model prameters weights=(0.783, 0.219, 0.498) Business requirements Quality of labels Data completeness Nature of problem Available expertise Numerical libraries .... Application Response time Available memory, storage Network bandwidth Licensing constraints Available servers, CPU cores Redundancy .... Observations Labels Context .... Statistical and computation modeling for machine-learning applications Domain expertise, data science, and software engineering A domain or subject-matter expert is a person with authoritative or credited expertise in a particular area or topic. A chemist is an expert in the domain of chemistry and possibly related fields. A data scientist solves problems related to data in a variety of fields such as biological sciences, health care, marketing, or finances. Data and text mining, signal processing, statistical analysis, and modeling using machine learning algorithms are some of the activities performed by a data scientist. A software developer performs all the tasks related to creation of software applications, including analysis, design, coding, testing, and deployment. The parameters of a data transformation may need to be reconfigured according to the output of the upstream data transformation. Scala's higher-order functions are particularly suitable for implementing configurable data transformations. [ 645 ] Hello World! The computational framework The objective is to create a framework flexible and reusable enough to accommodate different workflows and support all types of machine learning algorithms from preprocessing, data smoothing, and classification to validation. Scala provides us with a rich toolbox that includes monadic design, design patterns, and dependency injections using traits. The following diagram describes the three levels of complexity for creating the framework: Pipe Operator Monadic Data Transformation Dependancy Injection (Cake pattern) Hierarchical design of a monadic workflow The first step is to define a trait and a method that describes the transformation of data by a computation unit (element of the workflow). The pipe operator Data transformation is the foundation of any workflow for processing and classifying a dataset, training and validating a model, and displaying results. The objective is to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation. The pipe operator is used as the signature of a data transformation: trait PipeOperator[-T, +U] { def |> (data: T): Option[U] } F# reference The notation |> as the signature of the transform or pipe operator is borrowed from the F# language [2:2]. The data transformation indeed implements a function, and therefore, has the same variance signature as Function[-T, +R] of Scala. [ 646 ] Chapter 2 The |> operator transforms a data of the type T into a data of the type U and returns an option to handle internal errors and exceptions. Advanced Scala idioms The next two sections introduce a monadic representation of the data transformation and one implementation of the dependency injection to create a dynamic workflow as an alternative to the delimited continuation pattern. Although these topics may interest advanced Scala developers, they are not required to understand any of the techniques or procedures described in this book. Monadic data transformation The next step is to create a monadic design to implement the pipe operator. Let's use a monadic design to wrap _fct, a data transformation function (also known as operator), with the most commonly used Scala higher-order methods: class _FCT[+T](val _fct: T) { def map[U](c: T => U): _FCT[U] = new _FCT[U]( c(_fct)) def flatMap[U](f: T =>_FCT[U]): _FCT[U] = f(_fct) def filter(p: T =>Boolean): _FCT[T] = if( p(_fct) ) new _FCT[T](_ fct) else zeroFCT(_fct) def reduceLeft[U](f: (U,T) => U)(implicit c: T=> U): U = f(c(_fct), _fct) def foldLeft[U](zero: U)(f: (U, T) => U)(implicit c: T=> U): U = f(c(_fct), _fct) def foreach(p: T => Unit): Unit = p(_fct) } The methods of the _FCT class represent a subset of the traditional Scala higher methods for collections [2:3]. The _FCT class is to be used internally. Arguments are validated by subclasses or containers. Finally, the Transform class takes a PipeOperator instance as an argument and automatically invokes its operator: class Transform[-T, +U](val op: PipeOperator[T, U]) extends _ FCT[Function[T, Option[U]]](op.|>) { def |>(data: T): Option[U] = _fct(data) } [ 647 ] Hello World! You may wonder about the reason behind the monadic representation of a data transformation, Transform. You can create any algorithm by just implementing the PipeOperator trait, after all. The reason is that Transform has a richer protocol (methods) and enables developers to create a complex workflow as an alternative to the delimited continuation. The following code snippet illustrates a generic function composition or data transformation composition using the monadic approach: val op = new PipeOperator[Int, Double] { def |> (n: Int):Option[Double] =Some(Math.sin(n.toDouble)) } def g(f: Int =>Option[Double]): (Int=> Long) = { (n: Int) => { f(n) match { case Some(x) => x.toLong case None => -1L } } } val gof = new Transform[Int,Double](op).map(g(_)) This code extends op, an existing transformation, with another function, g. As stated in the Presentation section under Source code in Chapter 1, Getting Started, code related to exceptions, error checking, and validation of arguments is omitted (refer tothe Format of code snippets section in Appendix A, Basic Concepts. Dependency injection This section presents the key constructs behind the Cake pattern. A workflow composed of configurable data transformations requires a dynamic modularization (substitution) of the different stages of the workflow. The Cake pattern is an advanced class composition pattern that uses mix-in traits to meet the demands of a configurable computation workflow. It is also known as stackable modification traits [2:4]. This is not an in-depth analysis of the stackable trait injection and self-reference in Scala. There are few interesting articles on dependencies injection that are worth a look [2:5]. Java relies on packages tightly coupled with the directory structure and prefix to modularize the code base. Scala provides developers with a flexible and reusable approach to create and organize modules: traits. Traits can be nested, mixed with classes, stacked, and inherited. [ 648 ] Chapter 2 Dependency injection is a fancy name for a reverse look up and binding to dependencies. Let's consider a simple application that requires data preprocessing, classification, and validation. A simple implementation using traits looks like this: val myApp = new Classification with Validation with PreProcessing { val filter = .. } If, at a later stage, you need to use an unsupervised clustering algorithm instead of a classifier, then the application has to be rewired: val myApp = new Clustering with Validation with PreProcessing { val filter = .. } This approach results in code duplication and lack of flexibility. Moreover, the filter class member needs to be redefined for each new class in the composition of the application. The problem arises when there is a dependency between traits used in the composition of the application. Let's consider the case for which the filter depends on the validation methodology. Mixins linearization [2:6] The linearization or invocation of methods between mixins follows a right-to-left pattern: • Trait B extends A • Trait C extends A • Class M extends N with C with B The Scala compiler implements the linearization as follows: M =>B => C => A => N Although you can define filter as an abstract value, it still has to be redefined each time a new validation type is introduced. The solution is to use the self type in the definition of the newly composed PreProcessingWithValidation trait: trait PreProcessiongWithValidation extends PreProcessing { self: Validation => val filter = .. } The application can then be simply composed as: val myApp = new Classification with PreProcessingWithValidation { val validation: Validation } [ 649 ] Hello World! Overriding val with def It is advantageous to override the declaration of a value with a definition of a method with the same signature. Contrary to a value that locks the implementation of the value, a method can return a different value for each invocation: trait PreProcessor { val validation = … } trait MyValidator extends Validator { def validation = … } In Scala, a value declaration can be overridden by the method definition, not vice versa. Let's adapt and generalize this pattern to construct a boilerplate template in order to create dynamic computational workflows. The first step is to generate different modules to encapsulate different types of data transformation. Workflow modules The data transformation defined by the PipeOperator instance is dynamically injected into the module by initializing the abstract value. Let's define three parameterized modules representing the preprocessing, processing, and post-processing stages of a workflow: trait PreprocModule[-T, +U] { val preProc: PipeOperator[T, U] } trait ProcModule[-T, +U] { val proc: PipeOperator[T, U] } trait PostprocModule[-T, +U] { val postProc: PipeOperator[T, U] } The modules (traits) contain only a single abstract value. One characteristic of the Cake pattern is to enforce strict modularity by initializing the abstract values with the type encapsulated in the module, as follows: trait ProcModule[-T, +U] { val proc: PipeOperator [T, U] class Classification[-T, +U] extends PipeOperator [T,U] { } } One of the objectives in building the framework is allowing developers to create data transformation (inherited from PipeOperator) independently from any workflow. Under these constraints, strict modularity is not an option. [ 650 ] Chapter 2 Scala traits versus Java packages There is a major difference between Scala and Java in terms of modularity. Java packages constrain developers into following a strict syntax requirement; for instance, the source file has the same name as the class it contains. Scala modules based on stackable traits are far more flexible. The workflow factory The next step is to write the different modules into a workflow. This is achieved by using the self reference to the stack of the three traits defined in the previous paragraph. Here is an implementation of the said self reference: class WorkFlow[T, U, V, W] { self: PreprocModule[T,U] with ProcModule[U,V] with PostprocModule[V,W] => def |> (data: T): Option[W] = { preProc |> data match { case Some(input) => { proc |> input match { case Some(output) => postProc |> output case None => { … } } } case None => { … } } } } Quite simple indeed! If you need only two modules, you can either create a workflow with a stack of two traits or initialize the third with the PipeOperator identity: def identity[T] = new PipeOperator[T,T] { override def |> (data:T): Option[T] = Some(data) } Let's test the wiring with the following simple data transformations: class Sampler(val samples: Int) extends PipeOperator[Double => Double, DblVector] { override def |> (f: Double => Double): Option[DblVector] = [ 651 ] Hello World! Some(Array.tabulate(samples)(n => f(n.toDouble/samples)) ) } class Normalizer extends PipeOperator[DblVector, DblVector] { override def |> (data: DblVector): Option[DblVector] = Some(Stats[Double](data).normalize) } class Reducer extends PipeOperator[DblVector, Int] { override def |> (data: DblVector): Option[Int] = Range(0, data.size) find(data(_) == 1.0) } The first operator, Sampler, samples a function, f, with a frequency 1/samples over the interval [0, 1]. The second operator, Normalizer, normalizes the data over the range [0, 1] using the Stats class introduced in the Basic statistics section in Chapter 1, Getting Started. The last operator, Reducer, extracts the index of the large sample (value 1.0) using the Scala collection method, find. A picture is worth a thousand words; the following UML class diagram illustrates the workflow factory design pattern: PipeOperator preProc proc Sampler Normalizer Reducer postProc PreprocModule ProcModule PostprocModule Workflow Finally, the workflow is instantiated by dynamically initializing the abstract values, preProc, proc, and postProc, with a transformation of the type PipeOperator as long as the signature (input and output types) matches the parameterized types defined in each module (lines marked as 1): val dataflow = new Workflow[Double => Double, DblVector, DblVector, Int] with PreprocModule[Double => Double, DblVector] with ProcModule[DblVector, DblVector] with PostprocModule[DblVector, Int] { val preProc: PipeOperator[Double => Double,DblVector] = new Sampler(100) //1 [ 652 ] Chapter 2 val proc: PipeOperator[DblVector,DblVector]= new Normalizer //1 val postProc: PipeOperator[DblVector,Int] = new Reducer//1 } dataflow |> ((x: Double) => Math.log(x+1.0)+Random.nextDouble) match { case Some(index) => … Scala's strong type checking catches any inconsistent data types at compilation time. It reduces the development cycle because runtime errors are more difficult to track down. Examples of workflow components It is difficult to build an example of workflow using classes and algorithms introduced later in the book. The modularization of the preprocessing and clustering stages is briefly described here to illustrate the encapsulation of algorithms described throughout the book within a workflow. The preprocessing module The following examples of a workflow module use the time series class, XTSeries, which is used throughout the book: class XTSeries[T](label: String, arr: Array[T]) The XTSeries class takes an identifier, a label, and an array of parameterized values, arr, as parameters, and is formally described in Chapter 3, Data Preprocessing. The preprocessing algorithms such as moving average or discrete Fourier filters are encapsulated into a preprocessing module using a combination of abstract value and inheritance: trait PreprocessingModule[T] { val preprocessor: Preprocessing[T] //1 abstract class Preprocessing[T] { //2 def execute(xt: XTSeries[T]): Unit } abstract class MovingAverage[T] extends Preprocessing[T] with PipeOperator[XTSeries[T], XTSeries[Double]] { //3 override def execute(xt: XTSeries[T]): Unit = this |> xt match { case Some(filteredData) => … [ 653 ] Hello World! case None => … } } class SimpleMovingAverage[@specialized(Double) T <% Double](period: Int)(implicit num: Numeric[T]) extends MovingAverage[T] { override def |> (xt: XTSeries[T]): Option[XTSeries[Double]] = … } class DFTFir[T <% Double](g: Double=>Double) extends Preprocessing[T] extends PreProcessing[T] with PipeOperator[XTSeries[T], XTSeries[Double]] { override def execute(xt: XTSeries[T]): Unit = this |> xt match { case Some(filteredData) => … case None => … } override def |> (xt: XTSeries[T]) : Option[XTSeries[Double]] } } The preprocessing module, PreprocessingModule, defines preprocessor, an abstract value, that is initialized at runtime (line 1). The PreProcessing class is defined as a high-level abstract class with a generic execution function: execute (line 2). The preprocessing algorithms; filtering techniques moving average, MovingAverage; and discrete Fourier, DFTFir in this case, are defined as a class hierarchy with the base type PreProcessing. Each filtering class also implements PipeOperator so it can be weaved into a simpler data transformation workflow (line 3). The preprocessing algorithms are described in the next chapter. The clustering module The encapsulation of clustering techniques is the second example of a module for dependency-injection-based workflow: trait ClusteringModule[T] { type EMOutput = List[(Double, DblVector, DblVector)] val clustering: Clustering[T] abstract class Clustering[T] { [ 654 ] Chapter 2 def execute(xt: XTSeries[Array[T]]): Unit } class KMeans[T <% Double](K: Int, maxIters: Int, distance: (DblVector, Array[T]) => Double)(implicit order: Ordering[T], m: Manifest[T]) extends Clustering[T] with PipeOperator[XTSeries[Array [T]], List[Cluster[T]]] { override def |> (xt: XTSeries[Array[T]]): Option[List[Cluster[T]]] override def execute(xt: XTSeries[Array[T]]): Unit = this |> xt match { case Some(clusters) => … case None => … } } class MultivariateEM[T <% Double](K: Int) extends Clustering[T] with PipeOperator[XTSeries[Array[T]], EMOutput] { override def |> (xt: XTSeries[Array[T]]): Option[EMOutput] = override def execute(xt: XTSeries[Array[T]]): Unit = this |> xt match { case Some(emOutput) => … case None => … } } } The ClusteringModule clustering module defines an abstract value, clustering, which is initialized at runtime (line 1). The two clustering algorithms, KMeans and Expectation-Maximization, MultivariateEM, inherits the Clustering base class. The clustering technique can be used in: • A dependency-injection-based workflow by overriding execute • A simpler data transformation flow by overriding PipeOperator (|>) The clustering techniques are described in Chapter 4, Unsupervised Learning. [ 655 ] Hello World! Dependency-injection-based workflow versus data transformation The data transformation PipeOperator trades flexibility for simplicity. The design proposed for preprocessing and clustering techniques allows you to use both approaches. The techniques presented in the book implement the basic data transformation, PipeOperator, in order to keep the implementation of these techniques as simple as possible. Assessing a model Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting a strategy such as an N-Fold cross-validation scheme, and selecting the appropriate labeled data. Validation The purpose of this section is to create a Scala class to be used in future chapters for validating models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training. Key metrics Let's consider a simple classification model with two classes defined as positive (with respect to negative) represented with Black (with respect to White) color in the following diagram. Data scientists use the following terminology: • True positives (TP): These are observations that are correctly labeled as belonging to the positive class (white dots on a dark background) • True negatives (TN): These are observations that are correctly labeled as belonging to the negative class (black dots on a light background) • False positives (FP): These are observations incorrectly labeled as belonging to the positive class (white dots on a dark background) [ 656 ] Chapter 2 • False negatives (FN): These are observations incorrectly labeled as belonging to the negative class (black dots on a light background) True negatives (TN) False negatives (FN) Flase positives (FP) True positives(TP) Categorization of validation results This simplistic representation can be extended to classification problems that involve more than two classes. For instance, false positives are defined as observations incorrectly labeled that belong to any class other than the correct one. These four factors are used for evaluating accuracy, precision, recall, and F and G measures: • Accuracy: Represented as ac, this is the percentage of observations correctly classified. • Precision: Represented as p, this is the percentage of observations correctly classified as positive in the group that the classifier has declared positive. • Recall: Represented as r, this is the percentage of observations labeled as positive that are correctly classified. • F-Measure or F-score F1: This is the score of a test's accuracy that strikes a balance between precision and recall. It is computed as the harmonic mean of the precision and recall with values ranging between 0 (worst score) and 1 (best score). • G-measure: Represented as G, this is similar to the F-measure but is computed as the geometric mean of precision p and recall r. ac = TP+TN TP TP p= r= TP+TN+FP+FN TP+FP TP+FN F1 = 2 pr G= p+r [ 657 ] pr Hello World! Implementation Let's implement the validation formula using the same trait-based modular design used in creating the preprocessor and classifier modules. The Validation trait defines the signature for the validation of a classification model: the computation of the F1 statistics and the precision-recall pair: trait Validation { def f1: Double def precisionRecall: (Double, Double) } Let's provide a default implementation of the Validation trait of the F1Validation class. In the tradition of Scala programming, the class is immutable; it computes the counters for TP, TN, FP, and FN when the class is instantiated. The class takes two parameters: • The array of actual versus expected class: actualExpected • The target class for true positive observations: tpClass class F1Validation(actualExpected: Array[(Int, Int)], tpClass: Int) extends Validation { val counts = actualExpected.foldLeft(new Counter[Label])((cnt, oSeries) => cnt + classify(oSeries._1, oSeries._2)) lazy val accuracy = { val num = counts(TP) + counts(TN) num.toDouble/counts.foldLeft(0)( (s,kv) } => s + kv._2) lazy val precision = counts(TP).toDouble/(counts(TP) + counts(FP)) lazy val recall = counts(TP).toDouble/(counts(TP) + counters(FN)) override def f1: Double = 2.0*precision*recall/(precision + recall) override def precisionRecall: (Double, Double) = (precision, recall) def classify(actual: Int, expected: Int): Label = { if(actual == expected) { if(actual == tpClass) TP else TN } else { if (actual == tpClass) FP else FN } } } [ 658 ] Chapter 2 The precision and recall variables are defined as lazy so they are computed only once, when they are either accessed for the first time or the f1 and precisionRecall functions are invoked. The class is independent of the selected machine learning algorithm, the training, the labeling process, and the type of observations. Contrary to Java, which defines an enumerator as a class of types, Scala requires enumerators to be singletons that inherit the functionality of the Enumeration class: object Label extends Enumeration { type Label = Value val TP, TN, FP, FN = Value } K-fold cross-validation It is quite common that the labeled dataset used for both training and validation is not large enough. The solution is to break the original labeled dataset into K data groups. The data scientist creates K training-validation datasets by selecting one of the groups as a validation set then combining all other remaining groups into a training set as illustrated in the next diagram. The process is known as the K-fold cross validation [2:7]. S1 S2 S3 S4 ... SK Training S1 S2 S4 ... SK Validation S3 The third segment is used as validation data and all other dataset segments except S3 are combined into a single training set. This process is applied to each segment of the original labeled dataset. [ 659 ] Hello World! Bias-variance decomposition There is an obvious challenge in creating a model that fits both the training set and subsequent observations to be classified during the validation phase. If the model tightly fits the observations selected for training, there is a high probability that new observations may not be correctly classified. This is usually the case when the model is complex. This model is characterized as having a low bias with a high variance. Such a scenario can be attributed to the fact that the scientist is overly confident that the observations he or she selected for training are representative to the real world. The probability of a new observation being classified as belonging to a positive class increases as the selected model fits loosely the training set. In this case, the model is characterized as having a high bias with a low variance. The mathematical definition for the bias, variance, and mean squared error (MSE) of the distribution are defined by the following formulas: Variance and bias for a true model, θ: ( ) () ( 2 var θ$ = E ⎡ θ$ − E ⎡⎣θ% ⎤⎦ ⎤ bias θ$ = θ$ − θ θ$ : θ estimate ⎢⎣ ⎥⎦ ) Mean square error: () () MSE = var θ$ + bias θ$ 2 Let's illustrate the concept of bias, variance, and mean square error with an example. At this stage, most of the machines learning techniques have not been introduced yet. Therefore, the example will emulate a multiple models fEst: Double => Double generated from non-overlapping training sets. These models are evaluated against a test/validation datasets that are emulated by a model, emul. The BiasVarianceEmulator emulator class takes the emulator function and the size of the nValues validation test as parameters. It merely implements the formula to compute the bias and variance for each of the fEst models: class BiasVarianceEmulator[T <% Double](emul: Double => Double, nValues: Int) { def fit(fEst: List[Double => Double]): Option[XYTSeries] = { val rf = Range(0, fEst.size) val meanFEst = Array.tabulate(nValues)( x => [ 660 ] Chapter 2 rf.foldLeft(0.0)((s, n) => s+fEst(n)(x))/fEst.size) // 1 val r = Range(0, nValues) Some(fEst.map(fe => { r.foldLeft(0.0, 0.0)((s, x) => { val diff = (fe(x) - meanFEst(x))/ fEst.size // 2 (s._1 + diff*diff, s._2 + Math.abs(fe(x)-emul(x)))} ) }).toArray) } } The fit method computes the variance and bias for each of the fEst models generated from training. First, the mean of all the models are computed (line 1), and then used in the computation of the variance and bias. The method returns a tuple (variance, bias) for each of the fEst model. Let's apply the emulator to three nonlinear regression models evaluated against validation data: ⎛ ⎛ x ⎞⎞ ⎜ sin ⎜ 20 ⎟ ⎟ x ⎝ ⎠⎟ y = , y = 0.0003.x 2 + 0.18 x and y = x ⎜1 + 5 5 ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ The client code for the emulator consists of defining the emul emulator function, and a list, fEst, of three models defined as tuples of (function, descriptor) of type (Double=>Double, String). The fit method is call on the model functions extracted through a map, as shown in the following code: val emul = (x: Double) => 0.2*x*(1.0 + Math.sin(x*0.05)) val fEst = List[(Double=>Double, String)] ( ((x: Double) => 0.2*x, "y=x/5"), ((x: Double) => 0.0003*x*x + 0.18*x, "y=3e-4.x^2-0.18x"), ((x: Double) =>0.2*x*(1+Math.sin(x*0.05), "y=x(1+sin(x/20))/5")) val emulator = new BiasVarianceEmulator[Double](emul, 200) emulator.fit(fEst.map( _._1)) match { case Some(varBias) => show(varBias) case None => … } [ 661 ] Hello World! The JFreeChart library is used to display the test dataset and the three model functions. Fitting models to dataset The variance-bias trade-off is illustrated in the following scatter chart using the absolute value of the bias: [ 662 ] Chapter 2 The more complex the function, the lower the bias is. It is usually, but not always related to, a high variance. The most complex function y=x (1+sin(x/20))/5 has by far the highest variance and the lowest bias. The more complex model matches fairly well with the training dataset. As expected, the mean square error reflects the ability of each of the three models to fit the test data. Mean square error bar chart The low bias of the complex model reflects in its ability to predict new observations correctly. Its MSE is therefore low, as expected. Complex models with low bias and high variance are known as overfitting. Models with high bias and low variance are characterized as underfitting. Overfitting The methodology presented in the example can be applied to any classification and regression model. The list of models with low variance includes constant function and models independent of the training set. High degree polynomial, complex functions, and deep neural networks have high variance. Linear regression applied to linear data has a low bias, while linear regression applied to nonlinear data has a higher bias [2:8] [ 663 ] Hello World! Overfitting affects all aspects of the modeling process negatively, for example: • It is a sure sign of an overly complex model, which is difficult to debug and consumes computation resources • It makes the model representing minor fluctuations and noise • It may discover irrelevant relationships between observed and latent features • It has poor predictive performance However, there are well-proven solutions to reduce overfitting [2:9]: • Increasing the size of the training set whenever possible • Reducing noise in labeled and input data through filtering • Decreasing the number of features using techniques such as principal components analysis • Modeling observable and latent noised using filtering techniques such as Kalman or autoregressive models • Reducing inductive bias in a training set by applying cross-validation • Penalizing extreme values for some of the model's features using regularization techniques Summary In this chapter, we established the framework for the different data processing units that will be introduced in this book. There is a very good reason why the topics of model validation and overfitting are explored early on in this book. There is no point in building models and selecting algorithms if we do not have a methodology to evaluate their relative merits. In this chapter, you were introduced to: • The versatility and cleanness of the Cake pattern in Scala as an effective scaffolding tool for data processing • The concept of pipe operator for data conversion • A robust methodology to validate machine learning models • The challenge in fitting models to both training and real-world data The next chapter will address the problem of overfitting by penalizing outliers, modeling, and eliminating noise in data. [ 664 ] Data Preprocessing Real-world data is usually noisy and inconsistent with missing observations. No classification, regression, or clustering model can extract relevant information from unprocessed data. Data preprocessing consists of cleaning, filtering, transforming, and normalizing raw observations using statistics in order to correlate features or groups of features, identify trends and model, and filter out noise. The purpose of cleansing raw data is twofold: • Extract some basic knowledge from raw datasets • Evaluate the quality of data and generate clean datasets for unsupervised or supervised learning You should not underestimate the power of traditional statistical analysis methods to infer and classify information from textual or unstructured data. In this chapter, you will learn how to: • Apply commonly used moving average techniques to detect long-term trends in a time series • Identify market and sector cycles using discrete Fourier series • Leverage the Kalman filter to extract the state of a dynamic system from incomplete and noisy observations Time series The overwhelming majority of examples used to illustrate the different machine algorithms in this book process time series or sequential, ordered, or unordered data. Data Preprocessing Each library has its own container type to manipulate datasets. The challenge is to define all possible conversions between types from different libraries needed to implement a large variety of machine learning models. Such a strategy may result in a combinatorial explosion of implicit conversion. A solution consists of creating a generic class to manage conversion from and to any type used by a third-party library. Scala.collection.JavaConversions _ Scala provides a standard package to convert collection types from Scala to Java and vice versa. The generic data transformation, DT, can be used to transform any XTSeries time series: class DT[T,U] extends PipeOperator[XTSeries[T], XTSeries[U]] { override def |> : PartialFunction[XTSeries[T], XTSeries[U]] } Let's consider the simple case of using a Java library, the Apache Commons Math framework, and JFreeChart for visualization, and define a parameterized time series class, XTSeries[T]. The \> data transformation converts a time series of values of type T, XTSeries[T], into a time series of values of type U, XTSeries[U]. The following diagram provides an overview of type conversion in data transformation: java int[] int[] double[] double[] scala List[T] Vector[T] List[U] XTSeries[U] Array[T] Transform |> org.apache.commons.math3 RealMatrix RealVector Array2DRowRealMatrix Vector[U] Array[U] RealMatrix RealVector ArrayrealVector DblVector XTSeries[T] org.scalaml.core.Types DblVector DblMatrix org.jfree.data KeyedValues DblMatrix Values [ 666 ] Chapter 3 Let's create the XTSeries class. As a container, the class should be an implementation of the Scala higher-order collections functions such as map, foreach, or zip. The class should support at least conversion to DblVector and DblMatrix types introduced in the first chapter. Here is a partial implementation of the XTSeries class. Comments, exceptions, argument validations, and debugging code are omitted in the code: class XTSeries[T](label: String, arr: Array[T]) { // 1 def apply(n: Int): T = arr.apply(n) @implicitNotFound("Undefined conversion to DblVector") // 2 def toDblVector(implicit f: T=>Double):DblVector =arr.map(f(_)) @implicitNotFound("Undefined conversion to DblMatrix") // 2 def toDblMatrix(implicit fv: T => DblVector): DblMatrix = arr.map( fv( _ ) ) def + (n: Int, t: T)(implicit f: (T,T) => T): T = f(arr(n), t) def head: T = arr.head //3 def drop(n: Int):XTSeries[T] = XTSeries(label,arr.drop(n)) def map[U: ClassTag](f: T => U): XTSeries[U] = XTSeries[U](label, arr.map( x =>f(x))) def foreach( f: T => Unit) = arr.foreach(f) //3 def sortWith(lt: (T,T)=>Boolean):XTSeries[T] = XTSeries[T](label, arr.sortWith(lt)) def max(implicit cmp: Ordering[T]): T = arr.max //4 def min(implicit cmp: Ordering[T]): T = arr.min … } The class takes an optional label and an invariant array of the parameterized type T. The annotation @specialized (line 1) instructs the compiler to generate two versions of the class: • A generic XTSeries[T] class that exploits all the implicit conversions required to perform operations on time series of a generic type • An optimized XTSeries[Double] class that bypasses the conversion and offers the client code with a faster implementation [ 667 ] Data Preprocessing The conversion to DblVector (resp. DblMatrix) relies on the implicit conversion of elements to type Double (resp. DblVector) (line 2). The @implicitNotFound annotation instructs the compiler to omit an error if no implicit conversion is detected. The conversion methods are used to implement the implicit conversion introduced in the previous section. These methods are defined in the singleton org.scalaml.core.Types.CommonsMath library. The following code shows the implementation of the conversion methods: object Types { object CommonMath { implicit def series2DblVector[T](xt: XTSeries[T])(implicit f: T=>Double):DblVector = xt.toDblVector(f) implicit def series2DblMatrix[T](xt: XTSeries[T])(implicit f: T=>DblVector): DblMatrix = xt.toDblMatrix(f) … } This code snippet exposes a subset of the Scala higher-order collections methods (line 3) applied to the time series. The computation of the minimum and maximum values in the time series required that the cmp ordering/compare method be defined for the elements of the type T (line 4). Let's put our versatile XTSeries class to use in creating a basic preprocessing data transformation starting with the ubiquitous moving average techniques. Moving averages Moving averages provide data analysts and scientists with a basic predictive model. Despite its simplicity, the moving average method is widely used in the technical analysis of financial markets to define a dynamic level of support and resistance for the price of a given security. Let's consider a time series xt= x(t) and a function f(xt-p, xt-1) that reduces the last p observations into a value or average. The prediction or estimation of the observation at t+1 is defined by the following formula: x%t +1 = f ( xt − p ,..., xt ) Here, f is an average reducing function from the previous p data points. [ 668 ] Chapter 3 The simple moving average Simple moving average, a smoothing method, is the simplest form of the moving averages algorithms [3:1]. The simple moving average of period p estimates the value at time t by computing the average value of the previous p observations using the following formula: The simple moving average of a time series {xt} with a period p is computed as the average of the last p observations: x%t = 1 p t ∑ j =t − p xj The computation is implemented iteratively using the following formula (1): x%t = x%t −1 + Here, x%t xt − xt − p p ∀t ≥ p;0∀t ≤ p is the estimate or simple moving average value at time t. Let's build a class hierarchy of moving average algorithms, with the abstract parameterized class MovingAverage[T <% Double] as its root. We use the generic time series class, XTSeries[T], introduced in the first section and the generic pipe operator, |>, introduced in the previous chapter: abstract class MovingAverage[T <% Double] extends PipeOperator[XTSeries[T], XTSeries[Double]] The pipe operator for the SimpleMovingAverage class implements the iterative formula (1) for the computation of the simple moving average. The override keyword is omitted: class SimpleMovingAverage[@specialized(Double) T <% Double](val period: Int)(implicit num: Numeric[T]) extends MovingAverage[T] { def |> : PartialFunction[XTSeries[T], XTSeries[Double]] { case xt: XTSeries[T] if(xt != null && xt.size > 0) => { val slider = xt.take(data.size-period) .zip(data.drop(period)) //1 val a0 = xt.take(period).toArray.sum/period //2 var a: Double = a0 val z = Array[Array[Double]]( [ 669 ] Data Preprocessing Array.fill(period)(0.0), a, slider.map(x => { a += (x._2 - x._1)/period a}) ).flatten //3 XTSeries[Double](z) } The class is parameterized for the type of elements of the input time series. After all, we do not have control over the source of the input data. The type for the elements of the output time series is Double. The class has a type T and is specialized for the Double type for faster processing. The implicitly defined num: Numeric[T] is required by the arithmetic operators sum and / (line 2). The implementation has a few interesting elements. First, the set of observations is duplicated and the index in the clone is shifted by p observations before being zipped with the original to the array of a pair of values: slider (line 1): X0 X1 X2 ..... Xp-1 Xp................Xn-1 X0 X1 X2 ..... Xp-1 ................Xn-1 0 0 ... . .0 ap ... .. ai ... .. an Sliding pairs Moving averages The sliding algorithm to compute moving averages The average value is initialized with the average of the first p data points. The first p values of the trends are initialized as an array of p zero values. It is concatenated with the first average value and the array containing the remaining average values. Finally, the array of three arrays is flattened (flatten) into a single array containing the average values (line 3). The weighted moving average The weighted moving average method is an extension of the simple moving average by computing the weighted average of the last p observations [3:2]. The weights αj are assigned to each of the last p data points xj, and are normalized by the sum of the weights. [ 670 ] Chapter 3 The weighted moving average of a series {xt} with a period p and a normalized weights distribution {αj} is given by the following formula (2): x%t = Here, x%t 1 p t ∑ j =t − p p −1 α j− p x j; ∑α j = 1 j =0 is the estimate or simple moving average value at time t. The implementation of the WeightedMovingAverage class requires the computation of the last p data points. There is no simple iterative formula to compute the weighted moving average at time t+1 using the moving average at time t: class WeightedMovingAverage[@specialized(Double) T <% Double](val weights: DblVector) extends MovingAverage[T] { def |> : PartialFunction[XTSeries[T], XTSeries[Double]] = { case xt: XTSeries[T] if(xt != null && xt.size > 1) => { val smoothed = Range(weights.size, xt.size).map(i => { xt.toArray.slice(i- weights.size , i) .zip(weights) .foldLeft(0.0)((s, x) => s + x._1*x._2) }) //1 XTSeries[Double](Array.fill(weights.size)(0.0) ++ smoothed) //2 } } As with the simple moving average, the array of the initial p moving average with the value 0 is concatenated (line 2) with the first moving average value and the remaining weighted moving average computed using a map (line 1). The period for the weighted moving average is implicitly defined as weights.size. The exponential moving average The exponential moving average is widely used in financial analysis and marketing surveys because it favors the latest values. The older the value, the less impact it has on the moving average value at time t [3:3]. The exponential moving average on a series {xt} and a smoothing factor α is computed by the following iterative formula: x%t = (1 − α ) x%t −1 + α xt ∀t > 0; x0 if t = 0 Here, x% is the value of the exponential average at t. [ 671 ] Data Preprocessing The implementation of the ExpMovingAverage class is rather simple. There are two constructors, one for a user-defined smoothing factor and one for the Nyquist period, p, used to compute the smoothing factor alpha = 2/(p+1): class ExpMovingAverage[@specialized(Double) T <% Double](val alpha: Double) extends MovingAverage[T] { def |> : PartialFunction[XTSeries[T], XTSeries[Double]] = { case xt: XTSeries[T] if(xt != null && xt.size > 1) => { val alpha_1 = 1-alpha var y: Double = data(0) xt.map( x => { val z = x*alpha + y*alpha_1; y=z; z }) } } } The version of the constructor that uses the Nyquist period p is implemented using the Scala apply method: def apply[T <% Double](nyquist: Int): ExpMovingAverage[T] = new ExpMovingAverage[T](2/( nyquist + 1)) Let's compare the results generated from these three moving averages methods with the original price. We use a data source (with respect to sink), DataSource (with respect to DataSink) to load the historical daily closing stock price of Bank of America (BAC). The DataSource and DataSink classes are defined in the Data extraction section in Appendix A, Basic Concepts. The comparison of results can be done using the following code: val p_2 = p >>1 val w = Array.tabulate(p)(n =>if(n==p_2) 1.0 else 1.0/(Math. abs(n-p_2)+1)) //1 val weights = w map { _ / w.sum } //2 val src = DataSource("resources/data/chap3/BAC.csv, false)//3 val val val val price = src |> YahooFinancials.adjClose sMvAve = SimpleMovingAverage(p) wMvAve = WeightedMovingAverage(weights) eMvAve = ExpMovingAverage(p) //4 val results = price :: sMvAve.|>(price) :: wMvAve.|>(price) :: eMvAve.|>(price) :: List[XTSeries[Double]]() //5 Val outFile = "output/chap3/mvaverage" + p.toString + ".csv" DataSink[Double]( outFile) |> results //6 [ 672 ] Chapter 3 The coefficients for the weighted moving average are generated (line 1) and normalized (line 2). The trading data regarding the ticker symbol, BAC, is extracted from the Yahoo! finances CSV file (line 3), YahooFinancials, using the adjClose extractor (line 4). The smoothed data generated by each of the moving average techniques are concatenated into a list of time series (line 5). Finally, the content is formatted and dumped into a file, outFile, using a DataSink instance (line 6). The weighted moving average method relies on a symmetric distribution of normalized weights computed by a function passed as an argument of the generic tabulate method. Note that the original price time series is generated if a specific moving average cannot be computed. The following graph is an example of a symmetric filter for weighted moving averages: The three moving average techniques are applied to the price of the stock of Bank of America (BAC) over 200 trading days. Both the simple and weighted moving average uses a period of 11 trading days. The exponential moving average method uses a scaling factor of 2/(11+1) = 0.1667. 11-day moving averages of the historical stock price of Bank of America [ 673 ] Data Preprocessing The three techniques filter the noise out of the original historical price time series. The exponential moving average reacts to a sudden price fluctuation despite the fact that the smoothing factor is low. If you increase the period to 51 trading days or two calendar months, the simple and weighted moving averages generate a smoothed time series compared to the exponential moving average with a smoothing factor of 2/(p+1)= 0.038. 51-day moving averages of the historical stock price of Bank of America You are invited to experiment further with different smooth factors and weight distributions. You will be able to confirm the following basic rule: as the period of the moving average increases, noise with decreasing frequencies is eliminated. In other words, the window of allowed frequencies is shrinking. The moving average acts as a low-band filter that allows only lower frequencies. Fine-tuning the period or smoothing factor is time consuming. Spectral analysis, or more specifically, Fourier analysis, transforms the time series into a sequence of frequencies, which is a time series in the frequency domain. [ 674 ] Chapter 3 Fourier analysis The purpose of spectral density estimation is to measure the amplitude of a signal or a time series according to its frequency [3:4]. The spectral density is estimated by detecting periodicities in the dataset. A scientist can better understand a signal or time series by analyzing its harmonics. The spectral theory Spectral analysis for time series should not be confused with spectral theory, a subset of linear algebra that studies Eigenfunctions on Hilbert and Banach spaces. Harmonic and Fourier analyses are regarded as a subset of spectral theory. The fast Fourier transform (FFT) is the most commonly used frequency analysis algorithm [3:5]. Let's explore the concept behind the discrete Fourier series and the Fourier transform as well as their benefits as applied to financial markets. The Fourier analysis approximates any generic function as the sum of trigonometric functions, sine and cosine. The decomposition in a basic trigonometric function is known as a Fourier transform [3:6]. Discrete Fourier transform (DFT) A time series {xk} can be represented as a discrete real-time domain function f, x=f(t). In the 18th century, Jean Baptiste Joseph Fourier demonstrated that any continuous periodic function f could be represented as a linear combination of sine and cosine functions. The discrete Fourier transform (DFT) is a linear transformation that converts a time series into a list of coefficients of a finite combination of complex or real trigonometric functions, ordered by their frequencies. The frequency ω of each trigonometric function defines one of the harmonics of the signal. The space that represents signal amplitude versus frequency of the signal is known as the frequency domain. The generic DFT transforms a time series into a sequence of frequencies defined as complex numbers ω = a + j.φ (j2= -1), for which a is the amplitude of the frequency and φ is the phase. [ 675 ] Data Preprocessing This section is dedicated to the real DFT that converts a time series into an ordered sequence of frequencies with real values. Real discrete Fourier transform A periodic function f can be represented as an infinite combination of sine and cosine functions: f (t ) = ∞ ao ∞ + ∑ ak cos ( nx ) = ∑ bk sin ( nx ) 2 1 1 The Fourier cosine transform of a function f is defined as: F c ∞ ( f , k ) = ∫ cos ( 2π kx ) f ( x ) dx −∞ The discrete real cosine series of a function f(-x) = f(x) is defined as: π a 2 N −3 2 f ( x ) = f ( − x ) = 0 + ∑ ak cos ( kx ) where ak = ∫ f ( t ) cos ( kt ) .dt 2 k =1 π 0 The Fourier sine transform of a function is defined as: Fs ( f ,k) = ∞ ∫ sin ( 2π kx ) f ( x ) dx −∞ The discrete real sine series of a function f(-x) = f(x) is defined as: f ( x) = f (−x) = 2 N −3 ∑b k =1 k sin ( kx ) wherebk = 2 π π ∫ f ( t ) sin ( kt ) .dt 0 The computation of the Fourier trigonometric series is time consuming with an asymptotic time complexity of O(n2). Several attempts have been made to make the computation as effective as possible. The most common numerical algorithm used to compute the Fourier series is the fast Fourier transform created by J. W. Cooley and J. Tukey [3:7]. The algorithm, called Radix-2, recursively breaks down the Fourier transform for a time series of N data points into any combination of N1 and N2 sized segments such as N = N1 N2. Ultimately, the discrete Fourier transform is applied to the deepest-nested segments. The Cooley-Tukey algorithm I encourage you to implement the Radix-2 Cooley-Tukey algorithm in Scala using a tail recursion. [ 676 ] Chapter 3 The Radix-2 implementation requires that the number of data points is N=2n for even functions (sine) and N = 2n+1 for cosine. There are two approaches to meet this constraint: • Reduce the actual number of points to the next lower radix, 2n < N • Extend the original time series by padding it with 0 to the next higher radix, N < 2n+1 Padding the original time series is the preferred option because it does not affect the original set of observations. Let's define a base class, DTransform[T], for all the fast Fourier transforms, parameterized with a view bounded to the Double type (Double, Float, and so on). The first step is to implement the padding method, common to all the Fourier transforms: trait DTransform[T] extends PipeOperator[XTSeries[T], XTSeries[Double]] { def padSize(xtSz: Int, even: Boolean=true): Int = { val sz = if( even ) xtSz else xtSz-1 if( (sz & (sz-1)) == 0) 0 else { var bitPos = 0 do { bitPos += 1 } while( (sz >> bitPos) > 0) (if(even) (1< Double): DblVector = { val newSize = padSize(xt.size, even) val arr: DblVector = xt if( newSize > 0) arr ++ Array.fill(newSize)(0.0) } else arr } The while loop Scala developers prefer Scala higher-order methods on collection to implement iterative computation. However, nothing prevents you from using a traditional while loop if either readability or performance is an issue. [ 677 ] Data Preprocessing The fast implementation of the padding method, pad, consists of detecting the number of observations, N, which is a power of 2 using the bit operator & by evaluating whether N & (N-1) is null. The next highest radix is extracted by computing the number of bits shift in N. The code illustrates the effective use of implicit conversion to make the code readable. The arr: DblVector = series conversion triggers a conversion defined in the XTSeries companion object. The next step is to write the DFT class for the real discrete transforms, sine and cosine, by subclassing DTransform. The purpose of the class is to select the appropriate Fourier series, pad the time series to the next power of 2 if necessary, and invoke the FastSineTransformer and FastCosineTransformer classes of the Apache Commons Math library [3:8] introduced in the first chapter: class DFT[@specialized(Double) T<%Double] extends DTransform[T] { def |> : PartialFunction[XTSeries[T], XTSeries[Double]] = { case xt: XTSeries[T] if(xt != null && xt.length > 0) => XTSeries[Double](fwrd(xt)._2) } def fwrd(xt:XTSeries[T]): (RealTransformer, DblVector)= { val rdt = if(Math.abs(xt.head) < DFT_EPS) new FastSineTransformer(DstNormalization.STANDARD_DST_I) else new FastCosineTransformer(DctNormalization.STANDARD_DCT_I) (rdt, rdt.transform( pad(xt,xt.head==0.0),TransformType.FORWARD)) } } The discrete Fourier sine series requires that the first value of the time series is 0.0. This implementation automates the selection of the appropriate series by evaluating series.head. This example uses the standard formulation of the cosine and sine transformation, defined by the DctNormalization.STANDARD_DCT_I argument. The orthogonal normalization, which normalizes the frequency by a factor of 1/sqrt(2(N-1), where N is the size of the time series, generates a cleaner frequency spectrum for a higher computation cost. @specialized The @specialized(Double) annotation is used to instruct the Scala compiler to generate a specialized and more efficient version of the class for the type Double. The drawback of specialization is the duplication of byte code as the specialized version coexists with the parameterized classes [3:9]. [ 678 ] Chapter 3 In order to illustrate the different concepts behind DFTs, let's consider the case of a time series generated by a sequence h of sinusoidal functions: val _T= 1.0/1024 val h = (x:Double) =>2.0*Math.cos(2.0*Math.PI*_T*x) + Math.cos(5.0*Math.PI*_T*x) + Math.cos(15.0*Math.PI*_T*x)/3 As the signal is synthetically created, we can select the size of the time series to avoid padding. The first value in the time series is not null, so the number of observations is 2n+1. The data generated by the function h is plotted as follows: Example of the sinusoidal time series Let's extract the frequencies spectrum for the time series generated by the function h. The data points are created by tabulating the function h. The frequencies spectrum is computed with a simple invocation of the pipe operator on the instance of the DFT class: val rawOut = "output/chap3/raw.csv" val smoothedOut = "output/chap3/smoothed.csv" val values = Array.tabulate(1025)(x =>h(x/1025)) DataSink[Double](rawOut) |> values //1 val smoothed = DFT[Double] |> XTSeries[Double](values) //2 DataSink[Double]("output/chap3/smoothed.csv") |> smoothed [ 679 ] Data Preprocessing The first data sink (the type DataSink) stores the original time series into a CSV file (line 1). The DFT instance extracts the frequencies spectrum and formats it as time series (line 2). Finally, a second sink saves it into another CSV file. Data sinks and spreadsheets In this particular case, the results of the discrete Fourier transform are dumped into a CSV file so that it can be loaded into a spreadsheet. Some spreadsheets support a set of filtering techniques that can be used to validate the result of the example. A simpler alternative would be to use JFreeChart. The spectrum of the time series, plotted for the first 32 points, clearly shows three frequencies at k=2, 5, and 15. This is expected because the original signal is composed of three sinusoidal functions. The amplitude of these frequencies are 1024/1, 1024/2, and 1024/6, respectively. The following plot represents the first 32 harmonics for the time series: Frequency spectrum for a three-frequency sinusoidal The next step is to use the frequencies spectrum to create a low-pass filter using DFT. There are many algorithms to implement a low or pass band filter in the time domain from autoregressive models to the Butterworth algorithm. However, the fast Fourier transform is still a very popular technique to smooth signals and extract trends. [ 680 ] Chapter 3 Big Data A DFT for a large time series can be very computation intensive. One option is to treat the time series as a continuous signal and sample it using the Nyquist frequency. The Nyquist frequency is half of the sampling rate of a continuous signal. DFT-based filtering The purpose of this section is to introduce, describe, and implement a noise filtering mechanism that leverages the discrete Fourier transform. The idea is quite simple: the forward and inverse Fourier transforms are used sequentially to convert the time series from the time domain to the frequency domain and back. The only input you need to supply is a function G that modifies the sequence of frequencies. This operation is known as the convolution of the filter G and the frequencies spectrum. A convolution is similar to an inner product of two time series in the frequencies domain. Mathematically, the convolution is defined as follows: Convolution The convolution of two functions f and g is defined as: < f , g >= ∞ ∫ f ( t ).g ( x − t ) dt −∞ DFT convolution One important property of the Fourier transform is that convolution of two signals is implemented as the inner product of their relative spectrums: F ( f ∗ g) = F ( f )F (g) Let's apply the property to the discrete Fourier transform. If a time series {xi} has a frequency spectrum {ω f } and a filter f in a frequency domain defined as {ωg }, then the convolution is defined as: N −1 F ( f ∗ g ) = ∑ ω x , jω f , k − j 0 [ 681 ] Data Preprocessing Let's apply the convolution to our filtering problem. The filtering algorithm using the discrete Fourier transform consists of five steps: 1. Pad the time series to enable the discrete sine or cosine transform. 2. Generate the ordered sequence of frequencies using the forward transform. 3. Select the filter function g in the frequency domain and a cutoff frequency. 4. Convolute the sequence of frequency with the filter function g. 5. Generate the filtered signal in the time domain by applying the inverse DFT transform to the convoluted frequencies. Raw timeseries f (t ) Filtered timeseries F (¥w) Forward Fourier Transform Filter f * (t ) F* (¥w) = F (¥w).G (w ¥) G (¥w) Inverse Fourier Transform Diagram of a discrete Fourier filter The most commonly used low-pass filters are known as the sinc and sinc2 functions, defined as a rectangular function and a triangular function, respectively. The simplest low-pass filter is implemented by a sinc function that returns 1 for frequencies below a cutoff frequency, fC, and 0 if the frequency is higher: def sinc(f: Double, fC: Double): Double = if(Math.abs(f) < fC) 1.0 else 0.0 def sinc2(f: Double, fC: Double): Double = if(f*f < fC) 1.0 else 0.0 The filtering computation is implemented as a data transformation (pipe operator |>). The DFTFir class inherits from the DFT class in order to reuse the fwrd forward transform function. As usual, exception and validation code is omitted. The frequency domain function g is an attribute of the filter. The g function takes the frequency cutoff value fC as the second argument. The two filters sinc and sinc2 defined in the previous section are examples of filtering functions. class DFTFir[T <% Double](val g: (Double, Double) =>Double, val fC; Double) extends DFT[T] The pipe operator implements the filtering functionality: def |> : PartialFunction[XTSeries[T], XTSeries[Double]] = { case xt: XTSeries[T] if(xt != null && xt.size > 2) => { val spectrum = fwrd(xt) //1 val cutOff = fC*spectrum._2.size [ 682 ] Chapter 3 val filtered = spectrum._2.zipWithIndex.map(x => x._1*g(x._2, cutOff)) //2 XTSeries[Double](spectrum._1.transform(filtered, TransformType. INVERSE)) //3 } The filtering process follows three steps: 1. Computation of the discrete Fourier forward transformation (sine or cosine), fwrd. 2. Apply the filter function through a Scala map method. 3. Apply the inverse transform on the frequencies. Let's evaluate the impact of the cutoff values on the filtered data. The implementation of the test program consists of invoking the DFT filter pipe operator and writing results into a CSV file. The code reuses the generation function h introduced in the previous paragraph: val price = src |> YahooFinancials.adjClose val filter = new DFTFir[Double](sinc, 4.0) val filteredPrice = filter |> price Filtering out the noise is accomplished by selecting the cutoff value between any of the three harmonics with the respective frequencies of 2, 5, and 15. The original and the two filtered time series are plotted on the following graph: Plotting of the discrete Fourier filter-based smoothing As you would expect, the low-pass filter with a cutoff value of 12 removes the noise with the highest frequencies. The filter (with the cutoff value 4) cancels out the second harmonic (low-frequency noise), leaving out only the main trend cycle. [ 683 ] Data Preprocessing Detection of market cycles Using the discrete Fourier transform to generate the frequencies spectrum of a periodical time series is easy. However, what about real-world signals such as the time series representing the historical price of a stock? The purpose of the next exercise is to detect, if any, the long term cycle(s) of the overall stock market by applying the discrete Fourier transform to the quote of the S&P 500 index between January 1, 2009, and December 31, 2013, as illustrated in the following graph: Historical S&P 500 index prices The first step is to apply the DFT to extract a spectrum for the S&P 500 historical prices, as shown in the following graph, with the first 32 harmonics: Frequencies spectrum for historical S&P index [ 684 ] Chapter 3 The frequency domain chart highlights some interesting characteristics regarding the S&P 500 historical prices: • Both positive and negative amplitudes are present, as you would expect in a time series with complex values. The cosine series contributes to the positive amplitudes while the sine series affects both positive and negative amplitudes, (cos(x+π) = sin(x)). • The decay of the amplitude along the frequencies is steep enough to warrant further analysis beyond the first harmonic, which represents the main trend. The next step is to apply a pass-band filter technique to the S&P 500 historical data in order to identify short-term trends with lower periodicity. A low-pass filter is limited to reduce or cancel out the noise in the raw data. In this case, a passband filter using a range or window of frequencies is appropriate to isolate the frequency or the group of frequencies that characterize a specific cycle. The sinc function introduced in the previous section to implement a low-band filter is modified to enforce the passband within a window, [w1, w2], as follows: def sinc(f: Double, w: (Double, Double)): Double = if(Math.abs(f) > w._1 && Math.abs(f) < w._2) 1.0 else 0.0 Let's define a DFT-based pass-band filter with a window of width 4, w=(i, i +4), with i ranging between 2 and 20. Applying the window [4, 8] isolates the impact of the second harmonic on the price curve. As we eliminate the main upward trend with frequencies less than 4, all filtered data varies within a short range relative to the main trend. The following graph shows output of this filter: The output of a pass-band DFT filter range 4-8 on the historical S&P index [ 685 ] Data Preprocessing In this case, we filter the S&P 500 index around the third group of harmonics with frequencies ranging from 18 to 22; the signal is converted into a familiar sinusoidal function, as shown here: The output of a pass-band DFT filter range 18-22 on the historical S&P index There is a possible rational explanation for the shape of the S&P 500 data filtered by a passband with a frequency of 20, as illustrated in the previous plot; the S&P 500 historical data plot shows that the frequency of the fluctuation in the middle of the uptrend (trading sessions 620 to 770) increases significantly. This phenomenon can be explained by the fact that the S&P 5