Spark The Definitive Guide Big Data Processing Made Simple

User Manual:

Open the PDF directly: View PDF .
Page Count: 600 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Preface
- About the Authors
- Who This Book Is For
- Conventions Used in This Book
- Using Code Examples
- O’Reilly Safari
- How to Contact Us
- Acknowledgments
I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark?
- Apache Spark’s Philosophy
- Context: The Big Data Problem
- History of Spark
- The Present and Future of Spark
- Running Spark
2. A Gentle Introduction to Spark
- Spark’s Basic Architecture
  - Spark Applications
- Spark’s Language APIs
- Spark’s APIs
- Starting Spark
- The SparkSession
- DataFrames
  - Partitions
- Transformations
  - Lazy Evaluation
- Actions
- Spark UI
- An End-to-End Example
  - DataFrames and SQL
- Conclusion
3. A Tour of Spark’s Toolset
- Running Production Applications
- Datasets: Type-Safe Structured APIs
- Structured Streaming
- Machine Learning and Advanced Analytics
- Lower-Level APIs
- SparkR
- Spark’s Ecosystem and Packages
- Conclusion
II. Structured APIs—DataFrames, SQL, and Datasets
4. Structured API Overview
- DataFrames and Datasets
- Schemas
- Overview of Structured Spark Types
- Overview of Structured API Execution
- Conclusion
5. Basic Structured Operations
- Schemas
- Columns and Expressions
  - Columns
    - Explicit column references
  - Expressions
    - Columns as expressions
    - Accessing a DataFrame’s columns
- Records and Rows
  - Creating Rows
- DataFrame Transformations
- Conclusion
6. Working with Different Types of Data
- Where to Look for APIs
- Converting to Spark Types
- Working with Booleans
- Working with Numbers
- Working with Strings
  - Regular Expressions
- Working with Dates and Timestamps
- Working with Nulls in Data
- Ordering
- Working with Complex Types
- Working with JSON
- User-Defined Functions
- Conclusion
7. Aggregations
- Aggregation Functions
- Grouping
  - Grouping with Expressions
  - Grouping with Maps
- Window Functions
- Grouping Sets
- User-Defined Aggregation Functions
- Conclusion
8. Joins
- Join Expressions
- Join Types
- Inner Joins
- Outer Joins
- Left Outer Joins
- Right Outer Joins
- Left Semi Joins
- Left Anti Joins
- Natural Joins
- Cross (Cartesian) Joins
- Challenges When Using Joins
  - Joins on Complex Types
  - Handling Duplicate Column Names
- How Spark Performs Joins
  - Communication Strategies
- Conclusion
9. Data Sources
- The Structure of the Data Sources API
- CSV Files
- JSON Files
- Parquet Files
  - Reading Parquet Files
    - Parquet options
  - Writing Parquet Files
- ORC Files
  - Reading Orc Files
  - Writing Orc Files
- SQL Databases
- Text Files
  - Reading Text Files
  - Writing Text Files
- Advanced I/O Concepts
- Conclusion
10. Spark SQL
- What Is SQL?
- Big Data and SQL: Apache Hive
- Big Data and SQL: Spark SQL
  - Spark’s Relationship to Hive
    - The Hive metastore
- How to Run Spark SQL Queries
- Catalog
- Tables
- Views
  - Creating Views
  - Dropping Views
- Databases
- Select Statements
  - case…when…then Statements
- Advanced Topics
- Miscellaneous Features
  - Configurations
  - Setting Configuration Values in SQL
- Conclusion
11. Datasets
- When to Use Datasets
- Creating Datasets
  - In Java: Encoders
  - In Scala: Case Classes
- Actions
- Transformations
  - Filtering
  - Mapping
- Joins
- Grouping and Aggregations
- Conclusion
III. Low-Level APIs
12. Resilient Distributed Datasets (RDDs)
- What Are the Low-Level APIs?
  - When to Use the Low-Level APIs?
  - How to Use the Low-Level APIs?
- About RDDs
- Creating RDDs
- Manipulating RDDs
- Transformations
- Actions
- Saving Files
- Caching
- Checkpointing
- Pipe RDDs to System Commands
- Conclusion
13. Advanced RDDs
- Key-Value Basics (Key-Value RDDs)
- Aggregations
- CoGroups
- Joins
  - Inner Join
  - zips
- Controlling Partitions
- Custom Serialization
- Conclusion
14. Distributed Shared Variables
- Broadcast Variables
- Accumulators
  - Basic Example
  - Custom Accumulators
- Conclusion
IV. Production Applications
15. How Spark Runs on a Cluster
- The Architecture of a Spark Application
  - Execution Modes
- The Life Cycle of a Spark Application (Outside Spark)
- The Life Cycle of a Spark Application (Inside Spark)
- Execution Details
  - Pipelining
  - Shuffle Persistence
- Conclusion
16. Developing Spark Applications
- Writing Spark Applications
- Testing Spark Applications
- The Development Process
- Launching Applications
  - Application Launch Examples
- Configuring Applications
- Conclusion
17. Deploying Spark
- Where to Deploy Your Cluster to Run Spark Applications
  - On-Premises Cluster Deployments
  - Spark in the Cloud
- Cluster Managers
- Miscellaneous Considerations
- Conclusion
18. Monitoring and Debugging
- The Monitoring Landscape
- What to Monitor
  - Driver and Executor Processes
  - Queries, Jobs, Stages, and Tasks
- Spark Logs
- The Spark UI
- Debugging and Spark First Aid
- Conclusion
19. Performance Tuning
- Indirect Performance Enhancements
- Direct Performance Enhancements
- Conclusion
V. Streaming
20. Stream Processing Fundamentals
- What Is Stream Processing?
- Stream Processing Design Points
- Spark’s Streaming APIs
  - The DStream API
  - Structured Streaming
- Conclusion
21. Structured Streaming Basics
- Structured Streaming Basics
- Core Concepts
- Structured Streaming in Action
- Transformations on Streams
- Input and Output
- Streaming Dataset API
- Conclusion
22. Event-Time and Stateful Processing
- Event Time
- Stateful Processing
- Arbitrary Stateful Processing
- Event-Time Basics
- Windows on Event Time
  - Tumbling Windows
    - Sliding windows
  - Handling Late Data with Watermarks
- Dropping Duplicates in a Stream
- Arbitrary Stateful Processing
- Conclusion
23. Structured Streaming in Production
- Fault Tolerance and Checkpointing
- Updating Your Application
- Metrics and Monitoring
- Alerting
- Advanced Monitoring with the Streaming Listener
- Conclusion
VI. Advanced Analytics and Machine Learning
24. Advanced Analytics and Machine Learning Overview
- A Short Primer on Advanced Analytics
- Spark’s Advanced Analytics Toolkit
  - What Is MLlib?
    - When and why should you use MLlib (versus scikit-learn, TensorFlow, or foo package)
- High-Level MLlib Concepts
  - Low-level data types
- MLlib in Action
- Deployment Patterns
- Conclusion
25. Preprocessing and Feature Engineering
- Formatting Models According to Your Use Case
- Transformers
- Estimators for Preprocessing
  - Transformer Properties
- High-Level Transformers
- Working with Continuous Features
- Working with Categorical Features
- Text Data Transformers
- Feature Manipulation
- Feature Selection
  - ChiSqSelector
- Advanced Topics
  - Persisting Transformers
- Writing a Custom Transformer
- Conclusion
26. Classification
- Use Cases
- Types of Classification
- Classification Models in MLlib
  - Model Scalability
- Logistic Regression
- Decision Trees
- Random Forest and Gradient-Boosted Trees
- Naive Bayes
- Evaluators for Classification and Automating Model Tuning
- Detailed Evaluation Metrics
- One-vs-Rest Classifier
- Multilayer Perceptron
- Conclusion
27. Regression
- Use Cases
- Regression Models in MLlib
  - Model Scalability
- Linear Regression
- Generalized Linear Regression
- Decision Trees
- Random Forests and Gradient-Boosted Trees
- Advanced Methods
  - Survival Regression (Accelerated Failure Time)
  - Isotonic Regression
- Evaluators and Automating Model Tuning
- Metrics
- Conclusion
28. Recommendation
- Use Cases
- Collaborative Filtering with Alternating Least Squares
- Evaluators for Recommendation
- Metrics
  - Regression Metrics
  - Ranking Metrics
- Frequent Pattern Mining
- Conclusion
29. Unsupervised Learning
- Use Cases
- Model Scalability
- k-means
- Bisecting k-means
- Gaussian Mixture Models
- Latent Dirichlet Allocation
- Conclusion
30. Graph Analytics
- Building a Graph
- Querying the Graph
  - Subgraphs
- Motif Finding
- Graph Algorithms
- Conclusion
31. Deep Learning
- What Is Deep Learning?
- Ways of Using Deep Learning in Spark
- Deep Learning Libraries
- A Simple Example with Deep Learning Pipelines
- Conclusion
VII. Ecosystem
32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
- PySpark
  - Fundamental PySpark Differences
  - Pandas Integration
- R on Spark
  - SparkR
  - sparklyr
- Conclusion
33. Ecosystem and Community
- Spark Packages
- Community
  - Spark Summit
  - Local Meetups
- Conclusion
Index

Spark: The Definitive Guide

Big Data Processing Made Simple

Bill Chambers and Matei Zaharia

Spark: The Definitive Guide

by Bill Chambers and Matei Zaharia

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online

editions are also available for most titles (http://oreilly.com/safari). For more information, contact

our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

Editor: Nicole Tache

Production Editor: Justin Billing

Copyeditor: Octal Publishing, Inc., Chris Edwards, and Amanda Kersey

Proofreader: Jasmine Kwityn

Indexer: Judith McConville

Interior Designer: David Futato

Cover Designer: Karen Montgomery

Illustrator: Rebecca Demarest

February 2018: First Edition

Revision History for the First Edition

2018-02-08: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491912218 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Spark: The Definitive Guide,

the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. Apache, Spark and

Apache Spark are trademarks of the Apache Software Foundation.

While the publisher and the authors have used good faith efforts to ensure that the information and

instructions contained in this work are accurate, the publisher and the authors disclaim all

responsibility for errors or omissions, including without limitation responsibility for damages

resulting from the use of or reliance on this work. Use of the information and instructions contained in

this work is at your own risk. If any code samples or other technology this work contains or describes

is subject to open source licenses or the intellectual property rights of others, it is your responsibility

to ensure that your use thereof complies with such licenses and/or rights.

978-1-491-91221-8

[M]

Preface

Welcome to this first edition of Spark: The Definitive Guide! We are excited to bring you the most

complete resource on Apache Spark today, focusing especially on the new generation of Spark APIs

introduced in Spark 2.0.

Apache Spark is currently one of the most popular systems for large-scale data processing, with APIs

in multiple programming languages and a wealth of built-in and third-party libraries. Although the

project has existed for multiple years—first as a research project started at UC Berkeley in 2009,

then at the Apache Software Foundation since 2013—the open source community is continuing to

build more powerful APIs and high-level libraries over Spark, so there is still a lot to write about the

project. We decided to write this book for two reasons. First, we wanted to present the most

comprehensive book on Apache Spark, covering all of the fundamental use cases with easy-to-run

examples. Second, we especially wanted to explore the higher-level “structured” APIs that were

finalized in Apache Spark 2.0—namely DataFrames, Datasets, Spark SQL, and Structured Streaming

—which older books on Spark don’t always include. We hope this book gives you a solid foundation

to write modern Apache Spark applications using all the available tools in the project.

In this preface, we’ll tell you a little bit about our background, and explain who this book is for and

how we have organized the material. We also want to thank the numerous people who helped edit and

review this book, without whom it would not have been possible.

About the Authors

Both of the book’s authors have been involved in Apache Spark for a long time, so we are very

excited to be able to bring you this book.

Bill Chambers started using Spark in 2014 on several research projects. Currently, Bill is a Product

Manager at Databricks where he focuses on enabling users to write various types of Apache Spark

applications. Bill also regularly blogs about Spark and presents at conferences and meetups on the

topic. Bill holds a Master’s in Information Management and Systems from the UC Berkeley School of

Information.

Matei Zaharia started the Spark project in 2009, during his time as a PhD student at UC Berkeley.

Matei worked with other Berkeley researchers and external collaborators to design the core Spark

APIs and grow the Spark community, and has continued to be involved in new initiatives such as the

structured APIs and Structured Streaming. In 2013, Matei and other members of the Berkeley Spark

team co-founded Databricks to further grow the open source project and provide commercial

offerings around it. Today, Matei continues to work as Chief Technologist at Databricks, and also

holds a position as an Assistant Professor of Computer Science at Stanford University, where he does

research on large-scale systems and AI. Matei received his PhD in Computer Science from UC

Berkeley in 2013.

Who This Book Is For

We designed this book mainly for data scientists and data engineers looking to use Apache Spark. The

two roles have slightly different needs, but in reality, most application development covers a bit of

both, so we think the material will be useful in both cases. Specifically, in our minds, the data

scientist workload focuses more on interactively querying data to answer questions and build

statistical models, while the data engineer job focuses on writing maintainable, repeatable production

applications—either to use the data scientist’s models in practice, or just to prepare data for further

analysis (e.g., building a data ingest pipeline). However, we often see with Spark that these roles

blur. For instance, data scientists are able to package production applications without too much hassle

and data engineers use interactive analysis to understand and inspect their data to build and maintain

pipelines.

While we tried to provide everything data scientists and engineers need to get started, there are some

things we didn’t have space to focus on in this book. First, this book does not include in-depth

introductions to some of the analytics techniques you can use in Apache Spark, such as machine

learning. Instead, we show you how to invoke these techniques using libraries in Spark, assuming you

already have a basic background in machine learning. Many full, standalone books exist to cover

these techniques in formal detail, so we recommend starting with those if you want to learn about

these areas. Second, this book focuses more on application development than on operations and

administration (e.g., how to manage an Apache Spark cluster with dozens of users). Nonetheless, we

have tried to include comprehensive material on monitoring, debugging, and configuration in Parts V

and VI of the book to help engineers get their application running efficiently and tackle day-to-day

maintenance. Finally, this book places less emphasis on the older, lower-level APIs in Spark—

specifically RDDs and DStreams—to introduce most of the concepts using the newer, higher-level

structured APIs. Thus, the book may not be the best fit if you need to maintain an old RDD or

DStream application, but should be a great introduction to writing new applications.

Conventions Used in This Book

The following typographical conventions are used in this book:

Italic

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as

variable or function names, databases, data types, environment variables, statements, and

keywords.

Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values determined by context.

TIP

This element signifies a tip or suggestion.

NOTE

This element signifies a general note.

WARNING

This element indicates a warning or caution.

Using Code Examples

We’re very excited to have designed this book so that all of the code content is runnable on real data.

We wrote the whole book using Databricks notebooks and have posted the data and related material

on GitHub. This means that you can run and edit all the code as you follow along, or copy it into

working code in your own applications.

We tried to use real data wherever possible to illustrate the challenges you’ll run into while building

large-scale data applications. Finally, we also include several larger standalone applications in the

book’s GitHub repository for examples that it does not make sense to show inline in the text.

The GitHub repository will remain a living document as we update based on Spark’s progress. Be

sure to follow updates there.

This book is here to help you get your job done. In general, if example code is offered with this book,

you may use it in your programs and documentation. You do not need to contact us for permission

unless you’re reproducing a significant portion of the code. For example, writing a program that uses

several chunks of code from this book does not require permission. Selling or distributing a CD-

ROM of examples from O’Reilly books does require permission. Answering a question by citing this

book and quoting example code does not require permission. Incorporating a significant amount of

example code from this book into your product’s documentation does require permission.

We appreciate, but do not require, attribution. An attribution usually includes the title, author,

publisher, and ISBN. For example: “Spark: The Definitive Guide by Bill Chambers and Matei

If you feel your use of code examples falls outside fair use or the permission given above, feel free to
contact us at permissions@oreilly.com.
O’Reilly Safari
Safari (formerly Safari Books Online) is a membership-based training and reference platform for
enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interactive tutorials,
and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business
Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que,
Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann,
IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones
& Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
To comment or ask technical questions about this book, send email to bookquestions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website at
http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia

Acknowledgments

There were a huge number of people that made this book possible.

First, we would like to thank our employer, Databricks, for allocating time for us to work on this

book. Without the support of the company, this book would not have been possible. In particular, we

would like to thank Ali Ghodsi, Ion Stoica, and Patrick Wendell for their support.

Additionally, there are numerous people that read drafts of the book and individual chapters. Our

reviewers were best-in-class, and provided invaluable feedback.

These reviewers, in alphabetical order by last name, are:

Lynn Armstrong

Mikio Braun

Jules Damji

Denny Lee

Alex Thomas

In addition to the formal book reviewers, there were numerous other Spark users, contributors, and

committers who read over specific chapters or helped formulate how topics should be discussed. In

alphabetical order by last name, the people who helped are:

Sameer Agarwal

Bagrat Amirbekian

Michael Armbrust

Joseph Bradley

Tathagata Das

Hossein Falaki

Wenchen Fan

Sue Ann Hong

Yin Huai

Tim Hunter

Xiao Li

Cheng Lian

Xiangrui Meng

Kris Mok

Josh Rosen

Srinath Shankar

Takuya Ueshin

Herman van Hövell

Reynold Xin

Philip Yang

Burak Yavuz

Shixiong Zhu

Lastly, we would like to thank friends, family, and loved ones. Without their support, patience, and

encouragement, we would not have been able to write the definitive guide to Spark.

Part I. Gentle Overview of Big Data and

Spark

Chapter 1. What Is Apache Spark?

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on

computer clusters. As of this writing, Spark is the most actively developed open source engine for

this task, making it a standard tool for any developer or data scientist interested in big data. Spark

supports multiple widely used programming languages (Python, Java, Scala, and R), includes

libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere

from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and

scale-up to big data processing or incredibly large scale.

Figure 1-1 illustrates all the components and libraries Spark offers to end-users.

Figure 1-1. Spark’s toolkit

You’ll notice the categories roughly correspond to the different parts of this book. That should really

come as no surprise; our goal here is to educate you on all aspects of Spark, and Spark is composed

of a number of different components.

Given that you’re reading this book, you might already know a little bit about Apache Spark and what

it can do. Nonetheless, in this chapter, we want to briefly cover the overriding philosophy behind

Spark as well as the context it was developed in (why is everyone suddenly excited about parallel

data processing?) and its history. We will also outline the first few steps to running Spark.

Apache Spark’s Philosophy

Let’s break down our description of Apache Spark—a unified computing engine and set of libraries

for big data—into its key components:

Unified

Spark’s key driving goal is to offer a unified platform for writing big data applications. What do

we mean by unified? Spark is designed to support a wide range of data analytics tasks, ranging

from simple data loading and SQL queries to machine learning and streaming computation, over

the same computing engine and with a consistent set of APIs. The main insight behind this goal is

that real-world data analytics tasks—whether they are interactive analytics in a tool such as a

Jupyter notebook, or traditional software development for production applications—tend to

combine many different processing types and libraries.

Spark’s unified nature makes these tasks both easier and more efficient to write. First, Spark

provides consistent, composable APIs that you can use to build an application out of smaller

pieces or out of existing libraries. It also makes it easy for you to write your own analytics

libraries on top. However, composable APIs are not enough: Spark’s APIs are also designed to

enable high performance by optimizing across the different libraries and functions composed

together in a user program. For example, if you load data using a SQL query and then evaluate a

machine learning model over it using Spark’s ML library, the engine can combine these steps into

one scan over the data. The combination of general APIs and high-performance execution, no

matter how you combine them, makes Spark a powerful platform for interactive and production

applications.

Spark’s focus on defining a unified platform is the same idea behind unified platforms in other

areas of software. For example, data scientists benefit from a unified set of libraries (e.g., Python

or R) when doing modeling, and web developers benefit from unified frameworks such as

Node.js or Django. Before Spark, no open source systems tried to provide this type of unified

engine for parallel data processing, meaning that users had to stitch together an application out of

multiple APIs and systems. Thus, Spark quickly became the standard for this type of development.

Over time, Spark has continued to expand its built-in APIs to cover more workloads. At the same

time, the project’s developers have continued to refine its theme of a unified engine. In particular,

one major focus of this book will be the “structured APIs” (DataFrames, Datasets, and SQL) that

were finalized in Spark 2.0 to enable more powerful optimization under user applications.

Computing engine

At the same time that Spark strives for unification, it carefully limits its scope to a computing

engine. By this, we mean that Spark handles loading data from storage systems and performing

computation on it, not permanent storage as the end itself. You can use Spark with a wide variety

of persistent storage systems, including cloud storage systems such as Azure Storage and Amazon

S3, distributed file systems such as Apache Hadoop, key-value stores such as Apache Cassandra,

and message buses such as Apache Kafka. However, Spark neither stores data long term itself,

nor favors one over another. The key motivation here is that most data already resides in a mix of

storage systems. Data is expensive to move so Spark focuses on performing computations over the

data, no matter where it resides. In user-facing APIs, Spark works hard to make these storage

systems look largely similar so that applications do not need to worry about where their data is.

Spark’s focus on computation makes it different from earlier big data software platforms such as

Apache Hadoop. Hadoop included both a storage system (the Hadoop file system, designed for

low-cost storage over clusters of commodity servers) and a computing system (MapReduce),

which were closely integrated together. However, this choice makes it difficult to run one of the

systems without the other. More important, this choice also makes it a challenge to write

applications that access data stored anywhere else. Although Spark runs well on Hadoop storage,

today it is also used broadly in environments for which the Hadoop architecture does not make

sense, such as the public cloud (where storage can be purchased separately from computing) or

streaming applications.

Libraries

Spark’s final component is its libraries, which build on its design as a unified engine to provide a

unified API for common data analysis tasks. Spark supports both standard libraries that ship with

the engine as well as a wide array of external libraries published as third-party packages by the

open source communities. Today, Spark’s standard libraries are actually the bulk of the open

source project: the Spark core engine itself has changed little since it was first released, but the

libraries have grown to provide more and more types of functionality. Spark includes libraries for

SQL and structured data (Spark SQL), machine learning (MLlib), stream processing (Spark

Streaming and the newer Structured Streaming), and graph analytics (GraphX). Beyond these

libraries, there are hundreds of open source external libraries ranging from connectors for various

storage systems to machine learning algorithms. One index of external libraries is available at

spark-packages.org.

Context: The Big Data Problem

Why do we need a new engine and programming model for data analytics in the first place? As with

many trends in computing, this is due to changes in the economic factors that underlie computer

applications and hardware.

For most of their history, computers became faster every year through processor speed increases: the

new processors each year could run more instructions per second than the previous year’s. As a

result, applications also automatically became faster every year, without any changes needed to their

code. This trend led to a large and established ecosystem of applications building up over time, most

of which were designed to run only on a single processor. These applications rode the trend of

improved processor speeds to scale up to larger computations and larger volumes of data over time.

Unfortunately, this trend in hardware stopped around 2005: due to hard limits in heat dissipation,

hardware developers stopped making individual processors faster, and switched toward adding more

parallel CPU cores all running at the same speed. This change meant that suddenly applications

needed to be modified to add parallelism in order to run faster, which set the stage for new

programming models such as Apache Spark.

On top of that, the technologies for storing and collecting data did not slow down appreciably in

2005, when processor speeds did. The cost to store 1 TB of data continues to drop by roughly two

times every 14 months, meaning that it is very inexpensive for organizations of all sizes to store large

amounts of data. Moreover, many of the technologies for collecting data (sensors, cameras, public

datasets, etc.) continue to drop in cost and improve in resolution. For example, camera technology

continues to improve in resolution and drop in cost per pixel every year, to the point where a 12-

megapixel webcam costs only $3 to $4; this has made it inexpensive to collect a wide range of visual

data, whether from people filming video or automated sensors in an industrial setting. Moreover,

cameras are themselves the key sensors in other data collection devices, such as telescopes and even

gene-sequencing machines, driving the cost of these technologies down as well.

The end result is a world in which collecting data is extremely inexpensive—many organizations

today even consider it negligent not to log data of possible relevance to the business—but processing

it requires large, parallel computations, often on clusters of machines. Moreover, in this new world,

the software developed in the past 50 years cannot automatically scale up, and neither can the

traditional programming models for data processing applications, creating the need for new

programming models. It is this world that Apache Spark was built for.

History of Spark

Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first published

the following year in a paper entitled “Spark: Cluster Computing with Working Sets” by Matei

Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica of the UC Berkeley

AMPlab. At the time, Hadoop MapReduce was the dominant parallel programming engine for

clusters, being the first open source system to tackle data-parallel processing on clusters of thousands

of nodes. The AMPlab had worked with multiple early MapReduce users to understand the benefits

and drawbacks of this new programming model, and was therefore able to synthesize a list of

problems across several use cases and begin designing more general computing platforms. In

addition, Zaharia had also worked with Hadoop users at UC Berkeley to understand their needs for

the platform—specifically, teams that were doing large-scale machine learning using iterative

algorithms that need to make multiple passes over the data.

Across these conversations, two things were clear. First, cluster computing held tremendous

potential: at every organization that used MapReduce, brand new applications could be built using the

existing data, and many new groups began using the system after its initial use cases. Second,

however, the MapReduce engine made it both challenging and inefficient to build large applications.

For example, the typical machine learning algorithm might need to make 10 or 20 passes over the

data, and in MapReduce, each pass had to be written as a separate MapReduce job, which had to be

launched separately on the cluster and load the data from scratch.

To address this problem, the Spark team first designed an API based on functional programming that

could succinctly express multistep applications. The team then implemented this API over a new

engine that could perform efficient, in-memory data sharing across computation steps. The team also

began testing this system with both Berkeley and external users.

The first version of Spark supported only batch applications, but soon enough another compelling use

case became clear: interactive data science and ad hoc queries. By simply plugging the Scala

interpreter into Spark, the project could provide a highly usable interactive system for running queries

on hundreds of machines. The AMPlab also quickly built on this idea to develop Shark, an engine that

could run SQL queries over Spark and enable interactive use by analysts as well as data scientists.

Shark was first released in 2011.

After these initial releases, it quickly became clear that the most powerful additions to Spark would

be new libraries, and so the project began to follow the “standard library” approach it has today. In

particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX. They also ensured

that these APIs would be highly interoperable, enabling writing end-to-end big data applications in

the same engine for the first time.

In 2013, the project had grown to widespread use, with more than 100 contributors from more than 30

organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache Software

Foundation as a long-term, vendor-independent home for the project. The early AMPlab team also

launched a company, Databricks, to harden the project, joining the community of other companies and

organizations contributing to Spark. Since that time, the Apache Spark community released Spark 1.0

in 2014 and Spark 2.0 in 2016, and continues to make regular releases, bringing new features into the

project.

Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of

Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations

such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added

Spark SQL, a new API for working with structured data—tables with a fixed data format that is not

tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations across

libraries and APIs by understanding both the data format and the user code that runs on it in more

detail. Over time, the project added a plethora of new APIs that build on this more powerful

structured foundation, including DataFrames, machine learning pipelines, and Structured Streaming, a

high-level, automatically optimized streaming API. In this book, we will spend a signficant amount of

time explaining these next-generation APIs, most of which are marked as production-ready.

The Present and Future of Spark

Spark has been around for a number of years but continues to gain in popularity and use cases. Many

new projects within the Spark ecosystem continue to push the boundaries of what’s possible with the

system. For example, a new high-level streaming engine, Structured Streaming, was introduced in

2016. This technology is a huge part of companies solving massive-scale data challenges, from

technology companies like Uber and Netflix using Spark’s streaming and machine learning tools, to

institutions like NASA, CERN, and the Broad Institute of MIT and Harvard applying Spark to

scientific data analysis.

Spark will continue to be a cornerstone of companies doing big data analysis for the foreseeable

future, especially given that the project is still developing quickly. Any data scientist or engineer who

needs to solve big data problems probably needs a copy of Spark on their machine—and hopefully, a

copy of this book on their bookshelf!

Running Spark

This book contains an abundance of Spark-related code, and it’s essential that you’re prepared to run

it as you learn. For the most part, you’ll want to run the code interactively so that you can experiment

with it. Let’s go over some of your options before we begin working with the coding parts of the

book.

You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs on

the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster, all you

need is an installation of Java. If you want to use the Python API, you will also need a Python

interpreter (version 2.7 or later). If you want to use R, you will need a version of R on your machine.

There are two options we recommend for getting started with Spark: downloading and installing

Apache Spark on your laptop, or running a web-based version in Databricks Community Edition, a

free cloud environment for learning Spark that includes the code in this book. We explain both of

those options next.

Downloading Spark Locally

If you want to download and run Spark locally, the first step is to make sure that you have Java

installed on your machine (available as java), as well as a Python version if you would like to use

Python. Next, visit the project’s official download page, select the package type of “Pre-built for

Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or

tarball, that you will then need to extract. The majority of this book was written using Spark 2.2, so

downloading version 2.2 or later should be a good starting point.

Downloading Spark for a Hadoop cluster

Spark can run locally without any distributed storage system, such as Apache Hadoop. However, if

you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you

download the right Spark version for that Hadoop version, which can be chosen at

http://spark.apache.org/downloads.html by selecting a different package type. We discuss how

Spark runs on clusters and the Hadoop file system in later chapters, but at this point we recommend

just running Spark on your laptop to start out.

NOTE

In Spark 2.2, the developers also added the ability to install Spark for Python via pip install pyspark. This functionality

came out as this book was being written, so we weren’t able to include all of the relevant instructions.

Building Spark from source

We won’t cover this in the book, but you can also build and configure Spark from source. You can

select a source package on the Apache download page to get just the source and follow the

instructions in the README file for building.

After you’ve downloaded Spark, you’ll want to open a command-line prompt and extract the package.

In our case, we’re installing Spark 2.2. The following is a code snippet that you can run on any Unix-

style command line to unzip the file you downloaded from Spark and move into the directory:

cd ~/Downloads

tar -xf spark-2.2.0-bin-hadoop2.7.tgz

cd spark-2.2.0-bin-hadoop2.7.tgz

Note that Spark has a large number of directories and files within the project. Don’t be intimidated!

Most of these directories are relevant only if you’re reading source code. The next section will cover

the most important directories—the ones that let us launch Spark’s different consoles for interactive

use.

Launching Spark’s Interactive Consoles

You can start an interactive shell in Spark for several different programming languages. The majority

of this book is written with Python, Scala, and SQL in mind; thus, those are our recommended starting

points.

Launching the Python console

You’ll need Python 2 or 3 installed in order to launch the Python console. From Spark’s home

directory, run the following code:

./bin/pyspark

After you’ve done that, type “spark” and press Enter. You’ll see the SparkSession object printed,

which we cover in Chapter 2.

Launching the Scala console

To launch the Scala console, you will need to run the following command:

./bin/spark-shell

After you’ve done that, type “spark” and press Enter. As in Python, you’ll see the SparkSession

object, which we cover in Chapter 2.

Launching the SQL console

Parts of this book will cover a large amount of Spark SQL. For those, you might want to start the SQL

console. We’ll revisit some of the more relevant details after we actually cover these topics in the

book.

./bin/spark-sql

Running Spark in the Cloud

If you would like to have a simple, interactive notebook experience for learning Spark, you might

prefer using Databricks Community Edition. Databricks, as we mentioned earlier, is a company

founded by the Berkeley team that started Spark, and offers a free community edition of its cloud

service as a learning environment. The Databricks Community Edition includes a copy of all the data

and code examples for this book, making it easy to quickly run any of them. To use the Databricks

Community Edition, follow the instructions at https://github.com/databricks/Spark-The-Definitive-

Guide. You will be able to use Scala, Python, SQL, or R from a web browser–based interface to run

and visualize results.

Data Used in This Book

We’ll use a number of data sources in this book for our examples. If you want to run the code locally,

you can download them from the official code repository in this book as desribed at

https://github.com/databricks/Spark-The-Definitive-Guide. In short, you will download the data,

put it in a folder, and then run the code snippets in this book!

Chapter 2. A Gentle Introduction to Spark

Now that our history lesson on Apache Spark is completed, it’s time to begin using and applying it!

This chapter presents a gentle introduction to Spark, in which we will walk through the core

architecture of a cluster, Spark Application, and Spark’s structured APIs using DataFrames and SQL.

Along the way we will touch on Spark’s core terminology and concepts so that you can begin using

Spark right away. Let’s get started with some basic background information.

Spark’s Basic Architecture

Typically, when you think of a “computer,” you think about one machine sitting on your desk at home

or at work. This machine works perfectly well for watching movies or working with spreadsheet

software. However, as many users likely experience at some point, there are some things that your

computer is not powerful enough to perform. One particularly challenging area is data processing.

Single machines do not have enough power and resources to perform computations on huge amounts

of information (or the user probably does not have the time to wait for the computation to finish). A

cluster, or group, of computers, pools the resources of many machines together, giving us the ability

to use all the cumulative resources as if they were a single computer. Now, a group of machines alone

is not powerful, you need a framework to coordinate work across them. Spark does just that,

managing and coordinating the execution of tasks on data across a cluster of computers.

The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like

Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to these

cluster managers, which will grant resources to our application so that we can complete our work.

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver process

runs your main() function, sits on a node in the cluster, and is responsible for three things:

maintaining information about the Spark Application; responding to a user’s program or input; and

analyzing, distributing, and scheduling work across the executors (discussed momentarily). The

driver process is absolutely essential—it’s the heart of a Spark Application and maintains all

relevant information during the lifetime of the application.

The executors are responsible for actually carrying out the work that the driver assigns them. This

means that each executor is responsible for only two things: executing code assigned to it by the

driver, and reporting the state of the computation on that executor back to the driver node.

Figure 2-1 demonstrates how the cluster manager controls physical machines and allocates resources

to Spark Applications. This can be one of three core cluster managers: Spark’s standalone cluster

manager, YARN, or Mesos. This means that there can be multiple Spark Applications running on a

cluster at the same time. We will discuss cluster managers more in Part IV.

Figure 2-1. The architecture of a Spark Application

In Figure 2-1, we can see the driver on the left and four executors on the right. In this diagram, we

removed the concept of cluster nodes. The user can specify how many executors should fall on each

node through configurations.

NOTE

Spark, in addition to its cluster mode, also has a local mode. The driver and executors are simply processes, which means

that they can live on the same machine or different machines. In local mode, the driver and executurs run (as threads) on

your individual computer instead of a cluster. We wrote this book with local mode in mind, so you should be able to run

everything on a single machine.

Here are the key points to understand about Spark Applications at this point:

Spark employs a cluster manager that keeps track of the resources available.

The driver process is responsible for executing the driver program’s commands across the

executors to complete a given task.

The executors, for the most part, will always be running Spark code. However, the driver can be

“driven” from a number of different languages through Spark’s language APIs. Let’s take a look at

those in the next section.

Spark’s Language APIs

Spark’s language APIs make it possible for you to run Spark code using various programming

languages. For the most part, Spark presents some core “concepts” in every language; these concepts

are then translated into Spark code that runs on the cluster of machines. If you use just the Structured

APIs, you can expect all languages to have similar performance characteristics. Here’s a brief

rundown:

Scala

Spark is primarily written in Scala, making it Spark’s “default” language. This book will include

Scala code examples wherever relevant.

Java

Even though Spark is written in Scala, Spark’s authors have been careful to ensure that you can

write Spark code in Java. This book will focus primarily on Scala but will provide Java

examples where relevant.

Python

Python supports nearly all constructs that Scala supports. This book will include Python code

examples whenever we include Scala code examples and a Python API exists.

SQL

Spark supports a subset of the ANSI SQL 2003 standard. This makes it easy for analysts and non-

programmers to take advantage of the big data powers of Spark. This book includes SQL code

examples wherever relevant.

Spark has two commonly used R libraries: one as a part of Spark core (SparkR) and another as an

R community-driven package (sparklyr). We cover both of these integrations in Chapter 32.

Figure 2-2 presents a simple illustration of this relationship.

Figure 2-2. The relationship between the SparkSession and Spark’s Language API

Each language API maintains the same core concepts that we described earlier. There is a

SparkSession object available to the user, which is the entrance point to running Spark code. When

using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python

and R code that Spark translates into code that it then can run on the executor JVMs.

Spark’s APIs

Although you can drive Spark from a variety of languages, what it makes available in those languages

is worth mentioning. Spark has two fundamental sets of APIs: the low-level “unstructured” APIs, and

the higher-level structured APIs. We discuss both in this book, but these introductory chapters will

focus primarily on the higher-level structured APIs.

Starting Spark

Thus far, we covered the basic concepts of Spark Applications. This has all been conceptual in

nature. When we actually go about writing our Spark Application, we are going to need a way to send

user commands and data to it. We do that by first creating a SparkSession.

NOTE

To do this, we will start Spark’s local mode, just like we did in Chapter 1. This means running ./bin/spark-shell to access

the Scala console to start an interactive session. You can also start the Python console by using ./bin/pyspark. This starts

an interactive Spark Application. There is also a process for submitting standalone applications to Spark called spark-

submit, whereby you can submit a precompiled application to Spark. We’ll show you how to do that in Chapter 3.

When you start Spark in this interactive mode, you implicitly create a SparkSession that manages the

Spark Application. When you start it through a standalone application, you must create the

SparkSession object yourself in your application code.

The SparkSession

As discussed in the beginning of this chapter, you control your Spark Application through a driver

process called the SparkSession. The SparkSession instance is the way Spark executes user-defined

manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and

a Spark Application. In Scala and Python, the variable is available as spark when you start the

console. Let’s go ahead and look at the SparkSession in both Scala and/or Python:

spark

In Scala, you should see something like the following:

res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@...

In Python you’ll see something like this:

<pyspark.sql.session.SparkSession at 0x7efda4c1ccd0>

Let’s now perform the simple task of creating a range of numbers. This range of numbers is just like a

named column in a spreadsheet:

// in Scala

val myRange = spark.range(1000).toDF("number")

# in Python

myRange = spark.range(1000).toDF("number")

You just ran your first Spark code! We created a DataFrame with one column containing 1,000 rows

with values from 0 to 999. This range of numbers represents a distributed collection. When run on a

cluster, each part of this range of numbers exists on a different executor. This is a Spark DataFrame.

DataFrames

A DataFrame is the most common Structured API and simply represents a table of data with rows and

columns. The list that defines the columns and the types within those columns is called the schema.

You can think of a DataFrame as a spreadsheet with named columns. Figure 2-3 illustrates the

fundamental difference: a spreadsheet sits on one computer in one specific location, whereas a Spark

DataFrame can span thousands of computers. The reason for putting the data on more than one

computer should be intuitive: either the data is too large to fit on one machine or it would simply take

too long to perform that computation on one machine.

Figure 2-3. Distributed versus single-machine analysis

The DataFrame concept is not unique to Spark. R and Python both have similar concepts. However,

Python/R DataFrames (with some exceptions) exist on one machine rather than multiple machines.

This limits what you can do with a given DataFrame to the resources that exist on that specific

machine. However, because Spark has language interfaces for both Python and R, it’s quite easy to

convert Pandas (Python) DataFrames to Spark DataFrames, and R DataFrames to Spark DataFrames.

NOTE

Spark has several core abstractions: Datasets, DataFrames, SQL Tables, and Resilient Distributed Datasets (RDDs). These

different abstractions all represent distributed collections of data. The easiest and most efficient are DataFrames, which are

available in all languages. We cover Datasets at the end of Part II, and RDDs in Part III.

Partitions

To allow every executor to perform work in parallel, Spark breaks up the data into chunks called

partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A

DataFrame’s partitions represent how the data is physically distributed across the cluster of machines

during execution. If you have one partition, Spark will have a parallelism of only one, even if you

have thousands of executors. If you have many partitions but only one executor, Spark will still have a

parallelism of only one because there is only one computation resource.

An important thing to note is that with DataFrames you do not (for the most part) manipulate partitions

manually or individually. You simply specify high-level transformations of data in the physical

partitions, and Spark determines how this work will actually execute on the cluster. Lower-level

APIs do exist (via the RDD interface), and we cover those in Part III.

Transformations

In Spark, the core data structures are immutable, meaning they cannot be changed after they’re

created. This might seem like a strange concept at first: if you cannot change it, how are you supposed

to use it? To “change” a DataFrame, you need to instruct Spark how you would like to modify it to do

what you want. These instructions are called transformations. Let’s perform a simple transformation

to find all even numbers in our current DataFrame:

// in Scala

val divisBy2 = myRange.where("number % 2 = 0")

# in Python

divisBy2 = myRange.where("number % 2 = 0")

Notice that these return no output. This is because we specified only an abstract transformation, and

Spark will not act on transformations until we call an action (we discuss this shortly).

Transformations are the core of how you express your business logic using Spark. There are two

types of transformations: those that specify narrow dependencies, and those that specify wide

dependencies.

Transformations consisting of narrow dependencies (we’ll call them narrow transformations) are

those for which each input partition will contribute to only one output partition. In the preceding code

snippet, the where statement specifies a narrow dependency, where only one partition contributes to

at most one output partition, as you can see in Figure 2-4.

Figure 2-4. A narrow dependency

A wide dependency (or wide transformation) style transformation will have input partitions

contributing to many output partitions. You will often hear this referred to as a shuffle whereby Spark

will exchange partitions across the cluster. With narrow transformations, Spark will automatically

perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames,

they’ll all be performed in-memory. The same cannot be said for shuffles. When we perform a shuffle,

Spark writes the results to disk. Wide transformations are illustrated in Figure 2-5.

Figure 2-5. A wide dependency

You’ll see a lot of discussion about shuffle optimization across the web because it’s an important

topic, but for now, all you need to understand is that there are two kinds of transformations. You now

can see how transformations are simply ways of specifying different series of data manipulation. This

leads us to a topic called lazy evaluation.

Lazy Evaluation

Lazy evaulation means that Spark will wait until the very last moment to execute the graph of

computation instructions. In Spark, instead of modifying the data immediately when you express some

operation, you build up a plan of transformations that you would like to apply to your source data. By

waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame

transformations to a streamlined physical plan that will run as efficiently as possible across the

cluster. This provides immense benefits because Spark can optimize the entire data flow from end to

end. An example of this is something called predicate pushdown on DataFrames. If we build a large

Spark job but specify a filter at the end that only requires us to fetch one row from our source data,

the most efficient way to execute this is to access the single record that we need. Spark will actually

optimize this for us by pushing the filter down automatically.

Actions

Transformations allow us to build up our logical transformation plan. To trigger the computation, we

run an action. An action instructs Spark to compute a result from a series of transformations. The

simplest action is count, which gives us the total number of records in the DataFrame:

divisBy2.count()

The output of the preceding code should be 500. Of course, count is not the only action. There are

three kinds of actions:

Actions to view data in the console

Actions to collect data to native objects in the respective language

Actions to write to output data sources

In specifying this action, we started a Spark job that runs our filter transformation (a narrow

transformation), then an aggregation (a wide transformation) that performs the counts on a per

partition basis, and then a collect, which brings our result to a native object in the respective

language. You can see all of this by inspecting the Spark UI, a tool included in Spark with which you

can monitor the Spark jobs running on a cluster.

Spark UI

You can monitor the progress of a job through the Spark web UI. The Spark UI is available on port

4040 of the driver node. If you are running in local mode, this will be http://localhost:4040. The

Spark UI displays information on the state of your Spark jobs, its environment, and cluster state. It’s

very useful, especially for tuning and debugging. Figure 2-6 shows an example UI for a Spark job

where two stages containing nine tasks were executed.

Figure 2-6. The Spark UI

This chapter will not go into detail about Spark job execution and the Spark UI. We will cover that in

Chapter 18. At this point, all you need to understand is that a Spark job represents a set of

transformations triggered by an individual action, and you can monitor that job from the Spark UI.

An End-to-End Example

In the previous example, we created a DataFrame of a range of numbers; not exactly groundbreaking

big data. In this section, we will reinforce everything we learned previously in this chapter with a

more realistic example, and explain step by step what is happening under the hood. We’ll use Spark

to analyze some flight data from the United States Bureau of Transportation statistics.

Inside of the CSV folder, you’ll see that we have a number of files. There’s also a number of other

folders with different file formats, which we discuss in Chapter 9. For now, let’s focus on the CSV

files.

Each file has a number of rows within it. These files are CSV files, meaning that they’re a semi-

structured data format, with each row in the file representing a row in our future DataFrame:

$ head /data/flight-data/csv/2015-summary.csv

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count

United States,Romania,15

United States,Croatia,1

United States,Ireland,344

Spark includes the ability to read and write from a large number of data sources. To read this data,

we will use a DataFrameReader that is associated with our SparkSession. In doing so, we will

specify the file format as well as any options we want to specify. In our case, we want to do

something called schema inference, which means that we want Spark to take a best guess at what the

schema of our DataFrame should be. We also want to specify that the first row is the header in the

file, so we’ll specify that as an option, too.

To get the schema information, Spark reads in a little bit of the data and then attempts to parse the

types in those rows according to the types available in Spark. You also have the option of strictly

specifying a schema when you read in data (which we recommend in production scenarios):

// in Scala

val flightData2015 = spark

.read

.option("inferSchema", "true")

.option("header", "true")

.csv("/data/flight-data/csv/2015-summary.csv")

# in Python

flightData2015 = spark\

.read\

.option("inferSchema", "true")\

.option("header", "true")\

.csv("/data/flight-data/csv/2015-summary.csv")

Each of these DataFrames (in Scala and Python) have a set of columns with an unspecified number of

rows. The reason the number of rows is unspecified is because reading data is a transformation, and

is therefore a lazy operation. Spark peeked at only a couple of rows of data to try to guess what types

each column should be. Figure 2-7 provides an illustration of the CSV file being read into a

DataFrame and then being converted into a local array or list of rows.

Figure 2-7. Reading a CSV file into a DataFrame and converting it to a local array or list of rows

If we perform the take action on the DataFrame, we will be able to see the same results that we saw

before when we used the command line:

flightData2015.take(3)

Array([United States,Romania,15], [United States,Croatia...

Let’s specify some more transformations! Now, let’s sort our data according to the count column,

which is an integer type. Figure 2-8 illustrates this process.

NOTE

Remember, sort does not modify the DataFrame. We use sort as a transformation that returns a new DataFrame by

transforming the previous DataFrame. Let’s illustrate what’s happening when we call take on that resulting DataFrame

(Figure 2-8).

Figure 2-8. Reading, sorting, and collecting a DataFrame

Nothing happens to the data when we call sort because it’s just a transformation. However, we can

see that Spark is building up a plan for how it will execute this across the cluster by looking at the

explain plan. We can call explain on any DataFrame object to see the DataFrame’s lineage (or how

Spark will execute this query):

flightData2015.sort("count").explain()

== Physical Plan ==

*Sort [count#195 ASC NULLS FIRST], true, 0

+- Exchange rangepartitioning(count#195 ASC NULLS FIRST, 200)

+- *FileScan csv [DEST_COUNTRY_NAME#193,ORIGIN_COUNTRY_NAME#194,count#195] ...

Congratulations, you’ve just read your first explain plan! Explain plans are a bit arcane, but with a bit

of practice it becomes second nature. You can read explain plans from top to bottom, the top being the

end result, and the bottom being the source(s) of data. In this case, take a look at the first keywords.

You will see sort, exchange, and FileScan. That’s because the sort of our data is actually a wide

transformation because rows will need to be compared with one another. Don’t worry too much about

understanding everything about explain plans at this point, they can just be helpful tools for debugging

and improving your knowledge as you progress with Spark.

Now, just like we did before, we can specify an action to kick off this plan. However, before doing

that, we’re going to set a configuration. By default, when we perform a shuffle, Spark outputs 200

shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the

shuffle:

spark.conf.set("spark.sql.shuffle.partitions", "5")

flightData2015.sort("count").take(2)

... Array([United States,Singapore,1], [Moldova,United States,1])

Figure 2-9 illustrates this operation. Notice that in addition to the logical transformations, we include

the physical partition count, as well.

Figure 2-9. The process of logical and physical DataFrame manipulation

The logical plan of transformations that we build up defines a lineage for the DataFrame so that at any

given point in time, Spark knows how to recompute any partition by performing all of the operations

it had before on the same input data. This sits at the heart of Spark’s programming model—functional

programming where the same inputs always result in the same outputs when the transformations on

that data stay constant.

We do not manipulate the physical data; instead, we configure physical execution characteristics

through things like the shuffle partitions parameter that we set a few moments ago. We ended up with

five output partitions because that’s the value we specified in the shuffle partition. You can change

this to help control the physical execution characteristics of your Spark jobs. Go ahead and

experiment with different values and see the number of partitions yourself. In experimenting with

different values, you should see drastically different runtimes. Remember that you can monitor the job

progress by navigating to the Spark UI on port 4040 to see the physical and logical execution

characteristics of your jobs.

DataFrames and SQL

We worked through a simple transformation in the previous example, let’s now work through a more

complex one and follow along in both DataFrames and SQL. Spark can run the same transformations,

regardless of the language, in the exact same way. You can express your business logic in SQL or

DataFrames (either in R, Python, Scala, or Java) and Spark will compile that logic down to an

underlying plan (that you can see in the explain plan) before actually executing your code. With Spark

SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure

SQL. There is no performance difference between writing SQL queries or writing DataFrame code,

they both “compile” to the same underlying plan that we specify in DataFrame code.

You can make any DataFrame into a table or view with one simple method call:

flightData2015.createOrReplaceTempView("flight_data_2015")

Now we can query our data in SQL. To do so, we’ll use the spark.sql function (remember, spark is

our SparkSession variable) that conveniently returns a new DataFrame. Although this might seem a

bit circular in logic—that a SQL query against a DataFrame returns another DataFrame—it’s actually

quite powerful. This makes it possible for you to specify transformations in the manner most

convenient to you at any given point in time and not sacrifice any efficiency to do so! To understand

that this is happening, let’s take a look at two explain plans:

// in Scala

val sqlWay = spark.sql("""

SELECT DEST_COUNTRY_NAME, count(1)

FROM flight_data_2015

GROUP BY DEST_COUNTRY_NAME

""")

val dataFrameWay = flightData2015

.groupBy('DEST_COUNTRY_NAME)

.count()

sqlWay.explain

dataFrameWay.explain

# in Python

sqlWay = spark.sql("""

SELECT DEST_COUNTRY_NAME, count(1)

FROM flight_data_2015

GROUP BY DEST_COUNTRY_NAME

""")

dataFrameWay = flightData2015\

.groupBy("DEST_COUNTRY_NAME")\

.count()

sqlWay.explain()

dataFrameWay.explain()

== Physical Plan ==

*HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[count(1)])

+- Exchange hashpartitioning(DEST_COUNTRY_NAME#182, 5)

+- *HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[partial_count(1)])

+- *FileScan csv [DEST_COUNTRY_NAME#182] ...

== Physical Plan ==

*HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[count(1)])

+- Exchange hashpartitioning(DEST_COUNTRY_NAME#182, 5)

+- *HashAggregate(keys=[DEST_COUNTRY_NAME#182], functions=[partial_count(1)])

+- *FileScan csv [DEST_COUNTRY_NAME#182] ...

Notice that these plans compile to the exact same underlying plan!

Let’s pull out some interesting statistics from our data. One thing to understand is that DataFrames

(and SQL) in Spark already have a huge number of manipulations available. There are hundreds of

functions that you can use and import to help you resolve your big data problems faster. We will use

the max function, to establish the maximum number of flights to and from any given location. This just

scans each value in the relevant column in the DataFrame and checks whether it’s greater than the

previous values that have been seen. This is a transformation, because we are effectively filtering

down to one row. Let’s see what that looks like:

spark.sql("SELECT max(count) from flight_data_2015").take(1)

// in Scala

import org.apache.spark.sql.functions.max

flightData2015.select(max("count")).take(1)

# in Python

from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)

Great, that’s a simple example that gives a result of 370,002. Let’s perform something a bit more

complicated and find the top five destination countries in the data. This is our first multi-

transformation query, so we’ll take it step by step. Let’s begin with a fairly straightforward SQL

aggregation:

// in Scala

val maxSql = spark.sql("""

SELECT DEST_COUNTRY_NAME, sum(count) as destination_total

FROM flight_data_2015

GROUP BY DEST_COUNTRY_NAME

ORDER BY sum(count) DESC

LIMIT 5

""")

maxSql.show()

# in Python

maxSql = spark.sql("""

SELECT DEST_COUNTRY_NAME, sum(count) as destination_total

FROM flight_data_2015

GROUP BY DEST_COUNTRY_NAME

ORDER BY sum(count) DESC

LIMIT 5

""")

maxSql.show()

+-----------------+-----------------+

|DEST_COUNTRY_NAME|destination_total|

+-----------------+-----------------+

| United States| 411352|

| Canada| 8399|

| Mexico| 7140|

| United Kingdom| 2025|

| Japan| 1548|

+-----------------+-----------------+

Now, let’s move to the DataFrame syntax that is semantically similar but slightly different in

implementation and ordering. But, as we mentioned, the underlying plans for both of them are the

same. Let’s run the queries and see their results as a sanity check:

// in Scala

import org.apache.spark.sql.functions.desc

flightData2015

.groupBy("DEST_COUNTRY_NAME")

.sum("count")

.withColumnRenamed("sum(count)", "destination_total")

.sort(desc("destination_total"))

.limit(5)

.show()

# in Python

from pyspark.sql.functions import desc

flightData2015\

.groupBy("DEST_COUNTRY_NAME")\

.sum("count")\

.withColumnRenamed("sum(count)", "destination_total")\

.sort(desc("destination_total"))\

.limit(5)\

.show()

+-----------------+-----------------+

|DEST_COUNTRY_NAME|destination_total|

+-----------------+-----------------+

| United States| 411352|

| Canada| 8399|

| Mexico| 7140|

| United Kingdom| 2025|

| Japan| 1548|

+-----------------+-----------------+

Now there are seven steps that take us all the way back to the source data. You can see this in the

explain plan on those DataFrames. Figure 2-10 shows the set of steps that we perform in “code.” The

true execution plan (the one visible in explain) will differ from that shown in Figure 2-10 because

of optimizations in the physical execution; however, the llustration is as good of a starting point as

any. This execution plan is a directed acyclic graph (DAG) of transformations, each resulting in a

new immutable DataFrame, on which we call an action to generate a result.

Figure 2-10. The entire DataFrame transformation flow

The first step is to read in the data. We defined the DataFrame previously but, as a reminder, Spark

does not actually read it in until an action is called on that DataFrame or one derived from the

original DataFrame.

The second step is our grouping; technically when we call groupBy, we end up with a

RelationalGroupedDataset, which is a fancy name for a DataFrame that has a grouping specified

but needs the user to specify an aggregation before it can be queried further. We basically specified

that we’re going to be grouping by a key (or set of keys) and that now we’re going to perform an

aggregation over each one of those keys.

Therefore, the third step is to specify the aggregation. Let’s use the sum aggregation method. This

takes as input a column expression or, simply, a column name. The result of the sum method call is a

new DataFrame. You’ll see that it has a new schema but that it does know the type of each column.

It’s important to reinforce (again!) that no computation has been performed. This is simply another

transformation that we’ve expressed, and Spark is simply able to trace our type information through

it.

The fourth step is a simple renaming. We use the withColumnRenamed method that takes two

arguments, the original column name and the new column name. Of course, this doesn’t perform

computation: this is just another transformation!

The fifth step sorts the data such that if we were to take results off of the top of the DataFrame, they

would have the largest values in the destination_total column.

You likely noticed that we had to import a function to do this, the desc function. You might also have

noticed that desc does not return a string but a Column. In general, many DataFrame methods will

accept strings (as column names) or Column types or expressions. Columns and expressions are

actually the exact same thing.

Penultimately, we’ll specify a limit. This just specifies that we only want to return the first five values

in our final DataFrame instead of all the data.

The last step is our action! Now we actually begin the process of collecting the results of our

DataFrame, and Spark will give us back a list or array in the language that we’re executing. To

reinforce all of this, let’s look at the explain plan for the previous query:

// in Scala

flightData2015

.groupBy("DEST_COUNTRY_NAME")

.sum("count")

.withColumnRenamed("sum(count)", "destination_total")

.sort(desc("destination_total"))

.limit(5)

.explain()

# in Python

flightData2015\

.groupBy("DEST_COUNTRY_NAME")\

.sum("count")\

.withColumnRenamed("sum(count)", "destination_total")\

.sort(desc("destination_total"))\

.limit(5)\

.explain()

== Physical Plan ==

TakeOrderedAndProject(limit=5, orderBy=[destination_total#16194L DESC], outpu...

+- *HashAggregate(keys=[DEST_COUNTRY_NAME#7323], functions=[sum(count#7325L)])

+- Exchange hashpartitioning(DEST_COUNTRY_NAME#7323, 5)

+- *HashAggregate(keys=[DEST_COUNTRY_NAME#7323], functions=[partial_sum...

+- InMemoryTableScan [DEST_COUNTRY_NAME#7323, count#7325L]

+- InMemoryRelation [DEST_COUNTRY_NAME#7323, ORIGIN_COUNTRY_NA...

+- *Scan csv [DEST_COUNTRY_NAME#7578,ORIGIN_COUNTRY_NAME...

Although this explain plan doesn’t match our exact “conceptual plan,” all of the pieces are there. You

can see the limit statement as well as the orderBy (in the first line). You can also see how our

aggregation happens in two phases, in the partial_sum calls. This is because summing a list of

numbers is commutative, and Spark can perform the sum, partition by partition. Of course we can see

how we read in the DataFrame, as well.

Naturally, we don’t always need to collect the data. We can also write it out to any data source that

Spark supports. For instance, suppose we want to store the information in a database like PostgreSQL

or write them out to another file.

Conclusion

This chapter introduced the basics of Apache Spark. We talked about transformations and actions, and

how Spark lazily executes a DAG of transformations in order to optimize the execution plan on

DataFrames. We also discussed how data is organized into partitions and set the stage for working

with more complex transformations. In Chapter 3 we take you on a tour of the vast Spark ecosystem

and look at some more advanced concepts and tools that are available in Spark, from streaming to

machine learning.

Chapter 3. A Tour of Spark’s Toolset

In Chapter 2, we introduced Spark’s core concepts, like transformations and actions, in the context of

Spark’s Structured APIs. These simple conceptual building blocks are the foundation of Apache

Spark’s vast ecosystem of tools and libraries (Figure 3-1). Spark is composed of these primitives—

the lower-level APIs and the Structured APIs—and then a series of standard libraries for additional

functionality.

Figure 3-1. Spark’s toolset

Spark’s libraries support a variety of different tasks, from graph analysis and machine learning to

streaming and integrations with a host of computing and storage systems. This chapter presents a

whirlwind tour of much of what Spark has to offer, including some of the APIs we have not yet

covered and a few of the main libraries. For each section, you will find more detailed information in

other parts of this book; our purpose here is provide you with an overview of what’s possible.

This chapter covers the following:

Running production applications with spark-submit

Datasets: type-safe APIs for structured data

Structured Streaming

Machine learning and advanced analytics

Resilient Distributed Datasets (RDD): Spark’s low level APIs

SparkR

The third-party package ecosystem

After you’ve taken the tour, you’ll be able to jump to the corresponding parts of the book to find

answers to your questions about particular topics.

Running Production Applications

Spark makes it easy to develop and create big data programs. Spark also makes it easy to turn your

interactive exploration into production applications with spark-submit, a built-in command-line

tool. spark-submit does one thing: it lets you send your application code to a cluster and launch it to

execute there. Upon submission, the application will run until it exits (completes the task) or

encounters an error. You can do this with all of Spark’s support cluster managers including

Standalone, Mesos, and YARN.

spark-submit offers several controls with which you can specify the resources your application

needs as well as how it should be run and its command-line arguments.

You can write applications in any of Spark’s supported languages and then submit them for execution.

The simplest example is running an application on your local machine. We’ll show this by running a

sample Scala application that comes with Spark, using the following command in the directory where

you downloaded Spark:

./bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master local \

./examples/jars/spark-examples_2.11-2.2.0.jar 10

This sample application calculates the digits of pi to a certain level of estimation. Here, we’ve told

spark-submit that we want to run on our local machine, which class and which JAR we would like

to run, and some command-line arguments for that class.

We can also run a Python version of the application using the following command:

./bin/spark-submit \

--master local \

./examples/src/main/python/pi.py 10

By changing the master argument of spark-submit, we can also submit the same application to a

cluster running Spark’s standalone cluster manager, Mesos or YARN.

spark-submit will come in handy to run many of the examples we’ve packaged with this book. In

the rest of this chapter, we’ll go through examples of some APIs that we haven’t yet seen in our

introduction to Spark.

Datasets: Type-Safe Structured APIs

The first API we’ll describe is a type-safe version of Spark’s structured API called Datasets, for

writing statically typed code in Java and Scala. The Dataset API is not available in Python and R,

because those languages are dynamically typed.

Recall that DataFrames, which we saw in the previous chapter, are a distributed collection of objects

of type Row that can hold various types of tabular data. The Dataset API gives users the ability to

assign a Java/Scala class to the records within a DataFrame and manipulate it as a collection of typed

objects, similar to a Java ArrayList or Scala Seq. The APIs available on Datasets are type-safe,

meaning that you cannot accidentally view the objects in a Dataset as being of another class than the

class you put in initially. This makes Datasets especially attractive for writing large applications,

with which multiple software engineers must interact through well-defined interfaces.

The Dataset class is parameterized with the type of object contained inside: Dataset<T> in Java

and Dataset[T] in Scala. For example, a Dataset[Person] will be guaranteed to contain objects of

class Person. As of Spark 2.0, the supported types are classes following the JavaBean pattern in

Java and case classes in Scala. These types are restricted because Spark needs to be able to

automatically analyze the type T and create an appropriate schema for the tabular data within your

Dataset.

One great thing about Datasets is that you can use them only when you need or want to. For instance,

in the following example, we’ll define our own data type and manipulate it via arbitrary map and

filter functions. After we’ve performed our manipulations, Spark can automatically turn it back into a

DataFrame, and we can manipulate it further by using the hundreds of functions that Spark includes.

This makes it easy to drop down to lower level, perform type-safe coding when necessary, and move

higher up to SQL for more rapid analysis. Here is a small example showing how you can use both

type-safe functions and DataFrame-like SQL expressions to quickly write business logic:

// in Scala

case class Flight(DEST_COUNTRY_NAME: String,

ORIGIN_COUNTRY_NAME: String,

count: BigInt)

val flightsDF = spark.read

.parquet("/data/flight-data/parquet/2010-summary.parquet/")

val flights = flightsDF.as[Flight]

One final advantage is that when you call collect or take on a Dataset, it will collect objects of the

proper type in your Dataset, not DataFrame Rows. This makes it easy to get type safety and securely

perform manipulation in a distributed and a local manner without code changes:

// in Scala

flights

.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")

.map(flight_row => flight_row)

.take(5)

flights

.take(5)

.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")

.map(fr => Flight(fr.DEST_COUNTRY_NAME, fr.ORIGIN_COUNTRY_NAME, fr.count + 5))

We cover Datasets in depth in Chapter 11.

Structured Streaming

Structured Streaming is a high-level API for stream processing that became production-ready in

Spark 2.2. With Structured Streaming, you can take the same operations that you perform in batch

mode using Spark’s structured APIs and run them in a streaming fashion. This can reduce latency and

allow for incremental processing. The best thing about Structured Streaming is that it allows you to

rapidly and quickly extract value out of streaming systems with virtually no code changes. It also

makes it easy to conceptualize because you can write your batch job as a way to prototype it and then

you can convert it to a streaming job. The way all of this works is by incrementally processing that

data.

Let’s walk through a simple example of how easy it is to get started with Structured Streaming. For

this, we will use a retail dataset, one that has specific dates and times for us to be able to use. We

will use the “by-day” set of files, in which one file represents one day of data.

We put it in this format to simulate data being produced in a consistent and regular manner by a

different process. This is retail data so imagine that these are being produced by retail stores and sent

to a location where they will be read by our Structured Streaming job.

It’s also worth sharing a sample of the data so you can reference what the data looks like:

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country

536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17...

536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kin...

536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850...

To ground this, let’s first analyze the data as a static dataset and create a DataFrame to do so. We’ll

also create a schema from this static dataset (there are ways of using schema inference with streaming

that we will touch on in Part V):

// in Scala

val staticDataFrame = spark.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/data/retail-data/by-day/*.csv")

staticDataFrame.createOrReplaceTempView("retail_data")

val staticSchema = staticDataFrame.schema

# in Python

staticDataFrame = spark.read.format("csv")\

.option("header", "true")\

.option("inferSchema", "true")\

.load("/data/retail-data/by-day/*.csv")

staticDataFrame.createOrReplaceTempView("retail_data")

staticSchema = staticDataFrame.schema

Because we’re working with time–series data, it’s worth mentioning how we might go along grouping

and aggregating our data. In this example we’ll take a look at the sale hours during which a given

customer (identified by CustomerId) makes a large purchase. For example, let’s add a total cost

column and see on what days a customer spent the most.

The window function will include all data from each day in the aggregation. It’s simply a window

over the time–series column in our data. This is a helpful tool for manipulating date and timestamps

because we can specify our requirements in a more human form (via intervals), and Spark will group

all of them together for us:

// in Scala

import org.apache.spark.sql.functions.{window, column, desc, col}

staticDataFrame

.selectExpr(

"CustomerId",

"(UnitPrice * Quantity) as total_cost",

"InvoiceDate")

.groupBy(

col("CustomerId"), window(col("InvoiceDate"), "1 day"))

.sum("total_cost")

.show(5)

# in Python

from pyspark.sql.functions import window, column, desc, col

staticDataFrame\

.selectExpr(

"CustomerId",

"(UnitPrice * Quantity) as total_cost",

"InvoiceDate")\

.groupBy(

col("CustomerId"), window(col("InvoiceDate"), "1 day"))\

.sum("total_cost")\

.show(5)

It’s worth mentioning that you can also run this as SQL code, just as we saw in the previous chapter.

Here’s a sample of the output that you’ll see:

+----------+--------------------+------------------+

|CustomerId| window| sum(total_cost)|

+----------+--------------------+------------------+

| 17450.0|[2011-09-20 00:00...| 71601.44|

...

| null|[2011-12-08 00:00...|31975.590000000007|

+----------+--------------------+------------------+

The null values represent the fact that we don’t have a customerId for some transactions.

That’s the static DataFrame version; there shouldn’t be any big surprises in there if you’re familiar

with the syntax.

Because you’re likely running this in local mode, it’s a good practice to set the number of shuffle

partitions to something that’s going to be a better fit for local mode. This configuration specifies the

number of partitions that should be created after a shuffle. By default, the value is 200, but because

there aren’t many executors on this machine, it’s worth reducing this to 5. We did this same operation

in Chapter 2, so if you don’t remember why this is important, feel free to flip back to review.

spark.conf.set("spark.sql.shuffle.partitions", "5")

Now that we’ve seen how that works, let’s take a look at the streaming code! You’ll notice that very

little actually changes about the code. The biggest change is that we used readStream instead of read,

additionally you’ll notice the maxFilesPerTrigger option, which simply specifies the number of files

we should read in at once. This is to make our demonstration more “streaming,” and in a production

scenario this would probably be omitted.

val streamingDataFrame = spark.readStream

.schema(staticSchema)

.option("maxFilesPerTrigger", 1)

.format("csv")

.option("header", "true")

.load("/data/retail-data/by-day/*.csv")

# in Python

streamingDataFrame = spark.readStream\

.schema(staticSchema)\

.option("maxFilesPerTrigger", 1)\

.format("csv")\

.option("header", "true")\

.load("/data/retail-data/by-day/*.csv")

Now we can see whether our DataFrame is streaming:

streamingDataFrame.isStreaming // returns true

Let’s set up the same business logic as the previous DataFrame manipulation. We’ll perform a

summation in the process:

// in Scala

val purchaseByCustomerPerHour = streamingDataFrame

.selectExpr(

"CustomerId",

"(UnitPrice * Quantity) as total_cost",

"InvoiceDate")

.groupBy(

$"CustomerId", window($"InvoiceDate", "1 day"))

.sum("total_cost")

# in Python

purchaseByCustomerPerHour = streamingDataFrame\

.selectExpr(

"CustomerId",

"(UnitPrice * Quantity) as total_cost",

"InvoiceDate")\

.groupBy(

col("CustomerId"), window(col("InvoiceDate"), "1 day"))\

.sum("total_cost")

This is still a lazy operation, so we will need to call a streaming action to start the execution of this

data flow.

Streaming actions are a bit different from our conventional static action because we’re going to be

populating data somewhere instead of just calling something like count (which doesn’t make any

sense on a stream anyways). The action we will use will output to an in-memory table that we will

update after each trigger. In this case, each trigger is based on an individual file (the read option that

we set). Spark will mutate the data in the in-memory table such that we will always have the highest

value as specified in our previous aggregation:

// in Scala

purchaseByCustomerPerHour.writeStream

.format("memory") // memory = store in-memory table

.queryName("customer_purchases") // the name of the in-memory table

.outputMode("complete") // complete = all the counts should be in the table

.start()

# in Python

purchaseByCustomerPerHour.writeStream\

.format("memory")\

.queryName("customer_purchases")\

.outputMode("complete")\

.start()

When we start the stream, we can run queries against it to debug what our result will look like if we

were to write this out to a production sink:

// in Scala

spark.sql("""

SELECT *

FROM customer_purchases

ORDER BY `sum(total_cost)` DESC

""")

.show(5)

# in Python

spark.sql("""

SELECT *

FROM customer_purchases

ORDER BY `sum(total_cost)` DESC

""")\

.show(5)

You’ll notice that the composition of our table changes as we read in more data! With each file, the

results might or might not be changing based on the data. Naturally, because we’re grouping

customers, we hope to see an increase in the top customer purchase amounts over time (and do for a

period of time!). Another option you can use is to write the results out to the console:

purchaseByCustomerPerHour.writeStream

.format("console")

.queryName("customer_purchases_2")

.outputMode("complete")

.start()

You shouldn’t use either of these streaming methods in production, but they do make for convenient

demonstration of Structured Streaming’s power. Notice how this window is built on event time, as

well, not the time at which Spark processes the data. This was one of the shortcomings of Spark

Streaming that Structured Streaming has resolved. We cover Structured Streaming in depth in Part V.

Machine Learning and Advanced Analytics

Another popular aspect of Spark is its ability to perform large-scale machine learning with a built-in

library of machine learning algorithms called MLlib. MLlib allows for preprocessing, munging,

training of models, and making predictions at scale on data. You can even use models trained in

MLlib to make predictions in Strucutred Streaming. Spark provides a sophisticated machine learning

API for performing a variety of machine learning tasks, from classification to regression, and

clustering to deep learning. To demonstrate this functionality, we will perform some basic clustering

on our data using a standard algorithm called -means.

WHAT IS K-MEANS?

-means is a clustering algorithm in which “” centers are randomly assigned within the data. The points closest to that

point are then “assigned” to a class and the center of the assigned points is computed. This center point is called the

centroid. We then label the points closest to that centroid, to the centroid’s class, and shift the centroid to the new center of

that cluster of points. We repeat this process for a finite set of iterations or until convergence (our center points stop

changing).

Spark includes a number of preprocessing methods out of the box. To demonstrate these methods, we

will begin with some raw data, build up transformations before getting the data into the right format,

at which point we can actually train our model and then serve predictions:

staticDataFrame.printSchema()

root

|-- InvoiceNo: string (nullable = true)

|-- StockCode: string (nullable = true)

|-- Description: string (nullable = true)

|-- Quantity: integer (nullable = true)

|-- InvoiceDate: timestamp (nullable = true)

|-- UnitPrice: double (nullable = true)

|-- CustomerID: double (nullable = true)

|-- Country: string (nullable = true)

Machine learning algorithms in MLlib require that data is represented as numerical values. Our

current data is represented by a variety of different types, including timestamps, integers, and strings.

Therefore we need to transform this data into some numerical representation. In this instance, we’ll

use several DataFrame transformations to manipulate our date data:

// in Scala

import org.apache.spark.sql.functions.date_format

val preppedDataFrame = staticDataFrame

.na.fill(0)

.withColumn("day_of_week", date_format($"InvoiceDate", "EEEE"))

.coalesce(5)

# in Python

from pyspark.sql.functions import date_format, col

preppedDataFrame = staticDataFrame\

.na.fill(0)\

.withColumn("day_of_week", date_format(col("InvoiceDate"), "EEEE"))\

.coalesce(5)

We are also going to need to split the data into training and test sets. In this instance, we are going to

do this manually by the date on which a certain purchase occurred; however, we could also use

MLlib’s transformation APIs to create a training and test set via train validation splits or cross

validation (these topics are covered at length in Part VI):

// in Scala

val trainDataFrame = preppedDataFrame

.where("InvoiceDate < '2011-07-01'")

val testDataFrame = preppedDataFrame

.where("InvoiceDate >= '2011-07-01'")

# in Python

trainDataFrame = preppedDataFrame\

.where("InvoiceDate < '2011-07-01'")

testDataFrame = preppedDataFrame\

.where("InvoiceDate >= '2011-07-01'")

Now that we’ve prepared the data, let’s split it into a training and test set. Because this is a time–

series set of data, we will split by an arbitrary date in the dataset. Although this might not be the

optimal split for our training and test, for the intents and purposes of this example it will work just

fine. We’ll see that this splits our dataset roughly in half:

trainDataFrame.count()

testDataFrame.count()

Note that these transformations are DataFrame transformations, which we cover extensively in Part II.

Spark’s MLlib also provides a number of transformations with which we can automate some of our

general transformations. One such transformer is a StringIndexer:

// in Scala

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()

.setInputCol("day_of_week")

.setOutputCol("day_of_week_index")

# in Python

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer()\

.setInputCol("day_of_week")\

.setOutputCol("day_of_week_index")

This will turn our days of weeks into corresponding numerical values. For example, Spark might

represent Saturday as 6, and Monday as 1. However, with this numbering scheme, we are implicitly

stating that Saturday is greater than Monday (by pure numerical values). This is obviously incorrect.

To fix this, we therefore need to use a OneHotEncoder to encode each of these values as their own

column. These Boolean flags state whether that day of week is the relevant day of the week:

// in Scala

import org.apache.spark.ml.feature.OneHotEncoder

val encoder = new OneHotEncoder()

.setInputCol("day_of_week_index")

.setOutputCol("day_of_week_encoded")

# in Python

from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder()\

.setInputCol("day_of_week_index")\

.setOutputCol("day_of_week_encoded")

Each of these will result in a set of columns that we will “assemble” into a vector. All machine

learning algorithms in Spark take as input a Vector type, which must be a set of numerical values:

// in Scala

import org.apache.spark.ml.feature.VectorAssembler

val vectorAssembler = new VectorAssembler()

.setInputCols(Array("UnitPrice", "Quantity", "day_of_week_encoded"))

.setOutputCol("features")

# in Python

from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler()\

.setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\

.setOutputCol("features")

Here, we have three key features: the price, the quantity, and the day of week. Next, we’ll set this up

into a pipeline so that any future data we need to transform can go through the exact same process:

// in Scala

import org.apache.spark.ml.Pipeline

val transformationPipeline = new Pipeline()

.setStages(Array(indexer, encoder, vectorAssembler))

# in Python

from pyspark.ml import Pipeline

transformationPipeline = Pipeline()\

.setStages([indexer, encoder, vectorAssembler])

Preparing for training is a two-step process. We first need to fit our transformers to this dataset. We

cover this in depth in Part VI, but basically our StringIndexer needs to know how many unique

values there are to be indexed. After those exist, encoding is easy but Spark must look at all the

distinct values in the column to be indexed in order to store those values later on:

// in Scala

val fittedPipeline = transformationPipeline.fit(trainDataFrame)

# in Python

fittedPipeline = transformationPipeline.fit(trainDataFrame)

After we fit the training data, we are ready to take that fitted pipeline and use it to transform all of our

data in a consistent and repeatable way:

// in Scala

val transformedTraining = fittedPipeline.transform(trainDataFrame)

# in Python

transformedTraining = fittedPipeline.transform(trainDataFrame)

At this point, it’s worth mentioning that we could have included our model training in our pipeline.

We chose not to in order to demonstrate a use case for caching the data. Instead, we’re going to

perform some hyperparameter tuning on the model because we do not want to repeat the exact same

transformations over and over again; specifically, we’ll use caching, an optimization that we discuss

in more detail in Part IV. This will put a copy of the intermediately transformed dataset into memory,

allowing us to repeatedly access it at much lower cost than running the entire pipeline again. If you’re

curious to see how much of a difference this makes, skip this line and run the training without caching

the data. Then try it after caching; you’ll see the results are significant:

transformedTraining.cache()

We now have a training set; it’s time to train the model. First we’ll import the relevant model that

we’d like to use and instantiate it:

// in Scala

import org.apache.spark.ml.clustering.KMeans

val kmeans = new KMeans()

.setK(20)

.setSeed(1L)

# in Python

from pyspark.ml.clustering import KMeans

kmeans = KMeans()\

.setK(20)\

.setSeed(1L)

In Spark, training machine learning models is a two-phase process. First, we initialize an untrained

model, and then we train it. There are always two types for every algorithm in MLlib’s DataFrame

API. They follow the naming pattern of Algorithm, for the untrained version, and AlgorithmModel

for the trained version. In our example, this is KMeans and then KMeansModel.

Estimators in MLlib’s DataFrame API share roughly the same interface that we saw earlier with our

preprocessing transformers like the StringIndexer. This should come as no surprise because it

makes training an entire pipeline (which includes the model) simple. For our purposes here, we want

to do things a bit more step by step, so we chose to not do this in this example:

// in Scala

val kmModel = kmeans.fit(transformedTraining)

# in Python

kmModel = kmeans.fit(transformedTraining)

After we train this model, we can compute the cost according to some success merits on our training

set. The resulting cost on this dataset is actually quite high, which is likely due to the fact that we did

not properly preprocess and scale our input data, which we cover in depth in Chapter 25:

kmModel.computeCost(transformedTraining)

// in Scala

val transformedTest = fittedPipeline.transform(testDataFrame)

# in Python

transformedTest = fittedPipeline.transform(testDataFrame)

kmModel.computeCost(transformedTest)

Naturally, we could continue to improve this model, layering more preprocessing as well as

performing hyperparameter tuning to ensure that we’re getting a good model. We leave that discussion

for Part VI.

Lower-Level APIs

Spark includes a number of lower-level primitives to allow for arbitrary Java and Python object

manipulation via Resilient Distributed Datasets (RDDs). Virtually everything in Spark is built on top

of RDDs. As we will discuss in Chapter 4, DataFrame operations are built on top of RDDs and

compile down to these lower-level tools for convenient and extremely efficient distributed execution.

There are some things that you might use RDDs for, especially when you’re reading or manipulating

raw data, but for the most part you should stick to the Structured APIs. RDDs are lower level than

DataFrames because they reveal physical execution characteristics (like partitions) to end users.

One thing that you might use RDDs for is to parallelize raw data that you have stored in memory on

the driver machine. For instance, let’s parallelize some simple numbers and create a DataFrame after

we do so. We then can convert that to a DataFrame to use it with other DataFrames:

// in Scala

spark.sparkContext.parallelize(Seq(1, 2, 3)).toDF()

# in Python

from pyspark.sql import Row

spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()

RDDs are available in Scala as well as Python. However, they’re not equivalent. This differs from

the DataFrame API (where the execution characteristics are the same) due to some underlying

implementation details. We cover lower-level APIs, including RDDs in Part IV. As end users, you

shouldn’t need to use RDDs much in order to perform many tasks unless you’re maintaining older

Spark code. There are basically no instances in modern Spark, for which you should be using RDDs

instead of the structured APIs beyond manipulating some very raw unprocessed and unstructured data.

SparkR

SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other

language bindings. To use SparkR, you simply import it into your environment and run your code. It’s

all very similar to the Python API except that it follows R’s syntax instead of Python. For the most

part, almost everything available in Python is available in SparkR:

# in R

library(SparkR)

sparkDF <- read.df("/data/flight-data/csv/2015-summary.csv",

source = "csv", header="true", inferSchema = "true")

take(sparkDF, 5)

# in R

collect(orderBy(sparkDF, "count"), 20)

R users can also use other R libraries like the pipe operator in magrittr to make Spark transformations

a bit more R-like. This can make it easy to use with other libraries like ggplot for more sophisticated

plotting:

# in R

library(magrittr)

sparkDF %>%

orderBy(desc(sparkDF$count)) %>%

groupBy("ORIGIN_COUNTRY_NAME") %>%

count() %>%

limit(10) %>%

collect()

We will not include R code samples as we do in Python, because almost every concept throughout

this book that applies to Python also applies to SparkR. The only difference will by syntax. We cover

SparkR and sparklyr in Part VII.

Spark’s Ecosystem and Packages

One of the best parts about Spark is the ecosystem of packages and tools that the community has

created. Some of these tools even move into the core Spark project as they mature and become widely

used. As of this writing, the list of packages is rather long, numbering over 300—and more are added

frequently. You can find the largest index of Spark Packages at spark-packages.org, where any user

can publish to this package repository. There are also various other projects and packages that you

can find on the web; for example, on GitHub.

Conclusion

We hope this chapter showed you the sheer variety of ways in which you can apply Spark to your own

business and technical challenges. Spark’s simple, robust programming model makes it easy to apply

to a large number of problems, and the vast array of packages that have crept up around it, created by

hundreds of different people, are a true testament to Spark’s ability to robustly tackle a number of

business problems and challenges. As the ecosystem and community grows, it’s likely that more and

more packages will continue to crop up. We look forward to seeing what the community has in store!

The rest of this book will provide deeper dives into the product areas in Figure 3-1.

You may read the rest of the book any way that you prefer, we find that most people hop from area to

area as they hear terminology or want to apply Spark to certain problems they’re facing.

Part II. Structured APIs—DataFrames,

SQL, and Datasets

Chapter 4. Structured API Overview

This part of the book will be a deep dive into Spark’s Structured APIs. The Structured APIs are a tool

for manipulating all sorts of data, from unstructured log files to semi-structured CSV files and highly

structured Parquet files. These APIs refer to three core types of distributed collection APIs:

Datasets

DataFrames

SQL tables and views

Although they are distinct parts of the book, the majority of the Structured APIs apply to both batch

and streaming computation. This means that when you work with the Structured APIs, it should be

simple to migrate from batch to streaming (or vice versa) with little to no effort. We’ll cover

streaming in detail in Part V.

The Structured APIs are the fundamental abstraction that you will use to write the majority of your

data flows. Thus far in this book, we have taken a tutorial-based approach, meandering our way

through much of what Spark has to offer. This part offers a more in-depth exploration. In this chapter,

we’ll introduce the fundamental concepts that you should understand: the typed and untyped APIs (and

their differences); what the core terminology is; and, finally, how Spark actually takes your Structured

API data flows and executes it on the cluster. We will then provide more specific task-based

information for working with certain types of data or data sources.

NOTE

Before proceeding, let’s review the fundamental concepts and definitions that we covered in Part I. Spark is a distributed

programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph

of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into

stages and tasks to execute across the cluster. The logical structures that we manipulate with transformations and actions

are DataFrames and Datasets. To create a new DataFrame or Dataset, you call a transformation. To start computation or

convert to native language types, you call an action.

DataFrames and Datasets

Part I discussed DataFrames. Spark has two notions of structured collections: DataFrames and

Datasets. We will touch on the (nuanced) differences shortly, but let’s define what they both represent

first.

DataFrames and Datasets are (distributed) table-like collections with well-defined rows and

columns. Each column must have the same number of rows as all the other columns (although you can

use null to specify the absence of a value) and each column has type information that must be

consistent for every row in the collection. To Spark, DataFrames and Datasets represent immutable,

lazily evaluated plans that specify what operations to apply to data residing at a location to generate

some output. When we perform an action on a DataFrame, we instruct Spark to perform the actual

transformations and return the result. These represent plans of how to manipulate rows and columns

to compute the user’s desired result.

NOTE

Tables and views are basically the same thing as DataFrames. We just execute SQL against them instead of DataFrame

code. We cover all of this in Chapter 10, which focuses specifically on Spark SQL.

To add a bit more specificity to these definitions, we need to talk about schemas, which are the way

you define the types of data you’re storing in this distributed collection.

Schemas

A schema defines the column names and types of a DataFrame. You can define schemas manually or

read a schema from a data source (often called schema on read). Schemas consist of types, meaning

that you need a way of specifying what lies where.

Overview of Structured Spark Types

Spark is effectively a programming language of its own. Internally, Spark uses an engine called

Catalyst that maintains its own type information through the planning and processing of work. In

doing so, this opens up a wide variety of execution optimizations that make significant differences.

Spark types map directly to the different language APIs that Spark maintains and there exists a lookup

table for each of these in Scala, Java, Python, SQL, and R. Even if we use Spark’s Structured APIs

from Python or R, the majority of our manipulations will operate strictly on Spark types, not Python

types. For example, the following code does not perform addition in Scala or Python; it actually

performs addition purely in Spark:

// in Scala

val df = spark.range(500).toDF("number")

df.select(df.col("number") + 10)

# in Python

df = spark.range(500).toDF("number")

df.select(df["number"] + 10)

This addition operation happens because Spark will convert an expression written in an input

language to Spark’s internal Catalyst representation of that same type information. It then will operate

on that internal representation. We touch on why this is the case momentarily, but before we can, we

need to discuss Datasets.

DataFrames Versus Datasets

In essence, within the Structured APIs, there are two more APIs, the “untyped” DataFrames and the

“typed” Datasets. To say that DataFrames are untyped is aslightly inaccurate; they have types, but

Spark maintains them completely and only checks whether those types line up to those specified in the

schema at runtime. Datasets, on the other hand, check whether types conform to the specification at

compile time. Datasets are only available to Java Virtual Machine (JVM)–based languages (Scala

and Java) and we specify types with case classes or Java beans.

For the most part, you’re likely to work with DataFrames. To Spark (in Scala), DataFrames are

simply Datasets of Type Row. The “Row” type is Spark’s internal representation of its optimized in-

memory format for computation. This format makes for highly specialized and efficient computation

because rather than using JVM types, which can cause high garbage-collection and object

instantiation costs, Spark can operate on its own internal format without incurring any of those costs.

To Spark (in Python or R), there is no such thing as a Dataset: everything is a DataFrame and

therefore we always operate on that optimized format.

NOTE

The internal Catalyst format is well covered in numerous Spark presentations. Given that this book is intended for a more

general audience, we’ll refrain from going into the implementation. If you’re curious, there are some excellent talks by Josh

Rosen and Herman van Hovell, both of Databricks, about their work in the development of Spark’s Catalyst engine.

Understanding DataFrames, Spark Types, and Schemas takes some time to digest. What you need to

know is that when you’re using DataFrames, you’re taking advantage of Spark’s optimized internal

format. This format applies the same efficiency gains to all of Spark’s language APIs. If you need

strict compile-time checking, read Chapter 11 to learn more about it.

Let’s move onto some friendlier and more approachable concepts: columns and rows.

Columns

Columns represent a simple type like an integer or string, a complex type like an array or map, or a

null value. Spark tracks all of this type information for you and offers a variety of ways, with which

you can transform columns. Columns are discussed extensively in Chapter 5, but for the most part you

can think about Spark Column types as columns in a table.

Rows

A row is nothing more than a record of data. Each record in a DataFrame must be of type Row, as we

can see when we collect the following DataFrames. We can create these rows manually from SQL,

from Resilient Distributed Datasets (RDDs), from data sources, or manually from scratch. Here, we

create one by using a range:

// in Scala

spark.range(2).toDF().collect()

# in Python

spark.range(2).collect()

These both result in an array of Row objects.

Spark Types

We mentioned earlier that Spark has a large number of internal type representations. We include a

handy reference table on the next several pages so that you can most easily reference what type, in

your specific language, lines up with the type in Spark.

Before getting to those tables, let’s talk about how we instantiate, or declare, a column to be of a

certain type.

To work with the correct Scala types, use the following:

import org.apache.spark.sql.types._

val b = ByteType

To work with the correct Java types, you should use the factory methods in the following package:

import org.apache.spark.sql.types.DataTypes;

ByteType x = DataTypes.ByteType;

Python types at times have certain requirements, which you can see listed in Table 4-1, as do Scala

and Java, which you can see listed in Tables 4-2 and 4-3, respectively. To work with the correct

Python types, use the following:

from pyspark.sql.types import *

b = ByteType()

The following tables provide the detailed type information for each of Spark’s language bindings.

Table 4-1. Python type reference

Data type Value type in Python API to access or create a

data type

ByteType

int or long. Note: Numbers will be converted to 1-byte signed integer

numbers at runtime. Ensure that numbers are within the range of –128 to

127.

ByteType()

ShortType

int or long. Note: Numbers will be converted to 2-byte signed integer

numbers at runtime. Ensure that numbers are within the range of –32768 to

32767.

ShortType()

IntegerType

int or long. Note: Python has a lenient definition of “integer.” Numbers that

are too large will be rejected by Spark SQL if you use the IntegerType(). It’s

best practice to use LongType.

IntegerType()

LongType

long. Note: Numbers will be converted to 8-byte signed integer numbers at

runtime. Ensure that numbers are within the range of –

9223372036854775808 to 9223372036854775807. Otherwise, convert data to

decimal.Decimal and use DecimalType.

LongType()

FloatType float. Note: Numbers will be converted to 4-byte single-precision floating-

point numbers at runtime. FloatType()

DoubleType float DoubleType()

DecimalType decimal.Decimal DecimalType()

StringType string StringType()

BinaryType bytearray BinaryType()

BooleanType bool BooleanType()

TimestampType datetime.datetime TimestampType()

DateType datetime.date DateType()

ArrayType list, tuple, or array

ArrayType(elementType,

[containsNull]). Note: The

default value of containsNull is

True.

MapType dict

MapType(keyType, valueType,

[valueContainsNull]). Note: The

default value of

valueContainsNull is True.

StructType list or tuple

StructType(fields). Note: fields

is a list of StructFields. Also,

fields with the same name are

not allowed.

StructField The value type in Python of the data type of this field (for example, Int for a

StructField with the data type IntegerType)

StructField(name, dataType,

[nullable]) Note: The default

value of nullable is True.

Table 4-2. Scala type reference

Data type Value type in Scala API to access or create a data type

ByteType Byte ByteType

ShortType Short ShortType

IntegerType Int IntegerType

LongType Long LongType

FloatType Float FloatType

DoubleType Double DoubleType

DecimalType java.math.BigDecimal DecimalType

StringType String StringType

BinaryType Array[Byte] BinaryType

BooleanType Boolean BooleanType

TimestampType java.sql.Timestamp TimestampType

DateType java.sql.Date DateType

ArrayType scala.collection.Seq ArrayType(elementType, [containsNull]). Note: The

default value of containsNull is true.

MapType scala.collection.Map

MapType(keyType, valueType,

[valueContainsNull]). Note: The default value of

valueContainsNull is true.

StructType org.apache.spark.sql.Row

StructType(fields). Note: fields is an Array of

StructFields. Also, fields with the same name are

not allowed.

StructField

The value type in Scala of the data type of this field (for

example, Int for a StructField with the data type

IntegerType)

StructField(name, dataType, [nullable]). Note: The

default value of nullable is true.

Table 4-3. Java type reference

Data type Value type in Java API to access or create a data type

ByteType byte or Byte DataTypes.ByteType

ShortType short or Short DataTypes.ShortType

IntegerType int or Integer DataTypes.IntegerType

LongType long or Long DataTypes.LongType

FloatType float or Float DataTypes.FloatType

DoubleType double or Double DataTypes.DoubleType

DecimalType java.math.BigDecimal DataTypes.createDecimalType()

DataTypes.createDecimalType(precision, scale).

StringType String DataTypes.StringType

BinaryType byte[] DataTypes.BinaryType

BooleanType boolean or Boolean DataTypes.BooleanType

TimestampType java.sql.Timestamp DataTypes.TimestampType

DateType java.sql.Date DataTypes.DateType

ArrayType java.util.List

DataTypes.createArrayType(elementType). Note: The value of

containsNull will be true

DataTypes.createArrayType(elementType, containsNull).

MapType java.util.Map

DataTypes.createMapType(keyType, valueType). Note: The

value of valueContainsNull will be true.

DataTypes.createMapType(keyType, valueType,

valueContainsNull)

StructType org.apache.spark.sql.Row

DataTypes.createStructType(fields). Note: fields is a List or an

array of StructFields. Also, two fields with the same name are

not allowed.

StructField

The value type in Java of the data type of

this field (for example, int for a StructField

with the data type IntegerType)

DataTypes.createStructField(name, dataType, nullable)

It’s worth keeping in mind that the types might change over time as Spark SQL continues to grow so

you may want to reference Spark’s documentation for future updates. Of course, all of these types are

great, but you almost never work with purely static DataFrames. You will always manipulate and

transform them. Therefore it’s important that we give you an overview of the execution process in the

Structured APIs.

Overview of Structured API Execution

This section will demonstrate how this code is actually executed across a cluster. This will help you

understand (and potentially debug) the process of writing and executing code on clusters, so let’s

walk through the execution of a single structured API query from user code to executed code. Here’s

an overview of the steps:

1. Write DataFrame/Dataset/SQL Code.

2. If valid code, Spark converts this to a Logical Plan.

3. Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the

way.

4. Spark then executes this Physical Plan (RDD manipulations) on the cluster.

To execute code, we must write code. This code is then submitted to Spark either through the console

or via a submitted job. This code then passes through the Catalyst Optimizer, which decides how the

code should be executed and lays out a plan for doing so before, finally, the code is run and the result

is returned to the user. Figure 4-1 shows the process.

Figure 4-1. The Catalyst Optimizer

Logical Planning

The first phase of execution is meant to take user code and convert it into a logical plan. Figure 4-2

illustrates this process.

Figure 4-2. The structured API logical planning process

This logical plan only represents a set of abstract transformations that do not refer to executors or

drivers, it’s purely to convert the user’s set of expressions into the most optimized version. It does

this by converting user code into an unresolved logical plan. This plan is unresolved because

although your code might be valid, the tables or columns that it refers to might or might not exist.

Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and

tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or

column name does not exist in the catalog. If the analyzer can resolve it, the result is passed through

the Catalyst Optimizer, a collection of rules that attempt to optimize the logical plan by pushing down

predicates or selections. Packages can extend the Catalyst to include their own rules for domain-

specific optimizations.

Physical Planning

After successfully creating an optimized logical plan, Spark then begins the physical planning

process. The physical plan, often called a Spark plan, specifies how the logical plan will execute on

the cluster by generating different physical execution strategies and comparing them through a cost

model, as depicted in Figure 4-3. An example of the cost comparison might be choosing how to

perform a given join by looking at the physical attributes of a given table (how big the table is or how

big its partitions are).

Figure 4-3. The physical planning process

Physical planning results in a series of RDDs and transformations. This result is why you might have

heard Spark referred to as a compiler—it takes queries in DataFrames, Datasets, and SQL and

compiles them into RDD transformations for you.

Execution

Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level programming

interface of Spark (which we cover in Part III). Spark performs further optimizations at runtime,

generating native Java bytecode that can remove entire tasks or stages during execution. Finally the

result is returned to the user.

Conclusion

In this chapter, we covered Spark Structured APIs and how Spark transforms your code into what

will physically execute on the cluster. In the chapters that follow, we cover core concepts and how to

use the key functionality of the Structured APIs.

Chapter 5. Basic Structured Operations

In Chapter 4, we introduced the core abstractions of the Structured API. This chapter moves away

from the architectural concepts and toward the tactical tools you will use to manipulate DataFrames

and the data within them. This chapter focuses exclusively on fundamental DataFrame operations and

avoids aggregations, window functions, and joins. These are discussed in subsequent chapters.

Definitionally, a DataFrame consists of a series of records (like rows in a table), that are of type Row,

and a number of columns (like columns in a spreadsheet) that represent a computation expression that

can be performed on each individual record in the Dataset. Schemas define the name as well as the

type of data in each column. Partitioning of the DataFrame defines the layout of the DataFrame or

Dataset’s physical distribution across the cluster. The partitioning scheme defines how that is

allocated. You can set this to be based on values in a certain column or nondeterministically.

Let’s create a DataFrame with which we can work:

// in Scala

val df = spark.read.format("json")

.load("/data/flight-data/json/2015-summary.json")

# in Python

df = spark.read.format("json").load("/data/flight-data/json/2015-summary.json")

We discussed that a DataFame will have columns, and we use a schema to define them. Let’s take a

look at the schema on our current DataFrame:

df.printSchema()

Schemas tie everything together, so they’re worth belaboring.

Schemas

A schema defines the column names and types of a DataFrame. We can either let a data source define

the schema (called schema-on-read) or we can define it explicitly ourselves.

WARNING

Deciding whether you need to define a schema prior to reading in your data depends on your use case. For ad hoc analysis,

schema-on-read usually works just fine (although at times it can be a bit slow with plain-text file formats like CSV or

JSON). However, this can also lead to precision issues like a long type incorrectly set as an integer when reading in a file.

When using Spark for production Extract, Transform, and Load (ETL), it is often a good idea to define your schemas

manually, especially when working with untyped data sources like CSV and JSON because schema inference can vary

depending on the type of data that you read in.

Let’s begin with a simple file, which we saw in Chapter 4, and let the semi-structured nature of line-

delimited JSON define the structure. This is flight data from the United States Bureau of

Transportation statistics:

// in Scala

spark.read.format("json").load("/data/flight-data/json/2015-summary.json").schema

Scala returns the following:

org.apache.spark.sql.types.StructType = ...

StructType(StructField(DEST_COUNTRY_NAME,StringType,true),

StructField(ORIGIN_COUNTRY_NAME,StringType,true),

StructField(count,LongType,true))

# in Python

spark.read.format("json").load("/data/flight-data/json/2015-summary.json").schema

Python returns the following:

StructType(List(StructField(DEST_COUNTRY_NAME,StringType,true),

StructField(ORIGIN_COUNTRY_NAME,StringType,true),

StructField(count,LongType,true)))

A schema is a StructType made up of a number of fields, StructFields, that have a name, type, a

Boolean flag which specifies whether that column can contain missing or null values, and, finally,

users can optionally specify associated metadata with that column. The metadata is a way of storing

information about this column (Spark uses this in its machine learning library).

Schemas can contain other StructTypes (Spark’s complex types). We will see this in Chapter 6

when we discuss working with complex types. If the types in the data (at runtime) do not match the

schema, Spark will throw an error. The example that follows shows how to create and enforce a

specific schema on a DataFrame.

// in Scala

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType}

import org.apache.spark.sql.types.Metadata

val myManualSchema = StructType(Array(

StructField("DEST_COUNTRY_NAME", StringType, true),

StructField("ORIGIN_COUNTRY_NAME", StringType, true),

StructField("count", LongType, false,

Metadata.fromJson("{\"hello\":\"world\"}"))

))

val df = spark.read.format("json").schema(myManualSchema)

.load("/data/flight-data/json/2015-summary.json")

Here’s how to do the same in Python:

# in Python

from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([

StructField("DEST_COUNTRY_NAME", StringType(), True),

StructField("ORIGIN_COUNTRY_NAME", StringType(), True),

StructField("count", LongType(), False, metadata={"hello":"world"})

])

df = spark.read.format("json").schema(myManualSchema)\

.load("/data/flight-data/json/2015-summary.json")

As discussed in Chapter 4, we cannot simply set types via the per-language types because Spark

maintains its own type information. Let’s now discuss what schemas define: columns.

Columns and Expressions

Columns in Spark are similar to columns in a spreadsheet, R dataframe, or pandas DataFrame. You

can select, manipulate, and remove columns from DataFrames and these operations are represented as

expressions.

To Spark, columns are logical constructions that simply represent a value computed on a per-record

basis by means of an expression. This means that to have a real value for a column, we need to have a

row; and to have a row, we need to have a DataFrame. You cannot manipulate an individual column

outside the context of a DataFrame; you must use Spark transformations within a DataFrame to modify

the contents of a column.

Columns

There are a lot of different ways to construct and refer to columns but the two simplest ways are by

using the col or column functions. To use either of these functions, you pass in a column name:

// in Scala

import org.apache.spark.sql.functions.{col, column}

col("someColumnName")

column("someColumnName")

# in Python

from pyspark.sql.functions import col, column

col("someColumnName")

column("someColumnName")

We will stick to using col throughout this book. As mentioned, this column might or might not exist in

our DataFrames. Columns are not resolved until we compare the column names with those we are

maintaining in the catalog. Column and table resolution happens in the analyzer phase, as discussed

in Chapter 4.

NOTE

We just mentioned two different ways of referring to columns. Scala has some unique language features that allow for

more shorthand ways of referring to columns. The following bits of syntactic sugar perform the exact same thing, namely

creating a column, but provide no performance improvement:

// in Scala

$"myColumn"

'myColumn

The $ allows us to designate a string as a special string that should refer to an expression. The tick mark (') is a special

thing called a symbol; this is a Scala-specific construct of referring to some identifier. They both perform the same thing

and are shorthand ways of referring to columns by name. You’ll likely see all of the aforementioned references when you

read different people’s Spark code. We leave it to you to use whatever is most comfortable and maintainable for you and

those with whom you work.

Explicit column references

If you need to refer to a specific DataFrame’s column, you can use the col method on the specific

DataFrame. This can be useful when you are performing a join and need to refer to a specific column

in one DataFrame that might share a name with another column in the joined DataFrame. We will see

this in Chapter 8. As an added benefit, Spark does not need to resolve this column itself (during the

analyzer phase) because we did that for Spark:

df.col("count")

Expressions

We mentioned earlier that columns are expressions, but what is an expression? An expression is a set

of transformations on one or more values in a record in a DataFrame. Think of it like a function that

takes as input one or more column names, resolves them, and then potentially applies more

expressions to create a single value for each record in the dataset. Importantly, this “single value” can

actually be a complex type like a Map or Array. We’ll see more of the complex types in Chapter 6.

In the simplest case, an expression, created via the expr function, is just a DataFrame column

reference. In the simplest case, expr("someCol") is equivalent to col("someCol").

Columns as expressions

Columns provide a subset of expression functionality. If you use col() and want to perform

transformations on that column, you must perform those on that column reference. When using an

expression, the expr function can actually parse transformations and column references from a string

and can subsequently be passed into further transformations. Let’s look at some examples.

expr("someCol - 5") is the same transformation as performing col("someCol") - 5, or even

expr("someCol") - 5. That’s because Spark compiles these to a logical tree specifying the order of

operations. This might be a bit confusing at first, but remember a couple of key points:

Columns are just expressions.

Columns and transformations of those columns compile to the same logical plan as parsed

expressions.

Let’s ground this with an example:

(((col("someCol") + 5) * 200) - 6) < col("otherCol")

Figure 5-1 shows an overview of that logical tree.

Figure 5-1. A logical tree

This might look familiar because it’s a directed acyclic graph. This graph is represented equivalently

by the following code:

// in Scala

import org.apache.spark.sql.functions.expr

expr("(((someCol + 5) * 200) - 6) < otherCol")

# in Python

from pyspark.sql.functions import expr

expr("(((someCol + 5) * 200) - 6) < otherCol")

This is an extremely important point to reinforce. Notice how the previous expression is actually

valid SQL code, as well, just like you might put in a SELECT statement? That’s because this SQL

expression and the previous DataFrame code compile to the same underlying logical tree prior to

execution. This means that you can write your expressions as DataFrame code or as SQL expressions

and get the exact same performance characteristics. This is discussed in Chapter 4.

Accessing a DataFrame’s columns

Sometimes, you’ll need to see a DataFrame’s columns, which you can do by using something like

printSchema; however, if you want to programmatically access columns, you can use the columns

property to see all columns on a DataFrame:

spark.read.format("json").load("/data/flight-data/json/2015-summary.json")

.columns

Records and Rows

In Spark, each row in a DataFrame is a single record. Spark represents this record as an object of

type Row. Spark manipulates Row objects using column expressions in order to produce usable values.

Row objects internally represent arrays of bytes. The byte array interface is never shown to users

because we only use column expressions to manipulate them.

You’ll notice commands that return individual rows to the driver will always return one or more Row

types when we are working with DataFrames.

NOTE

We use lowercase “row” and “record” interchangeably in this chapter, with a focus on the latter. A capitalized Row refers to

the Row object.

Let’s see a row by calling first on our DataFrame:

df.first()

Creating Rows

You can create rows by manually instantiating a Row object with the values that belong in each

column. It’s important to note that only DataFrames have schemas. Rows themselves do not have

schemas. This means that if you create a Row manually, you must specify the values in the same order

as the schema of the DataFrame to which they might be appended (we will see this when we discuss

creating DataFrames):

// in Scala

import org.apache.spark.sql.Row

val myRow = Row("Hello", null, 1, false)

# in Python

from pyspark.sql import Row

myRow = Row("Hello", None, 1, False)

Accessing data in rows is equally as easy: you just specify the position that you would like. In Scala

or Java, you must either use the helper methods or explicitly coerce the values. However, in Python or

R, the value will automatically be coerced into the correct type:

// in Scala

myRow(0) // type Any

myRow(0).asInstanceOf[String] // String

myRow.getString(0) // String

myRow.getInt(2) // Int

# in Python

myRow[0]

myRow[2]

You can also explicitly return a set of Data in the corresponding Java Virtual Machine (JVM) objects

by using the Dataset APIs. This is covered in Chapter 11.

DataFrame Transformations

Now that we briefly defined the core parts of a DataFrame, we will move onto manipulating

DataFrames. When working with individual DataFrames there are some fundamental objectives.

These break down into several core operations, as depicted in Figure 5-2:

We can add rows or columns

We can remove rows or columns

We can transform a row into a column (or vice versa)

We can change the order of rows based on the values in columns

Figure 5-2. Different kinds of transformations

Luckily, we can translate all of these into simple transformations, the most common being those that

take one column, change it row by row, and then return our results.

Creating DataFrames

As we saw previously, we can create DataFrames from raw data sources. This is covered extensively

in Chapter 9; however, we will use them now to create an example DataFrame (for illustration

purposes later in this chapter, we will also register this as a temporary view so that we can query it

with SQL and show off basic transformations in SQL, as well):

// in Scala

val df = spark.read.format("json")

.load("/data/flight-data/json/2015-summary.json")

df.createOrReplaceTempView("dfTable")

# in Python

df = spark.read.format("json").load("/data/flight-data/json/2015-summary.json")

df.createOrReplaceTempView("dfTable")

We can also create DataFrames on the fly by taking a set of rows and converting them to a DataFrame.

// in Scala

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType}

val myManualSchema = new StructType(Array(

new StructField("some", StringType, true),

new StructField("col", StringType, true),

new StructField("names", LongType, false)))

val myRows = Seq(Row("Hello", null, 1L))

val myRDD = spark.sparkContext.parallelize(myRows)

val myDf = spark.createDataFrame(myRDD, myManualSchema)

myDf.show()

NOTE

In Scala, we can also take advantage of Spark’s implicits in the console (and if you import them in your JAR code) by

running toDF on a Seq type. This does not play well with null types, so it’s not necessarily recommended for production use

cases.

// in Scala

val myDF = Seq(("Hello", 2, 1L)).toDF("col1", "col2", "col3")

# in Python

from pyspark.sql import Row

from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([

StructField("some", StringType(), True),

StructField("col", StringType(), True),

StructField("names", LongType(), False)

])

myRow = Row("Hello", None, 1)

myDf = spark.createDataFrame([myRow], myManualSchema)

myDf.show()

Giving an output of:

+-----+----+-----+

| some| col|names|

+-----+----+-----+

|Hello|null| 1|

+-----+----+-----+

Now that you know how to create DataFrames, let’s take a look at their most useful methods that

you’re going to be using: the select method when you’re working with columns or expressions, and

the selectExpr method when you’re working with expressions in strings. Naturally some

transformations are not specified as methods on columns; therefore, there exists a group of functions

found in the org.apache.spark.sql.functions package.

With these three tools, you should be able to solve the vast majority of transformation challenges that

you might encounter in DataFrames.

select and selectExpr

select and selectExpr allow you to do the DataFrame equivalent of SQL queries on a table of

data:

-- in SQL

SELECT * FROM dataFrameTable

SELECT columnName FROM dataFrameTable

SELECT columnName * 10, otherColumn, someOtherCol as c FROM dataFrameTable

In the simplest possible terms, you can use them to manipulate columns in your DataFrames. Let’s

walk through some examples on DataFrames to talk about some of the different ways of approaching

this problem. The easiest way is just to use the select method and pass in the column names as

strings with which you would like to work:

// in Scala

df.select("DEST_COUNTRY_NAME").show(2)

# in Python

df.select("DEST_COUNTRY_NAME").show(2)

-- in SQL

SELECT DEST_COUNTRY_NAME FROM dfTable LIMIT 2

Giving an output of:

+-----------------+

|DEST_COUNTRY_NAME|

+-----------------+

| United States|

+-----------------+

You can select multiple columns by using the same style of query, just add more column name strings

to your select method call:

// in Scala

df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

# in Python

df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

-- in SQL

SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME FROM dfTable LIMIT 2

Giving an output of:

+-----------------+-------------------+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|

+-----------------+-------------------+

| United States| Romania|

| United States| Croatia|

+-----------------+-------------------+

As discussed in “Columns and Expressions”, you can refer to columns in a number of different ways;

all you need to keep in mind is that you can use them interchangeably:

// in Scala

import org.apache.spark.sql.functions.{expr, col, column}

df.select(

df.col("DEST_COUNTRY_NAME"),

col("DEST_COUNTRY_NAME"),

column("DEST_COUNTRY_NAME"),

'DEST_COUNTRY_NAME,

$"DEST_COUNTRY_NAME",

expr("DEST_COUNTRY_NAME"))

.show(2)

# in Python

from pyspark.sql.functions import expr, col, column

df.select(

expr("DEST_COUNTRY_NAME"),

col("DEST_COUNTRY_NAME"),

column("DEST_COUNTRY_NAME"))\

.show(2)

One common error is attempting to mix Column objects and strings. For example, the following code

will result in a compiler error:

df.select(col("DEST_COUNTRY_NAME"), "DEST_COUNTRY_NAME")

As we’ve seen thus far, expr is the most flexible reference that we can use. It can refer to a plain

column or a string manipulation of a column. To illustrate, let’s change the column name, and then

change it back by using the AS keyword and then the alias method on the column:

// in Scala

df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

# in Python

df.select(expr("DEST_COUNTRY_NAME AS destination")).show(2)

-- in SQL

SELECT DEST_COUNTRY_NAME as destination FROM dfTable LIMIT 2

This changes the column name to “destination.” You can further manipulate the result of your

expression as another expression:

// in Scala

df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME"))

.show(2)

# in Python

df.select(expr("DEST_COUNTRY_NAME as destination").alias("DEST_COUNTRY_NAME"))\

.show(2)

The preceding operation changes the column name back to its original name.

Because select followed by a series of expr is such a common pattern, Spark has a shorthand for

doing this efficiently: selectExpr. This is probably the most convenient interface for everyday use:

// in Scala

df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

# in Python

df.selectExpr("DEST_COUNTRY_NAME as newColumnName", "DEST_COUNTRY_NAME").show(2)

This opens up the true power of Spark. We can treat selectExpr as a simple way to build up

complex expressions that create new DataFrames. In fact, we can add any valid non-aggregating SQL

statement, and as long as the columns resolve, it will be valid! Here’s a simple example that adds a

new column withinCountry to our DataFrame that specifies whether the destination and origin are

the same:

// in Scala

df.selectExpr(

"*", // include all original columns

"(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")

.show(2)

# in Python

df.selectExpr(

"*", # all original columns

"(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")\

.show(2)

-- in SQL

SELECT *, (DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry

FROM dfTable

LIMIT 2

Giving an output of:

+-----------------+-------------------+-----+-------------+

+-----------------+-------------------+-----+-------------+

+-----------------+-------------------+-----+-------------+

With select expression, we can also specify aggregations over the entire DataFrame by taking

advantage of the functions that we have. These look just like what we have been showing so far:

// in Scala

df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

# in Python

df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

-- in SQL

SELECT avg(count), count(distinct(DEST_COUNTRY_NAME)) FROM dfTable LIMIT 2

Giving an output of:

+-----------+---------------------------------+

| avg(count)|count(DISTINCT DEST_COUNTRY_NAME)|

+-----------+---------------------------------+

|1770.765625| 132|

+-----------+---------------------------------+

Converting to Spark Types (Literals)

Sometimes, we need to pass explicit values into Spark that are just a value (rather than a new

column). This might be a constant value or something we’ll need to compare to later on. The way we

do this is through literals. This is basically a translation from a given programming language’s literal

value to one that Spark understands. Literals are expressions and you can use them in the same way:

// in Scala

import org.apache.spark.sql.functions.lit

df.select(expr("*"), lit(1).as("One")).show(2)

# in Python

from pyspark.sql.functions import lit

df.select(expr("*"), lit(1).alias("One")).show(2)

In SQL, literals are just the specific value:

-- in SQL

SELECT *, 1 as One FROM dfTable LIMIT 2

Giving an output of:

+-----------------+-------------------+-----+---+

+-----------------+-------------------+-----+---+

| United States| Romania| 15| 1|

| United States| Croatia| 1| 1|

+-----------------+-------------------+-----+---+

This will come up when you might need to check whether a value is greater than some constant or

other programmatically created variable.

Adding Columns

There’s also a more formal way of adding a new column to a DataFrame, and that’s by using the

withColumn method on our DataFrame. For example, let’s add a column that just adds the number

one as a column:

// in Scala

df.withColumn("numberOne", lit(1)).show(2)

# in Python

df.withColumn("numberOne", lit(1)).show(2)

-- in SQL

SELECT *, 1 as numberOne FROM dfTable LIMIT 2

Giving an output of:

+-----------------+-------------------+-----+---------+

+-----------------+-------------------+-----+---------+

| United States| Romania| 15| 1|

| United States| Croatia| 1| 1|

+-----------------+-------------------+-----+---------+

Let’s do something a bit more interesting and make it an actual expression. In the next example, we’ll

set a Boolean flag for when the origin country is the same as the destination country:

// in Scala

df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))

.show(2)

# in Python

df.withColumn("withinCountry", expr("ORIGIN_COUNTRY_NAME == DEST_COUNTRY_NAME"))\

.show(2)

Notice that the withColumn function takes two arguments: the column name and the expression that

will create the value for that given row in the DataFrame. Interestingly, we can also rename a column

this way. The SQL syntax is the same as we had previously, so we can omit it in this example:

df.withColumn("Destination", expr("DEST_COUNTRY_NAME")).columns

Resulting in:

... DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count, Destination

Renaming Columns

Although we can rename a column in the manner that we just described, another alternative is to use

the withColumnRenamed method. This will rename the column with the name of the string in the first

argument to the string in the second argument:

// in Scala

df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

# in Python

df.withColumnRenamed("DEST_COUNTRY_NAME", "dest").columns

... dest, ORIGIN_COUNTRY_NAME, count

Reserved Characters and Keywords

One thing that you might come across is reserved characters like spaces or dashes in column names.

Handling these means escaping column names appropriately. In Spark, we do this by using backtick

(`) characters. Let’s use withColumn, which you just learned about to create a column with reserved

characters. We’ll show two examples—in the one shown here, we don’t need escape characters, but

in the next one, we do:

// in Scala

import org.apache.spark.sql.functions.expr

val dfWithLongColName = df.withColumn(

"This Long Column-Name",

expr("ORIGIN_COUNTRY_NAME"))

# in Python

dfWithLongColName = df.withColumn(

"This Long Column-Name",

expr("ORIGIN_COUNTRY_NAME"))

We don’t need escape characters here because the first argument to withColumn is just a string for the

new column name. In this example, however, we need to use backticks because we’re referencing a

column in an expression:

// in Scala

dfWithLongColName.selectExpr(

"`This Long Column-Name`",

"`This Long Column-Name` as `new col`")

.show(2)

# in Python

dfWithLongColName.selectExpr(

"`This Long Column-Name`",

"`This Long Column-Name` as `new col`")\

.show(2)

dfWithLongColName.createOrReplaceTempView("dfTableLong")

-- in SQL

SELECT `This Long Column-Name`, `This Long Column-Name` as `new col`

FROM dfTableLong LIMIT 2

We can refer to columns with reserved characters (and not escape them) if we’re doing an explicit

string-to-column reference, which is interpreted as a literal instead of an expression. We only need to

escape expressions that use reserved characters or keywords. The following two examples both

result in the same DataFrame:

// in Scala

dfWithLongColName.select(col("This Long Column-Name")).columns

# in Python

dfWithLongColName.select(expr("`This Long Column-Name`")).columns

Case Sensitivity

By default Spark is case insensitive; however, you can make Spark case sensitive by setting the

configuration:

-- in SQL

set spark.sql.caseSensitive true

Removing Columns

Now that we’ve created this column, let’s take a look at how we can remove columns from

DataFrames. You likely already noticed that we can do this by using select. However, there is also a

dedicated method called drop:

df.drop("ORIGIN_COUNTRY_NAME").columns

We can drop multiple columns by passing in multiple columns as arguments:

dfWithLongColName.drop("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME")

Changing a Column’s Type (cast)

Sometimes, we might need to convert from one type to another; for example, if we have a set of

StringType that should be integers. We can convert columns from one type to another by casting the

column from one type to another. For instance, let’s convert our count column from an integer to a

type Long:

df.withColumn("count2", col("count").cast("long"))

-- in SQL

SELECT *, cast(count as long) AS count2 FROM dfTable

Filtering Rows

To filter rows, we create an expression that evaluates to true or false. You then filter out the rows

with an expression that is equal to false. The most common way to do this with DataFrames is to

create either an expression as a String or build an expression by using a set of column manipulations.

There are two methods to perform this operation: you can use where or filter and they both will

perform the same operation and accept the same argument types when used with DataFrames. We will

stick to where because of its familiarity to SQL; however, filter is valid as well.

NOTE

When using the Dataset API from either Scala or Java, filter also accepts an arbitrary function that Spark will apply to

each record in the Dataset. See Chapter 11 for more information.

The following filters are equivalent, and the results are the same in Scala and Python:

df.filter(col("count") < 2).show(2)

df.where("count < 2").show(2)

-- in SQL

SELECT * FROM dfTable WHERE count < 2 LIMIT 2

Giving an output of:

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Croatia| 1|

| United States| Singapore| 1|

+-----------------+-------------------+-----+

Instinctually, you might want to put multiple filters into the same expression. Although this is possible,

it is not always useful, because Spark automatically performs all filtering operations at the same time

regardless of the filter ordering. This means that if you want to specify multiple AND filters, just

chain them sequentially and let Spark handle the rest:

// in Scala

df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") =!= "Croatia")

.show(2)

# in Python

df.where(col("count") < 2).where(col("ORIGIN_COUNTRY_NAME") != "Croatia")\

.show(2)

-- in SQL

SELECT * FROM dfTable WHERE count < 2 AND ORIGIN_COUNTRY_NAME != "Croatia"

LIMIT 2

Giving an output of:

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Singapore| 1|

| Moldova| United States| 1|

+-----------------+-------------------+-----+

Getting Unique Rows

A very common use case is to extract the unique or distinct values in a DataFrame. These values can

be in one or more columns. The way we do this is by using the distinct method on a DataFrame,

which allows us to deduplicate any rows that are in that DataFrame. For instance, let’s get the unique

origins in our dataset. This, of course, is a transformation that will return a new DataFrame with only

unique rows:

// in Scala

df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

# in Python

df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().count()

-- in SQL

SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable

Results in 256.

// in Scala

df.select("ORIGIN_COUNTRY_NAME").distinct().count()

# in Python

df.select("ORIGIN_COUNTRY_NAME").distinct().count()

-- in SQL

SELECT COUNT(DISTINCT ORIGIN_COUNTRY_NAME) FROM dfTable

Results in 125.

Random Samples

Sometimes, you might just want to sample some random records from your DataFrame. You can do

this by using the sample method on a DataFrame, which makes it possible for you to specify a

fraction of rows to extract from a DataFrame and whether you’d like to sample with or without

replacement:

val seed = 5

val withReplacement = false

val fraction = 0.5

df.sample(withReplacement, fraction, seed).count()

# in Python

seed = 5

withReplacement = False

fraction = 0.5

df.sample(withReplacement, fraction, seed).count()

Giving an output of 126.

Random Splits

Random splits can be helpful when you need to break up your DataFrame into a random “splits” of

the original DataFrame. This is often used with machine learning algorithms to create training,

validation, and test sets. In this next example, we’ll split our DataFrame into two different

DataFrames by setting the weights by which we will split the DataFrame (these are the arguments to

the function). Because this method is designed to be randomized, we will also specify a seed (just

replace seed with a number of your choosing in the code block). It’s important to note that if you don’t

specify a proportion for each DataFrame that adds up to one, they will be normalized so that they do:

// in Scala

val dataFrames = df.randomSplit(Array(0.25, 0.75), seed)

dataFrames(0).count() > dataFrames(1).count() // False

# in Python

dataFrames = df.randomSplit([0.25, 0.75], seed)

dataFrames[0].count() > dataFrames[1].count() # False

Concatenating and Appending Rows (Union)

As you learned in the previous section, DataFrames are immutable. This means users cannot append

to DataFrames because that would be changing it. To append to a DataFrame, you must union the

original DataFrame along with the new DataFrame. This just concatenates the two DataFramess. To

union two DataFrames, you must be sure that they have the same schema and number of columns;

otherwise, the union will fail.

WARNING

Unions are currently performed based on location, not on the schema. This means that columns will not automatically line

up the way you think they might.

// in Scala

import org.apache.spark.sql.Row

val schema = df.schema

val newRows = Seq(

Row("New Country", "Other Country", 5L),

Row("New Country 2", "Other Country 3", 1L)

)

val parallelizedRows = spark.sparkContext.parallelize(newRows)

val newDF = spark.createDataFrame(parallelizedRows, schema)

df.union(newDF)

.where("count = 1")

.where($"ORIGIN_COUNTRY_NAME" =!= "United States")

.show() // get all of them and we'll see our new rows at the end

In Scala, you must use the =!= operator so that you don’t just compare the unevaluated column

expression to a string but instead to the evaluated one:

# in Python

from pyspark.sql import Row

schema = df.schema

newRows = [

Row("New Country", "Other Country", 5L),

Row("New Country 2", "Other Country 3", 1L)

]

parallelizedRows = spark.sparkContext.parallelize(newRows)

newDF = spark.createDataFrame(parallelizedRows, schema)

# in Python

df.union(newDF)\

.where("count = 1")\

.where(col("ORIGIN_COUNTRY_NAME") != "United States")\

.show()

Giving the output of:

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Croatia| 1|

...

| United States| Namibia| 1|

| New Country 2| Other Country 3| 1|

+-----------------+-------------------+-----+

As expected, you’ll need to use this new DataFrame reference in order to refer to the DataFrame with

the newly appended rows. A common way to do this is to make the DataFrame into a view or register

it as a table so that you can reference it more dynamically in your code.

Sorting Rows

When we sort the values in a DataFrame, we always want to sort with either the largest or smallest

values at the top of a DataFrame. There are two equivalent operations to do this sort and orderBy

that work the exact same way. They accept both column expressions and strings as well as multiple

columns. The default is to sort in ascending order:

// in Scala

df.sort("count").show(5)

df.orderBy("count", "DEST_COUNTRY_NAME").show(5)

df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

# in Python

df.sort("count").show(5)

df.orderBy("count", "DEST_COUNTRY_NAME").show(5)

df.orderBy(col("count"), col("DEST_COUNTRY_NAME")).show(5)

To more explicitly specify sort direction, you need to use the asc and desc functions if operating on a

column. These allow you to specify the order in which a given column should be sorted:

// in Scala

import org.apache.spark.sql.functions.{desc, asc}

df.orderBy(expr("count desc")).show(2)

df.orderBy(desc("count"), asc("DEST_COUNTRY_NAME")).show(2)

# in Python

from pyspark.sql.functions import desc, asc

df.orderBy(expr("count desc")).show(2)

df.orderBy(col("count").desc(), col("DEST_COUNTRY_NAME").asc()).show(2)

-- in SQL

SELECT * FROM dfTable ORDER BY count DESC, DEST_COUNTRY_NAME ASC LIMIT 2

An advanced tip is to use asc_nulls_first, desc_nulls_first, asc_nulls_last, or

desc_nulls_last to specify where you would like your null values to appear in an ordered

DataFrame.

For optimization purposes, it’s sometimes advisable to sort within each partition before another set of

transformations. You can use the sortWithinPartitions method to do this:

// in Scala

spark.read.format("json").load("/data/flight-data/json/*-summary.json")

.sortWithinPartitions("count")

# in Python

spark.read.format("json").load("/data/flight-data/json/*-summary.json")\

.sortWithinPartitions("count")

We will discuss this more when we look at tuning and optimization in Part III.

Limit

Oftentimes, you might want to restrict what you extract from a DataFrame; for example, you might

want just the top ten of some DataFrame. You can do this by using the limit method:

// in Scala

df.limit(5).show()

# in Python

df.limit(5).show()

-- in SQL

SELECT * FROM dfTable LIMIT 6

// in Scala

df.orderBy(expr("count desc")).limit(6).show()

# in Python

df.orderBy(expr("count desc")).limit(6).show()

-- in SQL

SELECT * FROM dfTable ORDER BY count desc LIMIT 6

Repartition and Coalesce

Another important optimization opportunity is to partition the data according to some frequently

filtered columns, which control the physical layout of data across the cluster including the partitioning

scheme and the number of partitions.

Repartition will incur a full shuffle of the data, regardless of whether one is necessary. This means

that you should typically only repartition when the future number of partitions is greater than your

current number of partitions or when you are looking to partition by a set of columns:

// in Scala

df.rdd.getNumPartitions // 1

# in Python

df.rdd.getNumPartitions() # 1

// in Scala

df.repartition(5)

# in Python

df.repartition(5)

If you know that you’re going to be filtering by a certain column often, it can be worth repartitioning

based on that column:

// in Scala

df.repartition(col("DEST_COUNTRY_NAME"))

# in Python

df.repartition(col("DEST_COUNTRY_NAME"))

You can optionally specify the number of partitions you would like, too:

// in Scala

df.repartition(5, col("DEST_COUNTRY_NAME"))

# in Python

df.repartition(5, col("DEST_COUNTRY_NAME"))

Coalesce, on the other hand, will not incur a full shuffle and will try to combine partitions. This

operation will shuffle your data into five partitions based on the destination country name, and then

coalesce them (without a full shuffle):

// in Scala

df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

# in Python

df.repartition(5, col("DEST_COUNTRY_NAME")).coalesce(2)

Collecting Rows to the Driver

As discussed in previous chapters, Spark maintains the state of the cluster in the driver. There are

times when you’ll want to collect some of your data to the driver in order to manipulate it on your

local machine.

Thus far, we did not explicitly define this operation. However, we used several different methods for

doing so that are effectively all the same. collect gets all data from the entire DataFrame, take

selects the first N rows, and show prints out a number of rows nicely.

// in Scala

val collectDF = df.limit(10)

collectDF.take(5) // take works with an Integer count

collectDF.show() // this prints it out nicely

collectDF.show(5, false)

collectDF.collect()

# in Python

collectDF = df.limit(10)

collectDF.take(5) # take works with an Integer count

collectDF.show() # this prints it out nicely

collectDF.show(5, False)

collectDF.collect()

There’s an additional way of collecting rows to the driver in order to iterate over the entire dataset.

The method toLocalIterator collects partitions to the driver as an iterator. This method allows you

to iterate over the entire dataset partition-by-partition in a serial manner:

collectDF.toLocalIterator()

WARNING

Any collection of data to the driver can be a very expensive operation! If you have a large dataset and call collect, you

can crash the driver. If you use toLocalIterator and have very large partitions, you can easily crash the driver node and

lose the state of your application. This is also expensive because we can operate on a one-by-one basis, instead of running

computation in parallel.

Conclusion

This chapter covered basic operations on DataFrames. You learned the simple concepts and tools that

you will need to be successful with Spark DataFrames. Chapter 6 covers in much greater detail all of

the different ways in which you can manipulate the data in those DataFrames.

Chapter 6. Working with Different Types of

Data

Chapter 5 presented basic DataFrame concepts and abstractions. This chapter covers building

expressions, which are the bread and butter of Spark’s structured operations. We also review working

with a variety of different kinds of data, including the following:

Booleans

Numbers

Strings

Dates and timestamps

Handling null

Complex types

User-defined functions

Where to Look for APIs

Before we begin, it’s worth explaining where you as a user should look for transformations. Spark is

a growing project, and any book (including this one) is a snapshot in time. One of our priorities in this

book is to teach where, as of this writing, you should look to find functions to transform your data.

Following are the key places to look:

DataFrame (Dataset) Methods

This is actually a bit of a trick because a DataFrame is just a Dataset of Row types, so you’ll

actually end up looking at the Dataset methods, which are available at this link.

Dataset submodules like DataFrameStatFunctions and DataFrameNaFunctions have more

methods that solve specific sets of problems. DataFrameStatFunctions, for example, holds a

variety of statistically related functions, whereas DataFrameNaFunctions refers to functions that are

relevant when working with null data.

Column Methods

These were introduced for the most part in Chapter 5. They hold a variety of general column-

related methods like alias or contains. You can find the API Reference for Column methods

here.

org.apache.spark.sql.functions contains a variety of functions for a range of different data

types. Often, you’ll see the entire package imported because they are used so frequently. You can find

SQL and DataFrame functions here.

Now this may feel a bit overwhelming but have no fear, the majority of these functions are ones that

you will find in SQL and analytics systems. All of these tools exist to achieve one purpose, to

transform rows of data in one format or structure to another. This might create more rows or reduce

the number of rows available. To begin, let’s read in the DataFrame that we’ll be using for this

analysis:

// in Scala

val df = spark.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/data/retail-data/by-day/2010-12-01.csv")

df.printSchema()

df.createOrReplaceTempView("dfTable")

# in Python

df = spark.read.format("csv")\

.option("header", "true")\

.option("inferSchema", "true")\

.load("/data/retail-data/by-day/2010-12-01.csv")

df.printSchema()

df.createOrReplaceTempView("dfTable")

Here’s the result of the schema and a small sample of the data:

root

|-- InvoiceNo: string (nullable = true)

|-- StockCode: string (nullable = true)

|-- Description: string (nullable = true)

|-- Quantity: integer (nullable = true)

|-- InvoiceDate: timestamp (nullable = true)

|-- UnitPrice: double (nullable = true)

|-- CustomerID: double (nullable = true)

|-- Country: string (nullable = true)

+---------+---------+--------------------+--------+-------------------+----...

+---------+---------+--------------------+--------+-------------------+----...

| 536365| 85123A|WHITE HANGING HEA...| 6|2010-12-01 08:26:00| ...

| 536365| 71053| WHITE METAL LANTERN| 6|2010-12-01 08:26:00| ...

...

| 536367| 21755|LOVE BUILDING BLO...| 3|2010-12-01 08:34:00| ...

| 536367| 21777|RECIPE BOX WITH M...| 4|2010-12-01 08:34:00| ...

+---------+---------+--------------------+--------+-------------------+----...

Converting to Spark Types

One thing you’ll see us do throughout this chapter is convert native types to Spark types. We do this

by using the first function that we introduce here, the lit function. This function converts a type in

another language to its correspnding Spark representation. Here’s how we can convert a couple of

different kinds of Scala and Python values to their respective Spark types:

// in Scala

import org.apache.spark.sql.functions.lit

df.select(lit(5), lit("five"), lit(5.0))

# in Python

from pyspark.sql.functions import lit

df.select(lit(5), lit("five"), lit(5.0))

There’s no equivalent function necessary in SQL, so we can use the values directly:

-- in SQL

SELECT 5, "five", 5.0

Working with Booleans

Booleans are essential when it comes to data analysis because they are the foundation for all filtering.

Boolean statements consist of four elements: and, or, true, and false. We use these simple structures

to build logical statements that evaluate to either true or false. These statements are often used as

conditional requirements for when a row of data must either pass the test (evaluate to true) or else it

will be filtered out.

Let’s use our retail dataset to explore working with Booleans. We can specify equality as well as

less-than or greater-than:

// in Scala

import org.apache.spark.sql.functions.col

df.where(col("InvoiceNo").equalTo(536365))

.select("InvoiceNo", "Description")

.show(5, false)

WARNING

Scala has some particular semantics regarding the use of == and ===. In Spark, if you want to filter by equality you should

use === (equal) or =!= (not equal). You can also use the not function and the equalTo method.

// in Scala

import org.apache.spark.sql.functions.col

df.where(col("InvoiceNo") === 536365)

.select("InvoiceNo", "Description")

.show(5, false)

Python keeps a more conventional notation:

# in Python

from pyspark.sql.functions import col

df.where(col("InvoiceNo") != 536365)\

.select("InvoiceNo", "Description")\

.show(5, False)

+---------+-----------------------------+

|InvoiceNo|Description |

+---------+-----------------------------+

|536366 |HAND WARMER UNION JACK |

...

|536367 |POPPY'S PLAYHOUSE KITCHEN |

+---------+-----------------------------+

Another option—and probably the cleanest—is to specify the predicate as an expression in a string.

This is valid for Python or Scala. Note that this also gives you access to another way of expressing

“does not equal”:

df.where("InvoiceNo = 536365")

.show(5, false)

df.where("InvoiceNo <> 536365")

.show(5, false)

We mentioned that you can specify Boolean expressions with multiple parts when you use and or or.

In Spark, you should always chain together and filters as a sequential filter.

The reason for this is that even if Boolean statements are expressed serially (one after the other),

Spark will flatten all of these filters into one statement and perform the filter at the same time,

creating the and statement for us. Although you can specify your statements explicitly by using and if

you like, they’re often easier to understand and to read if you specify them serially. or statements

need to be specified in the same statement:

// in Scala

val priceFilter = col("UnitPrice") > 600

val descripFilter = col("Description").contains("POSTAGE")

df.where(col("StockCode").isin("DOT")).where(priceFilter.or(descripFilter))

.show()

# in Python

from pyspark.sql.functions import instr

priceFilter = col("UnitPrice") > 600

descripFilter = instr(df.Description, "POSTAGE") >= 1

df.where(df.StockCode.isin("DOT")).where(priceFilter | descripFilter).show()

-- in SQL

SELECT * FROM dfTable WHERE StockCode in ("DOT") AND(UnitPrice > 600 OR

instr(Description, "POSTAGE") >= 1)

+---------+---------+--------------+--------+-------------------+---------+...

+---------+---------+--------------+--------+-------------------+---------+...

| 536544| DOT|DOTCOM POSTAGE| 1|2010-12-01 14:32:00| 569.77|...

| 536592| DOT|DOTCOM POSTAGE| 1|2010-12-01 17:06:00| 607.49|...

+---------+---------+--------------+--------+-------------------+---------+...

Boolean expressions are not just reserved to filters. To filter a DataFrame, you can also just specify a

Boolean column:

// in Scala

val DOTCodeFilter = col("StockCode") === "DOT"

val priceFilter = col("UnitPrice") > 600

val descripFilter = col("Description").contains("POSTAGE")

df.withColumn("isExpensive", DOTCodeFilter.and(priceFilter.or(descripFilter)))

.where("isExpensive")

.select("unitPrice", "isExpensive").show(5)

# in Python

from pyspark.sql.functions import instr

DOTCodeFilter = col("StockCode") == "DOT"

priceFilter = col("UnitPrice") > 600

descripFilter = instr(col("Description"), "POSTAGE") >= 1

df.withColumn("isExpensive", DOTCodeFilter & (priceFilter | descripFilter))\

.where("isExpensive")\

.select("unitPrice", "isExpensive").show(5)

-- in SQL

SELECT UnitPrice, (StockCode = 'DOT' AND

(UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1)) as isExpensive

FROM dfTable

WHERE (StockCode = 'DOT' AND

(UnitPrice > 600 OR instr(Description, "POSTAGE") >= 1))

Notice how we did not need to specify our filter as an expression and how we could use a column

name without any extra work.

If you’re coming from a SQL background, all of these statements should seem quite familiar. Indeed,

all of them can be expressed as a where clause. In fact, it’s often easier to just express filters as SQL

statements than using the programmatic DataFrame interface and Spark SQL allows us to do this

without paying any performance penalty. For example, the following two statements are equivalent:

// in Scala

import org.apache.spark.sql.functions.{expr, not, col}

df.withColumn("isExpensive", not(col("UnitPrice").leq(250)))

.filter("isExpensive")

.select("Description", "UnitPrice").show(5)

df.withColumn("isExpensive", expr("NOT UnitPrice <= 250"))

.filter("isExpensive")

.select("Description", "UnitPrice").show(5)

Here’s our state definition:

# in Python

from pyspark.sql.functions import expr

df.withColumn("isExpensive", expr("NOT UnitPrice <= 250"))\

.where("isExpensive")\

.select("Description", "UnitPrice").show(5)

WARNING

One “gotcha” that can come up is if you’re working with null data when creating Boolean expressions. If there is a null in

your data, you’ll need to treat things a bit differently. Here’s how you can ensure that you perform a null-safe equivalence

test:

df.where(col("Description").eqNullSafe("hello")).show()

Although not currently available (Spark 2.2), IS [NOT] DISTINCT FROM will be coming in Spark

2.3 to do the same thing in SQL.

Working with Numbers

When working with big data, the second most common task you will do after filtering things is

counting things. For the most part, we simply need to express our computation, and that should be

valid assuming that we’re working with numerical data types.

To fabricate a contrived example, let’s imagine that we found out that we mis-recorded the quantity in

our retail dataset and the true quantity is equal to (the current quantity * the unit price) + 5. This will

introduce our first numerical function as well as the pow function that raises a column to the expressed

power:

// in Scala

import org.apache.spark.sql.functions.{expr, pow}

val fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5

df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

# in Python

from pyspark.sql.functions import expr, pow

fabricatedQuantity = pow(col("Quantity") * col("UnitPrice"), 2) + 5

df.select(expr("CustomerId"), fabricatedQuantity.alias("realQuantity")).show(2)

+----------+------------------+

|CustomerId| realQuantity|

+----------+------------------+

| 17850.0|239.08999999999997|

| 17850.0| 418.7156|

+----------+------------------+

Notice that we were able to multiply our columns together because they were both numerical.

Naturally we can add and subtract as necessary, as well. In fact, we can do all of this as a SQL

expression, as well:

// in Scala

df.selectExpr(

"CustomerId",

"(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)

# in Python

df.selectExpr(

"CustomerId",

"(POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity").show(2)

-- in SQL

SELECT customerId, (POWER((Quantity * UnitPrice), 2.0) + 5) as realQuantity

FROM dfTable

Another common numerical task is rounding. If you’d like to just round to a whole number, oftentimes

you can cast the value to an integer and that will work just fine. However, Spark also has more

detailed functions for performing this explicitly and to a certain level of precision. In the following

example, we round to one decimal place:

// in Scala

import org.apache.spark.sql.functions.{round, bround}

df.select(round(col("UnitPrice"), 1).alias("rounded"), col("UnitPrice")).show(5)

By default, the round function rounds up if you’re exactly in between two numbers. You can round

down by using the bround:

// in Scala

import org.apache.spark.sql.functions.lit

df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

# in Python

from pyspark.sql.functions import lit, round, bround

df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

-- in SQL

SELECT round(2.5), bround(2.5)

+-------------+--------------+

|round(2.5, 0)|bround(2.5, 0)|

+-------------+--------------+

| 3.0| 2.0|

+-------------+--------------+

Another numerical task is to compute the correlation of two columns. For example, we can see the

Pearson correlation coefficient for two columns to see if cheaper things are typically bought in

greater quantities. We can do this through a function as well as through the DataFrame statistic

methods:

// in Scala

import org.apache.spark.sql.functions.{corr}

df.stat.corr("Quantity", "UnitPrice")

df.select(corr("Quantity", "UnitPrice")).show()

# in Python

from pyspark.sql.functions import corr

df.stat.corr("Quantity", "UnitPrice")

df.select(corr("Quantity", "UnitPrice")).show()

-- in SQL

SELECT corr(Quantity, UnitPrice) FROM dfTable

+-------------------------+

|corr(Quantity, UnitPrice)|

+-------------------------+

| -0.04112314436835551|

+-------------------------+

Another common task is to compute summary statistics for a column or set of columns. We can use the

describe method to achieve exactly this. This will take all numeric columns and calculate the count,

mean, standard deviation, min, and max. You should use this primarily for viewing in the console

because the schema might change in the future:

// in Scala

df.describe().show()

# in Python

df.describe().show()

+-------+------------------+------------------+------------------+

+-------+------------------+------------------+------------------+

| count| 3108| 3108| 1968|

| mean| 8.627413127413128| 4.151946589446603|15661.388719512195|

| stddev|26.371821677029203|15.638659854603892|1854.4496996893627|

| min| -24| 0.0| 12431.0|

| max| 600| 607.49| 18229.0|

+-------+------------------+------------------+------------------+

If you need these exact numbers, you can also perform this as an aggregation yourself by importing the

functions and applying them to the columns that you need:

// in Scala

import org.apache.spark.sql.functions.{count, mean, stddev_pop, min, max}

# in Python

from pyspark.sql.functions import count, mean, stddev_pop, min, max

There are a number of statistical functions available in the StatFunctions Package (accessible using

stat as we see in the code block below). These are DataFrame methods that you can use to calculate

a variety of different things. For instance, you can calculate either exact or approximate quantiles of

your data using the approxQuantile method:

// in Scala

val colName = "UnitPrice"

val quantileProbs = Array(0.5)

val relError = 0.05

df.stat.approxQuantile("UnitPrice", quantileProbs, relError) // 2.51

# in Python

colName = "UnitPrice"

quantileProbs = [0.5]

relError = 0.05

df.stat.approxQuantile("UnitPrice", quantileProbs, relError) # 2.51

You also can use this to see a cross-tabulation or frequent item pairs (be careful, this output will be

large and is omitted for this reason):

// in Scala

df.stat.crosstab("StockCode", "Quantity").show()

# in Python

df.stat.crosstab("StockCode", "Quantity").show()

// in Scala

df.stat.freqItems(Seq("StockCode", "Quantity")).show()

# in Python

df.stat.freqItems(["StockCode", "Quantity"]).show()

As a last note, we can also add a unique ID to each row by using the function

monotonically_increasing_id. This function generates a unique value for each row, starting with

// in Scala

import org.apache.spark.sql.functions.monotonically_increasing_id

df.select(monotonically_increasing_id()).show(2)

# in Python

from pyspark.sql.functions import monotonically_increasing_id

df.select(monotonically_increasing_id()).show(2)

There are functions added with every release, so check the documentation for more methods. For

instance, there are some random data generation tools (e.g., rand(), randn()) with which you can

randomly generate data; however, there are potential determinism issues when doing so. (You can

find discussions about these challenges on the Spark mailing list.) There are also a number of more

advanced tasks like bloom filtering and sketching algorithms available in the stat package that we

mentioned (and linked to) at the beginning of this chapter. Be sure to search the API documentation for

more information and functions.

Working with Strings

String manipulation shows up in nearly every data flow, and it’s worth explaining what you can do

with strings. You might be manipulating log files performing regular expression extraction or

substitution, or checking for simple string existence, or making all strings uppercase or lowercase.

Let’s begin with the last task because it’s the most straightforward. The initcap function will

capitalize every word in a given string when that word is separated from another by a space.

// in Scala

import org.apache.spark.sql.functions.{initcap}

df.select(initcap(col("Description"))).show(2, false)

# in Python

from pyspark.sql.functions import initcap

df.select(initcap(col("Description"))).show()

-- in SQL

SELECT initcap(Description) FROM dfTable

+----------------------------------+

|initcap(Description) |

+----------------------------------+

|White Hanging Heart T-light Holder|

|White Metal Lantern |

+----------------------------------+

As just mentioned, you can cast strings in uppercase and lowercase, as well:

// in Scala

import org.apache.spark.sql.functions.{lower, upper}

df.select(col("Description"),

lower(col("Description")),

upper(lower(col("Description")))).show(2)

# in Python

from pyspark.sql.functions import lower, upper

df.select(col("Description"),

lower(col("Description")),

upper(lower(col("Description")))).show(2)

-- in SQL

SELECT Description, lower(Description), Upper(lower(Description)) FROM dfTable

+--------------------+--------------------+-------------------------+

| Description| lower(Description)|upper(lower(Description))|

+--------------------+--------------------+-------------------------+

|WHITE HANGING HEA...|white hanging hea...| WHITE HANGING HEA...|

| WHITE METAL LANTERN| white metal lantern| WHITE METAL LANTERN|

+--------------------+--------------------+-------------------------+

Another trivial task is adding or removing spaces around a string. You can do this by using lpad,

ltrim, rpad and rtrim, trim:

// in Scala

import org.apache.spark.sql.functions.{lit, ltrim, rtrim, rpad, lpad, trim}

df.select(

ltrim(lit(" HELLO ")).as("ltrim"),

rtrim(lit(" HELLO ")).as("rtrim"),

trim(lit(" HELLO ")).as("trim"),

lpad(lit("HELLO"), 3, " ").as("lp"),

rpad(lit("HELLO"), 10, " ").as("rp")).show(2)

# in Python

from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

df.select(

ltrim(lit(" HELLO ")).alias("ltrim"),

rtrim(lit(" HELLO ")).alias("rtrim"),

trim(lit(" HELLO ")).alias("trim"),

lpad(lit("HELLO"), 3, " ").alias("lp"),

rpad(lit("HELLO"), 10, " ").alias("rp")).show(2)

-- in SQL

SELECT

ltrim(' HELLLOOOO '),

rtrim(' HELLLOOOO '),

trim(' HELLLOOOO '),

lpad('HELLOOOO ', 3, ' '),

rpad('HELLOOOO ', 10, ' ')

FROM dfTable

+---------+---------+-----+---+----------+

+---------+---------+-----+---+----------+

+---------+---------+-----+---+----------+

Note that if lpad or rpad takes a number less than the length of the string, it will always remove

values from the right side of the string.

Regular Expressions

Probably one of the most frequently performed tasks is searching for the existence of one string in

another or replacing all mentions of a string with another value. This is often done with a tool called

regular expressions that exists in many programming languages. Regular expressions give the user an

ability to specify a set of rules to use to either extract values from a string or replace them with some

other values.

Spark takes advantage of the complete power of Java regular expressions. The Java regular

expression syntax departs slightly from other programming languages, so it is worth reviewing before

putting anything into production. There are two key functions in Spark that you’ll need in order to

perform regular expression tasks: regexp_extract and regexp_replace. These functions extract

values and replace values, respectively.

Let’s explore how to use the regexp_replace function to replace substitute color names in our

description column:

// in Scala

import org.apache.spark.sql.functions.regexp_replace

val simpleColors = Seq("black", "white", "red", "green", "blue")

val regexString = simpleColors.map(_.toUpperCase).mkString("|")

// the | signifies `OR` in regular expression syntax

df.select(

regexp_replace(col("Description"), regexString, "COLOR").alias("color_clean"),

col("Description")).show(2)

# in Python

from pyspark.sql.functions import regexp_replace

regex_string = "BLACK|WHITE|RED|GREEN|BLUE"

df.select(

regexp_replace(col("Description"), regex_string, "COLOR").alias("color_clean"),

col("Description")).show(2)

-- in SQL

SELECT

regexp_replace(Description, 'BLACK|WHITE|RED|GREEN|BLUE', 'COLOR') as

color_clean, Description

FROM dfTable

+--------------------+--------------------+

| color_clean| Description|

+--------------------+--------------------+

|COLOR HANGING HEA...|WHITE HANGING HEA...|

| COLOR METAL LANTERN| WHITE METAL LANTERN|

+--------------------+--------------------+

Another task might be to replace given characters with other characters. Building this as a regular

expression could be tedious, so Spark also provides the translate function to replace these values.

This is done at the character level and will replace all instances of a character with the indexed

character in the replacement string:

// in Scala

import org.apache.spark.sql.functions.translate

df.select(translate(col("Description"), "LEET", "1337"), col("Description"))

.show(2)

# in Python

from pyspark.sql.functions import translate

df.select(translate(col("Description"), "LEET", "1337"),col("Description"))\

.show(2)

-- in SQL

SELECT translate(Description, 'LEET', '1337'), Description FROM dfTable

+----------------------------------+--------------------+

|translate(Description, LEET, 1337)| Description|

+----------------------------------+--------------------+

| WHI73 HANGING H3A...|WHITE HANGING HEA...|

| WHI73 M37A1 1AN73RN| WHITE METAL LANTERN|

+----------------------------------+--------------------+

We can also perform something similar, like pulling out the first mentioned color:

// in Scala

import org.apache.spark.sql.functions.regexp_extract

val regexString = simpleColors.map(_.toUpperCase).mkString("(", "|", ")")

// the | signifies OR in regular expression syntax

df.select(

regexp_extract(col("Description"), regexString, 1).alias("color_clean"),

col("Description")).show(2)

# in Python

from pyspark.sql.functions import regexp_extract

extract_str = "(BLACK|WHITE|RED|GREEN|BLUE)"

df.select(

regexp_extract(col("Description"), extract_str, 1).alias("color_clean"),

col("Description")).show(2)

-- in SQL

SELECT regexp_extract(Description, '(BLACK|WHITE|RED|GREEN|BLUE)', 1),

Description

FROM dfTable

+-------------+--------------------+

| color_clean| Description|

+-------------+--------------------+

| WHITE|WHITE HANGING HEA...|

| WHITE| WHITE METAL LANTERN|

+-------------+--------------------+

Sometimes, rather than extracting values, we simply want to check for their existence. We can do this

with the contains method on each column. This will return a Boolean declaring whether the value

you specify is in the column’s string:

// in Scala

val containsBlack = col("Description").contains("BLACK")

val containsWhite = col("DESCRIPTION").contains("WHITE")

df.withColumn("hasSimpleColor", containsBlack.or(containsWhite))

.where("hasSimpleColor")

.select("Description").show(3, false)

In Python and SQL, we can use the instr function:

# in Python

from pyspark.sql.functions import instr

containsBlack = instr(col("Description"), "BLACK") >= 1

containsWhite = instr(col("Description"), "WHITE") >= 1

df.withColumn("hasSimpleColor", containsBlack | containsWhite)\

.where("hasSimpleColor")\

.select("Description").show(3, False)

-- in SQL

SELECT Description FROM dfTable

WHERE instr(Description, 'BLACK') >= 1 OR instr(Description, 'WHITE') >= 1

+----------------------------------+

|Description |

+----------------------------------+

|WHITE HANGING HEART T-LIGHT HOLDER|

|WHITE METAL LANTERN |

|RED WOOLLY HOTTIE WHITE HEART. |

+----------------------------------+

This is trivial with just two values, but it becomes more complicated when there are values.

Let’s work through this in a more rigorous way and take advantage of Spark’s ability to accept a

dynamic number of arguments. When we convert a list of values into a set of arguments and pass them

into a function, we use a language feature called varargs. Using this feature, we can effectively

unravel an array of arbitrary length and pass it as arguments to a function. This, coupled with select

makes it possible for us to create arbitrary numbers of columns dynamically:

// in Scala

val simpleColors = Seq("black", "white", "red", "green", "blue")

val selectedColumns = simpleColors.map(color => {

col("Description").contains(color.toUpperCase).alias(s"is_$color")

}):+expr("*") // could also append this value

df.select(selectedColumns:_*).where(col("is_white").or(col("is_red")))

.select("Description").show(3, false)

+----------------------------------+

|Description |

+----------------------------------+

|WHITE HANGING HEART T-LIGHT HOLDER|

|WHITE METAL LANTERN |

|RED WOOLLY HOTTIE WHITE HEART. |

+----------------------------------+

We can also do this quite easily in Python. In this case, we’re going to use a different function,

locate, that returns the integer location (1 based location). We then convert that to a Boolean before

using it as the same basic feature:

# in Python

from pyspark.sql.functions import expr, locate

simpleColors = ["black", "white", "red", "green", "blue"]

def color_locator(column, color_string):

return locate(color_string.upper(), column)\

.cast("boolean")\

.alias("is_" + c)

selectedColumns = [color_locator(df.Description, c) for c in simpleColors]

selectedColumns.append(expr("*")) # has to a be Column type

df.select(*selectedColumns).where(expr("is_white OR is_red"))\

.select("Description").show(3, False)

This simple feature can often help you programmatically generate columns or Boolean filters in a way

that is simple to understand and extend. We could extend this to calculating the smallest common

denominator for a given input value, or whether a number is a prime.

Working with Dates and Timestamps

Dates and times are a constant challenge in programming languages and databases. It’s always

necessary to keep track of timezones and ensure that formats are correct and valid. Spark does its best

to keep things simple by focusing explicitly on two kinds of time-related information. There are dates,

which focus exclusively on calendar dates, and timestamps, which include both date and time

information. Spark, as we saw with our current dataset, will make a best effort to correctly identify

column types, including dates and timestamps when we enable inferSchema. We can see that this

worked quite well with our current dataset because it was able to identify and read our date format

without us having to provide some specification for it.

As we hinted earlier, working with dates and timestamps closely relates to working with strings

because we often store our timestamps or dates as strings and convert them into date types at runtime.

This is less common when working with databases and structured data but much more common when

we are working with text and CSV files. We will experiment with that shortly.

WARNING

There are a lot of caveats, unfortunately, when working with dates and timestamps, especially when it comes to timezone

handling. In version 2.1 and before, Spark parsed according to the machine’s timezone if timezones are not explicitly

specified in the value that you are parsing. You can set a session local timezone if necessary by setting

spark.conf.sessionLocalTimeZone in the SQL configurations. This should be set according to the Java TimeZone format.

df.printSchema()

root

|-- InvoiceNo: string (nullable = true)

|-- StockCode: string (nullable = true)

|-- Description: string (nullable = true)

|-- Quantity: integer (nullable = true)

|-- InvoiceDate: timestamp (nullable = true)

|-- UnitPrice: double (nullable = true)

|-- CustomerID: double (nullable = true)

|-- Country: string (nullable = true)

Although Spark will do read dates or times on a best-effort basis. However, sometimes there will be

no getting around working with strangely formatted dates and times. The key to understanding the

transformations that you are going to need to apply is to ensure that you know exactly what type and

format you have at each given step of the way. Another common “gotcha” is that Spark’s

TimestampType class supports only second-level precision, which means that if you’re going to be

working with milliseconds or microseconds, you’ll need to work around this problem by potentially

operating on them as longs. Any more precision when coercing to a TimestampType will be

removed.

Spark can be a bit particular about what format you have at any given point in time. It’s important to

be explicit when parsing or converting to ensure that there are no issues in doing so. At the end of the

day, Spark is working with Java dates and timestamps and therefore conforms to those standards.

Let’s begin with the basics and get the current date and the current timestamps:

// in Scala

import org.apache.spark.sql.functions.{current_date, current_timestamp}

val dateDF = spark.range(10)

.withColumn("today", current_date())

.withColumn("now", current_timestamp())

dateDF.createOrReplaceTempView("dateTable")

# in Python

from pyspark.sql.functions import current_date, current_timestamp

dateDF = spark.range(10)\

.withColumn("today", current_date())\

.withColumn("now", current_timestamp())

dateDF.createOrReplaceTempView("dateTable")

dateDF.printSchema()

root

|-- id: long (nullable = false)

|-- today: date (nullable = false)

|-- now: timestamp (nullable = false)

Now that we have a simple DataFrame to work with, let’s add and subtract five days from today.

These functions take a column and then the number of days to either add or subtract as the arguments:

// in Scala

import org.apache.spark.sql.functions.{date_add, date_sub}

dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1)

# in Python

from pyspark.sql.functions import date_add, date_sub

dateDF.select(date_sub(col("today"), 5), date_add(col("today"), 5)).show(1)

-- in SQL

SELECT date_sub(today, 5), date_add(today, 5) FROM dateTable

+------------------+------------------+

|date_sub(today, 5)|date_add(today, 5)|

+------------------+------------------+

| 2017-06-12| 2017-06-22|

+------------------+------------------+

Another common task is to take a look at the difference between two dates. We can do this with the

datediff function that will return the number of days in between two dates. Most often we just care

about the days, and because the number of days varies from month to month, there also exists a

function, months_between, that gives you the number of months between two dates:

// in Scala

import org.apache.spark.sql.functions.{datediff, months_between, to_date}

dateDF.withColumn("week_ago", date_sub(col("today"), 7))

.select(datediff(col("week_ago"), col("today"))).show(1)

dateDF.select(

to_date(lit("2016-01-01")).alias("start"),

to_date(lit("2017-05-22")).alias("end"))

.select(months_between(col("start"), col("end"))).show(1)

# in Python

from pyspark.sql.functions import datediff, months_between, to_date

dateDF.withColumn("week_ago", date_sub(col("today"), 7))\

.select(datediff(col("week_ago"), col("today"))).show(1)

dateDF.select(

to_date(lit("2016-01-01")).alias("start"),

to_date(lit("2017-05-22")).alias("end"))\

.select(months_between(col("start"), col("end"))).show(1)

-- in SQL

SELECT to_date('2016-01-01'), months_between('2016-01-01', '2017-01-01'),

datediff('2016-01-01', '2017-01-01')

FROM dateTable

+-------------------------+

|datediff(week_ago, today)|

+-------------------------+

| -7|

+-------------------------+

+--------------------------+

|months_between(start, end)|

+--------------------------+

| -16.67741935|

+--------------------------+

Notice that we introduced a new function: the to_date function. The to_date function allows you to

convert a string to a date, optionally with a specified format. We specify our format in the Java

SimpleDateFormat which will be important to reference if you use this function:

// in Scala

import org.apache.spark.sql.functions.{to_date, lit}

spark.range(5).withColumn("date", lit("2017-01-01"))

.select(to_date(col("date"))).show(1)

# in Python

from pyspark.sql.functions import to_date, lit

spark.range(5).withColumn("date", lit("2017-01-01"))\

.select(to_date(col("date"))).show(1)

Spark will not throw an error if it cannot parse the date; rather, it will just return null. This can be a

bit tricky in larger pipelines because you might be expecting your data in one format and getting it in

another. To illustrate, let’s take a look at the date format that has switched from year-month-day to

year-day-month. Spark will fail to parse this date and silently return null instead:

dateDF.select(to_date(lit("2016-20-12")),to_date(lit("2017-12-11"))).show(1)

+-------------------+-------------------+

|to_date(2016-20-12)|to_date(2017-12-11)|

+-------------------+-------------------+

| null| 2017-12-11|

+-------------------+-------------------+

We find this to be an especially tricky situation for bugs because some dates might match the correct

format, whereas others do not. In the previous example, notice how the second date appears as

Decembers 11th instead of the correct day, November 12th. Spark doesn’t throw an error because it

cannot know whether the days are mixed up or that specific row is incorrect.

Let’s fix this pipeline, step by step, and come up with a robust way to avoid these issues entirely. The

first step is to remember that we need to specify our date format according to the Java

SimpleDateFormat standard.

We will use two functions to fix this: to_date and to_timestamp. The former optionally expects a

format, whereas the latter requires one:

// in Scala

import org.apache.spark.sql.functions.to_date

val dateFormat = "yyyy-dd-MM"

val cleanDateDF = spark.range(1).select(

to_date(lit("2017-12-11"), dateFormat).alias("date"),

to_date(lit("2017-20-12"), dateFormat).alias("date2"))

cleanDateDF.createOrReplaceTempView("dateTable2")

# in Python

from pyspark.sql.functions import to_date

dateFormat = "yyyy-dd-MM"

cleanDateDF = spark.range(1).select(

to_date(lit("2017-12-11"), dateFormat).alias("date"),

to_date(lit("2017-20-12"), dateFormat).alias("date2"))

cleanDateDF.createOrReplaceTempView("dateTable2")

-- in SQL

SELECT to_date(date, 'yyyy-dd-MM'), to_date(date2, 'yyyy-dd-MM'), to_date(date)

FROM dateTable2

+----------+----------+

| date| date2|

+----------+----------+

|2017-11-12|2017-12-20|

+----------+----------+

Now let’s use an example of to_timestamp, which always requires a format to be specified:

// in Scala

import org.apache.spark.sql.functions.to_timestamp

cleanDateDF.select(to_timestamp(col("date"), dateFormat)).show()

# in Python

from pyspark.sql.functions import to_timestamp

cleanDateDF.select(to_timestamp(col("date"), dateFormat)).show()

-- in SQL

SELECT to_timestamp(date, 'yyyy-dd-MM'), to_timestamp(date2, 'yyyy-dd-MM')

FROM dateTable2

+----------------------------------+

|to_timestamp(`date`, 'yyyy-dd-MM')|

+----------------------------------+

| 2017-11-12 00:00:00|

+----------------------------------+

Casting between dates and timestamps is simple in all languages—in SQL, we would do it in the

following way:

-- in SQL

SELECT cast(to_date("2017-01-01", "yyyy-dd-MM") as timestamp)

After we have our date or timestamp in the correct format and type, comparing between them is

actually quite easy. We just need to be sure to either use a date/timestamp type or specify our string

according to the right format of yyyy-MM-dd if we’re comparing a date:

cleanDateDF.filter(col("date2") > lit("2017-12-12")).show()

One minor point is that we can also set this as a string, which Spark parses to a literal:

cleanDateDF.filter(col("date2") > "'2017-12-12'").show()

WARNING

Implicit type casting is an easy way to shoot yourself in the foot, especially when dealing with null values or dates in

different timezones or formats. We recommend that you parse them explicitly instead of relying on implicit conversions.

Working with Nulls in Data

As a best practice, you should always use nulls to represent missing or empty data in your

DataFrames. Spark can optimize working with null values more than it can if you use empty strings or

other values. The primary way of interacting with null values, at DataFrame scale, is to use the .na

subpackage on a DataFrame. There are also several functions for performing operations and

explicitly specifying how Spark should handle null values. For more information, see Chapter 5

(where we discuss ordering), and also refer back to “Working with Booleans”.

WARNING

Nulls are a challenging part of all programming, and Spark is no exception. In our opinion, being explicit is always better

than being implicit when handling null values. For instance, in this part of the book, we saw how we can define columns as

having null types. However, this comes with a catch. When we declare a column as not having a null time, that is not

actually enforced. To reiterate, when you define a schema in which all columns are declared to not have null values, Spark

will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize

for handling that column. If you have null values in columns that should not have null values, you can get an incorrect result

or see strange exceptions that can be difficult to debug.

There are two things you can do with null values: you can explicitly drop nulls or you can fill them

with a value (globally or on a per-column basis). Let’s experiment with each of these now.

Coalesce

Spark includes a function to allow you to select the first non-null value from a set of columns by using

the coalesce function. In this case, there are no null values, so it simply returns the first column:

// in Scala

import org.apache.spark.sql.functions.coalesce

df.select(coalesce(col("Description"), col("CustomerId"))).show()

# in Python

from pyspark.sql.functions import coalesce

df.select(coalesce(col("Description"), col("CustomerId"))).show()

ifnull, nullIf, nvl, and nvl2

There are several other SQL functions that you can use to achieve similar things. ifnull allows you

to select the second value if the first is null, and defaults to the first. Alternatively, you could use

nullif, which returns null if the two values are equal or else returns the second if they are not. nvl

returns the second value if the first is null, but defaults to the first. Finally, nvl2 returns the second

value if the first is not null; otherwise, it will return the last specified value (else_value in the

following example):

-- in SQL

SELECT

ifnull(null, 'return_value'),

nullif('value', 'value'),

nvl(null, 'return_value'),

nvl2('not_null', 'return_value', "else_value")

FROM dfTable LIMIT 1

+------------+----+------------+------------+

| a| b| c| d|

+------------+----+------------+------------+

+------------+----+------------+------------+

Naturally, we can use these in select expressions on DataFrames, as well.

drop

The simplest function is drop, which removes rows that contain nulls. The default is to drop any row

in which any value is null:

df.na.drop()

df.na.drop("any")

In SQL, we have to do this column by column:

-- in SQL

SELECT * FROM dfTable WHERE Description IS NOT NULL

Specifying "any" as an argument drops a row if any of the values are null. Using “all” drops the row

only if all values are null or NaN for that row:

df.na.drop("all")

We can also apply this to certain sets of columns by passing in an array of columns:

// in Scala

df.na.drop("all", Seq("StockCode", "InvoiceNo"))

# in Python

df.na.drop("all", subset=["StockCode", "InvoiceNo"])

fill

Using the fill function, you can fill one or more columns with a set of values. This can be done by

specifying a map—that is a particular value and a set of columns.

For example, to fill all null values in columns of type String, you might specify the following:

df.na.fill("All Null values become this string")

We could do the same for columns of type Integer by using df.na.fill(5:Integer), or for Doubles

df.na.fill(5:Double). To specify columns, we just pass in an array of column names like we did

in the previous example:

// in Scala

df.na.fill(5, Seq("StockCode", "InvoiceNo"))

# in Python

df.na.fill("all", subset=["StockCode", "InvoiceNo"])

We can also do this with with a Scala Map, where the key is the column name and the value is the

value we would like to use to fill null values:

// in Scala

val fillColValues = Map("StockCode" -> 5, "Description" -> "No Value")

df.na.fill(fillColValues)

# in Python

fill_cols_vals = {"StockCode": 5, "Description" : "No Value"}

df.na.fill(fill_cols_vals)

replace

In addition to replacing null values like we did with drop and fill, there are more flexible options

that you can use with more than just null values. Probably the most common use case is to replace all

values in a certain column according to their current value. The only requirement is that this value be

the same type as the original value:

// in Scala

df.na.replace("Description", Map("" -> "UNKNOWN"))

# in Python

df.na.replace([""], ["UNKNOWN"], "Description")

Ordering

As we discussed in Chapter 5, you can use asc_nulls_first, desc_nulls_first,

asc_nulls_last, or desc_nulls_last to specify where you would like your null values to appear

in an ordered DataFrame.

Working with Complex Types

Complex types can help you organize and structure your data in ways that make more sense for the

problem that you are hoping to solve. There are three kinds of complex types: structs, arrays, and

maps.

Structs

You can think of structs as DataFrames within DataFrames. A worked example will illustrate this

more clearly. We can create a struct by wrapping a set of columns in parenthesis in a query:

df.selectExpr("(Description, InvoiceNo) as complex", "*")

df.selectExpr("struct(Description, InvoiceNo) as complex", "*")

// in Scala

import org.apache.spark.sql.functions.struct

val complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))

complexDF.createOrReplaceTempView("complexDF")

# in Python

from pyspark.sql.functions import struct

complexDF = df.select(struct("Description", "InvoiceNo").alias("complex"))

complexDF.createOrReplaceTempView("complexDF")

We now have a DataFrame with a column complex. We can query it just as we might another

DataFrame, the only difference is that we use a dot syntax to do so, or the column method getField:

complexDF.select("complex.Description")

complexDF.select(col("complex").getField("Description"))

We can also query all values in the struct by using *. This brings up all the columns to the top-level

DataFrame:

complexDF.select("complex.*")

-- in SQL

SELECT complex.* FROM complexDF

Arrays

To define arrays, let’s work through a use case. With our current data, our objective is to take every

single word in our Description column and convert that into a row in our DataFrame.

The first task is to turn our Description column into a complex type, an array.

split

We do this by using the split function and specify the delimiter:

// in Scala

import org.apache.spark.sql.functions.split

df.select(split(col("Description"), " ")).show(2)

# in Python

from pyspark.sql.functions import split

df.select(split(col("Description"), " ")).show(2)

-- in SQL

SELECT split(Description, ' ') FROM dfTable

+---------------------+

|split(Description, )|

+---------------------+

| [WHITE, HANGING, ...|

| [WHITE, METAL, LA...|

+---------------------+

This is quite powerful because Spark allows us to manipulate this complex type as another column.

We can also query the values of the array using Python-like syntax:

// in Scala

df.select(split(col("Description"), " ").alias("array_col"))

.selectExpr("array_col[0]").show(2)

# in Python

df.select(split(col("Description"), " ").alias("array_col"))\

.selectExpr("array_col[0]").show(2)

-- in SQL

SELECT split(Description, ' ')[0] FROM dfTable

This gives us the following result:

+------------+

|array_col[0]|

+------------+

| WHITE|

+------------+

Array Length

We can determine the array’s length by querying for its size:

// in Scala

import org.apache.spark.sql.functions.size

df.select(size(split(col("Description"), " "))).show(2) // shows 5 and 3

# in Python

from pyspark.sql.functions import size

df.select(size(split(col("Description"), " "))).show(2) # shows 5 and 3

array_contains

We can also see whether this array contains a value:

// in Scala

import org.apache.spark.sql.functions.array_contains

df.select(array_contains(split(col("Description"), " "), "WHITE")).show(2)

# in Python

from pyspark.sql.functions import array_contains

df.select(array_contains(split(col("Description"), " "), "WHITE")).show(2)

-- in SQL

SELECT array_contains(split(Description, ' '), 'WHITE') FROM dfTable

This gives us the following result:

+--------------------------------------------+

|array_contains(split(Description, ), WHITE)|

+--------------------------------------------+

| true|

+--------------------------------------------+

However, this does not solve our current problem. To convert a complex type into a set of rows (one

per value in our array), we need to use the explode function.

explode

The explode function takes a column that consists of arrays and creates one row (with the rest of the

values duplicated) per value in the array. Figure 6-1 illustrates the process.

Figure 6-1. Exploding a column of text

// in Scala

import org.apache.spark.sql.functions.{split, explode}

df.withColumn("splitted", split(col("Description"), " "))

.withColumn("exploded", explode(col("splitted")))

.select("Description", "InvoiceNo", "exploded").show(2)

# in Python

from pyspark.sql.functions import split, explode

df.withColumn("splitted", split(col("Description"), " "))\

.withColumn("exploded", explode(col("splitted")))\

.select("Description", "InvoiceNo", "exploded").show(2)

-- in SQL

SELECT Description, InvoiceNo, exploded

FROM (SELECT *, split(Description, " ") as splitted FROM dfTable)

LATERAL VIEW explode(splitted) as exploded

This gives us the following result:

+--------------------+---------+--------+

| Description|InvoiceNo|exploded|

+--------------------+---------+--------+

|WHITE HANGING HEA...| 536365| WHITE|

|WHITE HANGING HEA...| 536365| HANGING|

+--------------------+---------+--------+

Maps

Maps are created by using the map function and key-value pairs of columns. You then can select them

just like you might select from an array:

// in Scala

import org.apache.spark.sql.functions.map

df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map")).show(2)

# in Python

from pyspark.sql.functions import create_map

df.select(create_map(col("Description"), col("InvoiceNo")).alias("complex_map"))\

.show(2)

-- in SQL

SELECT map(Description, InvoiceNo) as complex_map FROM dfTable

WHERE Description IS NOT NULL

This produces the following result:

+--------------------+

| complex_map|

+--------------------+

|Map(WHITE HANGING...|

|Map(WHITE METAL L...|

+--------------------+

You can query them by using the proper key. A missing key returns null:

// in Scala

df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))

.selectExpr("complex_map['WHITE METAL LANTERN']").show(2)

# in Python

df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))\

.selectExpr("complex_map['WHITE METAL LANTERN']").show(2)

This gives us the following result:

+--------------------------------+

|complex_map[WHITE METAL LANTERN]|

+--------------------------------+

| null|

| 536365|

+--------------------------------+

You can also explode map types, which will turn them into columns:

// in Scala

df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))

.selectExpr("explode(complex_map)").show(2)

# in Python

df.select(map(col("Description"), col("InvoiceNo")).alias("complex_map"))\

.selectExpr("explode(complex_map)").show(2)

This gives us the following result:

+--------------------+------+

| key| value|

+--------------------+------+

|WHITE HANGING HEA...|536365|

| WHITE METAL LANTERN|536365|

+--------------------+------+

Working with JSON

Spark has some unique support for working with JSON data. You can operate directly on strings of

JSON in Spark and parse from JSON or extract JSON objects. Let’s begin by creating a JSON

column:

// in Scala

val jsonDF = spark.range(1).selectExpr("""

'{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString""")

# in Python

jsonDF = spark.range(1).selectExpr("""

'{"myJSONKey" : {"myJSONValue" : [1, 2, 3]}}' as jsonString""")

You can use the get_json_object to inline query a JSON object, be it a dictionary or array. You can

use json_tuple if this object has only one level of nesting:

// in Scala

import org.apache.spark.sql.functions.{get_json_object, json_tuple}

jsonDF.select(

get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column",

json_tuple(col("jsonString"), "myJSONKey")).show(2)

# in Python

from pyspark.sql.functions import get_json_object, json_tuple

jsonDF.select(

get_json_object(col("jsonString"), "$.myJSONKey.myJSONValue[1]") as "column",

json_tuple(col("jsonString"), "myJSONKey")).show(2)

Here’s the equivalent in SQL:

jsonDF.selectExpr(

"json_tuple(jsonString, '$.myJSONKey.myJSONValue[1]') as column").show(2)

This results in the following table:

+------+--------------------+

|column| c0|

+------+--------------------+

| 2|{"myJSONValue":[1...|

+------+--------------------+

You can also turn a StructType into a JSON string by using the to_json function:

// in Scala

import org.apache.spark.sql.functions.to_json

df.selectExpr("(InvoiceNo, Description) as myStruct")

.select(to_json(col("myStruct")))

# in Python

from pyspark.sql.functions import to_json

df.selectExpr("(InvoiceNo, Description) as myStruct")\

.select(to_json(col("myStruct")))

This function also accepts a dictionary (map) of parameters that are the same as the JSON data

source. You can use the from_json function to parse this (or other JSON data) back in. This naturally

requires you to specify a schema, and optionally you can specify a map of options, as well:

// in Scala

import org.apache.spark.sql.functions.from_json

import org.apache.spark.sql.types._

val parseSchema = new StructType(Array(

new StructField("InvoiceNo",StringType,true),

new StructField("Description",StringType,true)))

df.selectExpr("(InvoiceNo, Description) as myStruct")

.select(to_json(col("myStruct")).alias("newJSON"))

.select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)

# in Python

from pyspark.sql.functions import from_json

from pyspark.sql.types import *

parseSchema = StructType((

StructField("InvoiceNo",StringType(),True),

StructField("Description",StringType(),True)))

df.selectExpr("(InvoiceNo, Description) as myStruct")\

.select(to_json(col("myStruct")).alias("newJSON"))\

.select(from_json(col("newJSON"), parseSchema), col("newJSON")).show(2)

This gives us the following result:

+----------------------+--------------------+

|jsontostructs(newJSON)| newJSON|

+----------------------+--------------------+

| [536365,WHITE HAN...|{"InvoiceNo":"536...|

| [536365,WHITE MET...|{"InvoiceNo":"536...|

+----------------------+--------------------+

User-Defined Functions

One of the most powerful things that you can do in Spark is define your own functions. These user-

defined functions (UDFs) make it possible for you to write your own custom transformations using

Python or Scala and even use external libraries. UDFs can take and return one or more columns as

input. Spark UDFs are incredibly powerful because you can write them in several different

programming languages; you do not need to create them in an esoteric format or domain-specific

language. They’re just functions that operate on the data, record by record. By default, these functions

are registered as temporary functions to be used in that specific SparkSession or Context.

Although you can write UDFs in Scala, Python, or Java, there are performance considerations that you

should be aware of. To illustrate this, we’re going to walk through exactly what happens when you

create UDF, pass that into Spark, and then execute code using that UDF.

The first step is the actual function. We’ll create a simple one for this example. Let’s write a power3

function that takes a number and raises it to a power of three:

// in Scala

val udfExampleDF = spark.range(5).toDF("num")

def power3(number:Double):Double = number * number * number

power3(2.0)

# in Python

udfExampleDF = spark.range(5).toDF("num")

def power3(double_value):

return double_value ** 3

power3(2.0)

In this trivial example, we can see that our functions work as expected. We are able to provide an

individual input and produce the expected result (with this simple test case). Thus far, our

expectations for the input are high: it must be a specific type and cannot be a null value (see “Working

with Nulls in Data”).

Now that we’ve created these functions and tested them, we need to register them with Spark so that

we can use them on all of our worker machines. Spark will serialize the function on the driver and

transfer it over the network to all executor processes. This happens regardless of language.

When you use the function, there are essentially two different things that occur. If the function is

written in Scala or Java, you can use it within the Java Virtual Machine (JVM). This means that there

will be little performance penalty aside from the fact that you can’t take advantage of code generation

capabilities that Spark has for built-in functions. There can be performance issues if you create or use

a lot of objects; we cover that in the section on optimization in Chapter 19.

If the function is written in Python, something quite different happens. Spark starts a Python process

on the worker, serializes all of the data to a format that Python can understand (remember, it was in

the JVM earlier), executes the function row by row on that data in the Python process, and then finally

returns the results of the row operations to the JVM and Spark. Figure 6-2 provides an overview of

the process.

Figure 6-2. Figure caption

WARNING

Starting this Python process is expensive, but the real cost is in serializing the data to Python. This is costly for two reasons:

it is an expensive computation, but also, after the data enters Python, Spark cannot manage the memory of the worker. This

means that you could potentially cause a worker to fail if it becomes resource constrained (because both the JVM and

Python are competing for memory on the same machine). We recommend that you write your UDFs in Scala or Java—the

small amount of time it should take you to write the function in Scala will always yield significant speed ups, and on top of

that, you can still use the function from Python!

Now that you have an understanding of the process, let’s work through an example. First, we need to

// in Scala

import org.apache.spark.sql.functions.udf

val power3udf = udf(power3(_:Double):Double)

We can use that just like any other DataFrame function:

// in Scala

udfExampleDF.select(power3udf(col("num"))).show()

The same applies to Python—first, we register it:

# in Python

from pyspark.sql.functions import udf

power3udf = udf(power3)

Then, we can use it in our DataFrame code:

# in Python

from pyspark.sql.functions import col

udfExampleDF.select(power3udf(col("num"))).show(2)

+-----------+

|power3(num)|

+-----------+

| 0|

| 1|

+-----------+

At this juncture, we can use this only as a DataFrame function. That is to say, we can’t use it within a

string expression, only on an expression. However, we can also register this UDF as a Spark SQL

function. This is valuable because it makes it simple to use this function within SQL as well as across

languages.

Let’s register the function in Scala:

// in Scala

spark.udf.register("power3", power3(_:Double):Double)

udfExampleDF.selectExpr("power3(num)").show(2)

Because this function is registered with Spark SQL—and we’ve learned that any Spark SQL function

or expression is valid to use as an expression when working with DataFrames—we can turn around

and use the UDF that we wrote in Scala, in Python. However, rather than using it as a DataFrame

function, we use it as a SQL expression:

# in Python

udfExampleDF.selectExpr("power3(num)").show(2)

# registered in Scala

We can also register our Python function to be available as a SQL function and use that in any

language, as well.

One thing we can also do to ensure that our functions are working correctly is specify a return type.

As we saw in the beginning of this section, Spark manages its own type information, which does not

align exactly with Python’s types. Therefore, it’s a best practice to define the return type for your

function when you define it. It is important to note that specifying the return type is not necessary, but

it is a best practice.

If you specify the type that doesn’t align with the actual type returned by the function, Spark will not

throw an error but will just return null to designate a failure. You can see this if you were to switch

the return type in the following function to be a DoubleType:

# in Python

from pyspark.sql.types import IntegerType, DoubleType

spark.udf.register("power3py", power3, DoubleType())

# in Python

udfExampleDF.selectExpr("power3py(num)").show(2)

# registered via Python

This is because the range creates integers. When integers are operated on in Python, Python won’t

convert them into floats (the corresponding type to Spark’s double type), therefore we see null. We

can remedy this by ensuring that our Python function returns a float instead of an integer and the

function will behave correctly.

Naturally, we can use either of these from SQL, too, after we register them:

-- in SQL

SELECT power3(12), power3py(12) -- doesn't work because of return type

When you want to optionally return a value from a UDF, you should return None in Python and an

Option type in Scala:

## Hive UDFs

As a last note, you can also use UDF/UDAF creation via a Hive syntax. To allow for this, first you

must enable Hive support when they create their SparkSession (via

SparkSession.builder().enableHiveSupport()). Then you can register UDFs in SQL. This is

only supported with precompiled Scala and Java packages, so you’ll need to specify them as a

dependency:

-- in SQL

CREATE TEMPORARY FUNCTION myFunc AS 'com.organization.hive.udf.FunctionName'

Additionally, you can register this as a permanent function in the Hive Metastore by removing

TEMPORARY.

Conclusion

This chapter demonstrated how easy it is to extend Spark SQL to your own purposes and do so in a

way that is not some esoteric, domain-specific language but rather simple functions that are easy to

test and maintain without even using Spark! This is an amazingly powerful tool that you can use to

specify sophisticated business logic that can run on five rows on your local machines or on terabytes

of data on a 100-node cluster!

Chapter 7. Aggregations

Aggregating is the act of collecting something together and is a cornerstone of big data analytics. In an

aggregation, you will specify a key or grouping and an aggregation function that specifies how you

should transform one or more columns. This function must produce one result for each group, given

multiple input values. Spark’s aggregation capabilities are sophisticated and mature, with a variety of

different use cases and possibilities. In general, you use aggregations to summarize numerical data

usually by means of some grouping. This might be a summation, a product, or simple counting. Also,

with Spark you can aggregate any kind of value into an array, list, or map, as we will see in

“Aggregating to Complex Types”.

In addition to working with any type of values, Spark also allows us to create the following groupings

types:

The simplest grouping is to just summarize a complete DataFrame by performing an

aggregation in a select statement.

A “group by” allows you to specify one or more keys as well as one or more aggregation

functions to transform the value columns.

A “window” gives you the ability to specify one or more keys as well as one or more

aggregation functions to transform the value columns. However, the rows input to the

function are somehow related to the current row.

A “grouping set,” which you can use to aggregate at multiple different levels. Grouping sets

are available as a primitive in SQL and via rollups and cubes in DataFrames.

A “rollup” makes it possible for you to specify one or more keys as well as one or more

aggregation functions to transform the value columns, which will be summarized

hierarchically.

A “cube” allows you to specify one or more keys as well as one or more aggregation

functions to transform the value columns, which will be summarized across all combinations

of columns.

Each grouping returns a RelationalGroupedDataset on which we specify our aggregations.

NOTE

An important thing to consider is how exact you need an answer to be. When performing calculations over big data, it can

be quite expensive to get an exact answer to a question, and it’s often much cheaper to simply request an approximate to a

reasonable degree of accuracy. You’ll note that we mention some approximation functions throughout the book and

oftentimes this is a good opportunity to improve the speed and execution of your Spark jobs, especially for interactive and ad

hoc analysis.

Let’s begin by reading in our data on purchases, repartitioning the data to have far fewer partitions

(because we know it’s a small volume of data stored in a lot of small files), and caching the results

for rapid access:

// in Scala

val df = spark.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/data/retail-data/all/*.csv")

.coalesce(5)

df.cache()

df.createOrReplaceTempView("dfTable")

# in Python

df = spark.read.format("csv")\

.option("header", "true")\

.option("inferSchema", "true")\

.load("/data/retail-data/all/*.csv")\

.coalesce(5)

df.cache()

df.createOrReplaceTempView("dfTable")

Here’s a sample of the data so that you can reference the output of some of the functions:

+---------+---------+--------------------+--------+--------------+---------+-----

+---------+---------+--------------------+--------+--------------+---------+-----

| 536365| 85123A|WHITE HANGING... | 6|12/1/2010 8:26| 2.55| ...

| 536365| 71053|WHITE METAL... | 6|12/1/2010 8:26| 3.39| ...

...

| 536367| 21755|LOVE BUILDING BLO...| 3|12/1/2010 8:34| 5.95| ...

| 536367| 21777|RECIPE BOX WITH M...| 4|12/1/2010 8:34| 7.95| ...

+---------+---------+--------------------+--------+--------------+---------+-----

As mentioned, basic aggregations apply to an entire DataFrame. The simplest example is the count

method:

df.count() == 541909

If you’ve been reading this book chapter by chapter, you know that count is actually an action as

opposed to a transformation, and so it returns immediately. You can use count to get an idea of the

total size of your dataset but another common pattern is to use it to cache an entire DataFrame in

memory, just like we did in this example.

Now, this method is a bit of an outlier because it exists as a method (in this case) as opposed to a

function and is eagerly evaluated instead of a lazy transformation. In the next section, we will see

count used as a lazy function, as well.

Aggregation Functions

All aggregations are available as functions, in addition to the special cases that can appear on

DataFrames or via .stat, like we saw in Chapter 6. You can find most aggregation functions in the

org.apache.spark.sql.functions package.

NOTE

There are some gaps between the available SQL functions and the functions that we can import in Scala and Python. This

changes every release, so it’s impossible to include a definitive list. This section covers the most common functions.

count

The first function worth going over is count, except in this example it will perform as a

transformation instead of an action. In this case, we can do one of two things: specify a specific

column to count, or all the columns by using count(*) or count(1) to represent that we want to

count every row as the literal one, as shown in this example:

// in Scala

import org.apache.spark.sql.functions.count

df.select(count("StockCode")).show() // 541909

# in Python

from pyspark.sql.functions import count

df.select(count("StockCode")).show() # 541909

-- in SQL

SELECT COUNT(*) FROM dfTable

WARNING

There are a number of gotchas when it comes to null values and counting. For instance, when performing a count(*),

Spark will count null values (including rows containing all nulls). However, when counting an individual column, Spark will

not count the null values.

countDistinct

Sometimes, the total number is not relevant; rather, it’s the number of unique groups that you want. To

get this number, you can use the countDistinct function. This is a bit more relevant for individual

columns:

// in Scala

import org.apache.spark.sql.functions.countDistinct

df.select(countDistinct("StockCode")).show() // 4070

# in Python

from pyspark.sql.functions import countDistinct

df.select(countDistinct("StockCode")).show() # 4070

-- in SQL

SELECT COUNT(DISTINCT *) FROM DFTABLE

approx_count_distinct

Often, we find ourselves working with large datasets and the exact distinct count is irrelevant. There

are times when an approximation to a certain degree of accuracy will work just fine, and for that, you

can use the approx_count_distinct function:

// in Scala

import org.apache.spark.sql.functions.approx_count_distinct

df.select(approx_count_distinct("StockCode", 0.1)).show() // 3364

# in Python

from pyspark.sql.functions import approx_count_distinct

df.select(approx_count_distinct("StockCode", 0.1)).show() # 3364

-- in SQL

SELECT approx_count_distinct(StockCode, 0.1) FROM DFTABLE

You will notice that approx_count_distinct took another parameter with which you can specify

the maximum estimation error allowed. In this case, we specified a rather large error and thus receive

an answer that is quite far off but does complete more quickly than countDistinct. You will see

much greater performance gains with larger datasets.

first and last

You can get the first and last values from a DataFrame by using these two obviously named functions.

This will be based on the rows in the DataFrame, not on the values in the DataFrame:

// in Scala

import org.apache.spark.sql.functions.{first, last}

df.select(first("StockCode"), last("StockCode")).show()

# in Python

from pyspark.sql.functions import first, last

df.select(first("StockCode"), last("StockCode")).show()

-- in SQL

SELECT first(StockCode), last(StockCode) FROM dfTable

+-----------------------+----------------------+

|first(StockCode, false)|last(StockCode, false)|

+-----------------------+----------------------+

| 85123A| 22138|

+-----------------------+----------------------+

min and max

To extract the minimum and maximum values from a DataFrame, use the min and max functions:

// in Scala

import org.apache.spark.sql.functions.{min, max}

df.select(min("Quantity"), max("Quantity")).show()

# in Python

from pyspark.sql.functions import min, max

df.select(min("Quantity"), max("Quantity")).show()

-- in SQL

SELECT min(Quantity), max(Quantity) FROM dfTable

+-------------+-------------+

|min(Quantity)|max(Quantity)|

+-------------+-------------+

| -80995| 80995|

+-------------+-------------+

sum

Another simple task is to add all the values in a row using the sum function:

// in Scala

import org.apache.spark.sql.functions.sum

df.select(sum("Quantity")).show() // 5176450

# in Python

from pyspark.sql.functions import sum

df.select(sum("Quantity")).show() # 5176450

-- in SQL

SELECT sum(Quantity) FROM dfTable

sumDistinct

In addition to summing a total, you also can sum a distinct set of values by using the sumDistinct

function:

// in Scala

import org.apache.spark.sql.functions.sumDistinct

df.select(sumDistinct("Quantity")).show() // 29310

# in Python

from pyspark.sql.functions import sumDistinct

df.select(sumDistinct("Quantity")).show() # 29310

-- in SQL

SELECT SUM(Quantity) FROM dfTable -- 29310

avg

Although you can calculate average by dividing sum by count, Spark provides an easier way to get

that value via the avg or mean functions. In this example, we use alias in order to more easily reuse

these columns later:

// in Scala

import org.apache.spark.sql.functions.{sum, count, avg, expr}

df.select(

count("Quantity").alias("total_transactions"),

sum("Quantity").alias("total_purchases"),

avg("Quantity").alias("avg_purchases"),

expr("mean(Quantity)").alias("mean_purchases"))

.selectExpr(

"total_purchases/total_transactions",

"avg_purchases",

"mean_purchases").show()

# in Python

from pyspark.sql.functions import sum, count, avg, expr

df.select(

count("Quantity").alias("total_transactions"),

sum("Quantity").alias("total_purchases"),

avg("Quantity").alias("avg_purchases"),

expr("mean(Quantity)").alias("mean_purchases"))\

.selectExpr(

"total_purchases/total_transactions",

"avg_purchases",

"mean_purchases").show()

+--------------------------------------+----------------+----------------+

|(total_purchases / total_transactions)| avg_purchases| mean_purchases|

+--------------------------------------+----------------+----------------+

| 9.55224954743324|9.55224954743324|9.55224954743324|

+--------------------------------------+----------------+----------------+

NOTE

You can also average all the distinct values by specifying distinct. In fact, most aggregate functions support doing so only on

distinct values.

Variance and Standard Deviation

Calculating the mean naturally brings up questions about the variance and standard deviation. These

are both measures of the spread of the data around the mean. The variance is the average of the

squared differences from the mean, and the standard deviation is the square root of the variance. You

can calculate these in Spark by using their respective functions. However, something to note is that

Spark has both the formula for the sample standard deviation as well as the formula for the population

standard deviation. These are fundamentally different statistical formulae, and we need to

differentiate between them. By default, Spark performs the formula for the sample standard deviation

or variance if you use the variance or stddev functions.

You can also specify these explicitly or refer to the population standard deviation or variance:

// in Scala

import org.apache.spark.sql.functions.{var_pop, stddev_pop}

import org.apache.spark.sql.functions.{var_samp, stddev_samp}

df.select(var_pop("Quantity"), var_samp("Quantity"),

stddev_pop("Quantity"), stddev_samp("Quantity")).show()

# in Python

from pyspark.sql.functions import var_pop, stddev_pop

from pyspark.sql.functions import var_samp, stddev_samp

df.select(var_pop("Quantity"), var_samp("Quantity"),

stddev_pop("Quantity"), stddev_samp("Quantity")).show()

-- in SQL

SELECT var_pop(Quantity), var_samp(Quantity),

stddev_pop(Quantity), stddev_samp(Quantity)

FROM dfTable

+------------------+------------------+--------------------+-------------------+

+------------------+------------------+--------------------+-------------------+

|47559.303646609056|47559.391409298754| 218.08095663447796| 218.081157850...|

+------------------+------------------+--------------------+-------------------+

skewness and kurtosis

Skewness and kurtosis are both measurements of extreme points in your data. Skewness measures the

asymmetry of the values in your data around the mean, whereas kurtosis is a measure of the tail of

data. These are both relevant specifically when modeling your data as a probability distribution of a

random variable. Although here we won’t go into the math behind these specifically, you can look up

definitions quite easily on the internet. You can calculate these by using the functions:

import org.apache.spark.sql.functions.{skewness, kurtosis}

df.select(skewness("Quantity"), kurtosis("Quantity")).show()

# in Python

from pyspark.sql.functions import skewness, kurtosis

df.select(skewness("Quantity"), kurtosis("Quantity")).show()

-- in SQL

SELECT skewness(Quantity), kurtosis(Quantity) FROM dfTable

+-------------------+------------------+

| skewness(Quantity)|kurtosis(Quantity)|

+-------------------+------------------+

|-0.2640755761052562|119768.05495536952|

+-------------------+------------------+

Covariance and Correlation

We discussed single column aggregations, but some functions compare the interactions of the values

in two difference columns together. Two of these functions are cov and corr, for covariance and

correlation, respectively. Correlation measures the Pearson correlation coefficient, which is scaled

between –1 and +1. The covariance is scaled according to the inputs in the data.

Like the var function, covariance can be calculated either as the sample covariance or the population

covariance. Therefore it can be important to specify which formula you want to use. Correlation has

no notion of this and therefore does not have calculations for population or sample. Here’s how they

work:

// in Scala

import org.apache.spark.sql.functions.{corr, covar_pop, covar_samp}

df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),

covar_pop("InvoiceNo", "Quantity")).show()

# in Python

from pyspark.sql.functions import corr, covar_pop, covar_samp

df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),

covar_pop("InvoiceNo", "Quantity")).show()

-- in SQL

SELECT corr(InvoiceNo, Quantity), covar_samp(InvoiceNo, Quantity),

covar_pop(InvoiceNo, Quantity)

FROM dfTable

+-------------------------+-------------------------------+---------------------+

|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceN...|

+-------------------------+-------------------------------+---------------------+

| 4.912186085635685E-4| 1052.7280543902734| 1052.7...|

+-------------------------+-------------------------------+---------------------+

Aggregating to Complex Types

In Spark, you can perform aggregations not just of numerical values using formulas, you can also

perform them on complex types. For example, we can collect a list of values present in a given

column or only the unique values by collecting to a set.

You can use this to carry out some more programmatic access later on in the pipeline or pass the

entire collection in a user-defined function (UDF):

// in Scala

import org.apache.spark.sql.functions.{collect_set, collect_list}

df.agg(collect_set("Country"), collect_list("Country")).show()

# in Python

from pyspark.sql.functions import collect_set, collect_list

df.agg(collect_set("Country"), collect_list("Country")).show()

-- in SQL

SELECT collect_set(Country), collect_set(Country) FROM dfTable

+--------------------+---------------------+

|collect_set(Country)|collect_list(Country)|

+--------------------+---------------------+

|[Portugal, Italy,...| [United Kingdom, ...|

+--------------------+---------------------+

Grouping

Thus far, we have performed only DataFrame-level aggregations. A more common task is to perform

calculations based on groups in the data. This is typically done on categorical data for which we

group our data on one column and perform some calculations on the other columns that end up in that

group.

The best way to explain this is to begin performing some groupings. The first will be a count, just as

we did before. We will group by each unique invoice number and get the count of items on that

invoice. Note that this returns another DataFrame and is lazily performed.

We do this grouping in two phases. First we specify the column(s) on which we would like to group,

and then we specify the aggregation(s). The first step returns a RelationalGroupedDataset, and the

second step returns a DataFrame.

As mentioned, we can specify any number of columns on which we want to group:

df.groupBy("InvoiceNo", "CustomerId").count().show()

-- in SQL

SELECT count(*) FROM dfTable GROUP BY InvoiceNo, CustomerId

+---------+----------+-----+

|InvoiceNo|CustomerId|count|

+---------+----------+-----+

| 536846| 14573| 76|

...

| C544318| 12989| 1|

+---------+----------+-----+

Grouping with Expressions

As we saw earlier, counting is a bit of a special case because it exists as a method. For this, usually

we prefer to use the count function. Rather than passing that function as an expression into a select

statement, we specify it as within agg. This makes it possible for you to pass-in arbitrary expressions

that just need to have some aggregation specified. You can even do things like alias a column after

transforming it for later use in your data flow:

// in Scala

import org.apache.spark.sql.functions.count

df.groupBy("InvoiceNo").agg(

count("Quantity").alias("quan"),

expr("count(Quantity)")).show()

# in Python

from pyspark.sql.functions import count

df.groupBy("InvoiceNo").agg(

count("Quantity").alias("quan"),

expr("count(Quantity)")).show()

+---------+----+---------------+

|InvoiceNo|quan|count(Quantity)|

+---------+----+---------------+

| 536596| 6| 6|

...

| C542604| 8| 8|

+---------+----+---------------+

Grouping with Maps

Sometimes, it can be easier to specify your transformations as a series of Maps for which the key is

the column, and the value is the aggregation function (as a string) that you would like to perform. You

can reuse multiple column names if you specify them inline, as well:

// in Scala

df.groupBy("InvoiceNo").agg("Quantity"->"avg", "Quantity"->"stddev_pop").show()

# in Python

df.groupBy("InvoiceNo").agg(expr("avg(Quantity)"),expr("stddev_pop(Quantity)"))\

.show()

-- in SQL

SELECT avg(Quantity), stddev_pop(Quantity), InvoiceNo FROM dfTable

GROUP BY InvoiceNo

+---------+------------------+--------------------+

|InvoiceNo| avg(Quantity)|stddev_pop(Quantity)|

+---------+------------------+--------------------+

| 536596| 1.5| 1.1180339887498947|

...

| C542604| -8.0| 15.173990905493518|

+---------+------------------+--------------------+

Window Functions

You can also use window functions to carry out some unique aggregations by either computing some

aggregation on a specific “window” of data, which you define by using a reference to the current data.

This window specification determines which rows will be passed in to this function. Now this is a

bit abstract and probably similar to a standard group-by, so let’s differentiate them a bit more.

A group-by takes data, and every row can go only into one grouping. A window function calculates a

return value for every input row of a table based on a group of rows, called a frame. Each row can

fall into one or more frames. A common use case is to take a look at a rolling average of some value

for which each row represents one day. If you were to do this, each row would end up in seven

different frames. We cover defining frames a little later, but for your reference, Spark supports three

kinds of window functions: ranking functions, analytic functions, and aggregate functions.

Figure 7-1 illustrates how a given row can fall into multiple frames.

Figure 7-1. Visualizing window functions

To demonstrate, we will add a date column that will convert our invoice date into a column that

contains only date information (not time information, too):

// in Scala

import org.apache.spark.sql.functions.{col, to_date}

val dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"),

"MM/d/yyyy H:mm"))

dfWithDate.createOrReplaceTempView("dfWithDate")

# in Python

from pyspark.sql.functions import col, to_date

dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/d/yyyy H:mm"))

dfWithDate.createOrReplaceTempView("dfWithDate")

The first step to a window function is to create a window specification. Note that the partition by

is unrelated to the partitioning scheme concept that we have covered thus far. It’s just a similar

concept that describes how we will be breaking up our group. The ordering determines the ordering

within a given partition, and, finally, the frame specification (the rowsBetween statement) states

which rows will be included in the frame based on its reference to the current input row. In the

following example, we look at all previous rows up to the current row:

// in Scala

import org.apache.spark.sql.expressions.Window

import org.apache.spark.sql.functions.col

val windowSpec = Window

.partitionBy("CustomerId", "date")

.orderBy(col("Quantity").desc)

.rowsBetween(Window.unboundedPreceding, Window.currentRow)

# in Python

from pyspark.sql.window import Window

from pyspark.sql.functions import desc

windowSpec = Window\

.partitionBy("CustomerId", "date")\

.orderBy(desc("Quantity"))\

.rowsBetween(Window.unboundedPreceding, Window.currentRow)

Now we want to use an aggregation function to learn more about each specific customer. An example

might be establishing the maximum purchase quantity over all time. To answer this, we use the same

aggregation functions that we saw earlier by passing a column name or expression. In addition, we

indicate the window specification that defines to which frames of data this function will apply:

import org.apache.spark.sql.functions.max

val maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

# in Python

from pyspark.sql.functions import max

maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

You will notice that this returns a column (or expressions). We can now use this in a DataFrame

select statement. Before doing so, though, we will create the purchase quantity rank. To do that we use

the dense_rank function to determine which date had the maximum purchase quantity for every

customer. We use dense_rank as opposed to rank to avoid gaps in the ranking sequence when there

are tied values (or in our case, duplicate rows):

// in Scala

import org.apache.spark.sql.functions.{dense_rank, rank}

val purchaseDenseRank = dense_rank().over(windowSpec)

val purchaseRank = rank().over(windowSpec)

# in Python

from pyspark.sql.functions import dense_rank, rank

purchaseDenseRank = dense_rank().over(windowSpec)

purchaseRank = rank().over(windowSpec)

This also returns a column that we can use in select statements. Now we can perform a select to

view the calculated window values:

// in Scala

import org.apache.spark.sql.functions.col

dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")

.select(

col("CustomerId"),

col("date"),

col("Quantity"),

purchaseRank.alias("quantityRank"),

purchaseDenseRank.alias("quantityDenseRank"),

maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

# in Python

from pyspark.sql.functions import col

dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\

.select(

col("CustomerId"),

col("date"),

col("Quantity"),

purchaseRank.alias("quantityRank"),

purchaseDenseRank.alias("quantityDenseRank"),

maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

-- in SQL

SELECT CustomerId, date, Quantity,

rank(Quantity) OVER (PARTITION BY CustomerId, date

ORDER BY Quantity DESC NULLS LAST

ROWS BETWEEN

UNBOUNDED PRECEDING AND

CURRENT ROW) as rank,

dense_rank(Quantity) OVER (PARTITION BY CustomerId, date

ORDER BY Quantity DESC NULLS LAST

ROWS BETWEEN

UNBOUNDED PRECEDING AND

CURRENT ROW) as dRank,

max(Quantity) OVER (PARTITION BY CustomerId, date

ORDER BY Quantity DESC NULLS LAST

ROWS BETWEEN

UNBOUNDED PRECEDING AND

CURRENT ROW) as maxPurchase

FROM dfWithDate WHERE CustomerId IS NOT NULL ORDER BY CustomerId

+----------+----------+--------+------------+-----------------+---------------+

+----------+----------+--------+------------+-----------------+---------------+

| 12346|2011-01-18| 74215| 1| 1| 74215|

| 12346|2011-01-18| -74215| 2| 2| 74215|

| 12347|2010-12-07| 36| 1| 1| 36|

| 12347|2010-12-07| 30| 2| 2| 36|

...

| 12347|2010-12-07| 12| 4| 4| 36|

| 12347|2010-12-07| 6| 17| 5| 36|

+----------+----------+--------+------------+-----------------+---------------+

Grouping Sets

Thus far in this chapter, we’ve seen simple group-by expressions that we can use to aggregate on a set

of columns with the values in those columns. However, sometimes we want something a bit more

complete—an aggregation across multiple groups. We achieve this by using grouping sets. Grouping

sets are a low-level tool for combining sets of aggregations together. They give you the ability to

create arbitrary aggregation in their group-by statements.

Let’s work through an example to gain a better understanding. Here, we would like to get the total

quantity of all stock codes and customers. To do so, we’ll use the following SQL expression:

// in Scala

val dfNoNull = dfWithDate.drop()

dfNoNull.createOrReplaceTempView("dfNoNull")

# in Python

dfNoNull = dfWithDate.drop()

dfNoNull.createOrReplaceTempView("dfNoNull")

-- in SQL

SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull

GROUP BY customerId, stockCode

ORDER BY CustomerId DESC, stockCode DESC

+----------+---------+-------------+

|CustomerId|stockCode|sum(Quantity)|

+----------+---------+-------------+

| 18287| 85173| 48|

| 18287| 85040A| 48|

| 18287| 85039B| 120|

...

| 18287| 23269| 36|

+----------+---------+-------------+

You can do the exact same thing by using a grouping set:

-- in SQL

SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull

GROUP BY customerId, stockCode GROUPING SETS((customerId, stockCode))

ORDER BY CustomerId DESC, stockCode DESC

+----------+---------+-------------+

|CustomerId|stockCode|sum(Quantity)|

+----------+---------+-------------+

| 18287| 85173| 48|

| 18287| 85040A| 48|

| 18287| 85039B| 120|

...

| 18287| 23269| 36|

+----------+---------+-------------+

WARNING

Grouping sets depend on null values for aggregation levels. If you do not filter-out null values, you will get incorrect results.

This applies to cubes, rollups, and grouping sets.

Simple enough, but what if you also want to include the total number of items, regardless of customer

or stock code? With a conventional group-by statement, this would be impossible. But, it’s simple

with grouping sets: we simply specify that we would like to aggregate at that level, as well, in our

grouping set. This is, effectively, the union of several different groupings together:

-- in SQL

SELECT CustomerId, stockCode, sum(Quantity) FROM dfNoNull

GROUP BY customerId, stockCode GROUPING SETS((customerId, stockCode),())

ORDER BY CustomerId DESC, stockCode DESC

+----------+---------+-------------+

|customerId|stockCode|sum(Quantity)|

+----------+---------+-------------+

| 18287| 85173| 48|

| 18287| 85040A| 48|

| 18287| 85039B| 120|

...

| 18287| 23269| 36|

+----------+---------+-------------+

The GROUPING SETS operator is only available in SQL. To perform the same in DataFrames, you use

the rollup and cube operators—which allow us to get the same results. Let’s go through those.

Rollups

Thus far, we’ve been looking at explicit groupings. When we set our grouping keys of multiple

columns, Spark looks at those as well as the actual combinations that are visible in the dataset. A

rollup is a multidimensional aggregation that performs a variety of group-by style calculations for us.

Let’s create a rollup that looks across time (with our new Date column) and space (with the Country

column) and creates a new DataFrame that includes the grand total over all dates, the grand total for

each date in the DataFrame, and the subtotal for each country on each date in the DataFrame:

val rolledUpDF = dfNoNull.rollup("Date", "Country").agg(sum("Quantity"))

.selectExpr("Date", "Country", "`sum(Quantity)` as total_quantity")

.orderBy("Date")

rolledUpDF.show()

# in Python

rolledUpDF = dfNoNull.rollup("Date", "Country").agg(sum("Quantity"))\

.selectExpr("Date", "Country", "`sum(Quantity)` as total_quantity")\

.orderBy("Date")

rolledUpDF.show()

+----------+--------------+--------------+

| Date| Country|total_quantity|

+----------+--------------+--------------+

| null| null| 5176450|

|2010-12-01|United Kingdom| 23949|

|2010-12-01| Germany| 117|

|2010-12-01| France| 449|

...

|2010-12-03| France| 239|

|2010-12-03| Italy| 164|

|2010-12-03| Belgium| 528|

+----------+--------------+--------------+

Now where you see the null values is where you’ll find the grand totals. A null in both rollup

columns specifies the grand total across both of those columns:

rolledUpDF.where("Country IS NULL").show()

rolledUpDF.where("Date IS NULL").show()

+----+-------+--------------+

|Date|Country|total_quantity|

+----+-------+--------------+

|null| null| 5176450|

+----+-------+--------------+

Cube

A cube takes the rollup to a level deeper. Rather than treating elements hierarchically, a cube does the

same thing across all dimensions. This means that it won’t just go by date over the entire time period,

but also the country. To pose this as a question again, can you make a table that includes the

following?

The total across all dates and countries

The total for each date across all countries

The total for each country on each date

The total for each country across all dates

The method call is quite similar, but instead of calling rollup, we call cube:

// in Scala

dfNoNull.cube("Date", "Country").agg(sum(col("Quantity")))

.select("Date", "Country", "sum(Quantity)").orderBy("Date").show()

# in Python

from pyspark.sql.functions import sum

dfNoNull.cube("Date", "Country").agg(sum(col("Quantity")))\

.select("Date", "Country", "sum(Quantity)").orderBy("Date").show()

+----+--------------------+-------------+

|Date| Country|sum(Quantity)|

+----+--------------------+-------------+

|null| Japan| 25218|

|null| Portugal| 16180|

|null| Unspecified| 3300|

|null| null| 5176450|

|null| Australia| 83653|

...

|null| Norway| 19247|

|null| Hong Kong| 4769|

|null| Spain| 26824|

|null| Czech Republic| 592|

+----+--------------------+-------------+

This is a quick and easily accessible summary of nearly all of the information in our table, and it’s a

great way to create a quick summary table that others can use later on.

Grouping Metadata

Sometimes when using cubes and rollups, you want to be able to query the aggregation levels so that

you can easily filter them down accordingly. We can do this by using the grouping_id, which gives

us a column specifying the level of aggregation that we have in our result set. The query in the

example that follows returns four distinct grouping IDs:

Table 7-1. Purpose of grouping IDs

Grouping

ID Description

3This will appear for the highest-level aggregation, which will gives us the total quantity regardless of customerId

and stockCode.

2This will appear for all aggregations of individual stock codes. This gives us the total quantity per stock code,

regardless of customer.

1 This will give us the total quantity on a per-customer basis, regardless of item purchased.

0This will give us the total quantity for individual customerId and stockCode combinations.

This is a bit abstract, so it’s well worth trying out to understand the behavior yourself:

// in Scala

import org.apache.spark.sql.functions.{grouping_id, sum, expr}

dfNoNull.cube("customerId", "stockCode").agg(grouping_id(), sum("Quantity"))

.orderBy(expr("grouping_id()").desc)

.show()

+----------+---------+-------------+-------------+

+----------+---------+-------------+-------------+

| null| null| 3| 5176450|

| null| 23217| 2| 1309|

| null| 90059E| 2| 19|

...

+----------+---------+-------------+-------------+

Pivot

Pivots make it possible for you to convert a row into a column. For example, in our current data we

have a Country column. With a pivot, we can aggregate according to some function for each of those

given countries and display them in an easy-to-query way:

// in Scala

val pivoted = dfWithDate.groupBy("date").pivot("Country").sum()

# in Python

pivoted = dfWithDate.groupBy("date").pivot("Country").sum()

This DataFrame will now have a column for every combination of country, numeric variable, and a

column specifying the date. For example, for USA we have the following columns:

USA_sum(Quantity), USA_sum(UnitPrice), USA_sum(CustomerID). This represents one for each

numeric column in our dataset (because we just performed an aggregation over all of them).

Here’s an example query and result from this data:

pivoted.where("date > '2011-12-05'").select("date" ,"`USA_sum(Quantity)`").show()

+----------+-----------------+

| date|USA_sum(Quantity)|

+----------+-----------------+

|2011-12-06| null|

|2011-12-09| null|

|2011-12-08| -196|

|2011-12-07| null|

+----------+-----------------+

Now all of the columns can be calculated with single groupings, but the value of a pivot comes down

to how you would like to explore the data. It can be useful, if you have low enough cardinality in a

certain column to transform it into columns so that users can see the schema and immediately know

what to query for.

User-Defined Aggregation Functions

User-defined aggregation functions (UDAFs) are a way for users to define their own aggregation

functions based on custom formulae or business rules. You can use UDAFs to compute custom

calculations over groups of input data (as opposed to single rows). Spark maintains a single

AggregationBuffer to store intermediate results for every group of input data.

To create a UDAF, you must inherit from the UserDefinedAggregateFunction base class and

implement the following methods:

inputSchema represents input arguments as a StructType

bufferSchema represents intermediate UDAF results as a StructType

dataType represents the return DataType

deterministic is a Boolean value that specifies whether this UDAF will return the same

result for a given input

initialize allows you to initialize values of an aggregation buffer

update describes how you should update the internal buffer based on a given row

merge describes how two aggregation buffers should be merged

evaluate will generate the final result of the aggregation

The following example implements a BoolAnd, which will inform us whether all the rows (for a

given column) are true; if they’re not, it will return false:

// in Scala

import org.apache.spark.sql.expressions.MutableAggregationBuffer

import org.apache.spark.sql.expressions.UserDefinedAggregateFunction

import org.apache.spark.sql.Row

import org.apache.spark.sql.types._

class BoolAnd extends UserDefinedAggregateFunction {

def inputSchema: org.apache.spark.sql.types.StructType =

StructType(StructField("value", BooleanType) :: Nil)

def bufferSchema: StructType = StructType(

StructField("result", BooleanType) :: Nil

)

def dataType: DataType = BooleanType

def deterministic: Boolean = true

def initialize(buffer: MutableAggregationBuffer): Unit = {

buffer(0) = true

}

def update(buffer: MutableAggregationBuffer, input: Row): Unit = {

buffer(0) = buffer.getAs[Boolean](0) && input.getAs[Boolean](0)

}

def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {

buffer1(0) = buffer1.getAs[Boolean](0) && buffer2.getAs[Boolean](0)

}

def evaluate(buffer: Row): Any = {

buffer(0)

}

Now, we simply instantiate our class and/or register it as a function:

// in Scala

val ba = new BoolAnd

spark.udf.register("booland", ba)

import org.apache.spark.sql.functions._

spark.range(1)

.selectExpr("explode(array(TRUE, TRUE, TRUE)) as t")

.selectExpr("explode(array(TRUE, FALSE, TRUE)) as f", "t")

.select(ba(col("t")), expr("booland(f)"))

.show()

+----------+----------+

|booland(t)|booland(f)|

+----------+----------+

| true| false|

+----------+----------+

UDAFs are currently available only in Scala or Java. However, in Spark 2.3, you will also be able to

call Scala or Java UDFs and UDAFs by registering the function just as we showed in the UDF section

in Chapter 6. For more information, go to SPARK-19439.

Conclusion

This chapter walked through the different types and kinds of aggregations that you can perform in

Spark. You learned about simple grouping-to window functions as well as rollups and cubes.

Chapter 8 discusses how to perform joins to combine different data sources together.

Chapter 8. Joins

Chapter 7 covered aggregating single datasets, which is helpful, but more often than not, your Spark

applications are going to bring together a large number of different datasets. For this reason, joins are

an essential part of nearly all Spark workloads. Spark’s ability to talk to different data means that you

gain the ability to tap into a variety of data sources across your company. This chapter covers not just

what joins exist in Spark and how to use them, but some of the basic internals so that you can think

about how Spark actually goes about executing the join on the cluster. This basic knowledge can help

you avoid running out of memory and tackle problems that you could not solve before.

Join Expressions

A join brings together two sets of data, the left and the right, by comparing the value of one or more

keys of the left and right and evaluating the result of a join expression that determines whether Spark

should bring together the left set of data with the right set of data. The most common join expression,

an equi-join, compares whether the specified keys in your left and right datasets are equal. If they

are equal, Spark will combine the left and right datasets. The opposite is true for keys that do not

match; Spark discards the rows that do not have matching keys. Spark also allows for much more

sophsticated join policies in addition to equi-joins. We can even use complex types and perform

something like checking whether a key exists within an array when you perform a join.

Join Types

Whereas the join expression determines whether two rows should join, the join type determines what

should be in the result set. There are a variety of different join types available in Spark for you to use:

Inner joins (keep rows with keys that exist in the left and right datasets)

Outer joins (keep rows with keys in either the left or right datasets)

Left outer joins (keep rows with keys in the left dataset)

Right outer joins (keep rows with keys in the right dataset)

Left semi joins (keep the rows in the left, and only the left, dataset where the key appears in

the right dataset)

Left anti joins (keep the rows in the left, and only the left, dataset where they do not appear

in the right dataset)

Natural joins (perform a join by implicitly matching the columns between the two datasets

with the same names)

Cross (or Cartesian) joins (match every row in the left dataset with every row in the right

dataset)

If you have ever interacted with a relational database system, or even an Excel spreadsheet, the

concept of joining different datasets together should not be too abstract. Let’s move on to showing

examples of each join type. This will make it easy to understand exactly how you can apply these to

your own problems. To do this, let’s create some simple datasets that we can use in our examples:

// in Scala

val person = Seq(

(0, "Bill Chambers", 0, Seq(100)),

(1, "Matei Zaharia", 1, Seq(500, 250, 100)),

(2, "Michael Armbrust", 1, Seq(250, 100)))

.toDF("id", "name", "graduate_program", "spark_status")

val graduateProgram = Seq(

(0, "Masters", "School of Information", "UC Berkeley"),

(2, "Masters", "EECS", "UC Berkeley"),

(1, "Ph.D.", "EECS", "UC Berkeley"))

.toDF("id", "degree", "department", "school")

val sparkStatus = Seq(

(500, "Vice President"),

(250, "PMC Member"),

(100, "Contributor"))

.toDF("id", "status")

# in Python

person = spark.createDataFrame([

(0, "Bill Chambers", 0, [100]),

(1, "Matei Zaharia", 1, [500, 250, 100]),

(2, "Michael Armbrust", 1, [250, 100])])\

.toDF("id", "name", "graduate_program", "spark_status")

graduateProgram = spark.createDataFrame([

(0, "Masters", "School of Information", "UC Berkeley"),

(2, "Masters", "EECS", "UC Berkeley"),

(1, "Ph.D.", "EECS", "UC Berkeley")])\

.toDF("id", "degree", "department", "school")

sparkStatus = spark.createDataFrame([

(500, "Vice President"),

(250, "PMC Member"),

(100, "Contributor")])\

.toDF("id", "status")

Next, let’s register these as tables so that we use them throughout the chapter:

person.createOrReplaceTempView("person")

graduateProgram.createOrReplaceTempView("graduateProgram")

sparkStatus.createOrReplaceTempView("sparkStatus")

Inner Joins

Inner joins evaluate the keys in both of the DataFrames or tables and include (and join together) only

the rows that evaluate to true. In the following example, we join the graduateProgram DataFrame

with the person DataFrame to create a new DataFrame:

// in Scala

val joinExpression = person.col("graduate_program") === graduateProgram.col("id")

# in Python

joinExpression = person["graduate_program"] == graduateProgram['id']

Keys that do not exist in both DataFrames will not show in the resulting DataFrame. For example, the

following expression would result in zero values in the resulting DataFrame:

// in Scala

val wrongJoinExpression = person.col("name") === graduateProgram.col("school")

# in Python

wrongJoinExpression = person["name"] == graduateProgram["school"]

Inner joins are the default join, so we just need to specify our left DataFrame and join the right in the

JOIN expression:

person.join(graduateProgram, joinExpression).show()

-- in SQL

SELECT * FROM person JOIN graduateProgram

ON person.graduate_program = graduateProgram.id

+---+----------------+----------------+---------------+---+-------+----------+---

+---+----------------+----------------+---------------+---+-------+----------+---

| 0| Bill Chambers| 0| [100]| 0|Masters| School...|...

| 1| Matei Zaharia| 1|[500, 250, 100]| 1| Ph.D.| EECS|...

| 2|Michael Armbrust| 1| [250, 100]| 1| Ph.D.| EECS|...

+---+----------------+----------------+---------------+---+-------+----------+---

We can also specify this explicitly by passing in a third parameter, the joinType:

// in Scala

var joinType = "inner"

# in Python

joinType = "inner"

person.join(graduateProgram, joinExpression, joinType).show()

-- in SQL

SELECT * FROM person INNER JOIN graduateProgram

ON person.graduate_program = graduateProgram.id

+---+----------------+----------------+---------------+---+-------+--------------

+---+----------------+----------------+---------------+---+-------+--------------

| 0| Bill Chambers| 0| [100]| 0|Masters| School...

| 1| Matei Zaharia| 1|[500, 250, 100]| 1| Ph.D.| EECS...

| 2|Michael Armbrust| 1| [250, 100]| 1| Ph.D.| EECS...

+---+----------------+----------------+---------------+---+-------+--------------

Outer Joins

Outer joins evaluate the keys in both of the DataFrames or tables and includes (and joins together) the

rows that evaluate to true or false. If there is no equivalent row in either the left or right DataFrame,

Spark will insert null:

joinType = "outer"

person.join(graduateProgram, joinExpression, joinType).show()

-- in SQL

SELECT * FROM person FULL OUTER JOIN graduateProgram

ON graduate_program = graduateProgram.id

+----+----------------+----------------+---------------+---+-------+-------------

+----+----------------+----------------+---------------+---+-------+-------------

| 1| Matei Zaharia| 1|[500, 250, 100]| 1| Ph.D.| EEC...

| 2|Michael Armbrust| 1| [250, 100]| 1| Ph.D.| EEC...

| 0| Bill Chambers| 0| [100]| 0|Masters| School...

+----+----------------+----------------+---------------+---+-------+-------------

Left Outer Joins

Left outer joins evaluate the keys in both of the DataFrames or tables and includes all rows from the

left DataFrame as well as any rows in the right DataFrame that have a match in the left DataFrame. If

there is no equivalent row in the right DataFrame, Spark will insert null:

joinType = "left_outer"

graduateProgram.join(person, joinExpression, joinType).show()

-- in SQL

SELECT * FROM graduateProgram LEFT OUTER JOIN person

ON person.graduate_program = graduateProgram.id

+---+-------+----------+-----------+----+----------------+----------------+---

+---+-------+----------+-----------+----+----------------+----------------+---

| 0|Masters| School...|UC Berkeley| 0| Bill Chambers| 0|...

| 1| Ph.D.| EECS|UC Berkeley| 2|Michael Armbrust| 1|...

| 1| Ph.D.| EECS|UC Berkeley| 1| Matei Zaharia| 1|...

+---+-------+----------+-----------+----+----------------+----------------+---

Right Outer Joins

Right outer joins evaluate the keys in both of the DataFrames or tables and includes all rows from the

right DataFrame as well as any rows in the left DataFrame that have a match in the right DataFrame.

If there is no equivalent row in the left DataFrame, Spark will insert null:

joinType = "right_outer"

person.join(graduateProgram, joinExpression, joinType).show()

-- in SQL

SELECT * FROM person RIGHT OUTER JOIN graduateProgram

ON person.graduate_program = graduateProgram.id

+----+----------------+----------------+---------------+---+-------+------------+

+----+----------------+----------------+---------------+---+-------+------------+

| 0| Bill Chambers| 0| [100]| 0|Masters|School of...|

| 2|Michael Armbrust| 1| [250, 100]| 1| Ph.D.| EECS|

| 1| Matei Zaharia| 1|[500, 250, 100]| 1| Ph.D.| EECS|

+----+----------------+----------------+---------------+---+-------+------------+

Left Semi Joins

Semi joins are a bit of a departure from the other joins. They do not actually include any values from

the right DataFrame. They only compare values to see if the value exists in the second DataFrame. If

the value does exist, those rows will be kept in the result, even if there are duplicate keys in the left

DataFrame. Think of left semi joins as filters on a DataFrame, as opposed to the function of a

conventional join:

joinType = "left_semi"

graduateProgram.join(person, joinExpression, joinType).show()

+---+-------+--------------------+-----------+

+---+-------+--------------------+-----------+

| 1| Ph.D.| EECS|UC Berkeley|

+---+-------+--------------------+-----------+

// in Scala

val gradProgram2 = graduateProgram.union(Seq(

(0, "Masters", "Duplicated Row", "Duplicated School")).toDF())

gradProgram2.createOrReplaceTempView("gradProgram2")

# in Python

gradProgram2 = graduateProgram.union(spark.createDataFrame([

(0, "Masters", "Duplicated Row", "Duplicated School")]))

gradProgram2.createOrReplaceTempView("gradProgram2")

gradProgram2.join(person, joinExpression, joinType).show()

-- in SQL

SELECT * FROM gradProgram2 LEFT SEMI JOIN person

ON gradProgram2.id = person.graduate_program

+---+-------+--------------------+-----------------+

+---+-------+--------------------+-----------------+

| 1| Ph.D.| EECS| UC Berkeley|

+---+-------+--------------------+-----------------+

Left Anti Joins

Left anti joins are the opposite of left semi joins. Like left semi joins, they do not actually include any

values from the right DataFrame. They only compare values to see if the value exists in the second

DataFrame. However, rather than keeping the values that exist in the second DataFrame, they keep

only the values that do not have a corresponding key in the second DataFrame. Think of anti joins as a

NOT IN SQL-style filter:

joinType = "left_anti"

graduateProgram.join(person, joinExpression, joinType).show()

-- in SQL

SELECT * FROM graduateProgram LEFT ANTI JOIN person

ON graduateProgram.id = person.graduate_program

+---+-------+----------+-----------+

+---+-------+----------+-----------+

+---+-------+----------+-----------+

Natural Joins

Natural joins make implicit guesses at the columns on which you would like to join. It finds matching

columns and returns the results. Left, right, and outer natural joins are all supported.

WARNING

Implicit is always dangerous! The following query will give us incorrect results because the two DataFrames/tables share a

column name (id), but it means different things in the datasets. You should always use this join with caution.

-- in SQL

SELECT * FROM graduateProgram NATURAL JOIN person

Cross (Cartesian) Joins

The last of our joins are cross-joins or cartesian products. Cross-joins in simplest terms are inner

joins that do not specify a predicate. Cross joins will join every single row in the left DataFrame to

ever single row in the right DataFrame. This will cause an absolute explosion in the number of rows

contained in the resulting DataFrame. If you have 1,000 rows in each DataFrame, the cross-join of

these will result in 1,000,000 (1,000 x 1,000) rows. For this reason, you must very explicitly state

that you want a cross-join by using the cross join keyword:

joinType = "cross"

graduateProgram.join(person, joinExpression, joinType).show()

-- in SQL

SELECT * FROM graduateProgram CROSS JOIN person

ON graduateProgram.id = person.graduate_program

+---+-------+----------+-----------+---+----------------+----------------+-------

+---+-------+----------+-----------+---+----------------+----------------+-------

| 0|Masters| School...|UC Berkeley| 0| Bill Chambers| 0| ...

| 1| Ph.D.| EECS|UC Berkeley| 2|Michael Armbrust| 1| [2...

| 1| Ph.D.| EECS|UC Berkeley| 1| Matei Zaharia| 1|[500...

+---+-------+----------+-----------+---+----------------+----------------+-------

If you truly intend to have a cross-join, you can call that out explicitly:

person.crossJoin(graduateProgram).show()

-- in SQL

SELECT * FROM graduateProgram CROSS JOIN person

+---+----------------+----------------+---------------+---+-------+-------------+

+---+----------------+----------------+---------------+---+-------+-------------+

| 0| Bill Chambers| 0| [100]| 0|Masters| School...|

...

| 1| Matei Zaharia| 1|[500, 250, 100]| 0|Masters| School...|

...

| 2|Michael Armbrust| 1| [250, 100]| 0|Masters| School...|

...

+---+----------------+----------------+---------------+---+-------+-------------+

WARNING

You should use cross-joins only if you are absolutely, 100 percent sure that this is the join you need. There is a reason why

you need to be explicit when defining a cross-join in Spark. They’re dangerous! Advanced users can set the session-level

configuration spark.sql.crossJoin.enable to true in order to allow cross-joins without warnings or without Spark trying

to perform another join for you.

Challenges When Using Joins

When performing joins, there are some specific challenges and some common questions that arise.

The rest of the chapter will provide answers to these common questions and then explain how, at a

high level, Spark performs joins. This will hint at some of the optimizations that we are going to

cover in later parts of this book.

Joins on Complex Types

Even though this might seem like a challenge, it’s actually not. Any expression is a valid join

expression, assuming that it returns a Boolean:

import org.apache.spark.sql.functions.expr

person.withColumnRenamed("id", "personId")

.join(sparkStatus, expr("array_contains(spark_status, id)")).show()

# in Python

from pyspark.sql.functions import expr

person.withColumnRenamed("id", "personId")\

.join(sparkStatus, expr("array_contains(spark_status, id)")).show()

-- in SQL

SELECT * FROM

(select id as personId, name, graduate_program, spark_status FROM person)

INNER JOIN sparkStatus ON array_contains(spark_status, id)

+--------+----------------+----------------+---------------+---+--------------+

+--------+----------------+----------------+---------------+---+--------------+

| 0| Bill Chambers| 0| [100]|100| Contributor|

| 1| Matei Zaharia| 1|[500, 250, 100]|500|Vice President|

| 1| Matei Zaharia| 1|[500, 250, 100]|250| PMC Member|

| 1| Matei Zaharia| 1|[500, 250, 100]|100| Contributor|

| 2|Michael Armbrust| 1| [250, 100]|250| PMC Member|

| 2|Michael Armbrust| 1| [250, 100]|100| Contributor|

+--------+----------------+----------------+---------------+---+--------------+

Handling Duplicate Column Names

One of the tricky things that come up in joins is dealing with duplicate column names in your results

DataFrame. In a DataFrame, each column has a unique ID within Spark’s SQL Engine, Catalyst. This

unique ID is purely internal and not something that you can directly reference. This makes it quite

difficult to refer to a specific column when you have a DataFrame with duplicate column names.

This can occur in two distinct situations:

The join expression that you specify does not remove one key from one of the input

DataFrames and the keys have the same column name

Two columns on which you are not performing the join have the same name

Let’s create a problem dataset that we can use to illustrate these problems:

val gradProgramDupe = graduateProgram.withColumnRenamed("id", "graduate_program")

val joinExpr = gradProgramDupe.col("graduate_program") === person.col(

"graduate_program")

Note that there are now two graduate_program columns, even though we joined on that key:

person.join(gradProgramDupe, joinExpr).show()

The challenge arises when we refer to one of these columns:

person.join(gradProgramDupe, joinExpr).select("graduate_program").show()

Given the previous code snippet, we will receive an error. In this particular example, Spark

generates this message:

org.apache.spark.sql.AnalysisException: Reference 'graduate_program' is

ambiguous, could be: graduate_program#40, graduate_program#1079.;

Approach 1: Different join expression

When you have two keys that have the same name, probably the easiest fix is to change the join

expression from a Boolean expression to a string or sequence. This automatically removes one of the

columns for you during the join:

person.join(gradProgramDupe,"graduate_program").select("graduate_program").show()

Approach 2: Dropping the column after the join

Another approach is to drop the offending column after the join. When doing this, we need to refer to

the column via the original source DataFrame. We can do this if the join uses the same key names or if

the source DataFrames have columns that simply have the same name:

person.join(gradProgramDupe, joinExpr).drop(person.col("graduate_program"))

.select("graduate_program").show()

val joinExpr = person.col("graduate_program") === graduateProgram.col("id")

person.join(graduateProgram, joinExpr).drop(graduateProgram.col("id")).show()

This is an artifact of Spark’s SQL analysis process in which an explicitly referenced column will

pass analysis because Spark has no need to resolve the column. Notice how the column uses the .col

method instead of a column function. That allows us to implicitly specify that column by its specific

ID.

Approach 3: Renaming a column before the join

We can avoid this issue altogether if we rename one of our columns before the join:

val gradProgram3 = graduateProgram.withColumnRenamed("id", "grad_id")

val joinExpr = person.col("graduate_program") === gradProgram3.col("grad_id")

person.join(gradProgram3, joinExpr).show()

How Spark Performs Joins

To understand how Spark performs joins, you need to understand the two core resources at play: the

node-to-node communication strategy and per node computation strategy. These internals are

likely irrelevant to your business problem. However, comprehending how Spark performs joins can

mean the difference between a job that completes quickly and one that never completes at all.

Communication Strategies

Spark approaches cluster communication in two different ways during joins. It either incurs a shuffle

join, which results in an all-to-all communication or a broadcast join. Keep in mind that there is a lot

more detail than we’re letting on at this point, and that’s intentional. Some of these internal

optimizations are likely to change over time with new improvements to the cost-based optimizer and

improved communication strategies. For this reason, we’re going to focus on the high-level examples

to help you understand exactly what’s going on in some of the more common scenarios, and let you

take advantage of some of the low-hanging fruit that you can use right away to try to speed up some of

your workloads.

The core foundation of our simplified view of joins is that in Spark you will have either a big table or

a small table. Although this is obviously a spectrum (and things do happen differently if you have a

“medium-sized table”), it can help to be binary about the distinction for the sake of this explanation.

Big table–to–big table

When you join a big table to another big table, you end up with a shuffle join, such as that illustrates

in Figure 8-1.

Figure 8-1. Joining two big tables

In a shuffle join, every node talks to every other node and they share data according to which node

has a certain key or set of keys (on which you are joining). These joins are expensive because the

network can become congested with traffic, especially if your data is not partitioned well.

This join describes taking a big table of data and joining it to another big table of data. An example of

this might be a company that receives billions of messages every day from the Internet of Things, and

needs to identify the day-over-day changes that have occurred. The way to do this is by joining on

deviceId, messageType, and date in one column, and date - 1 day in the other column.

In Figure 8-1, DataFrame 1 and DataFrame 2 are both large DataFrames. This means that all worker

nodes (and potentially every partition) will need to communicate with one another during the entire

join process (with no intelligent partitioning of data).

Big table–to–small table

When the table is small enough to fit into the memory of a single worker node, with some breathing

room of course, we can optimize our join. Although we can use a big table–to–big table

communication strategy, it can often be more efficient to use a broadcast join. What this means is that

we will replicate our small DataFrame onto every worker node in the cluster (be it located on one

machine or many). Now this sounds expensive. However, what this does is prevent us from

performing the all-to-all communication during the entire join process. Instead, we perform it only

once at the beginning and then let each individual worker node perform the work without having to

wait or communicate with any other worker node, as is depicted in Figure 8-2.

Figure 8-2. A broadcast join

At the beginning of this join will be a large communication, just like in the previous type of join.

However, immediately after that first, there will be no further communication between nodes. This

means that joins will be performed on every single node individually, making CPU the biggest

bottleneck. For our current set of data, we can see that Spark has automatically set this up as a

broadcast join by looking at the explain plan:

val joinExpr = person.col("graduate_program") === graduateProgram.col("id")

person.join(graduateProgram, joinExpr).explain()

== Physical Plan ==

*BroadcastHashJoin [graduate_program#40], [id#5....

:- LocalTableScan [id#38, name#39, graduate_progr...

+- BroadcastExchange HashedRelationBroadcastMode(....

+- LocalTableScan [id#56, degree#57, departmen....

With the DataFrame API, we can also explicitly give the optimizer a hint that we would like to use a

broadcast join by using the correct function around the small DataFrame in question. In this example,

these result in the same plan we just saw; however, this is not always the case:

import org.apache.spark.sql.functions.broadcast

val joinExpr = person.col("graduate_program") === graduateProgram.col("id")

person.join(broadcast(graduateProgram), joinExpr).explain()

The SQL interface also includes the ability to provide hints to perform joins. These are not enforced,

however, so the optimizer might choose to ignore them. You can set one of these hints by using a

special comment syntax. MAPJOIN, BROADCAST, and BROADCASTJOIN all do the same thing and are all

supported:

-- in SQL

SELECT /*+ MAPJOIN(graduateProgram) */ * FROM person JOIN graduateProgram

ON person.graduate_program = graduateProgram.id

This doesn’t come for free either: if you try to broadcast something too large, you can crash your

driver node (because that collect is expensive). This is likely an area for optimization in the future.

Little table–to–little table

When performing joins with small tables, it’s usually best to let Spark decide how to join them. You

can always force a broadcast join if you’re noticing strange behavior.

Conclusion

In this chapter, we discussed joins, probably one of the most common use cases. One thing we did not

mention but is important to consider is if you partition your data correctly prior to a join, you can end

up with much more efficient execution because even if a shuffle is planned, if data from two different

DataFrames is already located on the same machine, Spark can avoid the shuffle. Experiment with

some of your data and try partitioning beforehand to see if you can notice the increase in speed when

performing those joins. In Chapter 9, we will discuss Spark’s data source APIs. There are additional

implications when you decide what order joins should occur in. Because some joins act as filters, this

can be a low-hanging improvement in your workloads, as you are guaranteed to reduce data

exchanged over the network.

The next chapter will depart from user manipulation, as we’ve seen in the last several chapters, and

touch on reading and writing data using the Structured APIs.

Chapter 9. Data Sources

This chapter formally introduces the variety of other data sources that you can use with Spark out of

the box as well as the countless other sources built by the greater community. Spark has six “core”

data sources and hundreds of external data sources written by the community. The ability to read and

write from all different kinds of data sources and for the community to create its own contributions is

arguably one of Spark’s greatest strengths. Following are Spark’s core data sources:

CSV

JSON

Parquet

ORC

JDBC/ODBC connections

Plain-text files

As mentioned, Spark has numerous community-created data sources. Here’s just a small sample:

Cassandra

HBase

MongoDB

AWS Redshift

XML

And many, many others

The goal of this chapter is to give you the ability to read and write from Spark’s core data sources

and know enough to understand what you should look for when integrating with third-party data

sources. To achieve this, we will focus on the core concepts that you need to be able to recognize and

understand.

The Structure of the Data Sources API

Before proceeding with how to read and write from certain formats, let’s visit the overall

organizational structure of the data source APIs.

Read API Structure

The core structure for reading data is as follows:

DataFrameReader.format(...).option("key", "value").schema(...).load()

We will use this format to read from all of our data sources. format is optional because by default

Spark will use the Parquet format. option allows you to set key-value configurations to parameterize

how you will read data. Lastly, schema is optional if the data source provides a schema or if you

intend to use schema inference. Naturally, there are some required options for each format, which we

will discuss when we look at each format.

NOTE

There is a lot of shorthand notation in the Spark community, and the data source read API is no exception. We try to be

consistent throughout the book while still revealing some of the shorthand notation along the way.

Basics of Reading Data

The foundation for reading data in Spark is the DataFrameReader. We access this through the

SparkSession via the read attribute:

spark.read

After we have a DataFrame reader, we specify several values:

The format

The schema

The read mode

A series of options

The format, options, and schema each return a DataFrameReader that can undergo further

transformations and are all optional, except for one option. Each data source has a specific set of

options that determine how the data is read into Spark (we cover these options shortly). At a

minimum, you must supply the DataFrameReader a path to from which to read.

Here’s an example of the overall layout:

spark.read.format("csv")

.option("mode", "FAILFAST")

.option("inferSchema", "true")

.option("path", "path/to/file(s)")

.schema(someSchema)

.load()

There are a variety of ways in which you can set options; for example, you can build a map and pass

in your configurations. For now, we’ll stick to the simple and explicit way that you just saw.

Read modes

Reading data from an external source naturally entails encountering malformed data, especially when

working with only semi-structured data sources. Read modes specify what will happen when Spark

does come across malformed records. Table 9-1 lists the read modes.

Table 9-1. Spark’s read modes

Read mode Description

permissive Sets all fields to null when it encounters a corrupted record and places all corrupted records in a string column

called _corrupt_record

dropMalformed Drops the row that contains malformed records

failFast Fails immediately upon encountering malformed records

The default is permissive.

Write API Structure

The core structure for writing data is as follows:

DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy(

...).save()

We will use this format to write to all of our data sources. format is optional because by default,

Spark will use the arquet format. option, again, allows us to configure how to write out our given

data. PartitionBy, bucketBy, and sortBy work only for file-based data sources; you can use them

to control the specific layout of files at the destination.

Basics of Writing Data

The foundation for writing data is quite similar to that of reading data. Instead of the

DataFrameReader, we have the DataFrameWriter. Because we always need to write out some

given data source, we access the DataFrameWriter on a per-DataFrame basis via the write

attribute:

// in Scala

dataFrame.write

After we have a DataFrameWriter, we specify three values: the format, a series of options, and

the save mode. At a minimum, you must supply a path. We will cover the potential for options, which

vary from data source to data source, shortly.

// in Scala

dataframe.write.format("csv")

.option("mode", "OVERWRITE")

.option("dateFormat", "yyyy-MM-dd")

.option("path", "path/to/file(s)")

.save()

Save modes

Save modes specify what will happen if Spark finds data at the specified location (assuming all else

equal). Table 9-2 lists the save modes.

Table 9-2. Spark’s save modes

Save mode Description

append Appends the output files to the list of files that already exist at that location

overwrite Will completely overwrite any data that already exists there

errorIfExists Throws an error and fails the write if data or files already exist at the specified location

ignore If data or files exist at the location, do nothing with the current DataFrame

The default is errorIfExists. This means that if Spark finds data at the location to which you’re

writing, it will fail the write immediately.

We’ve largely covered the core concepts that you’re going to need when using data sources, so now

let’s dive into each of Spark’s native data sources.

CSV Files

CSV stands for commma-separated values. This is a common text file format in which each line

represents a single record, and commas separate each field within a record. CSV files, while seeming

well structured, are actually one of the trickiest file formats you will encounter because not many

assumptions can be made in production scenarios about what they contain or how they are structured.

For this reason, the CSV reader has a large number of options. These options give you the ability to

work around issues like certain characters needing to be escaped—for example, commas inside of

columns when the file is also comma-delimited or null values labeled in an unconventional way.

CSV Options

Table 9-3 presents the options available in the CSV reader.

Table 9-3. CSV data source options

Read/write Key Potential values Default Description

Both sep Any single string

character ,The single character that is used as

separator for each field and value.

Both header true, false false

A Boolean flag that declares whether the

first line in the file(s) are the names of the

columns.

Read escape Any string character \ The character Spark should use to escape

other characters in the file.

Read inferSchema true, false false Specifies whether Spark should infer

column types when reading the file.

Read ignoreLeadingWhiteSpace true, false false Declares whether leading spaces from

values being read should be skipped.

Read ignoreTrailingWhiteSpace true, false false Declares whether trailing spaces from

values being read should be skipped.

Both nullValue Any string character “” Declares what character represents a

null value in the file.

Both nanValue Any string character NaN

Declares what character represents a

NaN or missing character in the CSV

file.

Both positiveInf Any string or

character Inf Declares what character(s) represent a

positive infinite value.

Both negativeInf Any string or

character -Inf Declares what character(s) represent a

negative infinite value.

Both compression or codec

None, uncompressed,

bzip2, deflate,

gzip, lz4, or snappy

none Declares what compression codec Spark

should use to read or write the file.

Both dateFormat

Any string or

character that

conforms to java’s

SimpleDataFormat.

yyyy-MM-dd Declares the date format for any columns

that are date type.

Both timestampFormat

Any string or

character that

conforms to java’s

SimpleDataFormat.

yyyy-MM-

dd’T’HH:mm

:ss.SSSZZ

Declares the timestamp format for any

columns that are timestamp type.

Read maxColumns Any integer 20480 Declares the maximum number of

columns in the file.

Read maxCharsPerColumn Any integer 1000000 Declares the maximum number of

characters in a column.

Read escapeQuotes true, false true Declares whether Spark should escape

quotes that are found in lines.

Read maxMalformedLogPerPartition Any integer 10

Sets the maximum number of malformed

rows Spark will log for each partition.

Malformed records beyond this number

will be ignored.

Write quoteAll true, false false

Specifies whether all values should be

enclosed in quotes, as opposed to just

escaping values that have a quote

character.

Read multiLine true, false false

This option allows you to read multiline

CSV files where each logical row in the

CSV file might span multiple rows in the

file itself.

Reading CSV Files

To read a CSV file, like any other format, we must first create a DataFrameReader for that specific

format. Here, we specify the format to be CSV:

spark.read.format("csv")

After this, we have the option of specifying a schema as well as modes as options. Let’s set a couple

of options, some that we saw from the beginning of the book and others that we haven’t seen yet.

We’ll set the header to true for our CSV file, the mode to be FAILFAST, and inferSchema to true:

// in Scala

spark.read.format("csv")

.option("header", "true")

.option("mode", "FAILFAST")

.option("inferSchema", "true")

.load("some/path/to/file.csv")

As mentioned, we can use the mode to specify how much tolerance we have for malformed data. For

example, we can use these modes and the schema that we created in Chapter 5 to ensure that our

file(s) conform to the data that we expected:

// in Scala

import org.apache.spark.sql.types.{StructField, StructType, StringType, LongType}

val myManualSchema = new StructType(Array(

new StructField("DEST_COUNTRY_NAME", StringType, true),

new StructField("ORIGIN_COUNTRY_NAME", StringType, true),

new StructField("count", LongType, false)

))

spark.read.format("csv")

.option("header", "true")

.option("mode", "FAILFAST")

.schema(myManualSchema)

.load("/data/flight-data/csv/2010-summary.csv")

.show(5)

Things get tricky when we don’t expect our data to be in a certain format, but it comes in that way,

anyhow. For example, let’s take our current schema and change all column types to LongType. This

does not match the actual schema, but Spark has no problem with us doing this. The problem will

only manifest itself when Spark actually reads the data. As soon as we start our Spark job, it will

immediately fail (after we execute a job) due to the data not conforming to the specified schema:

// in Scala

val myManualSchema = new StructType(Array(

new StructField("DEST_COUNTRY_NAME", LongType, true),

new StructField("ORIGIN_COUNTRY_NAME", LongType, true),

new StructField("count", LongType, false) ))

spark.read.format("csv")

.option("header", "true")

.option("mode", "FAILFAST")

.schema(myManualSchema)

.load("/data/flight-data/csv/2010-summary.csv")

.take(5)

In general, Spark will fail only at job execution time rather than DataFrame definition time—even if,

for example, we point to a file that does not exist. This is due to lazy evaluation, a concept we

learned about in Chapter 2.

Writing CSV Files

Just as with reading data, there are a variety of options (listed in Table 9-3) for writing data when we

write CSV files. This is a subset of the reading options because many do not apply when writing data

(like maxColumns and inferSchema). Here’s an example:

// in Scala

val csvFile = spark.read.format("csv")

.option("header", "true").option("mode", "FAILFAST").schema(myManualSchema)

.load("/data/flight-data/csv/2010-summary.csv")

# in Python

csvFile = spark.read.format("csv")\

.option("header", "true")\

.option("mode", "FAILFAST")\

.option("inferSchema", "true")\

.load("/data/flight-data/csv/2010-summary.csv")

For instance, we can take our CSV file and write it out as a TSV file quite easily:

// in Scala

csvFile.write.format("csv").mode("overwrite").option("sep", "\t")

.save("/tmp/my-tsv-file.tsv")

# in Python

csvFile.write.format("csv").mode("overwrite").option("sep", "\t")\

.save("/tmp/my-tsv-file.tsv")

When you list the destination directory, you can see that my-tsv-file is actually a folder with numerous

files within it:

$ ls /tmp/my-tsv-file.tsv/

/tmp/my-tsv-file.tsv/part-00000-35cf9453-1943-4a8c-9c82-9f6ea9742b29.csv

This actually reflects the number of partitions in our DataFrame at the time we write it out. If we

were to repartition our data before then, we would end up with a different number of files. We

discuss this trade-off at the end of this chapter.

JSON Files

Those coming from the world of JavaScript are likely familiar with JavaScript Object Notation, or

JSON, as it’s commonly called. There are some catches when working with this kind of data that are

worth considering before we jump in. In Spark, when we refer to JSON files, we refer to line-

delimited JSON files. This contrasts with files that have a large JSON object or array per file.

The line-delimited versus multiline trade-off is controlled by a single option: multiLine. When you

set this option to true, you can read an entire file as one json object and Spark will go through the

work of parsing that into a DataFrame. Line-delimited JSON is actually a much more stable format

because it allows you to append to a file with a new record (rather than having to read in an entire

file and then write it out), which is what we recommend that you use. Another key reason for the

popularity of line-delimited JSON is because JSON objects have structure, and JavaScript (on which

JSON is based) has at least basic types. This makes it easier to work with because Spark can make

more assumptions on our behalf about the data. You’ll notice that there are significantly less options

than we saw for CSV because of the objects.

JSON Options

Table 9-4 lists the options available for the JSON object, along with their descriptions.

Table 9-4. JSON data source options

Read/write Key Potential values Default Description

Both compression or codec

None,

uncompressed,

bzip2, deflate, none

Declares

what

compression

codec Spark

gzip, lz4, or

snappy

should use to

read or write

the file.

Both dateFormat

Any string or

character that

conforms to Java’s

SimpleDataFormat.

yyyy-MM-dd

Declares the

date format

for any

columns that

are date type.

Both timestampFormat

Any string or

character that

conforms to Java’s

SimpleDataFormat.

yyyy-MM-dd’T’HH:mm:ss.SSSZZ

Declares the

timestamp

format for

any columns

that are

timestamp

type.

Read primitiveAsString true, false false

Infers all

primitive

values as

string type.

Read allowComments true, false false

Ignores

Java/C++

style

comment in

JSON

records.

Read allowUnquotedFieldNames true, false false

Allows

unquoted

JSON field

names.

Read allowSingleQuotes true, false true

Allows single

quotes in

addition to

double

quotes.

Read allowNumericLeadingZeros true, false false

Allows

leading zeroes

in numbers

(e.g., 00012).

Read allowBackslashEscapingAnyCharacter true, false false

Allows

accepting

quoting of all

characters

using

backslash

quoting

mechanism.

Allows

renaming the

Read columnNameOfCorruptRecord Any string Value of

spark.sql.column&NameOfCorruptRecord

new field

having a

malformed

string created

permissive

mode. This

will override

the

configuration

value.

Read multiLine true, false false

Allows for

reading in

non-line-

delimited

JSON files.

Now, reading a line-delimited JSON file varies only in the format and the options that we specify:

spark.read.format("json")

Reading JSON Files

Let’s look at an example of reading a JSON file and compare the options that we’re seeing:

// in Scala

spark.read.format("json").option("mode", "FAILFAST").schema(myManualSchema)

.load("/data/flight-data/json/2010-summary.json").show(5)

# in Python

spark.read.format("json").option("mode", "FAILFAST")\

.option("inferSchema", "true")\

.load("/data/flight-data/json/2010-summary.json").show(5)

Writing JSON Files

Writing JSON files is just as simple as reading them, and, as you might expect, the data source does

not matter. Therefore, we can reuse the CSV DataFrame that we created earlier to be the source for

our JSON file. This, too, follows the rules that we specified before: one file per partition will be

written out, and the entire DataFrame will be written out as a folder. It will also have one JSON

object per line:

// in Scala

csvFile.write.format("json").mode("overwrite").save("/tmp/my-json-file.json")

# in Python

csvFile.write.format("json").mode("overwrite").save("/tmp/my-json-file.json")

$ ls /tmp/my-json-file.json/

/tmp/my-json-file.json/part-00000-tid-543....json

Parquet Files

Parquet is an open source column-oriented data store that provides a variety of storage optimizations,

especially for analytics workloads. It provides columnar compression, which saves storage space

and allows for reading individual columns instead of entire files. It is a file format that works

exceptionally well with Apache Spark and is in fact the default file format. We recommend writing

data out to Parquet for long-term storage because reading from a Parquet file will always be more

efficient than JSON or CSV. Another advantage of Parquet is that it supports complex types. This

means that if your column is an array (which would fail with a CSV file, for example), map, or struct,

you’ll still be able to read and write that file without issue. Here’s how to specify Parquet as the read

format:

spark.read.format("parquet")

Reading Parquet Files

Parquet has very few options because it enforces its own schema when storing data. Thus, all you

need to set is the format and you are good to go. We can set the schema if we have strict requirements

for what our DataFrame should look like. Oftentimes this is not necessary because we can use

schema on read, which is similar to the inferSchema with CSV files. However, with Parquet files,

this method is more powerful because the schema is built into the file itself (so no inference needed).

Here are some simple examples reading from parquet:

spark.read.format("parquet")

// in Scala

spark.read.format("parquet")

.load("/data/flight-data/parquet/2010-summary.parquet").show(5)

# in Python

spark.read.format("parquet")\

.load("/data/flight-data/parquet/2010-summary.parquet").show(5)

Parquet options

As we just mentioned, there are very few Parquet options—precisely two, in fact—because it has a

well-defined specification that aligns closely with the concepts in Spark. Table 9-5 presents the

options.

WARNING

Even though there are only two options, you can still encounter problems if you’re working with incompatible Parquet files.

Be careful when you write out Parquet files with different versions of Spark (especially older ones) because this can cause

significant headache.

Table 9-5. Parquet data source options

Read/Write Key Potential Values Default Description

Write compression

or codec

None, uncompressed,

bzip2, deflate,

gzip, lz4, or snappy

None

Declares what compression codec

Spark should use to read or write the

file.

Read mergeSchema true, false Value of the configuration

spark.sql.parquet.mergeSchema

You can incrementally add columns to

newly written Parquet files in the same

table/folder. Use this option to enable or

disable this feature.

Writing Parquet Files

Writing Parquet is as easy as reading it. We simply specify the location for the file. The same

partitioning rules apply:

// in Scala

csvFile.write.format("parquet").mode("overwrite")

.save("/tmp/my-parquet-file.parquet")

# in Python

csvFile.write.format("parquet").mode("overwrite")\

.save("/tmp/my-parquet-file.parquet")

ORC Files

ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is

optimized for large streaming reads, but with integrated support for finding required rows quickly.

ORC actually has no options for reading in data because Spark understands the file format quite well.

An often-asked question is: What is the difference between ORC and Parquet? For the most part,

they’re quite similar; the fundamental difference is that Parquet is further optimized for use with

Spark, whereas ORC is further optimized for Hive.

Reading Orc Files

Here’s how to read an ORC file into Spark:

// in Scala

spark.read.format("orc").load("/data/flight-data/orc/2010-summary.orc").show(5)

# in Python

spark.read.format("orc").load("/data/flight-data/orc/2010-summary.orc").show(5)

Writing Orc Files

At this point in the chapter, you should feel pretty comfortable taking a guess at how to write ORC

files. It really follows the exact same pattern that we have seen so far, in which we specify the format

and then save the file:

// in Scala

csvFile.write.format("orc").mode("overwrite").save("/tmp/my-json-file.orc")

# in Python

csvFile.write.format("orc").mode("overwrite").save("/tmp/my-json-file.orc")

SQL Databases

SQL datasources are one of the more powerful connectors because there are a variety of systems to

which you can connect (as long as that system speaks SQL). For instance you can connect to a MySQL

database, a PostgreSQL database, or an Oracle database. You also can connect to SQLite, which is

what we’ll do in this example. Of course, databases aren’t just a set of raw files, so there are more

options to consider regarding how you connect to the database. Namely you’re going to need to begin

considering things like authentication and connectivity (you’ll need to determine whether the network

of your Spark cluster is connected to the network of your database system).

To avoid the distraction of setting up a database for the purposes of this book, we provide a reference

sample that runs on SQLite. We can skip a lot of these details by using SQLite, because it can work

with minimal setup on your local machine with the limitation of not being able to work in a

distributed setting. If you want to work through these examples in a distributed setting, you’ll want to

connect to another kind of database.

A PRIMER ON SQLITE

SQLite is the most used database engine in the entire world, and for good reason. It’s powerful,

fast, and easy to understand. This is because a SQLite database is just a file. That’s going to make

it very easy for you to get up and running because we include the source file in the official

repository for this book. Simply download that file to your local machine, and you will be able to

read from it and write to it. We’re using SQLite, but all of the code here works with more

traditional relational databases, as well, like MySQL. The primary difference is in the properties

that you include when you connect to the database. When we’re working with SQLite, there’s no

notion of user or password.

WARNING

Although SQLite makes for a good reference example, it’s probablu not what you want to use in production. Also, SQLite

will not necessarily work well in a distributed setting because of its requirement to lock the entire database on write. The

example we present here will work in a similar way using MySQL or PostgreSQL, as well.

To read and write from these databases, you need to do two things: include the Java Database

Connectivity (JDBC) driver for you particular database on the spark classpath, and provide the

proper JAR for the driver itself. For example, to be able to read and write from PostgreSQL, you

might run something like this:

./bin/spark-shell \

--driver-class-path postgresql-9.4.1207.jar \

--jars postgresql-9.4.1207.jar

Just as with our other sources, there are a number of options that are available when reading from and

writing to SQL databases. Only some of these are relevant for our current example, but Table 9-6 lists

all of the options that you can set when working with JDBC databases.

Table 9-6. JDBC data source options

Property Name Meaning

url The JDBC URL to which to connect. The source-specific connection properties can be specified in

the URL; for example, jdbc:postgresql://localhost/test?user=fred&password=secret.

dbtable The JDBC table to read. Note that anything that is valid in a FROM clause of a SQL query can be

used. For example, instead of a full table you could also use a subquery in parentheses.

driver The class name of the JDBC driver to use to connect to this URL.

partitionColumn,

lowerBound, upperBound

If any one of these options is specified, then all others must be set as well. In addition,

numPartitions must be specified. These properties describe how to partition the table when reading

in parallel from multiple workers. partitionColumn must be a numeric column from the table in

question. Notice that lowerBound and upperBound are used only to decide the partition stride, not for

filtering the rows in the table. Thus, all rows in the table will be partitioned and returned. This option

applies only to reading.

numPartitions

The maximum number of partitions that can be used for parallelism in table reading and writing. This

also determines the maximum number of concurrent JDBC connections. If the number of partitions

to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before

writing.

fetchsize

The JDBC fetch size, which determines how many rows to fetch per round trip. This can help

performance on JDBC drivers, which default to low fetch size (e.g., Oracle with 10 rows). This

option applies only to reading.

batchsize The JDBC batch size, which determines how many rows to insert per round trip. This can help

performance on JDBC drivers. This option applies only to writing. The default is 1000.

isolationLevel

The transaction isolation level, which applies to current connection. It can be one of NONE,

READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to

standard transaction isolation levels defined by JDBC’s Connection object. The default is

READ_UNCOMMITTED. This option applies only to writing. For more information, refer to the

documentation in java.sql.Connection.

truncate

This is a JDBC writer-related option. When SaveMode.Overwrite is enabled, Spark truncates an

existing table instead of dropping and re-creating it. This can be more efficient, and it prevents the

table metadata (e.g., indices) from being removed. However, it will not work in some cases, such as

when the new data has a different schema. The default is false. This option applies only to writing.

createTableOptions

This is a JDBC writer-related option. If specified, this option allows setting of database-specific table

and partition options when creating a table (e.g., CREATE TABLE t (name string)

ENGINE=InnoDB). This option applies only to writing.

createTableColumnTypes

The database column data types to use instead of the defaults, when creating the table. Data type

information should be specified in the same format as CREATE TABLE columns syntax (e.g.,

“name CHAR(64), comments VARCHAR(1024)”). The specified types should be valid Spark SQL

data types. This option applies only to writing.

Reading from SQL Databases

When it comes to reading a file, SQL databases are no different from the other data sources that we

looked at earlier. As with those sources, we specify the format and options, and then load in the data:

// in Scala

val driver = "org.sqlite.JDBC"

val path = "/data/flight-data/jdbc/my-sqlite.db"

val url = s"jdbc:sqlite:/${path}"

val tablename = "flight_info"

# in Python

driver = "org.sqlite.JDBC"

path = "/data/flight-data/jdbc/my-sqlite.db"

url = "jdbc:sqlite:" + path

tablename = "flight_info"

After you have defined the connection properties, you can test your connection to the database itself to

ensure that it is functional. This is an excellent troubleshooting technique to confirm that your

database is available to (at the very least) the Spark driver. This is much less relevant for SQLite

because that is a file on your machine but if you were using something like MySQL, you could test the

connection with the following:

import java.sql.DriverManager

val connection = DriverManager.getConnection(url)

connection.isClosed()

connection.close()

If this connection succeeds, you’re good to go. Let’s go ahead and read the DataFrame from the SQL

table:

// in Scala

val dbDataFrame = spark.read.format("jdbc").option("url", url)

.option("dbtable", tablename).option("driver", driver).load()

# in Python

dbDataFrame = spark.read.format("jdbc").option("url", url)\

.option("dbtable", tablename).option("driver", driver).load()

SQLite has rather simple configurations (no users, for example). Other databases, like PostgreSQL,

require more configuration parameters. Let’s perform the same read that we just performed, except

using PostgreSQL this time:

// in Scala

val pgDF = spark.read

.format("jdbc")

.option("driver", "org.postgresql.Driver")

.option("url", "jdbc:postgresql://database_server")

.option("dbtable", "schema.tablename")

.option("user", "username").option("password","my-secret-password").load()

# in Python

pgDF = spark.read.format("jdbc")\

.option("driver", "org.postgresql.Driver")\

.option("url", "jdbc:postgresql://database_server")\

.option("dbtable", "schema.tablename")\

.option("user", "username").option("password", "my-secret-password").load()

As we create this DataFrame, it is no different from any other: you can query it, transform it, and join

it without issue. You’ll also notice that there is already a schema, as well. That’s because Spark

gathers this information from the table itself and maps the types to Spark data types. Let’s get only the

distinct locations to verify that we can query it as expected:

dbDataFrame.select("DEST_COUNTRY_NAME").distinct().show(5)

+-----------------+

|DEST_COUNTRY_NAME|

+-----------------+

| Anguilla|

| Russia|

| Paraguay|

| Senegal|

| Sweden|

+-----------------+

Awesome, we can query the database! Before we proceed, there are a couple of nuanced details that

are worth understanding.

Query Pushdown

First, Spark makes a best-effort attempt to filter data in the database itself before creating the

DataFrame. For example, in the previous sample query, we can see from the query plan that it selects

only the relevant column name from the table:

dbDataFrame.select("DEST_COUNTRY_NAME").distinct().explain

== Physical Plan ==

*HashAggregate(keys=[DEST_COUNTRY_NAME#8108], functions=[])

+- Exchange hashpartitioning(DEST_COUNTRY_NAME#8108, 200)

+- *HashAggregate(keys=[DEST_COUNTRY_NAME#8108], functions=[])

+- *Scan JDBCRelation(flight_info) [numPartitions=1] ...

Spark can actually do better than this on certain queries. For example, if we specify a filter on our

DataFrame, Spark will push that filter down into the database. We can see this in the explain plan

under PushedFilters.

// in Scala

dbDataFrame.filter("DEST_COUNTRY_NAME in ('Anguilla', 'Sweden')").explain

# in Python

dbDataFrame.filter("DEST_COUNTRY_NAME in ('Anguilla', 'Sweden')").explain()

== Physical Plan ==

*Scan JDBCRel... PushedFilters: [*In(DEST_COUNTRY_NAME, [Anguilla,Sweden])],

...

Spark can’t translate all of its own functions into the functions available in the SQL database in which

you’re working. Therefore, sometimes you’re going to want to pass an entire query into your SQL that

will return the results as a DataFrame. Now, this might seem like it’s a bit complicated, but it’s

actually quite straightforward. Rather than specifying a table name, you just specify a SQL query. Of

course, you do need to specify this in a special way; you must wrap the query in parenthesis and

rename it to something—in this case, I just gave it the same table name:

// in Scala

val pushdownQuery = """(SELECT DISTINCT(DEST_COUNTRY_NAME) FROM flight_info)

AS flight_info"""

val dbDataFrame = spark.read.format("jdbc")

.option("url", url).option("dbtable", pushdownQuery).option("driver", driver)

.load()

# in Python

pushdownQuery = """(SELECT DISTINCT(DEST_COUNTRY_NAME) FROM flight_info)

AS flight_info"""

dbDataFrame = spark.read.format("jdbc")\

.option("url", url).option("dbtable", pushdownQuery).option("driver", driver)\

.load()

Now when you query this table, you’ll actually be querying the results of that query. We can see this

in the explain plan. Spark doesn’t even know about the actual schema of the table, just the one that

results from our previous query:

dbDataFrame.explain()

== Physical Plan ==

*Scan JDBCRelation(

(SELECT DISTINCT(DEST_COUNTRY_NAME)

FROM flight_info) as flight_info

) [numPartitions=1] [DEST_COUNTRY_NAME#788] ReadSchema: ...

Reading from databases in parallel

All throughout this book, we have talked about partitioning and its importance in data processing.

Spark has an underlying algorithm that can read multiple files into one partition, or conversely, read

multiple partitions out of one file, depending on the file size and the “splitability” of the file type and

compression. The same flexibility that exists with files, also exists with SQL databases except that

you must configure it a bit more manually. What you can configure, as seen in the previous options, is

the ability to specify a maximum number of partitions to allow you to limit how much you are reading

and writing in parallel:

// in Scala

val dbDataFrame = spark.read.format("jdbc")

.option("url", url).option("dbtable", tablename).option("driver", driver)

.option("numPartitions", 10).load()

# in Python

dbDataFrame = spark.read.format("jdbc")\

.option("url", url).option("dbtable", tablename).option("driver", driver)\

.option("numPartitions", 10).load()

In this case, this will still remain as one partition because there is not too much data. However, this

configuration can help you ensure that you do not overwhelm the database when reading and writing

data:

dbDataFrame.select("DEST_COUNTRY_NAME").distinct().show()

There are several other optimizations that unfortunately only seem to be under another API set. You

can explicitly push predicates down into SQL databases through the connection itself. This

optimization allows you to control the physical location of certain data in certain partitions by

specifying predicates. That’s a mouthful, so let’s look at a simple example. We only need data from

two countries in our data: Anguilla and Sweden. We could filter these down and have them pushed

into the database, but we can also go further by having them arrive in their own partitions in Spark.

We do that by specifying a list of predicates when we create the data source:

// in Scala

val props = new java.util.Properties

props.setProperty("driver", "org.sqlite.JDBC")

val predicates = Array(

"DEST_COUNTRY_NAME = 'Sweden' OR ORIGIN_COUNTRY_NAME = 'Sweden'",

"DEST_COUNTRY_NAME = 'Anguilla' OR ORIGIN_COUNTRY_NAME = 'Anguilla'")

spark.read.jdbc(url, tablename, predicates, props).show()

spark.read.jdbc(url, tablename, predicates, props).rdd.getNumPartitions // 2

# in Python

props = {"driver":"org.sqlite.JDBC"}

predicates = [

"DEST_COUNTRY_NAME = 'Sweden' OR ORIGIN_COUNTRY_NAME = 'Sweden'",

"DEST_COUNTRY_NAME = 'Anguilla' OR ORIGIN_COUNTRY_NAME = 'Anguilla'"]

spark.read.jdbc(url, tablename, predicates=predicates, properties=props).show()

spark.read.jdbc(url,tablename,predicates=predicates,properties=props)\

.rdd.getNumPartitions() # 2

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| Sweden| United States| 65|

| United States| Sweden| 73|

| Anguilla| United States| 21|

| United States| Anguilla| 20|

+-----------------+-------------------+-----+

If you specify predicates that are not disjoint, you can end up with lots of duplicate rows. Here’s an

example set of predicates that will result in duplicate rows:

// in Scala

val props = new java.util.Properties

props.setProperty("driver", "org.sqlite.JDBC")

val predicates = Array(

"DEST_COUNTRY_NAME != 'Sweden' OR ORIGIN_COUNTRY_NAME != 'Sweden'",

"DEST_COUNTRY_NAME != 'Anguilla' OR ORIGIN_COUNTRY_NAME != 'Anguilla'")

spark.read.jdbc(url, tablename, predicates, props).count() // 510

# in Python

props = {"driver":"org.sqlite.JDBC"}

predicates = [

"DEST_COUNTRY_NAME != 'Sweden' OR ORIGIN_COUNTRY_NAME != 'Sweden'",

"DEST_COUNTRY_NAME != 'Anguilla' OR ORIGIN_COUNTRY_NAME != 'Anguilla'"]

spark.read.jdbc(url, tablename, predicates=predicates, properties=props).count()

Partitioning based on a sliding window

Let’s take a look to see how we can partition based on predicates. In this example, we’ll partition

based on our numerical count column. Here, we specify a minimum and a maximum for both the first

partition and last partition. Anything outside of these bounds will be in the first partition or final

partition. Then, we set the number of partitions we would like total (this is the level of parallelism).

Spark then queries our database in parallel and returns numPartitions partitions. We simply modify

the upper and lower bounds in order to place certain values in certain partitions. No filtering is taking

place like we saw in the previous example:

// in Scala

val colName = "count"

val lowerBound = 0L

val upperBound = 348113L // this is the max count in our database

val numPartitions = 10

# in Python

colName = "count"

lowerBound = 0L

upperBound = 348113L # this is the max count in our database

numPartitions = 10

This will distribute the intervals equally from low to high:

// in Scala

spark.read.jdbc(url,tablename,colName,lowerBound,upperBound,numPartitions,props)

.count() // 255

# in Python

spark.read.jdbc(url, tablename, column=colName, properties=props,

lowerBound=lowerBound, upperBound=upperBound,

numPartitions=numPartitions).count() # 255

Writing to SQL Databases

Writing out to SQL databases is just as easy as before. You simply specify the URI and write out the

data according to the specified write mode that you want. In the following example, we specify

overwrite, which overwrites the entire table. We’ll use the CSV DataFrame that we defined earlier in

order to do this:

// in Scala

val newPath = "jdbc:sqlite://tmp/my-sqlite.db"

csvFile.write.mode("overwrite").jdbc(newPath, tablename, props)

# in Python

newPath = "jdbc:sqlite://tmp/my-sqlite.db"

csvFile.write.jdbc(newPath, tablename, mode="overwrite", properties=props)

Let’s look at the results:

// in Scala

spark.read.jdbc(newPath, tablename, props).count() // 255

# in Python

spark.read.jdbc(newPath, tablename, properties=props).count() # 255

Of course, we can append to the table this new table just as easily:

// in Scala

csvFile.write.mode("append").jdbc(newPath, tablename, props)

# in Python

csvFile.write.jdbc(newPath, tablename, mode="append", properties=props)

Notice that count increases:

// in Scala

spark.read.jdbc(newPath, tablename, props).count() // 765

# in Python

spark.read.jdbc(newPath, tablename, properties=props).count() # 765

Text Files

Spark also allows you to read in plain-text files. Each line in the file becomes a record in the

DataFrame. It is then up to you to transform it accordingly. As an example of how you would do this,

suppose that you need to parse some Apache log files to some more structured format, or perhaps you

want to parse some plain text for natural-language processing. Text files make a great argument for the

Dataset API due to its ability to take advantage of the flexibility of native types.

Reading Text Files

Reading text files is straightforward: you simply specify the type to be textFile. With textFile,

partitioned directory names are ignored. To read and write text files according to partitions, you

should use text, which respects partitioning on reading and writing:

spark.read.textFile("/data/flight-data/csv/2010-summary.csv")

.selectExpr("split(value, ',') as rows").show()

+--------------------+

| rows|

+--------------------+

|[DEST_COUNTRY_NAM...|

|[United States, R...|

...

|[United States, A...|

|[Saint Vincent an...|

|[Italy, United St...|

+--------------------+

Writing Text Files

When you write a text file, you need to be sure to have only one string column; otherwise, the write

will fail:

csvFile.select("DEST_COUNTRY_NAME").write.text("/tmp/simple-text-file.txt")

If you perform some partitioning when performing your write (we’ll discuss partitioning in the next

couple of pages), you can write more columns. However, those columns will manifest as directories

in the folder to which you’re writing out to, instead of columns on every single file:

// in Scala

csvFile.limit(10).select("DEST_COUNTRY_NAME", "count")

.write.partitionBy("count").text("/tmp/five-csv-files2.csv")

# in Python

csvFile.limit(10).select("DEST_COUNTRY_NAME", "count")\

.write.partitionBy("count").text("/tmp/five-csv-files2py.csv")

Advanced I/O Concepts

We saw previously that we can control the parallelism of files that we write by controlling the

partitions prior to writing. We can also control specific data layout by controlling two things:

bucketing and partitioning (discussed momentarily).

Splittable File Types and Compression

Certain file formats are fundamentally “splittable.” This can improve speed because it makes it

possible for Spark to avoid reading an entire file, and access only the parts of the file necessary to

satisfy your query. Additionally if you’re using something like Hadoop Distributed File System

(HDFS), splitting a file can provide further optimization if that file spans multiple blocks. In

conjunction with this is a need to manage compression. Not all compression schemes are splittable.

How you store your data is of immense consequence when it comes to making your Spark jobs run

smoothly. We recommend Parquet with gzip compression.

Reading Data in Parallel

Multiple executors cannot read from the same file at the same time necessarily, but they can read

different files at the same time. In general, this means that when you read from a folder with multiple

files in it, each one of those files will become a partition in your DataFrame and be read in by

available executors in parallel (with the remaining queueing up behind the others).

Writing Data in Parallel

The number of files or data written is dependent on the number of partitions the DataFrame has at the

time you write out the data. By default, one file is written per partition of the data. This means that

although we specify a “file,” it’s actually a number of files within a folder, with the name of the

specified file, with one file per each partition that is written.

For example, the following code

csvFile.repartition(5).write.format("csv").save("/tmp/multiple.csv")

will end up with five files inside of that folder. As you can see from the list call:

ls /tmp/multiple.csv

/tmp/multiple.csv/part-00000-767df509-ec97-4740-8e15-4e173d365a8b.csv

/tmp/multiple.csv/part-00001-767df509-ec97-4740-8e15-4e173d365a8b.csv

/tmp/multiple.csv/part-00002-767df509-ec97-4740-8e15-4e173d365a8b.csv

/tmp/multiple.csv/part-00003-767df509-ec97-4740-8e15-4e173d365a8b.csv

/tmp/multiple.csv/part-00004-767df509-ec97-4740-8e15-4e173d365a8b.csv

Partitioning

Partitioning is a tool that allows you to control what data is stored (and where) as you write it. When

you write a file to a partitioned directory (or table), you basically encode a column as a folder. What

this allows you to do is skip lots of data when you go to read it in later, allowing you to read in only

the data relevant to your problem instead of having to scan the complete dataset. These are supported

for all file-based data sources:

// in Scala

csvFile.limit(10).write.mode("overwrite").partitionBy("DEST_COUNTRY_NAME")

.save("/tmp/partitioned-files.parquet")

# in Python

csvFile.limit(10).write.mode("overwrite").partitionBy("DEST_COUNTRY_NAME")\

.save("/tmp/partitioned-files.parquet")

Upon writing, you get a list of folders in your Parquet “file”:

$ ls /tmp/partitioned-files.parquet

...

DEST_COUNTRY_NAME=Costa Rica/

DEST_COUNTRY_NAME=Egypt/

DEST_COUNTRY_NAME=Equatorial Guinea/

DEST_COUNTRY_NAME=Senegal/

DEST_COUNTRY_NAME=United States/

Each of these will contain Parquet files that contain that data where the previous predicate was true:

$ ls /tmp/partitioned-files.parquet/DEST_COUNTRY_NAME=Senegal/

part-00000-tid.....parquet

This is probably the lowest-hanging optimization that you can use when you have a table that readers

frequently filter by before manipulating. For instance, date is particularly common for a partition

because, downstream, often we want to look at only the previous week’s data (instead of scanning the

entire list of records). This can provide massive speedups for readers.

Bucketing

Bucketing is another file organization approach with which you can control the data that is

specifically written to each file. This can help avoid shuffles later when you go to read the data

because data with the same bucket ID will all be grouped together into one physical partition. This

means that the data is prepartitioned according to how you expect to use that data later on, meaning

you can avoid expensive shuffles when joining or aggregating.

Rather than partitioning on a specific column (which might write out a ton of directories), it’s

probably worthwhile to explore bucketing the data instead. This will create a certain number of files

and organize our data into those “buckets”:

val numberBuckets = 10

val columnToBucketBy = "count"

csvFile.write.format("parquet").mode("overwrite")

.bucketBy(numberBuckets, columnToBucketBy).saveAsTable("bucketedFiles")

$ ls /user/hive/warehouse/bucketedfiles/

part-00000-tid-1020575097626332666-8....parquet

...

Bucketing is supported only for Spark-managed tables. For more information on bucketing and

partitioning, watch this talk from Spark Summit 2017.

Writing Complex Types

As we covered in Chapter 6, Spark has a variety of different internal types. Although Spark can work

with all of these types, not every single type works well with every data file format. For instance,

CSV files do not support complex types, whereas Parquet and ORC do.

Managing File Size

Managing file sizes is an important factor not so much for writing data but reading it later on. When

you’re writing lots of small files, there’s a significant metadata overhead that you incur managing all

of those files. Spark especially does not do well with small files, although many file systems (like

HDFS) don’t handle lots of small files well, either. You might hear this referred to as the “small file

problem.” The opposite is also true: you don’t want files that are too large either, because it becomes

inefficient to have to read entire blocks of data when you need only a few rows.

Spark 2.2 introduced a new method for controlling file sizes in a more automatic way. We saw

previously that the number of output files is a derivative of the number of partitions we had at write

time (and the partitioning columns we selected). Now, you can take advantage of another tool in order

to limit output file sizes so that you can target an optimum file size. You can use the

maxRecordsPerFile option and specify a number of your choosing. This allows you to better control

file sizes by controlling the number of records that are written to each file. For example, if you set an

option for a writer as df.write.option("maxRecordsPerFile", 5000), Spark will ensure that

files will contain at most 5,000 records.

Conclusion

In this chapter we discussed the variety of options available to you for reading and writing data in

Spark. This covers nearly everything you’ll need to know as an everyday user of Spark. For the

curious, there are ways of implementing your own data source; however, we omitted instructions for

how to do this because the API is currently evolving to better support Structured Streaming. If you’re

interested in seeing how to implement your own custom data sources, the Cassandra Connector is

well organized and maintained and could provide a reference for the adventurous.

In Chapter 10, we discuss Spark SQL and how it interoperates with everything else we’ve seen so far

in the Structured APIs.

Chapter 10. Spark SQL

Spark SQL is arguably one of the most important and powerful features in Spark. This chapter

introduces the core concepts in Spark SQL that you need to understand. This chapter will not rewrite

the ANSI-SQL specification or enumerate every single kind of SQL expression. If you read any other

parts of this book, you will notice that we try to include SQL code wherever we include DataFrame

code to make it easy to cross-reference with code samples. Other examples are available in the

appendix and reference sections.

In a nutshell, with Spark SQL you can run SQL queries against views or tables organized into

databases. You also can use system functions or define user functions and analyze query plans in

order to optimize their workloads. This integrates directly into the DataFrame and Dataset API, and

as we saw in previous chapters, you can choose to express some of your data manipulations in SQL

and others in DataFrames and they will compile to the same underlying code.

What Is SQL?

SQL or Structured Query Language is a domain-specific language for expressing relational

operations over data. It is used in all relational databases, and many “NoSQL” databases create their

SQL dialect in order to make working with their databases easier. SQL is everywhere, and even

though tech pundits prophesized its death, it is an extremely resilient data tool that many businesses

depend on. Spark implements a subset of ANSI SQL:2003. This SQL standard is one that is available

in the majority of SQL databases and this support means that Spark successfully runs the popular

benchmark TPC-DS.

Big Data and SQL: Apache Hive

Before Spark’s rise, Hive was the de facto big data SQL access layer. Originally developed at

Facebook, Hive became an incredibly popular tool across industry for performing SQL operations on

big data. In many ways it helped propel Hadoop into different industries because analysts could run

SQL queries. Although Spark began as a general processing engine with Resilient Distributed

Datasets (RDDs), a large cohort of users now use Spark SQL.

Big Data and SQL: Spark SQL

With the release of Spark 2.0, its authors created a superset of Hive’s support, writing a native SQL

parser that supports both ANSI-SQL as well as HiveQL queries. This, along with its unique

interoperability with DataFrames, makes it a powerful tool for all sorts of companies. For example,

in late 2016, Facebook announced that it had begun running Spark workloads and seeing large

benefits in doing so. In the words of the blog post’s authors:

We challenged Spark to replace a pipeline that decomposed to hundreds of Hive jobs into a

single Spark job. Through a series of performance and reliability improvements, we were able to

scale Spark to handle one of our entity ranking data processing use cases in production…. The

Spark-based pipeline produced significant performance improvements (4.5–6x CPU, 3–4x

resource reservation, and ~5x latency) compared with the old Hive-based pipeline, and it has

been running in production for several months.

The power of Spark SQL derives from several key facts: SQL analysts can now take advantage of

Spark’s computation abilities by plugging into the Thrift Server or Spark’s SQL interface, whereas

data engineers and scientists can use Spark SQL where appropriate in any data flow. This unifying

API allows for data to be extracted with SQL, manipulated as a DataFrame, passed into one of Spark

MLlibs’ large-scale machine learning algorithms, written out to another data source, and everything in

between.

NOTE

Spark SQL is intended to operate as an online analytic processing (OLAP) database, not an online transaction processing

(OLTP) database. This means that it is not intended to perform extremely low-latency queries. Even though support for in-

place modifications is sure to be something that comes up in the future, it’s not something that is currently available.

Spark’s Relationship to Hive

Spark SQL has a great relationship with Hive because it can connect to Hive metastores. The Hive

metastore is the way in which Hive maintains table information for use across sessions. With Spark

SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to

reduce file listing when accessing information. This is popular for users who are migrating from a

legacy Hadoop environment and beginning to run all their workloads using Spark.

The Hive metastore

To connect to the Hive metastore, there are several properties that you’ll need. First, you need to set

the Metastore version (spark.sql.hive.metastore.version) to correspond to the proper Hive

metastore that you’re accessing. By default, this value is 1.2.1. You also need to set

spark.sql.hive.metastore.jars if you’re going to change the way that the

HiveMetastoreClient is initialized. Spark uses the default versions, but you can also specify

Maven repositories or a classpath in the standard format for the Java Virtual Machine (JVM). In

addition, you might need to supply proper class prefixes in order to communicate with different

databases that store the Hive metastore. You’ll set these as shared prefixes that both Spark and Hive

will share (spark.sql.hive.metastore.sharedPrefixes).

If you’re connecting to your own metastore, it’s worth checking the documentation for further updates

and more information.

How to Run Spark SQL Queries

Spark provides several interfaces to execute SQL queries.

Spark SQL CLI

The Spark SQL CLI is a convenient tool with which you can make basic Spark SQL queries in local

mode from the command line. Note that the Spark SQL CLI cannot communicate with the Thrift JDBC

server. To start the Spark SQL CLI, run the following in the Spark directory:

./bin/spark-sql

You configure Hive by placing your hive-site.xml, core-site.xml, and hdfs-site.xml files in conf/. For

a complete list of all available options, you can run ./bin/spark-sql --help.

Spark’s Programmatic SQL Interface

In addition to setting up a server, you can also execute SQL in an ad hoc manner via any of Spark’s

language APIs. You can do this via the method sql on the SparkSession object. This returns a

DataFrame, as we will see later in this chapter. For example, in Python or Scala, we can run the

following:

spark.sql("SELECT 1 + 1").show()

The command spark.sql("SELECT 1 + 1") returns a DataFrame that we can then evaluate

programmatically. Just like other transformations, this will not be executed eagerly but lazily. This is

an immensely powerful interface because there are some transformations that are much simpler to

express in SQL code than in DataFrames.

You can express multiline queries quite simply by passing a multiline string into the function. For

example, you could execute something like the following code in Python or Scala:

spark.sql("""SELECT user_id, department, first_name FROM professors

WHERE department IN

(SELECT name FROM department WHERE created_date >= '2016-01-01')""")

Even more powerful, you can completely interoperate between SQL and DataFrames, as you see fit.

For instance, you can create a DataFrame, manipulate it with SQL, and then manipulate it again as a

DataFrame. It’s a powerful abstraction that you will likely find yourself using quite a bit:

// in Scala

spark.read.json("/data/flight-data/json/2015-summary.json")

.createOrReplaceTempView("some_sql_view") // DF => SQL

spark.sql("""

SELECT DEST_COUNTRY_NAME, sum(count)

FROM some_sql_view GROUP BY DEST_COUNTRY_NAME

""")

.where("DEST_COUNTRY_NAME like 'S%'").where("`sum(count)` > 10")

.count() // SQL => DF

# in Python

spark.read.json("/data/flight-data/json/2015-summary.json")\

.createOrReplaceTempView("some_sql_view") # DF => SQL

spark.sql("""

SELECT DEST_COUNTRY_NAME, sum(count)

FROM some_sql_view GROUP BY DEST_COUNTRY_NAME

""")\

.where("DEST_COUNTRY_NAME like 'S%'").where("`sum(count)` > 10")\

.count() # SQL => DF

SparkSQL Thrift JDBC/ODBC Server

Spark provides a Java Database Connectivity (JDBC) interface by which either you or a remote

program connects to the Spark driver in order to execute Spark SQL queries. A common use case

might be a for a business analyst to connect business intelligence software like Tableau to Spark. The

Thrift JDBC/Open Database Connectivity (ODBC) server implemented here corresponds to the

HiveServer2 in Hive 1.2.1. You can test the JDBC server with the beeline script that comes with

either Spark or Hive 1.2.1.

To start the JDBC/ODBC server, run the following in the Spark directory:

./sbin/start-thriftserver.sh

This script accepts all bin/spark-submit command-line options. To see all available options for

configuring this Thrift Server, run ./sbin/start-thriftserver.sh --help. By default, the server

listens on localhost:10000. You can override this through environmental variables or system

properties.

For environment configuration, use this:

export HIVE_SERVER2_THRIFT_PORT=<listening-port>

export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>

./sbin/start-thriftserver.sh \

--master <master-uri> \

...

For system properties:

./sbin/start-thriftserver.sh \

--hiveconf hive.server2.thrift.port=<listening-port> \

--hiveconf hive.server2.thrift.bind.host=<listening-host> \

--master <master-uri>

...

You can then test this connection by running the following commands:

./bin/beeline

beeline> !connect jdbc:hive2://localhost:10000

Beeline will ask you for a username and password. In nonsecure mode, simply type the username on

your machine and a blank password. For secure mode, follow the instructions given in the beeline

documentation.

Catalog

The highest level abstraction in Spark SQL is the Catalog. The Catalog is an abstraction for the

storage of metadata about the data stored in your tables as well as other helpful things like databases,

tables, functions, and views. The catalog is available in the

org.apache.spark.sql.catalog.Catalog package and contains a number of helpful functions for

doing things like listing tables, databases, and functions. We will talk about all of these things shortly.

It’s very self-explanatory to users, so we will omit the code samples here but it’s really just another

programmatic interface to Spark SQL. This chapter shows only the SQL being executed; thus, if

you’re using the programmatic interface, keep in mind that you need to wrap everything in a

spark.sql function call to execute the relevant code.

Tables

To do anything useful with Spark SQL, you first need to define tables. Tables are logically equivalent

to a DataFrame in that they are a structure of data against which you run commands. We can join

tables, filter them, aggregate them, and perform different manipulations that we saw in previous

chapters. The core difference between tables and DataFrames is this: you define DataFrames in the

scope of a programming language, whereas you define tables within a database. This means that when

you create a table (assuming you never changed the database), it will belong to the default database.

We discuss databases more fully later on in the chapter.

An important thing to note is that in Spark 2.X, tables always contain data. There is no notion of a

temporary table, only a view, which does not contain data. This is important because if you go to drop

a table, you can risk losing the data when doing so.

Spark-Managed Tables

One important note is the concept of managed versus unmanaged tables. Tables store two important

pieces of information. The data within the tables as well as the data about the tables; that is, the

metadata. You can have Spark manage the metadata for a set of files as well as for the data. When

you define a table from files on disk, you are defining an unmanaged table. When you use

saveAsTable on a DataFrame, you are creating a managed table for which Spark will track of all of

the relevant information.

This will read your table and write it out to a new location in Spark format. You can see this reflected

in the new explain plan. In the explain plan, you will also notice that this writes to the default Hive

warehouse location. You can set this by setting the spark.sql.warehouse.dir configuration to the

directory of your choosing when you create your SparkSession. By default Spark sets this to

/user/hive/warehouse:

Note in the results that a database is listed. Spark also has databases which we will discuss later in

this chapter, but for now you should keep in mind that you can also see tables in a specific database

by using the query show tables IN databaseName, where databaseName represents the name of

the database that you want to query.

If you are running on a new cluster or local mode, this should return zero results.

Creating Tables

You can create tables from a variety of sources. Something fairly unique to Spark is the capability of

reusing the entire Data Source API within SQL. This means that you do not need to define a table and

then load data into it; Spark lets you create one on the fly. You can even specify all sorts of

sophisticated options when you read in a file. For example, here’s a simple way to read in the flight

data we worked with in previous chapters:

CREATE TABLE flights (

DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count LONG)

USING JSON OPTIONS (path '/data/flight-data/json/2015-summary.json')

USING AND STORED AS

The specification of the USING syntax in the previous example is of significant importance. If you

do not specify the format, Spark will default to a Hive SerDe configuration. This has performance

implications for future readers and writers because Hive SerDes are much slower than Spark’s

native serialization. Hive users can also use the STORED AS syntax to specify that this should be a

Hive table.

You can also add comments to certain columns in a table, which can help other developers understand

the data in the tables:

CREATE TABLE flights_csv (

DEST_COUNTRY_NAME STRING,

ORIGIN_COUNTRY_NAME STRING COMMENT "remember, the US will be most prevalent",

count LONG)

USING csv OPTIONS (header true, path '/data/flight-data/csv/2015-summary.csv')

It is possible to create a table from a query as well:

CREATE TABLE flights_from_select USING parquet AS SELECT * FROM flights

In addition, you can specify to create a table only if it does not currently exist:

NOTE

In this example, we are creating a Hive-compatible table because we did not explicitly specify the format via USING. We

can also do the following:

CREATE TABLE IF NOT EXISTS flights_from_select

AS SELECT * FROM flights

Finally, you can control the layout of the data by writing out a partitioned dataset, as we saw in

Chapter 9:

CREATE TABLE partitioned_flights USING parquet PARTITIONED BY (DEST_COUNTRY_NAME)

AS SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count FROM flights LIMIT 5

These tables will be available in Spark even through sessions; temporary tables do not currently exist

in Spark. You must create a temporary view, which we demonstrate later in this chapter.

Creating External Tables

As we mentioned in the beginning of this chapter, Hive was one of the first big data SQL systems, and

Spark SQL is completely compatible with Hive SQL (HiveQL) statements. One of the use cases that

you might encounter is to port your legacy Hive statements to Spark SQL. Luckily, you can, for the

most part, just copy and paste your Hive statements directly into Spark SQL. For example, in the

example that follows, we create an unmanaged table. Spark will manage the table’s metadata;

however, the files are not managed by Spark at all. You create this table by using the CREATE

EXTERNAL TABLE statement.

You can view any files that have already been defined by running the following command:

CREATE EXTERNAL TABLE hive_flights (

DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count LONG)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/flight-data-hive/'

You can also create an external table from a select clause:

CREATE EXTERNAL TABLE hive_flights_2

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/data/flight-data-hive/' AS SELECT * FROM flights

Inserting into Tables

Insertions follow the standard SQL syntax:

INSERT INTO flights_from_select

SELECT DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count FROM flights LIMIT 20

You can optionally provide a partition specification if you want to write only into a certain partition.

Note that a write will respect a partitioning scheme, as well (which may cause the above query to run

quite slowly); however, it will add additional files only into the end partitions:

INSERT INTO partitioned_flights

PARTITION (DEST_COUNTRY_NAME="UNITED STATES")

SELECT count, ORIGIN_COUNTRY_NAME FROM flights

WHERE DEST_COUNTRY_NAME='UNITED STATES' LIMIT 12

Describing Table Metadata

We saw earlier that you can add a comment when creating a table. You can view this by describing

the table metadata, which will show us the relevant comment:

DESCRIBE TABLE flights_csv

You can also see the partitioning scheme for the data by using the following (note, however, that this

works only on partitioned tables):

SHOW PARTITIONS partitioned_flights

Refreshing Table Metadata

Maintaining table metadata is an important task to ensure that you’re reading from the most recent set

of data. There are two commands to refresh table metadata. REFRESH TABLE refreshes all cached

entries (essentially, files) associated with the table. If the table were previously cached, it would be

cached lazily the next time it is scanned:

REFRESH table partitioned_flights

Another related command is REPAIR TABLE, which refreshes the partitions maintained in the catalog

for that given table. This command’s focus is on collecting new partition information—an example

might be writing out a new partition manually and the need to repair the table accordingly:

MSCK REPAIR TABLE partitioned_flights

Dropping Tables

You cannot delete tables: you can only “drop” them. You can drop a table by using the DROP keyword.

If you drop a managed table (e.g., flights_csv), both the data and the table definition will be

removed:

DROP TABLE flights_csv;

WARNING

Dropping a table deletes the data in the table, so you need to be very careful when doing this.

If you try to drop a table that does not exist, you will receive an error. To only delete a table if it

already exists, use DROP TABLE IF EXISTS.

DROP TABLE IF EXISTS flights_csv;

WARNING

This deletes the data in the table, so exercise caution when doing this.

Dropping unmanaged tables

If you are dropping an unmanaged table (e.g., hive_flights), no data will be removed but you will

no longer be able to refer to this data by the table name.

Caching Tables

Just like DataFrames, you can cache and uncache tables. You simply specify which table you would

like using the following syntax:

CACHE TABLE flights

Here’s how you uncache them:

UNCACHE TABLE FLIGHTS

Views

Now that you created a table, another thing that you can define is a view. A view specifies a set of

transformations on top of an existing table—basically just saved query plans, which can be

convenient for organizing or reusing your query logic. Spark has several different notions of views.

Views can be global, set to a database, or per session.

Creating Views

To an end user, views are displayed as tables, except rather than rewriting all of the data to a new

location, they simply perform a transformation on the source data at query time. This might be a

filter, select, or potentially an even larger GROUP BY or ROLLUP. For instance, in the following

example, we create a view in which the destination is United States in order to see only those

flights:

CREATE VIEW just_usa_view AS

SELECT * FROM flights WHERE dest_country_name = 'United States'

Like tables, you can create temporary views that are available only during the current session and are

not registered to a database:

CREATE TEMP VIEW just_usa_view_temp AS

SELECT * FROM flights WHERE dest_country_name = 'United States'

Or, it can be a global temp view. Global temp views are resolved regardless of database and are

viewable across the entire Spark application, but they are removed at the end of the session:

CREATE GLOBAL TEMP VIEW just_usa_global_view_temp AS

SELECT * FROM flights WHERE dest_country_name = 'United States'

SHOW TABLES

You can also specify that you would like to overwite a view if one already exists by using the

keywords shown in the sample that follows. We can overwrite both temp views and regular views:

CREATE OR REPLACE TEMP VIEW just_usa_view_temp AS

SELECT * FROM flights WHERE dest_country_name = 'United States'

Now you can query this view just as if it were another table:

SELECT * FROM just_usa_view_temp

A view is effectively a transformation and Spark will perform it only at query time. This means that it

will only apply that filter after you actually go to query the table (and not earlier). Effectively, views

are equivalent to creating a new DataFrame from an existing DataFrame.

In fact, you can see this by comparing the query plans generated by Spark DataFrames and Spark

SQL. In DataFrames, we would write the following:

val flights = spark.read.format("json")

.load("/data/flight-data/json/2015-summary.json")

val just_usa_df = flights.where("dest_country_name = 'United States'")

just_usa_df.selectExpr("*").explain

In SQL, we would write (querying from our view) this:

EXPLAIN SELECT * FROM just_usa_view

Or, equivalently:

EXPLAIN SELECT * FROM flights WHERE dest_country_name = 'United States'

Due to this fact, you should feel comfortable in writing your logic either on DataFrames or SQL—

whichever is most comfortable and maintainable for you.

Dropping Views

You can drop views in the same way that you drop tables; you simply specify that what you intend to

drop is a view instead of a table. The main difference between dropping a view and dropping a table

is that with a view, no underlying data is removed, only the view definition itself:

DROP VIEW IF EXISTS just_usa_view;

Databases

Databases are a tool for organizing tables. As mentioned earlier, if you do not define one, Spark will

use the default database. Any SQL statements that you run from within Spark (including DataFrame

commands) execute within the context of a database. This means that if you change the database, any

user-defined tables will remain in the previous database and will need to be queried differently.

WARNING

This can be a source of confusion, especially if you’re sharing the same context or session for your coworkers, so be sure

to set your databases appropriately.

You can see all databases by using the following command:

SHOW DATABASES

Creating Databases

Creating databases follows the same patterns you’ve seen previously in this chapter; however, here

you use the CREATE DATABASE keywords:

CREATE DATABASE some_db

Setting the Database

You might want to set a database to perform a certain query. To do this, use the USE keyword

followed by the database name:

USE some_db

After you set this database, all queries will try to resolve table names to this database. Queries that

were working just fine might now fail or yield different results because you are in a different

database:

SHOW tables

SELECT * FROM flights -- fails with table/view not found

However, you can query different databases by using the correct prefix:

SELECT * FROM default.flights

You can see what database you’re currently using by running the following command:

SELECT current_database()

You can, of course, switch back to the default database:

USE default;

Dropping Databases

Dropping or removing databases is equally as easy: you simply use the DROP DATABASE keyword:

DROP DATABASE IF EXISTS some_db;

Select Statements

Queries in Spark support the following ANSI SQL requirements (here we list the layout of the

SELECT expression):

SELECT [ALL|DISTINCT] named_expression[, named_expression, ...]

FROM relation[, relation, ...]

[lateral_view[, lateral_view, ...]]

[WHERE boolean_expression]

[aggregation [HAVING boolean_expression]]

[ORDER BY sort_expressions]

[CLUSTER BY expressions]

[DISTRIBUTE BY expressions]

[SORT BY sort_expressions]

[WINDOW named_window[, WINDOW named_window, ...]]

[LIMIT num_rows]

named_expression:

: expression [AS alias]

relation:

| join_relation

| (table_name|query|relation) [sample] [AS alias]

: VALUES (expressions)[, (expressions), ...]

[AS (column_name[, column_name, ...])]

expressions:

: expression[, expression, ...]

sort_expressions:

: expression [ASC|DESC][, expression [ASC|DESC], ...]

case…when…then Statements

Oftentimes, you might need to conditionally replace values in your SQL queries. You can do this by

using a case...when...then...end style statement. This is essentially the equivalent of

programmatic if statements:

SELECT

CASE WHEN DEST_COUNTRY_NAME = 'UNITED STATES' THEN 1

WHEN DEST_COUNTRY_NAME = 'Egypt' THEN 0

ELSE -1 END

FROM partitioned_flights

Advanced Topics

Now that we defined where data lives and how to organize it, let’s move on to querying it. A SQL

query is a SQL statement requesting that some set of commands be run. SQL statements can define

manipulations, definitions, or controls. The most common case are the manipulations, which is the

focus of this book.

Complex Types

Complex types are a departure from standard SQL and are an incredibly powerful feature that does

not exist in standard SQL. Understanding how to manipulate them appropriately in SQL is essential.

There are three core complex types in Spark SQL: structs, lists, and maps.

Structs

Structs are more akin to maps. They provide a way of creating or querying nested data in Spark. To

create one, you simply need to wrap a set of columns (or expressions) in parentheses:

CREATE VIEW IF NOT EXISTS nested_data AS

SELECT (DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME) as country, count FROM flights

Now, you can query this data to see what it looks like:

SELECT * FROM nested_data

You can even query individual columns within a struct—all you need to do is use dot syntax:

SELECT country.DEST_COUNTRY_NAME, count FROM nested_data

If you like, you can also select all the subvalues from a struct by using the struct’s name and select all

of the subcolumns. Although these aren’t truly subcolumns, it does provide a simpler way to think

about them because we can do everything that we like with them as if they were a column:

SELECT country.*, count FROM nested_data

Lists

If you’re familiar with lists in programming languages, Spark SQL lists will feel familiar. There are

several ways to create an array or list of values. You can use the collect_list function, which

creates a list of values. You can also use the function collect_set, which creates an array without

duplicate values. These are both aggregation functions and therefore can be specified only in

aggregations:

SELECT DEST_COUNTRY_NAME as new_name, collect_list(count) as flight_counts,

collect_set(ORIGIN_COUNTRY_NAME) as origin_set

FROM flights GROUP BY DEST_COUNTRY_NAME

You can, however, also create an array manually within a column, as shown here:

SELECT DEST_COUNTRY_NAME, ARRAY(1, 2, 3) FROM flights

You can also query lists by position by using a Python-like array query syntax:

SELECT DEST_COUNTRY_NAME as new_name, collect_list(count)[0]

FROM flights GROUP BY DEST_COUNTRY_NAME

You can also do things like convert an array back into rows. You do this by using the explode

function. To demonstrate, let’s create a new view as our aggregation:

CREATE OR REPLACE TEMP VIEW flights_agg AS

SELECT DEST_COUNTRY_NAME, collect_list(count) as collected_counts

FROM flights GROUP BY DEST_COUNTRY_NAME

Now let’s explode the complex type to one row in our result for every value in the array. The

DEST_COUNTRY_NAME will duplicate for every value in the array, performing the exact opposite of the

original collect and returning us to the original DataFrame:

SELECT explode(collected_counts), DEST_COUNTRY_NAME FROM flights_agg

Functions

In addition to complex types, Spark SQL provides a variety of sophisticated functions. You can find

most of these functions in the DataFrames function reference; however, it is worth understanding how

to find these functions in SQL, as well. To see a list of functions in Spark SQL, you use the SHOW

FUNCTIONS statement:

SHOW FUNCTIONS

You can also more specifically indicate whether you would like to see the system functions (i.e.,

those built into Spark) as well as user functions:

SHOW SYSTEM FUNCTIONS

User functions are those defined by you or someone else sharing your Spark environment. These are

the same user-defined functions that we talked about in earlier chapters (we will discuss how to

create them later on in this chapter):

SHOW USER FUNCTIONS

You can filter all SHOW commands by passing a string with wildcard (*) characters. Here, we can see

all functions that begin with “s”:

SHOW FUNCTIONS "s*";

Optionally, you can include the LIKE keyword, although this is not necessary:

SHOW FUNCTIONS LIKE "collect*";

Even though listing functions is certainly useful, often you might want to know more about specific

functions themselves. To do this, use the DESCRIBE keyword, which returns the documentation for a

specific function.

User-defined functions

As we saw in Chapters 3 and 4, Spark gives you the ability to define your own functions and use them

in a distributed manner. You can define functions, just as you did before, writing the function in the

language of your choice and then registering it appropriately:

def power3(number:Double):Double = number * number * number

spark.udf.register("power3", power3(_:Double):Double)

SELECT count, power3(count) FROM flights

You can also register functions through the Hive CREATE TEMPORARY FUNCTION syntax.

Subqueries

With subqueries, you can specify queries within other queries. This makes it possible for you to

specify some sophisticated logic within your SQL. In Spark, there are two fundamental subqueries.

Correlated subqueries use some information from the outer scope of the query in order to supplement

information in the subquery. Uncorrelated subqueries include no information from the outer scope.

Each of these queries can return one (scalar subquery) or more values. Spark also includes support

for predicate subqueries, which allow for filtering based on values.

Uncorrelated predicate subqueries

For example, let’s take a look at a predicate subquery. In this example, this is composed of two

uncorrelated queries. The first query is just to get the top five country destinations based on the data

we have:

SELECT dest_country_name FROM flights

GROUP BY dest_country_name ORDER BY sum(count) DESC LIMIT 5

This gives us the following result:

+-----------------+

|dest_country_name|

+-----------------+

| United States|

| Canada|

| Mexico|

| United Kingdom|

| Japan|

+-----------------+

Now we place this subquery inside of the filter and check to see if our origin country exists in that

list:

SELECT * FROM flights

WHERE origin_country_name IN (SELECT dest_country_name FROM flights

GROUP BY dest_country_name ORDER BY sum(count) DESC LIMIT 5)

This query is uncorrelated because it does not include any information from the outer scope of the

query. It’s a query that you can run on its own.

Correlated predicate subqueries

Correlated predicate subqueries allow you to use information from the outer scope in your inner

query. For example, if you want to see whether you have a flight that will take you back from your

destination country, you could do so by checking whether there is a flight that has the destination

country as an origin and a flight that had the origin country as a destination:

SELECT * FROM flights f1

WHERE EXISTS (SELECT 1 FROM flights f2

WHERE f1.dest_country_name = f2.origin_country_name)

AND EXISTS (SELECT 1 FROM flights f2

WHERE f2.dest_country_name = f1.origin_country_name)

EXISTS just checks for some existence in the subquery and returns true if there is a value. You can flip

this by placing the NOT operator in front of it. This would be equivalent to finding a flight to a

destination from which you won’t be able to return!

Uncorrelated scalar queries

Using uncorrelated scalar queries, you can bring in some supplemental information that you might not

have previously. For example, if you wanted to include the maximum value as its own column from

the entire counts dataset, you could do this:

SELECT *, (SELECT max(count) FROM flights) AS maximum FROM flights

Miscellaneous Features

There are some features in Spark SQL that don’t quite fit in previous sections of this chapter, so

we’re going to include them here in no particular order. These can be relevant when performing

optimizations or debugging your SQL code.

Configurations

There are several Spark SQL application configurations, which we list in Table 10-1. You can set

these either at application initialization or over the course of application execution (like we have

seen with shuffle partitions throughout this book).

Table 10-1. Spark SQL configurations

Property Name Default Meaning

spark.sql.inMemoryColumnarStorage.compressed true

When set to true, Spark SQL automatically selects a

compression codec for each column based on statistics of the

data.

spark.sql.inMemoryColumnarStorage.batchSize 10000

Controls the size of batches for columnar caching. Larger

batch sizes can improve memory utilization and compression,

but risk OutOfMemoryErrors (OOMs) when caching data.

spark.sql.files.maxPartitionBytes 134217728

(128 MB)

The maximum number of bytes to pack into a single partition

when reading files.

spark.sql.files.openCostInBytes 4194304

(4 MB)

The estimated cost to open a file, measured by the number of

bytes that could be scanned in the same time. This is used

when putting multiple files into a partition. It is better to

overestimate; that way the partitions with small files will be

faster than partitions with bigger files (which is scheduled

first).

spark.sql.broadcastTimeout 300 Timeout in seconds for the broadcast wait time in broadcast

joins.

spark.sql.autoBroadcastJoinThreshold 10485760

(10 MB)

Configures the maximum size in bytes for a table that will be

broadcast to all worker nodes when performing a join. You

can disable broadcasting by setting this value to -1. Note that

currently statistics are supported only for Hive Metastore

tables for which the command ANALYZE TABLE COMPUTE

STATISTICS noscan has been run.

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data

for joins or aggregations.

Setting Configuration Values in SQL

We talk about configurations in Chapter 15, but as a preview, it’s worth mentioning how to set

configurations from SQL. Naturally, you can only set Spark SQL configurations that way, but here’s

how you can set shuffle partitions:

SET spark.sql.shuffle.partitions=20

Conclusion

It should be clear from this chapter that Spark SQL and DataFrames are very closely related and that

you should be able to use nearly all of the examples throughout this book with only small syntactical

tweaks. This chapter illustrated more of the Spark SQL–related specifics. Chapter 11 focuses on a

new concept: Datasets that allow for type-safe structured transformations.

Chapter 11. Datasets

Datasets are the foundational type of the Structured APIs. We already worked with DataFrames,

which are Datasets of type Row, and are available across Spark’s different languages. Datasets are a

strictly Java Virtual Machine (JVM) language feature that work only with Scala and Java. Using

Datasets, you can define the object that each row in your Dataset will consist of. In Scala, this will be

a case class object that essentially defines a schema that you can use, and in Java, you will define a

Java Bean. Experienced users often refer to Datasets as the “typed set of APIs” in Spark. For more

information, see Chapter 4.

In Chapter 4, we discussed that Spark has types like StringType, BigIntType, StructType, and so

on. Those Spark-specific types map to types available in each of Spark’s languages like String,

Integer, and Double. When you use the DataFrame API, you do not create strings or integers, but

Spark manipulates the data for you by manipulating the Row object. In fact, if you use Scala or Java,

all “DataFrames” are actually Datasets of type Row. To efficiently support domain-specific objects, a

special concept called an “Encoder” is required. The encoder maps the domain-specific type T to

Spark’s internal type system.

For example, given a class Person with two fields, name (string) and age (int), an encoder directs

Spark to generate code at runtime to serialize the Person object into a binary structure. When using

DataFrames or the “standard” Structured APIs, this binary structure will be a Row. When we want to

create our own domain-specific objects, we specify a case class in Scala or a JavaBean in Java.

Spark will allow us to manipulate this object (in place of a Row) in a distributed manner.

When you use the Dataset API, for every row it touches, this domain specifies type, Spark converts

the Spark Row format to the object you specified (a case class or Java class). This conversion slows

down your operations but can provide more flexibility. You will notice a hit in performance but this is

a far different order of magnitude from what you might see from something like a user-defined

function (UDF) in Python, because the performance costs are not as extreme as switching

programming languages, but it is an important thing to keep in mind.

When to Use Datasets

You might ponder, if I am going to pay a performance penalty when I use Datasets, why should I use

them at all? If we had to condense this down into a canonical list, here are a couple of reasons:

When the operation(s) you would like to perform cannot be expressed using DataFrame

manipulations

When you want or need type-safety, and you’re willing to accept the cost of performance to

achieve it

Let’s explore these in more detail. There are some operations that cannot be expressed using the

Structured APIs we have seen in the previous chapters. Although these are not particularly common,

you might have a large set of business logic that you’d like to encode in one specific function instead

of in SQL or DataFrames. This is an appropriate use for Datasets. Additionally, the Dataset API is

type-safe. Operations that are not valid for their types, say subtracting two string types, will fail at

compilation time not at runtime. If correctness and bulletproof code is your highest priority, at the cost

of some performance, this can be a great choice for you. This does not protect you from malformed

data but can allow you to more elegantly handle and organize it.

Another potential time for which you might want to use Datasets is when you would like to reuse a

variety of transformations of entire rows between single-node workloads and Spark workloads. If

you have some experience with Scala, you might notice that Spark’s APIs reflect those of Scala

Sequence Types, but they operate in a distributed fashion. In fact, Martin Odersky, the inventor of

Scala, said just that in 2015 at Spark Summit Europe. Due to this, one advantage of using Datasets is

that if you define all of your data and transformations as accepting case classes it is trivial to reuse

them for both distributed and local workloads. Additionally, when you collect your DataFrames to

local disk, they will be of the correct class and type, sometimes making further manipulation easier.

Probably the most popular use case is to use DataFrames and Datasets in tandem, manually trading off

between performance and type safety when it is most relevant for your workload. This might be at the

end of a large, DataFrame-based extract, transform, and load (ETL) transformation when you’d like

to collect data to the driver and manipulate it by using single-node libraries, or it might be at the

beginning of a transformation when you need to perform per-row parsing before performing filtering

and further manipulation in Spark SQL.

Creating Datasets

Creating Datasets is somewhat of a manual operation, requiring you to know and define the schemas

ahead of time.

In Java: Encoders

Java Encoders are fairly simple, you simply specify your class and then you’ll encode it when you

come upon your DataFrame (which is of type Dataset<Row>):

import org.apache.spark.sql.Encoders;

public class Flight implements Serializable{

String DEST_COUNTRY_NAME;

String ORIGIN_COUNTRY_NAME;

Long DEST_COUNTRY_NAME;

}

Dataset<Flight> flights = spark.read

.parquet("/data/flight-data/parquet/2010-summary.parquet/")

.as(Encoders.bean(Flight.class));

In Scala: Case Classes

To create Datasets in Scala, you define a Scala case class. A case class is a regular class that

has the following characteristics:

Immutable

Decomposable through pattern matching

Allows for comparison based on structure instead of reference

Easy to use and manipulate

These traits make it rather valuable for data analysis because it is quite easy to reason about a case

class. Probably the most important feature is that case classes are immutable and allow for

comparison by structure instead of value.

Here’s how the Scala documentation describes it:

Immutability frees you from needing to keep track of where and when things are mutated

Comparison-by-value allows you to compare instances as if they were primitive values—no

more uncertainty regarding whether instances of a class are compared by value or reference

Pattern matching simplifies branching logic, which leads to less bugs and more readable

code.

These advantages carry over to their usage within Spark, as well.

To begin creating a Dataset, let’s define a case class for one of our datasets:

case class Flight(DEST_COUNTRY_NAME: String,

ORIGIN_COUNTRY_NAME: String, count: BigInt)

Now that we defined a case class, this will represent a single record in our dataset. More

succintly, we now have a Dataset of Flights. This doesn’t define any methods for us, simply the

schema. When we read in our data, we’ll get a DataFrame. However, we simply use the as method to

cast it to our specified row type:

val flightsDF = spark.read

.parquet("/data/flight-data/parquet/2010-summary.parquet/")

val flights = flightsDF.as[Flight]

Actions

Even though we can see the power of Datasets, what’s important to understand is that actions like

collect, take, and count apply to whether we are using Datasets or DataFrames:

flights.show(2)

+-----------------+-------------------+-----+

|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|

+-----------------+-------------------+-----+

| United States| Romania| 1|

| United States| Ireland| 264|

+-----------------+-------------------+-----+

You’ll also notice that when we actually go to access one of the case classes, we don’t need to do

any type coercion, we simply specify the named attribute of the case class and get back, not just the

expected value but the expected type, as well:

flights.first.DEST_COUNTRY_NAME // United States

Transformations

Transformations on Datasets are the same as those that we saw on DataFrames. Any transformation

that you read about in this section is valid on a Dataset, and we encourage you to look through the

specific sections on relevant aggregations or joins.

In addition to those transformations, Datasets allow us to specify more complex and strongly typed

transformations than we could perform on DataFrames alone because we manipulate raw Java Virtual

Machine (JVM) types. To illustrate this raw object manipulation, let’s filter the Dataset that you just

created.

Filtering

Let’s look at a simple example by creating a simple function that accepts a Flight and returns a

Boolean value that describes whether the origin and destination are the same. This is not a UDF (at

least, in the way that Spark SQL defines UDF) but a generic function.

TIP

You’ll notice in the following example that we’re going to create a function to define this filter. This is an important

difference from what we have done thus far in the book. By specifying a function, we are forcing Spark to evaluate this

function on every row in our Dataset. This can be very resource intensive. For simple filters it is always preferred to write

SQL expressions. This will greatly reduce the cost of filtering out the data while still allowing you to manipulate it as a

Dataset later on:

def originIsDestination(flight_row: Flight): Boolean = {

return flight_row.ORIGIN_COUNTRY_NAME == flight_row.DEST_COUNTRY_NAME

}

We can now pass this function into the filter method specifying that for each row it should verify that

this function returns true and in the process will filter our Dataset down accordingly:

flights.filter(flight_row => originIsDestination(flight_row)).first()

The result is:

Flight = Flight(United States,United States,348113)

As we saw earlier, this function does not need to execute in Spark code at all. Similar to our UDFs,

we can use it and test it on data on our local machines before using it within Spark.

For example, this dataset is small enough for us to collect to the driver (as an Array of Flights) on

which we can operate and perform the exact same filtering operation:

flights.collect().filter(flight_row => originIsDestination(flight_row))

The result is:

Array[Flight] = Array(Flight(United States,United States,348113))

We can see that we get the exact same answer as before.

Mapping

Filtering is a simple transformation, but sometimes you need to map one value to another value. We

did this with our function in the previous example: it accepts a flight and returns a Boolean, but other

times we might actually need to perform something more sophisticated like extract a value, compare a

set of values, or something similar.

The simplest example is manipulating our Dataset such that we extract one value from each row. This

is effectively performing a DataFrame like select on our Dataset. Let’s extract the destination:

val destinations = flights.map(f => f.DEST_COUNTRY_NAME)

Notice that we end up with a Dataset of type String. That is because Spark already knows the JVM

type that this result should return and allows us to benefit from compile-time checking if, for some

reason, it is invalid.

We can collect this and get back an array of strings on the driver:

val localDestinations = destinations.take(5)

This might feel trivial and unnecessary; we can do the majority of this right on DataFrames. We in fact

recommend that you do this because you gain so many benefits from doing so. You will gain

advantages like code generation that are simply not possible with arbitrary user-defined functions.

However, this can come in handy with much more sophisticated row-by-row manipulation.

Joins

Joins, as we covered earlier, apply just the same as they did for DataFrames. However Datasets also

provide a more sophisticated method, the joinWith method. joinWith is roughly equal to a co-group

(in RDD terminology) and you basically end up with two nested Datasets inside of one. Each column

represents one Dataset and these can be manipulated accordingly. This can be useful when you need

to maintain more information in the join or perform some more sophisticated manipulation on the

entire result, like an advanced map or filter.

Let’s create a fake flight metadata dataset to demonstrate joinWith:

case class FlightMetadata(count: BigInt, randomData: BigInt)

val flightsMeta = spark.range(500).map(x => (x, scala.util.Random.nextLong))

.withColumnRenamed("_1", "count").withColumnRenamed("_2", "randomData")

.as[FlightMetadata]

val flights2 = flights

.joinWith(flightsMeta, flights.col("count") === flightsMeta.col("count"))

Notice that we end up with a Dataset of a sort of key-value pair, in which each row represents a

Flight and the Flight Metadata. We can, of course, query these as a Dataset or a DataFrame with

complex types:

flights2.selectExpr("_1.DEST_COUNTRY_NAME")

We can collect them just as we did before:

flights2.take(2)

Array[(Flight, FlightMetadata)] = Array((Flight(United States,Romania,1),...

Of course, a “regular” join would work quite well, too, although you’ll notice in this case that we end

up with a DataFrame (and thus lose our JVM type information).

val flights2 = flights.join(flightsMeta, Seq("count"))

We can always define another Dataset to gain this back. It’s also important to note that there are no

problems joining a DataFrame and a Dataset—we end up with the same result:

val flights2 = flights.join(flightsMeta.toDF(), Seq("count"))

Grouping and Aggregations

Grouping and aggregations follow the same fundamental standards that we saw in the previous

aggregation chapter, so groupBy rollup and cube still apply, but these return DataFrames instead of

Datasets (you lose type information):

flights.groupBy("DEST_COUNTRY_NAME").count()

This often is not too big of a deal, but if you want to keep type information around there are other

groupings and aggregations that you can perform. An excellent example is the groupByKey method.

This allows you to group by a specific key in the Dataset and get a typed Dataset in return. This

function, however, doesn’t accept a specific column name but rather a function. This makes it

possible for you to specify more sophisticated grouping functions that are much more akin to

something like this:

flights.groupByKey(x => x.DEST_COUNTRY_NAME).count()

Although this provides flexibility, it’s a trade-off because now we are introducing JVM types as well

as functions that cannot be optimized by Spark. This means that you will see a performance difference

and we can see this when we inspect the explain plan. In the following, you can see that we are

effectivelly appending a new column to the DataFrame (the result of our function) and then performing

the grouping on that:

flights.groupByKey(x => x.DEST_COUNTRY_NAME).count().explain

== Physical Plan ==

*HashAggregate(keys=[value#1396], functions=[count(1)])

+- Exchange hashpartitioning(value#1396, 200)

+- *HashAggregate(keys=[value#1396], functions=[partial_count(1)])

+- *Project [value#1396]

+- AppendColumns <function1>, newInstance(class ...

[staticinvoke(class org.apache.spark.unsafe.types.UTF8String, ...

+- *FileScan parquet [D...

After we perform a grouping with a key on a Dataset, we can operate on the Key Value Dataset with

functions that will manipulate the groupings as raw objects:

def grpSum(countryName:String, values: Iterator[Flight]) = {

values.dropWhile(_.count < 5).map(x => (countryName, x))

}

flights.groupByKey(x => x.DEST_COUNTRY_NAME).flatMapGroups(grpSum).show(5)

+--------+--------------------+

| _1| _2|

+--------+--------------------+

|Anguilla|[Anguilla,United ...|

|Paraguay|[Paraguay,United ...|

| Russia|[Russia,United St...|

| Senegal|[Senegal,United S...|

| Sweden|[Sweden,United St...|

+--------+--------------------+

def grpSum2(f:Flight):Integer = {

}

flights.groupByKey(x => x.DEST_COUNTRY_NAME).mapValues(grpSum2).count().take(5)

We can even create new manipulations and define how groups should be reduced:

def sum2(left:Flight, right:Flight) = {

Flight(left.DEST_COUNTRY_NAME, null, left.count + right.count)

}

flights.groupByKey(x => x.DEST_COUNTRY_NAME).reduceGroups((l, r) => sum2(l, r))

.take(5)

It should be straightfoward enough to understand that this is a more expensive process than

aggregating immediately after scanning, especially because it ends up in the same end result:

flights.groupBy("DEST_COUNTRY_NAME").count().explain

== Physical Plan ==

*HashAggregate(keys=[DEST_COUNTRY_NAME#1308], functions=[count(1)])

+- Exchange hashpartitioning(DEST_COUNTRY_NAME#1308, 200)

+- *HashAggregate(keys=[DEST_COUNTRY_NAME#1308], functions=[partial_count(1)])

+- *FileScan parquet [DEST_COUNTRY_NAME#1308] Batched: tru...

This should motivate using Datasets only with user-defined encoding surgically and only where it

makes sense. This might be at the beginning of a big data pipeline or at the end of one.

Conclusion

In this chapter, we covered the basics of Datasets and provided some motivating examples. Although

short, this chapter actually teaches you basically all that you need to know about Datasets and how to

use them. It can be helpful to think of them as a blend between the higher-level Structured APIs and

the low-level RDD APIs, which is the topic of Chapter 12.

Part III. Low-Level APIs

Chapter 12. Resilient Distributed Datasets

(RDDs)

The previous part of the book covered Spark’s Structured APIs. You should heavily favor these APIs

in almost all scenarios. That being said, there are times when higher-level manipulation will not meet

the business or engineering problem you are trying to solve. For those cases, you might need to use

Spark’s lower-level APIs, specifically the Resilient Distributed Dataset (RDD), the SparkContext,

and distributed shared variables like accumulators and broadcast variables. The chapters that follow

in this part cover these APIs and how to use them.

WARNING

If you are brand new to Spark, this is not the place to start. Start with the Structured APIs, you’ll be more productive more

quickly!

What Are the Low-Level APIs?

There are two sets of low-level APIs: there is one for manipulating distributed data (RDDs), and

another for distributing and manipulating distributed shared variables (broadcast variables and

accumulators).

When to Use the Low-Level APIs?

You should generally use the lower-level APIs in three situations:

You need some functionality that you cannot find in the higher-level APIs; for example, if you

need very tight control over physical data placement across the cluster.

You need to maintain some legacy codebase written using RDDs.

You need to do some custom shared variable manipulation. We will discuss shared variables

ARCHIVES

Comma-separated list of archives to be extracted into the working directory of

each executor.

YARN Either --principal

PRINCIPAL Principal to be used to log in to KDC, while running on secure HDFS.

YARN Either --keytab

KEYTAB

The full path to the file that contains the keytab for the principal specified above.

This keytab will be copied to the node running the Application Master via the

Secure Distributed Cache, for renewing the login tickets and the delegation

tokens periodically.

Application Launch Examples

We already covered some local-mode application examples previously in this chapter, but it’s worth

looking at how we use some of the aforementioned options, as well. Spark also includes several

examples and demonstration applications in the examples directory that is included when you

download Spark. If you’re stuck on how to use certain parameters, simply try them first on your local

machine and use the SparkPi class as the main class:

./bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master spark://207.184.161.138:7077 \

--executor-memory 20G \

--total-executor-cores 100 \

replace/with/path/to/examples.jar \

1000

The following snippet does the same for Python. You run it from the Spark directory and this will

allow you to submit a Python application (all in one script) to the standalone cluster manager. You can

also set the same executor limits as in the preceding example:

./bin/spark-submit \

--master spark://207.184.161.138:7077 \

examples/src/main/python/pi.py \

1000

You can change this to run in local mode as well by setting the master to local or local[*] to run on

all the cores on your machine. You will also need to change the /path/to/examples.jar to the

relevant Scala and Spark versions you are running.

Configuring Applications

Spark includes a number of different configurations, some of which we covered in Chapter 15. There

are many different configurations, depending on what you’re hoping to achieve. This section covers

those very details. For the most part, this information is included for reference and is probably worth

skimming only, unless you’re looking for something in particular. The majority of configurations fall

into the following categories:

Application properties

Runtime environment

Shuffle behavior

Spark UI

Compression and serialization

Memory management

Execution behavior

Networking

Scheduling

Dynamic allocation

Security

Encryption

Spark SQL

Spark streaming

SparkR

Spark provides three locations to configure the system:

Spark properties control most application parameters and can be set by using a SparkConf

object

Java system properties

Hardcoded configuration files

There are several templates that you can use, which you can find in the /conf directory available in

the root of the Spark home folder. You can set these properties as hardcoded variables in your

applications or by specifying them at runtime. You can use environment variables to set per-machine

settings, such as the IP address, through the conf/spark-env.sh script on each node. Lastly, you can

configure logging through log4j.properties.

The SparkConf

The SparkConf manages all of our application configurations. You create one via the import

statement, as shown in the example that follows. After you create it, the SparkConf is immutable for

that specific Spark Application:

// in Scala

import org.apache.spark.SparkConf

val conf = new SparkConf().setMaster("local[2]").setAppName("DefinitiveGuide")

.set("some.conf", "to.some.value")

# in Python

from pyspark import SparkConf

conf = SparkConf().setMaster("local[2]").setAppName("DefinitiveGuide")\

.set("some.conf", "to.some.value")

You use the SparkConf to configure individual Spark Applications with Spark properties. These

Spark properties control how the Spark Application runs and how the cluster is configured. The

example that follows configures the local cluster to have two threads and specifies the application

name that shows up in the Spark UI.

You can configure these at runtime, as you saw previously in this chapter through command-line

arguments. This is helpful when starting a Spark Shell that will automatically include a basic Spark

Application for you; for instance:

./bin/spark-submit --name "DefinitiveGuide" --master local[4] ...

Of note is that when setting time duration-based properties, you should use the following format:

25ms (milliseconds)

5s (seconds)

10m or 10min (minutes)

3h (hours)

5d (days)

1y (years)

Application Properties

Application properties are those that you set either from spark-submit or when you create your

Spark Application. They define basic application metadata as well as some execution characteristics.

Table 16-3 presents a list of current application properties.

Table 16-3. Application properties

Property name Default Meaning

spark.app.name (none) The name of your application. This will appear in the UI and in log data.

spark.driver.cores 1 Number of cores to use for the driver process, only in cluster mode.

spark.driver.maxResultSize 1g

Limit of total size of serialized results of all partitions for each Spark action (e.g.,

collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size

exceeds this limit. Having a high limit can cause OutOfMemoryErrors in the driver

(depends on spark.driver.memory and memory overhead of objects in JVM).

Setting a proper limit can protect the driver from OutOfMemoryErrors.

spark.driver.memory 1g

Amount of memory to use for the driver process, where SparkContext is initialized.

(e.g. 1g, 2g). Note: in client mode, this must not be set through the SparkConf directly

in your application, because the driver JVM has already started at that point. Instead,

set this through the --driver-memory command-line option or in your default

properties file.

spark.executor.memory 1g Amount of memory to use per executor process (e.g., 2g, 8g).

spark.extraListeners (none)

A comma-separated list of classes that implement SparkListener; when initializing

SparkContext, instances of these classes will be created and registered with Spark’s

listener bus. If a class has a single-argument constructor that accepts a SparkConf,

that constructor will be called; otherwise, a zero-argument constructor will be called.

If no valid constructor can be found, the SparkContext creation will fail with an

exception.

spark.logConf FALSE Logs the effective SparkConf as INFO when a SparkContext is started.

spark.master (none) The cluster manager to connect to. See the list of allowed master URLs.

spark.submit.deployMode (none)

The deploy mode of the Spark driver program, either “client” or “cluster,” which

means to launch driver program locally (“client”) or remotely (“cluster”) on one of the

nodes inside the cluster.

spark.log.callerContext (none)

Application information that will be written into Yarn RM log/HDFS audit log when

running on Yarn/HDFS. Its length depends on the Hadoop configuration

hadoop.caller.context.max.size. It should be concise, and typically can have up to

50 characters.

spark.driver.supervise FALSE If true, restarts the driver automatically if it fails with a non-zero exit status. Only has

effect in Spark standalone mode or Mesos cluster deploy mode.

You can ensure that you’ve correctly set these values by checking the application’s web UI on port

4040 of the driver on the “Environment” tab. Only values explicitly specified through spark-

defaults.conf, SparkConf, or the command line will appear. For all other configuration properties,

you can assume the default value is used.

Runtime Properties
Although less common, there are times when you might also need to configure the runtime
environment of your application. Due to space limitations, we cannot include the entire configuration
set here. Refer to the relevant table on the Runtime Environment in the Spark documentation. These
properties allow you to configure extra classpaths and python paths for both drivers and executors,
Python worker configurations, as well as miscellaneous logging properties.
Execution Properties
These configurations are some of the most relevant for you to configure because they give you finer-
grained control on actual execution. Due to space limitations, we cannot include the entire
configuration set here. Refer to the relevant table on Execution Behavior in the Spark documentation.
The most common configurations to change are spark.executor.cores (to control the number of
available cores) and spark.files.maxPartitionBytes (maximum partition size when reading
files).
Configuring Memory Management
There are times when you might need to manually manage the memory options to try and optimize
your applications. Many of these are not particularly relevant for end users because they involve a lot
of legacy concepts or fine-grained controls that were obviated in Spark 2.X because of automatic
memory management. Due to space limitations, we cannot include the entire configuration set here.
Refer to the relevant table on Memory Management in the Spark documentation.
Configuring Shuffle Behavior
We’ve emphasized how shuffles can be a bottleneck in Spark jobs because of their high
communication overhead. Therefore there are a number of low-level configurations for controlling
shuffle behavior. Due to space limitations, we cannot include the entire configuration set here. Refer
to the relevant table on Shuffle Behavior in the Spark documentation.
Environmental Variables
You can configure certain Spark settings through environment variables, which are read from the
conf/spark-env.sh script in the directory where Spark is installed (or conf/spark-env.cmd on
Windows). In Standalone and Mesos modes, this file can give machine-specific information such as
hostnames. It is also sourced when running local Spark Applications or submission scripts.
Note that conf/spark-env.sh does not exist by default when Spark is installed. However, you can copy
conf/spark-env.sh.template to create it. Be sure to make the copy executable.
The following variables can be set in spark-env.sh:
JAVA_HOME

Location where Java is installed (if it’s not on your default PATH).

PYSPARK_PYTHON

Python binary executable to use for PySpark in both driver and workers (default is python2.7 if

available; otherwise, python). Property spark.pyspark.python takes precedence if it is set.

PYSPARK_DRIVER_PYTHON

Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).

Property spark.pyspark.driver.python takes precedence if it is set.

SPARKR_DRIVER_R

R binary executable to use for SparkR shell (default is R). Property spark.r.shell.command

takes precedence if it is set.

SPARK_LOCAL_IP

IP address of the machine to which to bind.

SPARK_PUBLIC_DNS

Hostname your Spark program will advertise to other machines.

In addition to the variables ust listed, there are also options for setting up the Spark standalone cluster

scripts, such as number of cores to use on each machine and maximum memory. Because spark-env.sh

is a shell script, you can set some of these programmatically; for example, you might compute

SPARK_LOCAL_IP by looking up the IP of a specific network interface.

NOTE

When running Spark on YARN in cluster mode, you need to set environment variables by using the

spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment

variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See

the YARN-related Spark Properties for more information.

Job Scheduling Within an Application

Within a given Spark Application, multiple parallel jobs can run simultaneously if they were

submitted from separate threads. By job, in this section, we mean a Spark action and any tasks that

need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to

enable applications that serve multiple requests (e.g., queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. If the jobs at the head of the queue don’t

need to use the entire cluster, later jobs can begin to run right away, but if the jobs at the head of the

queue are large, later jobs might be delayed significantly.

It is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks

between jobs in a round-robin fashion so that all jobs get a roughly equal share of cluster resources.

This means that short jobs submitted while a long job is running can begin receiving resources right

away and still achieve good response times without waiting for the long job to finish. This mode is

best for multiuser settings.

To enable the fair scheduler, set the spark.scheduler.mode property to FAIR when configuring a

SparkContext.

The fair scheduler also supports grouping jobs into pools, and setting different scheduling options, or

weights, for each pool. This can be useful to create a high-priority pool for more important jobs or to

group the jobs of each user together and give users equal shares regardless of how many concurrent

jobs they have instead of giving jobs equal shares. This approach is modeled after the Hadoop Fair

Scheduler.

Without any intervention, newly submitted jobs go into a default pool, but jobs pools can be set by

adding the spark.scheduler.pool local property to the SparkContext in the thread that’s

submitting them. This is done as follows (assuming sc is your SparkContext:

sc.setLocalProperty("spark.scheduler.pool", "pool1")

After setting this local property, all jobs submitted within this thread will use this pool name. The

setting is per-thread to make it easy to have a thread run multiple jobs on behalf of the same user. If

you’d like to clear the pool that a thread is associated with, set it to null.

Conclusion

This chapter covered a lot about Spark Applications; we learned how to write, test, run, and

configure them in all of Spark’s languages. In Chapter 17, we talk about deploying and the cluster

management options you have when it comes to running Spark Applications.

Chapter 17. Deploying Spark

This chapter explores the infrastructure you need in place for you and your team to be able to run

Spark Applications:

Cluster deployment choices

Spark’s different cluster managers

Deployment considerations and configuring deployments

For the most, part Spark should work similarly with all the supported cluster managers; however,

customizing the setup means understanding the intricacies of each of the cluster management systems.

The hard part is deciding on the cluster manager (or choosing a managed service). Although we

would be happy to include all the minute details about how you can configure different cluster with

different cluster managers, it’s simply impossible for this book to provide hyper-specific details for

every situation in every single enviroment. The goal of this chapter, therefore, is not to discuss each

of the cluster managers in full detail, but rather to look at their fundamental differences and to provide

a reference for a lot of the material already available on the Spark website. Unfortunately, there is no

easy answer to “which is the easiest cluster manager to run” because it varies so much by use case,

experience, and resources. The Spark documentation site offers a lot of detail about deploying Spark

with actionable examples. We do our best to discuss the most relevant points.

As of this writing, Spark has three officially supported cluster managers:

Standalone mode

Hadoop YARN

Apache Mesos

These cluster managers maintain a set of machines onto which you can deploy Spark Applications.

Naturally, each of these cluster managers has an opinionated view toward management, and so there

are trade-offs and semantics that you will need to keep in mind. However, they all run Spark

applications the same way (as covered in Chapter 16). Let’s begin with the first point: where to

deploy your cluster.

Where to Deploy Your Cluster to Run Spark Applications

There are two high-level options for where to deploy Spark clusters: deploy in an on-premises

cluster or in the public cloud. This choice is consequential and is therefore worth discussing.

On-Premises Cluster Deployments

Deploying Spark to an on-premises cluster is sometimes a reasonable option, especially for

organizations that already manage their own datacenters. As with everything else, there are trade-offs

to this approach. An on-premises cluster gives you full control over the hardware used, meaning you

can optimize performance for your specific workload. However, it also introduces some challenges,

especially when it comes to data analytics workloads like Spark. First, with on-premises deployment,

your cluster is fixed in size, whereas the resource demands of data analytics workloads are often

elastic. If you make your cluster too small, it will be hard to launch the occasional very large

analytics query or training job for a new machine learning model, whereas if you make it large, you

will have resources sitting idle. Second, for on-premises clusters, you need to select and operate your

own storage system, such as a Hadoop file system or scalable key-value store. This includes setting

up georeplication and disaster recovery if required.

If you are going to deploy on-premises, the best way to combat the resource utilization problem is to

use a cluster manager that allows you to run many Spark applications and dynamically reassign

resources between them, or even allows non-Spark applications on the same cluster. All of Spark’s

supported cluster managers allow multiple concurrent applications, but YARN and Mesos have better

support for dynamic sharing and also additionally support non-Spark workloads. Handling resource

sharing is likely going to be the biggest difference your users see day to day with Spark on-premise

versus in the cloud: in public clouds, it’s easy to give each application its own cluster of exactly the

required size for just the duration of that job.

For storage, you have several different options, but covering all the trade-offs and operational details

in depth would probably require its own book. The most common storage systems used for Spark are

distributed file systems such as Hadoop’s HDFS and key-value stores such as Apache Cassandra.

Streaming message bus systems such as Apache Kafka are also often used for ingesting data. All these

systems have varying degrees of support for management, backup, and georeplication, sometimes

built into the system and sometimes only through third-party commercial tools. Before choosing a

storage option, we recommend evaluating the performance of its Spark connector and evaluating the

available management tools.

Spark in the Cloud

While early big data systems were designed for on-premises deployment, the cloud is now an

increasingly common platform for deploying Spark. The public cloud has several advantages when it

comes to big data workloads. First, resources can be launched and shut down elastically, so you can

run that occasional “monster” job that takes hundreds of machines for a few hours without having to

pay for them all the time. Even for normal operation, you can choose a different type of machine and

cluster size for each application to optimize its cost performance—for example, launch machines with

Graphics Processing Units (GPUs) just for your deep learning jobs. Second, public clouds include

low-cost, georeplicated storage that makes it easier to manage large amounts of data.

Many companies looking to migrate to the cloud imagine they’ll run their applications in the same

way that they run their on-premises clusters. All the major cloud providers (Amazon Web Services

[AWS], Microsoft Azure, Google Cloud Platform [GCP], and IBM Bluemix) include managed

Hadoop clusters for their customers, which provide HDFS for storage as well as Apache Spark. This

is actually not a great way to run Spark in the cloud, however, because by using a fixed-size cluster

and file system, you are not going to be able to take advantage of elasticity. Instead, it is generally a

better idea to use global storage systems that are decoupled from a specific cluster, such as Amazon

S3, Azure Blob Storage, or Google Cloud Storage and spin up machines dynamically for each Spark

workload. With decoupled compute and storage, you will be able to pay for computing resources only

when needed, scale them up dynamically, and mix different hardware types. Basically, keep in mind

that running Spark in the cloud need not mean migrating an on-premises installation to virtual

machines: you can run Spark natively against cloud storage to take full advantage of the cloud’s

elasticity, cost-saving benefit, and management tools without having to manage an on-premise

computing stack within your cloud environment.

Several companies provide “cloud-native” Spark-based services, and all installations of Apache

Spark can of course connect to cloud storage. Databricks, the company started by the Spark team from

UC Berkeley, is one example of a service provider built specifically for Spark in the cloud.

Databricks provides a simple way to run Spark workloads without the heavy baggage of a Hadoop

installation. The company provides a number of features for running Spark more efficiently in the

cloud, such as auto-scaling, auto-termination of clusters, and optimized connectors to cloud storage,

as well as a collaborative environment for working on notebooks and standalone jobs. The company

also provides a free Community Edition for learning Spark where you can run notebooks on a small

cluster and share them live with others. A fun fact is that this entire book was written using the free

Community Edition of Databricks, because we found the integrated Spark notebooks, live

collaboration, and cluster management the easiest way to produce and test this content.

If you run Spark in the cloud, much of the content in this chapter might not be relevant because you can

often create a separate, short-lived Spark cluster for each job you execute. In that case, the standalone

cluster manager is likely the easiest to use. However, you may still want to read this content if you’d

like to share a longer-lived cluster among many applications, or to install Spark on virtual machines

yourself.

Cluster Managers

Unless you are using a high-level managed service, you will have to decide on the cluster manager to

use for Spark. Spark supports three aforementioned cluster managers: standalone clusters, Hadoop

YARN, and Mesos. Let’s review each of these.

Standalone Mode

Spark’s standalone cluster manager is a lightweight platform built specifically for Apache Spark

workloads. Using it, you can run multiple Spark Applications on the same cluster. It also provides

simple interfaces for doing so but can scale to large Spark workloads. The main disadvantage of the

standalone mode is that it’s more limited than the other cluster managers—in particular, your cluster

can only run Spark. It’s probably the best starting point if you just want to quickly get Spark running

on a cluster, however, and you do not have experience using YARN or Mesos.

Starting a standalone cluster

Starting a standalone cluster requires provisioning the machines for doing so. That means starting

them up, ensuring that they can talk to one another over the network, and getting the version of Spark

you would like to run on those sets of machines. After that, there are two ways to start the cluster: by

hand or using built-in launch scripts.

Let’s first launch a cluster by hand. The first step is to start the master process on the machine that we

want that to run on, using the following command:

$SPARK_HOME/sbin/start-master.sh

When we run this command, the cluster manager master process will start up on that machine. Once

started, the master prints out a spark://HOST:PORT URI. You use this when you start each of the

worker nodes of the cluster, and you can use it as the master argument to your SparkSession on

application initialization. You can also find this URI on the master’s web UI, which is http://master-

ip-address:8080 by default. With that URI, start the worker nodes by logging in to each machine and

running the following script using the URI you just received from the master node. The master

machine must be available on the network of the worker nodes you are using, and the port must be

open on the master node, as well:

$SPARK_HOME/sbin/start-slave.sh <master-spark-URI>

As soon as you’ve run that on another machine, you have a Spark cluster running! This process is

naturally a bit manual; thankfully there are scripts that can help to automate this process.

Cluster launch scripts

You can configure cluster launch scripts that can automate the launch of standalone clusters. To do

this, create a file called conf/slaves in your Spark directory that will contain the hostnames of all the

machines on which you intend to start Spark workers, one per line. If this file does not exist,

everything will launch locally. When you go to actually start the cluster, the master machine will

access each of the worker machines via Secure Shell (SSH). By default, SSH is run in parallel and

requires that you configure password-less (using a private key) access. If you do not have a

password-less setup, you can set the environment variable SPARK_SSH_FOREGROUND and serially

provide a password for each worker.

After you set up this file, you can launch or stop your cluster by using the following shell scripts,

based on Hadoop’s deploy scripts, and available in $SPARK_HOME/sbin:

$SPARK_HOME/sbin/start-master.sh

Starts a master instance on the machine on which the script is executed.

$SPARK_HOME/sbin/start-slaves.sh

Starts a slave instance on each machine specified in the conf/slaves file.

$SPARK_HOME/sbin/start-slave.sh

Starts a slave instance on the machine on which the script is executed.

$SPARK_HOME/sbin/start-all.sh

Starts both a master and a number of slaves as described earlier.

$SPARK_HOME/sbin/stop-master.sh

Stops the master that was started via the bin/start-master.sh script.

$SPARK_HOME/sbin/stop-slaves.sh

Stops all slave instances on the machines specified in the conf/slaves file.

$SPARK_HOME/sbin/stop-all.sh

Stops both the master and the slaves as described earlier.

Standalone cluster configurations

Standalone clusters have a number of configurations that you can use to tune your application. These

control everything from what happens to old files on each worker for terminated applications to the

worker’s core and memory resources. These are controlled via environment variables or via

application properties. Due to space limitations, we cannot include the entire configuration set here.

Refer to the relevant table on Standalone Environment Variables in the Spark documentation.

Submitting applications

After you create the cluster, you can submit applications to it using the spark:// URI of the master.

You can do this either on the master node itself or another machine using spark-submit. There are

some specific command-line arguments for standalone mode, which we covered in “Launching

Applications”.

Spark on YARN

Hadoop YARN is a framework for job scheduling and cluster resource management. Even though

Spark is often (mis)classified as a part of the “Hadoop Ecosystem,” in reality, Spark has little to do

with Hadoop. Spark does natively support the Hadoop YARN cluster manager but it requires nothing

from Hadoop itself.

You can run your Spark jobs on Hadoop YARN by specifying the master as YARN in the spark-

submit command-line arguments. Just like with standalone mode, there are a number of knobs that

you are able to tune according to what you would like the cluster to do. The number of knobs is

naturally larger than that of Spark’s standalone mode because Hadoop YARN is a generic scheduler

for a large number of different execution frameworks.

Setting up a YARN cluster is beyond the scope of this book, but there are some great books on the

topic as well as managed services that can simplify this experience.

Submitting applications

When submitting applications to YARN, the core difference from other deployments is that --master

will become yarn as opposed the master node IP, as it is in standalone mode. Instead, Spark will find

the YARN configuration files using the environment variable HADOOP_CONF_DIR or YARN_CONF_DIR.

Once you have set those environment variables to your Hadoop installation’s configuration directory,

you can just run spark-submit like we saw in Chapter 16.

NOTE

There are two deployment modes that you can use to launch Spark on YARN. As discussed in previous chapters, cluster

mode has the spark driver as a process managed by the YARN cluster, and the client can exit after creating the application.

In client mode, the driver will run in the client process and therefore YARN will be responsible only for granting executor

resources to the application, not maintaining the master node. Also of note is that in cluster mode, Spark doesn’t necessarily

run on the same machine on which you’re executing. Therefore libraries and external jars must be distributed manually or

through the --jars command-line argument.

There are a few YARN-specific properties that you can set by using spark-submit. These allow you

to control priority queues and things like keytabs for security. We covered these in “Launching

Applications” in Chapter 16.

Configuring Spark on YARN Applications

Deploying Spark as YARN applications requires you to understand the variety of different

configurations and their implications for your Spark applications. This section covers some best

practices for basic configurations and includes references to some of the important configuration for

running your Spark applications.

Hadoop configurations

If you plan to read and write from HDFS using Spark, you need to include two Hadoop configuration

files on Spark’s classpath: hdfs-site.xml, which provides default behaviors for the HDFS client; and

core-site.xml, which sets the default file system name. The location of these configuration files varies

across Hadoop versions, but a common location is inside of /etc/hadoop/conf. Some tools create

these configurations on the fly, as well, so it’s important to understand how your managed service

might be deploying these, as well.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a

location containing the configuration files or as an environment variable when you go to spark-

submit your application.

Application properties for YARN

There are a number of Hadoop-related configurations and things that come up that largely don’t have

much to do with Spark, just running or securing YARN in a way that influences how Spark runs. Due

to space limitations, we cannot include the configuration set here. Refer to the relevant table on

YARN Configurations in the Spark documentation.

Spark on Mesos

Apache Mesos is another clustering system that Spark can run on. A fun fact about Mesos is that the

project was also started by many of the original authors of Spark, including one of the authors of this

book. In the Mesos project’s own words:

Apache Mesos abstracts CPU, memory, storage, and other compute resources away from

machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily

be built and run effectively.

For the most part, Mesos intends to be a datacenter scale-cluster manager that manages not just short-

lived applications like Spark, but long-running applications like web applications or other resource

interfaces. Mesos is the heaviest-weight cluster manager, simply because you might choose this

cluster manager only if your organization already has a large-scale deployment of Mesos, but it makes

for a good cluster manager nonetheless.

Mesos is a large piece of infrastructure, and unfortunately there’s simply too much information for us

to cover how to deploy and maintain Mesos clusters. There are many great books on the subject for

that, including Dipa Dubhashi and Akhil Das’s Mastering Mesos (O’Reilly, 2016). The goal here is

to bring up some of the considerations that you’ll need to think about when running Spark

Applications on Mesos.

For instance, one common thing you will hear about Spark on Mesos is fine-grained versus coarse-

grained mode. Historically Mesos supported a variety of different modes (fine-grained and coarse-

grained), but at this point, it supports only coarse-grained scheduling (fine-grained has been

deprecated). Coarse-grained mode means that each Spark executor runs as a single Mesos task. Spark

executors are sized according to the following application properties:

spark.executor.memory

spark.executor.cores

spark.cores.max/spark.executor.cores

Submitting applications

Submitting applications to a Mesos cluster is similar to doing so for Spark’s other cluster managers.

For the most part you should favor cluster mode when using Mesos. Client mode requires some extra

configuration on your part, especially with regard to distributing resources around the cluster.

For instance, in client mode, the driver needs extra configuration information in spark-env.sh to work

with Mesos.

In spark-env.sh set some environment variables:

export MESOS_NATIVE_JAVA_LIBRARY=<path to libmesos.so>

This path is typically <prefix>/lib/libmesos.so where the prefix is /usr/local by default. On Mac OS

X, the library is called libmesos.dylib instead of libmesos.so:

export SPARK_EXECUTOR_URI=<URL of spark-2.2.0.tar.gz uploaded above>

Finally, set the Spark Application property spark.executor.uri to <URL of spark-

2.2.0.tar.gz>. Now, when starting a Spark application against the cluster, pass a mesos:// URL

as the master when creating a SparkContex, and set that property as a parameter in your SparkConf

variable or the initialization of a SparkSession:

// in Scala

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder

.master("mesos://HOST:5050")

.appName("my app")

.config("spark.executor.uri", "<path to spark-2.2.0.tar.gz uploaded above>")

.getOrCreate()

Submitting cluster mode applications is fairly straightforward and follows the same spark-submit

structure you read about before. We covered these in “Launching Applications”.

Configuring Mesos

Just like any other cluster manager, there are a number of ways that we can configure our Spark

Applications when they’re running on Mesos. Due to space limitations, we cannot include the entire

configuration set here. Refer to the relevant table on Mesos Configurations in the Spark

documentation.

Secure Deployment Configurations

Spark also provides some low-level ability to make your applications run more securely, especially

in untrusted environments. Note that the majority of this setup will happen outside of Spark. These
configurations are primarily network-based to help Spark run in a more secure manner. This means
authentication, network encryption, and setting TLS and SSL configurations. Due to space limitations,
we cannot include the entire configuration set here. Refer to the relevant table on Security
Configurations in the Spark documentation.
Cluster Networking Configurations
Just as shuffles are important, there can be some things worth tuning on the network. This can also be
helpful when performing custom deployment configurations for your Spark clusters when you need to
use proxies in between certain nodes. If you’re looking to increase Spark’s performance, these should
not be the first configurations you go to tune, but may come up in custom deployment scenarios. Due
to space limitations, we cannot include the entire configuration set here. Refer to the relevant table on
Networking Configurations in the Spark documentation.
Application Scheduling
Spark has several facilities for scheduling resources between computations. First, recall that, as
described earlier in the book, each Spark Application runs an independent set of executor processes.
Cluster managers provide the facilities for scheduling across Spark applications. Second, within each
Spark application, multiple jobs (i.e., Spark actions) may be running concurrently if they were
submitted by different threads. This is common if your application is serving requests over the
network. Spark includes a fair scheduler to schedule resources within each application. We
introduced this topic in the previous chapter.
If multiple users need to share your cluster and run different Spark Applications, there are different
options to manage allocation, depending on the cluster manager. The simplest option, available on all
cluster managers, is static partitioning of resources. With this approach, each application is given a
maximum amount of resources that it can use, and holds onto those resources for the entire duration. In
spark-submit there are a number of properties that you can set to control the resource allocation of
a particular application. Refer to Chapter 16 for more information. In addition, dynamic allocation
(described next) can be turned on to let applications scale up and down dynamically based on their
current number of pending tasks. If, instead, you want users to be able to share memory and executor
resources in a fine-grained manner, you can launch a single Spark Application and use thread
scheduling within it to serve multiple requests in parallel.
Dynamic allocation
If you would like to run multiple Spark Applications on the same cluster, Spark provides a
mechanism to dynamically adjust the resources your application occupies based on the workload.
This means that your application can give resources back to the cluster if they are no longer used, and
request them again later when there is demand. This feature is particularly useful if multiple
applications share resources in your Spark cluster.

This feature is disabled by default and available on all coarse-grained cluster managers; that is,

standalone mode, YARN mode, and Mesos coarse-grained mode. There are two requirements for

using this feature. First, your application must set spark.dynamicAllocation.enabled to true.

Second, you must set up an external shuffle service on each worker node in the same cluster and set

spark.shuffle.service.enabled to true in your application. The purpose of the external shuffle

service is to allow executors to be removed without deleting shuffle files written by them. This is set

up differently for each cluster manager and is described in the job scheduling configuration. Due to

space limitations, we cannot include the configuration set for dynamic allocation. Refer to the

relevant table on Dynamic Allocation Configurations.

Miscellaneous Considerations

There several other topics to consider when deploying Spark applications that may affect your choice

of cluster manager and its setup. These are just things that you should think about when comparing

different deployment options.

One of the more important considerations is the number and type of applications you intend to be

running. For instance, YARN is great for HDFS-based applications but is not commonly used for

much else. Additionally, it’s not well designed to support the cloud, because it expects information to

be available on HDFS. Also, compute and storage is largely coupled together, meaning that scaling

your cluster involves scaling both storage and compute instead of just one or the other. Mesos does

improve on this a bit conceptually, and it supports a wide range of application types, but it still

requires pre-provisioning machines and, in some sense, requires buy-in at a much larger scale. For

instance, it doesn’t really make sense to have a Mesos cluster for only running Spark Applications.

Spark standalone mode is the lightest-weight cluster manager and is relatively simple to understand

and take advantage of, but then you’re going to be building more application management

infrastructure that you could get much more easily by using YARN or Mesos.

Another challenge is managing different Spark versions. Your hands are largely tied if you want to try

to run a variety of different applications running different Spark versions, and unless you use a well-

managed service, you’re going to need to spend a fair amount of time either managing different setup

scripts for different Spark services or removing the ability for your users to use a variety of different

Spark applications.

Regardless of the cluster manager that you choose, you’re going to want to consider how you’re going

to set up logging, store logs for future reference, and allow end users to debug their applications.

These are more “out of the box” for YARN or Mesos and might need some tweaking if you’re using

standalone.

One thing you might want to consider—or that might influence your decision making—is maintaining

a metastore in order to maintain metadata about your stored datasets, such as a table catalog. We saw

how this comes up in Spark SQL when we are creating and maintaining tables. Maintaining an

Apache Hive metastore, a topic beyond the scope of this book, might be something that’s worth doing

to facilitate more productive, cross-application referencing to the same datasets.

Depending on your workload, it might be worth considering using Spark’s external shuffle service.

Typically Spark stores shuffle blocks (shuffle output) on a local disk on that particular node. An

external shuffle service allows for storing those shuffle blocks so that they are available to all

executors, meaning that you can arbitrarily kill executors and still have their shuffle outputs available

to other applications.

Finally, you’re going to need to configure at least some basic monitoring solution and help users

debug their Spark jobs running on their clusters. This is going to vary across cluster management

options and we touch on some of the things that you might want to set up in Chapter 18.

Conclusion

This chapter looked at the world of configuration options that you have when choosing how to deploy

Spark. Although most of the information is irrelevant to the majority of users, it is worth mentioning if

you’re performing more advanced use cases. It might seem fallacious, but there are other

configurations that we have omitted that control even lower-level behavior. You can find these in the

Spark documentation or in the Spark source code. Chapter 18 talks about some of the options that we

have when monitoring Spark Applications.

Chapter 18. Monitoring and Debugging

This chapter covers the key details you need to monitor and debug your Spark Applications. To do

this, we will walk through the Spark UI with an example query designed to help you understand how

to trace your own jobs through the execution life cycle. The example we’ll look at will also help you

understand how to debug your jobs and where errors are likely to occur.

The Monitoring Landscape

At some point, you’ll need to monitor your Spark jobs to understand where issues are occuring in

them. It’s worth reviewing the different things that we can actually monitor and outlining some of the

options for doing so. Let’s review the components we can monitor (see Figure 18-1).

Spark Applications and Jobs

The first thing you’ll want to begin monitoring when either debugging or just understanding better

how your application executes against the cluster is the Spark UI and the Spark logs. These report

information about the applications currently running at the level of concepts in Spark, such as

RDDs and query plans. We talk in detail about how to use these Spark monitoring tools throughout

this chapter.

JVM

Spark runs the executors in individual Java Virtual Machines (JVMs). Therefore, the next level of

detail would be to monitor the individual virtual machines (VMs) to better understand how your

code is running. JVM utilities such as jstack for providing stack traces, jmap for creating heap-

dumps, jstat for reporting time–series statistics, and jconsole for visually exploring various JVM

properties are useful for those comfortable with JVM internals. You can also use a tool like

jvisualvm to help profile Spark jobs. Some of this information is provided in the Spark UI, but for

very low-level debugging, the aforementioned tools can come in handy.

OS/Machine

The JVMs run on a host operating system (OS) and it’s important to monitor the state of those

machines to ensure that they are healthy. This includes monitoring things like CPU, network, and

I/O. These are often reported in cluster-level monitoring solutions; however, there are more

specific tools that you can use, including dstat, iostat, and iotop.

Cluster

Naturally, you can monitor the cluster on which your Spark Application(s) will run. This might be

a YARN, Mesos, or standalone cluster. Usually it’s important to have some sort of monitoring

solution here because, somewhat obviously, if your cluster is not working, you should probably

know pretty quickly. Some popular cluster-level monitoring tools include Ganglia and

Prometheus.

Figure 18-1. Components of a Spark application that you can monitor

What to Monitor

After that brief tour of the monitoring landscape, let’s discuss how we can go about monitoring and

debugging our Spark Applications. There are two main things you will want to monitor: the processes

running your application (at the level of CPU usage, memory usage, etc.), and the query execution

inside it (e.g., jobs and tasks).

Driver and Executor Processes

When you’re monitoring a Spark application, you’re definitely going to want to keep an eye on the

driver. This is where all of the state of your application lives, and you’ll need to be sure it’s running

in a stable manner. If you could monitor only one machine or a single JVM, it would definitely be the

driver. With that being said, understanding the state of the executors is also extremely important for

monitoring individual Spark jobs. To help with this challenge, Spark has a configurable metrics

system based on the Dropwizard Metrics Library. The metrics system is configured via a

configuration file that Spark expects to be present at $SPARK_HOME/conf/metrics.properties. A

custom file location can be specified by changing the spark.metrics.conf configuration property.

These metrics can be output to a variety of different sinks, including cluster monitoring solutions like

Ganglia.

Queries, Jobs, Stages, and Tasks

Although the driver and executor processes are important to monitor, sometimes you need to debug

what’s going on at the level of a specific query. Spark provides the ability to dive into queries, jobs,

stages, and tasks. (We learned about these in Chapter 15.) This information allows you to know

exactly what’s running on the cluster at a given time. When looking for performance tuning or

debugging, this is where you are most likely to start.

Now that we know what we want to monitor, let’s look at the two most common ways of doing so: the

Spark logs and the Spark UI.

Spark Logs

One of the most detailed ways to monitor Spark is through its log files. Naturally, strange events in

Spark’s logs, or in the logging that you added to your Spark Application, can help you take note of

exactly where jobs are failing or what is causing that failure. If you use the application template

provided with the book, the logging framework we set up in the template will allow your application

logs to show up along Spark’s own logs, making them very easy to correlate. One challenge, however,

is that Python won’t be able to integrate directly with Spark’s Java-based logging library. Using

Python’s logging module or even simple print statements will still print the results to standard error,

however, and make them easy to find.

To change Spark’s log level, simply run the following command:

spark.sparkContext.setLogLevel("INFO")

This will allow you to read the logs, and if you use our application template, you can log your own

relevant information along with these logs, allowing you to inspect both your own application and

Spark. The logs themselves will be printed to standard error when running a local mode application,

or saved to files by your cluster manager when running Spark on a cluster. Refer to each cluster

manager’s documentation about how to find them—typically, they are available through the cluster

manager’s web UI.

You won’t always find the answer you need simply by searching logs, but it can help you pinpoint the

given problem that you’re encountering and possibly add new log statements in your application to

better understand it. It’s also convenient to collect logs over time in order to reference them in the

future. For instance, if your application crashes, you’ll want to debug why, without access to the now-

crashed application. You may also want to ship logs off the machine they were written on to hold onto

them if a machine crashes or gets shut down (e.g., if running in the cloud).

The Spark UI

The Spark UI provides a visual way to monitor applications while they are running as well as metrics

about your Spark workload, at the Spark and JVM level. Every SparkContext running launches a

web UI, by default on port 4040, that displays useful information about the application. When you run

Spark in local mode, for example, just navigate to http://localhost:4040 to see the UI when running a

Spark Application on your local machine. If you’re running multiple applications, they will launch

web UIs on increasing port numbers (4041, 4042, …). Cluster managers will also link to each

application’s web UI from their own UI.

Figure 18-2 shows all of the tabs available in the Spark UI.

Figure 18-2. Spark UI tabs

These tabs are accessible for each of the things that we’d like to monitor. For the most part, each of

these should be self-explanatory:

The Jobs tab refers to Spark jobs.

The Stages tab pertains to individual stages (and their relevant tasks).

The Storage tab includes information and the data that is currently cached in our Spark

Application.

The Environment tab contains relevant information about the configurations and current

settings of the Spark application.

The SQL tab refers to our Structured API queries (including SQL and DataFrames).

The Executors tab provides detailed information about each executor running our

application.

Let’s walk through an example of how you can drill down into a given query. Open a new Spark shell,

run the following code, and we will trace its execution through the Spark UI:

# in Python

spark.read\

.option("header", "true")\

.csv("/data/retail-data/all/online-retail-dataset.csv")\

.repartition(2)\

.selectExpr("instr(Description, 'GLASS') >= 1 as is_glass")\

.groupBy("is_glass")\

.count()\

.collect()

This results in three rows of various values. The code kicks off a SQL query, so let’s navigate to the

SQL tab, where you should see something similar to Figure 18-3.

Figure 18-3. The SQL tab

The first thing you see is aggregate statistics about this query:

Submitted Time: 2017/04/08 16:24:41

Duration: 2 s

Succeeded Jobs: 2

These will become important in a minute, but first let’s take a look at the Directed Acyclic Graph

(DAG) of Spark stages. Each blue box in these tabs represent a stage of Spark tasks. The entire group

of these stages represent our Spark job. Let’s take a look at each stage in detail so that we can better

understand what is going on at each level, starting with Figure 18-4.

Figure 18-4. Stage one

The box on top, labeled WholeStateCodegen, represents a full scan of the CSV file. The box below

that represents a shuffle that we forced when we called repartition. This turned our original dataset

(of a yet to be specified number of partitions) into two partitions.

The next step is our projection (selecting/adding/filtering columns) and the aggregation. Notice that in

Figure 18-5 the number of output rows is six. This convienently lines up with the number of output

rows multiplied by the number of partitions at aggregation time. This is because Spark performs an

aggregation for each partition (in this case a hash-based aggregation) before shuffling the data around

in preparation for the final stage.

Figure 18-5. Stage two

The last stage is the aggregation of the subaggregations that we saw happen on a per-partition basis in

the previous stage. We combine those two partitions in the final three rows that are the output of our

total query (Figure 18-6).

Figure 18-6. Stage three

Let’s look further into the job’s execution. On the Jobs tab, next to Succeeded Jobs, click 2. As

Figure 18-7 demonstrates, our job breaks down into three stages (which corresponds to what we saw

on the SQL tab).

Figure 18-7. The Jobs tab

These stages have more or less the same information as what’s shown in Figure 18-6, but clicking the

label for one of them will show the details for a given stage. In this example, three stages ran, with

eight, two, and then two hundred tasks each. Before diving into the stage detail, let’s review why this

is the case.

The first stage has eight tasks. CSV files are splittable, and Spark broke up the work to be distributed

relatively evenly between the different cores on the machine. This happens at the cluster level and

points to an important optimization: how you store your files. The following stage has two tasks

because we explicitly called a repartition to move the data into two partitions. The last stage has 200

tasks because the default shuffle partitions value is 200.

Now that we reviewed how we got here, click the stage with eight tasks to see the next level of detail,

as shown in Figure 18-8.

Figure 18-8. Spark tasks

Spark provides a lot of detail about what this job did when it ran. Toward the top, notice the

Summary Metrics section. This provides a synopsis of statistics regarding various metrics. What you

want to be on the lookout for is uneven distributions of the values (we touch on this in Chapter 19). In

this case, everything looks very consistent; there are no wide swings in the distribution of values. In

the table at the bottom, we can also examine on a per-executor basis (one for every core on this

particular machine, in this case). This can help identify whether a particular executor is struggling

with its workload.

Spark also makes available a set of more detailed metrics, as shown in Figure 18-8, which are

probably not relevant to the large majority of users. To view those, click Show Additional Metrics,

and then either choose (De)select All or select individual metrics, depending on what you want to

see.

You can repeat this basic analysis for each stage that you want to analyze. We leave that as an

exercise for the reader.

Other Spark UI tabs

The remaining Spark tabs, Storage, Environment, and Executors, are fairly self-explanatory. The

Storage tab shows information about the cached RDDs/DataFrames on the cluster. This can help you
see if certain data has been evicted from the cache over time. The Environment tab shows you
information about the Runtime Environment, including information about Scala and Java as well as
the various Spark Properties that you configured on your cluster.
Configuring the Spark user interface
There are a number of configurations that you can set regarding the Spark UI. Many of them are
networking configurations such as enabling access control. Others let you configure how the Spark UI
will behave (e.g., how many jobs, stages, and tasks are stored). Due to space limitations, we cannot
include the entire configuration set here. Consult the relevant table on Spark UI Configurations in the
Spark documentation.
Spark REST API
In addition to the Spark UI, you can also access Spark’s status and metrics via a REST API. This is is
available at http://localhost:4040/api/v1 and is a way of building visualizations and monitoring
tools on top of Spark itself. For the most part this API exposes the same information presented in the
web UI, except that it doesn’t include any of the SQL-related information. This can be a useful tool if
you would like to build your own reporting solution based on the information available in the Spark
UI. Due to space limitations, we cannot include the list of API endpoints here. Consult the relevant
table on REST API Endpoints in the Spark documentation.
Spark UI History Server
Normally, the Spark UI is only available while a SparkContext is running, so how can you get to it
after your application crashes or ends? To do this, Spark includes a tool called the Spark History
Server that allows you to reconstruct the Spark UI and REST API, provided that the application was
configured to save an event log. You can find up-to-date information about how to use this tool in the
Spark documentation.
To use the history server, you first need to configure your application to store event logs to a certain
location. You can do this by by enabling spark.eventLog.enabled and the event log location with
the configuration spark.eventLog.dir. Then, once you have stored the events, you can run the
history server as a standalone application, and it will automatically reconstruct the web UI based on
these logs. Some cluster managers and cloud services also configure logging automatically and run a
history server by default.
There are a number of other configurations for the history server. Due to space limitations, we cannot
include the entire configuration set here. Refer to the relevant table on Spark History Server
Configurations in the Spark documentation.
Debugging and Spark First Aid

The previous sections defined some core “vital signs”—that is, things that we can monitor to check

the health of a Spark Application. For the remainder of the chapter we’re going to take a “first aid”

approach to Spark debugging: We’ll review some signs and symptoms of problems in your Spark

jobs, including signs that you might observe (e.g., slow tasks) as well as symptoms from Spark itself

(e.g., OutOfMemoryError). There are many issues that may affect Spark jobs, so it’s impossible to

cover everything. But we will discuss some of the more common Spark issues you may encounter. In

addition to the signs and symptoms, we’ll also look at some potential treatments for these issues.

Most of the recommendations about fixing issues refer to the configuration tools discussed in

Chapter 16.

Spark Jobs Not Starting

This issue can arise frequently, especially when you’re just getting started with a fresh deployment or

environment.

Signs and symptoms

Spark jobs don’t start.

The Spark UI doesn’t show any nodes on the cluster except the driver.

The Spark UI seems to be reporting incorrect information.

Potential treatments

This mostly occurs when your cluster or your application’s resource demands are not configured

properly. Spark, in a distributed setting, does make some assumptions about networks, file systems,

and other resources. During the process of setting up the cluster, you likely configured something

incorrectly, and now the node that runs the driver cannot talk to the executors. This might be because

you didn’t specify what IP and port is open or didn’t open the correct one. This is most likely a

cluster level, machine, or configuration issue. Another option is that your application requested more

resources per executor than your cluster manager currently has free, in which case the driver will be

waiting forever for executors to be launched.

Ensure that machines can communicate with one another on the ports that you expect. Ideally,

you should open up all ports between the worker nodes unless you have more stringent

security constraints.

Ensure that your Spark resource configurations are correct and that your cluster manager is

properly set up for Spark. Try running a simple application first to see if that works. One

common issue may be that you requested more memory per executor than the cluster manager

has free to allocate, so check how much it is reporting free (in its UI) and your spark-

submit memory configuration.

Errors Before Execution

This can happen when you’re developing a new application and have previously run code on this

cluster, but now some new code won’t work.

Signs and symptoms

Commands don’t run at all and output large error messages.

You check the Spark UI and no jobs, stages, or tasks seem to run.

Potential treatments

After checking and confirming that the Spark UI environment tab shows the correct information for

your application, it’s worth double-checking your code. Many times, there might be a simple typo or

incorrect column name that is preventing the Spark job from compiling into its underlying Spark plan

(when using the DataFrame API).

You should take a look at the error returned by Spark to confirm that there isn’t an issue in

your code, such as providing the wrong input file path or field name.

Double-check to verify that the cluster has the network connectivity that you expect between

your driver, your workers, and the storage system you are using.

There might be issues with libraries or classpaths that are causing the wrong version of a

library to be loaded for accessing storage. Try simplifying your application until you get a

smaller version that reproduces the issue (e.g., just reading one dataset).

Errors During Execution

This kind of issue occurs when you already are working on a cluster or parts of your Spark

Application run before you encounter an error. This can be a part of a scheduled job that runs at some

interval or a part of some interactive exploration that seems to fail after some time.

Signs and symptoms

One Spark job runs successfully on the entire cluster but the next one fails.

A step in a multistep query fails.

A scheduled job that ran yesterday is failing today.

Difficult to parse error message.

Potential treatments

Check to see if your data exists or is in the format that you expect. This can change over time

or some upstream change may have had unintended consequences on your application.

If an error quickly pops up when you run a query (i.e., before tasks are launched), it is most

likely an analysis error while planning the query. This means that you likely misspelled a

column name referenced in the query or that a column, view, or table you referenced does not

exist.

Read through the stack trace to try to find clues about what components are involved (e.g.,

what operator and stage it was running in).

Try to isolate the issue by progressively double-checking input data and ensuring the data

conforms to your expectations. Also try removing logic until you can isolate the problem in a

smaller version of your application.

If a job runs tasks for some time and then fails, it could be due to a problem with the input

data itself, wherein the schema might be specified incorrectly or a particular row does not

conform to the expected schema. For instance, sometimes your schema might specify that the

data contains no nulls but your data does actually contain nulls, which can cause certain

transformations to fail.

It’s also possible that your own code for processing the data is crashing, in which case Spark

will show you the exception thrown by your code. In this case, you will see a task marked as

“failed” on the Spark UI, and you can also view the logs on that machine to understand what

it was doing when it failed. Try adding more logs inside your code to figure out which data

record was being processed.

Slow Tasks or Stragglers

This issue is quite common when optimizing applications, and can occur either due to work not being

evenly distributed across your machines (“skew”), or due to one of your machines being slower than

the others (e.g., due to a hardware problem).

Signs and symptoms

Any of the following are appropriate symptoms of the issue:

Spark stages seem to execute until there are only a handful of tasks left. Those tasks then take

a long time.

These slow tasks show up in the Spark UI and occur consistently on the same dataset(s).

These occur in stages, one after the other.

Scaling up the number of machines given to the Spark Application doesn’t really help—some

tasks still take much longer than others.

In the Spark metrics, certain executors are reading and writing much more data than others.

Potential treatments

Slow tasks are often called “stragglers.” There are many reasons they may occur, but most often the

source of this issue is that your data is partitioned unevenly into DataFrame or RDD partitions. When

this happens, some executors might need to work on much larger amounts of work than others. One

particularly common case is that you use a group-by-key operation and one of the keys just has more

data than others. In this case, when you look at the Spark UI, you might see that the shuffle data for

some nodes is much larger than for others.

Try increasing the number of partitions to have less data per partition.

Try repartitioning by another combination of columns. For example, stragglers can come up

when you partition by a skewed ID column, or a column where many values are null. In the

latter case, it might make sense to first filter out the null values.

Try increasing the memory allocated to your executors if possible.

Monitor the executor that is having trouble and see if it is the same machine across jobs; you

might also have an unhealthy executor or machine in your cluster—for example, one whose

disk is nearly full.

If this issue is associated with a join or an aggregation, see “Slow Joins” or “Slow

Aggregations”.

Check whether your user-defined functions (UDFs) are wasteful in their object allocation or

business logic. Try to convert them to DataFrame code if possible.

Ensure that your UDFs or User-Defined Aggregate Functions (UDAFs) are running on a

small enough batch of data. Oftentimes an aggregation can pull a lot of data into memory for

a common key, leading to that executor having to do a lot more work than others.

Turning on speculation, which we discuss in “Slow Reads and Writes”, will have Spark run

a second copy of tasks that are extremely slow. This can be helpful if the issue is due to a

faulty node because the task will get to run on a faster one. Speculation does come at a cost,

however, because it consumes additional resources. In addition, for some storage systems

that use eventual consistency, you could end up with duplicate output data if your writes are

not idempotent. (We discussed speculation configurations in Chapter 17.)

Another common issue can arise when you’re working with Datasets. Because Datasets

perform a lot of object instantiation to convert records to Java objects for UDFs, they can

cause a lot of garbage collection. If you’re using Datasets, look at the garbage collection

metrics in the Spark UI to see if they’re consistent with the slow tasks.

Stragglers can be one of the most difficult issues to debug, simply because there are so many possible

causes. However, in all likelihood, the cause will be some kind of data skew, so definitely begin by

checking the Spark UI for imbalanced amounts of data across tasks.

Slow Aggregations

If you have a slow aggregation, start by reviewing the issues in the “Slow Tasks” section before

proceeding. Having tried those, you might continue to see the same problem.

Signs and symptoms

Slow tasks during a groupBy call.

Jobs after the aggregation are slow, as well.

Potential treatments

Unfortunately, this issue can’t always be solved. Sometimes, the data in your job just has some

skewed keys, and the operation you want to run on them needs to be slow.

Increasing the number of partitions, prior to an aggregation, might help by reducing the

number of different keys processed in each task.

Increasing executor memory can help alleviate this issue, as well. If a single key has lots of

data, this will allow its executor to spill to disk less often and finish faster, although it may

still be much slower than executors processing other keys.

If you find that tasks after the aggregation are also slow, this means that your dataset might

have remained unbalanced after the aggregation. Try inserting a repartition call to

partition it randomly.

Ensuring that all filters and SELECT statements that can be are above the aggregation can help

to ensure that you’re working only on the data that you need to be working on and nothing

else. Spark’s query optimizer will automatically do this for the structured APIs.

Ensure null values are represented correctly (using Spark’s concept of null) and not as

some default value like " " or "EMPTY". Spark often optimizes for skipping nulls early in the

job when possible, but it can’t do so for your own placeholder values.

Some aggregation functions are also just inherently slower than others. For instance,

collect_list and collect_set are very slow aggregation functions because they must

return all the matching objects to the driver, and should be avoided in performance-critical

code.

Slow Joins

Joins and aggregations are both shuffles, so they share some of the same general symptoms as well as

treatments.

Signs and symptoms

A join stage seems to be taking a long time. This can be one task or many tasks.

Stages before and after the join seem to be operating normally.

Potential treatments

Many joins can be optimized (manually or automatically) to other types of joins. We covered

how to select different join types in Chapter 8.

Experimenting with different join orderings can really help speed up jobs, especially if some

of those joins filter out a large amount of data; do those first.

Partitioning a dataset prior to joining can be very helpful for reducing data movement across

the cluster, especially if the same dataset will be used in multiple join operations. It’s worth

experimenting with different prejoin partitioning. Keep in mind, again, that this isn’t “free”

and does come at the cost of a shuffle.

Slow joins can also be caused by data skew. There’s not always a lot you can do here, but

sizing up the Spark application and/or increasing the size of executors can help, as described

in earlier sections.

Ensuring that all filters and select statements that can be are above the join can help to ensure

that you’re working only on the data that you need for the join.

Ensure that null values are handled correctly (that you’re using null) and not some default

value like " " or "EMPTY", as with aggregations.

Sometimes Spark can’t properly plan for a broadcast join if it doesn’t know any statistics

about the input DataFrame or table. If you know that one of the tables that you are joining is

small, you can try to force a broadcast (as discussed in Chapter 8), or use Spark’s statistics

collection commands to let it analyze the table.

Slow Reads and Writes

Slow I/O can be difficult to diagnose, especially with networked file systems.

Signs and symptoms

Slow reading of data from a distributed file system or external system.

Slow writes from network file systems or blob storage.

Potential treatments

Turning on speculation (set spark.speculation to true) can help with slow reads and

writes. This will launch additional tasks with the same operation in an attempt to see

whether it’s just some transient issue in the first task. Speculation is a powerful tool and

works well with consistent file systems. However, it can cause duplicate data writes with

some eventually consistent cloud services, such as Amazon S3, so check whether it is

supported by the storage system connector you are using.

Ensuring sufficient network connectivity can be important—your Spark cluster may simply

not have enough total network bandwidth to get to your storage system.

For distributed file systems such as HDFS running on the same nodes as Spark, make sure

Spark sees the same hostnames for nodes as the file system. This will enable Spark to do

locality-aware scheduling, which you will be able to see in the “locality” column in the

Spark UI. We’ll talk about locality a bit more in the next chapter.

Driver OutOfMemoryError or Driver Unresponsive

This is usually a pretty serious issue because it will crash your Spark Application. It often happens

due to collecting too much data back to the driver, making it run out of memory.

Signs and symptoms

Spark Application is unresponsive or crashed.

OutOfMemoryErrors or garbage collection messages in the driver logs.

Commands take a very long time to run or don’t run at all.

Interactivity is very low or non-existent.

Memory usage is high for the driver JVM.

Potential treatments

There are a variety of potential reasons for this happening, and diagnosis is not always

straightforward.

Your code might have tried to collect an overly large dataset to the driver node using

operations such as collect.

You might be using a broadcast join where the data to be broadcast is too big. Use Spark’s

maximum broadcast join configuration to better control the size it will broadcast.

A long-running application generated a large number of objects on the driver and is unable to

release them. Java’s jmap tool can be useful to see what objects are filling most of the

memory of your driver JVM by printing a histogram of the heap. However, take note that

jmap will pause that JVM while running.

Increase the driver’s memory allocation if possible to let it work with more data.

Issues with JVMs running out of memory can happen if you are using another language

binding, such as Python, due to data conversion between the two requiring too much memory

in the JVM. Try to see whether your issue is specific to your chosen language and bring back

less data to the driver node, or write it to a file instead of bringing it back as in-memory

objects.

If you are sharing a SparkContext with other users (e.g., through the SQL JDBC server and

some notebook environments), ensure that people aren’t trying to do something that might be

causing large amounts of memory allocation in the driver (like working overly large arrays

in their code or collecting large datasets).

Executor OutOfMemoryError or Executor Unresponsive

Spark applications can sometimes recover from this automatically, depending on the true underlying

issue.

Signs and symptoms

OutOfMemoryErrors or garbage collection messages in the executor logs. You can find these

in the Spark UI.

Executors that crash or become unresponsive.

Slow tasks on certain nodes that never seem to recover.

Potential treatments

Try increasing the memory available to executors and the number of executors.

Try increasing PySpark worker size via the relevant Python configurations.

Look for garbage collection error messages in the executor logs. Some of the tasks that are

running, especially if you’re using UDFs, can be creating lots of objects that need to be

garbage collected. Repartition your data to increase parallelism, reduce the amount of

records per task, and ensure that all executors are getting the same amount of work.

Ensure that null values are handled correctly (that you’re using null) and not some default

value like " " or "EMPTY", as we discussed earlier.

This is more likely to happen with RDDs or with Datasets because of object instantiations.

Try using fewer UDFs and more of Spark’s structured operations when possible.

Use Java monitoring tools such as jmap to get a histogram of heap memory usage on your

executors, and see which classes are taking up the most space.

If executors are being placed on nodes that also have other workloads running on them, such

as a key-value store, try to isolate your Spark jobs from other jobs.

Unexpected Nulls in Results

Signs and symptoms

Unexpected null values after transformations.

Scheduled production jobs that used to work no longer work, or no longer produce the right

results.

Potential treatments

It’s possible that your data format has changed without adjusting your business logic. This

means that code that worked before is no longer valid.

Use an accumulator to try to count records or certain types, as well as parsing or processing

errors where you skip a record. This can be helpful because you might think that you’re

parsing data of a certain format, but some of the data doesn’t. Most often, users will place

the accumulator in a UDF when they are parsing their raw data into a more controlled format

and perform the counts there. This allows you to count valid and invalid records and then

operate accordingly after the fact.

Ensure that your transformations actually result in valid query plans. Spark SQL sometimes

does implicit type coercions that can cause confusing results. For instance, the SQL

expression SELECT 5*"23" results in 115 because the string “25” converts to an the value

25 as an integer, but the expression SELECT 5 * " " results in null because casting the

empty string to an integer gives null. Make sure that your intermediate datasets have the

schema you expect them to (try using printSchema on them), and look for any CAST

operations in the final query plan.

No Space Left on Disk Errors

Signs and symptoms

You see “no space left on disk” errors and your jobs fail.

Potential treatments

The easiest way to alleviate this, of course, is to add more disk space. You can do this by

sizing up the nodes that you’re working on or attaching external storage in a cloud

environment.

If you have a cluster with limited storage space, some nodes may run out first due to skew.

Repartitioning the data as described earlier may help here.

There are also a number of storage configurations with which you can experiment. Some of

these determine how long logs should be kept on the machine before being removed. For

more information, see the Spark executor logs rolling configurations in Chapter 16.

Try manually removing some old log files or old shuffle files from the machine(s) in

question. This can help alleviate some of the issue although obviously it’s not a permanent

fix.

Serialization Errors

Signs and symptoms

You see serialization errors and your jobs fail.

Potential treatments

This is very uncommon when working with the Structured APIs, but you might be trying to

perform some custom logic on executors with UDFs or RDDs and either the task that you’re

trying to serialize to these executors or the data you are trying to share cannot be serialized.

This often happens when you’re working with either some code or data that cannot be

serialized into a UDF or function, or if you’re working with strange data types that cannot be

serialized. If you are using (or intend to be using Kryo serialization), verify that you’re

actually registering your classes so that they are indeed serialized.

Try not to refer to any fields of the enclosing object in your UDFs when creating UDFs

inside a Java or Scala class. This can cause Spark to try to serialize the whole enclosing

object, which may not be possible. Instead, copy the relevant fields to local variables in the

same scope as closure and use those.

Conclusion

This chapter covered some of the main tools that you can use to monitor and debug your Spark jobs

and applications, as well as the most common issues we see and their resolutions. As with debugging

any complex software, we recommend taking a principled, step-by-step approach to debug issues.

Add logging statements to figure out where your job is crashing and what type of data arrives at each

stage, try to isolate the problem to the smallest piece of code possible, and work up from there. For

data skew issues, which are unique to parallel computing, use Spark’s UI to get a quick overview of

how much work each task is doing. In Chapter 19, we discuss performance tuning in particular and

various tools you can use for that.

Chapter 19. Performance Tuning

Chapter 18 covered the Spark user interface (UI) and basic first-aid for your Spark Application.

Using the tools outlined in that chapter, you should be able to ensure that your jobs run reliably.

However, sometimes you’ll also need them to run faster or more efficiently for a variety of reasons.

That’s what this chapter is about. Here, we present a discussion of some of the performance choices

that are available to make your jobs run faster.

Just as with monitoring, there are a number of different levels that you can try to tune at. For instance,

if you had an extremely fast network, that would make many of your Spark jobs faster because

shuffles are so often one of the costlier steps in a Spark job. Most likely, you won’t have much ability

to control such things; therefore, we’re going to discuss the things you can control through code

choices or configuration.

There are a variety of different parts of Spark jobs that you might want to optimize, and it’s valuable

to be specific. Following are some of the areas:

Code-level design choices (e.g., RDDs versus DataFrames)

Data at rest

Joins

Aggregations

Data in flight

Individual application properties

Inside of the Java Virtual Machine (JVM) of an executor

Worker nodes

Cluster and deployment properties

This list is by no means exhaustive, but it does at least ground the conversation and the topics that we

cover in this chapter. Additionally, there are two ways of trying to achieve the execution

characteristics that we would like out of Spark jobs. We can either do so indirectly by setting

configuration values or changing the runtime environment. These should improve things across Spark

Applications or across Spark jobs. Alternatively, we can try to directly change execution

characteristic or design choices at the individual Spark job, stage, or task level. These kinds of fixes

are very specific to that one area of our application and therefore have limited overall impact. There

are numerous things that lie on both sides of the indirect versus direct divide, and we will draw lines

in the sand accordingly.

One of the best things you can do to figure out how to improve performance is to implement good

monitoring and job history tracking. Without this information, it can be difficult to know whether

you’re really improving job performance.

Indirect Performance Enhancements

As discussed, there are a number of indirect enhancements that you can perform to help your Spark

jobs run faster. We’ll skip the obvious ones like “improve your hardware” and focus more on the

things within your control.

Design Choices

Although good design choices seem like a somewhat obvious way to optimize performance, we often

don’t prioritize this step in the process. When designing your applications, making good design

choices is very important because it not only helps you to write better Spark applications but also to

get them to run in a more stable and consistent manner over time and in the face of external changes or

variations. We’ve already discussed some of these topics earlier in the book, but we’ll summarize

some of the fundamental ones again here.

Scala versus Java versus Python versus R

This question is nearly impossible to answer in the general sense because a lot will depend on your

use case. For instance, if you want to perform some single-node machine learning after performing a

large ETL job, we might recommend running your Extract, Transform, and Load (ETL) code as

SparkR code and then using R’s massive machine learning ecosystem to run your single-node machine

learning algorithms. This gives you the best of both worlds and takes advantage of the strength of R as

well as the strength of Spark without sacrifices. As we mentioned numerous times, Spark’s Structured

APIs are consistent across languages in terms of speed and stability. That means that you should code

with whatever language you are most comfortable using or is best suited for your use case.

Things do get a bit more complicated when you need to include custom transformations that cannot be

created in the Structured APIs. These might manifest themselves as RDD transformations or user-

defined functions (UDFs). If you’re going to do this, R and Python are not necessarily the best choice

simply because of how this is actually executed. It’s also more difficult to provide stricter guarantees

of types and manipulations when you’re defining functions that jump across languages. We find that

using Python for the majority of the application, and porting some of it to Scala or writing specific

UDFs in Scala as your application evolves, is a powerful technique—it allows for a nice balance

between overall usability, maintainability, and performance.

DataFrames versus SQL versus Datasets versus RDDs

This question also comes up frequently. The answer is simple. Across all languages, DataFrames,

Datasets, and SQL are equivalent in speed. This means that if you’re using DataFrames in any of these

languages, performance is equal. However, if you’re going to be defining UDFs, you’ll take a

performance hit writing those in Python or R, and to some extent a lesser performance hit in Java and

Scala. If you want to optimize for pure performance, it would behoove you to try and get back to

DataFrames and SQL as quickly as possible. Although all DataFrame, SQL, and Dataset code

compiles down to RDDs, Spark’s optimization engine will write “better” RDD code than you can

manually and certainly do it with orders of magnitude less effort. Additionally, you will lose out on

new optimizations that are added to Spark’s SQL engine every release.

Lastly, if you want to use RDDs, we definitely recommend using Scala or Java. If that’s not possible,

we recommend that you restrict the “surface area” of RDDs in your application to the bare minimum.

That’s because when Python runs RDD code, it’s serializes a lot of data to and from the Python

process. This is very expensive to run over very big data and can also decrease stability.

Although it isn’t exactly relevant to performance tuning, it’s important to note that there are also some

gaps in what functionality is supported in each of Spark’s languages. We discussed this in Chapter 16.

Object Serialization in RDDs

In Part III, we briefly discussed the serialization libraries that can be used within RDD

transformations. When you’re working with custom data types, you’re going to want to serialize them

using Kryo because it’s both more compact and much more efficient than Java serialization. However,

this does come at the inconvenience of registering the classes that you will be using in your

application.

You can use Kryo serialization by setting spark.serializer to

org.apache.spark.serializer.KryoSerializer. You will also need to explicitly register the

classes that you would like to register with the Kryo serializer via the

spark.kryo.classesToRegister configuration. There are also a number of advanced parameters

for controlling this in greater detail that are described in the Kryo documentation.

To register your classes, use the SparkConf that you just created and pass in the names of your

classes:

conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

Cluster Configurations

This area has huge potential benefits but is probably one of the more difficult to prescribe because of

the variation across hardware and use cases. In general, monitoring how the machines themselves are

performing will be the most valuable approach toward optimizing your cluster configurations,

especially when it comes to running multiple applications (whether they are Spark or not) on a single

cluster.

Cluster/application sizing and sharing

This somewhat comes down to a resource sharing and scheduling problem; however, there are a lot of

options for how you want to share resources at the cluster level or at the application level. Take a
look at the configurations listed at the end of Chapter 16 as well as some configurations in
Chapter 17.
Dynamic allocation
Spark provides a mechanism to dynamically adjust the resources your application occupies based on
the workload. This means that your application can give resources back to the cluster if they are no
longer used, and request them again later when there is demand. This feature is particularly useful if
multiple applications share resources in your Spark cluster. This feature is disabled by default and
available on all coarse-grained cluster managers; that is, standalone mode, YARN mode, and Mesos
coarse-grained mode. If you’d like to enable this feature, you should set
spark.dynamicAllocation.enabled to true. The Spark documentation presents a number of
individual parameters that you can tune.
Scheduling
Over the course of the previous chapters, we discussed a number of different potential optimizations
that you can take advantage of to either help Spark jobs run in parallel with scheduler pools or help
Spark applications run in parallel with something like dynamic allocation or setting max-executor-
cores. Scheduling optimizations do involve some research and experimentation, and unfortunately
there are not super-quick fixes beyond setting spark.scheduler.mode to FAIR to allow better
sharing of resources across multiple users, or setting --max-executor-cores, which specifies the
maximum number of executor cores that your application will need. Specifying this value can ensure
that your application does not take up all the resources on the cluster. You can also change the default,
depending on your cluster manager, by setting the configuration spark.cores.max to a default of
your choice. Cluster managers also provide some scheduling primitives that can be helpful when
optimizing multiple Spark Applications, as discussed in Chapters 16 and 17.
Data at Rest
More often that not, when you’re saving data it will be read many times as other folks in your
organization access the same datasets in order to run different analyses. Making sure that you’re
storing your data for effective reads later on is absolutely essential to successful big data projects.
This involves choosing your storage system, choosing your data format, and taking advantage of
features such as data partitioning in some storage formats.
File-based long-term data storage
There are a number of different file formats available, from simple comma-separated values (CSV)
files and binary blobs, to more sophisticated formats like Apache Parquet. One of the easiest ways to
optimize your Spark jobs is to follow best practices when storing data and choose the most efficient
storage format possible.

Generally you should always favor structured, binary types to store your data, especially when you’ll

be accessing it frequently. Although files like “CSV” seem well-structured, they’re very slow to

parse, and often also full of edge cases and pain points. For instance, improperly escaped new-line

characters can often cause a lot of trouble when reading a large number of files. The most efficient

file format you can generally choose is Apache Parquet. Parquet stores data in binary files with

column-oriented storage, and also tracks some statistics about each file that make it possible to

quickly skip data not needed for a query. It is well integrated with Spark through the built-in Parquet

data source.

Splittable file types and compression

Whatever file format you choose, you should make sure it is “splittable”, which means that different

tasks can read different parts of the file in parallel. We saw why this is important in Chapter 18.

When we read in the file, all cores were able to do part of the work. That’s because the file was

splittable. If we didn’t use a splittable file type—say something like a malformed JSON file—we’re

going to need to read in the entire file on a single machine, greatly reducing parallelism.

The main place splittability comes in is compression formats. A ZIP file or TAR archive cannot be

split, which means that even if we have 10 files in a ZIP file and 10 cores, only one core can read in

that data because we cannot parallelize access to the ZIP file. This is a poor use of resources. In

contrast, files compressed using gzip, bzip2, or lz4 are generally splittable if they were written by a

parallel processing framework like Hadoop or Spark. For your own input data, the simplest way to

make it splittable is to upload it as separate files, ideally each no larger than a few hundred

megabytes.

Table partitioning

We discussed table partitioning in Chapter 9, and will only use this section as a reminder. Table

partitioning refers to storing files in separate directories based on a key, such as the date field in the

data. Storage managers like Apache Hive support this concept, as do many of Spark’s built-in data

sources. Partitioning your data correctly allows Spark to skip many irrelevant files when it only

requires data with a specific range of keys. For instance, if users frequently filter by “date” or

“customerId” in their queries, partition your data by those columns. This will greatly reduce the

amount of data that end users must read by most queries, and therefore dramatically increase speed.

The one downside of partitioning, however, is that if you partition at too fine a granularity, it can

result in many small files, and a great deal of overhead trying to list all the files in the storage system.

Bucketing

We also discussed bucketing in Chapter 9, but to recap, the essense is that bucketing your data allows

Spark to “pre-partition” data according to how joins or aggregations are likely to be performed by

readers. This can improve performance and stability because data can be consistently distributed

across partitions as opposed to skewed into just one or two. For instance, if joins are frequently

performed on a column immediately after a read, you can use bucketing to ensure that the data is well

partitioned according to those values. This can help prevent a shuffle before a join and therefore help

speed up data access. Bucketing generally works hand-in-hand with partitioning as a second way of

physically splitting up data.

The number of files

In addition to organizing your data into buckets and partitions, you’ll also want to consider the

number of files and the size of files that you’re storing. If there are lots of small files, you’re going to

pay a price listing and fetching each of those individual files. For instance, if you’re reading a data

from Hadoop Distributed File System (HDFS), this data is managed in blocks that are up to 128 MB

in size (by default). This means if you have 30 files, of 5 MB each, you’re going to have to potentially

request 30 blocks, even though the same data could have fit into 2 blocks (150 MB total).

Although there is not necessarily a panacea for how you want to store your data, the trade-off can be

summarized as such. Having lots of small files is going to make the scheduler work much harder to

locate the data and launch all of the read tasks. This can increase the network and scheduling

overhead of the job. Having fewer large files eases the pain off the scheduler but it will also make

tasks run longer. In this case, though, you can always launch more tasks than there are input files if you

want more parallelism—Spark will split each file across multiple tasks assuming you are using a

splittable format. In general, we recommend sizing your files so that they each contain at least a few

tens of megatbytes of data.

One way of controlling data partitioning when you write your data is through a write option

introduced in Spark 2.2. To control how many records go into each file, you can specify the

maxRecordsPerFile option to the write operation.

Data locality

Another aspect that can be important in shared cluster environments is data locality. Data locality

basically specifies a preference for certain nodes that hold certain data, rather than having to

exchange these blocks of data over the network. If you run your storage system on the same nodes as

Spark, and the system supports locality hints, Spark will try to schedule tasks close to each input

block of data. For example HDFS storage provides this option. There are several configurations that

affect locality, but it will generally be used by default if Spark detects that it is using a local storage

system. You will also see data-reading tasks marked as “local” in the Spark web UI.

Statistics collection

Spark includes a cost-based query optimizer that plans queries based on the properties of the input

data when using the structured APIs. However, to allow the cost-based optimizer to make these sorts

of decisions, you need to collect (and maintain) statistics about your tables that it can use. There are

two kinds of statistics: table-level and column-level statistics. Statistics collection is available only

on named tables, not on arbitrary DataFrames or RDDs.

To collect table-level statistics, you can run the following command:

ANALYZE TABLE table_name COMPUTE STATISTICS

To collect column-level statistics, you can name the specific columns:

ANALYZE TABLE table_name COMPUTE STATISTICS FOR

COLUMNS column_name1, column_name2, ...

Column-level statistics are slower to collect, but provide more information for the cost-based

optimizer to use about those data columns. Both types of statistics can help with joins, aggregations,

filters, and a number of other potential things (e.g., automatically choosing when to do a broadcast

join). This is a fast-growing part of Spark, so different optimizations based on statistics will likely be

added in the future.

NOTE

You can follow the progress of cost-based optimization on its JIRA issue. You can also read through the design document

on SPARK-16026 to learn more about this feature. This is an active area of development in Spark at the time of writing.

Shuffle Configurations

Configuring Spark’s external shuffle service (discussed in Chapters 16 and 17) can often increase

performance because it allows nodes to read shuffle data from remote machines even when the

executors on those machines are busy (e.g., with garbage collection). This does come at the cost of

complexity and maintenance, however, so it might not be worth it in your deployment. Beyond

configuring this external service, there are also a number of configurations for shuffles, such as the

number of concurrent connections per executor, although these usually have good defaults.

In addition, for RDD-based jobs, the serialization format has a large impact on shuffle performance—

always prefer Kryo over Java serialization, as described in “Object Serialization in RDDs”.

Furthermore, for all jobs, the number of partitions of a shuffle matters. If you have too few partitions,

then too few nodes will be doing work and there may be skew, but if you have too many partitions,

there is an overhead to launching each one that may start to dominate. Try to aim for at least a few

tens of megabytes of data per output partition in your shuffle.

Memory Pressure and Garbage Collection

During the course of running Spark jobs, the executor or driver machines may struggle to complete

their tasks because of a lack of sufficient memory or “memory pressure.” This may occur when an

application takes up too much memory during execution or when garbage collection runs too

frequently or is slow to run as large numbers of objects are created in the JVM and subsequently

garbage collected as they are no longer used. One strategy for easing this issue is to ensure that you’re

using the Structured APIs as much as possible. These will not only increase the efficiency with which

your Spark jobs will execute, but it will also greatly reduce memory pressure because JVM objects

are never realized and Spark SQL simply performs the computation on its internal format.

The Spark documentation includes some great pointers on tuning garbage collection for RDD and

UDF based applications, and we paraphrase the following sections from that information.

Measuring the impact of garbage collection

The first step in garbage collection tuning is to gather statistics on how frequently garbage collection

occurs and the amount of time it takes. You can do this by adding -verbose:gc -

XX:+PrintGCDetails -XX:+PrintGCTimeStamps to Spark’s JVM options using the

spark.executor.extraJavaOptions configuration parameter. The next time you run your Spark

job, you will see messages printed in the worker’s logs each time a garbage collection occurs. These

logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not in the

driver.

Garbage collection tuning

To further tune garbage collection, you first need to understand some basic information about memory

management in the JVM:

Java heap space is divided into two regions: Young and Old. The Young generation is meant

to hold short-lived objects whereas the Old generation is intended for objects with longer

lifetimes.

The Young generation is further divided into three regions: Eden, Survivor1, and Survivor2.

Here’s a simplified description of the garbage collection procedure:

1. When Eden is full, a minor garbage collection is run on Eden and objects that are alive from

Eden and Survivor1 are copied to Survivor2.

2. The Survivor regions are swapped.

3. If an object is old enough or if Survivor2 is full, that object is moved to Old.

4. Finally, when Old is close to full, a full garbage collection is invoked. This involves tracing

through all the objects on the heap, deleting the unreferenced ones, and moving the others to

fill up unused space, so it is generally the slowest garbage collection operation.

The goal of garbage collection tuning in Spark is to ensure that only long-lived cached datasets are

stored in the Old generation and that the Young generation is sufficiently sized to store all short-lived

objects. This will help avoid full garbage collections to collect temporary objects created during task

execution. Here are some steps that might be useful.

Gather garbage collection statistics to determine whether it is being run too often. If a full garbage

collection is invoked multiple times before a task completes, it means that there isn’t enough memory

available for executing tasks, so you should decrease the amount of memory Spark uses for caching

(spark.memory.fraction).

If there are too many minor collections but not many major garbage collections, allocating more

memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much

memory each task will need. If the size of Eden is determined to be E, you can set the size of the

Young generation using the option -Xmn=4/3*E. (The scaling up by 4/3 is to account for space used

by survivor regions, as well.)

As an example, if your task is reading data from HDFS, the amount of memory used by the task can be

estimated by using the size of the data block read from HDFS. Note that the size of a decompressed

block is often two or three times the size of the block. So if you want to have three or four tasks’

worth of working space, and the HDFS block size is 128 MB, we can estimate size of Eden to be

43,128 MB.

Try the G1GC garbage collector with -XX:+UseG1GC. It can improve performance in some situations

in which garbage collection is a bottleneck and you don’t have a way to reduce it further by sizing the

generations. Note that with large executor heap sizes, it can be important to increase the G1 region

size with -XX:G1HeapRegionSize.

Monitor how the frequency and time taken by garbage collection changes with the new settings.

Our experience suggests that the effect of garbage collection tuning depends on your application and

the amount of memory available. There are many more tuning options described online, but at a high

level, managing how frequently full garbage collection takes place can help in reducing the overhead.

You can specify garbage collection tuning flags for executors by setting

spark.executor.extraJavaOptions in a job’s configuration.

Direct Performance Enhancements

In the previous section, we touched on some general performance enhancements that apply to all jobs.

Be sure to skim the previous couple of pages before jumping to this section and the solutions here.

These solutions here are intended as “band-aids” of sorts for issues with specific stages or jobs, but

they require inspecting and optimizing each stage or job separately.

Parallelism

The first thing you should do whenever trying to speed up a specific stage is to increase the degree of

parallelism. In general, we recommend having at least two or three tasks per CPU core in your cluster

if the stage processes a large amount of data. You can set this via the spark.default.parallelism

property as well as tuning the spark.sql.shuffle.partitions according to the number of cores in

your cluster.

Improved Filtering

Another frequent source of performance enhancements is moving filters to the earliest part of your

Spark job that you can. Sometimes, these filters can be pushed into the data sources themselves and

this means that you can avoid reading and working with data that is irrelevant to your end result.

Enabling partitioning and bucketing also helps achieve this. Always look to be filtering as much data

as you can early on, and you’ll find that your Spark jobs will almost always run faster.

Repartitioning and Coalescing

Repartition calls can incur a shuffle. However, doing some can optimize the overall execution of a

job by balancing data across the cluster, so they can be worth it. In general, you should try to shuffle

the least amount of data possible. For this reason, if you’re reducing the number of overall partitions

in a DataFrame or RDD, first try coalesce method, which will not perform a shuffle but rather merge

partitions on the same node into one partition. The slower repartition method will also shuffle

data across the network to achieve even load balancing. Repartitions can be particularly helpful when

performing joins or prior to a cache call. Remember that repartitioning is not free, but it can improve

overall application performance and parallelism of your jobs.

Custom partitioning

If your jobs are still slow or unstable, you might want to explore performing custom partitioning at the

RDD level. This allows you to define a custom partition function that will organize the data across

the cluster to a finer level of precision than is available at the DataFrame level. This is very rarely

necessary, but it is an option. For more information, see Part III.

User-Defined Functions (UDFs)

In general, avoiding UDFs is a good optimization opportunity. UDFs are expensive because they force

representing data as objects in the JVM and sometimes do this multiple times per record in a query.

You should try to use the Structured APIs as much as possible to perform your manipulations simply

because they are going to perform the transformations in a much more efficient manner than you can

do in a high-level language. There is also ongoing work to make data available to UDFs in batches,

such as the Vectorized UDF extension for Python that gives your code multiple records at once using a

Pandas data frame. We discussed UDFs and their costs in Chapter 18.

Temporary Data Storage (Caching)

In applications that reuse the same datasets over and over, one of the most useful optimizations is

caching. Caching will place a DataFrame, table, or RDD into temporary storage (either memory or

disk) across the executors in your cluster, and make subsequent reads faster. Although caching might

sound like something we should do all the time, it’s not always a good thing to do. That’s because

caching data incurs a serialization, deserialization, and storage cost. For example, if you are only

going to process a dataset once (in a later transformation), caching it will only slow you down.

The use case for caching is simple: as you work with data in Spark, either within an interactive

session or a standalone application, you will often want to reuse a certain dataset (e.g., a DataFrame

or RDD). For example, in an interactive data science session, you might load and clean your data and

then reuse it to try multiple statistical models. Or in a standalone application, you might run an

iterative algorithm that reuses the same dataset. You can tell Spark to cache a dataset using the cache

method on DataFrames or RDDs.

Caching is a lazy operation, meaning that things will be cached only as they are accessed. The RDD

API and the Structured API differ in how they actually perform caching, so let’s review the gory

details before going over the storage levels. When we cache an RDD, we cache the actual, physical

data (i.e., the bits). The bits. When this data is accessed again, Spark returns the proper data. This is

done through the RDD reference. However, in the Structured API, caching is done based on the

physical plan. This means that we effectively store the physical plan as our key (as opposed to the

object reference) and perform a lookup prior to the execution of a Structured job. This can cause

confusion because sometimes you might be expecting to access raw data but because someone else

already cached the data, you’re actually accessing their cached version. Keep that in mind when using

this feature.

There are different storage levels that you can use to cache your data, specifying what type of storage

to use. Table 19-1 lists the levels.

Table 19-1. Data cache storage levels

Storage level Meaning

MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions

will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the

partitions that don’t fit on disk, and read them from there when they’re needed.

MEMORY_ONLY_SER

(Java and Scala)

Store RDD as serialized Java objects (one byte array per partition). This is generally more space-

efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to

read.

MEMORY_AND_DISK_SER

(Java and Scala)

Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing

them on the fly each time they’re needed.

DISK_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2,

MEMORY_AND_DISK_2,

etc.

Same as the previous levels, but replicate each partition on two cluster nodes.

OFF_HEAP

(experimental)

Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to

be enabled.

For more information on these options, take a look at “Configuring Memory Management”.

Figure 19-1 presents a simple illustrations of the process. We load an initial DataFrame from a CSV

file and then derive some new DataFrames from it using transformations. We can avoid having to

recompute the original DataFrame (i.e., load and parse the CSV file) many times by adding a line to

cache it along the way.

Figure 19-1. A cached DataFrame

Now let’s walk through the code:

# in Python

# Original loading code that does *not* cache DataFrame

DF1 = spark.read.format("csv")\

.option("inferSchema", "true")\

.option("header", "true")\

.load("/data/flight-data/csv/2015-summary.csv")

DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()

DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()

DF4 = DF1.groupBy("count").count().collect()

You’ll see here that we have our “lazily” created DataFrame (DF1), along with three other

DataFrames that access data in DF1. All of our downstream DataFrames share that common parent

(DF1) and will repeat the same work when we perform the preceding code. In this case, it’s just

reading and parsing the raw CSV data, but that can be a fairly intensive process, especially for large

datasets.

On my machine, those commands take a second or two to run. Luckily caching can help speed things

up. When we ask for a DataFrame to be cached, Spark will save the data in memory or on disk the

first time it computes it. Then, when any other queries come along, they’ll just refer to the one stored

in memory as opposed to the original file. You do this using the DataFrame’s cache method:

DF1.cache()

DF1.count()

We used the count above to eagerly cache the data (basically perform an action to force Spark to store

it in memory), because caching itself is lazy—the data is cached only on the first time you run an

action on the DataFrame. Now that the data is cached, the previous commands will be faster, as we

can see by running the following code:

# in Python

DF2 = DF1.groupBy("DEST_COUNTRY_NAME").count().collect()

DF3 = DF1.groupBy("ORIGIN_COUNTRY_NAME").count().collect()

DF4 = DF1.groupBy("count").count().collect()

When we ran this code, it cut the time by more than half! This might not seem that wild, but picture a

large dataset or one that requires a lot of computation to create (not just reading in a file). The savings

can be immense. It’s also great for iterative machine learning workloads because they’ll often need to

access the same data a number of times, which we’ll see shortly.

The cache command in Spark always places data in memory by default, caching only part of the

dataset if the cluster’s total memory is full. For more control, there is also a persist method that

takes a StorageLevel object to specify where to cache the data: in memory, on disk, or both.

Joins

Joins are a common area for optimization. The biggest weapon you have when it comes to optimizing

joins is simply educating yourself about what each join does and how it’s performed. This will help

you the most. Additionally, equi-joins are the easiest for Spark to optimize at this point and therefore

should be preferred wherever possible. Beyond that, simple things like trying to use the filtering

ability of inner joins by changing join ordering can yield large speedups. Additionally, using

broadcast join hints can help Spark make intelligent planning decisions when it comes to creating

query plans, as described in Chapter 8. Avoiding Cartesian joins or even full outer joins is often low-

hanging fruit for stability and optimizations because these can often be optimized into different

filtering style joins when you look at the entire data flow instead of just that one particular job area.

Lastly, following some of the other sections in this chapter can have a significant effect on joins. For

example, collecting statistics on tables prior to a join will help Spark make intelligent join decisions.

Additionally, bucketing your data appropriately can also help Spark avoid large shuffles when joins

are performed.

Aggregations

For the most part, there are not too many ways that you can optimize specific aggregations beyond

filtering data before the aggregation having a sufficiently high number of partitions. However, if

you’re using RDDs, controlling exactly how these aggregations are performed (e.g., using

reduceByKey when possible over groupByKey) can be very helpful and improve the speed and

stability of your code.

Broadcast Variables

We touched on broadcast joins and variables in previous chapters, and these are a good option for

optimization. The basic premise is that if some large piece of data will be used across multiple UDF

calls in your program, you can broadcast it to save just a single read-only copy on each node and

avoid re-sending this data with each job. For example, broadcast variables may be useful to save a

lookup table or a machine learning model. You can also broadcast arbitrary objects by creating

broadcast variables using your SparkContext, and then simply refer to those variables in your tasks,

as we discussed in Chapter 14.

Conclusion

There are many different ways to optimize the performance of your Spark Applications and make

them run faster and at a lower cost. In general, the main things you’ll want to prioritize are (1) reading

as little data as possible through partitioning and efficient binary formats, (2) making sure there is

sufficient parallellism and no data skew on the cluster using partitioning, and (3) using high-level

APIs such as the Structured APIs as much as possible to take already optimized code. As with any

other software optimization work, you should also make sure you are optimizing the right operations

for your job: the Spark monitoring tools described in Chapter 18 will let you see which stages are

taking the longest time and focus your efforts on those. Once you have identified the work that you

believe can be optimized, the tools in this chapter will cover the most important performance

optimization opportunities for the majority of users.

Part V. Streaming

Chapter 20. Stream Processing

Fundamentals

Stream processing is a key requirement in many big data applications. As soon as an application

computes something of value—say, a report about customer activity, or a new machine learning model

—an organization will want to compute this result continuously in a production setting. As a result,

organizations of all sizes are starting to incorporate stream processing, often even in the first version

of a new application.

Luckily, Apache Spark has a long history of high-level support for streaming. In 2012, the project

incorporated Spark Streaming and its DStreams API, one of the first APIs to enable stream processing

using high-level functional operators like map and reduce. Hundreds of organizations now use

DStreams in production for large real-time applications, often processing terabytes of data per hour.

Much like the Resilient Distributed Dataset (RDD) API, however, the DStreams API is based on

relatively low-level operations on Java/Python objects that limit opportunities for higher-level

optimization. Thus, in 2016, the Spark project added Structured Streaming, a new streaming API built

directly on DataFrames that supports both rich optimizations and significantly simpler integration

with other DataFrame and Dataset code. The Structured Streaming API was marked as stable in

Apache Spark 2.2, and has also seen swift adoption throughout the Spark community.

In this book, we will focus only on the Structured Streaming API, which integrates directly with the

DataFrame and Dataset APIs we discussed earlier in the book and is the framework of choice for

writing new streaming applications. If you are interested in DStreams, many other books cover that

API, including several dedicated books on Spark Streaming only, such as Learning Spark Streaming

by Francois Garillot and Gerard Maas (O’Reilly, 2017). Much as with RDDs versus DataFrames,

however, Structured Streaming offers a superset of the majority of the functionality of DStreams, and

will often perform better due to code generation and the Catalyst optimizer.

Before we discuss the streaming APIs in Spark, let’s more formally define streaming and batch

processing. This chapter will discuss some of the core concepts in this area that we will need

throughout this part of the book. It won’t be a dissertation on this topic, but will cover enough of the

concepts to let you make sense of systems in this space.

What Is Stream Processing?

Stream processing is the act of continuously incorporating new data to compute a result. In stream

processing, the input data is unbounded and has no predetermined beginning or end. It simply forms a

series of events that arrive at the stream processing system (e.g., credit card transactions, clicks on a

website, or sensor readings from Internet of Things [IoT] devices). User applications can then

compute various queries over this stream of events (e.g., tracking a running count of each type of

event or aggregating them into hourly windows). The application will output multiple versions of the

result as it runs, or perhaps keep it up to date in an external “sink” system such as a key-value store.

Naturally, we can compare streaming to batch processing, in which the computation runs on a fixed-

input dataset. Oftentimes, this might be a large-scale dataset in a data warehouse that contains all the

historical events from an application (e.g., all website visits or sensor readings for the past month).

Batch processing also takes a query to compute, similar to stream processing, but only computes the

result once.

Although streaming and batch processing sound different, in practice, they often need to work

together. For example, streaming applications often need to join input data against a dataset written

periodically by a batch job, and the output of streaming jobs is often files or tables that are queried in

batch jobs. Moreover, any business logic in your applications needs to work consistently across

streaming and batch execution: for example, if you have a custom code to compute a user’s billing

amount, it would be harmful to get a different result when running it in a streaming versus batch

fashion! To handle these needs, Structured Streaming was designed from the beginning to interoperate

easily with the rest of Spark, including batch applications. Indeed, the Structured Streaming

developers coined the term continuous applications to capture end-to-end applications that consist of

streaming, batch, and interactive jobs all working on the same data to deliver an end product.

Structured Streaming is focused on making it simple to build such applications in an end-to-end

fashion instead of only handling stream-level per-record processing.

Stream Processing Use Cases

We defined stream processing as the incremental processing of unbounded datasets, but that’s a

strange way to motivate a use case. Before we get into advantages and disadvantages of streaming,

let’s explain why you might want to use streaming. We’ll describe six common use cases with varying

requirements from the underlying stream processing system.

Notifications and alerting

Probably the most obvious streaming use case involves notifications and alerting. Given some series

of events, a notification or alert should be triggered if some sort of event or series of events occurs.

This doesn’t necessarily imply autonomous or preprogrammed decision making; alerting can also be

used to notify a human counterpart of some action that needs to be taken. An example might be driving

an alert to an employee at a fulfillment center that they need to get a certain item from a location in the

warehouse and ship it to a customer. In either case, the notification needs to happen quickly.

Real-time reporting

Many organizations use streaming systems to run real-time dashboards that any employee can look at.

For example, this book’s authors leverage Structured Streaming every day to run real-time reporting

dashboards throughout Databricks (where both authors of this book work). We use these dashboards

to monitor total platform usage, system load, uptime, and even usage of new features as they are

rolled out, among other applications.

Incremental ETL

One of the most common streaming applications is to reduce the latency companies must endure while

retreiving information into a data warehouse—in short, “my batch job, but streaming.” Spark batch

jobs are often used for Extract, Transform, and Load (ETL) workloads that turn raw data into a

structured format like Parquet to enable efficient queries. Using Structured Streaming, these jobs can

incorporate new data within seconds, enabling users to query it faster downstream. In this use case, it

is critical that data is processed exactly once and in a fault-tolerant manner: we don’t want to lose

any input data before it makes it to the warehouse, and we don’t want to load the same data twice.

Moreover, the streaming system needs to make updates to the data warehouse transactionally so as not

to confuse the queries running on it with partially written data.

Update data to serve in real time

Streaming systems are frequently used to compute data that gets served interactively by another

application. For example, a web analytics product such as Google Analytics might continuously track

the number of visits to each page, and use a streaming system to keep these counts up to date. When

users interact with the product’s UI, this web application queries the latest counts. Supporting this use

case requires that the streaming system can perform incremental updates to a key–value store (or

other serving system) as a sync, and often also that these updates are transactional, as in the ETL

case, to avoid corrupting the data in the application.

Real-time decision making

Real-time decision making on a streaming system involves analyzing new inputs and responding to

them automatically using business logic. An example use case would be a bank that wants to

automatically verify whether a new transaction on a customer’s credit card represents fraud based on

their recent history, and deny the transaction if the charge is determined fradulent. This decision needs

to be made in real-time while processing each transaction, so developers could implement this

business logic in a streaming system and run it against the stream of transactions. This type of

application will likely need to maintain a significant amount of state about each user to track their

current spending patterns, and automatically compare this state against each new transaction.

Online machine learning

A close derivative of the real-time decision-making use case is online machine learning. In this

scenario, you might want to train a model on a combination of streaming and historical data from

multiple users. An example might be more sophisticated than the aforementioned credit card

transaction use case: rather than reacting with hardcoded rules based on one customer’s behavior, the

company may want to continuously update a model from all customers’ behavior and test each

transaction against it. This is the most challenging use case of the bunch for stream processing systems

because it requires aggregation across multiple customers, joins against static datasets, integration

with machine learning libraries, and low-latency response times.

Advantages of Stream Processing

Now that we’ve seen some use cases for streaming, let’s crystallize some of the advantages of stream

processing. For the most part, batch is much simpler to understand, troubleshoot, and write

applications in for the majority of use cases. Additionally, the ability to process data in batch allows

for vastly higher data processing throughput than many streaming systems. However, stream

processing is essential in two cases. First, stream processing enables lower latency: when your

application needs to respond quickly (on a timescale of minutes, seconds, or milliseconds), you will

need a streaming system that can keep state in memory to get acceptable performance. Many of the

decision making and alerting use cases we described fall into this camp. Second, stream processing

can also be more efficient in updating a result than repeated batch jobs, because it automatically

incrementalizes the computation. For example, if we want to compute web traffic statistics over the

past 24 hours, a naively implemented batch job might scan all the data each time it runs, always

processing 24 hours’ worth of data. In contrast, a streaming system can remember state from the

previous computation and only count the new data. If you tell the streaming system to update your

report every hour, for example, it would only need to process 1 hour’s worth of data each time (the

new data since the last report). In a batch system, you would have to implement this kind of

incremental computation by hand to get the same performance, resulting in a lot of extra work that the

streaming system will automatically give you out of the box.

Challenges of Stream Processing

We discussed motivations and advantages of stream processing, but as you likely know, there’s never

a free lunch. Let’s discuss some of the challenges of operating on streams.

To ground this example, let’s imagine that our application receives input messages from a sensor

(e.g., inside a car) that report its value at different times. We then want to search within this stream

for certain values, or certain patterns of values. One specific challenge is that the input records might

arrive to our application out-of-order: due to delays and retransmissions, for example, we might

receive the following sequence of updates in order, where the time field shows the time when the

value was actually measured:

{value: 1, time: "2017-04-07T00:00:00"}

{value: 2, time: "2017-04-07T01:00:00"}

{value: 5, time: "2017-04-07T02:00:00"}

{value: 10, time: "2017-04-07T01:30:00"}

{value: 7, time: "2017-04-07T03:00:00"}

In any data processing system, we can construct logic to perform some action based on receiving the

single value of “5.” In a streaming system, we can also respond to this individual event quickly.

However, things become more complicated if you want only to trigger some action based on a

specific sequence of values received, say, 2 then 10 then 5. In the case of batch processing, this is not

particularly difficult because we can simply sort all the events we have by time field to see that 10

did come between 2 and 5. However, this is harder for stream processing systems. The reason is that

the streaming system is going to receive each event individually, and will need to track some state

across events to remember the 2 and 5 events and realize that the 10 event was between them. The

need to remember such state over the stream creates more challenges. For instance, what if you have a

massive data volume (e.g., millions of sensor streams) and the state itself is massive? What if a

machine in the sytem fails, losing some state? What if the load is imbalanced and one machine is

slow? And how can your application signal downstream consumers when analysis for some event is

“done” (e.g., the pattern 2-10-5 did not occur)? Should it wait a fixed amount of time or remember

some state indefinitely? All of these challenges and others—such as making the input and the output of

the system transactional—can come up when you want to deploy a streaming application.

To summarize, the challenges we described in the previous paragraph and a couple of others, are as

follows:

Processing out-of-order data based on application timestamps (also called event time)

Maintaining large amounts of state

Supporting high-data throughput

Processing each event exactly once despite machine failures

Handling load imbalance and stragglers

Responding to events at low latency

Joining with external data in other storage systems

Determining how to update output sinks as new events arrive

Writing data transactionally to output systems

Updating your application’s business logic at runtime

Each of these topics are an active area of research and development in large-scale streaming systems.

To understand how different streaming systems have tackled these challenges, we describe a few of

the most common design concepts you will see across them.

Stream Processing Design Points

To support the stream processing challenges we described, including high throughput, low latency,

and out-of-order data, there are multiple ways to design a streaming system. We describe the most

common design options here, before describing Spark’s choices in the next section.

Record-at-a-Time Versus Declarative APIs

The simplest way to design a streaming API would be to just pass each event to the application and

let it react using custom code. This is the approach that many early streaming systems, such as Apache

Storm, implemented, and it has an important place when applications need full control over the

processing of data. Streaming that provide this kind of record-at-a-time API just give the user a

collection of “plumbing” to connect together into an application. However, the downside of these

systems is that most of the complicating factors we described earlier, such as maintaining state, are

solely governed by the application. For example, with a record-at-a-time API, you are responsible

for tracking state over longer time periods, dropping it after some time to clear up space, and

responding differently to duplicate events after a failure. Programming these systems correctly can be

quite challenging. At its core, low-level APIs require deep expertise to be develop and maintain.

As a result, many newer streaming systems provide declarative APIs, where your application

specifies what to compute but not how to compute it in response to each new event and how to

recover from failure. Spark’s original DStreams API, for example, offered functional API based on

operations like map, reduce and filter on streams. Internally, the DStream API automatically tracked

how much data each operator had processed, saved any relevant state reliably, and recovered the

computation from failure when needed. Systems such as Google Dataflow and Apache Kafka Streams

provide similar, functional APIs. Spark’s Structured Streaming actually takes this concept even

further, switching from functional operations to relational (SQL-like) ones that enable even richer

automatic optimization of the execution without programming effort.

Event Time Versus Processing Time

For the systems with declarative APIs, a second concern is whether the system natively supports

event time. Event time is the idea of processing data based on timestamps inserted into each record at

the source, as opposed to the time when the record is received at the streaming application (which is

called processing time). In particular, when using event time, records may arrive to the system out of

order (e.g., if they traveled back on different network paths), and different sources may also be out of

sync with each other (some records may arrive later than other records for the same event time). If

your application collects data from remote sources that may be delayed, such as mobile phones or IoT

devices, event-time processing is crucial: without it, you will miss important patterns when some

data is late. In contrast, if your application only processes local events (e.g., ones generated in the

same datacenter), you may not need sophisticated event-time processing.

When using event-time, several issues become common concerns across applications, including

tracking state in a manner that allows the system to incorporate late events, and determining when it is

safe to output a result for a given time window in event time (i.e., when the system is likely to have

received all the input up to that point). Because of this, many declarative systems, including

Structured Streaming, have “native” support for event time integrated into all their APIs, so that these

concerns can be handled automatically across your whole program.

Continuous Versus Micro-Batch Execution

The final design decision you will often see come up is about continuous versus micro-batch

execution. In continuous processing-based systems, each node in the system is continually listening to

messages from other nodes and outputting new updates to its child nodes. For example, suppose that

your application implements a map-reduce computation over several input streams. In a continuous

processing system, each of the nodes implementing map would read records one by one from an input

source, compute its function on them, and send them to the appropriate reducer. The reducer would

then update its state whenever it gets a new record. The key idea is that this happens on each

individual record, as illustrated in Figure 20-1.

Figure 20-1. Continuous processing

Continuous processing has the advantage of offering the lowest possible latency when the total input

rate is relatively low, because each node responds immediately to a new message. However,

continuous processing systems generally have lower maximum throughput, because they incur a

significant amount of overhead per-record (e.g., calling the operating system to send a packet to a

downstream node). In addition, continous systems generally have a fixed topology of operators that

cannot be moved at runtime without stopping the whole system, which can introduce load balancing

issues.

In contrast, micro-batch systems wait to accumulate small batches of input data (say, 500 ms’ worth),

then process each batch in parallel using a distributed collection of tasks, similar to the execution of a

batch job in Spark. Micro-batch systems can often achieve high throughput per node because they

leverage the same optimizations as batch systems (e.g., vectorized processing), and do not incur any

extra per-record overhead, as illustrated in Figure 20-2.

Figure 20-2. Micro-batch

Thus, they need fewer nodes to process the same rate of data. Micro-batch systems can also use

dynamic load balancing techniques to handle changing workloads (e.g., increasing or decreasing the

number of tasks). The downside, however, is a higher base latency due to waiting to accumulate a

micro-batch. In practice, the streaming applications that are large-scale enough to need to distribute

their computation tend to prioritize throughput, so Spark has traditionally implemented micro-batch

processing. In Structured Streaming, however, there is an active development effort to also support a

continuous processing mode beneath the same API.

When choosing between these two execution modes, the main factors you should keep in mind are

your desired latency and total cost of operation (TCO). Micro-batch systems can comfortably deliver

latencies from 100 ms to a second, depending on the application. Within this regime, they will

generally require fewer nodes to achieve the same throughput, and hence lower operational cost

(including lower maintenance cost due to less frequent node failures). For much lower latencies, you

should consider a continuous processing system, or using a micro-batch system in conjunction with a

fast serving layer to provide low-latency queries (e.g., outputting data into MySQL or Apache

Cassandra, where it can be served to clients in milliseconds).

Spark’s Streaming APIs

We covered some high-level design approaches to stream processing, but thus far we have not

discussed Spark’s APIs in detail. Spark includes two streaming APIs, as we discussed at the

beginning of this chapter. The earlier DStream API in Spark Streaming is purely micro-batch

oriented. It has a declarative (functional-based) API but no support for event time. The newer

Structured Streaming API adds higher-level optimizations, event time, and support for continuous

processing.

The DStream API

Spark’s original DStream API has been used broadly for stream processing since its first release in

2012. For example, DStreams was the most widely used processing engine in Datanami’s 2016

survey. Many companies use and operate Spark Streaming at scale in production today due to its high-

level API interface and simple exactly-once semantics. Interactions with RDD code, such as joins

with static data, are also natively supported in Spark Streaming. Operating Spark Streaming isn’t

much more difficult than operating a normal Spark cluster. However, the DStreams API has several

limitations. First, it is based purely on Java/Python objects and functions, as opposed to the richer

concept of structured tables in DataFrames and Datasets. This limits the engine’s opportunity to

perform optimizations. Second, the API is purely based on processing time—to handle event-time

operations, applications need to implement them on their own. Finally, DStreams can only operate in

a micro-batch fashion, and exposes the duration of micro-batches in some parts of its API, making it

difficult to support alternative execution modes.

Structured Streaming

Structured Streaming is a higher-level streaming API built from the ground up on Spark’s Structured

APIs. It is available in all the environments where structured processing runs, including Scala, Java,

Python, R, and SQL. Like DStreams, it is a declarative API based on high-level operations, but by

building on the structured data model introduced in the previous part of the book, Structured

Streaming can perform more types of optimizations automatically. However, unlike DStreams,

Structured Streaming has native support for event time data (all of its the windowing operators

automatically support it). As of Apache Spark 2.2, the system only runs in a micro-batch model, but

the Spark team at Databricks has announced an effort called Continuous Processing to add a

continuous execution mode. This should become an option for users in Spark 2.3.

More fundamentally, beyond simplifying stream processing, Structured Streaming is also designed to

make it easy to build end-to-end continuous applications using Apache Spark that combine streaming,

batch, and interactive queries. For example, Structured Streaming does not use a separate API from

DataFrames: you simply write a normal DataFrame (or SQL) computation and launch it on a stream.

Structured Streaming will automatically update the result of this computation in an incremental

fashion as data arrives. This is a major help when writing end-to-end data applications: developers

do not need to maintain a separate streaming version of their batch code, possibly for a different

execution system, and risk having these two versions of the code fall out of sync. As another example,

Structured Streaming can output data to standard sinks usable by Spark SQL, such as Parquet tables,

making it easy to query your stream state from another Spark applications. In future versions of

Apache Spark, we expect more and more components of the project to integrate with Structured

Streaming, including online learning algorithms in MLlib.

In general, Structured Streaming is meant to be an easier-to-use and higher-performance evolution of

Spark Streaming’s DStream API, so we will focus solely on this new API in this book. Many of the

concepts, such as building a computation out of a graph of transformations, also apply to DStreams,

but we leave the exposition of that to other books.

Conclusion

This chapter covered the basic concepts and ideas that you’re going to need to understand stream

processing. The design approaches introduced in this chapter should clarify how you can evaluate

streaming systems for a given application. You should also feel comfortable understanding what

trade-offs the authors of DStreams and Structured Streaming have made, and why the direct support

for DataFrame programs is a big help when using Structured Streaming: there is no need to duplicate

your application logic. In the upcoming chapters, we’ll dive right into Structured Streaming to

understand how to use it.

Chapter 21. Structured Streaming Basics

Now that we have covered a brief overview of stream processing, let’s dive right into Structured

Streaming. In this chapter, we will, again, state some of the key concepts behind Structured Streaming

and then apply them with some code examples that show how easy the system is to use.

Structured Streaming Basics

Structured Streaming, as we discussed at the end of Chapter 20, is a stream processing framework

built on the Spark SQL engine. Rather than introducing a separate API, Structured Streaming uses the

existing structured APIs in Spark (DataFrames, Datasets, and SQL), meaning that all the operations

you are familiar with there are supported. Users express a streaming computation in the same way

they’d write a batch computation on static data. Upon specifying this, and specifying a streaming

destination, the Structured Streaming engine will take care of running your query incrementally and

continuously as new data arrives into the system. These logical instructions for the computation are

then executed using the same Catalyst engine discussed in Part II of this book, including query

optimization, code generation, etc. Beyond the core structured processing engine, Structured

Streaming includes a number of features specifically for streaming. For instance, Structured

Streaming ensures end-to-end, exactly-once processing as well as fault-tolerance through

checkpointing and write-ahead logs.

The main idea behind Structured Streaming is to treat a stream of data as a table to which data is

continuously appended. The job then periodically checks for new input data, process it, updates some

internal state located in a state store if needed, and updates its result. A cornerstone of the API is that

you should not have to change your query’s code when doing batch or stream processing—you should

have to specify only whether to run that query in a batch or streaming fashion. Internally, Structured

Streaming will automatically figure out how to “incrementalize” your query, i.e., update its result

efficiently whenever new data arrives, and will run it in a fault-tolerant fashion.

Figure 21-1. Structured streaming input

In simplest terms, Structured Streaming is “your DataFrame, but streaming.” This makes it very easy

to get started using streaming applications. You probably already have the code for them! There are

some limits to the types of queries Structured Streaming will be able to run, however, as well as

some new concepts you have to think about that are specific to streaming, such as event-time and out-

of-order data. We will discuss these in this and the following chapters.

Finally, by integrating with the rest of Spark, Structured Streaming enables users to build what we

call continuous applications. A continous application is an end-to-end application that reacts to data

in real time by combining a variety of tools: streaming jobs, batch jobs, joins between streaming and

offline data, and interactive ad-hoc queries. Because most streaming jobs today are deployed within

the context of a larger continuous application, the Spark developers sought to make it easy to specify

the whole application in one framework and get consistent results across these different portions of it.

For example, you can use Structured Streaming to continuously update a table that users query

interactively with Spark SQL, serve a machine learning model trained by MLlib, or join streams with

offline data in any of Spark’s data sources—applications that would be much more complex to build

using a mix of different tools.

Core Concepts

Now that we introduced the high-level idea, let’s cover some of the important concepts in a

Structured Streaming job. One thing you will hopefully find is that there aren’t many. That’s because

Structured Streaming is designed to be simple. Read some other big data streaming books and you’ll

notice that they begin by introducing terminology like distributed stream processing topologies for

skewed data reducers (a caricature, but accurate) and other complex verbiage. Spark’s goal is to

handle these concerns automatically and give users a simple way to run any Spark computation on a

stream.

Transformations and Actions

Structured Streaming maintains the same concept of transformations and actions that we have seen

throughout this book. The transformations available in Structured Streaming are, with a few

restrictions, the exact same transformations that we saw in Part II. The restrictions usually involve

some types of queries that the engine cannot incrementalize yet, although some of the limitations are

being lifted in new versions of Spark. There is generally only one action available in Structured

Streaming: that of starting a stream, which will then run continuously and output results.

Input Sources

Structured Streaming supports several input sources for reading in a streaming fashion. As of Spark

2.2, the supported input sources are as follows:

Apache Kafka 0.10

Files on a distributed file system like HDFS or S3 (Spark will continuously read new files

in a directory)

A socket source for testing

We discuss these in depth later in this chapter, but it’s worth mentioning that the authors of Spark are

working on a stable source API so that you can build your own streaming connectors.

Sinks

Just as sources allow you to get data into Structured Streaming, sinks specify the destination for the

result set of that stream. Sinks and the execution engine are also responsible for reliably tracking the

exact progress of data processing. Here are the supported output sinks as of Spark 2.2:

Apache Kafka 0.10

Almost any file format

A foreach sink for running arbitary computation on the output records

A console sink for testing

A memory sink for debugging

We discuss these in more detail later in the chapter when we discuss sources.

Output Modes

Defining a sink for our Structured Streaming job is only half of the story. We also need to define how

we want Spark to write data to that sink. For instance, do we only want to append new information?

Do we want to update rows as we receive more information about them over time (e.g., updating the

click count for a given web page)? Do we want to completely overwrite the result set every single

time (i.e. always write a file with the complete click counts for all pages)? To do this, we define an

output mode, similar to how we define output modes in the static Structured APIs.

The supported output modes are as follows:

Append (only add new records to the output sink)

Update (update changed records in place)

Complete (rewrite the full output)

One important detail is that certain queries, and certain sinks, only support certain output modes, as

we will discuss later in the book. For example, suppose that your job is just performing a map on a

stream. The output data will grow indefinitely as new records arrive, so it would not make sense to

use Complete mode, which requires writing all the data to a new file at once. In contrast, if you are

doing an aggregation into a limited number of keys, Complete and Update modes would make sense,

but Append would not, because the values of some keys’ need to be updated over time.

Triggers

Whereas output modes define how data is output, triggers define when data is output—that is, when

Structured Streaming should check for new input data and update its result. By default, Structured

Streaming will look for new input records as soon as it has finished processing the last group of input

data, giving the lowest latency possible for new results. However, this behavior can lead to writing

many small output files when the sink is a set of files. Thus, Spark also supports triggers based on

processing time (only look for new data at a fixed interval). In the future, other types of triggers may

also be supported.

Event-Time Processing

Structured Streaming also has support for event-time processing (i.e., processing data based on

timestamps included in the record that may arrive out of order). There are two key ideas that you will

need to understand here for the moment; we will talk about both of these in much more depth in the

next chapter, so don’t worry if you’re not perfectly clear on them at this point.

Event-time data

Event-time means time fields that are embedded in your data. This means that rather than processing

data according to the time it reaches your system, you process it according to the time that it was

generated, even if records arrive out of order at the streaming application due to slow uploads or

network delays. Expressing event-time processing is simple in Structured Streaming. Because the

system views the input data as a table, the event time is just another field in that table, and your

application can do grouping, aggregation, and windowing using standard SQL operators. However,

under the hood, Structured Streaming can take some special actions when it knows that one of your

columns is an event-time field, including optimizing query execution or determining when it is safe to

forget state about a time window. Many of these actions can be controlled using watermarks.

Watermarks

Watermarks are a feature of streaming systems that allow you to specify how late they expect to see

data in event time. For example, in an application that processes logs from mobile devices, one might

expect logs to be up to 30 minutes late due to upload delays. Systems that support event time,

including Structured Streaming, usually allow setting watermarks to limit how long they need to

remember old data. Watermarks can also be used to control when to output a result for a particular

event time window (e.g., waiting until the watermark for it has passed).

Structured Streaming in Action

Let’s get to an applied example of how you might use Structured Streaming. For our examples, we’re

going to be working with the Heterogeneity Human Activity Recognition Dataset. The data consists of

smartphone and smartwatch sensor readings from a variety of devices—specifically, the

accelerometer and gyroscope, sampled at the highest possible frequency supported by the devices.

Readings from these sensors were recorded while users performed activities like biking, sitting,

standing, walking, and so on. There are several different smartphones and smartwatches used, and

nine total users. You can download the data here, in the activity data folder.

TIP

This Dataset is fairly large. If it’s too large for your machine, you can remove some of the files and it will work just fine.

Let’s read in the static version of the dataset as a DataFrame:

// in Scala

val static = spark.read.json("/data/activity-data/")

val dataSchema = static.schema

# in Python

static = spark.read.json("/data/activity-data/")

dataSchema = static.schema

Here’s the schema:

root

|-- Arrival_Time: long (nullable = true)

|-- Creation_Time: long (nullable = true)

|-- Device: string (nullable = true)

|-- Index: long (nullable = true)

|-- Model: string (nullable = true)

|-- User: string (nullable = true)

|-- _corrupt_record: string (nullable = true)

|-- gt: string (nullable = true)

|-- x: double (nullable = true)

|-- y: double (nullable = true)

|-- z: double (nullable = true)

Here’s a sample of the DataFrame:

+-------------+------------------+--------+-----+------+----+--------+-----+-----

|1424696634224|142469663222623685|nexus4_1| 62|nexus4| a| null|stand|-0...

...

|1424696660715|142469665872381726|nexus4_1| 2342|nexus4| a| null|stand|-0...

+-------------+------------------+--------+-----+------+----+--------+-----+-----

You can see in the preceding example, which includes a number of timestamp columns, models, user,

and device information. The gt field specifies what activity the user was doing at that time.

Next, let’s create a streaming version of the same Dataset, which will read each input file in the

dataset one by one as if it was a stream.

Streaming DataFrames are largely the same as static DataFrames. We create them within Spark

applications and then perform transformations on them to get our data into the correct format.

Basically, all of the transformations that are available in the static Structured APIs apply to Streaming

DataFrames. However, one small difference is that Structured Streaming does not let you perform

schema inference without explicitly enabling it. You can enable schema inference for this by setting

the configuration spark.sql.streaming.schemaInference to true. Given that fact, we will read

the schema from one file (that we know has a valid schema) and pass the dataSchema object from

our static DataFrame to our streaming DataFrame. As mentioned, you should avoid doing this in a

production scenario where your data may (accidentally) change out from under you:

// in Scala

val streaming = spark.readStream.schema(dataSchema)

.option("maxFilesPerTrigger", 1).json("/data/activity-data")

# in Python

streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1)\

.json("/data/activity-data")

NOTE

We discuss maxFilesPerTrigger a little later on in this chapter but essentially it allows you to control how quickly Spark

will read all of the files in the folder. By specifying this value lower, we’re artificially limiting the flow of the stream to one

file per trigger. This helps us demonstrate how Structured Streaming runs incrementally in our example, but probably isn’t

something you’d use in production.

Just like with other Spark APIs, streaming DataFrame creation and execution is lazy. In particular, we

can now specify transformations on our streaming DataFrame before finally calling an action to start

the stream. In this case, we’ll show one simple transformation—we will group and count data by the

gt column, which is the activity being performed by the user at that point in time:

// in Scala

val activityCounts = streaming.groupBy("gt").count()

# in Python

activityCounts = streaming.groupBy("gt").count()

Because this code is being written in local mode on a small machine, we are going to set the shuffle

partitions to a small value to avoid creating too many shuffle partitions:

spark.conf.set("spark.sql.shuffle.partitions", 5)

Now that we set up our transformation, we need only to specify our action to start the query. As

mentioned previously in the chapter, we will specify an output destination, or output sink for our

result of this query. For this basic example, we are going to write to a memory sink which keeps an

in-memory table of the results.

In the process of specifying this sink, we’re going to need to define how Spark will output that data. In

this example, we use the complete output mode. This mode rewrites all of the keys along with their

counts after every trigger:

// in Scala

val activityQuery = activityCounts.writeStream.queryName("activity_counts")

.format("memory").outputMode("complete")

.start()

# in Python

activityQuery = activityCounts.writeStream.queryName("activity_counts")\

.format("memory").outputMode("complete")\

.start()

We are now writing out our stream! You’ll notice that we set a unique query name to represent this

stream, in this case activity_counts. We specified our format as an in-memory table and we set the

output mode.

When we run the preceding code, we also want to include the following line:

activityQuery.awaitTermination()

After this code is executed, the streaming computation will have started in the background. The query

object is a handle to that active streaming query, and we must specify that we would like to wait for

the termination of the query using activityQuery.awaitTermination() to prevent the driver

process from exiting while the query is active. We will omit this from our future parts of the book for

readability, but it must be included in your production applications; otherwise, your stream won’t be

able to run.

Spark lists this stream, and other active ones, under the active streams in our SparkSession. We can

see a list of those streams by running the following:

spark.streams.active

Spark also assigns each stream a UUID, so if need be you could iterate through the list of running

streams and select the above one. In this case, we assigned it to a variable, so that’s not necessary.

Now that this stream is running, we can experiment with the results by querying the in-memory table it

is maintaining of the current output of our streaming aggregation. This table will be called

activity_counts, the same as the stream. To see the current data in this output table, we simply

need to query it! We’ll do this in a simple loop that will print the results of the streaming query every

second:

// in Scala

for( i <- 1 to 5 ) {

spark.sql("SELECT * FROM activity_counts").show()

Thread.sleep(1000)

}

# in Python

from time import sleep

for x in range(5):

spark.sql("SELECT * FROM activity_counts").show()

sleep(1)

As the preceding queries run, you should see the counts for each activity change over time. For

instance, the first show call displays the following result (because we queried it while the stream was

reading the first file):

+---+-----+

| gt|count|

+---+-----+

The previous show call shows the following result—note that the result will probably vary when

you’re running this code personally because you will likely start it at a different time:

+----------+-----+

| gt|count|

+----------+-----+

| sit| 8207|

...

| null| 6966|

| bike| 7199|

+----------+-----+

With this simple example, the power of Structured Streaming should become clear. You can take the

same operations that you use in batch and run them on a stream of data with very few code changes

(essentially just specifying that it’s a stream). The rest of this chapter touches on some of the details

about the various manipulations, sources, and sinks that you can use with Structured Streaming.

Transformations on Streams

Streaming transformations, as we mentioned, include almost all static DataFrame transformations that

you already saw in Part II. All select, filter, and simple transformations are supported, as are all

DataFrame functions and individual column manipulations. The limitations arise on transformations

that do not make sense in context of streaming data. For example, as of Apache Spark 2.2, users

cannot sort streams that are not aggregated, and cannot perform multiple levels of aggregation

without using Stateful Processing (covered in the next chater). These limitations may be lifted as

Structured Streaming continues to develop, so we encourage you to check the documentation of your

version of Spark for updates.

Selections and Filtering

All select and filter transformations are supported in Structured Streaming, as are all DataFrame

functions and individual column manipulations. We show a simple example using selections and

filtering below. In this case, because we are not updating any keys over time, we will use the Append

output mode, so that new results are appended to the output table:

// in Scala

import org.apache.spark.sql.functions.expr

val simpleTransform = streaming.withColumn("stairs", expr("gt like '%stairs%'"))

.where("stairs")

.where("gt is not null")

.select("gt", "model", "arrival_time", "creation_time")

.writeStream

.queryName("simple_transform")

.format("memory")

.outputMode("append")

.start()

# in Python

from pyspark.sql.functions import expr

simpleTransform = streaming.withColumn("stairs", expr("gt like '%stairs%'"))\

.where("stairs")\

.where("gt is not null")\

.select("gt", "model", "arrival_time", "creation_time")\

.writeStream\

.queryName("simple_transform")\

.format("memory")\

.outputMode("append")\

.start()

Aggregations

Structured Streaming has excellent support for aggregations. You can specify arbitrary aggregations,

as you saw in the Structured APIs. For example, you can use a more exotic aggregation, like a cube,

on the phone model and activity and the average x, y, z accelerations of our sensor (jump back to

Chapter 7 in order to see potential aggregations that you can run on your stream):

// in Scala

val deviceModelStats = streaming.cube("gt", "model").avg()

.drop("avg(Arrival_time)")

.drop("avg(Creation_Time)")

.drop("avg(Index)")

.writeStream.queryName("device_counts").format("memory").outputMode("complete")

.start()

# in Python

deviceModelStats = streaming.cube("gt", "model").avg()\

.drop("avg(Arrival_time)")\

.drop("avg(Creation_Time)")\

.drop("avg(Index)")\

.writeStream.queryName("device_counts").format("memory")\

.outputMode("complete")\

.start()

Querying that table allows us to see the results:

SELECT * FROM device_counts

+----------+------+------------------+--------------------+--------------------+

+----------+------+------------------+--------------------+--------------------+

| sit| null|-3.682775300344...|1.242033094787975...|-4.22021191297611...|

| stand| null|-4.415368069618...|-5.30657295890281...|2.264837548081631...|

...

| walk|nexus4|-0.007342235359...|0.004341030525168...|-6.01620400184307...|

|stairsdown|nexus4|0.0309175199508...|-0.02869185568293...| 0.11661923308518365|

...

+----------+------+------------------+--------------------+--------------------+

In addition to these aggregations on raw columns in the dataset, Structured Streaming has special

support for columns that represent event time, including watermark support and windowing. We will

discuss these in more detail in Chapter 22.

NOTE

As of Spark 2.2, the one limitation of aggregations is that multiple “chained” aggregations (aggregations on streaming

aggregations) are not supported at this time. However, you can achieve this by writing out to an intermediate sink of data,

like Kafka or a file sink. This will change in the future as the Structured Streaming community adds this functionality.

Joins

As of Apache Spark 2.2, Structured Streaming supports joining streaming DataFrames to static

DataFrames. Spark 2.3 will add the ability to join multiple streams together. You can do multiple

column joins and supplement streaming data with that from static data sources:

// in Scala

val historicalAgg = static.groupBy("gt", "model").avg()

val deviceModelStats = streaming.drop("Arrival_Time", "Creation_Time", "Index")

.cube("gt", "model").avg()

.join(historicalAgg, Seq("gt", "model"))

.writeStream.queryName("device_counts").format("memory").outputMode("complete")

.start()

# in Python

historicalAgg = static.groupBy("gt", "model").avg()

deviceModelStats = streaming.drop("Arrival_Time", "Creation_Time", "Index")\

.cube("gt", "model").avg()\

.join(historicalAgg, ["gt", "model"])\

.writeStream.queryName("device_counts").format("memory")\

.outputMode("complete")\

.start()

In Spark 2.2, full outer joins, left joins with the stream on the right side, and right joins with the

stream on the left are not supported. Structured Streaming also does not yet support stream-to-stream

joins, but this is also a feature under active development.

Input and Output

This section dives deeper into the details of how sources, sinks, and output modes work in Structured

Streaming. Specifically, we discuss how, when, and where data flows into and out of the system. As

of this writing, Structured Streaming supports several sources and sinks, including Apache Kafka,

files, and several sources and sinks for testing and debugging. More sources may be added over time,

so be sure to check the documentation for the most up-to-date information. We discuss the source and

sink for a particular storage system together in this chapter, but in reality you can mix and match them

(e.g., use a Kafka input source with a file sink).

Where Data Is Read and Written (Sources and Sinks)

Structured Streaming supports several production sources and sinks (files and Apache Kafka), as

well as some debugging tools like the memory table sink. We mentioned these at the beginning of the

chapter, but now let’s cover the details of each one.

File source and sink

Probably the simplest source you can think of is the simple file source. It’s easy to reason about and

understand. While essentially any file source should work, the ones that we see in practice are

Parquet, text, JSON, and CSV.

The only difference between using the file source/sink and Spark’s static file source is that with

streaming, we can control the number of files that we read in during each trigger via the

maxFilesPerTrigger option that we saw earlier.

Keep in mind that any files you add into an input directory for a streaming job need to appear in it

atomically. Otherwise, Spark will process partially written files before you have finished. On file

systems that show partial writes, such as local files or HDFS, this is best done by writing the file in

an external directory and moving it into the input directory when finished. On Amazon S3, objects

normally only appear once fully written.

Kafka source and sink

Apache Kafka is a distributed publish-and-subscribe system for streams of data. Kafka lets you

publish and subscribe to streams of records like you might do with a message queue—these are

stored as streams of records in a fault-tolerant way. Think of Kafka like a distributed buffer. Kafka

lets you store streams of records in categories that are referred to as topics. Each record in Kafka

consists of a key, a value, and a timestamp. Topics consist of immutable sequences of records for

which the position of a record in a sequence is called an offset. Reading data is called subscribing to

a topic and writing data is as simple as publishing to a topic.

Spark allows you to read from Kafka with both batch and streaming DataFrames.

As of Spark 2.2, Structured Streaming supports Kafka version 0.10. This too is likely to expand in the

future, so be sure to check the documentation for more information about the Kafka versions

available. There are only a few options that you need to specify when you read from Kafka.

Reading from the Kafka Source

To read, you first need to choose one of the following options: assign, subscribe, or

subscribePattern. Only one of these can be present as an option when you go to read from Kafka.

Assign is a fine-grained way of specifying not just the topic but also the topic partitions from which

you would like to read. This is specified as a JSON string {"topicA":[0,1],"topicB":[2,4]}.

subscribe and subscribePattern are ways of subscribing to one or more topics either by

specifying a list of topics (in the former) or via a pattern (via the latter).

Second, you will need to specify the kafka.bootstrap.servers that Kafka provides to connect to

the service.

After you have specified your options, you have several other options to specify:

startingOffsets and endingOffsets

The start point when a query is started, either earliest, which is from the earliest offsets;

latest, which is just from the latest offsets; or a JSON string specifying a starting offset for each

TopicPartition. In the JSON, -2 as an offset can be used to refer to earliest, -1 to latest. For

example, the JSON specification could be {"topicA":{"0":23,"1":-1},"topicB":

{"0":-2}}. This applies only when a new Streaming query is started, and that resuming will

always pick up from where the query left off. Newly discovered partitions during a query will

start at earliest. The ending offsets for a given query.

failOnDataLoss

Whether to fail the query when it’s possible that data is lost (e.g., topics are deleted, or offsets are

out of range). This might be a false alarm. You can disable it when it doesn’t work as you

expected. The default is true.

maxOffsetsPerTrigger

The total number of offsets to read in a given trigger.

There are also options for setting Kafka consumer timeouts, fetch retries, and intervals.

To read from Kafka, do the following in Structured Streaming:

// in Scala

// Subscribe to 1 topic

val ds1 = spark.readStream.format("kafka")

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")

.option("subscribe", "topic1")

.load()

// Subscribe to multiple topics

val ds2 = spark.readStream.format("kafka")

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")

.option("subscribe", "topic1,topic2")

.load()

// Subscribe to a pattern of topics

val ds3 = spark.readStream.format("kafka")

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")

.option("subscribePattern", "topic.*")

.load()

Python is quite similar:

# in Python

# Subscribe to 1 topic

df1 = spark.readStream.format("kafka")\

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\

.option("subscribe", "topic1")\

.load()

# Subscribe to multiple topics

df2 = spark.readStream.format("kafka")\

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\

.option("subscribe", "topic1,topic2")\

.load()

# Subscribe to a pattern

df3 = spark.readStream.format("kafka")\

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\

.option("subscribePattern", "topic.*")\

.load()

Each row in the source will have the following schema:

key: binary

value: binary

topic: string

partition: int

offset: long

timestamp: long

Each message in Kafka is likely to be serialized in some way. Using native Spark functions in the

Structured APIs, or a User-Defined Function (UDF), you can parse the message into a more structured

format analysis. A common pattern is to use JSON or Avro to read and write to Kafka.

Writing to the Kafka Sink

Writing to Kafka queries is largely the same as reading from them except for fewer parameters. You’ll

still need to specify the Kafka bootstrap servers, but the only other option you will need to supply is

either a column with the topic specification or supply that as an option. For example, the following

writes are equivalent:

// in Scala

ds1.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)")

.writeStream.format("kafka")

.option("checkpointLocation", "/to/HDFS-compatible/dir")

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")

.start()

ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

.writeStream.format("kafka")

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")

.option("checkpointLocation", "/to/HDFS-compatible/dir")\

.option("topic", "topic1")

.start()

# in Python

df1.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)")\

.writeStream\

.format("kafka")\

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\

.option("checkpointLocation", "/to/HDFS-compatible/dir")\

.start()

df1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\

.writeStream\

.format("kafka")\

.option("kafka.bootstrap.servers", "host1:port1,host2:port2")\

.option("checkpointLocation", "/to/HDFS-compatible/dir")\

.option("topic", "topic1")\

.start()

Foreach sink

The foreach sink is akin to foreachPartitions in the Dataset API. This operation allows arbitrary

operations to be computed on a per-partition basis, in parallel. This is available in Scala and Java

initially, but it will likely be ported to other languages in the future. To use the foreach sink, you

must implement the ForeachWriter interface, which is available in the Scala/Java documents, which

contains three methods: open, process, and close. The relevant methods will be called whenever

there is a sequence of rows generated as output after a trigger.

Here are some important details:

The writer must be Serializable, as it were a UDF or a Dataset map function.

The three methods (open, process, close) will be called on each executor.

The writer must do all its initialization, like opening connections or starting transactions only

in the open method. A common source of errors is that if initialization occurs outside of the

open method (say in the class that you’re using), that happens on the driver instead of the

executor.

Because the Foreach sink runs arbitrary user code, one key issue you must consider when using it is

fault tolerance. If Structured Streaming asked your sink to write some data, but then crashed, it cannot

know whether your original write succeeded. Therefore, the API provides some additional

parameters to help you achieve exactly-once processing.

First, the open call on your ForeachWriter receives two parameters that uniquely identify the set of

rows that need to be acted on. The version parameter is a monotonically increasing ID that increases

on a per-trigger basis, and partitionId is the ID of the partition of the output in your task. Your

open method should return whether to process this set of rows. If you track your sink’s output

externally and see that this set of rows was already output (e.g., you wrote the last version and

partitionId written in your storage system), you can return false from open to skip processing this

set of rows. Otherwise, return true. Your ForeachWriter will be opened again for each trigger’s

worth of data to write.

Next, the process method will be called for each record in the data, assuming your open method

returned true. This is fairly straightforward—just process or write your data.

Finally, whenever open is called, the close method is also called (unless the node crashed before

that), regardless of whether open returned true. If Spark witnessed an error during processing, the

close method receives that error. It is your responsibility to clean up any open resources during

close.

Together, the ForeachWriter interface effectively lets you implement your own sink, including your

own logic for tracking which triggers’ data has been written or safely overwriting it on failures. We

show an example of passing a ForeachWriter below:

//in Scala

datasetOfString.write.foreach(new ForeachWriter[String] {

def open(partitionId: Long, version: Long): Boolean = {

// open a database connection

}

def process(record: String) = {

// write string to connection

}

def close(errorOrNull: Throwable): Unit = {

// close the connection

}

})

Sources and sinks for testing

Spark also includes several test sources and sinks that you can use for prototyping or debugging your

streaming queries (these should be used only during development and not in production scenarios,

because they do not provide end-to-end fault tolerance for your application):

Socket source

The socket source allows you to send data to your Streams via TCP sockets. To start one, specify

a host and port to read data from. Spark will open a new TCP connection to read from that

address. The socket source should not be used in production because the socket sits on the driver

and does not provide end-to-end fault-tolerance guarantees.

Here is a short example of setting up this source to read from localhost:9999:

// in Scala

val socketDF = spark.readStream.format("socket")

.option("host", "localhost").option("port", 9999).load()

# in Python

socketDF = spark.readStream.format("socket")\

.option("host", "localhost").option("port", 9999).load()

If you’d like to actually write data to this application, you will need to run a server that listens on

port 9999. On Unix-like systems, you can do this using the NetCat utility, which will let you type

text into the first connection that is opened to port 9999. Run the command below before starting

your Spark application, then write into it:

nc -lk 9999

The socket source will return a table of text strings, one per line in the input data.

Console sink

The console sink allows you to write out some of your streaming query to the console. This is

useful for debugging but is not fault-tolerant. Writing out to the console is simple and only prints

some rows of your streaming query to the console. This supports both append and complete output

modes:

activityCounts.format("console").write()

Memory sink

The memory sink is a simple source for testing your streaming system. It’s similar to the console

sink except that rather than printing to the console, it collects the data to the driver and then makes

the data available as an in-memory table that is available for interactive querying. This sink is not

fault tolerant, and you shouldn’t use it in production, but is great for testing and querying your

stream during development. This supports both append and complete output modes:

// in Scala

activityCounts.writeStream.format("memory").queryName("my_device_table")

If you do want to output data to a table for interactive SQL queries in production, the authors

recommend using the Parquet file sink on a distributed file system (e.g., S3). You can then query the

data from any Spark application.

How Data Is Output (Output Modes)

Now that you know where your data can go, let’s discuss how the result Dataset will look when it

gets there. This is what we call the output mode. As we mentioned, they’re the same concept as save

modes on static DataFrames. There are three modes supported by Structured Streaming. Let’s look at

each of them.

Append mode

Append mode is the default behavior and the simplest to understand. When new rows are added to the

result table, they will be output to the sink based on the trigger (explained next) that you specify. This

mode ensures that each row is output once (and only once), assuming that you have a fault-tolerant

sink. When you use append mode with event-time and watermarks (covered in Chapter 22), only the

final result will output to the sink.

Complete mode

Complete mode will output the entire state of the result table to your output sink. This is useful when

you’re working with some stateful data for which all rows are expected to change over time or the

sink you are writing does not support row-level updates. Think of it like the state of a stream at the

time the previous batch had run.

Update mode

Update mode is similar to complete mode except that only the rows that are different from the

previous write are written out to the sink. Naturally, your sink must support row-level updates to

support this mode. If the query doesn’t contain aggregations, this is equivalent to append mode.

When can you use each mode?

Structured Streaming limits your use of each mode to queries where it makes sense. For example, if

your query just does a map operation, Structured Streaming will not allow complete mode, because

this would require it to remember all input records since the start of the job and rewrite the whole

output table. This requirement is bound to get prohibitively expensive as the job runs. We will discuss

when each mode is supported in more detail in the next chapter, once we also cover event-time

processing and watermarks. If your chosen mode is not available, Spark Streaming will throw an

exception when you start your stream.

Here’s a handy table from the documentation that lays all of this out. Keep in mind that this will

change in the future, so you’ll want to check the documentation for the most up-to-date version.

Table 21-1 shows when you can use each output mode.

Table 21-1. Structured streaming output modes as of Spark 2.2

Query Type Query type

(continued)

Supported

Output

Modes

Notes

Queries with

aggregation

Aggregation

on event-time

with

watermark

Append,

Update,

Complete

Append mode uses watermark to drop old aggregation state. This means

that as new rows are brought into the table, Spark will only keep around

rows that are below the “watermark”. Update mode also uses the

watermark to remove old aggregation state. By definition, complete

mode does not drop old aggregation state since this mode preserves all

data in the Result Table.

Other

aggregations

Complete,

Update

Since no watermark is defined (only defined in other category), old

aggregation state is not dropped. Append mode is not supported as

aggregates can update thus violating the semantics of this mode.

Queries with

mapGroupsWithState Update

Queries with

flatMapGroupsWithState

Append

operation

mode

Append Aggregations are allowed after flatMapGroupsWithState.

Update

operation

mode

Update Aggregations not allowed after flatMapGroupsWithState.

Other queries Append,

Update

Complete mode not supported as it is infeasible to keep all unaggregated

data in the Result Table.

When Data Is Output (Triggers)

To control when data is output to our sink, we set a trigger. By default, Structured Streaming will

start data as soon as the previous trigger completes processing. You can use triggers to ensure that you

do not overwhelm your output sink with too many updates or to try and control file sizes in the output.

Currently, there is one periodic trigger type, based on processing time, as well as a “once” trigger to

manually run a processing step once. More triggers will likely be added in the future.

Processing time trigger

For the processing time trigger, we simply specify a duration as a string (you may also use a

Duration in Scala or TimeUnit in Java). We’ll show the string format below.

// in Scala

import org.apache.spark.sql.streaming.Trigger

activityCounts.writeStream.trigger(Trigger.ProcessingTime("100 seconds"))

.format("console").outputMode("complete").start()

# in Python

activityCounts.writeStream.trigger(processingTime='5 seconds')\

.format("console").outputMode("complete").start()

The ProcessingTime trigger will wait for multiples of the given duration in order to output data. For

example, with a trigger duration of one minute, the trigger will fire at 12:00, 12:01, 12:02, and so on.

If a trigger time is missed because the previous processing has not yet completed, then Spark will

wait until the next trigger point (i.e., the next minute), rather than firing immediately after the previous

processing completes.

Once trigger

You can also just run a streaming job once by setting that as the trigger. This might seem like a weird

case, but it’s actually extremely useful in both development and production. During development, you

can test your application on just one trigger’s worth of data at a time. During production, the Once

trigger can be used to run your job manually at a low rate (e.g., import new data into a summary table

just occasionally). Because Structured Streaming still fully tracks all the input files processed and the

state of the computation, this is easier than writing your own custom logic to track this in a batch job,

and saves a lot of resources over running a continuous job 24/7:

// in Scala

import org.apache.spark.sql.streaming.Trigger

activityCounts.writeStream.trigger(Trigger.Once())

.format("console").outputMode("complete").start()

# in Python

activityCounts.writeStream.trigger(once=True)\

.format("console").outputMode("complete").start()

Streaming Dataset API

One final thing to note about Structured Streaming is that you are not limited to just the DataFrame

API for streaming. You can also use Datasets to perform the same computation but in type-safe

manner. You can turn a streaming DataFrame into a Dataset the same way you did with a static one.

As before, the Dataset’s elements need to be Scala case classes or Java bean classes. Other than that,

the DataFrame and Dataset operators work as they did in a static setting, and will also turn into a

streaming execution plan when run on a stream.

Here’s an example using the same dataset that we used in Chapter 11:

// in Scala

case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String,

count: BigInt)

val dataSchema = spark.read

.parquet("/data/flight-data/parquet/2010-summary.parquet/")

.schema

val flightsDF = spark.readStream.schema(dataSchema)

.parquet("/data/flight-data/parquet/2010-summary.parquet/")

val flights = flightsDF.as[Flight]

def originIsDestination(flight_row: Flight): Boolean = {

return flight_row.ORIGIN_COUNTRY_NAME == flight_row.DEST_COUNTRY_NAME

}

flights.filter(flight_row => originIsDestination(flight_row))

.groupByKey(x => x.DEST_COUNTRY_NAME).count()

.writeStream.queryName("device_counts").format("memory").outputMode("complete")

.start()

Conclusion

It should be clear that Structured Streaming presents a powerful way to write streaming applications.

Taking a batch job you already run and turning it into a streaming job with almost no code changes is

both simple and extremely helpful from an engineering standpoint if you need to have this job interact

closely with the rest of your data processing application. Chapter 22 dives into two advanced

streaming-related concepts: event-time processing and stateful processing. Then, after that,

Chapter 23 addresses what you need to do to run Structured Streaming in production.

Chapter 22. Event-Time and Stateful

Processing

Chapter 21 covered the core concepts and basic APIs; this chapter dives into event-time and stateful

processing. Event-time processing is a hot topic because we analyze information with respect to the

time that it was created, not processed. The key idea between this style of processing is that over the

lifetime of the job, Spark will maintain relevant state that it can update over the course of the job

before outputting it to the sink.

Let’s cover these concepts in greater detail before we begin working with code to show they work.

Event Time

Event time is an important topic to cover discretely because Spark’s DStream API does not support

processing information with respect to event-time. At a higher level, in stream-processing systems

there are effectively two relevant times for each event: the time at which it actually occurred (event

time), and the time that it was processed or reached the stream-processing system (processing time).

Event time

Event time is the time that is embedded in the data itself. It is most often, though not required to

be, the time that an event actually occurs. This is important to use because it provides a more

robust way of comparing events against one another. The challenge here is that event data can be

late or out of order. This means that the stream processing system must be able to handle out-of-

order or late data.

Processing time

Processing time is the time at which the stream-processing system actually receives data. This is

usually less important than event time because when it’s processed is largely an implementation

detail. This can’t ever be out of order because it’s a property of the streaming system at a certain

time (not an external system like event time).

Those explanations are nice and abstract, so let’s use a more tangible example. Suppose that we have

a datacenter located in San Francisco. An event occurs in two places at the same time: one in

Ecuador, the other in Virginia (see Figure 22-1).

Figure 22-1. Event Time Across the World

Due to the location of the datacenter, the event in Virginia is likely to show up in our datacenter

before the event in Ecuador. If we were to analyze this data based on processing time, it would

appear that the event in Virginia occurred before the event in Ecuador: something that we know to be

wrong. However, if we were to analyze the data based on event time (largely ignoring the time at

which it’s processed), we would see that these events occurred at the same time.

As we mentioned, the fundamental idea is that the order of the series of events in the processing

system does not guarantee an ordering in event time. This can be somewhat unintuitive, but is worth

reinforcing. Computer networks are unreliable. That means that events can be dropped, slowed down,

repeated, or be sent without issue. Because individual events are not guaranteed to suffer one fate or

the other, we must acknowledge that any number of things can happen to these events on the way from

the source of the information to our stream processing system. For this reason, we need to operate on

event time and look at the overall stream with reference to this information contained in the data

rather than on when it arrives in the system. This means that we hope to compare events based on the

time at which those events occurred.

Stateful Processing

The other topic we need to cover in this chapter is stateful processing. Actually, we already

demonstrated this many times in Chapter 21. Stateful processing is only necessary when you need to

use or update intermediate information (state) over longer periods of time (in either a microbatch or a

record-at-a-time approach). This can happen when you are using event time or when you are

performing an aggregation on a key, whether that involves event time or not.

For the most part, when you’re performing stateful operations. Spark handles all of this complexity

for you. For example, when you specify a grouping, Structured Streaming maintains and updates the

information for you. You simply specify the logic. When performing a stateful operation, Spark stores

the intermediate information in a state store. Spark’s current state store implementation is an in-

memory state store that is made fault tolerant by storing intermediate state to the checkpoint directory.

Arbitrary Stateful Processing

The stateful processing capabilities described above are sufficient to solve many streaming problems.

However, there are times when you need fine-grained control over what state should be stored, how it

is updated, and when it should be removed, either explicitly or via a time-out. This is called arbitrary

(or custom) stateful processing and Spark allows you to essentially store whatever information you

like over the course of the processing of a stream. This provides immense flexibility and power and

allows for some complex business logic to be handled quite easily. Just as we did before, let’s

ground this with some examples:

You’d like to record information about user sessions on an ecommerce site. For instance, you

might want to track what pages users visit over the course of this session in order to provide

recommendations in real time during their next session. Naturally, these sessions have

completely arbitrary start and stop times that are unique to that user.

Your company would like to report on errors in the web application but only if five events

occur during a user’s session. You could do this with count-based windows that only emit a

result if five events of some type occur.

You’d like to deduplicate records over time. To do so, you’re going to need to keep track of

every record that you see before deduplicating it.

Now that we’ve explained the core concepts that we’re going to need in this chapter, let’s cover all of

this with some examples that you can follow along with and explain some of the important caveats

that you need to consider when processing in this manner.

Event-Time Basics

Let’s begin with the same dataset from the previous chapter. When working with event time, it’s just

another column in our dataset, and that’s really all we need to concern ourselves with; we simply use

that column, as demonstrated here:

// in Scala

spark.conf.set("spark.sql.shuffle.partitions", 5)

val static = spark.read.json("/data/activity-data")

val streaming = spark

.readStream

.schema(static.schema)

.option("maxFilesPerTrigger", 10)

.json("/data/activity-data")

# in Python

spark.conf.set("spark.sql.shuffle.partitions", 5)

static = spark.read.json("/data/activity-data")

streaming = spark\

.readStream\

.schema(static.schema)\

.option("maxFilesPerTrigger", 10)\

.json("/data/activity-data")

streaming.printSchema()

root

|-- Arrival_Time: long (nullable = true)

|-- Creation_Time: long (nullable = true)

|-- Device: string (nullable = true)

|-- Index: long (nullable = true)

|-- Model: string (nullable = true)

|-- User: string (nullable = true)

|-- gt: string (nullable = true)

|-- x: double (nullable = true)

|-- y: double (nullable = true)

|-- z: double (nullable = true)

In this dataset, there are two time-based columns. The Creation_Time column defines when an event

was created, whereas the Arrival_Time defines when an event hit our servers somewhere upstream.

We will use Creation_Time in this chapter. This example reads from a file but, as we saw in the

previous chapter, it would be simple to change it to Kafka if you already have a cluster up and

running.

Windows on Event Time

The first step in event-time analysis is to convert the timestamp column into the proper Spark SQL

timestamp type. Our current column is unixtime nanoseconds (represented as a long), therefore we’re

going to have to do a little manipulation to get it into the proper format:

// in Scala

val withEventTime = streaming.selectExpr(

"*",

"cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")

# in Python

withEventTime = streaming\.selectExpr(

"*",

"cast(cast(Creation_Time as double)/1000000000 as timestamp) as event_time")

We’re now prepared to do arbitrary operations on event time! Note how this experience is just like

we’d do in batch operations—there’s no special API or DSL. We simply use columns, just like we

might in batch, the aggregation, and we’re working with event time.

Tumbling Windows

The simplest operation is simply to count the number of occurrences of an event in a given window.

Figure 22-2 depicts the process when performing a simple summation based on the input data and a

key.

Figure 22-2. Tumbling Windows

We’re performing an aggregation of keys over a window of time. We update the result table

(depending on the output mode) when every trigger runs, which will operate on the data received

since the last trigger. In the case of our actual dataset (and Figure 22-2), we’ll do so in 10-minute

windows without any overlap between them (each, and only one event can fall into one window).

This will update in real time, as well, meaning that if new events were being added upstream to our

system, Structured Streaming would update those counts accordingly. This is the complete output

mode, Spark will output the entire result table regardless of whether we’ve seen the entire dataset:

// in Scala

import org.apache.spark.sql.functions.{window, col}

withEventTime.groupBy(window(col("event_time"), "10 minutes")).count()

.writeStream

.queryName("events_per_window")

.format("memory")

.outputMode("complete")

.start()

# in Python

from pyspark.sql.functions import window, col

withEventTime.groupBy(window(col("event_time"), "10 minutes")).count()\

.writeStream\

.queryName("pyevents_per_window")\

.format("memory")\

.outputMode("complete")\

.start()

Now we’re writing out to the in-memory sink for debugging, so we can query it with SQL after we

have the stream running:

spark.sql("SELECT * FROM events_per_window").printSchema()

SELECT * FROM events_per_window

This shows us something like the following result, depending on the amount of data processed when

you had run the query:

+---------------------------------------------+-----+

|window |count|

+---------------------------------------------+-----+

|[2015-02-23 10:40:00.0,2015-02-23 10:50:00.0]|11035|

|[2015-02-24 11:50:00.0,2015-02-24 12:00:00.0]|18854|

...

|[2015-02-23 13:40:00.0,2015-02-23 13:50:00.0]|20870|

|[2015-02-23 11:20:00.0,2015-02-23 11:30:00.0]|9392 |

+---------------------------------------------+-----+

For reference, here’s the schema we get from the previous query:

root

|-- window: struct (nullable = false)

| |-- start: timestamp (nullable = true)

| |-- end: timestamp (nullable = true)

|-- count: long (nullable = false)

Notice how window is actually a struct (a complex type). Using this we can query this struct for

the start and end times of a particular window.

Of importance is the fact that we can also perform an aggregation on multiple columns, including the

event time column. Just like we saw in the previous chapter, we can even perform these aggregations

using methods like cube. While we won’t repeat the fact that we can perform the multi-key

aggregation below, this does apply to any window-style aggregation (or stateful computation) we

would like:

// in Scala

import org.apache.spark.sql.functions.{window, col}

withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count()

.writeStream

.queryName("events_per_window")

.format("memory")

.outputMode("complete")

.start()

# in Python

from pyspark.sql.functions import window, col

withEventTime.groupBy(window(col("event_time"), "10 minutes"), "User").count()\

.writeStream\

.queryName("pyevents_per_window")\

.format("memory")\

.outputMode("complete")\

.start()

Sliding windows

The previous example was simple counts in a given window. Another approach is that we can

decouple the window from the starting time of the window. Figure 22-3 illustrates what we mean.

Figure 22-3. Sliding Windows

In the figure, we are running a sliding window through which we look at an hour increment, but we’d

like to get the state every 10 minutes. This means that we will update the values over time and will

include the last hours of data. In this example, we have 10-minute windows, starting every five

minutes. Therefore each event will fall into two different windows. You can tweak this further

according to your needs:

// in Scala

import org.apache.spark.sql.functions.{window, col}

withEventTime.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))

.count()

.writeStream

.queryName("events_per_window")

.format("memory")

.outputMode("complete")

.start()

# in Python

from pyspark.sql.functions import window, col

withEventTime.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))\

.count()\

.writeStream\

.queryName("pyevents_per_window")\

.format("memory")\

.outputMode("complete")\

.start()

Naturally, we can query the in-memory table:

SELECT * FROM events_per_window

This query gives us the following result. Note that the starting times for each window are now in 5-

minute intervals instead of 10, like we saw in the previous query:

+---------------------------------------------+-----+

|window |count|

+---------------------------------------------+-----+

|[2015-02-23 14:15:00.0,2015-02-23 14:25:00.0]|40375|

|[2015-02-24 11:50:00.0,2015-02-24 12:00:00.0]|56549|

...

|[2015-02-24 11:45:00.0,2015-02-24 11:55:00.0]|51898|

|[2015-02-23 10:40:00.0,2015-02-23 10:50:00.0]|33200|

+---------------------------------------------+-----+

Handling Late Data with Watermarks

The preceding examples are great, but they have a flaw. We never specified how late we expect to

see data. This means that Spark is going to need to store that intermediate data forever because we

never specified a watermark, or a time at which we don’t expect to see any more data. This applies to

all stateful processing that operates on event time. We must specify this watermark in order to age-out

data in the stream (and, therefore, state) so that we don’t overwhelm the system over a long period of

time.

Concretely, a watermark is an amount of time following a given event or set of events after which we

do not expect to see any more data from that time. We know this can happen due to delays on the

network, devices that lose a connection, or any number of other issues. In the DStreams API, there

was no robust way to handle late data in this way—if an event occurred at a certain time but did not

make it to the processing system by the time the batch for a given window started, it would show up

in other processing batches. Structured Streaming remedies this. In event time and stateful processing,

a given window’s state or set of data is decoupled from a processing window. That means that as

more events come in, Structured Streaming will continue to update a window with more information.

Let’s return back to our event time example from the beginning of the chapter, shown now in

Figure 22-4.

Figure 22-4. Event Time Watermarking

In this example, let’s imagine that we frequently see some amount of delay from our customers in

Latin America. Therefore, we specify a watermark of 10 minutes. When doing this, we instruct Spark

that any event that occurs more than 10 “event-time” minutes past a previous event should be ignored.

Conversely, this also states that we expect to see every event within 10 minutes. After that, Spark

should remove intermediate state and, depending on the output mode, do something with the result. As

mentioned at the beginning of the chapter, we need to specify watermarks because if we did not, we’d

need to keep all of our windows around forever, expecting them to be updated forever. This brings us

to the core question when working with event-time: “how late do I expect to see data?” The answer to

this question will be the watermark that you’ll configure for your data.

Returning to our dataset, if we know that we typically see data as produced downstream in minutes

but we have seen delays in events up to five hours after they occur (perhaps the user lost cell phone

connectivity), we’d specify the watermark in the following way:

// in Scala

import org.apache.spark.sql.functions.{window, col}

withEventTime

.withWatermark("event_time", "5 hours")

.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))

.count()

.writeStream

.queryName("events_per_window")

.format("memory")

.outputMode("complete")

.start()

# in Python

from pyspark.sql.functions import window, col

withEventTime\

.withWatermark("event_time", "30 minutes")\

.groupBy(window(col("event_time"), "10 minutes", "5 minutes"))\

.count()\

.writeStream\

.queryName("pyevents_per_window")\

.format("memory")\

.outputMode("complete")\

.start()

It’s pretty amazing, but almost nothing changed about our query. We essentially just added another

configuration. Now, Structured Streaming will wait until 30 minutes after the final timestamp of this

10-minute rolling window before it finalizes the result of that window. We can query our table and

see the intermediate results because we’re using complete mode—they’ll be updated over time. In

append mode, this information won’t be output until the window closes.

SELECT * FROM events_per_window

+---------------------------------------------+-----+

|window |count|

+---------------------------------------------+-----+

|[2015-02-23 14:15:00.0,2015-02-23 14:25:00.0]|9505 |

|[2015-02-24 11:50:00.0,2015-02-24 12:00:00.0]|13159|

...

|[2015-02-24 11:45:00.0,2015-02-24 11:55:00.0]|12021|

|[2015-02-23 10:40:00.0,2015-02-23 10:50:00.0]|7685 |

+---------------------------------------------+-----+

At this point, you really know all that you need to know about handling late data. Spark does all of the

heavy lifting for you. Just to reinforce the point, if you do not specify how late you think you will see

data, then Spark will maintain that data in memory forever. Specifying a watermark allows it to free

those objects from memory, allowing your stream to continue running for a long time.

Dropping Duplicates in a Stream

One of the more difficult operations in record-at-a-time systems is removing duplicates from the

stream. Almost by definition, you must operate on a batch of records at a time in order to find

duplicates—there’s a high coordination overhead in the processing system. Deduplication is an

important tool in many applications, especially when messages might be delivered multiple times by

upstream systems. A perfect example of this are Internet of Things (IoT) applications that have

upstream producers generating messages in nonstable network environments, and the same message

might end up being sent multiple times. Your downstream applications and aggregations should be

able to assume that there is only one of each message.

Essentially, Structured Streaming makes it easy to take message systems that provide at-least-once

semantics, and convert them into exactly-once by dropping duplicate messages as they come in, based

on arbitrary keys. To de-duplicate data, Spark will maintain a number of user specified keys and

ensure that duplicates are ignored.

WARNING

Just like other stateful processing applications, you need to specify a watermark to ensure that the maintained state does not

grow infinitely over the course of your stream.

Let’s begin the de-duplication process. The goal here will be to de-duplicate the number of events per

user by removing duplicate events. Notice how you need to specify the event time column as a

duplicate column along with the column you should de-duplicate. The core assumption is that

duplicate events will have the same timestamp as well as identifier. In this model, rows with two

different timestamps are two different records:

// in Scala

import org.apache.spark.sql.functions.expr

withEventTime

.withWatermark("event_time", "5 seconds")

.dropDuplicates("User", "event_time")

.groupBy("User")

.count()

.writeStream

.queryName("deduplicated")

.format("memory")

.outputMode("complete")

.start()

# in Python

from pyspark.sql.functions import expr

withEventTime\

.withWatermark("event_time", "5 seconds")\

.dropDuplicates(["User", "event_time"])\

.groupBy("User")\

.count()\

.writeStream\

.queryName("pydeduplicated")\

.format("memory")\

.outputMode("complete")\

.start()

The result will be similar to the following and will continue to update over time as more data is read

by your stream:

+----+-----+

|User|count|

+----+-----+

| a| 8085|

| b| 9123|

| c| 7715|

| g| 9167|

| h| 7733|

| e| 9891|

| f| 9206|

| d| 8124|

| i| 9255|

+----+-----+

Arbitrary Stateful Processing

The first section if this chapter demonstrates how Spark maintains information and updates windows

based on our specifications. But things differ when you have more complex concepts of windows; this

is, where arbitrary stateful processing comes in. This section includes several examples of different

use cases along with examples that show you how you might go about setting up your business logic.

Stateful processing is available only in Scala in Spark 2.2. This will likely change in the future.

When performing stateful processing, you might want to do the following:

Create window based on counts of a given key

Emit an alert if there is a number of events within a certain time frame

Maintain user sessions of an undetermined amount of time and save those sessions to

perform some analysis on later.

At the end of the day, there are two things you will want to do when performing this style of

processing:

Map over groups in your data, operate on each group of data, and generate at most a single

row for each group. The relevant API for this use case is mapGroupsWithState.

Map over groups in your data, operate on each group of data, and generate one or more rows

for each group. The relevant API for this use case is flatMapGroupsWithState.

When we say “operate” on each group of data, that means that you can arbitrarily update each group

independent of any other group of data. This means that you can define arbitrary window types that

don’t conform to tumbling or sliding windows like we saw previously in the chapter. One important

benefit that we get when we perform this style of processing is control over configuring time-outs on

state. With windows and watermarks, it’s very simple: you simply time-out a window when the

watermark passes the window start. This doesn’t apply to arbitrary stateful processing, because you

manage the state based on user-defined concepts. Therefore, you need to properly time-out your state.

Let’s discuss this a bit more.

Time-Outs

As mentioned in Chapter 21, a time-out specifies how long you should wait before timing-out some

intermediate state. A time-out is a global parameter across all groups that is configured on a per-

group basis. Time-outs can be either based on processing time

(GroupStateTimeout.ProcessingTimeTimeout) or event time

(GroupStateTimeout.EventTimeTimeout). When using time-outs, check for time-out first before

processing the values. You can get this information by checking the state.hasTimedOut flag or

checking whether the values iterator is empty. You need to set some state (i.e., state must be defined,

not removed) for time-outs to be set.

With a time-out based on processing time, you can set the time-out duration by calling

GroupState.setTimeoutDuration (we’ll see code examples of this later in this section of the

chapter). The time-out will occur when the clock has advanced by the set duration. Guarantees

provided by this time-out with a duration of D ms are as follows:

Time-out will never occur before the clock time has advanced by D ms

Time-out will occur eventually when there is a trigger in the query (i.e., after D ms). So there

is a no strict upper bound on when the time-out would occur. For example, the trigger

interval of the query will affect when the time-out actually occurs. If there is no data in the

stream (for any group) for a while, there won’t be any trigger and the time-out function call

will not occur until there is data.

Because the processing time time-out is based on the clock time, it is affected by the variations in the

system clock. This means that time zone changes and clock skew are important variables to consider.

With a time-out based on event time, the user also must specify the event-time watermark in the query

using watermarks. When set, data older than the watermark is filtered out. As the developer, you can

set the timestamp that the watermark should reference by setting a time-out timestamp using the

GroupState.setTimeoutTimestamp(...) API. The time-out would occur when the watermark

advances beyond the set timestamp. Naturally, you can control the time-out delay by either specifying

longer watermarks or simply updating the time-out as you process your stream. Because you can do

this in arbitrary code, you can do it on a per-group basis. The guarantee provided by this time-out is

that it will never occur before the watermark has exceeded the set time-out.

Similar to processing-time time-outs, there is a no strict upper bound on the delay when the time-out

actually occurs. The watermark can advance only when there is data in the stream, and the event time

of the data has actually advanced.

NOTE

We mentioned this a few moments ago, but it’s worth reinforcing. Although time-outs are important, they might not always

function as you expect. For instance, as of this writing, Structured Streaming does not have asynchronous job execution,

which means that Spark will not output data (or time-out data) between the time that a epoch finishes and the next one

starts, because it is not processing any data at that time. Also, if a processing batch of data has no records (keep in mind

this is a batch, not a group), there are no updates and there cannot be an event-time time-out. This might change in future

versions.

Output Modes

One last “gotcha” when working with this sort of arbitrary stateful processing is the fact that not all

output modes discussed in Chapter 21 are supported. This is sure to change as Spark continues to

change, but, as of this writing, mapGroupsWithState supports only the update output mode, whereas

flatMapGroupsWithState supports append and update. append mode means that only after the

time-out (meaning the watermark has passed) will data show up in the result set. This does not

happen automatically, it is your responsibility to output the proper row or rows.

Please see Table 21-1 to see which output modes can be used when.

mapGroupsWithState

Our first example of stateful processing uses a feature called mapGroupsWithState. This is similar

to a user-defined aggregation function that takes as input an update set of data and then resolves it

down to a specific key with a set of values. There are several things you’re going to need to define

along the way:

Three class definitions: an input definition, a state definition, and optionally an output

definition.

A function to update the state based on a key, an iterator of events, and a previous state.

A time-out parameter (as described in the time-outs section).

With these objects and definitions, you can control arbitrary state by creating it, updating it over time,

and removing it. Let’s begin with a example of simply updating the key based on a certain amount of

state, and then move onto more complex things like sessionization.

Because we’re working with sensor data, let’s find the first and last timestamp that a given user

performed one of the activities in the dataset. This means that the key we will be grouping on (and

mapping on) is a user and activity combination.

NOTE

When you use mapGroupsWithState, the output of the dream will contain only one row per key (or group) at all times. If

you would like each group to have multiple outputs, you should use flatMapGroupsWithState (covered shortly).

Let’s establish the input, state, and output definitions:

case class InputRow(user:String, timestamp:java.sql.Timestamp, activity:String)

case class UserState(user:String,

var activity:String,

var start:java.sql.Timestamp,

var end:java.sql.Timestamp)

For readability, set up the function that defines how you will update your state based on a given row:

def updateUserStateWithEvent(state:UserState, input:InputRow):UserState = {

if (Option(input.timestamp).isEmpty) {

return state

}

if (state.activity == input.activity) {

if (input.timestamp.after(state.end)) {

state.end = input.timestamp

}

if (input.timestamp.before(state.start)) {

state.start = input.timestamp

}

} else {

if (input.timestamp.after(state.end)) {

state.start = input.timestamp

state.end = input.timestamp

state.activity = input.activity

}

state

}

Now, write the function that defines the way state is updated based on an epoch of rows:

import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode, GroupState}

def updateAcrossEvents(user:String,

inputs: Iterator[InputRow],

oldState: GroupState[UserState]):UserState = {

var state:UserState = if (oldState.exists) oldState.get else UserState(user,

"",

new java.sql.Timestamp(6284160000000L),

new java.sql.Timestamp(6284160L)

)

// we simply specify an old date that we can compare against and

// immediately update based on the values in our data

for (input <- inputs) {

state = updateUserStateWithEvent(state, input)

oldState.update(state)

}

state

}

When we have that, it’s time to start your query by passing in the relevant information. The one thing

that you’re going to have to add when you specify mapGroupsWithState is whether you need to time-

out a given group’s state. This just gives you a mechanism to control what should be done with state

that receives no update after a certain amount of time. In this case, you want to maintain state

indefinitely, so specify that Spark should not time-out.

Use the update output mode so that you get updates on the user activity:

import org.apache.spark.sql.streaming.GroupStateTimeout

withEventTime

.selectExpr("User as user",

"cast(Creation_Time/1000000000 as timestamp) as timestamp", "gt as activity")

.as[InputRow]

.groupByKey(_.user)

.mapGroupsWithState(GroupStateTimeout.NoTimeout)(updateAcrossEvents)

.writeStream

.queryName("events_per_window")

.format("memory")

.outputMode("update")

.start()

SELECT * FROM events_per_window order by user, start

Here’s a sample of our result set:

+----+--------+--------------------+--------------------+

+----+--------+--------------------+--------------------+

| a| bike|2015-02-23 13:30:...|2015-02-23 14:06:...|

...

| d| bike|2015-02-24 13:07:...|2015-02-24 13:42:...|

+----+--------+--------------------+--------------------+

An interesting aspect of our data is that the last activity performed at any given time is “bike.” This is

related to how the experiment was likely run, in which they had each participant perform the same

activities in order.

EXAMPLE: COUNT-BASED WINDOWS

Typical window operations are built from start and end times for which all events that fall in

between those two points contribute to the counting or summation that you’re performing.

However, there are times when instead of creating windows based on time, you’d rather create

them based on a number of events regardless of state and event times, and perform some

aggregation on that window of data. For example, we may want to compute a value for every 500

events received, regardless of when they are received.

The next example analyzes the activity dataset from this chapter and outputs the average reading

of each device periodically, creating a window based on the count of events and outputting it

each time it has accumulated 500 events for that device. You define two case classes for this task:

the input row format (which is simply a device and a timestamp); and the state and output rows

(which contain the current count of records collected, device ID, and an array of readings for the

events in the window).

Here are our various, self-describing case class definitions:

case class InputRow(device: String, timestamp: java.sql.Timestamp, x: Double)

case class DeviceState(device: String, var values: Array[Double],

var count: Int)

case class OutputRow(device: String, previousAverage: Double)

Now, you can define the function to update the individual state based on a single input row. You

could write this inline or in a number of other ways, but this example makes it easy to see exactly

how you update based on a given row:

def updateWithEvent(state:DeviceState, input:InputRow):DeviceState = {

state.count += 1

// maintain an array of the x-axis values

state.values = state.values ++ Array(input.x)

state

}

Now it’s time to define the function that updates across a series of input rows. Notice in the

example that follows that we have a specific key, the iterator of inputs, and the old state, and we

update that old state over time as we receive new events. This, in turn, will return our output

rows with the updates on a per-device level based on the number of counts it sees. This case is

quite straightforward, after a given number of events, you update the state and reset it. You then

create an output row. You can see this row in the output table:

import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode,

GroupState}

def updateAcrossEvents(device:String, inputs: Iterator[InputRow],

oldState: GroupState[DeviceState]):Iterator[OutputRow] = {

inputs.toSeq.sortBy(_.timestamp.getTime).toIterator.flatMap { input =>

val state = if (oldState.exists) oldState.get

else DeviceState(device, Array(), 0)

val newState = updateWithEvent(state, input)

if (newState.count >= 500) {

// One of our windows is complete; replace our state with an empty

// DeviceState and output the average for the past 500 items from

// the old state

oldState.update(DeviceState(device, Array(), 0))

Iterator(OutputRow(device,

newState.values.sum / newState.values.length.toDouble))

}

else {

// Update the current DeviceState object in place and output no

// records

oldState.update(newState)

Iterator()

}

Now you can run your stream. You will notice that you need to explicitly state the output mode,

which is append. You also need to set a GroupStateTimeout. This time-out specifies the amount

of time you want to wait before a window should be output as complete (even if it did not reach

the required count). In that case, set an infinite time-out, meaning if a device never gets to that

required 500 count threshold, it will maintain that state forever as “incomplete” and not output it

to the result table.

By specifying both of those parameters you can pass in the updateAcrossEvents function and

start the stream:

import org.apache.spark.sql.streaming.GroupStateTimeout

withEventTime

.selectExpr("Device as device",

"cast(Creation_Time/1000000000 as timestamp) as timestamp", "x")

.as[InputRow]

.groupByKey(_.device)

.flatMapGroupsWithState(OutputMode.Append,

GroupStateTimeout.NoTimeout)(updateAcrossEvents)

.writeStream

.queryName("count_based_device")

.format("memory")

.outputMode("append")

.start()

After you start the stream, it’s time to query it. Here are the results:

SELECT * FROM count_based_device

+--------+--------------------+

| device| previousAverage|

+--------+--------------------+

|nexus4_1| 4.660034012E-4|

|nexus4_1|0.001436279298199...|

...

|nexus4_1|1.049804683999999...|

|nexus4_1|-0.01837188737960...|

+--------+--------------------+

You can see the values change over each of those windows as you append new data to the result

set.

flatMapGroupsWithState

Our second example of stateful processing will use a feature called flatMapGroupsWithState. This

is quite similar to mapGroupsWithState except that rather than just having a single key with at most

one output, a single key can have many outputs. This can provide us a bit more flexibility and the

same fundamental structure as mapGroupsWithState applies. Here’s what we’ll need to define.

Three class definitions: an input definition, a state definition, and optionally an output

definition.

A function to update the state based on a key, an iterator of events, and a previous state.

A time-out parameter (as described in the time-outs section).

With these objects and definitions, we can control arbitrary state by creating it, updating it over time,

and removing it. Let’s start with an example of sessionization.

EXAMPLE: SESSIONIZATION

Sessions are simply unspecified time windows with a series of events that occur. Typically, you

want to record these different events in an array in order to compare these sessions to other

sessions in the future. In a session, you will likely have arbitrary logic to maintain and update

your state over time as well as certain actions to define when state ends (like a count) or a simple

time-out. Let’s build on the previous example and define it a bit more strictly as a session.

At times, you might have an explicit session ID that you can use in your function. This obviously

makes it much easier because you can just perform a simple aggregation and might not even need

your own stateful logic. In this case, you’re creating sessions on the fly from a user ID and some

time information and if you see no new event from that user in five seconds, the session

terminates. You’ll also notice that this code uses time-outs differently than we have in other

examples.

You can follow the same process of creating your classes, defining our single event update

function and then the multievent update function:

case class InputRow(uid:String, timestamp:java.sql.Timestamp, x:Double,

activity:String)

case class UserSession(val uid:String, var timestamp:java.sql.Timestamp,

var activities: Array[String], var values: Array[Double])

case class UserSessionOutput(val uid:String, var activities: Array[String],

var xAvg:Double)

def updateWithEvent(state:UserSession, input:InputRow):UserSession = {

// handle malformed dates

if (Option(input.timestamp).isEmpty) {

return state

}

state.timestamp = input.timestamp

state.values = state.values ++ Array(input.x)

if (!state.activities.contains(input.activity)) {

state.activities = state.activities ++ Array(input.activity)

}

state

}

import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode,

GroupState}

def updateAcrossEvents(uid:String,

inputs: Iterator[InputRow],

oldState: GroupState[UserSession]):Iterator[UserSessionOutput] = {

inputs.toSeq.sortBy(_.timestamp.getTime).toIterator.flatMap { input =>

val state = if (oldState.exists) oldState.get else UserSession(

uid,

new java.sql.Timestamp(6284160000000L),

Array(),

Array())

val newState = updateWithEvent(state, input)

if (oldState.hasTimedOut) {

val state = oldState.get

oldState.remove()

Iterator(UserSessionOutput(uid,

state.activities,

newState.values.sum / newState.values.length.toDouble))

} else if (state.values.length > 1000) {

val state = oldState.get

oldState.remove()

Iterator(UserSessionOutput(uid,

state.activities,

newState.values.sum / newState.values.length.toDouble))

} else {

oldState.update(newState)

oldState.setTimeoutTimestamp(newState.timestamp.getTime(), "5 seconds")

Iterator()

}

You’ll see in this one that we only expect to see an event at most five seconds late. Anything other

than that and we will ignore it. We will use an EventTimeTimeout to set that we want to time-out

based on the event time in this stateful operation:

import org.apache.spark.sql.streaming.GroupStateTimeout

withEventTime.where("x is not null")

.selectExpr("user as uid",

"cast(Creation_Time/1000000000 as timestamp) as timestamp",

"x", "gt as activity")

.as[InputRow]

.withWatermark("timestamp", "5 seconds")

.groupByKey(_.uid)

.flatMapGroupsWithState(OutputMode.Append,

GroupStateTimeout.EventTimeTimeout)(updateAcrossEvents)

.writeStream

.queryName("count_based_device")

.format("memory")

.start()

Querying this table will show you the output rows for each user over this time period:

SELECT * FROM count_based_device

+---+--------------------+--------------------+

|uid| activities| xAvg|

+---+--------------------+--------------------+

| a| [stand, null, sit]|-9.10908533566433...|

| a| [sit, null, walk]|-0.00654280428601...|

...

| c|[null, stairsdown...|-0.03286657789999995|

+---+--------------------+--------------------+

As you might expect, sessions that have a number of activities in them have a higher x-axis

gyroscope value than ones that have fewer activities. It should be trivial to extend this example to

problem sets more relevant to your own domain, as well.

Conclusion

This chapter covered some of the more advanced topics in Structured Streaming, including event time

and stateful processing. This is effectively the user guide to help you actually build out your

application logic and turn it into something that provides value. Next, we will discuss what we’ll

need to do in order to take this application to production and maintain and update it over time.

Chapter 23. Structured Streaming in

Production

The previous chapters of this part of the book have covered Structured Streaming from a user’s

perspective. Naturally this is the core of your application. This chapter covers some of the

operational tools needed to run Structured Streaming robustly in production after you’ve developed

an application.

Structured Streaming was marked as production-ready in Apache Spark 2.2.0, meaning that this

release has all the features required for production use and stabilizes the API. Many organizations are

already using the system in production because, frankly, it’s not much different from running other

production Spark applications. Indeed, through features such as transactional sources/sinks and

exactly-once processing, the Structured Streaming designers sought to make it as easy to operate as

possible. This chapter will walk you through some of the key operational tasks specific to Structured

Streaming. This should supplement everything we saw and learned about Spark operations in Part II.

Fault Tolerance and Checkpointing

The most important operational concern for a streaming application is failure recovery. Faults are

inevitable: you’re going to lose a machine in the cluster, a schema will change by accident without a

proper migration, or you may even intentionally restart the cluster or application. In any of these

cases, Structured Streaming allows you to recover an application by just restarting it. To do this, you

must configure the application to use checkpointing and write-ahead logs, both of which are handled

automatically by the engine. Specifically, you must configure a query to write to a checkpoint

location on a reliable file system (e.g., HDFS, S3, or any compatible filesystem). Structured

Streaming will then periodically save all relevant progress information (for instance, the range of

offsets processed in a given trigger) as well as the current intermediate state values to the checkpoint

location. In a failure scenario, you simply need to restart your application, making sure to point to the

same checkpoint location, and it will automatically recover its state and start processing data where it

left off. You do not have to manually manage this state on behalf of the application—Structured

Streaming does it for you.

To use checkpointing, specify your checkpoint location before starting your application through the

checkpointLocation option on writeStream. You can do this as follows:

// in Scala

val static = spark.read.json("/data/activity-data")

val streaming = spark

.readStream

.schema(static.schema)

.option("maxFilesPerTrigger", 10)

.json("/data/activity-data")

.groupBy("gt")

.count()

val query = streaming

.writeStream

.outputMode("complete")

.option("checkpointLocation", "/some/location/")

.queryName("test_stream")

.format("memory")

.start()

# in Python

static = spark.read.json("/data/activity-data")

streaming = spark\

.readStream\

.schema(static.schema)\

.option("maxFilesPerTrigger", 10)\

.json("/data/activity-data")\

.groupBy("gt")\

.count()

query = streaming\

.writeStream\

.outputMode("complete")\

.option("checkpointLocation", "/some/python/location/")\

.queryName("test_python_stream")\

.format("memory")\

.start()

If you lose your checkpoint directory or the information inside of it, your application will not be able

to recover from failures and you will have to restart your stream from scratch.

Updating Your Application

Checkpointing is probably the most important thing to enable in order to run your applications in

production. This is because the checkpoint will store all of the information about what your stream

has processed thus far and what the intermediate state it may be storing is. However, checkpointing

does come with a small catch—you’re going to have to reason about your old checkpoint data when

you update your streaming application. When you update your application, you’re going to have to

ensure that your update is not a breaking change. Let’s cover these in detail when we review the two

types of updates: either an update to your application code or running a new Spark version.

Updating Your Streaming Application Code

Structured Streaming is designed to allow certain types of changes to the application code between

application restarts. Most importantly, you are allowed to change user-defined functions (UDFs) as

long as they have the same type signature. This feature can be very useful for bug fixes. For example,

imagine that your application starts receiving a new type of data, and one of the data parsing functions

in your current logic crashes. With Structured Streaming, you can recompile the application with a

new version of that function and pick up at the same point in the stream where it crashed earlier.

While small adjustments like adding a new column or changing a UDF are not breaking changes and

do not require a new checkpoint directory, there are larger changes that do require an entirely new

checkpoint directory. For example, if you update your streaming application to add a new aggregation

key or fundamentally change the query itself, Spark cannot construct the required state for the new

query from an old checkpoint directory. In these cases, Structured Streaming will throw an exception

saying it cannot begin from a checkpoint directory, and you must start from scratch with a new

(empty) directory as your checkpoint location.

Updating Your Spark Version

Structured Streaming applications should be able to restart from an old checkpoint directory across

patch version updates to Spark (e.g., moving from Spark 2.2.0 to 2.2.1 to 2.2.2). The checkpoint

format is designed to be forward-compatible, so the only way it may be broken is due to critical bug

fixes. If a Spark release cannot recover from old checkpoints, this will be clearly documented in its

release notes. The Structured Streaming developers also aim to keep the format compatible across

minor version updates (e.g., Spark 2.2.x to 2.3.x), but you should check the release notes to see

whether this is supported for each upgrade. In either case, if you cannot start from a checkpoint, you

will need to start your application again using a new checkpoint directory.

Sizing and Rescaling Your Application

In general, the size of your cluster should be able to comfortably handle bursts above your data rate.

The key metrics you should be monitoring in your application and cluster are discussed as follows. In

general, if you see that your input rate is much higher than your processing rate (elaborated upon

momentarily), it’s time to scale up your cluster or application. Depending on your resource manager

and deployment, you may just be able to dynamically add executors to your application. When it

comes time, you can scale-down your application in the same way—remove executors (potentially

through your cloud provider) or restart your application with lower resource counts. These changes

will likely incur some processing delay (as data is recomputed or partitions are shuffled around when

executors are removed). In the end, it’s a business decision as to whether it’s worthwhile to create a

system with more sophisticated resource management capabilities.

While making underlying infrastructure changes to the cluster or application are sometimes necessary,

other times a change may only require a restart of the application or stream with a new configuration.

For instance, changing spark.sql.shuffle.partitions is not supported while a stream is

currently running (it won’t actually change the number of shuffle partitions). This requires restarting

the actual stream, not necessarily the entire application. Heavier weight changes, like changing

arbitrary Spark application configurations, will likely require an application restart.

Metrics and Monitoring

Metrics and monitoring in streaming applications is largely the same as for general Spark

applications using the tools described in Chapter 18. However, Structured Streaming does add

several more specifics in order to help you better understand the state of your application. There are

two key APIs you can leverage to query the status of a streaming query and see its recent execution

progress. With these two APIs, you can get a sense of whether or not your stream is behaving as

expected.

Query Status

The query status is the most basic monitoring API, so it’s a good starting point. It aims to answer the

question, “What processing is my stream performing right now?” This information is reported in the

status field of the query object returned by startStream. For example, you might have a simple

counts stream that provides counts of IOT devices defined by the following query (here we’re just

using the same query from the previous chapter without the initialization code):

query.status

To get the status of a given query, simply running the command query.status will return the current

status of the stream. This gives us details about what is happening at that point in time in the stream.

Here’s a sample of what you’ll get back when querying this status:

{

"message" : "Getting offsets from ...",

"isDataAvailable" : true,

"isTriggerActive" : true

}

The above snippet describes getting the offsets from a Structured Streaming data source (hence the

message describing getting offsets). There are a variety of messages to describe the stream’s status.

NOTE

We have shown the status command inline here the way you would call it in a Spark shell. However, for a standalone

application, you may not have a shell attached to run arbitrary code inside your process. In that case, you can expose its

status by implementing a monitoring server, such as a small HTTP server that listens on a port and returns query.status

when it gets a request. Alternatively, you can use the richer StreamingQueryListener API described later to listen to more

events.

Recent Progress

While the query’s current status is useful to see, equally important is an ability to view the query’s

progress. The progress API allows us to answer questions like “At what rate am I processing

tuples?” or “How fast are tuples arriving from the source?” By running query.recentProgress,

you’ll get access to more time-based information like the processing rate and batch durations. The

streaming query progress also includes information about the input sources and output sinks behind

your stream.

query.recentProgress

Here’s the result of the Scala version after we ran the code from before; the Python one will be

similar:

Array({

"id" : "d9b5eac5-2b27-4655-8dd3-4be626b1b59b",

"runId" : "f8da8bc7-5d0a-4554-880d-d21fe43b983d",

"name" : "test_stream",

"timestamp" : "2017-08-06T21:11:21.141Z",

"numInputRows" : 780119,

"processedRowsPerSecond" : 19779.89350912779,

"durationMs" : {

"addBatch" : 38179,

"getBatch" : 235,

"getOffset" : 518,

"queryPlanning" : 138,

"triggerExecution" : 39440,

"walCommit" : 312

"stateOperators" : [ {

"numRowsTotal" : 7,

"numRowsUpdated" : 7

} ],

"sources" : [ {

"description" : "FileStreamSource[/some/stream/source/]",

"startOffset" : null,

"endOffset" : {

"logOffset" : 0

"numInputRows" : 780119,

"processedRowsPerSecond" : 19779.89350912779

} ],

"sink" : {

"description" : "MemorySink"

}

})

As you can see from the output just shown, this includes a number of details about the state of the

stream. It is important to note that this is a snapshot in time (according to when we asked for the query

progress). In order to consistently get output about the state of the stream, you’ll need to query this

API for the updated state repeatedly. The majority of the fields in the previous output should be self-

explanatory. However, let’s review some of the more consequential fields in detail.

Input rate and processing rate

The input rate specifies how much data is flowing into Structured Streaming from our input source.

The processing rate is how quickly the application is able to analyze that data. In the ideal case, the

input and processing rates should vary together. Another case might be when the input rate is much

greater than the processing rate. When this happens, the stream is falling behind and you will need to

scale the cluster up to handle the larger load.

Batch duration

Nearly all streaming systems utilize batching to operate at any reasonable throughput (some have an

option of high latency in exchange for lower throughput). Structured Streaming achieves both. As it

operates on the data, you will likely see batch duration oscillate as Structured Streaming processes

varying numbers of events over time. Naturally, this metric will have little to no relevance when the

continuous processing engine is made an execution option.

TIP

Generally it’s a best practice to visualize the changes in batch duration and input and processing rates. It’s much more

helpful than simply reporting changes over time.

Spark UI

The Spark web UI, covered in detail in Chapter 18, also shows tasks, jobs, and data processing

metrics for Structured Streaming applications. On the Spark UI, each streaming application will

appear as a sequence of short jobs, one for each trigger. However, you can use the same UI to see

metrics, query plans, task durations, and logs from your application. One departure of note from the

DStream API is that the Streaming Tab is not used by Structured Streaming.

Alerting

Understanding and looking at the metrics for your Structured Streaming queries is an important first

step. However, this involves constantly watching a dashboard or the metrics in order to discover

potential issues. You’re going to need robust automatic alerting to notify you when your jobs are

failing or not keeping up with the input data rate without monitoring them manually. There are several

ways to integrate existing alerting tools with Spark, generally building on the recent progress API we

covered before. For example, you may directly feed the metrics to a monitoring system such as the

open source Coda Hale Metrics library or Prometheus, or you may simply log them and use a log

aggregation system like Splunk. In addition to monitoring and alerting on queries, you’re also going to

want to monitor and alert on the state of the cluster and the overall application (if you’re running

multiple queries together).

Advanced Monitoring with the Streaming Listener

We already touched on some of the high-level monitoring tools in Structured Streaming. With a bit of

glue logic, you can use the status and queryProgress APIs to output monitoring events into your

organization’s monitoring platform of choice (e.g., a log aggregation system or Prometheus

dashboard). Beyond these approaches, there is also a lower-level but more powerful way to observe

an application’s execution: the StreamingQueryListener class.

The StreamingQueryListener class will allow you to receive asynchronous updates from the

streaming query in order to automatically output this information to other systems and implement

robust monitoring and alerting mechanisms. You start by developing your own object to extend

StreamingQueryListener, then attach it to a running SparkSession. Once you attach your custom

listener with sparkSession.streams.addListener(), your class will receive notifications when a

query is started or stopped, or progress is made on an active query. Here’s a simple example of a

listener from the Structured Streaming documentation:

val spark: SparkSession = ...

spark.streams.addListener(new StreamingQueryListener() {

override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {

println("Query started: " + queryStarted.id)

}

override def onQueryTerminated(

queryTerminated: QueryTerminatedEvent): Unit = {

println("Query terminated: " + queryTerminated.id)

}

override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {

println("Query made progress: " + queryProgress.progress)

}

})

Streaming listeners allow you to process each progress update or status change using custom code

and pass it to external systems. For example, the following code for a StreamingQueryListener

that will forward all query progress information to Kafka. You’ll have to parse this JSON string once

you read data from Kafka in order to access the actual metrics:

class KafkaMetrics(servers: String) extends StreamingQueryListener {

val kafkaProperties = new Properties()

kafkaProperties.put(

"bootstrap.servers",

servers)

kafkaProperties.put(

"key.serializer",

"kafkashaded.org.apache.kafka.common.serialization.StringSerializer")

kafkaProperties.put(

"value.serializer",

"kafkashaded.org.apache.kafka.common.serialization.StringSerializer")

val producer = new KafkaProducer[String, String](kafkaProperties)

import org.apache.spark.sql.streaming.StreamingQueryListener

import org.apache.kafka.clients.producer.KafkaProducer

override def onQueryProgress(event:

StreamingQueryListener.QueryProgressEvent): Unit = {

producer.send(new ProducerRecord("streaming-metrics",

event.progress.json))

}

override def onQueryStarted(event:

StreamingQueryListener.QueryStartedEvent): Unit = {}

override def onQueryTerminated(event:

StreamingQueryListener.QueryTerminatedEvent): Unit = {}

}

Using the StreamingQueryListener interface, you can even monitor Structured Streaming

applications on one cluster by running a Structured Streaming application on that same (or another)

cluster. You could also manage multiple streams in this way.

Conclusion

In this chapter, we covered the main tools needed to run Structured Streaming in production:

checkpoints for fault tolerance and various monitoring APIs that let you observe how your application

is running. Lucky for you, if you’re running Spark in production already, many of the concepts and

tools are similar, so you should be able to reuse a lot of your existing knowledge. Be sure to check

Part IV to see some other helpful tools for monitoring Spark Applications.

Part VI. Advanced Analytics and

Machine Learning

Chapter 24. Advanced Analytics and

Machine Learning Overview

Thus far, we have covered fairly general data flow APIs. This part of the book will dive deeper into

some of the more specific advanced analytics APIs available in Spark. Beyond large-scale SQL

analysis and streaming, Spark also provides support for statistics, machine learning, and graph

analytics. These encompass a set of workloads that we will refer to as advanced analytics. This part

of the book will cover advanced analytics tools in Spark, including:

Preprocessing your data (cleaning data and feature engineering)

Supervised learning

Recommendation learning

Unsupervised engines

Graph analytics

Deep learning

This chapter offers a basic overview of advanced analytics, some example use cases, and a basic

advanced analytics workflow. Then we’ll cover the analytics tools just listed and teach you how to

apply them.

WARNING

This book is not intended to teach you everything you need to know about machine learning from scratch. We won’t go into

strict mathematical definitions and formulations—not for lack of importance but simply because it’s too much information to

include. This part of the book is not an algorithm guide that will teach you the mathematical underpinnings of every

available algorithm nor the in-depth implementation strategies used. The chapters included here serve as a guide for users,

with the purpose of outlining what you need to know to use Spark’s advanced analytics APIs.

A Short Primer on Advanced Analytics

Advanced analytics refers to a variety of techniques aimed at solving the core problem of deriving

insights and making predictions or recommendations based on data. The best ontology for machine

learning is structured based on the task that you’d like to perform. The most common tasks include:

Supervised learning, including classification and regression, where the goal is to predict a

label for each data point based on various features.

Recommendation engines to suggest products to users based on behavior.

Unsupervised learning, including clustering, anomaly detection, and topic modeling, where

the goal is to discover structure in the data.

Graph analytics tasks such as searching for patterns in a social network.

Before discussing Spark’s APIs in detail, let’s review each of these tasks along with some common

machine learning and advanced analytics use cases. While we have certainly tried to make this

introduction as accessible as possible, at times you may need to consult other resources in order to

fully understand the material. O’Reilly should we link to or mention any specific ones? Additionally,

we will cite the following books throughout the next few chapters because they are great resources

for learning more about the individual analytics (and, as a bonus, they are freely available on the

web):

An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie,

and Robert Tibshirani. We refer to this book as “ISL.”

Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome

Friedman. We refer to this book as “ESL.”

Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. We refer to this

book as “DLB.”

Supervised Learning

Supervised learning is probably the most common type of machine learning. The goal is simple:

using historical data that already has labels (often called the dependent variables), train a model to

predict the values of those labels based on various features of the data points. One example would be

to predict a person’s income (the dependent variable) based on age (a feature). This training process

usually proceeds through an iterative optimization algorithm such as gradient descent. The training

algorithm starts with a basic model and gradually improves it by adjusting various internal

parameters (coefficients) during each training iteration. The result of this process is a trained model

that you can use to make predictions on new data. There are a number of different tasks we’ll need to

complete as part of the process of training and making predictions, such as measuring the success of

trained models before using them in the field, but the fundamental principle is simple: train on

historical data, ensure that it generalizes to data we didn’t train on, and then make predictions on new

data.

We can further organize supervised learning based on the type of variable we’re looking to predict.

We’ll get to that next.

Classification

One common type of supervised learning is classification. Classification is the act of training an

algorithm to predict a dependent variable that is categorical (belonging to a discrete, finite set of

values). The most common case is binary classification, where our resulting model will make a

prediction that a given item belongs to one of two groups. The canonical example is classifying email

spam. Using a set of historical emails that are organized into groups of spam emails and not spam

emails, we train an algorithm to analyze the words in, and any number of properties of, the historical

emails and make predictions about them. Once we are satisfied with the algorithm’s performance, we

use that model to make predictions about future emails the model has never seen before.

When we classify items into more than just two categories, we call this multiclass classification. For

example, we may have four different categories of email (as opposed to the two categories in the

previous paragraph): spam, personal, work related, and other. There are many use cases for

classification, including:

Predicting disease

A doctor or hospital might have a historical dataset of behavioral and physiological attributes of a

set of patients. They could use this dataset to train a model on this historical data (and evaluate its

success and ethical implications before applying it) and then leverage it to predict whether or not

a patient has heart disease or not. This is an example of binary classification (healthy heart,

unhealthy heart) or multiclass classification (healthly heart, or one of several different diseases).

Classifying images

There are a number of applications from companies like Apple, Google, or Facebook that can

predict who is in a given photo by running a classification model that has been trained on

historical images of people in your past photos. Another common use case is to classify images or

label the objects in images.

Predicting customer churn

A more business-oriented use case might be predicting customer churn—that is, which customers

are likely to stop using a service. You can do this by training a binary classifier on past customers

that have churned (and not churned) and using it to try and predict whether or not current

customers will churn.

Buy or won’t buy

Companies often want to predict whether visitors of their website will purchase a given product.

They might use information about users’ browsing pattern or attributes such as location in order to

drive this prediction.

There are many more use cases for classification beyond these examples. We will introduce more use

cases, as well as Spark’s classification APIs, in Chapter 26.

Regression

In classification, our dependent variable is a set of discrete values. In regression, we instead try to

predict a continuous variable (a real number). In simplest terms, rather than predicting a category, we

want to predict a value on a number line. The rest of the process is largely the same, which is why

they’re both forms of supervised learning. We will train on historical data to make predictions about

data we have never seen. Here are some typical examples:

Predicting sales

A store may want to predict total product sales on given data using historical sales data. There are

a number of potential input variables, but a simple example might be using last week’s sales data

to predict the next day’s data.

Predicting height

Based on the heights of two individuals, we might want to predict the heights of their potential

children.

Predicting the number of viewers of a show

A media company like Netflix might try to predict how many of their subscribers will watch a

particular show.

We will introduce more use cases, as well as Spark’s methods for regression, in Chapter 27.

Recommendation

Recommendation is one of the most intuitive applications of advanced analytics. By studying people’s

explicit preferences (through ratings) or implicit ones (through observed behavior) for various

products or items, an algorithm can make recommendations on what a user may like by drawing

similarities between the users or items. By looking at these similarities, the algorithm makes

recommendations to users based on what similar users liked, or what other products resemble the

ones the user already purchased. Recommendation is a common use case for Spark and well suited to

big data. Here are some example use cases:

Movie recommendations

Netflix uses Spark, although not necessarily its built-in libraries, to make large-scale movie

recommendations to its users. It does this by studying what movies users watch and do not watch

in the Netflix application. In addition, Netflix likely takes into consideration how similar a given

user’s ratings are to other users’.

Product recommendations

Amazon uses product recommendations as one of its main tools to increase sales. For instance,

based on the items in our shopping cart, Amazon may recommend other items that were added to

similar shopping carts in the past. Likewise, on every product page, Amazon shows similar

products purchased by other users.

We will introduce more recommendation use cases, as well as Spark’s methods for generating

recommendations, in Chapter 28.

Unsupervised Learning

Unsupervised learning is the act of trying to find patterns or discover the underlying structure in a

given set of data. This differs from supervised learning because there is no dependent variable (label)

to predict.

Some example use cases for unsupervised learning include:

Anomaly detection

Given some standard event type often occuring over time, we might want to report when a

nonstandard type of event occurs. For example, a security officer might want to receive

notifications when a strange object (think vehicle, skater, or bicyclist) is observed on a pathway.

User segmentation

Given a set of user behaviors, we might want to better understand what attributes certain users

share with other users. For instance, a gaming company might cluster users based on properties

like the number of hours played in a given game. The algorithm might reveal that casual players

have very different behavior than hardcore gamers, for example, and allow the company to offer

different recommendations or rewards to each player.

Topic modeling

Given a set of documents, we might analyze the different words contained therein to see if there is

some underlying relation between them. For example, given a number of web pages on data

analytics, a topic modeling algorithm can cluster them into pages about machine learning, SQL,

streaming, and so on based on groups of words that are more common in one topic than in others.

Intuitively, it is easy to see how segmenting customers could help a platform cater better to each set of

users. However, it may be hard to discover whether or not this set of user segments is “correct”. For

this reason, it can be difficult to determine whether a particular model is good or not. We will discuss

unsupervised learning in detail in Chapter 29.

Graph Analytics

While less common than classification and regression, graph analytics is a powerful tool.

Fundamentally, graph analytics is the study of structures in which we specify vertices (which are

objects) and edges (which represent the relationships between those objects). For example, the

vertices might represent people and products, and edges might represent a purchase. By looking at the

properties of vertices and edges, we can better understand the connections between them and the

overall structure of the graph. Since graphs are all about relationships, anything that specifies a

relationship is a great use case for graph analytics. Some examples include:

Fraud prediction

Capital One uses Spark’s graph analytics capabilities to better understand fraud networks. By

using historical fraudulent information (like phone numbers, addresses, or names) they discover

fraudulent credit requests or transactions. For instance, any user accounts within two hops of a

fraudulent phone number might be considered suspicious.

Anomaly detection

By looking at how networks of individuals connect with one another, outliers and anomalies can

be flagged for manual analysis. For instance, if typically in our data each vertex has ten edges

associated with it and a given vertex only has one edge, that might be worth investigating as

something strange.

Classification

Given some facts about certain vertices in a network, you can classify other vertices according to

their connection to the original node. For instance, if a certain individual is labeled as an

influencer in a social network, we could classify other individuals with similar network

structures as influencers.

Recommendation

Google’s original web recommendation algorithm, PageRank, is a graph algorithm that analyzes

website relationships in order to rank the importance of web pages. For example, a web page that

has a lot of links to it is ranked as more important than one with no links to it.

We’ll discuss more examples of graph analytics in Chapter 30.

The Advanced Analytics Process

You should have a firm grasp of some fundamental use cases for machine learning and advanced

analytics. However, finding a use case is only a small part of the actual advanced analytics process.

There is a lot of work in preparing your data for analysis, testing different ways of modeling it, and

evaluating these models. This section will provide structure to the overall anaytics process and the

steps we have to take to not just perform one of the tasks just outlined, but actually evaluate success

objectively in order to understand whether or not we should apply our model to the real world

(Figure 24-1).

Figure 24-1. The machine learning workflow

The overall process involves, the following steps (with some variation):

1. Gathering and collecting the relevant data for your task.

2. Cleaning and inspecting the data to better understand it.

3. Performing feature engineering to allow the algorithm to leverage the data in a suitable form

(e.g., converting the data to numerical vectors).

4. Using a portion of this data as a training set to train one or more algorithms to generate some

candidate models.

5. Evaluating and comparing models against your success criteria by objectively measuring

results on a subset of the same data that was not used for training. This allows you to better

understand how your model may perform in the wild.

6. Leveraging the insights from the above process and/or using the model to make predictions,

detect anomalies, or solve more general business challenges.

These steps won’t be the same for every advanced analytics task. However, this workflow does serve

as a general framework for what you’re going to need to be successful with advanced analytics. Just

as we did with the various advanced analytics tasks earlier in the chapter, let’s break down the

process to better understand the overall objective of each step.

Data collection

Naturally it’s hard to create a training set without first collecting data. Typically this means at least

gathering the datasets you’ll want to leverage to train your algorithm. Spark is an excellent tool for

this because of its ability to speak to a variety of data sources and work with data big and small.

Data cleaning

After you’ve gathered the proper data, you’re going to need to clean and inspect it. This is typically

done as part of a process called exploratory data analysis, or EDA. EDA generally means using

interactive queries and visualization methods in order to better understand distributions, correlations,

and other details in your data. During this process you may notice you need to remove some values

that may have been misrecorded upstream or that other values may be missing. Whatever the case, it’s

always good to know what is in your data to avoid mistakes down the road. The multitude of Spark

functions in the structured APIs will provide a simple way to clean and report on your data.

Feature engineering

Now that you collected and cleaned your dataset, it’s time to convert it to a form suitable for machine

learning algorithms, which generally means numerical features. Proper feature engineering can often

make or break a machine learning application, so this is one task you’ll want to do carefully. The

process of feature engineering includes a variety of tasks, such as normalizing data, adding variables

to represent the interactions of other variables, manipulating categorical variables, and converting

them to the proper format to be input into our machine learning model. In MLlib, Spark’s machine

learning library, all variables will usually have to be input as vectors of doubles (regardless of what

they actually represent). We cover the process of feature engineering in great depth in Chapter 25. As

you will see in that chapter, Spark provides the essentials you’ll need to manipulate your data using a

variety of machine learning statistical techniques.

NOTE

The following few steps (training models, model tuning, and evaluation) are not relevant to all use cases. This is a general

workflow that may vary significantly based on the end objective you would like to achieve.

Training models

At this point in the process we have a dataset of historical information (e.g., spam or not spam

emails) and a task we would like to complete (e.g., classifying spam emails). Next, we will want to

train a model to predict the correct output, given some input. During the training process, the

parameters inside of the model will change according to how well the model performed on the input

data. For instance, to classify spam emails, our algorithm will likely find that certain words are better

predictors of spam than others and therefore weight the parameters associated with those words

higher. In the end, the trained model will find that certain words should have more influence (because

of their consistent association with spam emails) than others. The output of the training process is

what we call a model. Models can then be used to gain insights or to make future predictions. To

make predictions, you will give the model an input and it will produce an output based on a

mathematical manipulation of these inputs. Using the classification example, given the properties of

an email, it will predict whether that email is spam or not by comparing to the historical spam and not

spam emails that it was trained on.

However, just training a model isn’t the objective—we want to leverage our model to produce

insights. Thus, we must answer the question: how do we know our model is any good at what it’s

supposed to do? That’s where model tuning and evaluation come in.

Model tuning and evaluation

You likely noticed earlier that we mentioned that you should split your data into multiple portions and

use only one for training. This is an essential step in the machine learning process because when you

build an advanced analytics model you want that model to generalize to data it has not seen before.

Splitting our dataset into multiple portions allows us to objectively test the effectiveness of the

trained model against a set of data that it has never seen before. The objective is to see if your model

understands something fundamental about this data process or whether or not it just noticed the things

particular to only the training set (sometimes called overfitting). That’s why it is called a test set. In

the process of training models, we also might take another, separate subset of data and treat that as

another type of test set, called a validation set, in order to try out different hyperparameters

(parameters that affect the training process) and compare different variations of the same model

without overfitting to the test set.

WARNING

Following proper training, validation, and test set best practices is essential to successfully using machine learning. It’s easy

to end up overfitting (training a model that does not generalize well to new data) if we do not properly isolate these sets of

data. We cannot cover this problem in depth in this book, but almost any machine learning book will cover this topic.

To continue with the classification example we referenced previously, we have three sets of data: a

training set for training models, a validation set for testing different variations of the models that

we’re training, and lastly, a test set we will use for the final evaluation of our different model

variations to see which one performed the best.

Leveraging the model and/or insights

After running the model through the training process and ending up with a well-performing model, you

are now ready to use it! Taking your model to production can be a significant challenge in and of

itself. We will discuss some tactics later on in this chapter.

Spark’s Advanced Analytics Toolkit

The previous overview is just an example workflow and doesn’t encompass all use cases or potential

workflows. In addition, you probably noticed that we did not discuss Spark almost at all. This section

will discuss Spark’s advanced analytics capabilities. Spark includes several core packages and many

external packages for performing advanced analytics. The primary package is MLlib, which provides

an interface for building machine learning pipelines.

What Is MLlib?

MLlib is a package, built on and included in Spark, that provides interfaces for gathering and cleaning

data, feature engineering and feature selection, training and tuning large-scale supervised and

unsupervised machine learning models, and using those models in production.

WARNING

MLlib actually consists of two packages that leverage different core data structures. The package org.apache.spark.ml

includes an interface for use with DataFrames. This package also offers a high-level interface for building machine learning

pipelines that help standardize the way in which you perform the preceding steps. The lower-level package,

org.apache.spark.mllib, includes interfaces for Spark’s low-level RDD APIs. This book will focus exclusively on the

DataFrame API. The RDD API is the lower-level interface, which is in maintenance mode (meaning it will only receive

bug fixes, not new features) at this time. It has also been covered fairly extensively in older books on Spark and is therefore

omitted here.

When and why should you use MLlib (versus scikit-learn, TensorFlow, or foo

package)

At a high level, MLlib might sound like a lot of other machine learning packages you’ve probably

heard of, such as scikit-learn for Python or the variety of R packages for performing similar tasks. So

why should you bother with MLlib at all? There are numerous tools for performing machine learning

on a single machine, and while there are several great options to choose from, these single machine

tools do have their limits either in terms of the size of data you can train on or the processing time.

This means single-machine tools are usually complementary to MLlib. When you hit those scalability

issues, take advantage of Spark’s abilities.

There are two key use cases where you want to leverage Spark’s ability to scale. First, you want to

leverage Spark for preprocessing and feature generation to reduce the amount of time it might take to

produce training and test sets from a large amount of data. Then you might leverage single-machine

learning libraries to train on those given data sets. Second, when your input data or model size

become too difficult or inconvenient to put on one machine, use Spark to do the heavy lifting. Spark

makes distributed machine learning very simple.

An important caveat to all of this is that while training and data preparation are made simple, there

are still some complexities you will need to keep in mind, especially when it comes to deploying a

trained model. For example, Spark does not provide a built-in way to serve low-latency predictions

from a model, so you may want to export the model to another serving system or a custom application

to do that. MLlib is generally designed to allow inspecting and exporting models to other tools where

possible.

High-Level MLlib Concepts

In MLlib there are several fundamental “structural” types: transformers, estimators, evaluators, and

pipelines. By structural, we mean you will think in terms of these types when you define an end-to-

end machine learning pipeline. They’ll provide the common language for defining what belongs in

what part of the pipeline. Figure 24-2 illustrates the overall workflow that you will follow when

developing machine learning models in Spark.

Figure 24-2. The machine learning workflow, in Spark

Transformers are functions that convert raw data in some way. This might be to create a new

interaction variable (from two other variables), normalize a column, or simply change an Integer

into a Double type to be input into a model. An example of a transformer is one that converts string

categorical variables into numerical values that can be used in MLlib. Transformers are primarily

used in preprocessing and feature engineering. Transformers take a DataFrame as input and produce a

new DataFrame as output, as illustrated in Figure 24-3.

Figure 24-3. A standard transformer

Estimators are one of two kinds of things. First, estimators can be a kind of transformer that is

initialized with data. For instance, to normalize numerical data we’ll need to initialize our

transformation with some information about the current values in the column we would like to

normalize. This requires two passes over our data—the initial pass generates the initialization values

and the second actually applies the generated function over the data. In the Spark’s nomenclature,

algorithms that allow users to train a model from data are also referred to as estimators.

An evaluator allows us to see how a given model performs according to criteria we specify like a

receiver operating characteristic (ROC) curve. After we use an evaluator to select the best model

from the ones we tested, we can then use that model to make predictions.

From a high level we can specify each of the transformations, estimations, and evaluations one by

one, but it is often easier to specify our steps as stages in a pipeline. This pipeline is similar to

scikit-learn’s pipeline concept.

Low-level data types

In addition to the structural types for building pipelines, there are also several lower-level data types

you may need to work with in MLlib (Vector being the most common). Whenever we pass a set of

features into a machine learning model, we must do it as a vector that consists of Doubles. This

vector can be either sparse (where most of the elements are zero) or dense (where there are many

unique values). Vectors are created in different ways. To create a dense vector, we can specify an

array of all the values. To create a sparse vector, we can specify the total size and the indices and

values of the non-zero elements. Sparse is the best format, as you might have guessed, when the

majority of values are zero as this is a more compressed representation. Here is an example of how to

manually create a Vector:

// in Scala

import org.apache.spark.ml.linalg.Vectors

val denseVec = Vectors.dense(1.0, 2.0, 3.0)

val size = 3

val idx = Array(1,2) // locations of non-zero elements in vector

val values = Array(2.0,3.0)

val sparseVec = Vectors.sparse(size, idx, values)

sparseVec.toDense

denseVec.toSparse

# in Python

from pyspark.ml.linalg import Vectors

denseVec = Vectors.dense(1.0, 2.0, 3.0)

size = 3

idx = [1, 2] # locations of non-zero elements in vector

values = [2.0, 3.0]

sparseVec = Vectors.sparse(size, idx, values)

WARNING

Confusingly, there are similar datatypes that refer to ones that can be used in DataFrames and others that can only be used

in RDDs. The RDD implementations fall under the mllib package while the DataFrame implementations fall under ml.

MLlib in Action

Now that we have described some of the core pieces you can expect to come across, let’s create a

simple pipeline to demonstrate each of the components. We’ll use a small synthetic dataset that will

help illustrate our point. Let’s read the data in and see a sample before talking about it further:

// in Scala

var df = spark.read.json("/data/simple-ml")

df.orderBy("value2").show()

# in Python

df = spark.read.json("/data/simple-ml")

df.orderBy("value2").show()

Here’s a sample of the data:

+-----+----+------+------------------+

+-----+----+------+------------------+

|green|good| 1|14.386294994851129|

...

| red| bad| 16|14.386294994851129|

|green|good| 12|14.386294994851129|

+-----+----+------+------------------+

This dataset consists of a categorical label with two values (good or bad), a categorical variable

(color), and two numerical variables. While the data is synthetic, let’s imagine that this dataset

represents a company’s customer health. The “color” column represents some categorical health

rating made by a customer service representative. The “lab” column represents the true customer

health. The other two values are some numerical measures of activity within an application (e.g.,

minutes spent on site and purchases). Suppose that we want to train a classification model where we

hope to predict a binary variable—the label—from the other values.

TIP

Apart from JSON, there are some specific data formats commonly used for supervised learning, including LIBSVM. These

formats have real valued labels and sparse input data. Spark can read and write for these formats using its data source

API. Here’s an example of how to read in data from a libsvm file using that Data Source API.

spark.read.format("libsvm").load(

"/data/sample_libsvm_data.txt")

For more information on LIBSVM, see the documentation.

Feature Engineering with Transformers

As already mentioned, transformers help us manipulate our current columns in one way or another.

Manipulating these columns is often in pursuit of building features (that we will input into our model).

Transformers exist to either cut down the number of features, add more features, manipulate current

ones, or simply to help us format our data correctly. Transformers add new columns to DataFrames.

When we use MLlib, all inputs to machine learning algorithms (with several exceptions discussed in

later chapters) in Spark must consist of type Double (for labels) and Vector[Double] (for features).

The current dataset does not meet that requirement and therefore we need to transform it to the proper

format.

To achieve this in our example, we are going to specify an RFormula. This is a declarative language

for specifying machine learning transformations and is simple to use once you understand the syntax.

RFormula supports a limited subset of the R operators that in practice work quite well for simple

models and manipulations (we demonstrate the manual approach to this problem in Chapter 25). The

basic RFormula operators are:

Separate target and terms

Concat terms; “+ 0” means removing the intercept (this means that the y-intercept of the line that

we will fit will be 0)

Remove a term; “- 1” means removing the intercept (this means that the y-intercept of the line that

we will fit will be 0—yes, this does the same thing as “+ 0”

Interaction (multiplication for numeric values, or binarized categorical values)

All columns except the target/dependent variable

In order to specify transformations with this syntax, we need to import the relevant class. Then we go

through the process of defining our formula. In this case we want to use all available variables (the .)

and also add in the interactions between value1 and color and value2 and color, treating those as

new features:

// in Scala

import org.apache.spark.ml.feature.RFormula

val supervised = new RFormula()

.setFormula("lab ~ . + color:value1 + color:value2")

# in Python

from pyspark.ml.feature import RFormula

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")

At this point, we have declaratively specified how we would like to change our data into what we

will train our model on. The next step is to fit the RFormula transformer to the data to let it discover

the possible values of each column. Not all transformers have this requirement but because RFormula

will automatically handle categorical variables for us, it needs to determine which columns are

categorical and which are not, as well as what the distinct values of the categorical columns are. For

this reason, we have to call the fit method. Once we call fit, it returns a “trained” version of our

transformer we can then use to actually transform our data.

NOTE

We’re using the RFormula transformer because it makes performing several transformations extremely easy to do. In

Chapter 25, we’ll show other ways to specify a similar set of transformations and outline the component parts of the

RFormula when we cover the specific transformers in MLlib.

Now that we covered those details, let’s continue on and prepare our DataFrame:

// in Scala

val fittedRF = supervised.fit(df)

val preparedDF = fittedRF.transform(df)

preparedDF.show()

# in Python

fittedRF = supervised.fit(df)

preparedDF = fittedRF.transform(df)

preparedDF.show()

Here’s the output from the training and transformation process:

+-----+----+------+------------------+--------------------+-----+

+-----+----+------+------------------+--------------------+-----+

|green|good| 1|14.386294994851129|(10,[1,2,3,5,8],[...| 1.0|

...

| red| bad| 2|14.386294994851129|(10,[0,2,3,4,7],[...| 0.0|

+-----+----+------+------------------+--------------------+-----+

In the output we can see the result of our transformation—a column called features that has our

previously raw data. What’s happening behind the scenes is actually pretty simple. RFormula

inspects our data during the fit call and outputs an object that will transform our data according to

the specified formula, which is called an RFormulaModel. This “trained” transformer always has the

word Model in the type signature. When we use this transformer, Spark automatically converts our

categorical variable to Doubles so that we can input it into a (yet to be specified) machine learning

model. In particular, it assigns a numerical value to each possible color category, creates additional

features for the interaction variables between colors and value1/value2, and puts them all into a

single vector. We then call transform on that object in order to transform our input data into the

expected output data.

Thus far you (pre)processed the data and added some features along the way. Now it is time to

actually train a model (or a set of models) on this dataset. In order to do this, you first need to prepare

a test set for evaluation.

TIP

Having a good test set is probably the most important thing you can do to ensure you train a model you can actually use in

the real world (in a dependable way). Not creating a representative test set or using your test set for hyperparameter tuning

are surefire ways to create a model that does not perform well in real-world scenarios. Don’t skip creating a test set—it’s a

requirement to know how well your model actually does!

Let’s create a simple test set based off a random split of the data now (we’ll be using this test set

throughout the remainder of the chapter):

// in Scala

val Array(train, test) = preparedDF.randomSplit(Array(0.7, 0.3))

# in Python

train, test = preparedDF.randomSplit([0.7, 0.3])

Estimators

Now that we have transformed our data into the correct format and created some valuable features,

it’s time to actually fit our model. In this case we will use a classification algorithm called logistic

regression. To create our classifier we instantiate an instance of LogisticRegression, using the

default configuration or hyperparameters. We then set the label columns and the feature columns; the

column names we are setting—label and features—are actually the default labels for all

estimators in Spark MLlib, and in later chapters we omit them:

// in Scala

import org.apache.spark.ml.classification.LogisticRegression

val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")

# in Python

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(labelCol="label",featuresCol="features")

Before we actually go about training this model, let’s inspect the parameters. This is also a great way

to remind yourself of the options available for each particular model:

// in Scala

println(lr.explainParams())

# in Python

print lr.explainParams()

While the output is too large to reproduce here, it shows an explanation of all of the parameters for

Spark’s implementation of logistic regression. The explainParams method exists on all algorithms

available in MLlib.

Upon instantiating an untrained algorithm, it becomes time to fit it to data. In this case, this returns a

LogisticRegressionModel:

// in Scala

val fittedLR = lr.fit(train)

# in Python

fittedLR = lr.fit(train)

This code will kick off a Spark job to train the model. As opposed to the transformations that you saw

throughout the book, the fitting of a machine learning model is eager and performed immediately.

Once complete, you can use the model to make predictions. Logically this means tranforming features

into labels. We make predictions with the transform method. For example, we can transform our

training dataset to see what labels our model assigned to the training data and how those compare to

the true outputs. This, again, is just another DataFrame we can manipulate. Let’s perform that

prediction with the following code snippet:

fittedLR.transform(train).select("label", "prediction").show()

This results in:

+-----+----------+

|label|prediction|

+-----+----------+

| 0.0| 0.0|

...

| 0.0| 0.0|

+-----+----------+

Our next step would be to manually evaluate this model and calculate performance metrics like the

true positive rate, false negative rate, and so on. We might then turn around and try a different set of

parameters to see if those perform better. However, while this is a useful process, it can also be quite

tedious. Spark helps you avoid manually trying different models and evaluation criteria by allowing

you to specify your workload as a declarative pipeline of work that includes all your transformations

as well as tuning your hyperparameters.

A REVIEW OF HYPERPARAMETERS

Although we mentioned them previously, let’s more formally define hyperparameters.

Hyperparameters are configuration parameters that affect the training process, such as model

architecture and regularization. They are set prior to starting training. For instance, logistic

regression has a hyperparameter that determines how much regularization should be performed on

our data through the training phase (regularization is a technique that pushes models against

overfitting data). You’ll see in the next couple of pages that we can set up our pipeline to try

different hyperparameter values (e.g., different regularization values) in order to compare

different variations of the same model against one another.

Pipelining Our Workflow

As you probably noticed, if you are performing a lot of transformations, writing all the steps and

keeping track of DataFrames ends up being quite tedious. That’s why Spark includes the Pipeline

concept. A pipeline allows you to set up a dataflow of the relevant transformations that ends with an

estimator that is automatically tuned according to your specifications, resulting in a tuned model ready

for use. Figure 24-4 illustrates this process.

Figure 24-4. Pipelining the ML workflow

Note that it is essential that instances of transformers or models are not reused across different

pipelines. Always create a new instance of a model before creating another pipeline.

In order to make sure we don’t overfit, we are going to create a holdout test set and tune our

hyperparameters based on a validation set (note that we create this validation set based on the

original dataset, not the preparedDF used in the previous pages):

// in Scala

val Array(train, test) = df.randomSplit(Array(0.7, 0.3))

# in Python

train, test = df.randomSplit([0.7, 0.3])

Now that you have a holdout set, let’s create the base stages in our pipeline. A stage simply

represents a transformer or an estimator. In our case, we will have two estimators. The RFomula will

first analyze our data to understand the types of input features and then transform them to create new

features. Subsequently, the LogisticRegression object is the algorithm that we will train to produce a

model:

// in Scala

val rForm = new RFormula()

val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")

# in Python

rForm = RFormula()

lr = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

We will set the potential values for the RFormula in the next section. Now instead of manually using

our transformations and then tuning our model we just make them stages in the overall pipeline, as in

the following code snippet:

// in Scala

import org.apache.spark.ml.Pipeline

val stages = Array(rForm, lr)

val pipeline = new Pipeline().setStages(stages)

# in Python

from pyspark.ml import Pipeline

stages = [rForm, lr]

pipeline = Pipeline().setStages(stages)

Training and Evaluation

Now that you arranged the logical pipeline, the next step is training. In our case, we won’t train just

one model (like we did previously); we will train several variations of the model by specifying

different combinations of hyperparameters that we would like Spark to test. We will then select the

best model using an Evaluator that compares their predictions on our validation data. We can test

different hyperparameters in the entire pipeline, even in the RFormula that we use to manipulate the

raw data. This code shows how we go about doing that:

// in Scala

import org.apache.spark.ml.tuning.ParamGridBuilder

val params = new ParamGridBuilder()

.addGrid(rForm.formula, Array(

"lab ~ . + color:value1",

"lab ~ . + color:value1 + color:value2"))

.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))

.addGrid(lr.regParam, Array(0.1, 2.0))

.build()

# in Python

from pyspark.ml.tuning import ParamGridBuilder

params = ParamGridBuilder()\

.addGrid(rForm.formula, [

"lab ~ . + color:value1",

"lab ~ . + color:value1 + color:value2"])\

.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\

.addGrid(lr.regParam, [0.1, 2.0])\

.build()

In our current paramter grid, there are three hyperparameters that will diverge from the defaults:

Two different versions of the RFormula

Three different options for the ElasticNet parameter

Two different options for the regularization parameter

This gives us a total of 12 different combinations of these parameters, which means we will be

training 12 different versions of logistic regression. We explain the ElasticNet parameter as well as

the regularization options in Chapter 26.

Now that the grid is built, it’s time to specify our evaluation process. The evaluator allows us to

automatically and objectively compare multiple models to the same evaluation metric. There are

evaluators for classification and regression, covered in later chapters, but in this case we will use the

BinaryClassificationEvaluator, which has a number of potential evaluation metrics, as we’ll

discuss in Chapter 26. In this case we will use areaUnderROC, which is the total area under the

receiver operating characteristic, a common measure of classification performance:

// in Scala

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val evaluator = new BinaryClassificationEvaluator()

.setMetricName("areaUnderROC")

.setRawPredictionCol("prediction")

.setLabelCol("label")

# in Python

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()\

.setMetricName("areaUnderROC")\

.setRawPredictionCol("prediction")\

.setLabelCol("label")

Now that we have a pipeline that specifies how our data should be transformed, we will perform

model selection to try out different hyperparameters in our logistic regression model and measure

success by comparing their performance using the areaUnderROC metric.

As we discussed, it is a best practice in machine learning to fit hyperparameters on a validation set

(instead of your test set) to prevent overfitting. For this reason, we cannot use our holdout test set

(that we created before) to tune these parameters. Luckily, Spark provides two options for performing

hyperparameter tuning automatically. We can use TrainValidationSplit, which will simply

perform an arbitrary random split of our data into two different groups, or CrossValidator, which

performs K-fold cross-validation by splitting the dataset into k non-overlapping, randomly partitioned

folds:

// in Scala

import org.apache.spark.ml.tuning.TrainValidationSplit

val tvs = new TrainValidationSplit()

.setTrainRatio(0.75) // also the default.

.setEstimatorParamMaps(params)

.setEstimator(pipeline)

.setEvaluator(evaluator)

# in Python

from pyspark.ml.tuning import TrainValidationSplit

tvs = TrainValidationSplit()\

.setTrainRatio(0.75)\

.setEstimatorParamMaps(params)\

.setEstimator(pipeline)\

.setEvaluator(evaluator)

Let’s run the entire pipeline we constructed. To review, running this pipeline will test out every

version of the model against the validation set. Note the type of tvsFitted is

TrainValidationSplitModel. Any time we fit a given model, it outputs a “model” type:

// in Scala

val tvsFitted = tvs.fit(train)

# in Python

tvsFitted = tvs.fit(train)

And of course evaluate how it performs on the test set!

evaluator.evaluate(tvsFitted.transform(test)) // 0.9166666666666667

We can also see a training summary for some models. To do this we extract it from the pipeline, cast

it to the proper type, and print our results. The metrics available on each model are discussed

throughout the next several chapters. Here’s how we can see the results:

// in Scala

import org.apache.spark.ml.PipelineModel

import org.apache.spark.ml.classification.LogisticRegressionModel

val trainedPipeline = tvsFitted.bestModel.asInstanceOf[PipelineModel]

val TrainedLR = trainedPipeline.stages(1).asInstanceOf[LogisticRegressionModel]

val summaryLR = TrainedLR.summary

summaryLR.objectiveHistory // 0.6751425885789243, 0.5543659647777687, 0.473776...

The objective history shown here provides details related to how our algorithm performed over each

training iteration. This can be helpful because we can note the progress our algorithm is making

toward the best model. Large jumps are typically expected at the beginning, but over time the values

should become smaller and smaller, with only small amounts of variation between the values.

Persisting and Applying Models

Now that we trained this model, we can persist it to disk to use it for prediction purposes later on:

tvsFitted.write.overwrite().save("/tmp/modelLocation")

After writing out the model, we can load it into another Spark program to make predictions. To do

this, we need to use a “model” version of our particular algorithm to load our persisted model from

disk. If we were to use CrossValidator, we’d have to read in the persisted version as the

CrossValidatorModel, and if we were to use LogisticRegression manually we would have to

use LogisticRegressionModel. In this case, we use TrainValidationSplit, which outputs

TrainValidationSplitModel:

// in Scala

import org.apache.spark.ml.tuning.TrainValidationSplitModel

val model = TrainValidationSplitModel.load("/tmp/modelLocation")

model.transform(test)

Deployment Patterns

In Spark there are several different deployment patterns for putting machine learning models into

production. Figure 24-5 illustrates common workflows.

Figure 24-5. The productionization process

Here are the various options for how you might go about deploying a Spark model. These are the

general options you should be able to link to the process illustrated in Figure 24-5.

Train your machine learning (ML) model offline and then supply it with offline data. In this

context, we mean offline data to be data that is stored for analysis, and not data that you need

to get an answer from quickly. Spark is well suited to this sort of deployment.

Train your model offline and then put the results into a database (usually a key-value store).

This works well for something like recommendation but poorly for something like

classification or regression where you cannot just look up a value for a given user but must

calculate one based on the input.

Train your ML algorithm offline, persist the model to disk, and then use that for serving. This

is not a low-latency solution if you use Spark for the serving part, as the overhead of starting

up a Spark job can be high, even if you’re not running on a cluster. Additionally this does not

parallelize well, so you’ll likely have to put a load balancer in front of multiple model

replicas and build out some REST API integration yourself. There are some interesting

potential solutions to this problem, but no standards currently exist for this sort of model

serving.

Manually (or via some other software) convert your distributed model to one that can run

much more quickly on a single machine. This works well when there is not too much

manipulation of the raw data in Spark but can be hard to maintain over time. Again, there are

several solutions in progress. For example, MLlib can export some models to PMML, a

common model interchange format.

Train your ML algorithm online and use it online. This is possible when used in conjunction

with Structured Streaming, but can be complex for some models.

While these are some of the options, there are many other ways of performing model deployment and

management. This is an area under heavy development and many potential innovations are currently

being worked on.

Conclusion

In this chapter we covered the core concepts behind advanced analytics and MLlib. We also showed

you how to use them. The next chapter will discuss preprocessing in depth, including Spark’s tools

for feature engineering and data cleaning. Then we’ll move into detailed descriptions of each

algorithm available in MLlib along with some tools for graph analytics and deep learning.

Chapter 25. Preprocessing and Feature

Engineering

Any data scientist worth her salt knows that one of the biggest challenges (and time sinks) in

advanced analytics is preprocessing. It’s not that it’s particularly complicated programming, but

rather that it requires deep knowledge of the data you are working with and an understanding of what

your model needs in order to successfully leverage this data. This chapter covers the details of how

you can use Spark to perform preprocessing and feature engineering. We’ll walk through the core

requirements you’ll need to meet in order to train an MLlib model in terms of how your data is

structured. We will then discuss the different tools Spark makes available for performing this kind of

work.

Formatting Models According to Your Use Case

To preprocess data for Spark’s different advanced analytics tools, you must consider your end

objective. The following list walks through the requirements for input data structure for each

advanced analytics task in MLlib:

In the case of most classification and regression algorithms, you want to get your data into a

column of type Double to represent the label and a column of type Vector (either dense or

sparse) to represent the features.

In the case of recommendation, you want to get your data into a column of users, a column of

items (say movies or books), and a column of ratings.

In the case of unsupervised learning, a column of type Vector (either dense or sparse) is

needed to represent the features.

In the case of graph analytics, you will want a DataFrame of vertices and a DataFrame of

edges.

The best way to get your data in these formats is through transformers. Transformers are functions that

accept a DataFrame as an argument and return a new DataFrame as a response. This chapter will

focus on what transformers are relevant for particular use cases rather than attempting to enumerate

every possible transformer.

NOTE

Spark provides a number of transformers as part of the org.apache.spark.ml.feature package. The corresponding

package in Python is pyspark.ml.feature. New transformers are constantly popping up in Spark MLlib and therefore it is

impossible to include a definitive list in this book. The most up-to-date information can be found on the Spark documentation

site.

Before we proceed, we’re going to read in several different sample datasets, each of which has

different properties we will manipulate in this chapter:

// in Scala

val sales = spark.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/data/retail-data/by-day/*.csv")

.coalesce(5)

.where("Description IS NOT NULL")

val fakeIntDF = spark.read.parquet("/data/simple-ml-integers")

var simpleDF = spark.read.json("/data/simple-ml")

val scaleDF = spark.read.parquet("/data/simple-ml-scaling")

# in Python

sales = spark.read.format("csv")\

.option("header", "true")\

.option("inferSchema", "true")\

.load("/data/retail-data/by-day/*.csv")\

.coalesce(5)\

.where("Description IS NOT NULL")

fakeIntDF = spark.read.parquet("/data/simple-ml-integers")

simpleDF = spark.read.json("/data/simple-ml")

scaleDF = spark.read.parquet("/data/simple-ml-scaling")

In addition to this realistic sales data, we’re going to use several simple synthetic datasets as well.

FakeIntDF, simpleDF, and scaleDF all have very few rows. This will give you the ability to focus

on the exact data manipulation we are performing instead of the various inconsistencies of any

particular dataset. Because we’re going to be accessing the sales data a number of times, we’re going

to cache it so we can read it efficiently from memory as opposed to reading it from disk every time

we need it. Let’s also check out the first several rows of data in order to better understand what’s in

the dataset:

sales.cache()

sales.show()

+---------+---------+--------------------+--------+-------------------+---------

+---------+---------+--------------------+--------+-------------------+---------

| 580538| 23084| RABBIT NIGHT LIGHT| 48|2011-12-05 08:38:00| 1...

...

| 580539| 22375|AIRLINE BAG VINTA...| 4|2011-12-05 08:39:00| 4...

+---------+---------+--------------------+--------+-------------------+---------

NOTE

It is important to note that we filtered out null values here. MLlib does not always play nicely with null values at this point in

time. This is a frequent cause for problems and errors and a great first step when you are debugging. Improvements are

also made with every Spark release to improve algorithm handling of null values.

Transformers

We discussed transformers in the previous chapter, but it’s worth reviewing them again here.

Transformers are functions that convert raw data in some way. This might be to create a new

interaction variable (from two other variables), to normalize a column, or to simply turn it into a

Double to be input into a model. Transformers are primarily used in preprocessing or feature

generation.

Spark’s transformer only includes a transform method. This is because it will not change based on the

input data. Figure 25-1 is a simple illustration. On the left is an input DataFrame with the column to

be manipulated. On the right is the input DataFrame with a new column representing the output

transformation.

Figure 25-1. A Spark transformer

The Tokenizer is an example of a transformer. It tokenizes a string, splitting on a given character,

and has nothing to learn from our data; it simply applies a function. We’ll discuss the tokenizer in

more depth later in this chapter, but here’s a small code snippet showing how a tokenizer is built to

accept the input column, how it transforms the data, and then the output from that transformation:

// in Scala

import org.apache.spark.ml.feature.Tokenizer

val tkn = new Tokenizer().setInputCol("Description")

tkn.transform(sales.select("Description")).show(false)

+-----------------------------------+------------------------------------------+

|Description |tok_7de4dfc81ab7__output |

+-----------------------------------+------------------------------------------+

|RABBIT NIGHT LIGHT |[rabbit, night, light] |

|DOUGHNUT LIP GLOSS |[doughnut, lip, gloss] |

...

|AIRLINE BAG VINTAGE WORLD CHAMPION |[airline, bag, vintage, world, champion] |

|AIRLINE BAG VINTAGE JET SET BROWN |[airline, bag, vintage, jet, set, brown] |

+-----------------------------------+------------------------------------------+

Estimators for Preprocessing

Another tool for preprocessing are estimators. An estimator is necessary when a transformation you

would like to perform must be initialized with data or information about the input column (often

derived by doing a pass over the input column itself). For example, if you wanted to scale the values

in our column to have mean zero and unit variance, you would need to perform a pass over the entire

data in order to calculate the values you would use to normalize the data to mean zero and unit

variance. In effect, an estimator can be a transformer configured according to your particular input

data. In simplest terms, you can either blindly apply a transformation (a “regular” transformer type) or

perform a transformation based on your data (an estimator type). Figure 25-2 is a simple illustration

of an estimator fitting to a particular input dataset, generating a transformer that is then applied to the

input dataset to append a new column (of the transformed data).

Figure 25-2. A Spark estimator

An example of this type of estimator is the StandardScaler, which scales your input column

according to the range of values in that column to have a zero mean and a variance of 1 in each

dimension. For that reason it must first perform a pass over the data to create the transformer. Here’s

a sample code snippet showing the entire process, as well as the output:

// in Scala

import org.apache.spark.ml.feature.StandardScaler

val ss = new StandardScaler().setInputCol("features")

ss.fit(scaleDF).transform(scaleDF).show(false)

+---+--------------+------------------------------------------------------------+

|id |features |stdScal_d66fbeac10ea__output |

+---+--------------+------------------------------------------------------------+

|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|

...

|1 |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902] |

+---+--------------+------------------------------------------------------------+

We will use both estimators and transformers throughout and cover more about these particular

estimators (and add examples in Python) later on in this chapter.

Transformer Properties

All transformers require you to specify, at a minimum, the inputCol and the outputCol, which

represent the column name of the input and output, respectively. You set these with setInputCol and

setOutputCol. There are some defaults (you can find these in the documentation), but it is a best

practice to manually specify them yourself for clarity. In addition to input and output columns, all

transformers have different parameters that you can tune (whenever we mention a parameter in this

chapter you must set it with a set() method). In Python, we also have another method to set these

values with keyword arguments to the object’s constructor. We exclude these from the examples in the

next chapter for consistency. Estimators require you to fit the transformer to your particular dataset

and then call transform on the resulting object.

NOTE

Spark MLlib stores metadata about the columns it uses in each DataFrame as an attribute on the column itself. This allows

it to properly store (and annotate) that a column of Doubles may actually represent a series of categorical variables instead

of continuous values. However, metadata won’t show up when you print the schema or the DataFrame.

High-Level Transformers

High-level transformers, such as the RFormula we saw in the previous chapter, allow you to

concisely specify a number of transformations in one. These operate at a “high level”, and allow you

to avoid doing data manipulations or transformations one by one. In general, you should try to use the

highest level transformers you can, in order to minimize the risk of error and help you focus on the

business problem instead of the smaller details of implementation. While this is not always possible,

it’s a good objective.

RFormula

The RFormula is the easiest transfomer to use when you have “conventionally” formatted data. Spark

borrows this transformer from the R language to make it simple to declaratively specify a set of

transformations for your data. With this transformer, values can be either numerical or categorical and

you do not need to extract values from strings or manipulate them in any way. The RFormula will

automatically handle categorical inputs (specified as strings) by performing something called one-hot

encoding. In brief, one-hot encoding converts a set of values into a set of binary columns specifying

whether or not the data point has each particular value (we’ll discuss one-hot encoding in more depth

later in the chapter). With the RFormula, numeric columns will be cast to Double but will not be one-

hot encoded. If the label column is of type String, it will be first transformed to Double with

StringIndexer.

WARNING

Automatic casting of numeric columns to Double without one-hot encoding has some important implications. If you have

numerically valued categorical variables, they will only be cast to Double, implicitly specifying an order. It is important to

ensure the input types correspond to the expected conversion. If you have categorical variables that really have no order

relation, they should be cast to String. You can also manually index columns (see “Working with Categorical Features”).

The RFormula allows you to specify your transformations in declarative syntax. It is simple to use

once you understand the syntax. Currently, RFormula supports a limited subset of the R operators that

in practice work quite well for simple transformations. The basic operators are:

Separate target and terms

Concatenate terms; “+ 0” means removing the intercept (this means the y-intercept of the line that

we will fit will be 0)

Remove a term; “- 1” means removing intercept (this means the y-intercept of the line that we will

fit will be 0)

Interaction (multiplication for numeric values, or binarized categorical values)

All columns except the target/dependent variable

RFormula also uses default columns of label and features to label, you guessed it, the label and

the set of features that it outputs (for supervised machine learning). The models covered later on in

this chapter by default require those column names, making it easy to pass the resulting transformed

DataFrame into a model for training. If this doesn’t make sense yet, don’t worry—it’ll become clear

once we actually start using models in later chapters.

Let’s use RFormula in an example. In this case, we want to use all available variables (the .) and

then specify an interaction between value1 and color and value2 and color as additional features

to generate:

// in Scala

import org.apache.spark.ml.feature.RFormula

val supervised = new RFormula()

.setFormula("lab ~ . + color:value1 + color:value2")

supervised.fit(simpleDF).transform(simpleDF).show()

# in Python

from pyspark.ml.feature import RFormula

supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")

supervised.fit(simpleDF).transform(simpleDF).show()

+-----+----+------+------------------+--------------------+-----+

+-----+----+------+------------------+--------------------+-----+

|green|good| 1|14.386294994851129|(10,[1,2,3,5,8],[...| 1.0|

| blue| bad| 8|14.386294994851129|(10,[2,3,6,9],[8....| 0.0|

...

| red| bad| 1| 38.97187133755819|(10,[0,2,3,4,7],[...| 0.0|

| red| bad| 2|14.386294994851129|(10,[0,2,3,4,7],[...| 0.0|

+-----+----+------+------------------+--------------------+-----+

SQL Transformers

A SQLTransformer allows you to leverage Spark’s vast library of SQL-related manipulations just as

you would a MLlib transformation. Any SELECT statement you can use in SQL is a valid

transformation. The only thing you need to change is that instead of using the table name, you should

just use the keyword THIS. You might want to use SQLTransformer if you want to formally codify

some DataFrame manipulation as a preprocessing step, or try different SQL expressions for features

during hyperparameter tuning. Also note that the output of this transformation will be appended as a

column to the output DataFrame.

You might want to use an SQLTransformer in order to represent all of your manipulations on the very

rawest form of your data so you can version different variations of manipulations as transformers.

This gives you the benefit of building and testing varying pipelines, all by simply swapping out

transformers. The following is a basic example of using SQLTransformer:

// in Scala

import org.apache.spark.ml.feature.SQLTransformer

val basicTransformation = new SQLTransformer()

.setStatement("""

SELECT sum(Quantity), count(*), CustomerID

FROM __THIS__

GROUP BY CustomerID

""")

basicTransformation.transform(sales).show()

# in Python

from pyspark.ml.feature import SQLTransformer

basicTransformation = SQLTransformer()\

.setStatement("""

SELECT sum(Quantity), count(*), CustomerID

FROM __THIS__

GROUP BY CustomerID

""")

basicTransformation.transform(sales).show()

Here’s a sample of the output:

-------------+--------+----------+

|sum(Quantity)|count(1)|CustomerID|

+-------------+--------+----------+

| 119| 62| 14452.0|

...

| 138| 18| 15776.0|

+-------------+--------+----------+

For extensive samples of these transformations, refer back to Part II.

VectorAssembler

The VectorAssembler is a tool you’ll use in nearly every single pipeline you generate. It helps

concatenate all your features into one big vector you can then pass into an estimator. It’s used

typically in the last step of a machine learning pipeline and takes as input a number of columns of

Boolean, Double, or Vector. This is particularly helpful if you’re going to perform a number of

manipulations using a variety of transformers and need to gather all of those results together.

The output from the following code snippet will make it clear how this works:

// in Scala

import org.apache.spark.ml.feature.VectorAssembler

val va = new VectorAssembler().setInputCols(Array("int1", "int2", "int3"))

va.transform(fakeIntDF).show()

# in Python

from pyspark.ml.feature import VectorAssembler

va = VectorAssembler().setInputCols(["int1", "int2", "int3"])

va.transform(fakeIntDF).show()

+----+----+----+--------------------------------------------+

|int1|int2|int3|VectorAssembler_403ab93eacd5585ddd2d__output|

+----+----+----+--------------------------------------------+

| 1| 2| 3| [1.0,2.0,3.0]|

| 4| 5| 6| [4.0,5.0,6.0]|

| 7| 8| 9| [7.0,8.0,9.0]|

+----+----+----+--------------------------------------------+

Working with Continuous Features

Continuous features are just values on the number line, from positive infinity to negative infinity.

There are two common transformers for continuous features. First, you can convert continuous

features into categorical features via a process called bucketing, or you can scale and normalize your

features according to several different requirements. These transformers will only work on Double

types, so make sure you’ve turned any other numerical values to Double:

// in Scala

val contDF = spark.range(20).selectExpr("cast(id as double)")

# in Python

contDF = spark.range(20).selectExpr("cast(id as double)")

Bucketing

The most straightforward approach to bucketing or binning is using the Bucketizer. This will split a

given continuous feature into the buckets of your designation. You specify how buckets should be

created via an array or list of Double values. This is useful because you may want to simplify the

features in your dataset or simplify their representations for interpretation later on. For example,

imagine you have a column that represents a person’s weight and you would like to predict some

value based on this information. In some cases, it might be simpler to create three buckets of

“overweight,” “average,” and “underweight.”

To specify the bucket, set its borders. For example, setting splits to 5.0, 10.0, 250.0 on our contDF

will actually fail because we don’t cover all possible input ranges. When specifying your bucket

points, the values you pass into splits must satisfy three requirements:

The minimum value in your splits array must be less than the minimum value in your

DataFrame.

The maximum value in your splits array must be greater than the maximum value in your

DataFrame.

You need to specify at a minimum three values in the splits array, which creates two buckets.

WARNING

The Bucketizer can be confusing because we specify bucket borders via the splits method, but these are not actually

splits.

To cover all possible ranges, scala.Double.NegativeInfinity might be another split option, with

scala.Double.PositiveInfinity to cover all possible ranges outside of the inner splits. In Python

we specify this in the following way: float("inf"), float("-inf").

In order to handle null or NaN values, we must specify the handleInvalid parameter as a certain

value. We can either keep those values (keep), error or null, or skip those rows. Here’s an

example of using bucketing:

// in Scala

import org.apache.spark.ml.feature.Bucketizer

val bucketBorders = Array(-1.0, 5.0, 10.0, 250.0, 600.0)

val bucketer = new Bucketizer().setSplits(bucketBorders).setInputCol("id")

bucketer.transform(contDF).show()

# in Python

from pyspark.ml.feature import Bucketizer

bucketBorders = [-1.0, 5.0, 10.0, 250.0, 600.0]

bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("id")

bucketer.transform(contDF).show()

+----+---------------------------------------+

| id|Bucketizer_4cb1be19f4179cc2545d__output|

+----+---------------------------------------+

| 0.0| 0.0|

...

|10.0| 2.0|

|11.0| 2.0|

...

+----+---------------------------------------+

In addition to splitting based on hardcoded values, another option is to split based on percentiles in

our data. This is done with QuantileDiscretizer, which will bucket the values into user-specified

buckets with the splits being determined by approximate quantiles values. For instance, the 90th

quantile is the point in your data at which 90% of the data is below that value. You can control how

finely the buckets should be split by setting the relative error for the approximate quantiles

calculation using setRelativeError. Spark does this is by allowing you to specify the number of

buckets you would like out of the data and it will split up your data accordingly. The following is an

example:

// in Scala

import org.apache.spark.ml.feature.QuantileDiscretizer

val bucketer = new QuantileDiscretizer().setNumBuckets(5).setInputCol("id")

val fittedBucketer = bucketer.fit(contDF)

fittedBucketer.transform(contDF).show()

# in Python

from pyspark.ml.feature import QuantileDiscretizer

bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id")

fittedBucketer = bucketer.fit(contDF)

fittedBucketer.transform(contDF).show()

+----+----------------------------------------+

| id|quantileDiscretizer_cd87d1a1fb8e__output|

+----+----------------------------------------+

| 0.0| 0.0|

...

| 6.0| 1.0|

| 7.0| 2.0|

...

|14.0| 3.0|

|15.0| 4.0|

...

+----+----------------------------------------+

Advanced bucketing techniques

The techniques descriubed here are the most common ways of bucketing data, but there are a number

of other ways that exist in Spark today. All of these processes are the same from a data flow

perspective: start with continuous data and place them in buckets so that they become categorical.

Differences arise depending on the algorithm used to compute these buckets. The simple examples we

just looked at are easy to intepret and work with, but more advanced techniques such as locality

sensitivity hashing (LSH) are also available in MLlib.

Scaling and Normalization

We saw how we can use bucketing to create groups out of continuous variables. Another common task

is to scale and normalize continuous data. While not always necessary, doing so is usually a best

practice. You might want to do this when your data contains a number of columns based on different

scales. For instance, say we have a DataFrame with two columns: weight (in ounces) and height (in

feet). If you don’t scale or normalize, the algorithm will be less sensitive to variations in height

because height values in feet are much lower than weight values in ounces. That’s an example where

you should scale your data.

An example of normalization might involve transforming the data so that each point’s value is a

representation of its distance from the mean of that column. Using the same example from before, we

might want to know how far a given individual’s height is from the mean height. Many algorithms

assume that their input data is normalized.

As you might imagine, there are a multitude of algorithms we can apply to our data to scale or

normalize it. Enumerating them all is unnecessary here because they are covered in many other texts

and machine learning libraries. If you’re unfamiliar with the concept in detail, check out any of the

books referenced in the previous chapter. Just keep in mind the fundamental goal—we want our data

on the same scale so that values can easily be compared to one another in a sensible way. In MLlib,

this is always done on columns of type Vector. MLlib will look across all the rows in a given

column (of type Vector) and then treat every dimension in those vectors as its own particular column.

It will then apply the scaling or normalization function on each dimension separately.

A simple example might be the following vectors in a column:

1,2

3,4

When we apply our scaling (but not normalization) function, the “3” and the “1” will be adjusted

according to those two values while the “2” and the “4” will be adjusted according to one another.

This is commonly referred to as component-wise comparisons.

StandardScaler

The StandardScaler standardizes a set of features to have zero mean and a standard deviation of 1.

The flag withStd will scale the data to unit standard deviation while the flag withMean (false by

default) will center the data prior to scaling it.

WARNING

Centering can be very expensive on sparse vectors because it generally turns them into dense vectors, so be careful before

centering your data.

Here’s an example of using a StandardScaler:

// in Scala

import org.apache.spark.ml.feature.StandardScaler

val sScaler = new StandardScaler().setInputCol("features")

sScaler.fit(scaleDF).transform(scaleDF).show()

# in Python

from pyspark.ml.feature import StandardScaler

sScaler = StandardScaler().setInputCol("features")

sScaler.fit(scaleDF).transform(scaleDF).show()

The output is shown below:

+---+--------------+------------------------------------------------------------+

|id |features |StandardScaler_41aaa6044e7c3467adc3__output |

+---+--------------+------------------------------------------------------------+

|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|

...

|1 |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902] |

+---+--------------+------------------------------------------------------------+

MinMaxScaler

The MinMaxScaler will scale the values in a vector (component wise) to the proportional values on

a scale from a given min value to a max value. If you specify the minimum value to be 0 and the

maximum value to be 1, then all the values will fall in between 0 and 1:

// in Scala

import org.apache.spark.ml.feature.MinMaxScaler

val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")

val fittedminMax = minMax.fit(scaleDF)

fittedminMax.transform(scaleDF).show()

# in Python

from pyspark.ml.feature import MinMaxScaler

minMax = MinMaxScaler().setMin(5).setMax(10).setInputCol("features")

fittedminMax = minMax.fit(scaleDF)

fittedminMax.transform(scaleDF).show()

+---+--------------+-----------------------------------------+

| id| features|MinMaxScaler_460cbafafbe6b9ab7c62__output|

+---+--------------+-----------------------------------------+

| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|

...

| 1|[3.0,10.1,3.0]| [10.0,10.0,10.0]|

+---+--------------+-----------------------------------------+

MaxAbsScaler

The max absolute scaler (MaxAbsScaler) scales the data by dividing each value by the maximum

absolute value in this feature. All values therefore end up between −1 and 1. This transformer does

not shift or center the data at all in the process:

// in Scala

import org.apache.spark.ml.feature.MaxAbsScaler

val maScaler = new MaxAbsScaler().setInputCol("features")

val fittedmaScaler = maScaler.fit(scaleDF)

fittedmaScaler.transform(scaleDF).show()

# in Python

from pyspark.ml.feature import MaxAbsScaler

maScaler = MaxAbsScaler().setInputCol("features")

fittedmaScaler = maScaler.fit(scaleDF)

fittedmaScaler.transform(scaleDF).show()

+---+--------------+----------------------------------------------------------+

|id |features |MaxAbsScaler_402587e1d9b6f268b927__output |

+---+--------------+----------------------------------------------------------+

|0 |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009901,-0.3333333333333]|

...

|1 |[3.0,10.1,3.0]|[1.0,1.0,1.0] |

+---+--------------+----------------------------------------------------------+

ElementwiseProduct

The ElementwiseProduct allows us to scale each value in a vector by an arbitrary value. For

example, given the vector below and the row “1, 0.1, -1” the output will be “10, 1.5, -20.” Naturally

the dimensions of the scaling vector must match the dimensions of the vector inside the relevant

column:

// in Scala

import org.apache.spark.ml.feature.ElementwiseProduct

import org.apache.spark.ml.linalg.Vectors

val scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)

val scalingUp = new ElementwiseProduct()

.setScalingVec(scaleUpVec)

.setInputCol("features")

scalingUp.transform(scaleDF).show()

# in Python

from pyspark.ml.feature import ElementwiseProduct

from pyspark.ml.linalg import Vectors

scaleUpVec = Vectors.dense(10.0, 15.0, 20.0)

scalingUp = ElementwiseProduct()\

.setScalingVec(scaleUpVec)\

.setInputCol("features")

scalingUp.transform(scaleDF).show()

+---+--------------+-----------------------------------------------+

| id| features|ElementwiseProduct_42b29ea5a55903e9fea6__output|

+---+--------------+-----------------------------------------------+

| 0|[1.0,0.1,-1.0]| [10.0,1.5,-20.0]|

...

| 1|[3.0,10.1,3.0]| [30.0,151.5,60.0]|

+---+--------------+-----------------------------------------------+

Normalizer

The normalizer allows us to scale multidimensional vectors using one of several power norms, set

through the parameter “p”. For example, we can use the Manhattan norm (or Manhattan distance) with

p = 1, Euclidean norm with p = 2, and so on. The Manhattan distance is a measure of distance where

you can only travel from point to point along the straight lines of an axis (like the streets in

Manhattan).

Here’s an example of using the Normalizer:

// in Scala

import org.apache.spark.ml.feature.Normalizer

val manhattanDistance = new Normalizer().setP(1).setInputCol("features")

manhattanDistance.transform(scaleDF).show()

# in Python

from pyspark.ml.feature import Normalizer

manhattanDistance = Normalizer().setP(1).setInputCol("features")

manhattanDistance.transform(scaleDF).show()

+---+--------------+-------------------------------+

| id| features|normalizer_1bf2cd17ed33__output|

+---+--------------+-------------------------------+

| 0|[1.0,0.1,-1.0]| [0.47619047619047...|

| 1| [2.0,1.1,1.0]| [0.48780487804878...|

| 0|[1.0,0.1,-1.0]| [0.47619047619047...|

| 1| [2.0,1.1,1.0]| [0.48780487804878...|

| 1|[3.0,10.1,3.0]| [0.18633540372670...|

+---+--------------+-------------------------------+

Working with Categorical Features

The most common task for categorical features is indexing. Indexing converts a categorical variable

in a column to a numerical one that you can plug into machine learning algorithms. While this is

conceptually simple, there are some catches that are important to keep in mind so that Spark can do

this in a stable and repeatable manner.

In general, we recommend re-indexing every categorical variable when pre-processing just for

consistency’s sake. This can be helpful in maintaining your models over the long run as your encoding

practices may change over time.

StringIndexer

The simplest way to index is via the StringIndexer, which maps strings to different numerical IDs.

Spark’s StringIndexer also creates metadata attached to the DataFrame that specify what inputs

correspond to what outputs. This allows us later to get inputs back from their respective index values:

// in Scala

import org.apache.spark.ml.feature.StringIndexer

val lblIndxr = new StringIndexer().setInputCol("lab").setOutputCol("labelInd")

val idxRes = lblIndxr.fit(simpleDF).transform(simpleDF)

idxRes.show()

# in Python

from pyspark.ml.feature import StringIndexer

lblIndxr = StringIndexer().setInputCol("lab").setOutputCol("labelInd")

idxRes = lblIndxr.fit(simpleDF).transform(simpleDF)

idxRes.show()

+-----+----+------+------------------+--------+

+-----+----+------+------------------+--------+

|green|good| 1|14.386294994851129| 1.0|

...

| red| bad| 2|14.386294994851129| 0.0|

+-----+----+------+------------------+--------+

We can also apply StringIndexer to columns that are not strings, in which case, they will be

converted to strings before being indexed:

// in Scala

val valIndexer = new StringIndexer()

.setInputCol("value1")

.setOutputCol("valueInd")

valIndexer.fit(simpleDF).transform(simpleDF).show()

# in Python

valIndexer = StringIndexer().setInputCol("value1").setOutputCol("valueInd")

valIndexer.fit(simpleDF).transform(simpleDF).show()

+-----+----+------+------------------+--------+

+-----+----+------+------------------+--------+

|green|good| 1|14.386294994851129| 1.0|

...

| red| bad| 2|14.386294994851129| 0.0|

+-----+----+------+------------------+--------+

Keep in mind that the StringIndexer is an estimator that must be fit on the input data. This means it

must see all inputs to select a mapping of inputs to IDs. If you train a StringIndexer on inputs “a,”

“b,” and “c” and then go to use it against input “d,” it will throw an error by default. Another option

is to skip the entire row if the input value was not a value seen during training. Going along with the

previous example, an input value of “d” would cause that row to be skipped entirely. We can set this

option before or after training the indexer or pipeline. More options may be added to this feature in

the future but as of Spark 2.2, you can only skip or throw an error on invalid inputs.

valIndexer.setHandleInvalid("skip")

valIndexer.fit(simpleDF).setHandleInvalid("skip")

Converting Indexed Values Back to Text

When inspecting your machine learning results, you’re likely going to want to map back to the original

values. Since MLlib classification models make predictions using the indexed values, this conversion

is useful for converting model predictions (indices) back to the original categories. We can do this

with IndexToString. You’ll notice that we do not have to input our value to the String key; Spark’s

MLlib maintains this metadata for you. You can optionally specify the outputs.

// in Scala

import org.apache.spark.ml.feature.IndexToString

val labelReverse = new IndexToString().setInputCol("labelInd")

labelReverse.transform(idxRes).show()

# in Python

from pyspark.ml.feature import IndexToString

labelReverse = IndexToString().setInputCol("labelInd")

labelReverse.transform(idxRes).show()

+-----+----+------+------------------+--------+--------------------------------+

+-----+----+------+------------------+--------+--------------------------------+

|green|good| 1|14.386294994851129| 1.0| good|

...

| red| bad| 2|14.386294994851129| 0.0| bad|

+-----+----+------+------------------+--------+--------------------------------+

Indexing in Vectors

VectorIndexer is a helpful tool for working with categorical variables that are already found inside

of vectors in your dataset. This tool will automatically find categorical features inside of your input

vectors and convert them to categorical features with zero-based category indices. For example, in

the following DataFrame, the first column in our Vector is a categorical variable with two different

categories while the rest of the variables are continuous. By setting maxCategories to 2 in our

VectorIndexer, we are instructing Spark to take any column in our vector with two or less distinct

values and convert it to a categorical variable. This can be helpful when you know how many unique

values there are in your largest category because you can specify this and it will automatically index

the values accordingly. Conversely, Spark changes the data based on this parameter, so if you have

continuous variables that don’t appear particularly continuous (lots of repeated values) these can be

unintentionally converted to categorical variables if there are too few unique values.

// in Scala

import org.apache.spark.ml.feature.VectorIndexer

import org.apache.spark.ml.linalg.Vectors

val idxIn = spark.createDataFrame(Seq(

(Vectors.dense(1, 2, 3),1),

(Vectors.dense(2, 5, 6),2),

(Vectors.dense(1, 8, 9),3)

)).toDF("features", "label")

val indxr = new VectorIndexer()

.setInputCol("features")

.setOutputCol("idxed")

.setMaxCategories(2)

indxr.fit(idxIn).transform(idxIn).show

# in Python

from pyspark.ml.feature import VectorIndexer

from pyspark.ml.linalg import Vectors

idxIn = spark.createDataFrame([

(Vectors.dense(1, 2, 3),1),

(Vectors.dense(2, 5, 6),2),

(Vectors.dense(1, 8, 9),3)

]).toDF("features", "label")

indxr = VectorIndexer()\

.setInputCol("features")\

.setOutputCol("idxed")\

.setMaxCategories(2)

indxr.fit(idxIn).transform(idxIn).show()

+-------------+-----+-------------+

| features|label| idxed|

+-------------+-----+-------------+

|[1.0,2.0,3.0]| 1|[0.0,2.0,3.0]|

|[2.0,5.0,6.0]| 2|[1.0,5.0,6.0]|

|[1.0,8.0,9.0]| 3|[0.0,8.0,9.0]|

+-------------+-----+-------------+

One-Hot Encoding

Indexing categorical variables is only half of the story. One-hot encoding is an extremely common

data transformation performed after indexing categorical variables. This is because indexing does not

always represent our categorical variables in the correct way for downstream models to process. For

instance, when we index our “color” column, you will notice that some colors have a higher value (or

index number) than others (in our case, blue is 1 and green is 2).

This is incorrect because it gives the mathematical appearance that the input to the machine learning

algorithm seems to specify that green > blue, which makes no sense in the case of the current

categories. To avoid this, we use OneHotEncoder, which will convert each distinct value to a

Boolean flag (1 or 0) as a component in a vector. When we encode the color value, then we can see

these are no longer ordered, making them easier for downstream models (e.g., a linear model) to

process:

// in Scala

import org.apache.spark.ml.feature.{StringIndexer, OneHotEncoder}

val lblIndxr = new StringIndexer().setInputCol("color").setOutputCol("colorInd")

val colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color"))

val ohe = new OneHotEncoder().setInputCol("colorInd")

ohe.transform(colorLab).show()

# in Python

from pyspark.ml.feature import OneHotEncoder, StringIndexer

lblIndxr = StringIndexer().setInputCol("color").setOutputCol("colorInd")

colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color"))

ohe = OneHotEncoder().setInputCol("colorInd")

ohe.transform(colorLab).show()

+-----+--------+------------------------------------------+

|color|colorInd|OneHotEncoder_46b5ad1ef147bb355612__output|

+-----+--------+------------------------------------------+

|green| 1.0| (2,[1],[1.0])|

| blue| 2.0| (2,[],[])|

...

| red| 0.0| (2,[0],[1.0])|

+-----+--------+------------------------------------------+

Text Data Transformers

Text is always tricky input because it often requires lots of manipulation to map to a format that a

machine learning model will be able to use effectively. There are generally two kinds of texts you’ll

see: free-form text and string categorical variables. This section primarily focuses on free-form text

because we already discussed categorical variables.

Tokenizing Text

Tokenization is the process of converting free-form text into a list of “tokens” or individual words.

The easiest way to do this is by using the Tokenizer class. This transformer will take a string of

words, separated by whitespace, and convert them into an array of words. For example, in our dataset

we might want to convert the Description field into a list of tokens.

// in Scala

import org.apache.spark.ml.feature.Tokenizer

val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut")

val tokenized = tkn.transform(sales.select("Description"))

tokenized.show(false)

# in Python

from pyspark.ml.feature import Tokenizer

tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")

tokenized = tkn.transform(sales.select("Description"))

tokenized.show(20, False)

+-----------------------------------+------------------------------------------+

|Description DescOut |

+-----------------------------------+------------------------------------------+

|RABBIT NIGHT LIGHT |[rabbit, night, light] |

|DOUGHNUT LIP GLOSS |[doughnut, lip, gloss] |

...

|AIRLINE BAG VINTAGE WORLD CHAMPION |[airline, bag, vintage, world, champion] |

|AIRLINE BAG VINTAGE JET SET BROWN |[airline, bag, vintage, jet, set, brown] |

+-----------------------------------+------------------------------------------+

We can also create a Tokenizer that is not just based white space but a regular expression with the

RegexTokenizer. The format of the regular expression should conform to the Java Regular

Expression (RegEx) syntax:

// in Scala

import org.apache.spark.ml.feature.RegexTokenizer

val rt = new RegexTokenizer()

.setInputCol("Description")

.setOutputCol("DescOut")

.setPattern(" ") // simplest expression

.setToLowercase(true)

rt.transform(sales.select("Description")).show(false)

# in Python

from pyspark.ml.feature import RegexTokenizer

rt = RegexTokenizer()\

.setInputCol("Description")\

.setOutputCol("DescOut")\

.setPattern(" ")\

.setToLowercase(True)

rt.transform(sales.select("Description")).show(20, False)

+-----------------------------------+------------------------------------------+

|Description DescOut |

+-----------------------------------+------------------------------------------+

|RABBIT NIGHT LIGHT |[rabbit, night, light] |

|DOUGHNUT LIP GLOSS |[doughnut, lip, gloss] |

...

|AIRLINE BAG VINTAGE WORLD CHAMPION |[airline, bag, vintage, world, champion] |

|AIRLINE BAG VINTAGE JET SET BROWN |[airline, bag, vintage, jet, set, brown] |

+-----------------------------------+------------------------------------------+

Another way of using the RegexTokenizer is to use it to output values matching the provided pattern

instead of using it as a gap. We do this by setting the gaps parameter to false. Doing this with a space

as a pattern returns all the spaces, which is not too useful, but if we made our pattern capture

individual words, we could return those:

// in Scala

import org.apache.spark.ml.feature.RegexTokenizer

val rt = new RegexTokenizer()

.setInputCol("Description")

.setOutputCol("DescOut")

.setPattern(" ")

.setGaps(false)

.setToLowercase(true)

rt.transform(sales.select("Description")).show(false)

# in Python

from pyspark.ml.feature import RegexTokenizer

rt = RegexTokenizer()\

.setInputCol("Description")\

.setOutputCol("DescOut")\

.setPattern(" ")\

.setGaps(False)\

.setToLowercase(True)

rt.transform(sales.select("Description")).show(20, False)

+-----------------------------------+------------------+

|Description DescOut |

+-----------------------------------+------------------+

|RABBIT NIGHT LIGHT |[ , ] |

|DOUGHNUT LIP GLOSS |[ , , ] |

...

|AIRLINE BAG VINTAGE WORLD CHAMPION |[ , , , , ] |

|AIRLINE BAG VINTAGE JET SET BROWN |[ , , , , ] |

+-----------------------------------+------------------+

Removing Common Words

A common task after tokenization is to filter stop words, common words that are not relevant in many

kinds of analysis and should thus be removed. Frequently occurring stop words in English include

“the,” “and,” and “but.” Spark contains a list of default stop words you can see by calling the

following method, which can be made case insensitive if necessary (as of Spark 2.2, supported

languages for stopwords are “danish,” “dutch,” “english,” “finnish,” “french,” “german,” “hungarian,”

“italian,” “norwegian,” “portuguese,” “russian,” “spanish,” “swedish,” and “turkish”):

// in Scala

import org.apache.spark.ml.feature.StopWordsRemover

val englishStopWords = StopWordsRemover.loadDefaultStopWords("english")

val stops = new StopWordsRemover()

.setStopWords(englishStopWords)

.setInputCol("DescOut")

stops.transform(tokenized).show()

# in Python

from pyspark.ml.feature import StopWordsRemover

englishStopWords = StopWordsRemover.loadDefaultStopWords("english")

stops = StopWordsRemover()\

.setStopWords(englishStopWords)\

.setInputCol("DescOut")

stops.transform(tokenized).show()

The following output shows how this works:

+--------------------+--------------------+------------------------------------+

| Description| DescOut|StopWordsRemover_4ab18...6ed__output|

+--------------------+--------------------+------------------------------------+

...

|SET OF 4 KNICK KN...|[set, of, 4, knic...| [set, 4, knick, k...|

...

+--------------------+--------------------+------------------------------------+

Notice how the word of is removed in the output column. That’s because it’s such a common word

that it isn’t relevant to any downstream manipulation and simply adds noise to our dataset.

Creating Word Combinations

Tokenizing our strings and filtering stop words leaves us with a clean set of words to use as features.

It is often of interest to look at combinations of words, usually by looking at colocated words. Word

combinations are technically referred to as n-grams—that is, sequences of words of length n. An n-

gram of length 1 is called a unigrams; those of length 2 are called bigrams, and those of length 3 are

called trigrams (anything above those are just four-gram, five-gram, etc.), Order matters with n-gram

creation, so converting a sentence with three words into bigram representation would result in two

bigrams. The goal when creating n-grams is to better capture sentence structure and more information

than can be gleaned by simply looking at all words individually. Let’s create some n-grams to

illustrate this concept.

The bigrams of “Big Data Processing Made Simple” are:

“Big Data”

“Data Processing”

“Processing Made”

“Made Simple”

While the trigrams are:

“Big Data Processing”

“Data Processing Made”

“Procesing Made Simple”

With n-grams, we can look at sequences of words that commonly co-occur and use them as inputs to a

machine learning algorithm. These can create better features than simply looking at all of the words

individually (say, tokenized on a space character):

// in Scala

import org.apache.spark.ml.feature.NGram

val unigram = new NGram().setInputCol("DescOut").setN(1)

val bigram = new NGram().setInputCol("DescOut").setN(2)

unigram.transform(tokenized.select("DescOut")).show(false)

bigram.transform(tokenized.select("DescOut")).show(false)

# in Python

from pyspark.ml.feature import NGram

unigram = NGram().setInputCol("DescOut").setN(1)

bigram = NGram().setInputCol("DescOut").setN(2)

unigram.transform(tokenized.select("DescOut")).show(False)

bigram.transform(tokenized.select("DescOut")).show(False)

+-----------------------------------------+-------------------------------------

DescOut |ngram_104c4da6a01b__output ...

+-----------------------------------------+-------------------------------------

|[rabbit, night, light] |[rabbit, night, light] ...

|[doughnut, lip, gloss] |[doughnut, lip, gloss] ...

...

|[airline, bag, vintage, world, champion] |[airline, bag, vintage, world, cha...

|[airline, bag, vintage, jet, set, brown] |[airline, bag, vintage, jet, set, ...

+-----------------------------------------+-------------------------------------

And the result for bigrams:

+------------------------------------------+------------------------------------

DescOut |ngram_6e68fb3a642a__output ...

+------------------------------------------+------------------------------------

|[rabbit, night, light] |[rabbit night, night light] ...

|[doughnut, lip, gloss] |[doughnut lip, lip gloss] ...

...

|[airline, bag, vintage, world, champion] |[airline bag, bag vintage, vintag...

|[airline, bag, vintage, jet, set, brown] |[airline bag, bag vintage, vintag...

+------------------------------------------+------------------------------------

Converting Words into Numerical Representations

Once you have word features, it’s time to start counting instances of words and word combinations

for use in our models. The simplest way is just to include binary counts of a word in a given

document (in our case, a row). Essentially, we’re measuring whether or not each row contains a given

word. This is a simple way to normalize for document sizes and occurrence counts and get numerical

features that allow us to classify documents based on content. In addition, we can count words using a

CountVectorizer, or reweigh them according to the prevalence of a given word in all the documents

using a TF–IDF transformation (discussed next).

A CountVectorizer operates on our tokenized data and does two things:

1. During the fit process, it finds the set of words in all the documents and then counts the

occurrences of those words in those documents.

2. It then counts the occurrences of a given word in each row of the DataFrame column during

the transformation process and outputs a vector with the terms that occur in that row.

Conceptually this tranformer treats every row as a document and every word as a term and the total

collection of all terms as the vocabulary. These are all tunable parameters, meaning we can set the

minimum term frequency (minTF) for the term to be included in the vocabulary (effectively removing

rare words from the vocabulary); minimum number of documents a term must appear in (minDF)

before being included in the vocabulary (another way to remove rare words from the vocabulary);

and finally, the total maximum vocabulary size (vocabSize). Lastly, by default the CountVectorizer

will output the counts of a term in a document. To just return whether or not a word exists in a

document, we can use setBinary(true). Here’s an example of using CountVectorizer:

// in Scala

import org.apache.spark.ml.feature.CountVectorizer

val cv = new CountVectorizer()

.setInputCol("DescOut")

.setOutputCol("countVec")

.setVocabSize(500)

.setMinTF(1)

.setMinDF(2)

val fittedCV = cv.fit(tokenized)

fittedCV.transform(tokenized).show(false)

# in Python

from pyspark.ml.feature import CountVectorizer

cv = CountVectorizer()\

.setInputCol("DescOut")\

.setOutputCol("countVec")\

.setVocabSize(500)\

.setMinTF(1)\

.setMinDF(2)

fittedCV = cv.fit(tokenized)

fittedCV.transform(tokenized).show(False)

While the output looks a little complicated, it’s actually just a sparse vector that contains the total

vocabulary size, the index of the word in the vocabulary, and then the counts of that particular word:

+---------------------------------+--------------------------------------------+

DescOut |countVec |

+---------------------------------+--------------------------------------------+

|[rabbit, night, light] |(500,[150,185,212],[1.0,1.0,1.0]) |

|[doughnut, lip, gloss] |(500,[462,463,492],[1.0,1.0,1.0]) |

...

|[airline, bag, vintage, world,...|(500,[2,6,328],[1.0,1.0,1.0]) |

|[airline, bag, vintage, jet, s...|(500,[0,2,6,328,405],[1.0,1.0,1.0,1.0,1.0]) |

+---------------------------------+--------------------------------------------+

Term frequency–inverse document frequency

Another way to approach the problem of converting text into a numerical representation is to use term

frequency–inverse document frequency (TF–IDF). In simplest terms, TF–IDF measures how often a

word occurs in each document, weighted according to how many documents that word occurs in. The

result is that words that occur in a few documents are given more weight than words that occur in

many documents. In practice, a word like “the” would be weighted very low because of its

prevalence while a more specialized word like “streaming” would occur in fewer documents and

thus would be weighted higher. In a way, TF–IDF helps find documents that share similar topics.

Let’s take a look at an example—first, we’ll inspect some of the documents in our data containing the

word “red”:

// in Scala

val tfIdfIn = tokenized

.where("array_contains(DescOut, 'red')")

.select("DescOut")

.limit(10)

tfIdfIn.show(false)

# in Python

tfIdfIn = tokenized\

.where("array_contains(DescOut, 'red')")\

.select("DescOut")\

.limit(10)

tfIdfIn.show(10, False)

+---------------------------------------+

DescOut |

+---------------------------------------+

|[gingham, heart, , doorstop, red] |

...

|[red, retrospot, oven, glove] |

|[red, retrospot, plate] |

+---------------------------------------+

We can see some overlapping words in these documents, but these words provide at least a rough

topic-like representation. Now let’s input that into TF–IDF. To do this, we’re going to hash each word

and convert it to a numerical representation, and then weigh each word in the voculary according to

the inverse document frequency. Hashing is a similar process as CountVectorizer, but is

irreversible—that is, from our output index for a word, we cannot get our input word (multiple words

might map to the same output index):

// in Scala

import org.apache.spark.ml.feature.{HashingTF, IDF}

val tf = new HashingTF()

.setInputCol("DescOut")

.setOutputCol("TFOut")

.setNumFeatures(10000)

val idf = new IDF()

.setInputCol("TFOut")

.setOutputCol("IDFOut")

.setMinDocFreq(2)

# in Python

from pyspark.ml.feature import HashingTF, IDF

tf = HashingTF()\

.setInputCol("DescOut")\

.setOutputCol("TFOut")\

.setNumFeatures(10000)

idf = IDF()\

.setInputCol("TFOut")\

.setOutputCol("IDFOut")\

.setMinDocFreq(2)

// in Scala

idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(false)

# in Python

idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(10, False)

While the output is too large to include here, notice that a certain value is assigned to “red” and that

this value appears in every document. Also note that this term is weighted extremely low because it

appears in every document. The output format is a sparse Vector we can subsequently input into a

machine learning model in a form like this:

(10000,[2591,4291,4456],[1.0116009116784799,0.0,0.0])

This vector is represented using three different values: the total vocabulary size, the hash of every

word appearing in the document, and the weighting of each of those terms. This is similar to the

CountVectorizer output.

Word2Vec

Word2Vec is a deep learning–based tool for computing a vector representation of a set of words. The

goal is to have similar words close to one another in this vector space, so we can then make

generalizations about the words themselves. This model is easy to train and use, and has been shown

to be useful in a number of natural language processing applications, including entity recognition,

disambiguation, parsing, tagging, and machine translation.

Word2Vec is notable for capturing relationships between words based on their semantics. For

example, if v~king, v~queen, v~man, and v~women represent the vectors for those four words, then

we will often get a representation where v~king − v~man + v~woman ~= v~queen. To do this,

Word2Vec uses a technique called “skip-grams” to convert a sentence of words into a vector

representation (optionally of a specific size). It does this by building a vocabulary, and then for every

sentence, it removes a token and trains the model to predict the missing token in the "n-gram”

representation. Word2Vec works best with continuous, free-form text in the form of tokens.

Here’s a simple example from the documentation:

// in Scala

import org.apache.spark.ml.feature.Word2Vec

import org.apache.spark.ml.linalg.Vector

import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.

val documentDF = spark.createDataFrame(Seq(

"Hi I heard about Spark".split(" "),

"I wish Java could use case classes".split(" "),

"Logistic regression models are neat".split(" ")

).map(Tuple1.apply)).toDF("text")

// Learn a mapping from words to Vectors.

val word2Vec = new Word2Vec()

.setInputCol("text")

.setOutputCol("result")

.setVectorSize(3)

.setMinCount(0)

val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)

result.collect().foreach { case Row(text: Seq[_], features: Vector) =>

println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n")

}

# in Python

from pyspark.ml.feature import Word2Vec

# Input data: Each row is a bag of words from a sentence or document.

documentDF = spark.createDataFrame([

("Hi I heard about Spark".split(" "), ),

("I wish Java could use case classes".split(" "), ),

("Logistic regression models are neat".split(" "), )

], ["text"])

# Learn a mapping from words to Vectors.

word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text",

outputCol="result")

model = word2Vec.fit(documentDF)

result = model.transform(documentDF)

for row in result.collect():

text, vector = row

print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))

Text: [Hi, I, heard, about, Spark] =>

Vector: [-0.008142343163490296,0.02051363289356232,0.03255096450448036]

Text: [I, wish, Java, could, use, case, classes] =>

Vector: [0.043090314205203734,0.035048123182994974,0.023512658663094044]

Text: [Logistic, regression, models, are, neat] =>

Vector: [0.038572299480438235,-0.03250147425569594,-0.01552378609776497]

Spark’s Word2Vec implementation includes a variety of tuning parameters that can be found in the

documentation.

Feature Manipulation

While nearly every transformer in ML manipulates the feature space in some way, the following

algorithms and tools are automated means of either expanding the input feature vectors or reducing

them to a lower number of dimensions.

PCA

Principal Components Analysis (PCA) is a mathematical technique for finding the most important

aspects of our data (the principal components). It changes the feature representation of our data by

creating a new set of features (“aspects”). Each new feature is a combination of the original features.

The power of PCA is that it can create a smaller set of more meaningful features to be input into your

model, at the potential cost of interpretability.

You’d want to use PCA if you have a large input dataset and want to reduce the total number of

features you have. This frequently comes up in text analysis where the entire feature space is massive

and many of the features are largely irrelevant. Using PCA, we can find the most important

combinations of features and only include those in our machine learning model. PCA takes a

parameter , specifying the number of output features to create. Generally, this should be much

smaller than your input vectors’ dimension.

NOTE

Picking the right is nontrivial and there’s no prescription we can give. Check out the relevant chapters in ESL and ISL

for more information.

Let’s train PCA with a of 2:

// in Scala

import org.apache.spark.ml.feature.PCA

val pca = new PCA().setInputCol("features").setK(2)

pca.fit(scaleDF).transform(scaleDF).show(false)

# in Python

from pyspark.ml.feature import PCA

pca = PCA().setInputCol("features").setK(2)

pca.fit(scaleDF).transform(scaleDF).show(20, False)

+---+--------------+------------------------------------------+

|id |features |pca_7c5c4aa7674e__output |

+---+--------------+------------------------------------------+

|0 |[1.0,0.1,-1.0]|[0.0713719499248418,-0.4526654888147822] |

...

|1 |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060150646]|

+---+--------------+------------------------------------------+

Interaction

In some cases, you might have domain knowledge about specific variables in your dataset. For

example, you might know that a certain interaction between the two variables is an important variable

to include in a downstream estimator. The feature transformer Interaction allows you to create an

interaction between two variables manually. It just multiplies the two features together—something

that a typical linear model would not do for every possible pair of features in your data. This

transformer is currently only available directly in Scala but can be called from any language using the

RFormula. We recommend users just use RFormula instead of manually creating interactions.

Polynomial Expansion

Polynomial expansion is used to generate interaction variables of all the input columns. With

polynomial expansion, we specify to what degree we would like to see various interactions. For

example, for a degree-2 polynomial, Spark takes every value in our feature vector, multiplies it by

every other value in the feature vector, and then stores the results as features. For instance, if we have

two input features, we’ll get four output features if we use a second degree polynomial (2x2). If we

have three input features, we’ll get nine output features (3x3). If we use a third-degree polynomial,

we’ll get 27 output features (3x3x3) and so on. This transformation is useful when you want to see

interactions between particular features but aren’t necessarily sure about which interactions to

consider.

WARNING

Polynomial expansion can greatly increase your feature space, leading to both high computational costs and overfitting. Use

it with caution, especially for higher degrees.

Here’s an example of a second degree polynomial:

// in Scala

import org.apache.spark.ml.feature.PolynomialExpansion

val pe = new PolynomialExpansion().setInputCol("features").setDegree(2)

pe.transform(scaleDF).show(false)

# in Python

from pyspark.ml.feature import PolynomialExpansion

pe = PolynomialExpansion().setInputCol("features").setDegree(2)

pe.transform(scaleDF).show()

+---+--------------+-----------------------------------------------------------+

|id |features |poly_9b2e603812cb__output |

+---+--------------+-----------------------------------------------------------+

|0 |[1.0,0.1,-1.0]|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0] |

...

|1 |[3.0,10.1,3.0]|[3.0,9.0,10.1,30.299999999999997,102.00999999999999,3.0... |

+---+--------------+-----------------------------------------------------------+

Feature Selection

Often, you will have a large range of possible features and want to select a smaller subset to use for

training. For example, many features might be correlated, or using too many features might lead to

overfitting. This process is called feature selection. There are a number of ways to evaluate feature

importance once you’ve trained a model but another option is to do some rough filtering beforehand.

Spark has some simple options for doing that, such as ChiSqSelector.

ChiSqSelector

ChiSqSelector leverages a statistical test to identify features that are not independent from the label

we are trying to predict, and drop the uncorrelated features. It’s often used with categorical data in

order to reduce the number of features you will input into your model, as well as to reduce the

dimensionality of text data (in the form of frequencies or counts). Since this method is based on the

Chi-Square test, there are several different ways we can pick the “best” features. The methods are

numTopFeatures, which is ordered by p-value; percentile, which takes a proportion of the input

features (instead of just the top N features); and fpr, which sets a cut off p-value.

We will demonstrate this with the output of the CountVectorizer created earlier in this chapter:

// in Scala

import org.apache.spark.ml.feature.{ChiSqSelector, Tokenizer}

val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut")

val tokenized = tkn

.transform(sales.select("Description", "CustomerId"))

.where("CustomerId IS NOT NULL")

val prechi = fittedCV.transform(tokenized)

val chisq = new ChiSqSelector()

.setFeaturesCol("countVec")

.setLabelCol("CustomerId")

.setNumTopFeatures(2)

chisq.fit(prechi).transform(prechi)

.drop("customerId", "Description", "DescOut").show()

# in Python

from pyspark.ml.feature import ChiSqSelector, Tokenizer

tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")

tokenized = tkn\

.transform(sales.select("Description", "CustomerId"))\

.where("CustomerId IS NOT NULL")

prechi = fittedCV.transform(tokenized)\

.where("CustomerId IS NOT NULL")

chisq = ChiSqSelector()\

.setFeaturesCol("countVec")\

.setLabelCol("CustomerId")\

.setNumTopFeatures(2)

chisq.fit(prechi).transform(prechi)\

.drop("customerId", "Description", "DescOut").show()

Advanced Topics

There are several advanced topics surrounding transformers and estimators. Here we touch on the

two most common, persisting transformers as well as writing custom ones.

Persisting Transformers

Once you’ve used an estimator to configure a transformer, it can be helpful to write it to disk and

simply load it when necessary (e.g., for use in another Spark session). We saw this in the previous

chapter when we persisted an entire pipeline. To persist a transformer individually, we use the write

method on the fitted transformer (or the standard transformer) and specify the location:

// in Scala

val fittedPCA = pca.fit(scaleDF)

fittedPCA.write.overwrite().save("/tmp/fittedPCA")

# in Python

fittedPCA = pca.fit(scaleDF)

fittedPCA.write().overwrite().save("/tmp/fittedPCA")

We can then load it back in:

// in Scala

import org.apache.spark.ml.feature.PCAModel

val loadedPCA = PCAModel.load("/tmp/fittedPCA")

loadedPCA.transform(scaleDF).show()

# in Python

from pyspark.ml.feature import PCAModel

loadedPCA = PCAModel.load("/tmp/fittedPCA")

loadedPCA.transform(scaleDF).show()

Writing a Custom Transformer

Writing a custom transformer can be valuable when you want to encode some of your own business

logic in a form that you can fit into an ML Pipeline, pass on to hyperparameter search, and so on. In

general you should try to use the built-in modules (e.g., SQLTransformer) as much as possible

because they are optimized to run efficiently. But sometimes we do not have that luxury. Let’s create a

simple tokenizer to demonstrate:

import org.apache.spark.ml.UnaryTransformer

import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable,

Identifiable}

import org.apache.spark.sql.types.{ArrayType, StringType, DataType}

import org.apache.spark.ml.param.{IntParam, ParamValidators}

class MyTokenizer(override val uid: String)

extends UnaryTransformer[String, Seq[String],

MyTokenizer] with DefaultParamsWritable {

def this() = this(Identifiable.randomUID("myTokenizer"))

val maxWords: IntParam = new IntParam(this, "maxWords",

"The max number of words to return.",

ParamValidators.gtEq(0))

def setMaxWords(value: Int): this.type = set(maxWords, value)

def getMaxWords: Integer = $(maxWords)

override protected def createTransformFunc: String => Seq[String] = (

inputString: String) => {

inputString.split("\\s").take($(maxWords))

}

override protected def validateInputType(inputType: DataType): Unit = {

require(

inputType == StringType, s"Bad input type: $inputType. Requires String.")

}

override protected def outputDataType: DataType = new ArrayType(StringType,

true)

}

// this will allow you to read it back in by using this object.

object MyTokenizer extends DefaultParamsReadable[MyTokenizer]

val myT = new MyTokenizer().setInputCol("someCol").setMaxWords(2)

myT.transform(Seq("hello world. This text won't show.").toDF("someCol")).show()

It is also possible to write a custom estimator where you must customize the transformation based on

the actual input data. However, this isn’t as common as writing a standalone transformer and is

therefore not included in this book. A good way to do this is to look at one of the simple estimators

we saw before and modify the code to suit your use case. A good place to start might be the

StandardScaler.

Conclusion

This chapter gave a whirlwind tour of many of the most common preprocessing transformations Spark

has available. There are several domain-specific ones we did not have enough room to cover (e.g.,

Discrete Cosine Transform), but you can find more information in the documentation. This area of

Spark is also constantly growing as the community develops new ones.

Another important aspect of this feature engineering toolkit is consistency. In the previous chapter we

covered the pipeline concept, an essential tool to package and train end-to-end ML workflows. In the

next chapter we will start going through the variety of machine learning tasks you may have and what

algorithms are available for each one.

Chapter 26. Classification

Classification is the task of predicting a label, category, class, or discrete variable given some input

features. The key difference from other ML tasks, such as regression, is that the output label has a

finite set of possible values (e.g., three classes).

Use Cases

Classification has many use cases, as we discussed in Chapter 24. Here are a few more to consider

as a reinforcement of the multitude of ways classification can be used in the real world.

Predicting credit risk

A financing company might look at a number of variables before offering a loan to a company or

individual. Whether or not to offer the loan is a binary classification problem.

News classification

An algorithm might be trained to predict the topic of a news article (sports, politics, business,

etc.).

Classifying human activity

By collecting data from sensors such as a phone accelerometer or smart watch, you can predict

the person’s activity. The output will be one of a finite set of classes (e.g., walking, sleeping,

standing, or running).

Types of Classification

Before we continue, let’s review several different types of classification.

Binary Classification

The simplest example of classification is binary classification, where there are only two labels you

can predict. One example is fraud analytics, where a given transaction can be classified as fraudulent

or not; or email spam, where a given email can be classified as spam or not spam.

Multiclass Classification

Beyond binary classification lies multiclass classification, where one label is chosen from more than

two distinct possible labels. A typical example is Facebook predicting the people in a given photo or

a meterologist predicting the weather (rainy, sunny, cloudy, etc.). Note how there is always a finite set

of classes to predict; it’s never unbounded. This is also called multinomial classification.

Multilabel Classification

Finally, there is multilabel classification, where a given input can produce multiple labels. For

example, you might want to predict a book’s genre based on the text of the book itself. While this

could be multiclass, it’s probably better suited for multilabel because a book may fall into multiple

genres. Another example of multilabel classification is identifying the number of objects that appear

in an image. Note that in this example, the number of output predictions is not necessarily fixed, and

could vary from image to image.

Classification Models in MLlib

Spark has several models available for performing binary and multiclass classification out of the box.

The following models are available for classification in Spark:

Logistic regression

Decision trees

Random forests

Gradient-boosted trees

Spark does not support making multilabel predictions natively. In order to train a multilabel model,

you must train one model per label and combine them manually. Once manually constructed, there are

built-in tools that support measuring these kinds of models (discussed at the end of the chapter).

This chapter will cover the basics of each of these models by providing:

A simple explanation of the model and the intuition behind it

Model hyperparameters (the different ways we can initialize the model)

Training parameters (parameters that affect how the model is trained)

Prediction parameters (parameters that affect how predictions are made)

You can set the hyperparameters and training parameters in a ParamGrid as we saw in Chapter 24.

Model Scalability

Model scalability is an important consideration when choosing your model. In general, Spark has

great support for training large-scale machine learning models (note, these are large scale; on single-

node workloads there are a number of other tools that also perform well). Table 26-1 is a simple

model scalability scorecard to use to find the best model for your particular task (if scalability is your

core consideration). The actual scalability will depend on your configuration, machine size, and other

specifics but should make for a good heuristic.

Table 26-1. Model scalability reference

Model Features count Training examples Output classes

Logistic regression 1 to 10 million No limit Features x Classes < 10 million

Decision trees 1,000s No limit Features x Classes < 10,000s

Random forest 10,000s No limit Features x Classes < 100,000s

Gradient-boosted trees 1,000s No limit Features x Classes < 10,000s

We can see that nearly all these models scale to large collections of input data and there is ongoing

work to scale them even further. The reason no limit is in place for the number of training examples is

because these are trained using methods like stochastic gradient descent and L-BFGS. These methods

are optimized specifically for working with massive datasets and to remove any constraints that might

exist on the number of training examples you would hope to learn on.

Let’s start looking at the classification models by loading in some data:

// in Scala

val bInput = spark.read.format("parquet").load("/data/binary-classification")

.selectExpr("features", "cast(label as double) as label")

# in Python

bInput = spark.read.format("parquet").load("/data/binary-classification")\

.selectExpr("features", "cast(label as double) as label")

NOTE

Like our other advanced analytics chapters, this one cannot teach you the mathematical underpinnings of every model. See

Chapter 4 in ISL and ESL for a review of classification.

Logistic Regression

Logistic regression is one of the most popular methods of classification. It is a linear method that

combines each of the individual inputs (or features) with specific weights (these weights are

generated during the training process) that are then combined to get a probability of belonging to a

particular class. These weights are helpful because they are good representations of feature

importance; if you have a large weight, you can assume that variations in that feature have a

significant effect on the outcome (assuming you performed normalization). A smaller weight means

the feature is less likely to be important.

See ISL 4.3 and ESL 4.4 for more information.

Model Hyperparameters

Model hyperparameters are configurations that determine the basic structure of the model itself. The

following hyperparameters are available for logistic regression:

family

Can be multinomial (two or more distinct labels; multiclass classification) or binary (only two

distinct labels; binary classification).

elasticNetParam

A floating-point value from 0 to 1. This parameter specifies the mix of L1 and L2 regularization

according to elastic net regularization (which is a linear combination of the two). Your choice of

L1 or L2 depends a lot on your particular use case but the intuition is as follows: L1

regularization (a value of 1) will create sparsity in the model because certain feature weights will

become zero (that are of little consequence to the output). For this reason, it can be used as a

simple feature-selection method. On the other hand, L2 regularization (a value of 0) does not

create sparsity because the corresponding weights for particular features will only be driven

toward zero, but will never completely reach zero. ElasticNet gives us the best of both worlds—

we can choose a value between 0 and 1 to specify a mix of L1 and L2 regularization. For the most

part, you should be tuning this by testing different values.

fitIntercept

Can be true or false. This hyperparameter determines whether or not to fit the intercept or the

arbitrary number that is added to the linear combination of inputs and weights of the model.

Typically you will want to fit the intercept if we haven’t normalized our training data.

regParam

A value ≥ 0. that determines how much weight to give to the regularization term in the objective

function. Choosing a value here is again going to be a function of noise and dimensionality in our

dataset. In a pipeline, try a wide range of values (e.g., 0, 0.01, 0.1, 1).

standardization

Can be true or false, whether or not to standardize the inputs before passing them into the model.

See Chapter 25 for more information.

Training Parameters

Training parameters are used to specify how we perform our training. Here are the training

parameters for logistic regression.

maxIter

Total number of iterations over the data before stopping. Changing this parameter probably won’t

change your results a ton, so it shouldn’t be the first parameter you look to adjust. The default is

100.

tol

This value specifies a threshold by which changes in parameters show that we optimized our

weights enough, and can stop iterating. It lets the algorithm stop before maxIter iterations. The

default value is 1.0E-6. This also shouldn’t be the first parameter you look to tune.

weightCol

The name of a weight column used to weigh certain rows more than others. This can be a useful

tool if you have some other measure of how important a particular training example is and have a

weight associated with it. For example, you might have 10,000 examples where you know that

some labels are more accurate than others. You can weigh the labels you know are correct more

than the ones you don’t.

Prediction Parameters

These parameters help determine how the model should actually be making predictions at prediction

time, but do not affect training. Here are the prediction parameters for logistic regression:

threshold

A Double in the range of 0 to 1. This parameter is the probability threshold for when a given

class should be predicted. You can tune this parameter according to your requirements to balance

between false positives and false negatives. For instance, if a mistaken prediction would be

costly—you might want to make its prediction threshold very high.

thresholds

This parameter lets you specify an array of threshold values for each class when using multiclass

classification. It works similarly to the single threshold parameter described previously.

Example

Here’s a simple example using the LogisticRegression model. Notice how we didn’t specify any

parameters because we’ll leverage the defaults and our data conforms to the proper column naming.

In practice, you probably won’t need to change many of the parameters:

// in Scala

import org.apache.spark.ml.classification.LogisticRegression

val lr = new LogisticRegression()

println(lr.explainParams()) // see all parameters

val lrModel = lr.fit(bInput)

# in Python

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression()

print lr.explainParams() # see all parameters

lrModel = lr.fit(bInput)

Once the model is trained you can get information about the model by taking a look at the coefficients

and the intercept. The coefficients correspond to the individual feature weights (each feature weight

is multiplied by each respective feature to compute the prediction) while the intercept is the value of

the italics-intercept (if we chose to fit one when specifying the model). Seeing the coefficients can be

helpful for inspecting the model that you built and comparing how features affect the prediction:

// in Scala

println(lrModel.coefficients)

println(lrModel.intercept)

# in Python

print lrModel.coefficients

print lrModel.intercept

For a multinomial model (the current one is binary), lrModel.coefficientMatrix and

lrModel.interceptVector can be used to get the coefficients and intercept. These will return

Matrix and Vector types representing the values or each of the given classes.

Model Summary

Logistic regression provides a model summary that gives you information about the final, trained

model. This is analogous to the same types of summaries we see in many R language machine learning

packages. The model summary is currently only available for binary logistic regression problems, but

multiclass summaries will likely be added in the future. Using the binary summary, we can get all

sorts of information about the model itself including the area under the ROC curve, the f measure by

threshold, the precision, the recall, the recall by thresholds, and the ROC curve. Note that for the area

under the curve, instance weighting is not taken into account, so if you wanted to see how you

performed on the values you weighed more highly, you’d have to do that manually. This will probably

change in future Spark versions. You can see the summary using the following APIs:

// in Scala

import org.apache.spark.ml.classification.BinaryLogisticRegressionSummary

val summary = lrModel.summary

val bSummary = summary.asInstanceOf[BinaryLogisticRegressionSummary]

println(bSummary.areaUnderROC)

bSummary.roc.show()

bSummary.pr.show()

# in Python

summary = lrModel.summary

print summary.areaUnderROC

summary.roc.show()

summary.pr.show()

The speed at which the model descends to the final result is shown in the objective history. We can

access this through the objective history on the model summary:

summary.objectiveHistory

This is an array of doubles that specify how, over each training iteration, we are performing with

respect to our objective function. This information is helpful to see if we have sufficient iterations or

need to be tuning other parameters.

Decision Trees

Decision trees are one of the more friendly and interpretable models for performing classification

because they’re similar to simple decision models that humans use quite often. For example, if you

have to predict whether or not someone will eat ice cream when offered, a good feature might be

whether or not that individual likes ice cream. In pseudocode, if person.likes(“ice_cream”), they will

eat ice cream; otherwise, they won’t eat ice cream. A decision tree creates this type of structure with

all the inputs and follows a set of branches when it comes time to make a prediction. This makes it a

great starting point model because it’s easy to reason about, easy to inspect, and makes very few

assumptions about the structure of the data. In short, rather than trying to train coeffiecients in order to

model a function, it simply creates a big tree of decisions to follow at prediction time. This model

also supports multiclass classification and provides outputs as predictions and probabilities in two

different columns.

While this model is usually a great start, it does come at a cost. It can overfit data extremely quickly.

By that we mean that, unrestrained, the decision tree will create a pathway from the start based on

every single training example. That means it encodes all of the information in the training set in the

model. This is bad because then the model won’t generalize to new data (you will see poor test set

prediction performance). However, there are a number of ways to try and rein in the model by

limiting its branching structure (e.g., limiting its height) to get good predictive power.

See ISL 8.1 and ESL 9.2 for more information.

Model Hyperparameters

There are many different ways to configure and train decision trees. Here are the hyperparameters

that Spark’s implementation supports:

maxDepth

Since we’re training a tree, it can be helpful to specify a max depth in order to avoid overfitting to

the dataset (in the extreme, every row ends up as its own leaf node). The default is 5.

maxBins

In decision trees, continuous features are converted into categorical features and maxBins

determines how many bins should be created from continous features. More bins gives a higher

level of granularity. The value must be greater than or equal to 2 and greater than or equal to the

number of categories in any categorical feature in your dataset. The default is 32.

impurity

To build up a “tree” you need to configure when the model should branch. Impurity represents the

metric (information gain) to determine whether or not the model should split at a particular leaf

node. This parameter can be set to either be “entropy” or “gini” (default), two commonly used

impurity metrics.

minInfoGain

This parameter determines the minimum information gain that can be used for a split. A higher

value can prevent overfitting. This is largely something that needs to be determined from testing

out different variations of the decision tree model. The default is zero.

minInstancePerNode

This parameter determines the minimum number of training instances that need to end in a

particular node. Think of this as another manner of controlling max depth. We can prevent

overfitting by limiting depth or we can prevent it by specifying that at minimum a certain number

of training values need to end up in a particular leaf node. If it’s not met we would “prune” the

tree until that requirement is met. A higher value can prevent overfitting. The default is 1, but this

can be any value greater than 1.

Training Parameters

These are configurations we specify in order to manipulate how we perform our training. Here is the

training parameter for decision trees:

checkpointInterval

Checkpointing is a way to save the model’s work over the course of training so that if nodes in the

cluster crash for some reason, you don’t lose your work. A value of 10 means the model will get

checkpointed every 10 iterations. Set this to -1 to turn off checkpointing. This parameter needs to

be set together with a checkpointDir (a directory to checkpoint to) and with

useNodeIdCache=true. Consult the Spark documentation for more information on checkpointing.

Prediction Parameters

There is only one prediction parameter for decision trees: thresholds. Refer to the explanation for

thresholds under “Logistic Regression”.

Here’s a minimal but complete example of using a decision tree classifier:

// in Scala

import org.apache.spark.ml.classification.DecisionTreeClassifier

val dt = new DecisionTreeClassifier()

println(dt.explainParams())

val dtModel = dt.fit(bInput)

# in Python

from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier()

print dt.explainParams()

dtModel = dt.fit(bInput)

Random Forest and Gradient-Boosted Trees

These methods are extensions of the decision tree. Rather than training one tree on all of the data, you

train multiple trees on varying subsets of the data. The intuition behind doing this is that various

decision trees will become “experts” in that particular domain while others become experts in others.

By combining these various experts, you then get a “wisdom of the crowds” effect, where the group’s

performance exceeds any individual. In addition, these methods can help prevent overfitting.

Random forests and gradient-boosted trees are two distinct methods for combining decision trees. In

random forests, we simply train a lot of trees and then average their response to make a prediction.

With gradient-boosted trees, each tree makes a weighted prediction (such that some trees have more

predictive power for some classes than others). They have largely the same parameters, which we

note below. One current limitation is that gradient-boosted trees currently only support binary labels.

NOTE

There are several popular tools for learning tree-based models. For example, the XGBoost library provides an integration

package for Spark that can be used to run it on Spark.

See ISL 8.2 and ESL 10.1 for more information on these tree ensemble models.

Model Hyperparameters

Random forests and gradient-boosted trees provide all of the same model hyperparameters supported

by decision trees. In addition, they add several of their own.

Random forest only

numTrees

The total number of trees to train.

featureSubsetStrategy

This parameter determines how many features should be considered for splits. This can be a

variety of different values including “auto”, “all”, “sqrt”, “log2”, or a number “n.” When your

input is “n” the model will use n * number of features during training. When n is in the range (1,

number of features), the model will use n features during training. There’s no one-size-fits-all

solution here, so it’s worth experimenting with different values in your pipeline.

Gradient-boosted trees (GBT) only

lossType

This is the loss function for gradient-boosted trees to minimize during training. Currently, only

logistic loss is supported.

maxIter

Total number of iterations over the data before stopping. Changing this probably won’t change

your results a ton, so it shouldn’t be the first parameter you look to adjust. The default is 100.

stepSize

This is the learning rate for the algorithm. A larger step size means that larger jumps are made

between training iterations. This can help in the optimization process and is something that should

be tested in training. The default is 0.1 and this can be any value from 0 to 1.

Training Parameters

There is only one training parameter for these models, checkpointInterval. Refer back to the

explanation under “Decision Trees” for details on checkpointing.

Prediction Parameters

These models have the same prediction parameters as decision trees. Consult the prediction

parameters under that model for more information.

Here’s a short code example of using each of these classifiers:

// in Scala

import org.apache.spark.ml.classification.RandomForestClassifier

val rfClassifier = new RandomForestClassifier()

println(rfClassifier.explainParams())

val trainedModel = rfClassifier.fit(bInput)

// in Scala

import org.apache.spark.ml.classification.GBTClassifier

val gbtClassifier = new GBTClassifier()

println(gbtClassifier.explainParams())

val trainedModel = gbtClassifier.fit(bInput)

# in Python

from pyspark.ml.classification import RandomForestClassifier

rfClassifier = RandomForestClassifier()

print rfClassifier.explainParams()

trainedModel = rfClassifier.fit(bInput)

# in Python

from pyspark.ml.classification import GBTClassifier

gbtClassifier = GBTClassifier()

print gbtClassifier.explainParams()

trainedModel = gbtClassifier.fit(bInput)

Naive Bayes

Naive Bayes classifiers are a collection of classifiers based on Bayes’ theorem. The core assumption

behind the models is that all features in your data are independent of one another. Naturally, strict

independence is a bit naive, but even if this is violated, useful models can still be produced. Naive

Bayes classifiers are commonly used in text or document classification, although it can be used as a

more general-purpose classifier as well. There are two different model types: either a multivariate

Bernoulli model, where indicator variables represent the existence of a term in a document; or the

multinomial model, where the total counts of terms are used.

One important note when it comes to Naive Bayes is that all input features must be non-negative.

See ISL 4.4 and ESL 6.6 for more background on these models.

Model Hyperparameters

These are configurations we specify to determine the basic structure of the models:

modelType

Either “bernoulli” or “multinomial.” See the previous section for more information on this choice.

weightCol

Allows weighing different data points differently. Refer back to “Training Parameters” for the

explanation of this hyperparameter.

Training Parameters

These are configurations that specify how we perform our training:

smoothing

This determines the amount of regularization that should take place using additive smoothing. This

helps smooth out categorical data and avoid overfitting on the training data by changing the

expected probability for certain classes. The default value is 1.

Prediction Parameters

Naive Bayes shares the same prediction parameter, thresholds, as all of our other models. Refer

back to the previous explanation for threshold to see how to use this.

Here’s an example of using a Naive Bayes classifier.

// in Scala

import org.apache.spark.ml.classification.NaiveBayes

val nb = new NaiveBayes()

println(nb.explainParams())

val trainedModel = nb.fit(bInput.where("label != 0"))

# in Python

from pyspark.ml.classification import NaiveBayes

nb = NaiveBayes()

print nb.explainParams()

trainedModel = nb.fit(bInput.where("label != 0"))

WARNING

Note that in this example dataset, we have features that have negative values. In this case, the rows with negative features

correspond to rows with label “0”. Therefore we’re just going to filter them out (via the label) instead of processing them

further to demonstrate the naive bayes API.

Evaluators for Classification and Automating Model Tuning

As we saw in Chapter 24, evaluators allow us to specify the metric of success for our model. An

evaluator doesn’t help too much when it stands alone; however, when we use it in a pipeline, we can

automate a grid search of our various parameters of the models and transformers—trying all

combinations of the parameters to see which ones perform the best. Evaluators are most useful in this

pipeline and parameter grid context. For classification, there are two evaluators, and they expect two

columns: a predicted label from the model and a true label. For binary classification we use the

BinaryClassificationEvaluator. This supports optimizing for two different metrics

“areaUnderROC” and areaUnderPR.” For multiclass classification, we need to use the

MulticlassClassificationEvaluator, which supports optimizing for “f1”, “weightedPrecision”,

“weightedRecall”, and “accuracy”.

To use evaluators, we build up our pipeline, specify the parameters we would like to test, and then

run it and see the results. See Chapter 24 for a code example.

Detailed Evaluation Metrics

MLlib also contains tools that let you evaluate multiple classification metrics at once. Unfortunately,

these metrics classes have not been ported over to Spark’s DataFrame-based ML package from the

underlying RDD framework. So, at the time of this writing, you still have to create an RDD to use

these. In the future, this functionality will likely be ported to DataFrames and the following may no

longer be the best way to see metrics (although you will still be able to use these APIs).

There are three different classification metrics we can use:

Binary classification metrics

Multiclass classification metrics

Multilabel classification metrics

All of these measures follow the same approximate style. We’ll compare generated outputs with true

values and the model calculates all of the relevant metrics for us. Then we can query the object for

the values for each of the metrics:

// in Scala

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

val out = model.transform(bInput)

.select("prediction", "label")

.rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))

val metrics = new BinaryClassificationMetrics(out)

# in Python

from pyspark.mllib.evaluation import BinaryClassificationMetrics

out = model.transform(bInput)\

.select("prediction", "label")\

.rdd.map(lambda x: (float(x[0]), float(x[1])))

metrics = BinaryClassificationMetrics(out)

Once we’ve done that, we can see typical classification success metrics on this metric’s object using

a similar API to the one we saw with logistic regression:

// in Scala

metrics.areaUnderPR

metrics.areaUnderROC

println("Receiver Operating Characteristic")

metrics.roc.toDF().show()

# in Python

print metrics.areaUnderPR

print metrics.areaUnderROC

print "Receiver Operating Characteristic"

metrics.roc.toDF().show()

One-vs-Rest Classifier

There are some MLlib models that don’t support multiclass classification. In these cases, users can

leverage a one-vs-rest classifier in order to perform multiclass classification given only a binary

classifier. The intuition behind this is that for every class you hope to predict, the one-vs-rest

classifier will turn the problem into a binary classification problem by isolating one class as the

target class and grouping all of the other classes into one. Thus the prediction of the class becomes

binary (is it this class or not this class?).

One-vs-rest is implemented as an estimator. For the base classifier it takes instances of the classifier

and creates a binary classification problem for each of the classes. The classifier for class i is

trained to predict whether the label is i or not, distinguishing class i from all other classes.

Predictions are done by evaluating each binary classifier and the index of the most confident

classifier is output as the label.

See the Spark documentation for a nice example of the use of one-vs-rest.

Multilayer Perceptron

The multilayer perceptron is a classifier based on neural networks with a configurable number of

layers (and layer sizes). We will discuss it in Chapter 31.

Conclusion

In this chapter we covered the majority of tools Spark provides for classification: predicting one of a

finite set of labels for each data point based on its features. In the next chapter, we’ll look at

regression, where the required output is continuous instead of categorical.

Chapter 27. Regression

Regression is a logical extension of classification. Rather than just predicting a single value from a

set of values, regression is the act of predicting a real number (or continuous variable) from a set of

features (represented as numbers).

Regression can be harder than classification because, from a mathematical perspective, there are an

infinite number of possible output values. Furthermore, we aim to optimize some metric of error

between the predicted and true value, as opposed to an accuracy rate. Aside from that, regression and

classification are fairly similar. For this reason, we will see a lot of the same underlying concepts

applied to regression as we did with classification.

Use Cases

The following is a small set of regression use cases that can get you thinking about potential

regression problems in your own domain:

Predicting movie viewership

Given information about a movie and the movie-going public, such as how many people have

watched the trailer or shared it on social media, you might want to predict how many people are

likely to watch the movie when it comes out.

Predicting company revenue

Given a current growth trajectory, the market, and seasonality, you might want to predict how

much revenue a company will gain in the future.

Predicting crop yield

Given information about the particular area in which a crop is grown, as well as the current

weather throughout the year, you might want to predict the total crop yield for a particular plot of

land.

Regression Models in MLlib

There are several fundamental regression models in MLlib. Some of these models are carryovers

from Chapter 26. Others are only relevant to the regression problem domain. This list is current as of

Spark 2.2 but will grow:

Linear regression

Generalized linear regression

Isotonic regression

Decision trees

Random forest

Gradient-boosted trees

Survival regression

This chapter will cover the basics of each of these particular models by providing:

A simple explanation of the model and the intuition behind the algorithm

Model hyperparameters (the different ways that we can initialize the model)

Training parameters (parameters that affect how the model is trained)

Prediction parameters (parameters that affect how predictions are made)

You can search over the hyperparameters and training parameters using a ParamGrid, as we saw in

Chapter 24.

Model Scalability

The regression models in MLlib all scale to large datasets. Table 27-1 is a simple model scalability

scorecard that will help you in choosing the best model for your particular task (if scalability is your

core consideration). These will depend on your configuration, machine size, and other factors.

Table 27-1. Regression scalability reference

Model Number features Training examples

Linear regression 1 to 10 million No limit

Generalized linear regression 4,096 No limit

Isotonic regression N/A Millions

Decision trees 1,000s No limit

Random forest 10,000s No limit

Gradient-boosted trees 1,000s No limit

Survival regression 1 to 10 million No limit

NOTE

Like our other advanced analytics chapters, this one cannot teach you the mathematical underpinnings of every model. See

Chapter 3 in ISL and ESL for a review of regression.

Let’s read in some sample data that we will use throughout the chapter:

// in Scala

val df = spark.read.load("/data/regression")

# in Python

df = spark.read.load("/data/regression")

Linear Regression

Linear regression assumes that a linear combination of your input features (the sum of each feature

multiplied by a weight) results along with an amount of Gaussian error in the output. This linear

assumption (along with Gaussian error) does not always hold true, but it does make for a simple,

interpretable model that’s hard to overfit. Like logistic regression, Spark implements ElasticNet

regularization for this, allowing you to mix L1 and L2 regularization.

See ISL 3.2 and ESL 3.2 for more information.

Model Hyperparameters

Linear regression has the same model hyperparameters as logistic regression. See Chapter 26 for

more information.

Training Parameters

Linear regression also shares all of the same training parameters from logistic regression. Refer back

to Chapter 26 for more on this topic.

Example

Here’s a short example of using linear regression on our sample dataset:

// in Scala

import org.apache.spark.ml.regression.LinearRegression

val lr = new LinearRegression().setMaxIter(10).setRegParam(0.3)\

.setElasticNetParam(0.8)

println(lr.explainParams())

val lrModel = lr.fit(df)

# in Python

from pyspark.ml.regression import LinearRegression

lr = LinearRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)

print lr.explainParams()

lrModel = lr.fit(df)

Training Summary

Just as in logistic regression, we get detailed training information back from our model. The code font

method is a simple shorthand for accessing these metrics. It reports several conventional metrics for

measuring the success of a regression model, allowing you to see how well your model is actually

fitting the line.

The summary method returns a summary object with several fields. Let’s go through these in turn. The

residuals are simply the weights for each of the features that we input into the model. The objective

history shows how our training is going at every iteration. The root mean squared error is a measure

of how well our line is fitting the data, determined by looking at the distance between each predicted

value and the actual value in the data. The R-squared variable is a measure of the proportion of the

variance of the predicted variable that is captured by the model.

There are a number of metrics and summary information that may be relevant to your use case. This

section demonstrates the API, but does not comprehensively cover every metric (consult the API

documentation for more information).

Here are some of the attributes of the model summary for linear regression:

// in Scala

val summary = lrModel.summary

summary.residuals.show()

println(summary.objectiveHistory.toSeq.toDF.show())

println(summary.rootMeanSquaredError)

println(summary.r2)

# in Python

summary = lrModel.summary

summary.residuals.show()

print summary.totalIterations

print summary.objectiveHistory

print summary.rootMeanSquaredError

print summary.r2

Generalized Linear Regression

The standard linear regression that we saw in this chapter is actually a part of a family of algorithms

called generalized linear regression. Spark has two implementations of this algorithm. One is

optimized for working with very large sets of features (the simple linear regression covered

previously in this chapter), while the other is more general, includes support for more algorithms, and

doesn’t currently scale to large numbers of features.

The generalized form of linear regression gives you more fine-grained control over what kind of

regression model you use. For instance, these allow you to select the expected noise distribution from

a variety of families, including Gaussian (linear regression), binomial (logistic regression), poisson

(poisson regression), and gamma (gamma regression). The generalized models also support setting a

link function that specifies the relationship between the linear predictor and the mean of the

distribution function. Table 27-2 shows the available link functions for each family.

Table 27-2. Regression families, response

types, and link functions

Family Response type Supported links

Gaussian Continuous Identity*, Log, Inverse

Binomial Binary Logit*, Probit, CLogLog

Poisson Count Log*, Identity, Sqrt

Gamma Continuous Inverse*, Idenity, Log

Tweedie Zero-inflated continuous Power link function

The asterisk signifies the canonical link function for each family.

See ISL 3.2 and ESL 3.2 for more information on generalized linear models.

WARNING

A fundamental limitation as of Spark 2.2 is that generalized linear regression only accepts a maximum of 4,096 features for

inputs. This will likely change for later versions of Spark, so be sure to refer to the documentation.

Model Hyperparameters

These are configurations that we specify to determine the basic structure of the model itself. In

addition to fitIntercept and regParam (mentioned in “Regression”), generalized linear regression

includes several other hyperparameters:

family

A description of the error distribution to be used in the model. Supported options are Poisson,

binomial, gamma, Gaussian, and tweedie.

link

The name of link function which provides the relationship between the linear predictor and the

mean of the distribution function. Supported options are cloglog, probit, logit, inverse, sqrt,

identity, and log (default: identity).

solver

The solver algorithm to be used for optimization. The only currently supported solver is irls

(iteratively reweighted least squares).

variancePower

The power in the variance function of the Tweedie distribution, which characterizes the

relationship between the variance and mean of the distribution. Only applicable to the Tweedie

family. Supported values are 0 and [1, Infinity). The default is 0.

linkPower

The index in the power link function for the Tweedie family.

Training Parameters

The training parameters are the same that you will find for logistic regression. Consult Chapter 26 for

more information.

Prediction Parameters

This model adds one prediction parameter:

linkPredictionCol

A column name that will hold the output of our link function for each prediction.

Example

Here’s an example of using GeneralizedLinearRegression:

// in Scala

import org.apache.spark.ml.regression.GeneralizedLinearRegression

val glr = new GeneralizedLinearRegression()

.setFamily("gaussian")

.setLink("identity")

.setMaxIter(10)

.setRegParam(0.3)

.setLinkPredictionCol("linkOut")

println(glr.explainParams())

val glrModel = glr.fit(df)

# in Python

from pyspark.ml.regression import GeneralizedLinearRegression

glr = GeneralizedLinearRegression()\

.setFamily("gaussian")\

.setLink("identity")\

.setMaxIter(10)\

.setRegParam(0.3)\

.setLinkPredictionCol("linkOut")

print glr.explainParams()
glrModel = glr.fit(df)
Training Summary
As for the simple linear model in the previous section, the training summary provided by Spark for
the generalized linear model can help you ensure that your model is a good fit for the data that you
used as the training set. It is important to note that this does not replace running your algorithm against
a proper test set, but it can provide more information. This information includes a number of different
potential metrics for analyzing the fit of your algorithm, including some of the most common success
metrics:
R squared
The coefficient of determination; a measure of fit.
The residuals
The difference between the label and the predicted value.
Be sure to inspect the summary object on the model to see all the available methods.
Decision Trees
Decision trees as applied to regression work fairly similarly to decision trees applied to
classification. The main difference is that decision trees for regression output a single number per
leaf node instead of a label (as we saw with classification). The same interpretability properties and
model structure still apply. In short, rather than trying to train coeffiecients to model a function,
decision tree regression simply creates a tree to predict the numerical outputs. This is of significant
consequence because unlike generalized linear regression, we can predict nonlinear functions in the
input data. This also creates a significant risk of overfitting the data, so we need to be careful when
tuning and evaluating these models.
We also covered decision trees in Chapter 26 (refer to “Decision Trees”). For more information on
this topic, consult ISL 8.1 and ESL 9.2.
Model Hyperparameters
The model hyperparameters that apply decision trees for regression are the same as those for
classification except for a slight change to the impurity parameter. See Chapter 26 for more
information on the other hyperparameters:
impurity
The impurity parameter represents the metric (information gain) for whether or not the model
should split at a particular leaf node with a particular value or keep it as is. The only metric
currently supported for regression trees is “variance.”

Training Parameters
In addition to hyperparameters, classification and regression trees also share the same training
parameters. See “Training Parameters” for these parameters.
Example
Here’s a short example of using a decision tree regressor:
// in Scala
import org.apache.spark.ml.regression.DecisionTreeRegressor
val dtr = new DecisionTreeRegressor()
println(dtr.explainParams())
val dtrModel = dtr.fit(df)
# in Python
from pyspark.ml.regression import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
print dtr.explainParams()
dtrModel = dtr.fit(df)
Random Forests and Gradient-Boosted Trees
The random forest and gradient-boosted tree models can be applied to both classification and
regression. As a review, these both follow the same basic concept as the decision tree, except rather
than training one tree, many trees are trained to perform a regression. In the random forest model,
many de-correlated trees are trained and then averaged. With gradient-boosted trees, each tree makes
a weighted prediction (such that some trees have more predictive power for some classes over
others). Random forest and gradient-boosted tree regression have the same model hyperparameters
and training parameters as the corresponding classification models, except for the purity measure (as
is the case with DecisionTreeRegressor).
See ISL 8.2 and ESL 10.1 for more information on tree ensembles.
Model Hyperparameters
These models share many of the same parameters as we saw in the previous chapter as well as for
regression decision trees. Refer back to “Model Hyperparameters” for a thorough explanation of
these parameters. As for a single regression tree, however, the only impurity metric currently
supported is variance.
Training Parameters
These models support the same checkpointInterval parameter as classification trees, as described
in Chapter 26.

Example

Here’s a small example of how to use these two models to perform a regression:

// in Scala

import org.apache.spark.ml.regression.RandomForestRegressor

import org.apache.spark.ml.regression.GBTRegressor

val rf = new RandomForestRegressor()

println(rf.explainParams())

val rfModel = rf.fit(df)

val gbt = new GBTRegressor()

println(gbt.explainParams())

val gbtModel = gbt.fit(df)

# in Python

from pyspark.ml.regression import RandomForestRegressor

from pyspark.ml.regression import GBTRegressor

rf = RandomForestRegressor()

print rf.explainParams()

rfModel = rf.fit(df)

gbt = GBTRegressor()

print gbt.explainParams()

gbtModel = gbt.fit(df)

Advanced Methods

The preceding methods are highly general methods for performing a regression. The models are by no

means exhaustive, but do provide the essential regression types that many folks use. This next section

will cover some of the more specialized regression models that Spark includes. We omit code

examples simply because they follow the same patterns as the other algorithms.

Survival Regression (Accelerated Failure Time)

Statisticians use survival analysis to understand the survival rate of individuals, typically in

controlled experiments. Spark implements the accelerated failure time model, which, rather than

describing the actual survival time, models the log of the survival time. This variation of survival

regression is implemented in Spark because the more well-known Cox Proportional Hazard’s model

is semi-parametric and does not scale well to large datasets. By contrast, accelerated failure time

does because each instance (row) contributes to the resulting model independently. Accelerated

failure time does have different assumptions than the Cox survival model and therefore one is not

necessarily a drop-in replacement for the other. Covering these differing assumptions is outside of the

scope of this book. See L. J. Wei’s paper on accelerated failure time for more information.

The requirement for input is quite similar to that of other regressions. We will tune coefficients

according to feature values. However, there is one departure, and that is the introduction of a censor

variable column. A test subject censors during a scientific study when that individual drops out of a

study, since their state at the end of the experiment may be unknown. This is important because we

cannot assume an outcome for an individual that censors (doesn’t report that state to the researchers)

at some intermediate point in a study.

See more about survival regression with AFT in the documentation.

Isotonic Regression

Isotonic regression is another specialized regression model, with some unique requirements.

Essentially, isotonic regression specifies a piecewise linear function that is always monotonically

increasing. It cannot decrease. This means that if your data is going up and to the right in a given plot,

this is an appropriate model. If it varies over the course of input values, then this is not appropriate.

The illustration of isotonic regression’s behavior in Figure 27-1 makes it much easier to understand.

Figure 27-1. Isotonic regression line

Notice how this gets a better fit than the simple linear regression. See more about how to use this

model in the Spark documentation.

Evaluators and Automating Model Tuning

Regression has the same core model tuning functionality that we saw with classification. We can

specify an evaluator, pick a metric to optimize for, and then train our pipeline to perform that

parameter tuning on our part. The evaluator for regression, unsurprisingly, is called the

RegressionEvaluator and allows us to optimize for a number of common regression success

metrics. Just like the classification evaluator, RegressionEvaluator expects two columns, a column

representing the prediction and another representing the true label. The supported metrics to optimize

for are the root mean squared error (“rmse”), the mean squared error (“mse”), the r metric (“r2”),

and the mean absolute error (“mae”).

To use RegressionEvaluator, we build up our pipeline, specify the parameters we would like to

test, and then run it. Spark will automatically select the model that performs best and return this to us:

// in Scala

import org.apache.spark.ml.evaluation.RegressionEvaluator

import org.apache.spark.ml.regression.GeneralizedLinearRegression

import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

val glr = new GeneralizedLinearRegression()

.setFamily("gaussian")

.setLink("identity")

val pipeline = new Pipeline().setStages(Array(glr))

val params = new ParamGridBuilder().addGrid(glr.regParam, Array(0, 0.5, 1))

.build()

val evaluator = new RegressionEvaluator()

.setMetricName("rmse")

.setPredictionCol("prediction")

.setLabelCol("label")

val cv = new CrossValidator()

.setEstimator(pipeline)

.setEvaluator(evaluator)

.setEstimatorParamMaps(params)

.setNumFolds(2) // should always be 3 or more but this dataset is small

val model = cv.fit(df)

# in Python

from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.regression import GeneralizedLinearRegression

from pyspark.ml import Pipeline

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

glr = GeneralizedLinearRegression().setFamily("gaussian").setLink("identity")

pipeline = Pipeline().setStages([glr])

params = ParamGridBuilder().addGrid(glr.regParam, [0, 0.5, 1]).build()

evaluator = RegressionEvaluator()\

.setMetricName("rmse")\

.setPredictionCol("prediction")\

.setLabelCol("label")

cv = CrossValidator()\

.setEstimator(pipeline)\

.setEvaluator(evaluator)\

.setEstimatorParamMaps(params)\

.setNumFolds(2) # should always be 3 or more but this dataset is small

model = cv.fit(df)

Metrics

Evaluators allow us to evaluate and fit a model according to one specific metric, but we can also

access a number of regression metrics via the RegressionMetrics object. As for the classification

metrics in the previous chapter, RegressionMetrics operates on RDDs of (prediction, label) pairs.

For instance, let’s see how we can inspect the results of the previously trained model.

// in Scala

import org.apache.spark.mllib.evaluation.RegressionMetrics

val out = model.transform(df)

.select("prediction", "label")

.rdd.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double]))

val metrics = new RegressionMetrics(out)

println(s"MSE = ${metrics.meanSquaredError}")

println(s"RMSE = ${metrics.rootMeanSquaredError}")

println(s"R-squared = ${metrics.r2}")

println(s"MAE = ${metrics.meanAbsoluteError}")

println(s"Explained variance = ${metrics.explainedVariance}")

# in Python

from pyspark.mllib.evaluation import RegressionMetrics

out = model.transform(df)\

.select("prediction", "label").rdd.map(lambda x: (float(x[0]), float(x[1])))

metrics = RegressionMetrics(out)

print "MSE: " + str(metrics.meanSquaredError)

print "RMSE: " + str(metrics.rootMeanSquaredError)

print "R-squared: " + str(metrics.r2)

print "MAE: " + str(metrics.meanAbsoluteError)

print "Explained variance: " + str(metrics.explainedVariance)

Consult the Spark documentation for the latest methods.

Conclusion

In this chapter, we covered the basics of regression in Spark, including how we train models and how

we measure success. In the next chapter, we’ll take a look at recommendation engines, one of the

more popular applications of MLlib.

Chapter 28. Recommendation

The task of recommendation is one of the most intuitive. By studying people’s explicit preferences

(through ratings) or implicit preferences (through observed behavior), you can make

recommendations on what one user may like by drawing similarities between the user and other users,

or between the products they liked and other products. Using the underlying similarities,

recommendation engines can make new recommendations to other users.

Use Cases

Recommendation engines are one of the best use cases for big data. It’s fairly easy to collect training

data about users’ past preferences at scale, and this data can be used in many domains to connect

users with new content. Spark is an open source tool of choice used across a variety of companies for

large-scale recommendations:

Movie recommendations

Amazon, Netflix, and HBO all want to provide relevant film and TV content to their users. Netflix

utilizes Spark, to make large scale movie recommendations to their users.

Course recommendations

A school might want to recommend courses to students by studying what courses similar students

have liked or taken. Past enrollment data makes for a very easy to collect training dataset for this

task.

In Spark, there is one workhorse recommendation algorithm, Alternating Least Squares (ALS). This

algorithm leverages a technique called collaborative filtering, which makes recommendations based

only on which items users interacted with in the past. That is, it does not require or use any additional

features about the users or the items. It supports several ALS variants (e.g., explicit or implicit

feedback). Apart from ALS, Spark provides Frequent Pattern Mining for finding association rules in

market basket analysis. Finally, Spark’s RDD API also includes a lower-level matrix factorization

method that will not be covered in this book.

Collaborative Filtering with Alternating Least Squares

ALS finds a -dimensional feature vector for each user and item such that the dot product of each

user’s feature vector with each item’s feature vector approximates the user’s rating for that item.

Therefore this only requires an input dataset of existing ratings between user-item pairs, with three

columns: a user ID column, an item ID column (e.g., a movie), and a rating column. The ratings can

either be explicit—a numerical rating that we aim to predict directly—or implicit—in which case

each rating represents the strength of interactions observed between a user and item (e.g., number of

visits to a particular page), which measures our level of confidence in the user’s preference for that

item. Given this input DataFrame, the model will produce feature vectors that you can use to predict

users’ ratings for items they have not yet rated.

One issue to note in practice is that this algorithm does have a preference for serving things that are

very common or that it has a lot of information on. If you’re introducing a new product that no users

have expressed a preference for, the algorithm isn’t going to recommend it to many people.

Additionally, if new users are onboarding onto the platform, they may not have any ratings in the

training set. Therefore, the algorithm won’t know what to recommend them. These are examples of

what we call the cold start problem, which we discuss later on in the chapter.

In terms of scalability, one reason for Spark’s popularity for this task is that the algorithm and

implementation in MLlib can scale to millions of users, millions of items, and billions of ratings.

Model Hyperparameters

These are configurations that we can specify to determine the structure of the model as well as the

specific collaborative filtering problem we wish to solve:

rank

The rank term determines the dimension of the feature vectors learned for users and items. This

should normally be tuned through experimentation. The core trade-off is that by specifying too

high a rank, the algorithm may overfit the training data; but by specifying a low rank, then it may

not make the best possible predictions. The default value is 10.

alpha

When training on implicit feedback (behavioral observations), the alpha sets a baseline

confidence for preference. This has a default of 1.0 and should be driven through experimentation.

regParam

Controls regularization to prevent overfitting. You should test out different values for the

regularization parameter to find the optimal value for your problem. The default is 0.1.

implicitPrefs

This Boolean value specifies whether you are training on implicit (true) or explicit (false)

(refer back to the preceding discussion for an explanation of the difference between explicit and

implicit). This value should be set based on the data that you’re using as input to the model. If the

data is based off passive endorsement of a product (say, via a click or page visit), then you should

use implicit preferences. In contrast, if the data is an explicit rating (e.g., the user gave this

restaurant 4/5 stars), you should use explicit preferences. Explicit preferences are the default.

nonnegative

If set to true, this parameter configures the model to place non-negative constraints on the least-

squares problem it solves and only return non-negative feature vectors. This can improve

performance in some applications. The default value is false.

Training Parameters

The training parameters for alternating least squares are a bit different from those that we have seen

in other models. That’s because we’re going to get more low-level control over how the data is

distributed across the cluster. The groups of data that are distributed around the cluster are called

blocks. Determining how much data to place in each block can have a significant impact on the time it

takes to train the algorithm (but not the final result). A good rule of thumb is to aim for approximately

one to five million ratings per block. If you have less data than that in each block, more blocks will

not improve the algorithm’s performance.

numUserBlocks

This determines how many blocks to split the users into. The default is 10.

numItemBlocks

This determines how many blocks to split the items into. The default is 10.

maxIter

Total number of iterations over the data before stopping. Changing this probably won’t change

your results a ton, so this shouldn’t be the first parameter you adjust. The default is 10. An

example of when you might want to increase this is that after inspecting your objective history and

noticing that it doesn’t flatline after a certain number of training iterations.

checkpointInterval

Checkpointing allows you to save model state during training to more quickly recover from node

failures. You can set a checkpoint directory using SparkContext.setCheckpointDir.

seed

Specifying a random seed can help you replicate your results.

Prediction Parameters

Prediction parameters determine how a trained model should actually make predictions. In our case,

there’s one parameter: the cold start strategy (set through coldStartStrategy). This setting

determines what the model should predict for users or items that did not appear in the training set.

The cold start challenge commonly arises when you’re serving a model in production, and new users

and/or items have no ratings history, and therefore the model has no recommendation to make. It can

also occur when using simple random splits as in Spark’s CrossValidator or

TrainValidationSplit, where it is very common to encounter users and/or items in the evaluation

set that are not in the training set.

By default, Spark will assign NaN prediction values when it encounters a user and/or item that is not

present in the actual model. This can be useful because you design your overall system to fall back to

some default recommendation when a new user or item is in the system. However, this is undesirable

during training because it will ruin the ability for your evaluator to properly measure the success of

your model. This makes model selection impossible. Spark allows users to set the

coldStartStrategy parameter to drop in order to drop any rows in the DataFrame of predictions

that contain NaN values. The evaluation metric will then be computed over the non-NaN data and will

be valid. drop and nan (the default) are the only currently supported cold-start strategies.

Example

This example will make use of a dataset that we have not used thus far in the book, the MovieLens

movie rating dataset. This dataset, naturally, has information relevant for making movie

recommendations. We will first use this dataset to train a model:

// in Scala

import org.apache.spark.ml.recommendation.ALS

val ratings = spark.read.textFile("/data/sample_movielens_ratings.txt")

.selectExpr("split(value , '::') as col")

.selectExpr(

"cast(col[0] as int) as userId",

"cast(col[1] as int) as movieId",

"cast(col[2] as float) as rating",

"cast(col[3] as long) as timestamp")

val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

val als = new ALS()

.setMaxIter(5)

.setRegParam(0.01)

.setUserCol("userId")

.setItemCol("movieId")

.setRatingCol("rating")

println(als.explainParams())

val alsModel = als.fit(training)

val predictions = alsModel.transform(test)

# in Python

from pyspark.ml.recommendation import ALS

from pyspark.sql import Row

ratings = spark.read.text("/data/sample_movielens_ratings.txt")\

.rdd.toDF()\

.selectExpr("split(value , '::') as col")\

.selectExpr(

"cast(col[0] as int) as userId",

"cast(col[1] as int) as movieId",

"cast(col[2] as float) as rating",

"cast(col[3] as long) as timestamp")

training, test = ratings.randomSplit([0.8, 0.2])

als = ALS()\

.setMaxIter(5)\

.setRegParam(0.01)\

.setUserCol("userId")\

.setItemCol("movieId")\

.setRatingCol("rating")

print als.explainParams()

alsModel = als.fit(training)

predictions = alsModel.transform(test)

We can now output the top recommendations for each user or movie. The model’s

recommendForAllUsers method returns a DataFrame of a userId, an array of recommendations, as

well as a rating for each of those movies. recommendForAllItems returns a DataFrame of a

movieId, as well as the top users for that movie:

// in Scala

alsModel.recommendForAllUsers(10)

.selectExpr("userId", "explode(recommendations)").show()

alsModel.recommendForAllItems(10)

.selectExpr("movieId", "explode(recommendations)").show()

# in Python

alsModel.recommendForAllUsers(10)\

.selectExpr("userId", "explode(recommendations)").show()

alsModel.recommendForAllItems(10)\

.selectExpr("movieId", "explode(recommendations)").show()

Evaluators for Recommendation

When covering the cold-start strategy, we can set up an automatic model evaluator when working

with ALS. One thing that may not be immediately obvious is that this recommendation problem is

really just a kind of regression problem. Since we’re predicting values (ratings) for given users, we

want to optimize for reducing the total difference between our users’ ratings and the true values. We

can do this using the same RegressionEvaluator that we saw in Chapter 27. You can place this in a

pipeline to automate the training process. When doing this, you should also set the cold-start strategy

to be drop instead of NaN and then switch it back to NaN when it comes time to actually make

predictions in your production system:

// in Scala

import org.apache.spark.ml.evaluation.RegressionEvaluator

val evaluator = new RegressionEvaluator()

.setMetricName("rmse")

.setLabelCol("rating")

.setPredictionCol("prediction")

val rmse = evaluator.evaluate(predictions)

println(s"Root-mean-square error = $rmse")

# in Python

from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator()\

.setMetricName("rmse")\

.setLabelCol("rating")\

.setPredictionCol("prediction")

rmse = evaluator.evaluate(predictions)

print("Root-mean-square error = %f" % rmse)

Metrics

Recommendation results can be measured using both the standard regression metrics and some

recommendation-specific metrics. It should come as no surprise that there are more sophisticated

ways of measuring recommendation success than simply evaluating based on regression. These

metrics are particularly useful for evaluating your final model.

Regression Metrics

We can recycle the regression metrics for recommendation. This is because we can simply see how

close each prediction is to the actual rating for that user and item:

// in Scala

import org.apache.spark.mllib.evaluation.{

RankingMetrics,

RegressionMetrics}

val regComparison = predictions.select("rating", "prediction")

.rdd.map(x => (x.getFloat(0).toDouble,x.getFloat(1).toDouble))

val metrics = new RegressionMetrics(regComparison)

# in Python

from pyspark.mllib.evaluation import RegressionMetrics

regComparison = predictions.select("rating", "prediction")\

.rdd.map(lambda x: (x(0), x(1)))

metrics = RegressionMetrics(regComparison)

Ranking Metrics

More interestingly, we also have another tool: ranking metrics. A RankingMetric allows us to

compare our recommendations with an actual set of ratings (or preferences) expressed by a given

user. RankingMetric does not focus on the value of the rank but rather whether or not our algorithm

recommends an already ranked item again to a user. This does require some data preparation on our

part. You may want to refer to Part II for a refresher on some of the methods. First, we need to collect

a set of highly ranked movies for a given user. In our case, we’re going to use a rather low threshold:

movies ranked above 2.5. Tuning this value will largely be a business decision:

// in Scala

import org.apache.spark.mllib.evaluation.{RankingMetrics, RegressionMetrics}

import org.apache.spark.sql.functions.{col, expr}

val perUserActual = predictions

.where("rating > 2.5")

.groupBy("userId")

.agg(expr("collect_set(movieId) as movies"))

# in Python

from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics

from pyspark.sql.functions import col, expr

perUserActual = predictions\

.where("rating > 2.5")\

.groupBy("userId")\

.agg(expr("collect_set(movieId) as movies"))

At this point, we have a collection of users, along with a truth set of previously ranked movies for

each user. Now we will get our top 10 recommendations from our algorithm on a per-user basis. We

will then see if the top 10 recommendations show up in our truth set. If we have a well-trained model,

it will correctly recommend the movies a user already liked. If it doesn’t, it may not have learned

enough about each particular user to successfully reflect their preferences:

// in Scala

val perUserPredictions = predictions

.orderBy(col("userId"), col("prediction").desc)

.groupBy("userId")

.agg(expr("collect_list(movieId) as movies"))

# in Python

perUserPredictions = predictions\

.orderBy(col("userId"), expr("prediction DESC"))\

.groupBy("userId")\

.agg(expr("collect_list(movieId) as movies"))

Now we have two DataFrames, one of predictions and another the top-ranked items for a particular

user. We can pass them into the RankingMetrics object. This object accepts an RDD of these

combinations, as you can see in the following join and RDD conversion:

// in Scala

val perUserActualvPred = perUserActual.join(perUserPredictions, Seq("userId"))

.map(row => (

row(1).asInstanceOf[Seq[Integer]].toArray,

row(2).asInstanceOf[Seq[Integer]].toArray.take(15)

))

val ranks = new RankingMetrics(perUserActualvPred.rdd)

# in Python

perUserActualvPred = perUserActual.join(perUserPredictions, ["userId"]).rdd\

.map(lambda row: (row[1], row[2][:15]))

ranks = RankingMetrics(perUserActualvPred)

Now we can see the metrics from that ranking. For instance, we can see how precise our algorithm is

with the mean average precision. We can also get the precision at certain ranking points, for instance,

to see where the majority of the positive recommendations fall:

// in Scala

ranks.meanAveragePrecision

ranks.precisionAt(5)

# in Python

ranks.meanAveragePrecision

ranks.precisionAt(5)

Frequent Pattern Mining

In addition to ALS, another tool that MLlib provides for creating recommendations is frequent pattern

mining. Frequent pattern mining, sometimes referred to as market basket analysis, looks at raw data

and finds association rules. For instance, given a large number of transactions it might identify that

users who buy hot dogs almost always purchase hot dog buns. This technique can be applied in the

recommendation context, especially when people are filling shopping carts (either on or offline).

Spark implements the FP-growth algorithm for frequent pattern mining. See the Spark documentation

and ESL 14.2 for more information about this algorithm.

Conclusion

In this chapter, we discussed one of Spark’s most popular machine learning algorithms in practice—

alternating least squares for recommendation. We saw how we can train, tune, and evaluate this

model. In the next chapter, we’ll move to unsupervised learning and discuss clustering.

Chapter 29. Unsupervised Learning

This chapter will cover the details of Spark’s available tools for unsupervised learning, focusing

specifically on clustering. Unsupervised learning is, generally speaking, used less often than

supervised learning because it’s usually harder to apply and measure success (from an end-result

perspective). These challenges can become exacerbated at scale. For instance, clustering in high-

dimensional space can create odd clusters simply because of the properties of high-dimensional

spaces, something referred to as the curse of dimensionality. The curse of dimensionality describes

the fact that as a feature space expands in dimensionality, it becomes increasingly sparse. This means

that the data needed to fill this space for statistically meaningful results increases rapidly with any

increase in dimensionality. Additionally, with high dimensions comes more noise in the data. This, in

turn, may cause your model to hone in on noise instead of the true factors causing a particular result or

grouping. Therefore in the model scalability table, we include computational limits, as well as a set

of statistical recommendations. These are heuristics and should be helpful guides, not requirements.

At its core, unsupervised learning is trying to discover patterns or derive a concise representation of

the underlying structure of a given dataset.

Use Cases

Here are some potential use cases. At its core, these patterns might reveal topics, anomalies, or

groupings in our data that may not have been obvious beforehand:

Finding anomalies in data

If the majority of values in a dataset cluster into a larger group with several small groups on the

outside, those groups might warrant further investigation.

Topic modeling

By looking at large bodies of text, it is possible to find topics that exist across those different

documents.

Model Scalability

Just like with our other models, it’s important to mention the basic model scalability requirements

along with statistical recommendations.

Table 29-1. Clustering model scalability reference

Model Statistical recommendation Computation limits Training examples

k-means 50 to 100 maximum Features x clusters < 10 million No limit

Bisecting k-means 50 to 100 maximum Features x clusters < 10 million No limit

GMM 50 to 100 maximum Features x clusters < 10 million No limit

LDA An interpretable number 1,000s of topics No limit

Let’s get started by loading some example numerical data:

// in Scala

import org.apache.spark.ml.feature.VectorAssembler

val va = new VectorAssembler()

.setInputCols(Array("Quantity", "UnitPrice"))

.setOutputCol("features")

val sales = va.transform(spark.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/data/retail-data/by-day/*.csv")

.limit(50)

.coalesce(1)

.where("Description IS NOT NULL"))

sales.cache()

# in Python

from pyspark.ml.feature import VectorAssembler

va = VectorAssembler()\

.setInputCols(["Quantity", "UnitPrice"])\

.setOutputCol("features")

sales = va.transform(spark.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/data/retail-data/by-day/*.csv")

.limit(50)

.coalesce(1)

.where("Description IS NOT NULL"))

sales.cache()

k-means

-means is one of the most popular clustering algorithms. In this algorithm, a user-specified number of

clusters () are randomly assigned to different points in the dataset. The unassigned points are then

“assigned” to a cluster based on their proximity (measured in Euclidean distance) to the previously

assigned point. Once this assignment happens, the center of this cluster (called the centroid) is

computed, and the process repeats. All points are assigned to a particular centroid, and a new

centroid is computed. We repeat this process for a finite number of iterations or until convergence

(i.e., when our centroid locations stop changing). This does not, however, mean that our clusters are

always sensical. For instance, a given “logical” cluster of data might be split right down the middle

simply because of the starting points of two distinct clusters. Thus, it is often a good idea to perform

multiple runs of -means starting with different initializations.

Choosing the right value for is an extremely important aspect of using this algorithm successfully,

as well as a hard task. There’s no real prescription for the number of clusters you need, so you’ll

likely have to experiment with different values and consider what you would like the end result to be.

For more information on -means, see ISL 10.3 and ESL 14.3.

Model Hyperparameters

These are configurations that we specify to determine the basic structure of the model:

This is the number of clusters that you would like to end up with.

Training Parameters

initMode

The initialization mode is the algorithm that determines the starting locations of the centroids. The

supported options are random and -means|| (the default). The latter is a parallelized variant of

the -means|| method. While the details are not within the scope of this book, the thinking behind

the latter method is that rather than simply choosing random initialization locations, the algorithm

chooses cluster centers that are already well spread out to generate a better clustering.

initSteps

The number of steps for -means|| initialization mode. Must be greater than 0. (The default

value is 2.)

maxIter

Total number of iterations over the data before stopping. Changing this probably won’t change

your results a ton, so don’t make this the first parameter you look to adjust. The default is 20.

tol

Specifies a threshold by which changes in centroids show that we optimized our model enough,

and can stop iterating early, before maxIter iterations. The default value is 0.0001.

This algorithm is generally robust to these parameters, and the main trade-off is that running more

initialization steps and iterations may lead to a better clustering at the expense of longer training time:

Example

// in Scala

import org.apache.spark.ml.clustering.KMeans

val km = new KMeans().setK(5)

println(km.explainParams())

val kmModel = km.fit(sales)

# in Python

from pyspark.ml.clustering import KMeans

km = KMeans().setK(5)

print km.explainParams()

kmModel = km.fit(sales)

k-means Metrics Summary

-means includes a summary class that we can use to evaluate our model. This class provides some

common measures for -means success (whether these apply to your problem set is another

question). The -means summary includes information about the clusters created, as well as their

relative sizes (number of examples).

We can also compute the within set sum of squared errors, which can help measure how close our

values are from each cluster centroid, using computeCost. The implicit goal in -means is that we

want to minimize the within set sum of squared error, subject to the given number of clusters:

// in Scala

val summary = kmModel.summary

summary.clusterSizes // number of points

kmModel.computeCost(sales)

println("Cluster Centers: ")

kmModel.clusterCenters.foreach(println)

# in Python

summary = kmModel.summary

print summary.clusterSizes # number of points

kmModel.computeCost(sales)

centers = kmModel.clusterCenters()

print("Cluster Centers: ")

for center in centers:

print(center)

Bisecting k-means

Bisecting -means is a variant of -means. The core difference is that instead of clustering points by

starting “bottom-up” and assigning a bunch of different groups in the data, this is a top-down

clustering method. This means that it will start by creating a single group and then splitting that group

into smaller groups in order to end up with the number of clusters specified by the user. This is

usually a faster method than -means and will yield different results.

Model Hyperparameters

These are configurations that we specify to determine the basic structure of the model:

This is the number of clusters that you would like to end up with.

Training Parameters

minDivisibleClusterSize

The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of

points (if less than 1.0) of a divisible cluster. The default is 1.0, meaning that there must be at

least one point in each cluster.

maxIter

Total number of iterations over the data before stopping. Changing this probably won’t change

your results a ton, so don’t make this the first parameter you look to adjust. The default is 20.

Most of the parameters in this model should be tuned in order to find the best result. There’s no rule

that applies to all datasets.

Example

// in Scala

import org.apache.spark.ml.clustering.BisectingKMeans

val bkm = new BisectingKMeans().setK(5).setMaxIter(5)

println(bkm.explainParams())

val bkmModel = bkm.fit(sales)

# in Python

from pyspark.ml.clustering import BisectingKMeans

bkm = BisectingKMeans().setK(5).setMaxIter(5)

bkmModel = bkm.fit(sales)

Bisecting k-means Summary

Bisecting -means includes a summary class that we can use to evaluate our model, that is largely the

same as the -means summary. This includes information about the clusters created, as well as their

relative sizes (number of examples):

// in Scala

val summary = bkmModel.summary

summary.clusterSizes // number of points

kmModel.computeCost(sales)

println("Cluster Centers: ")

kmModel.clusterCenters.foreach(println)

# in Python

summary = bkmModel.summary

print summary.clusterSizes # number of points

kmModel.computeCost(sales)

centers = kmModel.clusterCenters()

print("Cluster Centers: ")

for center in centers:

print(center)

Gaussian Mixture Models

Gaussian mixture models (GMM) are another popular clustering algorithm that makes different

assumptions than bisecting -means or -means do. Those algorithms try to group data by reducing

the sum of squared distances from the center of the cluster. Gaussian mixture models, on the other

hand, assume that each cluster produces data based upon random draws from a Gaussian distribution.

This means that clusters of data should be less likely to have data at the edge of the cluster (reflected

in the Guassian distribution) and much higher probability of having data in the center. Each Gaussian

cluster can be of arbitrary size with its own mean and standard deviation (and hence a possibly

different, ellipsoid shape). There are still user-specified clusters that will be created during

training.

A simplified way of thinking about Gaussian mixture models is that they’re like a soft version of -

means. -means creates very rigid clusters—each point is only within one cluster. GMMs allow for

a more nuanced cluster associated with probabilities, instead of rigid boundaries.

For more information, see ESL 14.3.

Model Hyperparameters

These are configurations that we specify to determine the basic structure of the model:

This is the number of clusters that you would like to end up with.

Training Parameters

maxIter

Total number of iterations over the data before stopping. Changing this probably won’t change

your results a ton, so don’t make this the first parameter you look to adjust. The default is 100.

tol

This value simply helps us specify a threshold by which changes in parameters show that we

optimized our weights enough. A smaller value can lead to higher accuracy at the cost of

performing more iterations (although never more than maxIter). The default value is 0.01.

As with our -means model, these training parameters are less likely to have an impact than the

number of clusters, .

Example

// in Scala

import org.apache.spark.ml.clustering.GaussianMixture

val gmm = new GaussianMixture().setK(5)

println(gmm.explainParams())

val model = gmm.fit(sales)

# in Python

from pyspark.ml.clustering import GaussianMixture

gmm = GaussianMixture().setK(5)

print gmm.explainParams()

model = gmm.fit(sales)

Gaussian Mixture Model Summary

Like our other clustering algorithms, Gaussian mixture models include a summary class to help with

model evaluation. This includes information about the clusters created, like the weights, the means,

and the covariance of the Gaussian mixture, which can help us learn more about the underlying

structure inside of our data:

// in Scala

val summary = model.summary

model.weights

model.gaussiansDF.show()

summary.cluster.show()

summary.clusterSizes

summary.probability.show()

# in Python

summary = model.summary

print model.weights

model.gaussiansDF.show()

summary.cluster.show()

summary.clusterSizes

summary.probability.show()

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a hierarchical clustering model typically used to perform topic

modelling on text documents. LDA tries to extract high-level topics from a series of documents and

keywords associated with those topics. It then interprets each document as having a variable number

of contributions from multiple input topics. There are two implementations that you can use: online

LDA and expectation maximization. In general, online LDA will work better when there are more

examples, and the expectation maximization optimizer will work better when there is a larger input

vocabulary. This method is also capable of scaling to hundreds or thousands of topics.

To input our text data into LDA, we’re going to have to convert it into a numeric format. You can use

the CountVectorizer to achieve this.

Model Hyperparameters

These are configurations that we specify to determine the basic structure of the model:

The total number of topics to infer from the data. The default is 10 and must be a positive number.

docConcentration

Concentration parameter (commonly named “alpha”) for the prior placed on documents’

distributions over topics (“theta”). This is the parameter to a Dirichlet distribution, where larger

values mean more smoothing (more regularization).

If not set by the user, then docConcentration is set automatically. If set to singleton vector

[alpha], then alpha is replicated to a vector of length k in fitting. Otherwise, the

docConcentration vector must be length .

topicConcentration

The concentration parameter (commonly named “beta” or “eta”) for the prior placed on a topic’s

distributions over terms. This is the parameter to a symmetric Dirichlet distribution. If not set by

the user, then topicConcentration is set automatically.

Training Parameters

These are configurations that specify how we perform training:

maxIter

Total number of iterations over the data before stopping. Changing this probably won’t change

your results a ton, so don’t make this the first parameter you look to adjust. The default is 20.

optimizer

This determines whether to use EM or online training optimization to determine the LDA model.

The default is online.

learningDecay

Learning rate, set as an exponential decay rate. This should be between (0.5, 1.0] to guarantee

asymptotic convergence. The default is 0.51 and only applies to the online optimizer.

learningOffset

A (positive) learning parameter that downweights early iterations. Larger values make early

iterations count less. The default is 1,024.0 and only applies to the online optimizer.

optimizeDocConcentration

Indicates whether the docConcentration (Dirichlet parameter for document-topic distribution)

will be optimized during training. The default is true but only applies to the online optimizer.

subsamplingRate

The fraction of the corpus to be sampled and used in each iteration of mini-batch gradient descent,

in range (0, 1]. The default is 0.5 and only applies to the online optimizer.

seed

This model also supports specifying a random seed for reproducibility.

checkpointInterval

This is the same checkpoint feature that we saw in Chapter 26.

Prediction Parameters

topicDistributionCol

The column that will hold the output of the topic mixture distribution for each document.

Example

// in Scala

import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer}

val tkn = new Tokenizer().setInputCol("Description").setOutputCol("DescOut")

val tokenized = tkn.transform(sales.drop("features"))

val cv = new CountVectorizer()

.setInputCol("DescOut")

.setOutputCol("features")

.setVocabSize(500)

.setMinTF(0)

.setMinDF(0)

.setBinary(true)

val cvFitted = cv.fit(tokenized)

val prepped = cvFitted.transform(tokenized)

# in Python

from pyspark.ml.feature import Tokenizer, CountVectorizer

tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")

tokenized = tkn.transform(sales.drop("features"))

cv = CountVectorizer()\

.setInputCol("DescOut")\

.setOutputCol("features")\

.setVocabSize(500)\

.setMinTF(0)\

.setMinDF(0)\

.setBinary(True)

cvFitted = cv.fit(tokenized)

prepped = cvFitted.transform(tokenized)

// in Scala

import org.apache.spark.ml.clustering.LDA

val lda = new LDA().setK(10).setMaxIter(5)

println(lda.explainParams())

val model = lda.fit(prepped)

# in Python

from pyspark.ml.clustering import LDA

lda = LDA().setK(10).setMaxIter(5)

print lda.explainParams()

model = lda.fit(prepped)

After we train the model, you will see some of the top topics. This will return the term indices, and

we’ll have to look these up using the CountVectorizerModel that we trained in order to find out the

true words. For instance, when we trained on the data our top 3 topics were hot, home, and brown

after looking them up in our vocabulary:

// in Scala

model.describeTopics(3).show()

cvFitted.vocabulary

# in Python

model.describeTopics(3).show()

cvFitted.vocabulary

These methods result in detailed information about the vocabulary used as well as the emphasis on

particular terms. These can be helpful for better understanding the underlying topics. Due to space

constraints, we can’t show this output. Using similar APIs, we can get some more technical measures

like the log likelihood and perplexity. The goal of these tools is to help you optimize the number of

topics, based on your data. When using perplexity in your success criteria, you should apply these

metrics to a holdout set to reduce the overall perplexity of the model. Another option is to optimize to

increase the log likelihood value on the holdout set. We can calculate each of these by passing a

dataset into the following functions: model.logLikelihood and model.logPerplexity.

Conclusion

This chapter covered the most popular algorithms that Spark includes for unsupervised learning. The

next chapter will bring us out of MLlib and talk about some of the advanced analytics ecosystem that

has grown outside of Spark.

Chapter 30. Graph Analytics

The previous chapter covered some conventional unsupervised techniques. This chapter is going to

dive into a more specialized toolset: graph processing. Graphs are data structures composed of

nodes, or vertices, which are arbitrary objects, and edges that define relationships between these

nodes. Graph analytics is the process of analyzing these relationships. An example graph might be

your friend group. In the context of graph analytics, each vertex or node would represent a person,

and each edge would represent a relationship. Figure 30-1 shows a sample graph.

Figure 30-1. A sample graph with seven nodes and seven edges

This particular graph is undirected, in that the edges do not have a specified “start” and “end” vertex.

There are also directed graphs that specify a start and end. Figure 30-2 shows a directed graph where

the edges are directional.

Figure 30-2. A directed graph

Edges and vertices in graphs can also have data associated with them. In our friend example, the

weight of the edge might represent the intimacy between different friends; acquaintances would have

low-weight edges between them, while married individuals would have edges with large weights. We

could set this value by looking at communication frequency between nodes and weighting the edges

accordingly. Each vertex (person) might also have data such as a name.

Graphs are a natural way of describing relationships and many different problem sets, and Spark

provides several ways of working in this analytics paradigm. Some business use cases could be

detecting credit card fraud, motif finding, determining importance of papers in bibliographic networks

(i.e., which papers are most referenced), and ranking web pages, as Google famously used the

PageRank algorithm to do.

Spark has long contained an RDD-based library for performing graph processing: GraphX. This

provided a very low-level interface that was extremely powerful, but just like RDDs, wasn’t easy to

use or optimize. GraphX remains a core part of Spark. Companies continue to build production

applications on top of it, and it still sees some minor feature development. The GraphX API is well

documented simply because it hasn’t changed much since its creation. However, some of the

developers of Spark (including some of the original authors of GraphX) have recently created a next-

generation graph analytics library on Spark: GraphFrames. GraphFrames extends GraphX to provide

a DataFrame API and support for Spark’s different language bindings so that users of Python can take

advantage of the scalability of the tool. In this book, we will focus on GraphFrames.

GraphFrames is currently available as a Spark package, an external package that you need to load

when you start up your Spark application, but may be merged into the core of Spark in the future. For

the most part, there should be little difference in performance between the two (except for a huge user

experience improvement in GraphFrames). There is some small overhead when using GraphFrames,

but for the most part it tries to call down to GraphX where appropriate; and for most, the user

experience gains greatly outweigh this minor overhead.

HOW DOES GRAPHFRAMES COMPARE TO GRAPH DATABASES?

Spark is not a database. Spark is a distributed computation engine, but it does not store data long-

term or perform transactions. You can build a graph computation on top of Spark, but that’s

fundamentally different from a database. GraphFrames can scale to much larger workloads than

many graph databases and performs well for analytics but does not support transactional

processing and serving.

The goal of this chapter is to show you how to use GraphFrames to perform graph analytics on Spark.

We are going to be doing this with publicly available bike data from the Bay Area Bike Share portal.

TIP

During the course of writing this book, this map and data have changed dramatically (even the naming!). We include a copy

of the dataset inside the data folder of this book’s repository. Be sure to use that dataset to replicate the following results;

and when you’re feeling adventurous, expand to the whole dataset!

To get set up, you’re going to need to point to the proper package. To do this from the command line,

you’ll run:

./bin/spark-shell --packages graphframes:graphframes:0.5.0-spark2.2-s_2.11

// in Scala

val bikeStations = spark.read.option("header","true")

.csv("/data/bike-data/201508_station_data.csv")

val tripData = spark.read.option("header","true")

.csv("/data/bike-data/201508_trip_data.csv")

# in Python

bikeStations = spark.read.option("header","true")\

.csv("/data/bike-data/201508_station_data.csv")

tripData = spark.read.option("header","true")\

.csv("/data/bike-data/201508_trip_data.csv")

Building a Graph

The first step is to build the graph. To do this we need to define the vertices and edges, which are

DataFrames with some specifically named columns. In our case, we’re creating a directed graph.

This graph will point from the source to the location. In the context of this bike trip data, this will

point from a trip’s starting location to a trip’s ending location. To define the graph, we use the naming

conventions for columns presented in the GraphFrames library. In the vertices table we define our

identifier as id (in our case this is of string type), and in the edges table we label each edge’s source

vertex ID as src and the destination ID as dst:

// in Scala

val stationVertices = bikeStations.withColumnRenamed("name", "id").distinct()

val tripEdges = tripData

.withColumnRenamed("Start Station", "src")

.withColumnRenamed("End Station", "dst")

# in Python

stationVertices = bikeStations.withColumnRenamed("name", "id").distinct()

tripEdges = tripData\

.withColumnRenamed("Start Station", "src")\

.withColumnRenamed("End Station", "dst")

We can now build a GraphFrame object, which represents our graph, from the vertex and edge

DataFrames we have so far. We will also leverage caching because we’ll be accessing this data

frequently in later queries:

// in Scala

import org.graphframes.GraphFrame

val stationGraph = GraphFrame(stationVertices, tripEdges)

stationGraph.cache()

# in Python

from graphframes import GraphFrame

stationGraph = GraphFrame(stationVertices, tripEdges)

stationGraph.cache()

Now we can see the basic statistics about graph (and query our original DataFrame to ensure that we

see the expected results):

// in Scala

println(s"Total Number of Stations: ${stationGraph.vertices.count()}")

println(s"Total Number of Trips in Graph: ${stationGraph.edges.count()}")

println(s"Total Number of Trips in Original Data: ${tripData.count()}")

# in Python

print "Total Number of Stations: " + str(stationGraph.vertices.count())

print "Total Number of Trips in Graph: " + str(stationGraph.edges.count())

print "Total Number of Trips in Original Data: " + str(tripData.count())

This returns the following results:

Total Number of Stations: 70

Total Number of Trips in Graph: 354152

Total Number of Trips in Original Data: 354152

Querying the Graph

The most basic way of interacting with the graph is simply querying it, performing things like counting

trips and filtering by given destinations. GraphFrames provides simple access to both vertices and

edges as DataFrames. Note that our graph retained all the additional columns in the data in addition to

IDs, sources, and destinations, so we can also query those if needed:

// in Scala

import org.apache.spark.sql.functions.desc

stationGraph.edges.groupBy("src", "dst").count().orderBy(desc("count")).show(10)

# in Python

from pyspark.sql.functions import desc

stationGraph.edges.groupBy("src", "dst").count().orderBy(desc("count")).show(10)

+--------------------+--------------------+-----+

| src| dst|count|

+--------------------+--------------------+-----+

|San Francisco Cal...| Townsend at 7th| 3748|

|Harry Bridges Pla...|Embarcadero at Sa...| 3145|

...

| Townsend at 7th|San Francisco Cal...| 2192|

|Temporary Transba...|San Francisco Cal...| 2184|

+--------------------+--------------------+-----+

We can also filter by any valid DataFrame expression. In this instance, I want to look at one specific

station and the count of trips in and out of that station:

// in Scala

stationGraph.edges

.where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'")

.groupBy("src", "dst").count()

.orderBy(desc("count"))

.show(10)

# in Python

stationGraph.edges\

.where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'")\

.groupBy("src", "dst").count()\

.orderBy(desc("count"))\

.show(10)

+--------------------+--------------------+-----+

| src| dst|count|

+--------------------+--------------------+-----+

|San Francisco Cal...| Townsend at 7th| 3748|

| Townsend at 7th|San Francisco Cal...| 2734|

...

| Steuart at Market| Townsend at 7th| 746|

| Townsend at 7th|Temporary Transba...| 740|

+--------------------+--------------------+-----+

Subgraphs

Subgraphs are just smaller graphs within the larger one. We saw in the last section how we can query

a given set of edges and vertices. We can use this query ability to create subgraphs:

// in Scala

val townAnd7thEdges = stationGraph.edges

.where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'")

val subgraph = GraphFrame(stationGraph.vertices, townAnd7thEdges)

# in Python

townAnd7thEdges = stationGraph.edges\

.where("src = 'Townsend at 7th' OR dst = 'Townsend at 7th'")

subgraph = GraphFrame(stationGraph.vertices, townAnd7thEdges)

We can then apply the following algorithms to either the original graph or the subgraph.

Motif Finding

Motifs are a way of expresssing structural patterns in a graph. When we specify a motif, we are

querying for patterns in the data instead of actual data. In GraphFrames, we specify our query in a

domain-specific language similar to Neo4J’s Cypher language. This language lets us specify

combinations of vertices and edges and assign then names. For example, if we want to specify that a

given vertex a connects to another vertex b through an edge ab, we would specify (a)-[ab]->(b).

The names inside parentheses or brackets do not signify values but instead what the columns for

matching vertices and edges should be named in the resulting DataFrame. We can omit the names

(e.g., (a)-[]->()) if we do not intend to query the resulting values.

Let’s perform a query on our bike data. In plain English, let’s find all the rides that form a “triangle”

pattern between three stations. We express this with the following motif, using the find method to

query our GraphFrame for that pattern. (a) signifies the starting station, and [ab] represents an edge

from (a) to our next station (b). We repeat this for stations (b) to (c) and then from (c) to (a):

// in Scala

val motifs = stationGraph.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)")

# in Python

motifs = stationGraph.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[ca]->(a)")

Figure 30-3 presents a visual representation of this query.

Figure 30-3. Triangle motif in our triangle query

The DataFrame we get from running this query contains nested fields for vertices a, b, and c, as well

as the respective edges. We can now query this as we would a DataFrame. For example, given a

certain bike, what is the shortest trip the bike has taken from station a, to station b, to station c, and

back to station a? The following logic will parse our timestamps, into Spark timestamps and then

we’ll do comparisons to make sure that it’s the same bike, traveling from station to station, and that

the start times for each trip are correct:

// in Scala

import org.apache.spark.sql.functions.expr

motifs.selectExpr("*",

"to_timestamp(ab.`Start Date`, 'MM/dd/yyyy HH:mm') as abStart",

"to_timestamp(bc.`Start Date`, 'MM/dd/yyyy HH:mm') as bcStart",

"to_timestamp(ca.`Start Date`, 'MM/dd/yyyy HH:mm') as caStart")

.where("ca.`Bike #` = bc.`Bike #`").where("ab.`Bike #` = bc.`Bike #`")

.where("a.id != b.id").where("b.id != c.id")

.where("abStart < bcStart").where("bcStart < caStart")

.orderBy(expr("cast(caStart as long) - cast(abStart as long)"))

.selectExpr("a.id", "b.id", "c.id", "ab.`Start Date`", "ca.`End Date`")

.limit(1).show(false)

# in Python

from pyspark.sql.functions import expr

motifs.selectExpr("*",

"to_timestamp(ab.`Start Date`, 'MM/dd/yyyy HH:mm') as abStart",

"to_timestamp(bc.`Start Date`, 'MM/dd/yyyy HH:mm') as bcStart",

"to_timestamp(ca.`Start Date`, 'MM/dd/yyyy HH:mm') as caStart")\

.where("ca.`Bike #` = bc.`Bike #`").where("ab.`Bike #` = bc.`Bike #`")\

.where("a.id != b.id").where("b.id != c.id")\

.where("abStart < bcStart").where("bcStart < caStart")\

.orderBy(expr("cast(caStart as long) - cast(abStart as long)"))\

.selectExpr("a.id", "b.id", "c.id", "ab.`Start Date`", "ca.`End Date`")

.limit(1).show(1, False)

We see the fastest trip is approximately 20 minutes. Pretty fast for three different people (we assume)

using the same bike!

Note also that we had to filter the triangles returned by our motif query in this example. In general,

different vertex IDs used in the query will not be forced to match distinct vertices, so you should

perform this type of filtering if you want distinct vertices. One of the most powerful features of

GraphFrames is that you can combine motif finding with DataFarme queries over the resulting tables

to further narrow down, sort, or aggregate the patterns found.

Graph Algorithms

A graph is just a logical representation of data. Graph theory provides numerous algorithms for

analyzing data in this format, and GraphFrames allows us to leverage many algorithms out of the box.

Development continues as new algorithms are added to GraphFrames, so this list will most likely

continue to grow.

PageRank

One of the most prolific graph algorithms is PageRank. Larry Page, cofounder of Google, created

PageRank as a research project for how to rank web pages. Unfortunately, a complete explanation of

how PageRank works is outside the scope of this book. However, to quote Wikipedia, the high-level

explanation is as follows:

PageRank works by counting the number and quality of links to a page to determine a rough

estimate of how important the website is. The underlying assumption is that more important

websites are likely to receive more links from other websites.

PageRank generalizes quite well outside of the web domain. We can apply this right to our own data

and get a sense for important bike stations (specifically, those that receive a lot of bike traffic). In this

example, important bike stations will be assigned large PageRank values:

// in Scala

import org.apache.spark.sql.functions.desc

val ranks = stationGraph.pageRank.resetProbability(0.15).maxIter(10).run()

ranks.vertices.orderBy(desc("pagerank")).select("id", "pagerank").show(10)

# in Python

from pyspark.sql.functions import desc

ranks = stationGraph.pageRank(resetProbability=0.15, maxIter=10)

ranks.vertices.orderBy(desc("pagerank")).select("id", "pagerank").show(10)

+--------------------+------------------+

| id| pagerank|

+--------------------+------------------+

|San Jose Diridon ...| 4.051504835989922|

|San Francisco Cal...|3.3511832964279518|

...

| Townsend at 7th| 1.568456580534273|

|Embarcadero at Sa...|1.5414242087749768|

+--------------------+------------------+

GRAPH ALGORITHM APIS: PARAMETERS AND RETURN VALUES

Most algorithms in GraphFrames are accessed as methods which take parameters (e.g.,

resetProbability in this PageRank example). Most algorithms return either a new GraphFrame

or a single DataFrame. The results of the algorithm are stored as one or more columns in the

GraphFrame’s vertices and/or edges or the DataFrame. For PageRank, the algorithm returns a

GraphFrame, and we can extract the estimated PageRank values for each vertex from the new

pagerank column.

WARNING

Depending on the resources available on your machine, this may take some time. You can always try a smaller set of data

before running this to see the results. On Databricks Community Edition, this takes about 20 seconds to run, although some

reviewers found it to take much longer on their machines.

Interestingly, we see that Caltrain stations rank quite highly. This makes sense because these are

natural connection points where a lot of bike trips might end up. Either as commuters move from

home to the Caltrain station for their commute or from the Caltrain station to home.

In-Degree and Out-Degree Metrics

Our graph is a directed graph. This is due to the bike trips being directional, starting in one location

and ending in another. One common task is to count the number of trips into or out of a given station.

To measure trips in and out of stations, we will use a metric called in-degree and out-degree,

respectively, as seen in Figure 30-4.

Figure 30-4. In-degree and out-degree

This is particularly applicable in the context of social networking because certain users may have

many more inbound connections (i.e., followers) than outbound connections (i.e., people they follow).

Using the following query, you can find interesting people in the social network who might have more

influence than others. GraphFrames provides a simple way to query our graph for this information:

// in Scala

val inDeg = stationGraph.inDegrees

inDeg.orderBy(desc("inDegree")).show(5, false)

# in Python

inDeg = stationGraph.inDegrees

inDeg.orderBy(desc("inDegree")).show(5, False)

The result of querying for the stations sorted by the highest in-degree:

+----------------------------------------+--------+

|id |inDegree|

+----------------------------------------+--------+

|San Francisco Caltrain (Townsend at 4th)|34810 |

|San Francisco Caltrain 2 (330 Townsend) |22523 |

|Harry Bridges Plaza (Ferry Building) |17810 |

|2nd at Townsend |15463 |

|Townsend at 7th |15422 |

+----------------------------------------+--------+

We can query the out degrees in the same fashion:

// in Scala

val outDeg = stationGraph.outDegrees

outDeg.orderBy(desc("outDegree")).show(5, false)

# in Python

outDeg = stationGraph.outDegrees

outDeg.orderBy(desc("outDegree")).show(5, False)

+---------------------------------------------+---------+

|id |outDegree|

+---------------------------------------------+---------+

|San Francisco Caltrain (Townsend at 4th) |26304 |

|San Francisco Caltrain 2 (330 Townsend) |21758 |

|Harry Bridges Plaza (Ferry Building) |17255 |

|Temporary Transbay Terminal (Howard at Beale)|14436 |

|Embarcadero at Sansome |14158 |

+---------------------------------------------+---------+

The ratio of these two values is an interesting metric to look at. A higher ratio value will tell us

where a large number of trips end (but rarely begin), while a lower value tells us where trips often

begin (but infrequently end):

// in Scala

val degreeRatio = inDeg.join(outDeg, Seq("id"))

.selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio")

degreeRatio.orderBy(desc("degreeRatio")).show(10, false)

degreeRatio.orderBy("degreeRatio").show(10, false)

# in Python

degreeRatio = inDeg.join(outDeg, "id")\

.selectExpr("id", "double(inDegree)/double(outDegree) as degreeRatio")

degreeRatio.orderBy(desc("degreeRatio")).show(10, False)

degreeRatio.orderBy("degreeRatio").show(10, False)

Those queries result in the following data:

+----------------------------------------+------------------+

|id |degreeRatio |

+----------------------------------------+------------------+

|Redwood City Medical Center |1.5333333333333334|

|San Mateo County Center |1.4724409448818898|

...

|Embarcadero at Vallejo |1.2201707365495336|

|Market at Sansome |1.2173913043478262|

+----------------------------------------+------------------+

+-------------------------------+------------------+

|id |degreeRatio |

+-------------------------------+------------------+

|Grant Avenue at Columbus Avenue|0.5180520570948782|

|2nd at Folsom |0.5909488686085761|

...

|San Francisco City Hall |0.7928849902534113|

|Palo Alto Caltrain Station |0.8064516129032258|

+-------------------------------+------------------+

Breadth-First Search

Breadth-first search will search our graph for how to connect two sets of nodes, based on the edges

in the graph. In our context, we might want to do this to find the shortest paths to different stations, but

the algorithm also works for sets of nodes specified through a SQL expression. We can specify the

maximum of edges to follow with the maxPathLength, and we can also specify an edgeFilter to

filter out edges that do not meet a requirement, like trips during nonbusiness hours.

We’ll choose two fairly close stations so that this does not run too long. However, you can do

interesting graph traversals when you have sparse graphs that have distant connections. Feel free to

play around with the stations (especially those in other cities) to see if you can get distant stations to

connect:

// in Scala

stationGraph.bfs.fromExpr("id = 'Townsend at 7th'")

.toExpr("id = 'Spear at Folsom'").maxPathLength(2).run().show(10)

# in Python

stationGraph.bfs(fromExpr="id = 'Townsend at 7th'",

toExpr="id = 'Spear at Folsom'", maxPathLength=2).show(10)

+--------------------+--------------------+--------------------+

| from| e0| to|

+--------------------+--------------------+--------------------+

|[65,Townsend at 7...|[913371,663,8/31/...|[49,Spear at Fols...|

|[65,Townsend at 7...|[913265,658,8/31/...|[49,Spear at Fols...|

...

|[65,Townsend at 7...|[903375,850,8/24/...|[49,Spear at Fols...|

|[65,Townsend at 7...|[899944,910,8/21/...|[49,Spear at Fols...|

+--------------------+--------------------+--------------------+

Connected Components

A connected component defines an (undirected) subgraph that has connections to itself but does not

connect to the greater graph, as illustrated in Figure 30-5.

Figure 30-5. A connected component

The connected components algorithm does not directly relate to our current problem because they

assume an undirected graph. However, we can still run the algorithm, which just assumes that there

are is no directionality associated with our edges. In fact, if we look at the bike share map, we

assume that we would get two distinct connected components (Figure 30-6).

Figure 30-6. A map of Bay Area bike share locations

WARNING

To run this algorithm, you will need to set a checkpoint directory which will store the state of the job at every iteration. This

allows you to continue where you left off if the job crashes. This is probably one of the most expensive algorithms currently

in GraphFrames, so expect delays.

One thing you will likely have to do to run this algorithm on your local machine is take a sample of

the data, just as we do in the following code example (taking a sample can help you get to a result

without crashing the Spark application with garbage collection issues):

// in Scala

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")

# in Python

spark.sparkContext.setCheckpointDir("/tmp/checkpoints")

// in Scala

val minGraph = GraphFrame(stationVertices, tripEdges.sample(false, 0.1))

val cc = minGraph.connectedComponents.run()

# in Python

minGraph = GraphFrame(stationVertices, tripEdges.sample(False, 0.1))

cc = minGraph.connectedComponents()

From this query we get two connected components but not necessarily the ones we might expect. Our

sample may not have all of the correct data or information so we’d probably need more compute

resources to investigate further:

// in Scala

cc.where("component != 0").show()

# in Python

cc.where("component != 0").show()

+----------+------------------+---------+-----------+---------+------------+-----

|station_id| id| lat| long|dockcount| landmark|in...

+----------+------------------+---------+-----------+---------+------------+-----

| 47| Post at Kearney|37.788975|-122.403452| 19|San Franc...| ...

| 46|Washington at K...|37.795425|-122.404767| 15|San Franc...| ...

+----------+------------------+---------+-----------+---------+------------+-----

Strongly Connected Components

GraphFrames includes another related algorithm that relates to directed graphs: strongly connected

components, which takes directionality into account. A strongly connected component is a subgraph

that has paths between all pairs of vertices inside it.

// in Scala

val scc = minGraph.stronglyConnectedComponents.maxIter(3).run()

# in Python

scc = minGraph.stronglyConnectedComponents(maxIter=3)

scc.groupBy("component").count().show()

Advanced Tasks

This is just a short selection of some of the features of GraphFrames. The GraphFrames library also

includes features such as writing your own algorithms via a message-passing interface, triangle

counting, and converting to and from GraphX. You can find more information in the GraphFrames

documentation.

Conclusion

In this chapter, we took a tour of GraphFrames, a library for performing graph analysis on Apache

Spark. We took a more tutorial-based approach, since this processing technique is not necessarily the

first tool that people use when performing advanced analytics. It is nonetheless a powerful tool for

analyzing relationships between different objects, and critical in many domains. The next chapter will

talk about more cutting-edge functionality—specifically, deep learning.

Chapter 31. Deep Learning

Deep learning is one of the most exciting areas of development around Spark due to its ability to

solve several previously difficult machine learning problems, especially those involving unstructured

data such as images, audio, and text. This chapter will cover how Spark works in tandem with deep

learning, and some of the different approaches you can use to work with Spark and deep learning

together.

Because deep learning is still a new field, many of the newest tools are implemented in external

libraries. This chapter will not focus on packages that are necessarily core to Spark but rather on the

massive amount of innovation in libraries built on top of Spark. We will start with several high-level

ways to use deep learning on Spark, discuss when to use each one, and then go over the libraries

available for them. As usual, we will include end-to-end examples.

NOTE

To make the most of this chapter you should know at least the basics of deep learning as well as the basics of Spark. With

that being said, we point to an excellent resource at the beginning of this part of the book called the Deep Learning Book,

by some of the top researchers in this area.

What Is Deep Learning?

To define deep learning, we must first define neural networks. A neural network is a graph of nodes

with weights and activation functions. These nodes are organized into layers that are stacked on top

of one another. Each layer is connected, either partially or completely, to the previous layer in the

network. By stacking layers one after the other, these simple functions can learn to recognize more

and more complex signals in the input: simple lines with one layer, circles and squares with the next

layer, complex textures in another, and finally the full object or output you hope to identify. The goal

is to train the network to associate certain inputs with certain outputs by tuning the weights associated

with each connection and the values of each node in the network. Figure 31-1 shows the simple neural

network.

Figure 31-1. A neural network

Deep learning, or deep neural networks, stack many of these layers together into various different

architectures. Neural networks themselves have existed for decades, and have waxed and waned in

terms of popularity for various machine learning problems. Recently, however, a combination of

much larger datasets (e.g., the ImageNet corpus for object recognition), powerful hardware (clusters

and GPUs), and new training algorithms have enabled training much larger neural networks that

outperform previous approaches in many machine learning tasks. Typical machine learning techniques

typically cannot continue to perform well as more data is added; their performance hits a ceiling.

Deep learning can benefit from enormous amounts of data and information and it is not uncommon for

deep learning datasets to be orders of magnitude larger than other machine learning datasets. Deep

neural networks have now become the standard in computer vision, speech processing, and some

natural language tasks, where they often “learn” better features than previous hand-tuned models.

They are also actively being applied in other areas of machine learning. Apache Spark’s strength as a

big data and parallel computing system makes it a natural framework to use with deep learning.

Researchers and engineers have put a lot of effort into speeding up these neural network-like

calculations. Nowadays, the most popular way to use neural networks or deep learning is to use a

framework, implemented by a research institute or corporation. The most popular as of the time of

this writing are TensorFlow, MXNet, Keras, and PyTorch. This area is rapidly evolving so it’s

always worth searching around for others.

Ways of Using Deep Learning in Spark

For the most part, regardless of which application you are targeting, there are three major ways to use

deep learning in Spark:

Inference

The simplest way to use deep learning is to take a pretrained model and apply it to large datasets

in parallel using Spark. For example, you could use an image classification model, trained using a

standard dataset like ImageNet, and apply it to your own image collection to identify pandas,

flowers, or cars. Many organizations publish large, pretrained models on common datasets (e.g.,

Faster R-CNN and YOLO for object detection), so you can often take a model from your favorite

deep learning framework and apply it in parallel using a Spark function. Using PySpark, you

could simply call a framework such as TensorFlow or PyTorch in a map function to get distributed

inference, though some of the libraries we discuss for it make further optimizations beyond simply

calling these libraries in a map function.

Featurization and transfer learning

The next level of complexity is to use an existing model as a featurizer instead of taking its final

output. Many deep learning models learn useful feature representations in their lower layers as

they get trained for an end-to-end task. For example, a classifier trained on the ImageNet dataset

will also learn low-level features present in all natural images, such as edges and textures. We

can then use these features to learn models for a new problem not covered by the original dataset.

This method is called transfer learning, and generally involves the last few layers of a pretrained

model and retraining them with the data of interest. Transfer learning is also especially useful if

you do not have a large amount of training data: training a full-blown network from scratch

requires a dataset of hundreds of thousands of images, like ImageNet, to avoid overfitting, which

will not be available in many business contexts. In contrast, transfer learning can work even with

a few thousand images because it updates fewer parameters.

Model training

Spark can also be used to train a new deep learning model from scratch. There are two common

methods here. First, you can use a Spark cluster to parallelize the training of a single model over

multiple servers, communicating updates between them. Alternatively, some libraries let the user

train multiple instances of similar models in parallel to try various model architectures and

hyperparameters, accelerating the model search and tuning process. In both cases, Spark’s deep

learning libraries make it simple to pass data from RDDs and DataFrames to deep learning

algorithms. Finally, even if you do not wish to train your model in parallel, these libraries can be

used to extract data from a cluster and export it to a single-machine training script using the native

data format of frameworks like TensorFlow.

In all three cases, the deep learning code typically runs as part of a larger application that includes

Extract, Transform, and Load (ETL) steps to parse the input data, I/O from various sources, and

potentially batch or streaming inference. For these other parts of the application, you can simply use

the DataFrame, RDD, and MLlib APIs described earlier in this book. One of Spark’s strengths is the

ease of combining these steps into a single parallel workflow.

Deep Learning Libraries

In this section, we’ll survey a few of the most popular libraries available for deep learning in Spark.

We will describe the main use cases of the library and link them to references or examples when

possible. This list is not meant to be exhaustive, because the field is rapidly evolving. We encourage

you to check each library’s website and the Spark documentation for the latest updates.

MLlib Neural Network Support

Spark’s MLlib currently has native support for a single deep learning algorithm: the

ml.classification.MultilayerPerceptronClassifier class’s multilayer perceptron classifier.

This class is limited to training relatively shallow networks containing fully connected layers with the

sigmoid activation function and an output layer with a softmax activation function. This class is most

useful for training the last few layers of a classification model when using transfer learning on top of

an existing deep learning–based featurizer. For example, it can be added on top of the Deep Learning

Pipelines library we describe later in this chapter to quickly perform transfer learning over Keras and

TensorFlow models.

TensorFrames

TensorFrames is an inference and transfer learning-oriented library that makes it easy to pass data

between Spark DataFrames and TensorFlow. It supports Python and Scala interfaces and focuses on

providing a simple but optimized interface to pass data from TensorFlow to Spark and back. In

particular, using TensorFrames to apply a model over Spark DataFrames is generally more efficient

than calling a Python map function that directly invokes the TensorFlow model, due to faster data

transfer and amortization of the startup cost. TensorFrames is most useful for inference, in both

streaming and batch settings, and for transfer learning, where you can apply an existing model over

raw data to featurize it, then learn the last layers using a MultilayerPerceptronClassifier or

even a simpler logistic regression or random forest classifier over the data.

BigDL

BigDL is a distributed deep learning framework for Apache Spark primarily developed by Intel. It

aims to support distributed training of large models as well as fast applications of these models using

inference. One key advantage of BigDL over the other libraries described here is that it is primarily

optimized to use CPUs instead of GPUs, making it efficient to run on an existing, CPU-based cluster

(e.g., an Apache Hadoop deployment). BigDL provides high-level APIs to build neural networks

from scratch and automatically distributes all operations by default. It can also train models

described with the Keras DL library.

TensorFlowOnSpark

TensorFlowOnSpark is a widely used library that can train TensorFlow models in a parallel fashion

on Spark clusters. TensorFlow includes some foundations to do distributed training, but it still needs

to rely on a cluster manager for managing the hardware and data communications. It does not come

with a cluster manager or a distributed I/O layer out of the box. TensorFlowOnSpark launches

TensorFlow’s existing distributed mode inside a Spark job, and automatically feeds data from Spark

RDDs or DataFrames into the TensorFlow job. If you already know how to use TensorFlow’s

distributed mode, TensorFlowOnSpark makes it easy to launch your job inside a Spark cluster and

pass it data processed with other Spark libraries (e.g., DataFrame transformations) from any input

source Spark supports. TensorFlowOnSpark was originally developed at Yahoo! and is also used in

production at other large organizations. The project also integrates with Spark’s ML Pipelines API.

DeepLearning4J

DeepLearning4j is an open-source, distributed deep learning project in Java and Scala that provides

both single-node and distributed training options. One of its advantages over Python-based deep

learning frameworks is that it was primarily designed for the JVM, making it more convenient for

groups that do not wish to add Python to their development process. It includes a wide variety of

training algorithms and support for CPUs as well as GPUs.

Deep Learning Pipelines

Deep Learning Pipelines is an open source package from Databricks that integrates deep learning

functionality into Spark’s ML Pipelines API. The package existing deep learning frameworks

(TensorFlow and Keras at the time of writing), but focuses on two goals:

Incorporating these frameworks into standard Spark APIs (such as ML Pipelines and Spark

SQL) to make them very easy to use

Distributing all computation by default

For example, Deep Learning Pipelines provides a DeepImageFeaturizer class that acts as a

transformer in the Spark ML Pipeline API, allowing you to build a transfer learning pipeline in just a

few lines of code (e.g., by adding a perceptron or logistic regression classifier on top). Likewise, the

library supports parallel grid search over multiple model parameters using MLlib’s grid search and

cross-validation API. Finally, users can export an ML model as a Spark SQL user-defined function

and make it available to analysts using SQL or streaming applications. At the time of writing (summer

2017), Deep Learning Pipelines is under heavy development, so we encourage you to check its

website for the latest updates.

Table 31-1 summarizes the various deep learning libraries and the main use cases they support:

Table 31-1. Deep learning libraries

Library Underlying DL

framework Use cases

BigDL BigDL Distributed training, inference, ML Pipeline integration

DeepLearning4J DeepLearning4J Inference, transfer learning, distributed training

Deep Learning

Pipelines TensorFlow, Keras Inference, transfer learning, multi-model training, ML Pipeline and Spark SQL

integration

MLlib Perceptron Spark Distributed training, ML Pipeline integration

TensorFlowOnSpark TensorFlow Distributed training, ML Pipeline integration

TensorFrames TensorFlow Inference, transfer learning, DataFrame integration

While there are several approaches different companies have taken to integrating Spark and deep

learning libraries, the one currently aiming for the closest integration with MLlib and DataFrames is

Deep Learning Pipelines. This library aims to improve Spark’s support for image and tensor data

(which will be integrated into the core Spark codebase in Spark 2.3), and to make all deep learning

functionality available in the ML Pipeline API. Its friendly API makes it the simplest way to run deep

learning on Spark today and will be the focus of the remaining sections in this chapter.

A Simple Example with Deep Learning Pipelines

As we described, Deep Learning Pipelines provides high-level APIs for scalable deep learning by

integrating popular deep learning frameworks with ML Pipelines and Spark SQL.

Deep Learning Pipelines builds on Spark’s ML Pipelines for training and on Spark DataFrames and

SQL for deploying models. It includes high-level APIs for common aspects of deep learning so they

can be done efficiently in a few lines of code:

Working with images in Spark DataFrames;

Applying deep learning models at scale, whether they are your own or standard popular

models, to image and tensor data;

Transfer learning using common pretrained deep learning models;

Exporting models as Spark SQL functions to make it simple for all kinds of users to take

advantage of deep learning; and

Distributed deep learning hyperparameter tuning via ML Pipelines.

Deep Learning Pipelines currently only offers an API in Python, which is designed to work closely

with existing Python deep learning packages such as TensorFlow and Keras.

Setup
Deep Learning Pipelines is a Spark Package, so we’ll load it just like we loaded GraphFrames. Deep
Learning Pipelines works on Spark 2.x and the package can be found here. You’re going to need to
install a few Python dependencies, including TensorFrames, TensorFlow, Keras, and h5py. Make sure
these are installed across both your driver and worker machines.
We’ll use the flowers dataset from the TensorFlow retraining tutorial. Now if you’re running this on a
cluster of machines, you’re going to need a way to put these files on a distributed file system once you
download them. We include a sample of these images in the book’s GitHub Repository.
Images and DataFrames
One of the historical challenges when working with images in Spark is that getting them into a
DataFrame was difficult and tedious. Deep Learning Pipelines includes utility functions that make
loading and decoding images in a distributed fashion easy. This is an area that’s changing rapidly.
Currently, this is a part of Deep Learning Pipelines. Basic image loading and representation will be
included in Spark 2.3. While it is not released yet, all of the examples in this chapter should be
compatible with this upcoming version of Spark:
from sparkdl import readImages
img_dir = '/data/deep-learning-images/'
image_df = readImages(img_dir)
The resulting DataFrame contains the path and then the image along with some associated metadata:
image_df.printSchema()
root
 |-- filePath: string (nullable = false)
 |-- image: struct (nullable = true)
 |    |-- mode: string (nullable = false)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- data: binary (nullable = false)
Transfer Learning
Now that we have some data, we can get started with some simple transfer learning. Remember, this
means leveraging a model that someone else created and modifying it to better suit our own purposes.
First, we will load the data for each type of flower and create a training and test set:
from sparkdl import readImages
from pyspark.sql.functions import lit
tulips_df = readImages(img_dir + "/tulips").withColumn("label", lit(1))
daisy_df = readImages(img_dir + "/daisy").withColumn("label", lit(0))
tulips_train, tulips_test = tulips_df.randomSplit([0.6, 0.4])

daisy_train, daisy_test = daisy_df.randomSplit([0.6, 0.4])

train_df = tulips_train.unionAll(daisy_train)

test_df = tulips_test.unionAll(daisy_test)

In the next step we will leverage a transformer called the DeepImageFeaturizer. This will allow us

to leverage a pretrained model called Inception, a powerful neural network successfully used to

identify patterns in images. The version we are using is pretrained to work well with images of

various common objects and animals. This is one of the standard pretrained models that ship with the

Keras library. However, this particular neural network is not trained to recognize daisies and roses.

So we’re going to use transfer learning in order to make it into something useful for our own

purposes: distinguishing different flower types.

Note that we can use the same ML Pipeline concepts we learned about throughout this part of the book

and leverage them with Deep Learning Pipelines: DeepImageFeaturizer is just an ML transformer.

Additionally, all that we’ve done to extend this model is add on a logistic regression model in order

to facilitate the training of our end model. We could use another classifier in its place. The following

code snippet demonstrates adding this model (note this may take time to complete as it’s a fairly

resource intensive process):

from pyspark.ml.classification import LogisticRegression

from pyspark.ml import Pipeline

from sparkdl import DeepImageFeaturizer

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features",

modelName="InceptionV3")

lr = LogisticRegression(maxIter=1, regParam=0.05, elasticNetParam=0.3,

labelCol="label")

p = Pipeline(stages=[featurizer, lr])

p_model = p.fit(train_df)

Once we’ve trained the model, we can use the same classification evaluator we used in Chapter 25.

We can specify the metric we’d like to test and then evaluate it:

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

tested_df = p_model.transform(test_df)

evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

print("Test set accuracy = " + str(evaluator.evaluate(tested_df.select(

"prediction", "label"))))

With our DataFrame of examples, we can inspect the rows and images in which we made mistakes in

the previous training:

from pyspark.sql.types import DoubleType

from pyspark.sql.functions import expr

# a simple UDF to convert the value to a double

def _p1(v):

return float(v.array[1])

p1 = udf(_p1, DoubleType())

df = tested_df.withColumn("p_1", p1(tested_df.probability))

wrong_df = df.orderBy(expr("abs(p_1 - label)"), ascending=False)

wrong_df.select("filePath", "p_1", "label").limit(10).show()

Applying deep learning models at scale

Spark DataFrames are a natural construct for applying deep learning models to a large-scale dataset.

Deep Learning Pipelines provides a set of Transformers for applying TensorFlow graphs and

TensorFlow-backed Keras models at scale. In addition, popular image models can be applied out of

the box, without requiring any TensorFlow or Keras code. The transformers, backed by the

Tensorframes library, efficiently handle the distribution of models and data to Spark tasks.

Applying Popular Models

There are many standard deep learning models for images. If the task at hand is very similar to what

the models provide (e.g., object recognition with ImageNet classes), or merely for exploration, you

can use the transformer DeepImagePredictor by simply specifying the model name. Deep Learning

Pipelines supports a variety of standard models included in Keras, which are listed on its website.

The following is an example of using DeepImagePredictor:

from sparkdl import readImages, DeepImagePredictor

image_df = readImages(img_dir)

predictor = DeepImagePredictor(

inputCol="image",

outputCol="predicted_labels",

modelName="InceptionV3",

decodePredictions=True,

topK=10)

predictions_df = predictor.transform(image_df)

Notice that the predicted_labels column shows “daisy” as a high probability class for all sample

flowers using this base model. However, as can be seen from the differences in the probability

values, the neural network has the information to discern the two flower types. As we can see, our

transfer learning example was able to properly learn the differences between daisies and tulips

starting from the base model:

df = p_model.transform(image_df)

Applying custom Keras models

Deep Learning Pipelines also allows us to apply a Keras model in a distributed manner using Spark.

To do this, check the user guide on the KerasImageFileTransformer. This loads a Keras model and

applies it to a DataFrame column.

Applying TensorFlow models

Deep Learning Pipelines, through its integration with TensorFlow, can be used to create custom

transformers that manipulate images using TensorFlow. For instance, you could create a transformer

to change the size of an image or modify the color spectrum. To do this, use the

TFImageTransformer class.

Deploying models as SQL functions

Another option is to deploy a model as a SQL function allowing any user who knows SQL to be able

to use a deep learning model. Once this function is used, the resulting UDF function takes a column

and produces the output of the particular model. For instance, you could apply Inception v3 to a

variety of images by using the registerKeraImageUDF class:

from keras.applications import InceptionV3

from sparkdl.udf.keras_image_model import registerKerasImageUDF

from keras.applications import InceptionV3

registerKerasImageUDF("my_keras_inception_udf", InceptionV3(weights="imagenet"))

This way, the power of deep learning is available to any Spark user, not just the specialist who built

the model.

Conclusion

This chapter discussed several common approaches to using deep learning in Spark. We covered a

variety of available libraries and then worked through some basic examples of common tasks. This

area of Spark is under very active development and will continue to advance as time moves on so it’s

worth checking in on the libraries to learn more as time goes on! Over time, the authors of this book

hope to keep this chapter up to date with current developments.

Part VII. Ecosystem

Chapter 32. Language Specifics: Python

(PySpark) and R (SparkR and sparklyr)

This chapter will cover some of the more nuanced language specifics of Apache Spark. We’ve seen a

huge number of PySpark examples throughout the book. In Chapter 1, we discussed at a high level

how Spark runs code from other languages. Let’s talk through some of the more specific integrations:

PySpark

SparkR

sparklyr

As a reminder, Figure 32-1 shows the fundamental architecture for these specific languages.

Figure 32-1. The Spark Driver

Now let’s cover each of these in depth.

PySpark

We covered a ton of PySpark throughout this book. In fact, PySpark is included alongside Scala and

SQL in nearly every chapter in this book. Therefore, this section will be short and sweet, covering

only the details that are relevant to Spark itself. As we discussed in Chapter 1, Spark 2.2 included a

way to install PySpark with pip. Simply, pip install pyspark will make it available as a package

on your local machine. This is new, so there may be some bugs to fix, but it is something that you can

leverage in your projects today.

Fundamental PySpark Differences

If you’re using the structured APIs, your code should run just about as fast as if you had written it in

Scala, except if you’re not using UDFs in Python. If you’re using a UDF, you may have a performance

impact. Refer back to Chapter 6 for more information on why this is the case.

If you’re using the unstructured APIs, specifically RDDs, then your performance is going to suffer (at

the cost of a bit more flexibility). We touch on this reasoning in Chapter 12, but the fundamental idea

is that Spark is going to have to work a lot harder converting information from something that Spark

and the JVM can understand to Python and back again. This includes both functions as well as data

and is a process known as serialization. We’re not saying it never makes sense to use them; it’s just

something to be aware of when doing so.

Pandas Integration

One of the powers of PySpark is its ability to work across programming models. For instance, a

common pattern is to perform very large-scale ETL work with Spark and then collect the (single-

machine-sized) result to the driver and then leverage Pandas to manipulate it further. This allows you

to use a best-in-class tool for the best task at hand—Spark for big data and Pandas for small data:

import pandas as pd

df = pd.DataFrame({"first":range(200), "second":range(50,250)})

sparkDF = spark.createDataFrame(df)

newPDF = sparkDF.toPandas()

newPDF.head()

These niceties make working with data big and small easy with Spark. Spark’s community continues

to focus on improving this interoperability with various other projects, so the integration between

Spark and Python will continue to improve. For example, at the time of writing, the community is

actively working on Vectorized UDFs (SPARK-21190), which add a mapBatches API to let you

process a Spark DataFrame as a series of Pandas data frames in Python instead of converting each

individual row to a Python object. This feature is targeted to appear in Spark 2.3.

R on Spark

The rest of this chapter will cover R, Spark’s newest officially supported language. R is a language

and environment for statistical computing and graphics. It is similar to the S language and

environment developed at Bell Laboratories by John Chambers (of no relation to one of the authors of

this book) and colleagues. The R langauge has been around for decades and is consistently popular

among statisticians and those doing research in numerical computing. R is steadily becoming a first-

class citizen in Spark and provides the simplest open source interface for distributed computation to

the R language.

The popularity of R for performing single-machine data analysis and advanced analytics makes it an

excellent complement to Spark. There are two core initiatives to making this partnership a reality:

SparkR and sparklyr. These packages take slightly different approaches to provide similar

functionality. SparkR provides a DataFrame API similar to R’s data.frame, while sparklyr is

based on the popular dplyr package for accessing structured data. You can use whichever you prefer

in your code, but over time we expect that the community might converge toward a single integrated

package.

We will cover both packages here to let you choose which API you prefer. For the most part, both of

these projects are mature and well supported, albeit by slightly different communities. They both

support Spark’s structured APIs and allow for machine learning. We will elaborate on their

differences in the next sections.

SparkR

SparkR is an R package (originating as a collaborative research project between UC Berkeley,

Databricks, and MIT CSAIL) that provides a frontend to Apache Spark based on familiar R APIs.

SparkR is conceptually similar to R’s built-in data.frame API, except for some departures from the

API semantics, such as lazy evaluation. SparkR is a part of the official Spark project and is supported

as such. See the documentation for SparkR for more information.

Pros and cons of using SparkR instead of other languages

The reasons we would recommend that you use SparkR as opposed to PySpark are the following.

You are familiar with R and want to take the smallest step to leverage the capabilities of

Spark:

You want to leverage R-specific functionality or libraries (say the excellent ggplot2

library) and would like to work with big data in the process.

R is a powerful programming language that provides a lot of advantages over other languages when it

comes to certain tasks. However, it has its share of shortcomings like natively working with

distributed data. SparkR aims to fill this gap and does a great job enabling users to be successful on

both small and large data, in a conceptual way similar to PySpark and Pandas.

Setup

Let’s take a look at how to use SparkR. Naturally, you will need to have R installed on your system to

follow along in this chapter. To start up the shell, in your Spark home folder, run ./bin/sparkR to

start SparkR. This will automatically create a SparkSession for you. If you were to run SparkR from

RStudio, you would have to do something like the following:

library(SparkR)

spark <- sparkR.session()

Once we’ve started the shell, we can run Spark commands. For instance, we can read in a CSV file

like we saw in Chapter 9:

retail.data <- read.df(

"/data/retail-data/all/",

"csv",

header="true",

inferSchema="true")

print(str(retail.data))

We can take some rows from this SparkDataFrame and convert them to a standard R data.frame

type:

local.retail.data <- take(retail.data, 5)

print(str(local.retail.data))

Key Concepts

Now that we saw some very basic code, let’s reiterate key concepts. First, SparkR is still Spark.

Basically, all the tools that you have seen across the entire book apply directly to SparkR. It runs

according to the same principles as PySpark and has almost all of the same functionality available as

PySpark.

As shown in Figure 32-1, there is a gateway that connects the R process to the JVM that contains a

SparkSession, and SparkR converts user code into structured Spark manipulations across the cluster.

This makes its efficiency on par with Python and Scala when using the structured APIs. SparkR has

no support for RDDs or other low-level APIs.

While SparkR is used less than PySpark or Scala, it’s still popular and continues to grow. For those

that want to know enough Spark to leverage SparkR effectively, we recommend reading the following

section, along with Parts I and II of this book. When working through those other chapters, feel free to

try and use SparkR in place of Python or Scala. You’ll see that once you get the hang of it, it’s easy to

translate between the various languages.

The rest of this chapter will explain the most important differences between SparkR and “standard” R

to make it easier to be productive with SparkR faster.

The first thing we should cover is the different between local types and Spark types. A data.frame

type’s core difference with the Spark version is that it is available in memory and is usually directly

available in that particular process. A SparkDataFrame is just a logical representation of a series of

manipulations. Therefore when we manipulate a data.frame, we’ll see our results right away. On a

SparkDataFrame, we are going to logically manipulate the data using the same transformation and

action concepts that we saw throughout the book.

Once we have a SparkDataFrame, we can collect it to a data.frame similar to how we can read in

data using Spark. We can also collect it into a local data.frame with the following code (using the

SparkDataFrame we created in “Setup”):

# collect brings it from Spark to your local environment

collect(count(groupBy(retail.data, "country")))

# createDataFrame comverts a data.frame

# from your local environment to Spark

This difference is of consequence for end users. Certain functions or assumptions that apply to local

data.frames do not apply in Spark. For instance, we cannot index a SparkDataFrame according to a

particular row. Additionally, we cannot change point values in a SparkDataFrame but can do that in

a local data.frame.

Function masking

One frequent “gotcha” when users come to SparkR is that certain functions are masked by SparkR.

When I imported SparkR, I received the following message:

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, ...

This means that if we wish to call these masked functions, we need to be explicit about the package

that we’re calling them from or at least understand which function masks another. The ? can be helpful

in determining these conflicts:

?na.omit # refers to SparkR due to package loading order

?stats::na.omit # refers explicitly to stats

?SparkR::na.omit # refers explicitly to sparkR's null value filtering

SparkR functions only apply to SparkDataFrames

One implication of function masking is that functions that worked on objects previously may no longer

work on them after you bring in the SparkR package. This is because SparkR functions only apply on

Spark objects. For instance, we cannot use the sample function on a standard data.frame because

Spark takes that function name:

sample(mtcars) # fails

What you have to do instead is explicitly use the base sample function. Additionally the function

signatures differ between the two functions, which means that even if you are familiar with the syntax

and argument order for one particular library, it does not necessarily mean it’s the same order for

SparkR:

base::sample(some.r.data.frame) # some.r.data.frame = R data.frame type

Data manipulation

Data manipulation in SparkR is conceptually the same as Spark’s DataFrame API in other languages.

The core difference is in the syntax, largely due to us running R code and not another language.

Aggregations, filtering, and many of the functions that you can find in the other chapters throughout

this book are also available in R. For the most part, you can look at the names of functions or

manipulations that you find throughout this book and find out if they are available in SparkR by

running ?<function-name>. This should work the vast majority of the time, as there is good

coverage of structured SQL functions:

?to_date # to Data DataFrame column manipulation

SQL is largely the same. We can specify SQL commands that we can then manipulate as DataFrames.

For instance, we can find all tables that contain the word “production” in them:

tbls <- sql("SHOW TABLES")

collect(

select(

filter(tbls, like(tbls$tableName, "%production%")),

"tableName",

"isTemporary"))

We can also use the popular magrittr package to make this code more readable, leveraging the

piping operator to chain our transformations in a more functional and readable syntax:

library(magrittr)

tbls %>%

filter(like(tbls$tableName, "%production%")) %>%

select("tableName", "isTemporary") %>%

collect()

Data sources

SparkR supports all of the data sources that Spark supports, including third-party packages. We can

see in the following snippet that we simply specify the options using a slightly different syntax:

retail.data <- read.df(

"/data/retail-data/all/",

"csv",

header="true",

inferSchema="true")

flight.data <- read.df(

"/data/flight-data/parquet/2010-summary.parquet",

"parquet")

Refer back to Chapter 9 for more information.

Machine learning

Machine learning is a fundamental part of the R language, as well as of Spark. From SparkR there is a

decent availability of Spark MLlib algorithms. Typically they arrive in R one or two versions after

they are introduced in Scala or Python. As of Spark 2.1, the following algorithms are supported in

SparkR:

spark.glm or glm: Generalized linear model

spark.survreg: Accelerated failure time (AFT) survival regression model

spark.naiveBayes: Naive Bayes model

spark.kmeans: -means model

spark.logit: Logistic regression model

spark.isoreg: Isotonic regression model

spark.gaussianMixture: Gaussian mixture model

spark.lda: Latent Dirichlet allocation (LDA) model

spark.mlp: Multilayer perceptron classification model

spark.gbt: Gradient boosted tree model for regression and classification

spark.randomForest: Random forest model for regression and classification

spark.als: Alternating least squares (ALS) matrix factorization model

spark.kstest: Kolmogorov-Smirnov test

Under the hood, SparkR uses MLlib to train the model, which means that most everything covered in

Part VI is relevant for SparkR users. Users can call summary to print a summary of the fitted model,

predict to make predictions on new data, and write.ml/read.ml to save/load fitted models.

SparkR supports a subset of the available R formula operators for model fitting, including ~, ., :, +,

and -. Here’s an example of running a simple regression on the retail dataset:

model <- spark.glm(retail.data, Quantity ~ UnitPrice + Country,

family='gaussian')

summary(model)

predict(model, retail.data)

write.ml(model, "/tmp/myModelOutput", overwrite=T)

newModel <- read.ml("/tmp/myModelOutput")

The API is consistent across models, although not all models support detailed summary outputs like

we saw with glm. For more information about specific models or preprocessing techniques, see the

corresponding chapters in Part VI.

While this pales in comparison to R’s extensive collection of statistical algorithms and analysis

libraries, many users do not require the scale that Spark provides for the actual training and usage of

their machine learning algorithms. Users have the opportunity to build training sets on large data using

Spark and then collect that dataset to their local environment for training on a local data.frame.

User-defined functions

In SparkR, there are several ways of running user-defined functions. A user-defined function is one

that is created in the native language and run on the server in that same native langauge. These run, for

the most part, in the same way that a Python UDF runs, by performing serialization into and out of the

JVM of the function.

The different kinds of UDFs you can define are as follows:

First, spark.lapply lets you run multiple instances of a function in Spark on different parameter

values provided in an R collection. This is a great way of performing grid search and comparing the

results:

families <- c("gaussian", "poisson")

train <- function(family) {

model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)

summary(model)

}

# Return a list of model's summaries

model.summaries <- spark.lapply(families, train)

# Print the summary of each model

print(model.summaries)

Second, dapply and dapplyCollect let you process SparkDataFrame data using custom code. In

particular, these functions will take each partition of the SparkDataFrame, convert it to an R

data.frame inside of an executor, and then call your R code over that partition (represented as an R

data.frame). They will then return the results: a SparkDataFrame for dapply or a local

data.frame for dapplyCollect.

To use dapply, which returns a SparkDataFrame, you must specify the output schema that will result

from the transformation so that Spark understands what kind of data you will return. For example, the

following code will allow you to train a local R model per partition in your SparkDataFrame,

assuming you partition your data according to the correct keys:

df <- withColumnRenamed(createDataFrame(as.data.frame(1:100)), "1:100", "col")

outputSchema <- structType(

structField("col", "integer"),

structField("newColumn", "double"))

udfFunc <- function (remote.data.frame) {

remote.data.frame['newColumn'] = remote.data.frame$col * 2

remote.data.frame

}

# outputs SparkDataFrame, so it requires a schema

take(dapply(df, udfFunc, outputSchema), 5)

# collects all results to a, so no schema required.

# however this will fail if the result is large

dapplyCollect(df, udfFunc)

Finally, the gapply and gapplyCollect functions apply a UDF to a group of data in a fashion similar

to dapply. In fact, these two methods are largely the same, except that one operates on a generic

SparkDataFrame, and the other applies to a grouped DataFrame. The gapply function will apply this

function on a per-group basis and by passing in the key as the first parameter to the function that you

define. In this way, you can be sure to have a function customized according to each particular group:

local <- as.data.frame(1:100)

local['groups'] <- c("a", "b")

df <- withColumnRenamed(createDataFrame(local), "1:100", "col")

outputSchema <- structType(

structField("col", "integer"),

structField("groups", "string"),

structField("newColumn", "double"))

udfFunc <- function (key, remote.data.frame) {

if (key == "a") {

remote.data.frame['newColumn'] = remote.data.frame$col * 2

} else if (key == "b") {

remote.data.frame['newColumn'] = remote.data.frame$col * 3

} else if (key == "c") {

remote.data.frame['newColumn'] = remote.data.frame$col * 4

}

remote.data.frame

}

# outputs SparkDataFrame, so it requires a schema

take(gapply(df,

"groups",

udfFunc,

outputSchema), 50)

gapplyCollect(df,

"groups",

udfFunc)

SparkR will continue to grow as a part of Spark; and if you’re familiar with R and a little bit of

Spark, this can be a very powerful tool.

sparklyr

sparklyr is a newer package from the RStudio team based on the popular dplyr package for

structured data. This package is fundamentally different from SparkR and its authors take a more

opinionated stance toward what the integration between Spark and R should do. This means that

sparklyr sheds some of the Spark concepts that are available throughout this book, like the

SparkSession, and uses its own ideas instead. For some, this means that sparklyr takes a R-first

approach instead of SparkR’s approach of closely matching Python and Scala APIs. That approach

speaks to its origins as a framework; sparklyr was created within the R community by the folks at

RStudio (the popular R IDE), rather than being created by the Spark community. Whether sparklyr’s

or SparkR’s approach is better or worse completely depends on the end user’s preference.

In short, sparklyr provides an improved experience for R users familiar with dplyr, with slightly

less overall functionality than SparkR (which may change over time). Specifically, sparklyr

provides a complete dplyr backend to Spark, making it easy to take the dplyr code that you run

today on your local machine and make it distributed. The implication of the dplyr backend

architecture is that the same functions you use on local data.frame objects apply in a distributed

manner to distributed Spark DataFrames. In essence, scaling up requires no code changes. Since

functions apply to both single node and distributed DataFrames, this architecture addresses one of the

core challenges with SparkR today, where function masking can lead to strange debugging scenarios.

In addition, this architectural choice makes sparklyr an easier transition than simply using SparkR.

Like SparkR, sparklyr is an evolving project; and when this book is published, the sparklyr

project will have evolved further. For the most up-to-date reference, you should see the sparklyr

website. The following sections provide a lightweight comparison and won’t go into depth on this

particular project. Let’s get started with some hands-on examples of sparklyr. The first thing we

need to do is install the package:

install.packages("sparklyr")

library(sparklyr)

Key concepts

sparklyr ignores some of the fundamental concepts that Spark has and that we discussed throughout

this book. We posit that this is because these concepts are unfamiliar (and potentially irrelevant) to

the typical R user. For instance, rather than a SparkSession, there’s simply spark_connect, which

allows you to connect to a Spark cluster:

sc <- spark_connect(master = "local")

The returned variable is a remote dplyr data source. This connection, even though it resembles a

SparkContext, is not the same SparkContext we mentioned in this book. This is a purely sparklyr

concept that represents a Spark cluster connect. This function is largely the entire interface for how

you will define configurations that you would like to use in your spark environment. Through this

interface, you can specify initialization configurations for the spark cluster as a whole:

spark_connect(master = "local", config = spark_config())

This works by using the config package in R to specify the configurations you would like to set on

your Spark cluster. These details are covered in the sparklyr deployment documentation.

Using this variable, we can manipulate remote Spark data from a local R process, thus the result of

spark_connect performs roughly the same administrative role for end users as a SparkContext.

No DataFrames

sparklyr ignores the concept of a unique SparkDataFrame type. Instead it leverages tables (which

are still mapped to DataFrames inside Spark) similar to other dplyr data sources and allows you to

manipulate those. This aligns more with the typical R workflow, which is to use dplyr and magrittr

to functionally define transformations from a source table. However, it means that some of Spark’s

built-in functions and APIs may not be accessible unless dplyr also supports them.

Data manipulation

Once we connect to our cluster, we can run all the available dplyr functions and manipulations as if

they were a local dplyr data.frame. This architectural choice gives those familiar with R the

ability to do the same transformations using the same code, at scale. This means there’s no new syntax

or concepts for R users to learn.

While sparklyr does improve the R end-user experience, it comes at a cost of reducing the overall

power available to sparklyr users, since the concepts are R concepts, not necessarily Spark

concepts. For instance, sparklyr does not support user-defined functions that you can create and

apply in SparkR using dapply, gapply, and lapply. As sparklyr continues to mature, it may add

this sort of functionality, but at the time of this writing this capability does not exist. sparklyr is

under very active development and more functionality is being added so refer to the sparklyr

homepage for more information.

Executing SQL

While there is less direct Spark integration, users can execute arbitrary SQL code against the cluster

using the DBI library corresponding to almost the same SQL interface we have seen in previous

chapters:

library(DBI)

allTables <- dbGetQuery(sc, "SHOW TABLES")

This SQL interface provides a convenient lower-level interface to the SparkSession. For instance,

users can use DBI’s interface to set Spark SQL specific properties on the Spark cluster:

setShufflePartitions <- dbGetQuery(sc, "SET spark.sql.shuffle.partitions=10")

Unfortunately, neither DBI nor spark_connect does not give you an interface for setting Spark-

specific properties, which you are going to have to specify when you connect to your cluster.

Data sources

Users can leverage many of the same data sources available in Spark using sparklyr. For example,

you should be able to create table statements using arbitrary data sources. However, only CSV, JSON,

and Parquet formats are supported as first-class citizens using the following function definitions:

spark_write_csv(tbl_name, location)

spark_write_json(tbl_name, location)

spark_write_parquet(tbl_name, location)

Machine learning

sparklyr also has support for some of the core machine learning algorithms that we saw in previous

chapters. A list of the supported algorithms (at the time of this writing) includes:

ml_kmeans: -means clustering

ml_linear_regression: Linear regression

ml_logistic_regression: Logistic regression

ml_survival_regression: Survival regression

ml_generalized_linear_regression: Generalized linear regression

ml_decision_tree: Decision trees

ml_random_forest: Random forests

ml_gradient_boosted_trees: Gradient-boosted trees

ml_pca: Principal components analysis

ml_naive_bayes: Naive-Bayes

ml_multilayer_perceptron: Multilayer perceptron

ml_lda: Latent Dirichlet allocation

ml_one_vs_rest: One versus rest (allowing you to make a binary classifier into a

multiclass classifier)

However, development does continue, so check MLlib for more information.

Conclusion

SparkR and sparklyr are areas of rapid growth in the Spark project, so visit their websites to find

out the latest updates about each one. Moreover, the entire Spark project continues to grow as new

members, tools, integrations, and packages join the community. The next chapter will discuss the

Spark community and some of the other resources available to you.

Chapter 33. Ecosystem and Community

One of Spark’s biggest selling points is the sheer volume of resources, tools, and contributors. At the

time of this writing, there are over 1,000 contributors to the Spark codebase. This is orders of

magnitude more than most other projects dream of achieving and a testament to Spark’s amazing

community—both in terms of contributors and stewards. The Spark project shows no sign of slowing

down, as companies large and small seek to join the community. This environment has stimulated a

large number of projects that complement and extend Spark’s features, including formal Spark

packages and informal extensions that users can use in Spark.

Spark Packages

Spark has a package repository for packages specific to Spark: Spark Packages. These packages

were discussed in Chapters 9 and 24. Spark packages are libraries for Spark applications that can

easily be shared with the community. GraphFrames is a perfect example; it makes graph analysis

available on Spark’s structured APIs in ways much easier to use than the lower-level (GraphX) API

built into Spark. There are numerous other packages, including many machine learning and deep

learning ones, that leverage Spark as the core and extend its functionality.

Beyond these advanced analytics packages, others exist to solve problems in particular verticals.

Healthcare and genomics have seen a surge in opportunity for big data applications. For example, the

ADAM Project leverages unique, internal optimizations to Spark’s Catalyst engine to provide a

scalable API & CLI for genome processing. Another package, Hail, is an open source, scalable

framework for exploring and analyzing genomic data. Starting from sequencing or microarray data in

VCF and other formats, Hail provides scalable algorithms to enable statistical analysis of gigabyte-

scale data on a laptop or terabyte-scale data on cluster.

At the time of this writing, there are nearly 400 different packages to choose. As a user, you can

specify Spark packages as dependencies in your build files (as seen in this book’s book GitHub

repository). You can also download the pre-built jars and include them in your class path without

explicitly adding them to your build file. Spark packages can also be included at runtime by passing a

parameter to the spark-shell or spark-submit command-line tools.

An Abridged List of Popular Packages

As mentioned, there are nearly 400 Spark packages. Including all of these is not relevant to you as a

user because you can search for specific packages on the Spark package website. However, it is

worth mentioning some of the more popular packages:

Spark Cassandra Connector

This connector helps you get data in and out of the Cassandra database.
Spark Redshift Connector
This connector helps you get data in and out of the Redshift database.
Spark bigquery
This connector helps you get data in and out of Google’s BigQuery.
Spark Avro
This package allows you to read and write Avro files.
Elasticsearch
This package allows you to get data into and out of Elasticsearch.
Magellan
Allows you to perform geo-spatial data analytics on top of Spark.
GraphFrames
Allows you to perform graph analysis with DataFrames.
Spark Deep Learning
Allows you to leverage Deep Learning and Spark together.
Using Spark Packages
There are two core ways you can include Spark Packages in your projects. In Scala or Java, you can
include it as a build dependency, or you can also specify your packages at runtime (for Python or R).
Let’s review the ways in which you can include this information.
In Scala
Including the following resolver in your build.sbt file will allow you to include Spark packages as
dependencies. For example, we can add this resolver:
// allows us to include spark packages
resolvers += "bintray-spark-packages" at
  "https://dl.bintray.com/spark-packages/maven/"
Now that we added this line, we can include a library dependency for our Spark package:
libraryDependencies ++= Seq(
...
  // spark packages
  "graphframes" % "graphframes" % "0.4.0-spark2.1-s_2.11",

)

This is to include the GraphFrames library. There are slight versioning differences between packages,

but you can always find this information on the Spark packages website.

In Python

At the time of this writing , there is no explicit way to include a Spark package as a dependency in a

Python package. These sorts of dependencies must be set at runtime.

At runtime

We saw how we can specify Spark packages in Scala packages, but we can also include these

packages at runtime. This is as simple as including a new argument to the spark-shell and spark-

submit that you would use to run your code.

For example, to include the magellan library:

$SPARK_HOME/bin/spark-shell --packages harsha2010:magellan:1.0.4-s_2.11

External Packages

In addition to the formal Spark Packages, there are a number of informal packages that are built on or

leverage Spark’s capabilities. A prime example is the popular gradient-boosted, decision-tree

framework XGBoost, which makes use of Spark for scheduling distributed training on individual

partitions. A number of these are liberally licensed, public projects available on GitHub. Using your

favorite search engine is a great way to discover projects that may already exist, rather than having to

write your own.

Community

Spark has a large, robust community. It is so much larger than the packages and direct contributions.

The ecosystem of end users who build Spark into their products and write tutorials is an ever-

growing group. As of this writing, there are over 1,000 contributors to the repository on Github.

The official Spark website maintains the most up-to-date community information, including mailing

lists, improvement proposals, and project committers. This website also includes many resources

about new Spark versions, documentation, and release notes for the community.

Spark Summit

Spark Summits are events that occur across the globe at various times a year. This is the canonical

event for Spark-related talks, where thousands of end users and developers attend these summits to

learn about the cutting edge in Spark and hear about use cases. There are hundreds of tracks and

training courses over the course of several days. In 2016, there were three events: New York (Spark

Summit East), San Francisco (Spark Summit West), and Amsterdam (Spark Summit Europe). In 2017,

there were Spark Summits in Boston, San Francisco, and Dublin. Coming in 2018—and beyond—

there will be even more events. Find out more at at the Spark Summit website.

There are hundreds of freely available Spark Summit videos for learning about use cases, Spark’s

development, and strategies and tactics that you can use to get the most out of Spark. You can browse

historical Spark Summit talks and videos on the website.

Local Meetups

There are many Spark-related meetup groups on meetup.com. Figure 33-1 shows a map of Spark-

related meetups on Meetup.com.

Figure 33-1. Spark meetup map

Spark’s “official meetup group” in the Bay Area (founded by one of the authors of this book), can be

found here. However, there are over 600 Spark-related meetups around the world, totaling nearly

350,000 members. These meetups continue to spring up and grow, so be sure to find one in your area.

Conclusion

This whirlwind chapter discussed nontechnical resources that Spark makes available. One important

fact is that one of Spark’s greatest assets is the Spark community. We are extremely proud of the

community’s involvement in the development of Spark and love to hear about what companies,

academic institutions, and individuals build with Spark.

We sincerely hope that you’ve enjoyed this book and we look forward to seeing you at a Spark

Summit!

Index
Symbols
--jars command-line argument, Submitting applications
./bin/pyspark, Launching the Python console, Starting Spark
./bin/spark-shell, Launching the Scala console, Starting Spark
=!= operator, Concatenating and Appending Rows (Union), Working with Booleans
== (equal to) expression, Working with Booleans
` (backtick) character, Reserved Characters and Keywords
A
Accelerated Failure Time (AFT), Survival Regression (Accelerated Failure Time)
accumulators
basic example, Basic Example-Basic Example
custom, Custom Accumulators
overview of, Distributed Shared Variables, Accumulators
acknowledgements, Acknowledgments
actions, Actions
ADAM Project, Spark Packages
advanced analytics (see machine learning and advanced analytics)
aggregate function, aggregate
AggregateByKey function, aggregateByKey
AggregationBuffer, User-Defined Aggregation Functions
aggregations
aggregate function, aggregate
AggregateByKey function, aggregateByKey

aggregation functions, Aggregation Functions-Aggregating to Complex Types
CombineByKey function, combineByKey
countByKey, countByKey
debugging, Slow Aggregations
foldByKey function, foldByKey
groupByKey, groupByKey
grouping, Grouping-Grouping with Maps
grouping sets, Grouping Sets-Pivot
on complex types, Aggregating to Complex Types
overview of, Aggregations-Aggregations
performance tuning, Aggregations
on RDDs, Aggregations-foldByKey
reduceByKey, reduceByKey
in Structured Streaming API, Aggregations
User-Defined Aggregation Functions (UDAFs), User-Defined Aggregation Functions
window functions, Window Functions-Grouping Sets
alerting and notifications, Notifications and alerting, Alerting
Alternating Least Squares (ALS), Use Cases-Ranking Metrics
analytics (see machine learning and advanced analytics)
analyzer phase in Spark SQL, Logical Planning
anomaly detection
through graph analysis, Graph Analytics
through unsupervised learning, Unsupervised Learning
anti joins, Left Anti Joins
Apache Hadoop, Apache Spark’s Philosophy, Splittable File Types and Compression, Hadoop Files

Apache Hive, Big Data and SQL: Apache Hive, The SparkContext, Miscellaneous Considerations
Apache Maven, A Simple Scala-Based App
Apache Mesos, The Architecture of a Spark Application, Deploying Spark, Spark on Mesos
Apache Spark (see also Spark applications; Spark SQL)
actions, Actions
API selection, Which Spark API to Use?
architecture of, Spark’s Basic Architecture
benefits of, What Is Apache Spark?, Context: The Big Data Problem, Conclusion, Ecosystem and
Community
building Spark from source, Building Spark from source
case sensitivity, Case Sensitivity
cloud deployment of, Spark in the Cloud
(see also deployment)
DataFrames, DataFrames
downloading, Downloading Spark Locally
ecosystem of packages and tools, Spark’s Ecosystem and Packages, Spark Packages-External
Packages
focus on computation, Apache Spark’s Philosophy
functional programming model underlying, History of Spark, An End-to-End Example
fundamental APIs of, Spark’s APIs
history of, Preface, History of Spark
interactive nature of, History of Spark
internal type representations, Spark Types-Spark Types
language APIs, Spark’s Language APIs
launching interactive consoles, Launching Spark’s Interactive Consoles
libraries supported by, Apache Spark’s Philosophy

managing Spark versions, Miscellaneous Considerations, Updating Your Spark Version
philosophy of, Apache Spark’s Philosophy
recent improvements to, The Present and Future of Spark
reserved characters and keywords, Reserved Characters and Keywords
running, Running Spark, Running Spark in the Cloud, How Spark Runs on a Cluster-Conclusion
Spark UI, Spark UI
starting, Starting Spark
toolkit components and libraries, What Is Apache Spark?, A Tour of Spark’s Toolset-Spark’s
Ecosystem and Packages
topics covered, Preface
transformation end-to-end example, An End-to-End Example-DataFrames and SQL
transformations basics, Transformations
unified nature of, Apache Spark’s Philosophy
append output mode, Append mode
application properties, Application Properties, Application properties for YARN
applications (Structured Streaming) (see also production applications; Spark applications; Structured
Streaming API)
alerting, Alerting
monitoring, Metrics and Monitoring-Spark UI
sizing and rescaling, Sizing and Rescaling Your Application
Stream Listener monitoring, Advanced Monitoring with the Streaming Listener
updating, Updating Your Application
approximations, Aggregations
approxQuantile method, Working with Numbers
approx_count_distinct function, approx_count_distinct
arrays, Arrays-explode

array_contains, array_contains
asynchronous job execution, Time-Outs
atomicity, Resilience in output and atomicity
attributions, Using Code Examples
automatic model tuning, Evaluators for Classification and Automating Model Tuning, Evaluators and
Automating Model Tuning
average, calculating, avg
Avro, An Abridged List of Popular Packages
B
backtick (`) character, Reserved Characters and Keywords
batch processing, What Is Stream Processing?, Batch duration
Bay Area Bike Share data, Graph Analytics
BigDL, BigDL
bigquery package, An Abridged List of Popular Packages
binary classification, Classification, Binary Classification
binning, Bucketing
(see also bucketing)
bisecting -means , Bisecting k-means Summary
Booleans, Working with Booleans
breadth-first search, Breadth-First Search
broadcast joins, Communication Strategies
broadcast variables, Distributed Shared Variables-Broadcast Variables, Broadcast Variables
bucketing, Bucketing, Bucketing, Bucketing-Advanced bucketing techniques
business logic, Business logic resilience and evolution
ByKey, Key-Value Basics (Key-Value RDDs)
C

caching, Temporary Data Storage (Caching)-Temporary Data Storage (Caching)
calendar dates, Working with Dates and Timestamps-Working with Dates and Timestamps
capitalization, Working with Strings
Cartesian products, Cross (Cartesian) Joins
case classes, In Scala: Case Classes, Datasets and RDDs of Case Classes
case sensitivity, Case Sensitivity
case…when…then…end style statements, case…when…then Statements
Cassandra Connector, Conclusion, On-Premises Cluster Deployments, An Abridged List of Popular
Packages
casting, Changing a Column’s Type (cast)
catalog, Logical Planning, Miscellaneous Considerations
Catalog (Spark SQL), Catalog, Miscellaneous Considerations
Catalyst computation engine, Overview of Structured Spark Types, Spark Packages
categorical features, preprocessing, Working with Categorical Features-Text Data Transformers
centroid, Machine Learning and Advanced Analytics, k-means
(see also -means)
checkpointing, Checkpointing, Fault Tolerance and Checkpointing, Connected Components
Chi-Square Selector, Feature Selection
classification
decision trees, Decision Trees-Prediction Parameters
evaluators and automating model tuning, Evaluators for Classification and Automating Model
Tuning
logistic regression, Logistic Regression-Model Summary
metrics, Detailed Evaluation Metrics
models in MLlib, Classification Models in MLlib-Model Scalability
multilayer perceptron classifier, MLlib Neural Network Support

Naive Bayes, Naive Bayes
One-vs-Rest, One-vs-Rest Classifier
random forests and gradient boosted trees, Random Forest and Gradient-Boosted Trees-Prediction
Parameters
through graph analysis, Graph Analytics
types of, Types of Classification
use cases for, Classification, Use Cases
client mode, Client mode
client requests, Client Request
cloud deployment, Spark in the Cloud
(see also deployment)
cluster managers (see also clusters)
cluster managers, Spark on YARN-Application properties for YARN
Mesos, Spark on Mesos
overview of, The Architecture of a Spark Application
purpose of, Spark’s Basic Architecture
selecting, Deploying Spark
standalone, Cluster Managers-Submitting applications
cluster mode, Cluster mode
clusters (see also cluster managers)
cluster networking configurations, Cluster Networking Configurations
creating, Launch
defined, Spark’s Basic Architecture
monitoring, The Monitoring Landscape
on-premises clusters, On-Premises Cluster Deployments

performance tuning, Cluster Configurations
sizing and sharing configuration, Cluster/application sizing and sharing
coalesce, Repartition and Coalesce, Coalesce, coalesce, Repartitioning and Coalescing
code examples, obtaining and using, Using Code Examples, Data Used in This Book
CoGroups, CoGroups
col function, Columns
cold start problem, Collaborative Filtering with Alternating Least Squares, Prediction Parameters
collaborative filtering, Use Cases-Ranking Metrics
collect method, Datasets: Type-Safe Structured APIs, Collecting Rows to the Driver, Actions
collect_list function, Lists
collect_set function, Lists
column function, Columns
columns
accessing, Accessing a DataFrame’s columns
adding, Adding Columns
case sensitivity, Case Sensitivity
changing type (cast), Changing a Column’s Type (cast)
converting rows to, Pivot
exploding, explode
instantiating, Spark Types
locating methods, Where to Look for APIs
manipulating with Select and SelectExpr, select and selectExpr
MLlib column metadata, Transformer Properties
overview of, Columns
removing, Removing Columns

renaming, Renaming Columns
reserved characters and keywords, Reserved Characters and Keywords
working with in Spark, Columns and Expressions
CombineByKey function, combineByKey
comments and questions, How to Contact Us
common words, removing, Removing Common Words
community, Spark’s Ecosystem and Packages
comparison (=!=) operator, Concatenating and Appending Rows (Union), Working with Booleans
complete output mode, Complete mode
complex types, Columns, Writing Complex Types, Complex Types-Lists
compression formats, Splittable File Types and Compression, Splittable file types and compression
computing engines, What Is Apache Spark?, Apache Spark’s Philosophy
concatenation, Concatenating and Appending Rows (Union)
conf/slaves file, Cluster launch scripts
configuration options
application properties, Application Properties
environmental variables, Environmental Variables
execution properties, Execution Properties
job scheduling, Job Scheduling Within an Application
memory management, Configuring Memory Management
overview of, Configuring Applications
runtime properties, Runtime Properties
shuffle behavior, Configuring Shuffle Behavior
SparkConf, The SparkConf
connected components algorithm, Connected Components

console sink, Sources and sinks for testing
console, launching interactive, Launching Spark’s Interactive Consoles
continuous applications, What Is Stream Processing?, Structured Streaming Basics
continuous features, preprocessing of, Working with Continuous Features-Normalizer
continuous processing-based systems, Continuous Versus Micro-Batch Execution
correlated predicated subqueries, Correlated predicate subqueries
correlated subqueries, Subqueries
correlation, computing, Working with Numbers, Covariance and Correlation
cost-based optimizations, Statistics collection
(see also performance tuning)
count action, Aggregations, Actions, count
count-based windows, mapGroupsWithState-mapGroupsWithState
countApprox, countApprox
countApproxDistinct, countApproxDistinct
countByKey, Aggregations
countByValue, countByValue
countByValueApprox, countByValueApprox
countDistinct function, countDistinct
counting, Working with Numbers
CountVectorizer, Converting Words into Numerical Representations
covariance, calculating, Covariance and Correlation
CREATE EXTERNAL TABLE statement, Creating External Tables
cross (Cartesian) joins, Cross (Cartesian) Joins
CSV (comma-separated values) files
CSV reader options, CSV Files-Reading CSV Files

reading, Reading CSV Files
writing, Writing CSV Files
cube operator, Cube
curse of dimensionality, Unsupervised Learning
D
data locality, Data locality
data sources
community-created, Data Sources
CSV files, CSV Files-Writing CSV Files
data source APIs structure, The Structure of the Data Sources API-Save modes
downloading data used in this book, Data Used in This Book
JSON files, JSON Files-Parquet Files
managing file size, Managing File Size
ORC files, ORC Files
Parquet files, Parquet Files-Writing Parquet Files
reading data in parallel, Reading Data in Parallel
Spark's core, Data Sources
splittable file formats, Splittable File Types and Compression
SQL databases, SQL Databases-Text Files
in Structured Streaming API, Input Sources, Where Data Is Read and Written (Sources and Sinks)-
Reading from the Kafka Source, Sources and sinks for testing
text files, Text Files
writing complex types, Writing Complex Types
writing data in parallel, Writing Data in Parallel
data, cleaning, Data cleaning

data, reading
basics of, Basics of Reading Data
core API structure, Read API Structure
CSV files, Reading CSV Files
debugging, Slow Reads and Writes
in parallel, Reading Data in Parallel
read mode, Read modes
data, storing
bucketing, Bucketing
collecting statistics, Statistics collection
data locality, Data locality
file-based long-term, File-based long-term data storage
number of files, The number of files
splittable file types and compression, Splittable file types and compression
table partitioning, Table partitioning
temporary (caching), Temporary Data Storage (Caching)-Temporary Data Storage (Caching)
data, types of
arrays, Arrays-explode
Booleans, Working with Booleans
converting to Spark types, Converting to Spark Types
dates and timestamps, Working with Dates and Timestamps-Working with Dates and Timestamps
JSON data, Working with JSON
locating transformations, Where to Look for APIs
maps, Maps
null values, Working with Nulls in Data-Ordering, Signs and symptoms, Formatting Models

According to Your Use Case
numbers, Working with Numbers-Working with Numbers
ordering null values, Ordering
strings, Working with Strings-Regular Expressions
structs, Structs
user-defined functions (UDFs), User-Defined Functions-User-Defined Functions, Aggregating to
Complex Types, User-defined functions
data, updating in real time, Update data to serve in real time
data, writing
basics of, Basics of Writing Data
core API structure, Write API Structure
debugging, Slow Reads and Writes
in parallel, Writing Data in Parallel
save mode, Save modes
databases (Spark SQL) (see also SQL databases)
creating, Creating Databases
dropping, Dropping Databases
OLTP vs. OLAP, Big Data and SQL: Spark SQL
overview of, Databases
setting, Setting the Database
Databricks Community Edition, Running Spark, Running Spark in the Cloud, The Development
Process, Spark in the Cloud
DataFrameNaFunctions submodule, Where to Look for APIs
DataFrameReader, An End-to-End Example, Basics of Reading Data
DataFrames
basic structured operations, Basic Structured Operations-Conclusion

basics of, DataFrames, DataFrames and Datasets
components of, Basic Structured Operations
creating, Creating DataFrames
creating from RDDs, Interoperating Between DataFrames, Datasets, and RDDs
vs. Datasets, DataFrames Versus Datasets, DataFrames versus SQL versus Datasets versus RDDs
locating methods and functions, Where to Look for APIs
manipulating, Creating DataFrames-Conclusion
streaming, Structured Streaming in Action
DataFrameStatFunctions submodule, Where to Look for APIs
DataFrameWriter, Basics of Writing Data
Datasets
actions, Actions
benefits of, Datasets: Type-Safe Structured APIs, When to Use Datasets
creating, Creating Datasets
creating from RDDs, Interoperating Between DataFrames, Datasets, and RDDs
vs. DataFrames, DataFrames Versus Datasets
filtering, Filtering
grouping and aggregations, Grouping and Aggregations
joins, Joins
locating methods, Where to Look for APIs
mapping, Mapping
overview of, Datasets: Type-Safe Structured APIs, DataFrames and Datasets, Datasets
streaming, Streaming Dataset API
transformations, Transformations
when to use, When to Use Datasets, Which Spark API to Use?, DataFrames versus SQL versus
Datasets versus RDDs

dates and timestamps, Working with Dates and Timestamps-Working with Dates and Timestamps
de-duplication, Dropping Duplicates in a Stream
debugging (see monitoring and debugging)
decision making, real-time, Real-time decision making
decision trees
applied to classification, Decision Trees-Prediction Parameters
applied to regression, Decision Trees
example, Prediction Parameters
model hyperparameters, Model Hyperparameters
overview of, Decision Trees
prediction parameters, Prediction Parameters
training parameters, Training Parameters
declarative APIs, Record-at-a-Time Versus Declarative APIs
deep learning
BigDL, BigDL
Deep Learning Pipelines, Deep Learning Pipelines-Conclusion
DeepLearning4j, DeepLearning4J
MLlib neural network support, MLlib Neural Network Support
overview of, What Is Deep Learning?
in Spark, Ways of Using Deep Learning in Spark-Ways of Using Deep Learning in Spark
TensorFlowOnSpark, TensorFlowOnSpark
TensorFrames, TensorFrames
Deep Learning (Goodfellow), A Short Primer on Advanced Analytics
Deep Learning Pipelines, Deep Learning Pipelines-Conclusion
deep neural networks, What Is Deep Learning?

DeepLearning4j, DeepLearning4J
default stop words, Removing Common Words
dependencies, Transformations
deployment
application scheduling, Application Scheduling
cluster networking configurations, Cluster Networking Configurations
external shuffle service, Miscellaneous Considerations
logging considerations, Miscellaneous Considerations, The Monitoring Landscape
managing Spark versions, Miscellaneous Considerations
Mesos, Spark on Mesos
metastores, Miscellaneous Considerations
monitoring, Miscellaneous Considerations
number and type of applications, Miscellaneous Considerations
overview of, Deploying Spark
secure deployment configurations, Secure Deployment Configurations
standalone cluster manager, Cluster Managers-Submitting applications
where to deploy, Where to Deploy Your Cluster to Run Spark Applications-Spark in the Cloud
YARN framework, Spark on YARN-Application properties for YARN
describe method, Working with Numbers
development process, The Development Process
development template, Developing Spark Applications, Spark Logs
(see also Spark applications)
dimensionality, curse of, Unsupervised Learning
directed acyclic graph (DAG), DataFrames and SQL, Columns as expressions, The Spark UI
directed graphs, Graph Analytics, Building a Graph

disks, no space left errors, Signs and symptoms
distinct method, distinct
distributed collections, The SparkSession
distributed shared variables
accumulators, Accumulators-Custom Accumulators
broadcast variables, Broadcast Variables-Broadcast Variables
distributed stream processing, Core Concepts
DLB (see Deep Learning (Goodfellow))
dplyr data sources, sparklyr
driver processes, Spark Applications, Collecting Rows to the Driver, The Architecture of a Spark
Application, Client Request, Driver and Executor Processes, Driver OutOfMemoryError or Driver
Unresponsive, Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
drop function, drop
Dropwizard Metrics Library, Driver and Executor Processes
dstat utility, The Monitoring Landscape
DStreams API, Stream Processing Fundamentals, The DStream API, Event Time, Handling Late Data
with Watermarks
duplicates, removing, Dropping Duplicates in a Stream
dynamic allocation, Application Scheduling, Dynamic allocation
E
edge nodes, Client mode
edges, Graph Analytics
ElasticNet model, Linear Regression
Elasticsearch, An Abridged List of Popular Packages
Elements of Statistical Learning (Hastie), A Short Primer on Advanced Analytics
ElementwiseProduct, ElementwiseProduct

empty data, Working with Nulls in Data
end-to-end applications, What Is Stream Processing?, Structured Streaming, Structured Streaming
Basics
environmental variables, Environmental Variables
eponymous functions, skewness and kurtosis
equal to (==) expression, Working with Booleans
errors (see also monitoring and debugging)
before execution, Errors Before Execution
during execution, Errors During Execution
no space left on disk errors, Signs and symptoms
OutOfMemoryError, Driver OutOfMemoryError or Driver Unresponsive-Potential treatments
serialization errors, Signs and symptoms
ESL (see Elements of Statistical Learning (Hastie))
estimators, High-Level MLlib Concepts, Estimators, Estimators for Preprocessing
evaluators, High-Level MLlib Concepts, Training and Evaluation-Training and Evaluation,
Evaluators for Classification and Automating Model Tuning, Evaluators and Automating Model
Tuning, Evaluators for Recommendation
event logs, Spark UI History Server
event time, Event Time Versus Processing Time
event-time processing
basics of, Event-Time Basics
benefits of Spark for, Event-Time and Stateful Processing
defined, Event-Time and Stateful Processing
dropping duplicates, Dropping Duplicates in a Stream
event time defined, Event Time
example, Event Time

key ideas, Event-Time Processing
late data handling, Handling Late Data with Watermarks-Handling Late Data with Watermarks
processing time defined, Event Time
windows, Windows on Event Time-Handling Late Data with Watermarks
execution modes, Execution Modes
execution plan, DataFrames and SQL, Logical Instructions
execution properties, Execution Properties
execution, of Spark applications, Execution
executor processes, Spark Applications, The Architecture of a Spark Application, Driver and
Executor Processes, Executor OutOfMemoryError or Executor Unresponsive
explain plan, An End-to-End Example
explode function, explode
exploratory data analysis (EDA), Data cleaning
expr function, Columns as expressions
expressions
accessing DataFrame columns, Accessing a DataFrame’s columns
building, Working with Different Types of Data-Conclusion
(see also data, types of)
columns as expressions, Columns as expressions
defined, Columns and Expressions, Expressions
grouping with, Grouping with Expressions
joins, Join Expressions-Conclusion
external packages, External Packages
external shuffle service, Miscellaneous Considerations, Shuffle Configurations
external tables, Creating External Tables

extract, transform, and load (ETL), Schemas, When to Use Datasets, Scala versus Java versus Python
versus R, Incremental ETL, Ways of Using Deep Learning in Spark
F
failure recovery, Fault Tolerance and Checkpointing
fair scheduler, Application Scheduling, Scheduling
fault tolerance, Fault Tolerance and Checkpointing
feature engineering, Feature engineering
feature generation, Feature Manipulation-Polynomial Expansion
feature selection, Feature Selection
featurization, Ways of Using Deep Learning in Spark
file size, managing, Managing File Size
fill function, fill
filter method, Filtering Rows, Working with Booleans, Improved Filtering, Selections and Filtering
first function, first and last
first method, first
fit method, Feature Engineering with Transformers
flatMap, flatMap, Mapping over Values
flatMapGroupsWithState, When can you use each mode?, flatMapGroupsWithState-
flatMapGroupsWithState
foldByKey function, foldByKey
foreach sink, Foreach sink
foreachPartition, foreachPartition
fraud prediction, Graph Analytics
frequent item pairs, Working with Numbers
frequent pattern mining, Frequent Pattern Mining
function (Spark SQL), Functions

G
gamma regression, Generalized Linear Regression
Ganglia, The Monitoring Landscape
garbage collection, Memory Pressure and Garbage Collection-Garbage collection tuning
gateway machines, Client mode
Gaussian (linear regression), Generalized Linear Regression
Gaussian mixture models (GMM), Gaussian Mixture Models
generalized linear regression
example, Example
model hyperparameters, Model Hyperparameters
overview of, Generalized Linear Regression
prediction parameters, Prediction Parameters
training parameters, Training Parameters
training summary, Training Summary
geo-spatial data analytics, An Abridged List of Popular Packages
ggplot library, SparkR
glom function, glom
gradient boosted trees (GBT)
applied to classification, Random Forest and Gradient-Boosted Trees-Prediction Parameters
applied to regression, Random Forests and Gradient-Boosted Trees
example, Prediction Parameters
model hyperparameters, Model Hyperparameters
overview of, Random Forest and Gradient-Boosted Trees
prediction parameters, Prediction Parameters
training parameters, Training Parameters

graph analysis
breadth-first search, Breadth-First Search
building graphs, Building a Graph
connected components, Connected Components
GraphFrames algorithms, Graph Algorithms
in-degree and out-degree metrics, In-Degree and Out-Degree Metrics-In-Degree and Out-Degree
Metrics
motif finding, Motif Finding-Motif Finding
overview of, Graph Analytics-Graph Analytics
PageRank algorithm, PageRank
querying graphs, Querying the Graph
strongly connected components, Strongly Connected Components
subgraphs, Subgraphs
use cases for, Graph Analytics
graph databases, Graph Analytics
GraphFrames, Graph Analytics-Conclusion, Graph Algorithms, Spark Packages
GraphX, Graph Analytics
group-by function, DataFrames and SQL, Window Functions
groupByKey, groupByKey
grouping, Grouping-Grouping with Maps
grouping sets, Grouping Sets-Pivot
grouping_id operator, Grouping Metadata
gzip compression, Splittable File Types and Compression
H
Hadoop Distributed File System (HDFS), Apache Spark’s Philosophy, Splittable File Types and
Compression, Hadoop Files, On-Premises Cluster Deployments, Hadoop configurations

Hadoop YARN, The Architecture of a Spark Application, Deploying Spark, Spark on YARN-
Application properties for YARN
Hail, Spark Packages
Heterogeneity Human Activity Recognition Dataset, Structured Streaming in Action
History Server, Spark UI History Server
Hive, Big Data and SQL: Apache Hive
Hive metastore, The Hive metastore, Miscellaneous Considerations
HiveContext, The SparkContext
HiveQL, Creating External Tables
hyperparameters, Model tuning and evaluation, Estimators
I
ifnull function, ifnull, nullIf, nvl, and nvl2
image models, Applying Popular Models
immutability, Transformations
incremental data updates, Update data to serve in real time
incremental ETL, Incremental ETL
IndexToString, Converting Indexed Values Back to Text
inference, Ways of Using Deep Learning in Spark
informal packages, External Packages
initcap function, Working with Strings
inner joins, Inner Joins
input data resilience, Input data resilience
input/output (Structured Streaming API)
file source and sink, File source and sink
foreach sink, Foreach sink-Foreach sink
input rate monitoring, Input rate and processing rate

Kafka source and sink, Kafka source and sink
output modes, How Data Is Output (Output Modes)
reading from Kafka source, Reading from the Kafka Source
sources and sinks for testing, Sources and sinks for testing
triggers, When Data Is Output (Triggers)
writing to Kafka sink, Writing to the Kafka Sink
Interaction feature transformer, Interaction
interactive consoles, launching, Launching Spark’s Interactive Consoles
iostat utility, The Monitoring Landscape
iotop utility, The Monitoring Landscape
isotonic regression, Isotonic Regression
J
Java
Encoders, In Java: Encoders
SimpleDateFormat, Working with Dates and Timestamps
TimeZone format, Working with Dates and Timestamps
type reference, Spark Types
writing Spark applications in, Writing Java Applications
Java Database Connectivity (JDBC), SQL Databases, SparkSQL Thrift JDBC/ODBC Server
Java Virtual Machine (JVM), The Hive metastore, Datasets, The Monitoring Landscape
jconsole utility, The Monitoring Landscape
jmap utility, The Monitoring Landscape
joins
broadcast join, Communication Strategies
challenges when using, Challenges When Using Joins-Approach 3: Renaming a column before the

join
cross (Cartesian) joins, Cross (Cartesian) Joins
debugging, Slow Joins
how Spark performs joins, How Spark Performs Joins-Little table–to–little table
inner joins, Inner Joins, Inner Join
left anti joins, Left Anti Joins
left outer joins, Left Outer Joins
left semi joins, Left Semi Joins
natural joins, Natural Joins
outer joins, Outer Joins
overview of, Join Expressions
performance tuning, Joins
in RDDs, Joins
right outer joins, Right Outer Joins
shuffle join, Communication Strategies
in Structured Streaming API, Joins
types available, Join Types
zips, zips
joinWith method, Joins
JSON data
line-delimited JSON files, JSON Files
options available, JSON Options
reading JSON files, Reading JSON Files
working with, Working with JSON
writing JSON files, Writing JSON Files

jstack utility, The Monitoring Landscape
jstat utility, The Monitoring Landscape
JUnit, Managing SparkSessions
jvisualvm utility, The Monitoring Landscape
K
-means algorithm , Machine Learning and Advanced Analytics
example, Training Parameters
model hyperparameters, Model Hyperparameters
overview of, k-means
summary class, k-means Metrics Summary
training parameters, Training Parameters
Kafka
overview of, Kafka source and sink
reading from, Reading from the Kafka Source
writing to, Writing to the Kafka Sink
Keras, Applying Popular Models
Key–Value RDDs
aggregate function, aggregate
AggregateByKey function, aggregateByKey
aggregations, Aggregations-foldByKey
CombineByKey function, combineByKey
creating, keyBy
extracting keys and values, Extracting Keys and Values
foldByKey function, foldByKey
groupByKey, groupByKey

mapping over values, Mapping over Values
reduceByKey, reduceByKey
when to use, Key-Value Basics (Key-Value RDDs), DataFrames versus SQL versus Datasets
versus RDDs
Kryo serialization, Custom Serialization
Kryo Serialization, Object Serialization in RDDs
kurtosis, calculating, skewness and kurtosis
L
L-BFGS (Limited memory Broyden-Fletcher-Goldfarb-Shanno), Model Scalability
language APIs
Java, Spark’s Language APIs
overview of, Spark’s Language APIs
Python, Spark’s Language APIs, PySpark
R, Spark’s Language APIs, R on Spark-Machine learning
Scala, Spark’s Language APIs
selecting, Which Spark API to Use?, Scala versus Java versus Python versus R
SQL, Spark’s Language APIs
last function, first and last
late data, handling, Handling Late Data with Watermarks-Handling Late Data with Watermarks
Latent Dirichlet Allocation (LDA)
example, Training Parameters
model hyperparameters, Model Hyperparameters
overview of, Latent Dirichlet Allocation
prediction parameters, Prediction Parameters
training parameters, Training Parameters

lazy evaluation, Lazy Evaluation
left anti joins, Left Anti Joins
left outer joins, Left Outer Joins
left semi joins, Left Semi Joins
libraries, supported by Spark, Apache Spark’s Philosophy
LIBSVM data format, MLlib in Action
limit method, Limit
line-delimited JSON files, JSON Files
linear regression, Linear Regression
lists, Lists
lit function, Converting to Spark Types
literals, Converting to Spark Types (Literals)
local mode, Spark Applications, Local mode
logging, Miscellaneous Considerations, The Monitoring Landscape, Spark Logs
logical plan, An End-to-End Example, Logical Planning, Logical Instructions
logistic regression
example, Example
model hyperparameters, Model Hyperparameters
model summary, Model Summary
overview of, Logistic Regression
prediction parameters, Prediction Parameters
training parameters, Training Parameters
lookup function, lookup
lower-level APIs
defined, What Are the Low-Level APIs?

distributed shared variables, Distributed Shared Variables-Conclusion
how to use, How to Use the Low-Level APIs?
overview of, Lower-Level APIs
RDD advanced applications, Advanced RDDs-Conclusion
RDD basics, About RDDs-Conclusion
when to use, When to Use the Low-Level APIs?, Which Spark API to Use?, DataFrames versus
SQL versus Datasets versus RDDs
lowercase, Working with Strings
M
machine learning and advanced analytics
advanced analytics process, The Advanced Analytics Process-Leveraging the model and/or
insights
classification, Classification-Conclusion
data cleaning, Data cleaning
data collection, Data collection
deep learning, Deep Learning-Conclusion
deployment patterns, Deployment Patterns
feature engineering, Feature engineering, Feature Manipulation-ChiSqSelector
graph analysis, Graph Analytics, Graph Analytics-Conclusion
MLlib, Spark’s Advanced Analytics Toolkit-Persisting and Applying Models
model training, Feature engineering
model tuning and evaluation, Model tuning and evaluation
online machine learning, Online machine learning
overview of, Machine Learning and Advanced Analytics-Machine Learning and Advanced
Analytics, Advanced Analytics and Machine Learning Overview-A Short Primer on Advanced
Analytics

persisting and applying models, Persisting and Applying Models
Pipeline concept, Pipelining Our Workflow
preprocessing, Preprocessing and Feature Engineering-Word2Vec
recommendation, Recommendation, Recommendation-Conclusion
regression, Regression-Conclusion
supervised learning, Supervised Learning-Regression
unsupervised learning, Unsupervised Learning, Unsupervised Learning-Conclusion
Magellan, An Abridged List of Popular Packages
magrittr library, SparkR
main() function, Spark Applications
mapGroupsWithState, When can you use each mode?, mapGroupsWithState-mapGroupsWithState
mapPartitions, mapPartitions
MapPartitionsRDD, mapPartitions
maps, Maps, Grouping with Maps
market basket analysis, Frequent Pattern Mining
Maven, A Simple Scala-Based App
max function, min and max, max and min
MaxAbsScaler, MaxAbsScaler
maxFilesPerTrigger, Structured Streaming in Action, File source and sink
meetup groups, Local Meetups
memory management
configuring, Configuring Memory Management
garbage collection, Memory Pressure and Garbage Collection-Garbage collection tuning
OutOfMemoryError, Driver OutOfMemoryError or Driver Unresponsive-Potential treatments
temporary data storage (caching), Temporary Data Storage (Caching)-Temporary Data Storage
(Caching)

memory sinks, Structured Streaming in Action, Sources and sinks for testing
Mesos (see Apache Mesos)
metadata
describing, Describing Table Metadata
grouping, Grouping Metadata
Hive metastore, Miscellaneous Considerations
MLlib column metadata, Transformer Properties
refreshing, Refreshing Table Metadata
tables (Spark SQL), Spark-Managed Tables
metastores, Miscellaneous Considerations
metrics, Driver and Executor Processes, Metrics and Monitoring-Spark UI, Detailed Evaluation
Metrics, Metrics, Metrics, In-Degree and Out-Degree Metrics-In-Degree and Out-Degree Metrics
(see also monitoring and debugging)
micro-batch systems, Continuous Versus Micro-Batch Execution
min function, min and max, max and min
minDF, Converting Words into Numerical Representations
minimum term frequency (minTF), Converting Words into Numerical Representations
MinMaxScaler, MinMaxScaler
missing data, Working with Nulls in Data
MLlib
benefits of, Machine Learning and Advanced Analytics, When and why should you use MLlib
(versus scikit-learn, TensorFlow, or foo package)
classification models in, Classification Models in MLlib-Model Scalability
estimators, High-Level MLlib Concepts
evaluators, High-Level MLlib Concepts

low-level data types, Low-level data types
neural network support, MLlib Neural Network Support
overview of, Machine Learning and Advanced Analytics-Machine Learning and Advanced
Analytics, What Is MLlib?
packages included in, What Is MLlib?
persisting and applying models, Persisting and Applying Models
pipeline example, MLlib in Action-Persisting and Applying Models
regression models in, Regression Models in MLlib-Conclusion
transformers, High-Level MLlib Concepts
models (see also individual models; machine learning and advanced analytics; MLlib)
automatic model tuning, Evaluators for Classification and Automating Model Tuning, Evaluators
and Automating Model Tuning
classification models, Classification Models in MLlib-Conclusion
deep learning models, Deep Learning-Conclusion
formatting according to use case, Formatting Models According to Your Use Case-Formatting
Models According to Your Use Case
image models, Applying Popular Models
regression models, Regression-Conclusion
scalability of, Model Scalability, Model Scalability, Collaborative Filtering with Alternating Least
Squares, Model Scalability
training, Feature engineering
training deep learning models, Ways of Using Deep Learning in Spark
tuning and evaluation, Model tuning and evaluation
monitoring and debugging
column metadata, Transformer Properties
components to monitor, The Monitoring Landscape-The Monitoring Landscape

deployment decisions, Miscellaneous Considerations
driver issues, Driver OutOfMemoryError or Driver Unresponsive
errors before execution, Errors Before Execution
errors during execution, Errors During Execution
executor issues, Executor OutOfMemoryError or Executor Unresponsive
no space left on disk errors, Signs and symptoms
processes to monitor, What to Monitor
role in performance tuning, Performance Tuning
serialization errors, Signs and symptoms
slow aggregation, Slow Aggregations
slow joins, Slow Joins
slow reads and writes, Slow Reads and Writes
slow tasks or stragglers, Slow Tasks or Stragglers
Spark jobs not starting, Spark Jobs Not Starting
Spark logs, Spark Logs
Spark UI, The Spark UI-Spark UI History Server
Structured Streaming API, Metrics and Monitoring-Spark UI
unexpected nulls in results, Signs and symptoms, Formatting Models According to Your Use Case
monotonically_increasing_id function, Working with Numbers
motif finding algorithms, Motif Finding-Motif Finding
multiclass classification, Multiclass Classification
multilabel classification, Multilabel Classification
multilayer perceptron classifier, MLlib Neural Network Support
multinomial models, Naive Bayes
multivariate Bernoulli models, Naive Bayes

N
n-grams, Creating Word Combinations
Naive Bayes classifiers, Naive Bayes
narrow dependencies, Transformations
natural joins, Natural Joins
Netflix, Use Cases
node-to-node communication strategy, How Spark Performs Joins
nodes, Graph Analytics
normalization and scaling, Scaling and Normalization-Normalizer
notifications and alerting, Notifications and alerting
null values, Columns, Working with Nulls in Data-Ordering, Signs and symptoms, Formatting Models
According to Your Use Case
nullIf function, ifnull, nullIf, nvl, and nvl2
numbers, Working with Numbers-Working with Numbers
nvl function, ifnull, nullIf, nvl, and nvl2
nvl2 function, ifnull, nullIf, nvl, and nvl2
O
object serialization, Custom Serialization, Signs and symptoms, Object Serialization in RDDs
official meetup group, Local Meetups
on-premises clusters, On-Premises Cluster Deployments
once trigger, Once trigger
One-vs-Rest, One-vs-Rest Classifier
OneHotEncoder, Machine Learning and Advanced Analytics, One-Hot Encoding
online analytic processing (OLAP), Big Data and SQL: Spark SQL
online machine learning, Online machine learning
online transaction processing (OLTP), Big Data and SQL: Spark SQL

optimization, cost-based, Statistics collection
(see also performance tuning)
ORC files, ORC Files
org.apache.spark.sql.functions package, Creating DataFrames, Where to Look for APIs, Aggregation
Functions
out-degree metric, In-Degree and Out-Degree Metrics-In-Degree and Out-Degree Metrics
outer joins, Outer Joins
OutOfMemoryError
drivers, Driver OutOfMemoryError or Driver Unresponsive
executors, Executor OutOfMemoryError or Executor Unresponsive
output modes, Output Modes, How Data Is Output (Output Modes), Output Modes
output schema resolution, Resilience in output and atomicity
output sinks, Sinks
(see also sinks)
P
packages, Spark’s Ecosystem and Packages, Spark Packages-External Packages
PageRank algorithm, Graph Analytics, PageRank
Pandas, Pandas Integration
parallelism, Parallelism
parallelize method, From a Local Collection
ParamGrid, Training and Evaluation
Parquet files
benefits of, Parquet Files
options available, Parquet options
reading, Reading Parquet Files

writing, Writing Parquet Files
Partitioner, Types of RDDs, Custom Partitioning
partitions
based on sliding windows, Partitioning based on a sliding window
controlling with RDDs, Controlling Partitions-Custom Partitioning
custom partitioning, Custom Partitioning-Custom Partitioning
defined, Partitions
partitioning schemes, Basic Structured Operations
performance tuning, Table partitioning, Repartitioning and Coalescing
purpose of, Partitioning
repartitioning, Repartition and Coalesce
role in application lifecycle, Stages
Pearson Correlation Coefficient, Working with Numbers, Covariance and Correlation
per node computation strategy, How Spark Performs Joins
performance tuning
aggregations, Aggregations
automatic model tuning, Evaluators for Classification and Automating Model Tuning, Evaluators
and Automating Model Tuning
broadcast variables, Broadcast Variables
cluster configurations, Cluster Configurations
cluster networking configurations, Cluster Networking Configurations
data at rest, Data at Rest-Statistics collection
design choices, Design Choices
direct vs. indirect approaches, Performance Tuning
improved filtering, Improved Filtering

joins, Joins
memory pressure and garbage collection, Memory Pressure and Garbage Collection-Garbage
collection tuning
object serialization in RDDs, Object Serialization in RDDs
overview of, Performance Tuning
parallelism, Parallelism
repartitioning and coalescing, Repartitioning and Coalescing
role of monitoring in, Performance Tuning
scheduling, Scheduling
shuffle configurations, Shuffle Configurations
temporary data storage (caching), Temporary Data Storage (Caching)-Temporary Data Storage
(Caching)
User-Defined Functions (UDFs), User-Defined Functions (UDFs)
physical plan, Lazy Evaluation, Physical Planning, Logical instructions to physical execution,
Temporary Data Storage (Caching)
pip install pyspark, Downloading Spark for a Hadoop cluster
pipe method, Pipe RDDs to System Commands
Pipeline concept, Pipelining Our Workflow
pipelining, Transformations, Pipelining
pivots, Pivot
poisson regression, Generalized Linear Regression
polynomial expansion, Polynomial Expansion
predicate pushdown, Lazy Evaluation
predicate subqueries, Uncorrelated predicate subqueries
preprocessing
bucketing, Bucketing-Advanced bucketing techniques

categorical features, Working with Categorical Features-Text Data Transformers
continuous features, Working with Continuous Features-Normalizer
converting indexed values back to text, Converting Indexed Values Back to Text
converting words into numbers, Converting Words into Numerical Representations-Term
frequency–inverse document frequency
creating word combinations, Creating Word Combinations
estimators, Estimators for Preprocessing
feature generation, Feature Manipulation-Polynomial Expansion
formatting models according to use case, Formatting Models According to Your Use Case-
Formatting Models According to Your Use Case
high-level transformers, High-Level Transformers-VectorAssembler
indexing in vectors, Indexing in Vectors
one-hot encoding, One-Hot Encoding
removing common words, Removing Common Words
scaling and normalization, Scaling and Normalization
SQLTransformers, SQL Transformers
StringIndexer, StringIndexer
text data transformers, Text Data Transformers-Term frequency–inverse document frequency
tokenizing text, Tokenizing Text
transformers, Transformers, Persisting Transformers-Writing a Custom Transformer
VectorAssembler, VectorAssembler
Word2Vec, Word2Vec
Principal component analysis (PCA), PCA
processing time, Event Time Versus Processing Time, Event Time, Input rate and processing rate
processing time trigger, Processing time trigger
production applications

benefits of Spark for, Running Production Applications
deploying, Deploying Spark-Conclusion
developing, Developing Spark Applications-Conclusion
how Spark runs on clusters, How Spark Runs on a Cluster-Conclusion
monitoring and debugging, Monitoring and Debugging-Conclusion
performance tuning, Performance Tuning-Conclusion
Structured Streaming API, Structured Streaming in Production-Conclusion
PushedFilters, Query Pushdown
PySpark, Downloading Spark for a Hadoop cluster, PySpark
Python
launching the console, Launching the Python console
PySpark, PySpark
type reference, Spark Types
writing Spark applications in, Writing Python Applications
Q
query execution, monitoring of, What to Monitor, Query Status-Batch duration
query optimizer, Statistics collection
query pushdown, Query Pushdown-Partitioning based on a sliding window
questions and comments, How to Contact Us
R
R
overview of, Spark’s Language APIs, R on Spark
sparklyr, sparklyr-Machine learning
SparkR, SparkR-User-defined functions
random forests

applied to classification, Random Forest and Gradient-Boosted Trees-Prediction Parameters
applied to regression, Random Forests and Gradient-Boosted Trees
example, Prediction Parameters
model hyperparameters, Model Hyperparameters
overview of, Random Forest and Gradient-Boosted Trees
prediction parameters, Prediction Parameters
training parameters, Training Parameters
random splits, Random Splits, Random Splits
rdd method, Interoperating Between DataFrames, Datasets, and RDDs
read attribute, Basics of Reading Data
reading data
basics of, Basics of Reading Data
core API structure, Read API Structure
debugging, Slow Reads and Writes
read mode, Read modes
real-time decision making, Real-time decision making
real-time reporting, Real-time reporting
recommendation
collaborative filtering with alternating least squares, Collaborative Filtering with Alternating Least
Squares
evaluators, Evaluators for Recommendation
example, Example
frequent pattern mining, Frequent Pattern Mining
metrics, Metrics-Ranking Metrics
model hyperparameters, Model Hyperparameters

prediction parameters, Prediction Parameters
through graph analysis, Graph Analytics
training parameters, Training Parameters
use cases for, Recommendation, Use Cases
record-at-a-time APIs, Record-at-a-Time Versus Declarative APIs
records (see also columns; rows)
random samples of, Random Samples
random splits of, Random Splits
repartition and coalesce, Repartition and Coalesce
restricting extraction of, Limit
vs. rows, Records and Rows
recovery, Fault Tolerance and Checkpointing
Redshift Connector, An Abridged List of Popular Packages
reduce method, reduce
reduceByKey, reduceByKey
REFRESH TABLE, Refreshing Table Metadata
RegexTokenizer, Tokenizing Text
regression
decision trees, Decision Trees
evaluators and automating model tuning, Evaluators and Automating Model Tuning
generalized linear regression, Generalized Linear Regression-Training Summary
isotonic regression, Isotonic Regression
linear regression, Linear Regression
metrics, Metrics
models in MLlib, Regression Models in MLlib-Model Scalability

random forests and gradient boosted trees, Random Forests and Gradient-Boosted Trees
survival regression (accelerated failure time), Survival Regression (Accelerated Failure Time)
use cases for, Regression, Use Cases
Regular Expressions (RegExes), Regular Expressions-Regular Expressions, Tokenizing Text
RelationalGroupedDataset, DataFrames and SQL, Aggregations
repartition, Repartition and Coalesce, repartition, Stages, Repartitioning and Coalescing
repartitionAndSortWithinPartitions, repartitionAndSortWithinPartitions
replace function, replace
reports, real-time, Real-time reporting
requests, Client Request
rescaling applications, Sizing and Rescaling Your Application
reserved characters, Reserved Characters and Keywords
resilience
of business logic, Business logic resilience and evolution
of output data, Resilience in output and atomicity
resilience, of input data, Input data resilience
Resilient Distributed Datasets (RDDs)
accessing values in, first
actions, Actions-take
aggregations, Aggregations-foldByKey
caching, Caching
checkpointing, Checkpointing
CoGroups, CoGroups
counting, count
creating, Creating RDDs-From Data Sources

filtering, filter
joins, Joins
Key–Value RDDs, Key-Value Basics (Key-Value RDDs)-foldByKey
manipulating, Manipulating RDDs
mapping, map
object serialization in, Object Serialization in RDDs
overview of, Lower-Level APIs, About RDDs
partitions, Controlling Partitions-Custom Partitioning
pipe RDDs to system commands, Pipe RDDs to System Commands-Conclusion
random splits, Random Splits
RDDs of Case Classes vs. Datasets, Datasets and RDDs of Case Classes
reducing, reduce
removing duplicates from, distinct
saving files, Saving Files
serialization, Custom Serialization
sorting, sort
transformations, Transformations-Random Splits
types of, Types of RDDs
when to use, When to Use RDDs?, Which Spark API to Use?, DataFrames versus SQL versus
Datasets versus RDDs
resource utilization problem, On-Premises Cluster Deployments, Cluster/application sizing and
sharing
REST API Endpoints, Spark REST API
RFormula
benefits of, RFormula
column labels, RFormula

example, RFormula
operators, Feature Engineering with Transformers, RFormula
right outer joins, Right Outer Joins
rollup operator, Rollups
rounding numbers, Working with Numbers
Row type, DataFrames Versus Datasets, Records and Rows, Datasets
rows
accessing data in, Creating Rows
collecting to the driver, Collecting Rows to the Driver
concatenating and appending, Concatenating and Appending Rows (Union)
converting to columns, Pivot
creating, Creating Rows
extracting unique, Getting Unique Rows
filtering, Filtering Rows
generating unique IDs for, Working with Numbers
sorting, Sorting Rows
runtime properties, Runtime Properties
S
sample method, Random Samples
sampleByKey function, sampleByKey
save modes, Save modes
saveAsTextFile, saveAsTextFile
sbt tool, A Simple Scala-Based App
Scala
benefits of, A Simple Scala-Based App

case classes, In Scala: Case Classes
column creation in, Columns
comparison operators in, Working with Booleans
launching the console, Launching the Scala console
symbols in, Columns
type reference, Spark Types
scalability, Model Scalability, Model Scalability, Collaborative Filtering with Alternating Least
Squares, Model Scalability
scalar queries, Uncorrelated scalar queries
ScalaTest, Managing SparkSessions
scaling and normalization, Scaling and Normalization-Normalizer
scheduling, Job Scheduling Within an Application, Application Scheduling, Scheduling
schema inference, An End-to-End Example, Structured Streaming in Action
schema-on-read, Schemas, Schemas
schemas
components of, Schemas
defined, DataFrames
defining, Schemas
enforcing, Schemas
secure deployment configurations, Secure Deployment Configurations
Select method, select and selectExpr, Window Functions, Selections and Filtering
select statements, Select Statements
SelectExpr method, select and selectExpr
semi joins, Left Semi Joins
sequenceFiles, SequenceFiles

serialization, Custom Serialization, Signs and symptoms, Object Serialization in RDDs
sessionization, flatMapGroupsWithState-flatMapGroupsWithState
sets, grouping, Grouping Sets-Pivot
SHOW FUNCTIONS statement, Functions
shuffle
configuring behavior, Configuring Shuffle Behavior
defined, Transformations, Stages
efficiency of, Conclusion
external service, Miscellaneous Considerations
performance tuning, Shuffle Configurations
shuffle joins, Communication Strategies
shuffle persistence, Shuffle Persistence
simple types, Columns
sinks
console sink, Sources and sinks for testing
defined, Sinks
files, File source and sink
for testing, Sources and sinks for testing
memory sinks, Structured Streaming in Action, Sources and sinks for testing
sizing applications, Sizing and Rescaling Your Application
skewness, calculating, skewness and kurtosis
sliding windows, Sliding windows-Sliding windows
social networking, In-Degree and Out-Degree Metrics
socket source, Sources and sinks for testing
sort action, An End-to-End Example

sortBy method, sort
spaces, removing, Working with Strings
Spark (see Apache Spark)
Spark applications
application properties, Application Properties
architecture and components of, The Architecture of a Spark Application-Completion
basics of, Spark Applications
configuring, Configuring Applications-Job Scheduling Within an Application
development process, The Development Process
development template, Developing Spark Applications, Spark Logs
execution details, Execution Details
execution modes, Execution Modes
launching, Launching Applications-Application Launch Examples
lifecycle of inside of Spark, The Life Cycle of a Spark Application (Inside Spark)-Tasks
lifecycle of outside of Spark, The Life Cycle of a Spark Application (Outside Spark)-Completion
monitoring and debugging, Monitoring and Debugging-Conclusion
performance tuning, Performance Tuning-Conclusion
scheduling, Application Scheduling, Scheduling
(see also deployment)
testing, Testing Spark Applications-Connecting to Data Sources
writing, Writing Spark Applications-Running the application
Spark community, Spark’s Ecosystem and Packages, Data Sources, Community-Local Meetups
Spark Deep Learning, An Abridged List of Popular Packages
Spark jobs
asynchronous job execution, Time-Outs

debugging, Debugging and Spark First Aid-Potential treatments
defined, Spark UI
execution order, The Life Cycle of a Spark Application (Inside Spark)
improving speed and execution of, Aggregations
monitoring, Queries, Jobs, Stages, and Tasks
scheduling, Job Scheduling Within an Application, Application Scheduling, Scheduling
stages and tasks in, Logical instructions to physical execution-Tasks
Spark logs, Spark Logs
Spark Packages, Spark’s Ecosystem and Packages, Spark Packages-External Packages
Spark plan, Physical Planning
(see also physical plan)
Spark REST API, Spark REST API
Spark SQL
application configurations, Miscellaneous Features
benefits of, Big Data and SQL: Spark SQL
Catalog, Catalog, Miscellaneous Considerations
complex types, Complex Types-Lists
databases, Databases
vs. DataFrames and Datasets, DataFrames versus SQL versus Datasets versus RDDs
functions, Functions
history of, Big Data and SQL: Apache Hive
launching the console, Launching the SQL console
lists, Lists
overview of, Spark SQL
relationship to Hive, Spark’s Relationship to Hive

running queries, How to Run Spark SQL Queries-SparkSQL Thrift JDBC/ODBC Server
select statement, Select Statements
structs, Structs
subqueries, Subqueries-Uncorrelated scalar queries
tables, Tables-Caching Tables
views, Views-Dropping Views
Spark Streaming, Stream Processing Fundamentals
Spark Summits, Spark Summit
Spark UI
benefits of, Spark UI
configuring, Configuring the Spark user interface
History Server, Spark UI History Server
job execution investigation, The Spark UI-The Spark UI
overview of, The Spark UI
query investigation, The Spark UI-The Spark UI
Spark REST API, Spark REST API
Structured Streaming API and, Spark UI
tabs in, The Spark UI, Other Spark UI tabs
Spark Unti Tests, Business logic resilience and evolution
spark variable, The SparkSession
spark-packages.org, Apache Spark’s Philosophy
spark-shell, The Development Process
spark-submit, Running Production Applications, The Life Cycle of a Spark Application (Outside
Spark), A Simple Scala-Based App, Launching Applications-Launching Applications
spark.sql function, DataFrames and SQL

spark.sql.hive.metastore.jars, The Hive metastore
spark.sql.hive.metastore.version, The Hive metastore
spark.sql.shuffle.partitions, Stages
SparkConf, The SparkConf
SparkContext, How to Use the Low-Level APIs?, The SparkSession-The SparkContext
sparklyr
data manipulation, Data manipulation
data sources, Data sources
executing SQL, Executing SQL
key concepts, Key concepts
lack of DataFrames in, No DataFrames
machine learning, Machine learning
overview of, sparklyr
SparkR
benefits and drawbacks of, Pros and cons of using SparkR instead of other languages
data manipulation, Data manipulation
data sources, Data sources
function masking, Function masking
key concepts, Key Concepts-User-defined functions
machine learning, Machine learning
overview of, SparkR, SparkR
setup, Setup
SparkR functions, SparkR functions only apply to SparkDataFrames
user-defined functions (UDFs), User-defined functions-User-defined functions
SparkSession instances

creating, Starting Spark, The SparkSession
in Python, The SparkSession
in Scala, The SparkSession
managing, Managing SparkSessions
purpose of, The SparkSession
role in launch process, Launch
SparkContext and, The SparkContext-The SparkContext
split function, split
splittable file formats, Splittable File Types and Compression, Splittable file types and compression
SQL (Structured Query Language), What Is SQL?
(see also Spark SQL; SQL databases)
SQL databases
Java Database Connectivity (JDBC) driver for, SQL Databases
JDBC data source options, SQL Databases
query pushdown, Query Pushdown-Partitioning based on a sliding window
reading from, Reading from SQL Databases-Reading from SQL Databases
reading in parallel, Reading from databases in parallel, Reading Data in Parallel
SQLite, SQL Databases
systems available, SQL Databases
writing to, Writing to SQL Databases
SQLContext, The SparkSession
SQLite, SQL Databases
SQLTransformers, SQL Transformers
stages, Logical instructions to physical execution-Tasks, Queries, Jobs, Stages, and Tasks
standalone cluster manager, Cluster Managers-Submitting applications

standard deviation, calculating, Variance and Standard Deviation
StandardScaler, Estimators for Preprocessing, StandardScaler
stateful processing
arbitrary, Arbitrary Stateful Processing
benefits of Spark for, Event-Time and Stateful Processing
considerations for, Arbitrary Stateful Processing
flatMapGroupsWithState, flatMapGroupsWithState-flatMapGroupsWithState
mapGroupsWithState, mapGroupsWithState-mapGroupsWithState
output modes, Output Modes
overview of, Stateful Processing
time-outs, Time-Outs
StatFunctions Package, Working with Numbers
statistical functions, Working with Numbers
statistics, collecting and maintaining, Statistics collection
stddev function, Variance and Standard Deviation
stochastic gradient descent, Model Scalability
stop words, removing, Removing Common Words
STORED AS, Creating Tables
stragglers, Slow Tasks or Stragglers
stream processing
advantages of, Advantages of Stream Processing
API selection, Spark’s Streaming APIs
basics of, What Is Stream Processing?
challenges of, Challenges of Stream Processing
continuous vs. micro-batch execution, Continuous Versus Micro-Batch Execution

design points, Stream Processing Design Points-Continuous Versus Micro-Batch Execution
event time vs. processing time, Event Time Versus Processing Time
history of, Stream Processing Fundamentals
record-at-a-time vs. declarative APIs, Record-at-a-Time Versus Declarative APIs
use cases for, Stream Processing Use Cases-Online machine learning
Streaming Listener, Advanced Monitoring with the Streaming Listener
string, Working with Strings-Regular Expressions
StringIndexer, Machine Learning and Advanced Analytics, StringIndexer
strongly connected components algorithm, Strongly Connected Components
StructFields, Schemas
structs, Structs, Structs
StructType, Schemas
Structured APIs
basic structured operations, Basic Structured Operations-Conclusion
code execution, Overview of Structured API Execution-Execution
DataFrames, DataFrames and Datasets
Datasets, Datasets: Type-Safe Structured APIs, DataFrames and Datasets
overview of, Structured API Overview
schemas and, Schemas
selecting over RDDs, Lower-Level APIs
Spark fundamental concepts, Structured API Overview
structured Spark types and, Overview of Structured Spark Types-Spark Types, Converting to Spark
Types (Literals)
Structured Streaming API
alerting, Alerting

application updates, Updating Your Application-Sizing and Rescaling Your Application
applied example, Structured Streaming in Action-Structured Streaming in Action
benefits of, What Is Stream Processing?, Structured Streaming, Structured Streaming Basics
core concepts, Core Concepts-Watermarks
fault tolerance and checkpointing, Fault Tolerance and Checkpointing
history of, Stream Processing Fundamentals
input and output, Input and Output-Once trigger
lack of asynchronous job execution, Time-Outs
metrics and monitoring, Metrics and Monitoring-Spark UI
overview of, Structured Streaming-Structured Streaming, Structured Streaming Basics
as production-ready, Structured Streaming in Production
sizing and rescaling applications, Sizing and Rescaling Your Application
streaming Dataset API, Streaming Dataset API
Streaming Listener monitoring, Advanced Monitoring with the Streaming Listener
transformations on streams, Transformations on Streams-Joins
subgraphs, Subgraphs
subqueries, Subqueries
subqueries (Spark SQL), Subqueries-Uncorrelated scalar queries
sum aggregation method, DataFrames and SQL
sum function, sum
sumDistinct function, sumDistinct
summary statistics, computing, Working with Numbers
supervised learning
classification, Classification, Classification-Conclusion
goal of, Supervised Learning

regression, Regression, Regression-Conclusion
survival regression (Accelerated Failure Time), Survival Regression (Accelerated Failure Time)
T
tables (Spark SQL)
caching, Caching Tables
creating, Creating Tables
creating external, Creating External Tables
dropping, Dropping Tables
inserting into, Inserting into Tables
managed vs. unmanaged tables, Spark-Managed Tables
metadata, Describing Table Metadata
overview of, Tables
take action, An End-to-End Example, Datasets: Type-Safe Structured APIs, Actions, take
tasks, Logical instructions to physical execution-Tasks, Queries, Jobs, Stages, and Tasks, Slow Tasks
or Stragglers
template, Developing Spark Applications, Spark Logs
temporary data storage (caching), Temporary Data Storage (Caching)-Temporary Data Storage
(Caching)
TensorFlowOnSpark, TensorFlowOnSpark
TensorFrames, TensorFrames
testing
connecting to data sources, Connecting to Data Sources
connecting to unit testing frameworks, Connecting to Unit Testing Frameworks
key principles and tactics, Strategic Principles
managing SparkSessions, Managing SparkSessions
Spark API selection, Which Spark API to Use?

tactical considerations, Tactical Takeaways
text data transformers, Text Data Transformers-Term frequency–inverse document frequency
text files, Text Files
TF-IDF (term frequency-inverse document frequency), Term frequency–inverse document frequency-
Term frequency–inverse document frequency
the curse of dimensionality, Unsupervised Learning
Thrift JDBC/Open Database Connectivity (ODBC) server, SparkSQL Thrift JDBC/ODBC Server
time-outs, Arbitrary Stateful Processing, Time-Outs
time-related information, Working with Dates and Timestamps-Working with Dates and Timestamps
timestamps, Event-Time Processing
TimestampType class, Working with Dates and Timestamps
toDF method, Interoperating Between DataFrames, Datasets, and RDDs
Tokenizer transformer, Transformers, Tokenizing Text
toLocalIterator method, Collecting Rows to the Driver
toolkit components
Datasets API, Datasets: Type-Safe Structured APIs
ecosystem of packages and tools, Spark’s Ecosystem and Packages
lower-level APIs, Lower-Level APIs
machine learning and advanced analytics, Machine Learning and Advanced Analytics-Machine
Learning and Advanced Analytics
overview of, What Is Apache Spark?, A Tour of Spark’s Toolset
running production applications, Running Production Applications
SparkR, SparkR
Structured Streaming, Structured Streaming
topic modeling, Unsupervised Learning
to_date function, Working with Dates and Timestamps

to_timestamp function, Working with Dates and Timestamps
transfer learning, Ways of Using Deep Learning in Spark, Transfer Learning
transformations
basics of, Transformations
core operations, DataFrame Transformations
custom, User-Defined Functions-User-Defined Functions
DataFrame creation, Creating DataFrames
on Datasets, Transformations
end-to-end example, An End-to-End Example-DataFrames and SQL
locating APIs, Where to Look for APIs
in Structured Streaming API, Transformations and Actions, Transformations on Streams-Joins
working with different data types, Working with Booleans-User-Defined Functions
transformers
example, Feature Engineering with Transformers-Feature Engineering with Transformers
formatting models according to use case, Formatting Models According to Your Use Case
Interaction feature transformer, Interaction
locating, Formatting Models According to Your Use Case
persisting, Persisting Transformers
properties, Transformer Properties
purpose of, High-Level MLlib Concepts, Transformers
RFormula, RFormula
text data transformers, Text Data Transformers-Word2Vec
Tokenizer transformer, Transformers, Tokenizing Text
writing custom, Writing a Custom Transformer
treeAggregate method, aggregate

triggers, Structured Streaming, Triggers, When Data Is Output (Triggers)
tumbling windows, Tumbling Windows-Tumbling Windows
tuning (see performance tuning)
type safety, Datasets: Type-Safe Structured APIs, Which Spark API to Use?
typographical conventions, Conventions Used in This Book
U
uncorrelated predicate subqueries, Uncorrelated predicate subqueries
uncorrelated scalar queries, Uncorrelated scalar queries
uncorrelated subqueries, Subqueries
undirected graphs, Graph Analytics
unified software platforms, Apache Spark’s Philosophy
unions, Concatenating and Appending Rows (Union)
unit tests, Business logic resilience and evolution-Connecting to Unit Testing Frameworks
unresolved logical plan, Logical Planning
unsupervised learning
bisecting -means , Bisecting k-means Summary
Gaussian mixture models, Gaussian Mixture Models
Latent Dirichlet Allocation, Latent Dirichlet Allocation-Example
model scalability, Model Scalability
use cases for, Unsupervised Learning, Use Cases
-means , k-means-k-means Metrics Summary
update output mode, Update mode
updates, real-time, Update data to serve in real time
updating applications, Updating Your Application-Sizing and Rescaling Your Application
uppercase, Working with Strings

user segmentation, Unsupervised Learning
User-Defined Aggregation Functions (UDAFs), User-Defined Aggregation Functions
user-defined functions (UDFs), User-Defined Functions-User-Defined Functions, Aggregating to
Complex Types, User-defined functions, Types of RDDs, User-Defined Functions (UDFs), Updating
Your Streaming Application Code, User-defined functions-User-defined functions
USING, Creating Tables
V
variance, calculating, Variance and Standard Deviation
Vector data type (MLlib), Low-level data types, Scaling and Normalization
VectorAssembler, VectorAssembler
VectorIndexer, Indexing in Vectors
Vectorized UDF, User-Defined Functions (UDFs), Pandas Integration
versions, updating, Updating Your Spark Version
vertices, Graph Analytics
views (Spark SQL)
creating, Creating Views
dropping, Dropping Views
purpose of, Views
vocabulary size (vocabSize), Converting Words into Numerical Representations
W
watermarks, Watermarks, Handling Late Data with Watermarks-Handling Late Data with Watermarks
where method, Filtering Rows
whitespace, removing, Working with Strings
wide dependencies, Transformations
windows
count-based, mapGroupsWithState-mapGroupsWithState

over time-series columns, Structured Streaming
partitioning based on sliding, Partitioning based on a sliding window
sliding windows, Sliding windows-Sliding windows
timestamp conversion, Windows on Event Time
tumbling windows, Tumbling Windows-Tumbling Windows
unique aggregations using, Window Functions-Grouping Sets
withColumnRenamed method, DataFrames and SQL
word combinations, creating, Creating Word Combinations
Word2Vec, Word2Vec
words, converting into numbers, Converting Words into Numerical Representations-Term frequency–
inverse document frequency
write method, Persisting Transformers
writing data
basics of, Basics of Writing Data
core API structure, Write API Structure
debugging, Slow Reads and Writes
save mode, Save modes
X
XGBoost, Random Forest and Gradient-Boosted Trees, External Packages
Y
YARN (see Hadoop YARN)
Z
zips, zips

About the Authors

Bill Chambers is a product manager at Databricks focused on helping customers succeed with their

large scale data science and analytics initiatives using Spark and Databricks.

Bill also regularly blogs about data science and big data and presents at conferences and meetups. He

has a Master’s degree in Information Systems from the UC Berkeley School of Information, where he

focused on data science.

Matei Zaharia is an assistant professor of computer science at Stanford University and chief

technologist at Databricks. He started the Spark project at UC Berkeley in 2009, where he was a PhD

student, and he continues to serve as its vice president at Apache. Matei also co-started the Apache

Mesos project and is a committer on Apache Hadoop. Matei’s research work was recognized through

the 2014 ACM Doctoral Dissertation Award and the VMware Systems Research Award.

Colophon

The animal on the cover of Spark: The Definitive Guide is the swallow-tailed kite (Elanoides

forficatus). Found in woodland and wetland locations ranging from southern Brazil to the

southeastern United States, these raptors subsist on small reptiles, ambhibians, and mammals, as well

as large insects. They build their nests near water.

Swallow-tailed kites tend to be 20–27 inches in length, and coast through the air on wings spanning

around 4 feet, using their sharply forked tails to steer. Their plumage grows in a strikingly contrasting

black and white, and they spend most of their time in the air, even grazing the surface of bodies of

water to drink rather than staying put on the ground.

Among the raptor species, Elanoides forficatus are social animals, and often nest in close proximity

or roost for the night in large communal groups. During migration, they may travel in groups

numbering in the hundreds or thousands.

Many of the animals on O’Reilly covers are endangered; all of them are important to the world. To

learn more about how you can help, go to animals.oreilly.com.

The cover image is from Lydekker’s The Royal Natural History. The cover fonts are URW

Typewriter and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad

Condensed; and the code font is Dalton Maag’s Ubuntu Mono.

Spark The Definitive Guide Big Data Processing Made Simple

Navigation menu

Versions of this User Manual:

Views

Navigation