Natural Language Processing And Computational Linguistics A Practical Guide To Text Analysis With Pyt

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 523

DownloadNatural Language Processing And Computational Linguistics A Practical Guide To Text Analysis With Pyt
Open PDF In BrowserView PDF
Natural Language Processing and Computational Linguistics

A practical guide to text analysis with Python, Gensim, spaCy, and Keras

Bhargav Srinivasa-Desikan


Natural Language Processing and
Computational Linguistics
Copy right © 2 01 8 Packt Publishing
All rights reserv ed. No part of this book m ay be reproduced, stored in a retriev al sy stem , or transm itted in any form or
by any m eans, without the prior written perm ission of the publisher, except in the case of brief quotations em bedded in
critical articles or rev iews.
Ev ery effort has been m ade in the preparation of this book to ensure the accuracy of the inform ation presented.
Howev er, the inform ation contained in this book is sold without warranty , either express or im plied. Neither the
author, nor Packt Publishing or its dealers and distributors, will be held liable for any dam ages caused or alleged to
hav e been caused directly or indirectly by this book.
Packt Publishing has endeav ored to prov ide tradem ark inform ation about all of the com panies and products m entioned
in this book by the appropriate use of capitals. Howev er, Packt Publishing cannot guarantee the accuracy of this
inform ation.

Acquisit ion Edit ors: Frank Pohlm ann, Suresh Jain
Project Edit or: Suzanne Coutinho
Cont ent Development Edit or: Alex Sorentinho
Technical Edit or: Gaurav Gav as
Proofreader: Tom Jacob
Indexer: Tejal Daruwale Soni
Graphics: Tom Scaria
Product ion Coordinat or: Sandip Tadge
First published: June 2 01 8
Production reference: 1 2 806 1 8
Published by Packt Publishing Ltd.
Liv ery Place
3 5 Liv ery Street
Birm ingham
B3 2 PB, UK.
ISBN 9 7 8-1 -7 8883 -853 -5

Mapt is an online digital library that gives you full access to over 5,000 books
and videos, as well as industry leading tools to help you plan your personal
development and advance your career. For more information, please visit our

Why subscribe?
Spend less time learning and more time coding with practical eBooks
and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Mapt is fully searchable
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with
PDF and ePub files available? You can upgrade to the eBook version at www.Pack and as a print book customer, you are entitled to a discount on the
eBook copy. Get in touch with us at for more details.
At, you can also read a collection of free technical articles, sign
up for a range of free newsletters, and receive exclusive discounts and offers
on Packt books and eBooks.


About the author
Bhargav Srinivasa-Desikan is a research engineer working for INRIA in
Lille, France. He is part of the MODAL (Models of Data Analysis and
Learning) team, and he works on metric learning, predictor aggregation, and
data visualization. He is a regular contributor to the Python open source
community, and he completed Google Summer of Code in 2016 with Gensim
where he implemented Dynamic Topic Models. Bhargav is a regular speaker
at PyCons and PyDatas across Europe and Asia, and conducts tutorials on
text analysis using Python. He is the maintainer of the Python machine
learning package pycobra, and has published in the Journal of Machine
Learning Research.
I would like to thank the Python community for all their help, and for building such incredible packages for text analysis. I
would also like to thank Lev Konstantinovskiy for introducing me to the world of open source scientific computing and Dr.
Benjamin Guedj for always helping me with writing technical articles and material. I would also like to thank my parents,
brother and friends for their constant support throughout the process of writing the book.

About the reviewers
Brian Sacash is a data scientist and Python developer in the Washington,
DC area. He helps various organizations discover the best ways to extract
value from data. His interests are in the areas of Natural Language
Processing, Machine Learning, Big Data, and Statistical Methods. Brian holds
a Master of Science in Quantitative Analysis from the University of
Cincinnati and a Bachelor of Science in Physics from the Ohio Northern
Reddy Anil Kumar is a data scientist working at Imaginea technologies
Inc. He has over 4 years of experience in the field of data science which
includes 2 years of freelance experience. He is experienced in implementing
Artificial Intelligence solutions in various domains using Machine Learning /
Deep Learning, Natural Language Processing, and Big Data Analytics. In his
free time, he loves to participate in data science competitions and he is also a
Kaggle expert.

What this book covers
, What is Text Analysis? There is no time like now to do text analysis we have an abundance of easily available data, powerful and free open source
tools to conduct our analysis and research on Machine Learning,
Computational Linguistics, and computing with text is progressing at a pace
we have not seen before. In this chapter, we will go into details about what
exactly text analysis is, and the motivations for studying and understanding
text analysis.
Chapter 1

, Python Tips for Text Analysis. We mentioned in Chapter 1, What is
Text Analysis, that we will be using Python throughout the book because it
is an easy-to-use and powerful language. In this chapter, we will substantiate
these claims, while also providing a revision course in basic Python for text
analysis. Why is this important? While we expect readers of the book to have
a background in Python and high-school math, it is still possible that it’s
been a while since you’ve written Python code - and even if you have, Python
code you write during text analysis and string manipulation is quite different
from, say, building a website using the web framework Django.
Chapter 2

, spaCy’s Language Models. While we introduced text analysis in the
previous chapter, we did not discuss any of the technical details behind
building a text analysis pipeline. In this chapter, we will introduce you to
spaCy’s Language Model - these will serve as the first step in text analysis,
and are the first building block in our pipelines. Also, we will introduce the
reader to spaCy and how we can use spaCy to help us in our text analysis
tasks, as well as talk about some of it’s more powerful functionalities, such
as POS-tagging and NER-tagging. We will finish up with an example of how
we can preprocess data quickly and efficiently using spaCy.
Chapter 3

, Gensim – Vectorizing Text and Transformations and n-grams.
While we have worked with raw textual data so far, any Machine Learning or
information retrieval related algorithm will not accept data like this - which
is why we use mathematical constructs called Vectors to help let the
algorithms make sense of the text. We will introduce gensim as the tool to
conduct this transformation, as well as scikit-learn, which will be used before
we plug in the text to any sort of further analysis. A huge part of
preprocessing is carried on over when we start our vectorization - bi-grams,
Chapter 4

tri-grams, and n-grams, as well using term frequencies to get rid of some
words which we deem to not be useful.
, POS-Tagging and Its Applications. Chapters 1 and 2 introduced text
analysis and Python, and chapters 3 and 4 helped us set-up our code for more
advanced text analysis. This chapter discusses the first of such advanced
techniques - part of speech tagging, popularly called POS-tagging. We will
study what parts of speech exist, how to identify them in our documents, and
what possible uses these POS-tags have.
Chapter 5

, NER-Tagging and Its Applications. In the previous chapter, we saw
how we can use spaCy’s language pipeline - POS-tagging is a very powerful
tool, and we will now explore itsanother interesting usage, NER-tagging. We
will discuss what exactly this is from a both linguistic and text analysis point
of view, as well as detailing examples of its usage, and how to train our own
NER-tagger with spaCy.
Chapter 6

, Dependency Parsing. We saw in Chapters 5 and 6 how spaCy’s
language pipeline performs a variety of complex Computational Linguistics
algorithms, such as POS-tagging and NER-tagging. This isn’t all spaCy packs
though, and in this chapter we will explore the power of dependency parsing
and how it can be used in a variety of contexts and applications. We will have
a look at the theory of dependency parsing before moving on to using it with
spaCy, as well as training our own dependency parsers.
Chapter 7

, Topic Models. Until now, we dealt with Computational Linguistics
algorithms and spaCy, and understood how to use these computational
linguistic algorithms to annotate our data, as well as understand sentence
structure. While these algorithms helped us understand the finer details of
our text, we still didn’t get a big picture of our data - what kind of words
appear more often than others in our corpus? Can we group our data or find
underlying themes? We will be attempting to answer these questions and
more in this chapter.
Chapter 8

, Advanced Topic Modeling. We saw in the previous chapter the
power of topic modeling, and how intuitive a way it can be to understand our
data, as well as explore it. In this chapter, we will further explore the utility
of these topic models, and also on how to create more useful topic models
which better encapsulate the topics that may be present in a corpus. Since
topic modeling is a way to understand the documents of a corpus, it also
means that we can analyze documents in ways we have not done before.
Chapter 9

, Clustering and Classifying Text. In the previous chapter we studied
topic models and how they can help us in organizing and better
understanding our documents and its sub-structure. We will now move on to
our next set of Machine Learning algorithms, and for two particular tasks clustering and classification. We will learn what is the intuitive reasoning of
these two tasks, as well as how to perform these tasks using the popular
Python Machine Learning library, scikit-learn.
Chapter 10

, Similarity Queries and Summarization. Once we have begun to
represent text documents in the form of vector representations, it is possible
to start finding the similarity or distance between documents - and that is
exactly what we will learn about in this chapter. We are now aware of a
variety of different vector representations, from standard bag-of-words or
TF-IDF to topic model representations of text documents. We will also learn
about a very useful feature implemented in gensim and how to use it summarization and keyword extraction.
Chapter 11

, Word2Vec, Doc2Vec and Gensim. We previously talked about
vectors a lot throughout the book - they are used to understand and
represent our textual data in a mathematical form, and the basis of all the
Machine Learning methods we use rely on these representations. We will be
taking this one step further, and use Machine Learning techniques to
generate vector representations of words which better encapsulate the
meaning of a word. This technique is generally referred to as word
embeddings, and Word2Vec and Doc2Vec are two popular variations of
Chapter 12

, Deep Learning for Text. Until now, we have explored the usage of
Machine Learning for text in a variety of contexts - topic modelling,
clustering, classification, text summarisation, and even our POS-taggers and
NER-taggers were trained using Machine Learning. In this chapter, we will
begin to explore one of the most cutting-edge forms of Machine Learning Deep Learning. Deep Learning is a form of ML where we use biologically
inspired structures to generate algorithms and architectures to perform
various tasks on text. Some of these tasks are text generation, classification,
and word embeddings. In this chapter, we will discuss some of the
underpinnings of Deep Learning as well as how to implement our own Deep
Learning models for text.
Chapter 13

, Keras and spaCy for Deep Learning. In the previous chapter, we
introduced Deep Learning techniques for text, and to get a taste of using
Neural Networks, we attempted to generate text using an RNN. In this
Chapter 14

chapter, we will take a closer look at Deep Learning for text, and in
particular, how to set up a Keras model which can perform classification, as
well as how to incorporate Deep Learning into spaCy pipelines.
, Sentiment Analysis and ChatBots. By now, we are equipped with
the skills needed to get started on text analysis projects, and to also take a
shot at more complicated, meatier projects. Two common text analysis
projects which encapsulate a lot of the concepts we have explored
throughout the book are sentiment analysis and chatbots. In fact, we’ve
already touched upon all the methods we will be using for these projects, and
this chapter will serve as a guide to how one can put up such an application
on their own. In this chapter, we will not be providing the code to build a
chatbot or sentiment analysis pipeline from the first step to the last, but will
rather introduce the reader to a variety of techniques that will help when
setting up such a project.
Chapter 15

Packt is searching for authors like you
If you're interested in becoming an author for Packt, please visit authors.packtpub
.com and apply today. We have worked with thousands of developers and tech
professionals, just like you, to help them share their insight with the global
tech community. You can make a general application, apply for a specific hot
topic that we are recruiting an author for, or submit your own idea.

Table of Contents
Title Page
Copyright and Credits

Natural Language Processing and Computational Linguistics

Packt Upsell

Why subscribe?


About the author
About the reviewers
Packt is searching for authors like you


Who this book is for
What this book covers
To get the most out of this book

Download the example code files
Download the color images
Conventions used

Get in touch



What is Text Analysis?

What is text analysis?
Where's the data at?
Garbage in, garbage out
Why should you do text analysis?


Python Tips for Text Analysis

Why Python?
Text manipulation in Python


spaCy's Language Models


Language models
Installing language models
Installation – how and why?
Basic preprocessing with language models

Tokenizing text

Part-of-speech (POS) – tagging
Named entity recognition
Rule-based matching



Gensim – Vectorizing Text and Transformations and n-grams

Introducing Gensim
Vectors and why we need them

Other representations

Vector transformations in Gensim
n-grams and some more preprocessing


POS-Tagging and Its Applications

What is POS-tagging?
POS-tagging in Python

POS-tagging with spaCy

Training our own POS-taggers
POS-tagging code examples


NER-Tagging and Its Applications

What is NER-tagging?
NER-tagging in Python

NER-tagging with spaCy

Training our own NER-taggers
NER-tagging examples and visualization


Dependency Parsing

Dependency parsing
Dependency parsing in Python
Dependency parsing with spaCy
Training our dependency parsers


Topic Models

What are topic models?
Topic models in Gensim

Latent Dirichlet allocation
Latent semantic indexing

Hierarchical Dirichlet process

Dynamic topic models

Topic models in scikit-learn


Advanced Topic Modeling

Advanced training tips
Exploring documents
Topic coherence and evaluating topic models
Visualizing topic models


Clustering and Classifying Text

Clustering text
Starting clustering
Hierarchical clustering
Classifying text


Similarity Queries and Summarization

Similarity metrics
Similarity queries
Summarizing text


Word2Vec, Doc2Vec, and Gensim


Using Word2Vec with Gensim

Other word embeddings




Deep Learning for Text

Deep learning
Deep learning for text (and more)
Generating text


Keras and spaCy for Deep Learning

Keras and spaCy
Classification with Keras
Classification with spaCy


Sentiment Analysis and ChatBots

Sentiment analysis

Reddit for mining data
Twitter for mining data


Other Books You May Enjoy

Leave a review - let other readers know what you think

Modern text analysis is now very accessible using Python and open source
tools, so discover how you can now perform modern text analysis in this era
of textual data.
This book shows you how to use natural language processing, and
computational linguistics algorithms, to make inferences and gain insights
about data you have. These algorithms are based on statistical machine
learning and artificial intelligence techniques. The tools to work with these
algorithms are available to you right now - with Python, and tools like
Gensim and spaCy.
You'll start by learning about data cleaning, and then how to perform
computational linguistics from first concepts. You're then ready to explore
the more sophisticated areas of statistical NLP and deep learning using
Python, using realistic language and text samples. You'll learn to tag, parse,
and model text using the best tools. You'll gain hands-on knowledge of the
best frameworks to use, and you'll know when to choose a tool like Gensim
for topic models, and when to work with Keras for deep learning.
This book balances theory and practical hands-on examples, so you can learn
about and conduct your own natural language processing projects and
computational linguistics. You'll discover the rich ecosystem of Python tools
you have available to conduct NLP - and enter the interesting world of
modern text analysis.

Who this book is for
Fluency in Python is assumed, but the book attempts to be accessible to even
Python beginners. Basic statistics is helpful. Given that this book introduces
Natural Language Processing from first principles, it helps, although it is not
a requirement, to be familiar with basic linguistics.

To get the most out of this book
Follow the listed steps and commands to prepare the system environment:

1. Python:

1. Most, if not all, OS come installed with Python. It is already
available on Windowns, Ubuntu 14.04 onwards, and macOS

2. If not, please follow the official wiki documentation: https://wiki.python

This is a good time to start migrating all of the code to Python 3.6 ( By 2020, a lot of
scientific computing packages (such as NumPy) will be dropping support for python 2.

2. spaCy:

pip install spacy

3. Gensim:

pip install gensim

4. Keras:

pip install keras

5. scikit-learn:

pip install scikit-learn

Download the example code files
You can download the example code files for this book from your account at w If you purchased this book elsewhere, you can visit
/support and register to have the files emailed directly to you.
You can download the code files by following these steps:

1. Log in or register at

2. Select the SUPPORT tab.

3. Click on Code Downloads & Errata.

4. Enter the name of the book in the Search box and follow the onscreen
Once the file is downloaded, please make sure that you unzip or extract the
folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at
ublishing/Natural-Language-Processing-and-Computational-Linguistics. The code and the PDF
version of all Jupyter notebooks is hosted at
l-Language-Processing-and-Computational-Linguistics/tree/master/notebooks. In case there's an
update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos
available at Check them out!

Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams
used in this book. You can download it here:

Conventions used
There are a number of text conventions used throughout this book.
: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk
image file as another disk in your system."

A block of code is set as follows:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Any command-line input or output is written as follows:
virtualenv env
source env/bin/activate
pip install spacy

Bold: Indicates a new term, an important word, or words that you see
onscreen. For example, words in menus or dialog boxes appear in the text
like this. Here is an example: "Select System info from the Administration
Warnings or important notes appear like this.
Tips and tricks appear like this.

Get in touch
Feedback from our readers is always welcome.
General feedback: Email and mention the book title in
the subject of your message. If you have questions about any aspect of this
book, please email us at
Errata: Although we have taken every care to ensure the accuracy of our
content, mistakes do happen. If you have found a mistake in this book, we
would be grateful if you would report this to us. Please visit
t-errata, selecting your book, clicking on the Errata Submission Form link, and
entering the details.
Piracy: If you come across any illegal copies of our works in any form on the
Internet, we would be grateful if you would provide us with the location
address or website name. Please contact us at with a link to
the material.
If you are interested in becoming an author: If there is a topic that you
have expertise in and you are interested in either writing or contributing to a
book, please visit

Please leave a review. Once you have read and used this book, why not leave
a review on the site that you purchased it from? Potential readers can then
see and use your unbiased opinion to make purchase decisions, we at Packt
can understand what you think about our products, and our authors can see
your feedback on their book. Thank you!
For more information about Packt, please visit

What is Text Analysis?
There is no time like now to do text analysis – we have an abundance of
easily available data, powerful and free open source tools to conduct our
analysis, and research on machine learning, computational linguistics and
computing with text is progressing at a pace we have not seen before.
In this chapter, we will go into details about what exactly text analysis is and
look at the motivations for studying and understanding text analysis.
Following are the topics we will cover in this chapter:
What is text analysis?
Where's the data at?
Garbage in, garbage out
Why should YOU be interested?
A note about the references: they will appear throughout the PDF version of
the book as links, and if it is an academic reference it will link to the PDF of
the reference or the journal page. All of these links and references are then
displayed as the final section of the chapter, so offline readers can also visit
the websites or research papers.

What is text analysis?
If there's one medium of media which we are exposed to every single day, it's
text. Whether it's our morning paper or the messages we receive, it's likely
you receive your information in the form of text.
Let's put things into a little more perspective – consider the amount of text
data handled by companies such as Google (1+ trillion queries per year),
Twitter (1.6 billion queries per day), and WhatsApp (30+ billion messages
per day). That's an incredible resource, and the sheer ubiquitous nature of
the text is enough reason for us to take it seriously. Textual data also has
huge business value, and companies can use this data to help profile
customers and understand customer trends. This can either be used to offer
a more personalized experience for users or as information for targeted
marketing. Facebook, for example, uses textual data heavily, and one of the
algorithms we will learn later in this book was developed at Facebook's AI
research team.

Fig 1 .1 Rate of data growth from 2 006 – 2 01 8 with predicted rates of data in 2 01 9 and 2 02 0. Source: Patrick
Cheesem an, https://www.eetim /author.asp?section_id= 3 6 &doc_id= 1 3 3 04 6 2

Text analysis can be understood as the technique of gleaning useful
information from text. This can be done through various techniques, and we
use Natural Language Processing (NLP), Computational Linguistics
(CL), and numerical tools to get this information. These numerical tools are
machine learning algorithms or information retrieval algorithms. We'll
briefly, informally explain these terms as they will be coming up throughout
the book.
Natural language processing (NLP) refers to the use of a computer to process
natural language. For example, removing all occurrences of the word thereby
from a body of text is one such example, albeit a basic example.
Computational linguistics (CL), as the name suggests, is the study of
linguistics from a computational perspective. This means using computers
and algorithms to perform linguistics tasks such as marking your text as a
part of speech (such as noun or verb), instead of performing this task
Machine Learning (ML) is the field of study where we use statistical
algorithms to teach machines to perform a particular task. This learning
occurs with data, and our task is often to predict a new value based on
previously observed data.
Information Retrieval (IR) is the task of looking up or retrieving
information based on a query by the user. The algorithms that aid in
performing this task are called information retrieval algorithms, and we will
be encountering them throughout the book.
Text analysis itself has been around for a long time – one of the first
definitions of Business Intelligence (BI) itself, in an October 1958 IBM
Journal article by H. P. Luhn, A Business Intelligence System [1], describes a
system that will do the following:
"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles
for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically
abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

It's interesting to see talk about documents, instead of numbers – to think
that the first ideas of business intelligence were understanding text and
documents is again a testament to text analysis throughout the ages. But
even outside the realm of text analysis for business, using computers to
better understand text and language has been around since the beginning of

ideas of artificial intelligence. The 1999 review on text analysis by John
Hutchins, Retrospect and prospect in computer-based translation [2], talks
about efforts to do machine translation as early as the 1950s by the United
States military, in order to translate Russian scientific journals into English.
Efforts to make an intelligent machine started with text as well – the ELIZA
program developed in 1966 at MIT by Joseph Weizenbaum is one example.
Even though the program had no real understanding of language, by basic
pattern matching it could attempt to hold a conversation. These are just
some of the earliest attempts to analyze text – computers (and human
beings!) have come a long way since, and we now have incredible tools at our
Machine translation itself has come a long way, and we can now use our
smartphones to effectively translate between languages, and with cuttingedge techniques such as Google's Neural Machine Translation, the gap
between academia and industry is reducing – allowing us to actually
experience the magic of natural language processing first hand.

Fig 1 .2 An exam ple of a Neural Translation m odel, working on French to English

Advances in this subject have helped advance the way we approach speech as
well – closed captioning in videos, and personal assistants such as Apple's
Siri or Amazon's Alexa are greatly benefited by superior text processing.
Understanding structure in conversations and extracting information were
key problems in early NLP, and the fruits of the research done are being very
apparent in the 21st century.
Search engines such as Google or Bing! also stand on the shoulders of the
research done in NLP and CL and affect our lives in an unprecedented way.
Information retrieval (IR) builds on statistical approaches in text processing

and allows us to classify, cluster, and retrieve documents. Methods such as
topic modeling can help us identify key topics in large, unstructured bodies
of text. Identifying these topics goes beyond searching for keywords, and we
use statistical models to further understand the underlying nature of bodies
of text. Without the power of computers, we could not perform this kind of
large-scale statistical analysis on the text. We will be exploring topic
modeling in detail later on in the book.

Fig 1 .2 Techniques such as topic m odeling use probabilistic m odeling m ethods to identify key topics from the text. We
will be study ing this in detail later in the book

Going one step ahead of just being able to experience the wonders of modern
computing on our mobile phones, recent developments in both Python and
NLP means that we can now develop such systems on our own!
Not only has there been an evolution in the techniques used in NLP and text
analysis, it has become very accessible to us – open source packages are
becoming state-of-the-art, performing as well as commercial tools. An
example of a commercial tool would be Microsoft's Text Analysis API
MATLAB is another example of a popular commercial tool used for scientific
computing. While historically such commercial tools performed better than
free, open source software, an increase in people contributing to open source
libraries, as well as funding from industry has helped the open source
community immensely. Now, the tables appear to have turned and many
software giants use open source packages for their internal systems – such

as Google using TensorFlow and Apple using scikit-learn! Tensor flow and
scikit-learn are two open source Python machine learning packages.
It can be argued that the sheer number of packages offered by the python
ecosystem means it leads the pack when it comes to doing text analysis, and
we will focus our efforts here. A very strong and active open source
community adds to the appeal.
Throughout the course of the book, we will discuss modern natural language
processing and computational linguistics techniques and the best open
source tools available to us which we can use to apply these techniques.

Where's the data at?
While it is important to be aware of the techniques and the tools involved in
NLP and CL, it is, of course, pointless without any data. Luckily for us, we
have access to an abundance of data if we look in the right places. The easiest
way to find textual data to work on is to look for a corpus.
A text corpus is a large and structured set of texts and is a great way to start
off with text analysis. Examples of such corpora that are free are the Open
American National Corpus [5] or the British National Corpus [6]. Wikipedia
has a useful list of the largest corpuses available in its article on text
corpuses [7]. These are not limited to the English language, and there also
exist various corpuses in European and Asian languages, and there are
constant efforts worldwide to create corpuses for majority of languages.
Universities research labs are another valuable source for obtaining corpuses
– indeed, one of the most iconic English language corpuses, the Brown
Corpus, was put together at Brown University.
Different corpuses tend to have varying levels of information present,
usually dependent on the primary purpose for that corpora – for example,
corpora whose primary function is to aid during translation would have the
same sentence present in multiple languages. Another way corpora have
extra information is through annotation. Examples of annotation in text
usually include Part-Of-Speech (POS) tagging or Named-EntityRecognition (NER). POS-tagging refers to marking each word in a
sentence with its part of speech (Noun, verb, adverb, and so on), and a corpus
annotated for NER would have all named entities recognized, such as places,
people, and times. We'll be further going into details of both POS-tagging and
NER later on in the book, in Chapter 5, POS-Tagging and its Applications and Ch
apter 6, NER-Tagging and its Applications.
Based on the structure and varying levels of information present in the
corpora, it would have a different purpose. Some corpora are also built to
evaluate clustering or classification tasks, where rather than annotation
being important, the label or class would be. This means that some corpora
are designed to aid with machine learning tasks such as cluster or
classification by providing text with labels tagged by humans. Clustering
refers to the task of grouping similar objects together, and classification is

the process of deciding which pre-defined class an identifying what exactly
your dataset is going to be used for is a crucial part of text analysis and an
important first step.
Apart from downloading datasets or scraping data off the internet, there are
still some rich sources for gathering our textual data – in particular,
literature. One example of this is the research done at the University of
Pennsylvania, where Alejandro Ribeiro, Santiago Segarra, Mark Eisen, and
Gabriel Egan discovered possible collaborators of Shakespeare, a literary
history problem that stumbled many researchers [14]. They approached the
problem by identifying literary styles – an upcoming field of study in
computational linguistics called style analysis.
The increased use of computational tools to perform research in the
humanities has also led to the growth of Digital Humanities labs in
universities, where traditional research approaches are either aided or
overtaken by computer science, and in particular machine learning (and by
extension), natural language processing. Speeches of politicians, or
proceedings in parliament, for example, are another example of a data source
used often in this community. TheyWorkForYou [17] is A UK parliament
tracking system, which gets speeches and uploads them and is an example of
the many sites available doing this kind of work.
Project Gutenberg is likely the best resource to download books and contains
over 50,000 free eBooks and many literary classics. Personal PDFs and
eBooks also remain a resource, but again, it is important to know the legal
nature of your text before analyzing it. Downloading a pirated copy of, say,
Harry Potter off the internet and publishing text analysis results might not
be the best idea if you cannot explain where you got the text from! Similarly,
text analysis on private text messages might not only annoy your friends but
also could be infringing on privacy laws.

Fig 1 .3 An exam ple of a text dataset list – here, it is of rev iews datasets found on

So where else apart from downloading a structured data-set straight off the
internet, do we get our textual data? Well, the internet, of course. Even if it
isn't labelled, the sheer amount of text on the internet means that we can
access large parts of it – the [7] is one such example, and the media dump of
all the content on Wikipedia, after unzipping, is about 58 GB (as of April
2018) – more than enough text to play around with. The popular news
aggregation website [9] allows for easy web-scraping and is another
great resource for text analysis.
Python again remains a great choice to use for any such web-scraping, and
libraries such as BeautifulSoup [10], urllib [11] and scrapy [12] are designed
particularly for this. It is important to remain careful about the legal side of
things here, and make sure to check the terms and conditions of the website
where you are scraping the data from – a number of websites will not allow
you to use the information on the website for commercial purposes.
Twitter is another website that is fast becoming a very important part of text
analysis – you even have academia taking this resource very seriously (What
is Twitter, a social network or a news media? [13] has over 5000 citations!),
with multiple papers being written on text analysis of tweets, and even fullfledged tools [15] to do sentiment analysis have been built! The Twitterstreaming API allows us to easily mine for textual data from Twitter as well,
and the Python interface [16] is straightforward. Most world leaders are users
of Twitter, as well as celebrities and major news corporations – there is a lot
of interesting insights Twitter can offer us.

Fig 1 .4 An exam ple of the rich text resource Twitter has becom e, with m ultiple structured datasets av ailable [7 ]. These
datasets, all m ined from Twitter, hav e particular tasks, which can be used for and fall under the category of labeled
datasets which we discussed before.

Other examples of textual information you can get off the internet include
research articles, medical reports, restaurant reviews (the Yelp! dataset
comes to mind), and other social media websites. Sentiment analysis is
usually the prime objective in these cases. As the name suggests, sentiment
analysis refers to the task of identifying sentiment in text. These sentiments
can be basic, such as positive or negative sentiment, but we could have more
complex sentiment analysis tasks where we analyze whether a sentence
contains happy, sad, or angry sentiments.
It's clear that if we look hard enough, it's more than easy to find data to play
around with. But let's take a small step back from downloading data off the
internet – where else can we try and find information?
Right in our hands, as it may seem – we send and receive text messages and
emails every day, and we can use this text for text analysis. Most text
messaging applications have interfaces to download chats. WhatsApp, for
example, will mail the data to you [18], with both media and text. Most mail
clients have the same option, and the advantage in both these cases is that
this kind of data is often well organized, allowing for easy cleaning and preprocessing before we dive into the data.
One aspect we've ignored so far whilst talking about data is the noise which
is often in the text – in tweets, for example, short forms and emoticons
which are often used, and in some cases, we have multi-lingual data where a
simple analysis might fail. This brings us to arguably the most important
aspect of text analysis – pre-processing.

Garbage in, garbage out
Garbage in, garbage out (or GIGO) is an adage of computer science which is
even more important when dealing with machine learning and possibly even
more so when dealing with textual data. Garbage in, garbage out means that
if we have poorly formatted data, it is likely we will have poor results.

Fig 1 .5 XKCD hits the ham m er on the nail once again ( /1 83 8/)

While more data usually leads to a better prediction, it isn't always the same
case with text analysis, where more data can result in nonsense results or
results which we don't always want. An intuitive example: the part of speech,
articles, such as the words a, or the tend to appear a lot in text, but not
adding any information to the text, and is usually limited to grammar or
Words such as these which don't provide useful information are called stop
words, and these words are often removed from the text before applying text
analysis techniques on them. Similarly, sometimes we remove words with
very high frequency in the body of text, and words which only appear once or
twice – it is highly likely these words will not be useful to our analysis. That
being said, this depends heavily on the kind of task being performed - if, for
example, we would want to replicate human writing styles, stop words are
important because humans many such words when writing. An example of

how stop words can also include useful information is in this article, Pastiche
detection based on stopword rankings. Exposing impersonators of a
Romanian writer [20], is a study identified a certain author using frequency
of stop words.
Let's consider another example where we might be dealing with useless data
– if searching for influential words or topics in the text, would it make sense
to have both the words reading and read in the results? Here, shortening the
word reading to read would not lead to any loss of information. But on a
similar note, it would make sense to have the words information and inform
exist separately in the same body of text, because they could mean different
things based on the context. We would then need techniques to shorten
words appropriately. Lemmatizing and stemming are two methods we use to
tackle this problem and remain two of the core concepts in natural language
processing. We will be exploring these two techniques in more detail in Chapter
3, spaCy's Language models.
Even after basic text-processing, our data is still a collection of words. Since
machines do not inherently understand the concepts tied to words, we can
instead use numbers that represent individual words. The next important
step in text analysis is converting words into numbers, whether it is bag-ofwords (BOW), or term frequency-inverse document frequency (TFIDF), which are different ways to count the number of words in each
document or sentence. There are also more advanced techniques to represent
words such as Word2Vec and GloVe.
We will go into these details and techniques in more detail in the chapter on
pre-processing techniques – it is especially important to understand the
motivation behind these techniques, and that a computer's output is only as
good as the input you feed it.

Why should you do text analysis?
We've talked about what text analysis is, where we can find the data, and
some of the things to keep in mind before diving into text analysis. But after
all, what motivation do you, the reader, have to actually go about doing text
For starters, it's the sheer abundance of easily available data that we can use.
In the big data age, there really is no excuse to not have a look at what all our
data really means. In fact, apart from the massive data sets, we can download
off the internet, we also have access to small data – text messages, emails, a
collection of poems are such examples. You could even do a meta-analysis
and run an analysis on this very book! Textual data is even easier to get a
hold-off, but far more importantly - it's easy to interpret and understand the
results of the analysis. Numbers might not always make sense and are not
always appealing to look at - but words are easier for us human beings to
Text analysis remains exciting also because we can use data which directly
involves the user- our own text conversations, our favorite childhood book,
or tweets by our favorite celebrity. The personal nature of text data always
adds an extra bit of motivation, and it also likely means we are aware of the
nature of the data, and what kind of results to expect.
NLP techniques can also help us construct tools that can assist personal
businesses or enterprises – chatbots, for example, are becoming increasingly
common in major websites, and with the right approach, it is possible to have
a personal chat-bot. This is largely due to a sub-field of machine learning,
called Deep Learning, where we use algorithms and structures that are
inspired by the structure of the human brain. These algorithms and
structures are also referred to as neural networks. Advances in deep learning
have introduced to powerful neural networks such as Recurrent Neural
Networks (RNNs) and Convolutional Neural Networks (CNNs). Now,
even with minimal knowledge of the mathematical functioning of these
algorithms, high-level APIs are allowing us to use these tools. Integrating
this into our daily life is no longer reserved for computer science researchers
or full-time engineers – with the right collection of data and open source
packages, this is well within our capabilities.

Open source packages have become industry standard – Google has released
and maintains TensorFlow [21], and packages such as scikit-learn [22] are
used by Apple and Spotify, and spaCy [23], which we will extensively discuss
throughout this book – is used by Quora, a popular question-answer website.
We are no longer limited by either data or the tools – the only two things we
would need to do text analysis.
The programming language python will be our friend throughout the book,
and all the tools we will use will all be free open-source software. While we
move towards open science, we also move towards open source code, and
this will remain a key philosophy throughout the book. In the world of
research, open source code means academic results are reproducible and
available to all those interested. Python remains an easy-to-use and powerful
language and serves as a great way to enter the world of natural language
One could argue that the last thing needed was the knowledge of how to
apply these tools and to wrangle with the data – but that is precisely the
purpose of the book and, hoping to let the reader build their own natural
language processing pipelines and models at the end of the journey.

We've had a look at the incredible power of text analysis, and the kind of
things we can do with it – as well as the kind of tools we would be using to
take advantage of this. Data has become increasingly easy for us to access,
and with the growth of social media, we have continuous access to both new
data, as well as standardized annotated datasets.
This book will aim at walking the reader through the tools and knowledge
required to conduct textual analysis on their own personal data or own
standardized datasets. We will discuss methods to access and clean data to
make it ready for pre-processing, as well as how to explore and organize our
textual data. Classification and clustering are two other commonly conducted
text processing tasks, and we will figure out how to perform this as well,
before finishing up with how to use deep learning for text.
In the next chapter, we will introduce how and why Python is the right choice
for our purposes, as well as discuss some python tricks and tips to help us
with text analysis.

[1] A business intelligence system – H. P. Lunn, October 1958
[2] Retrospect and prospect in computer-based translation – John Hutchins,
September 1999 (
[3] Introduction to Neural Machine Translation with GPUs:

[4] Text Mining :

[5] Open American National Corpus:

[6] British National Corpus:

[7] List of Text Corpora:

[8] Wikipedia Dataset:

[9] Reddit, news aggregation website:

[10] Beautiful Soup:

[11] UrlLib:

[12] Scrapy:

[13] What is Twitter, a social network or a news media?:

[14] Shakespeare and his co-authors:

[15] Tweet Sentiment Visualization:

[16] Tweepy, twitter API:

[17] TheyWorkForYou:

[18] Mailing WhatsApp chat history:

[19] Project Gutenburg:

[20] Pastiche detection based on stopword rankings. Exposing impersonators
of a Romanian writer:

[21] TensorFlow:

[22] Scikit-learn:

[23] spaCy:

Python Tips for Text Analysis
We mentioned in Chapter 1, What is Text Analysis, that we will be using
Python throughout the book because it is an easy-to-use and powerful
language. In this chapter, we will substantiate these claims, while also
providing a revision course in basic Python for text analysis.
Why is this important? While we expect readers of the book to have a
background in Python and high-school level math, it is still possible that it's
been a while since you've written Python code – and even if you have, the
Python code you write during text analysis and string manipulation is quite
different from, say, building a website using the web framework Django.
Following are the topics we will cover in this chapter:
Why Python?
Text manipulation in Python

Why Python?
In Python, we represent text in the form of string [1], which are objects of the
str [ 2] class. They are an immutable sequence of Unicode code points or
characters. It is important to make a careful distinction here, though; in
Python 3, all strings are by default Unicode, but in Python 2, the str class is
limited to ASCII code, and there is a Unicode class to deal with Unicodes.
Unicode is merely an encoding language or a way we handle text. For
example, the Unicode value for the letter Z is U+005A. There are many
encoding types, and historically in Python, developers were expected to deal
with different encodings on their own, with all the low-level action
happening in bytes. In fact, the shift in the way Python handles Unicode has
led to a lot of discussions [3], criticism [4], and praise [5] within the
community. It also remains an important point of contention when we are
porting code from Python 2 and Python 3.
We said earlier on that the low-level action was going on in bytes - what does
this mean? Bytes are numbers, and these numbers are used to represent
different characters or symbols. This is what Unicode or ASCII is - different
ways to represent characters. In Python 2, strings are stored as bytes, and in
Python 3 by default, it is stored as a Unicode code point.
We will not be going deep into the technicalities of how text is encoded and
the problems we encounter when dealing with these encodings but can give
the following advice in general when dealing with text and Python - use
Python 3 and use Unicode! The reason is mainly that we want to stop using
Python 2; it is going to be phased out [6] by the scientific computing
community, and there makes no sense in still using Python 2 applications
and code. Since Python 3 supports Unicode as well, we will be supporting the
use of Unicode for all text as well. This would mean remembering to include
u before our string starts, which ensures that it is a Unicode string.
While most of the text analysis that we will be doing throughout this book
will not feature extensive string manipulation, it is still something we should
be comfortable doing, and often we will have troublesome words in our
dataset, where we would need to clean things up before starting any kind of
text analysis. It may also be important to make our final pretty, and for these

kind of tasks, it is worth knowing how to be able to manipulate strings.
The other useful Python knowledge that will help us in text analysis is basic
data structures and how to use them - lists remain one of the most used data
structures during text analysis and knowing how a dictionary works is also
important to us.
The purpose of this chapter is to illustrate some of the functions we can
perform with strings, and how we use strings in lists and dictionaries.
But we still haven't explained why we decided Python as the language of our
choice - there are a number of text analysis packages in Java as well, and Perl
is another programming language with a reputation for being good with text.
But what sets Python apart is the community and open source libraries we
have access to.
You would have had a taste of this in the previous chapter as well - we talked
about Google using TensorFlow and Apple using SciKit-learn, for example.
The open source code is reaching the same standards and efficiency as
industry code - one of the libraries we will focus on throughout this book,
spaCy, is an example of this. Collecting data is also largely done with Python,
using libraries such as tweepy (Twitter), urllib (accessing web pages), and
beautiful soup (extracting HTML from web pages). More people using a
certain ecosystem means it will grow (the Stack Overflow blog post does a
good write up regarding this [6]), and this means that both researchers and
industry are increasingly using it, which means it is a good time to jump on
the bandwagon!
Apart from the external support Python receives from the wide variety of
libraries (and in particular, NLP libraries), there are other reasons why
Python is an attractive language to use. One of these is Python's predominant
use as a scripting language. A scripting language is one where the ability to
run scripts is supported; programs that are written for a run-time
environment that usually automate tasks. For example, if you write up a few
lines of code to quickly reply to Facebook birthday wishes, and this is done
every year - it is an example of a script. There is no hard and fast rule to what
you call a scripting language, but its rather a way we colloquially discuss
programming languages.
Python is a very useful scripting language because of how quickly we can
code up a script to manipulate text files - it is easily readable, fast enough for
file sizes that are not massive, and is an interpreted language [7], meaning we

don't need to compile our code before running it. It is dynamically typed [8],
meaning we don't need to define data types while writing code.
But more than the technical reasons of why Python is superior, we are more
interested in Python because of its ease-of-use. It is flexible, readable, and
with a high level of abstraction, allowing us to be more productive. We can
focus more on the problem, rather than on programming technicalities and
code errors. This is not to suggest we won't have code errors when coding in
Python; just that they tend to be more solvable and provide more
information than just, for example - SEGMENTATION FAULT.
We will now illustrate basic Python commands for string manipulation and
text analysis. For users who are already familiar with the basics of Python
and text, you are welcome to not run all the code in this chapter, but a quick
glance through the code will serve as a useful reminder!

Text manipulation in Python
We mentioned earlier in the chapter that the way we represent text in Python
is through strings. So how do we specify that an object is a string?
word = "Bonjour World!"

Now the word variable contains the text, Bonjour World!. Note how we used double
quotes around the text that we intend to use - while single quotes also work;
if we also wish to use a single quote in our string, we would need to use
double quotes. Printing our word is straightforward, where all we need to do
is use the print function. Remember to use parentheses if we are coding in
Python 3!
Bonjour World!

We don't have to use variables to be able to print string though - we can also
just do:
print("Bonjour World!")
Bonjour World!

Be careful not to enclose your variable in quotations though! Consider this

This will just print the word out.
We mentioned before in the chapter that a string is a sequence of characters;
how do we then access the first character of a string?

We can similarly access subsequent indices. What about finding the length of
a string?

Let us now quickly run through more String functions, such as finding
characters, counting characters, and changing the case of letters in a word.

There are three o letter, so the output is 3.

The index of the j character in the string is the three.

This helps us find where the World string starts.

The upper method converts all characters to uppercase.
'bonjour world!'

The lower method converts all characters to lowercase.
'Bonjour World!'

The title method capitalizes the first letter of every word.
'Bonjour world!

The capitalize method only capitalizes the first letter.

The swapcase method, as the name suggests, changes the cases of each letter.
Pythonic strings can also be manipulated by arithmetic operators:
Adding the Fromage word (which means cheese, in French) to our word string can
be done by simply adding Fromage to the end of it!
print(word + " Fromage!")
'Bonjour World! Fromage!'

Similarly, we can also multiply words the same way:
print("hello " * 5)
hello hello hello hello hello

String functions can also help us easily reverse a string or add an extra
whitespace between every character.
print( ''.join(reversed(word)))
!dlroW ruojnoB

Reversed returns a generator - which we then join using the join function. We
can similarly use join to add an extra whitespace.
print( " ".join(word))
B o n j o u r
W o r l d !

We may also often find the need to check different properties of our strings.
This can be done with the following methods:

Checks if all char are alphanumeric.


Checks if all char in the string are alphabetic.

Checks if string contains digits.

Checks if string contains title words.

Checks if the string is in upper case.

Checks if string is in lower case.

Checks if string contains spaces.

Checks if string ends with a f character.

Checks if string starts with H.
It is also possible to replace characters in strings, as well as slice them up; in
fact, slicing strings for sub-strings is a very useful and fundamental part of
playing with strings.
word.replace("World", "Pizza")
'Bonjour Pizza!'

The straightforward replace function makes our world into pizza!
Slicing is the process of getting a portion of your string. The syntax is as
New_string = old_string[startloc:endloc]

If, for example, we only want the second part of our sentence and we are
aware it is from the 8th to the 16th character, we can try this:

And if we only wish for the first word, we can use this:

Leaving the part before the semi-colon blank means the index starts from

With the knowledge of the functions and strategies we have discussed, our
text analysis can be aided; it is often when we are doing large scale text
analysis that a small error can lead to completely nonsense results
(remember garbage in, garbage out from Chapter 1, What is Text Analysis?).
We finish this mini-chapter with a few useful links on basic text

1. Printing and Manipulating Text [9]: Basic manipulation and
printing of text, recommended if interested in how to display text in
different ways.

2. Manipulating Strings [10]: Basic String functions as well as exercises,
useful for the further practice of string manipulation.

3. Manipulating Strings in Python [11]: Similar to the two-preceding
links includes a section on escape sequences as well.

4. Text Processing in Python (book) [12]: Unlike the other links, this is
a whole book. It covers the very fundamentals of text and string
manipulation in Python and includes useful material on some
uncovered topics such as regular expressions.

5. An Introduction to Text Analysis in Python [13]: This provides
great further reading if you want to get a more general view of the
relationship between Python and text analysis. Recommended if you're a
beginner and you need more foundation to this chapter.
Understanding how strings behave in Python and being able to quickly
perform basic operations on them will come in handy multiple times
throughout the book - after all, no house can be built to last long without a
strong foundation!

[1] Strings:

[2] str:

[3] Strings, Bytes, and Unicode in Python 2 and 3:

[4] More About Unicode in Python 2 and 3:

[5] Python 3 and ASCII Compatible Binary Protocols:

[6] The Incredible Growth of Python:

[7] Interpreted Language:

[8] Dynamically Typed:

[9] Printing and Manipulating Text:

[10] Manipulating Strings:

[11] Manipulating Strings in Python:

[12] Text Processing in Python:

[13] An Introduction to Text Analysis in Python:

spaCy's Language Models
While we introduced text analysis in Chapter 1, What is Text Analysis?, we did
not discuss any of the technical details behind building a text analysis
pipeline. In this chapter, we will introduce you to spaCy's language model –
these will serve as the first step in text analysis and are the first building
block in our pipelines. In this chapter, we will introduce the reader to spaCy
and how we can use spaCy to help us in our text analysis tasks, as well as talk
about some of its more powerful functionalities, such as Part of Speechtagging and Named Entity Recognition-tagging. We will finish up with an
example of how we can preprocess data quickly and efficiently using the
natural language processing Python library, spaCy.
We will cover the following topics in this chapter:
Tokenizing Text

Having discussed some of the basics of text analysis, let's dive head first into
our first Python package we'll be learning to use - spaCy [1].
spaCy describes itself as Industrial Strength Natural Language
Processing – and it most certainly does its best to live up to this promise.
Focused on getting things done rather than a more academic approach, spaCy
ships with only one part-of-speech tagging algorithm and only one namedentity-recognizer (per language). What this also means is that the package is
not bloated with unnecessary features.
We previously mentioned academic approach – what does this mean? A large
number of the open-source packages in the natural language processing and
machine learning are usually created or maintained by researchers and those
working in academia. While they do end up working – the aim of the
projects is not to provide state-of-the-art implementations of algorithms.
NLTK [2] is one such example, where the primary focus of the library is to
give students and researchers a toolkit to play around with. spaCy, on the
other hand, can be used pretty satisfactorily in production code – this means
that you can expect it to perform on real-world data, and with the right
amount of foresight, it can also be scalable.
The blog post [3] by Matt Honnibal, creator and maintainer of spaCy, goes
into some more detail about the problems being faced in open source NLP
libraries, and the philosophy of spaCy. The crux of the problems is still lack
of curation and maintenance in some libraries (such as Pattern [4], which
only very recently attempted to move to Python 3), and in the case of NLTK,
outdated techniques or serving simply as a Wrapper [5] tool, providing
bindings which let you use other POS-taggers or parsers.
However, with all this being said it is still worthwhile to explore what NLTK
has to offer – it still serves as a fairly handy tool to studying traditional NLP
techniques as well as providing a variety of corpuses (such as brown corpus [
6]). This link [7] is part of the NLTK Book and serves as a way to get started
with exploring a few of these corpuses. We will not be going into the inner
workings of NLTK, and a prerequisite knowledge of NLTK is not required to
make the most of your NLP projects.

Throughout this book, we will be using spaCy (v2.0) for our text
preprocessing and computational linguistics purposes. Following are the
features of spaCy:

1. Non-destructive tokenization

2. Support for 21+ natural languages

3. 6 statistical models for 5 languages

4. Pre-trained word vectors

5. Easy deep learning integration

6. Part-of-speech tagging

7. Named entity recognition

8. Labeled dependency parsing

9. Syntax-driven sentence segmentation

10. Built-in visualizers for syntax and NER

11. Convenient string-to-hash mapping

12. Export to numpy data arrays

13. Efficient binary serialization

14. Easy model packaging and deployment

15. State-of-the-art speed

16. Robust, rigorously evaluated accuracy
The following is a table that has features of spaCy as mentioned on their websi

Fig 3 .1 Feature com parison from the Facts & Figures page

Let's get started with setting up and installing spaCy. spaCy is compatible
with 64-bit CPython [8] 2.6+⁄3.3+ and runs on Unix/Linux, macOS/OS X, and
Windows. CPython is a reference implementation of Python written in C –
we don't need to know the details behind it, and if you have a stable
installation of Python running, it is likely your CPython modules are just fine
as well. The latest spaCy releases are available over Pip [9] (source packages
only) and Conda [10]. Pip and conda are two Python package distributors.
Installation requires a working build environment. We will be using Python
3, though the examples are all valid for Python 2 as well.
Pip remains the most straightforward choice, but for users with anaconda
installed, they will be using conda instead.
pip install -U spacy
When using pip, it is generally recommended that you install packages in a virtualenv tool to avoid modifying
system state.

Since we will be downloading a number of Python packages throughout the
book, it makes sense to understand exactly how virtual environments in
Python work – this post [11] serves as a good resource to learn the same.
virtualenv env
source env/bin/activate
pip install spacy

Hopefully, by now you should have spaCy up and running –
import spacy

Into your Python, the Terminal should let you verify the spaCy installation.

Now, it may be the case that some issues would have popped up during the
installation process; this might be because of CPython installation
complications. If you are running a Mac system, you may need to run this
xcode-select -install

This installs Mac command-line developer tools.
Most of the common installation problems that may occur are well
documented in both Stack Overflow and the spaCy GitHub page.
The following two links are useful in troubleshooting:
Can't Pip Install (Mac) [12]
Failed building wheel for spacy (Windows) [13]
In general, if you are using a virtual environment with the correct Xcode (for
Mac users) and Python dependencies, there should not be unsolvable
installation problems.
While we are discussing spaCy, it is important to know what other tools are
available for similar tasks and how spaCy compares to these – the Facts &
Figures [14] page on the spaCy page goes into the numbers behind spaCy's
Let's move on to our first usage of spaCy – language models.

Language models
One of spaCy's most interesting features is its language models [15]. A
language model is a statistical model that lets us perform the NLP tasks we
want to, such as POS-tagging and NER-tagging. These language models do
not come packaged with spaCy, but need to be downloaded – we'll get into
details of exactly how we can download these models later on in the chapter.
Different languages have different models to perform these tasks, and there
are also different models for the same language – the difference between
these models is mostly statistical, and you can use different models based on
your use case. A different model would just be trained on a different dataset.
It is still the same underlying algorithm. The spaCy documentation on their
models gives us some more insight into how they work.
As of now, there are models available for English, German, French, Spanish,
Portuguese, Italian, and Dutch, and this number is expected to grow. For
more information about the models, such as naming conventions or
versioning, you can visit the model overview page [16]. We will be focusing
more on using these models, before briefly going over how to create our own
pipelines and models.

Installing language models
As of v1.7.0, models for spaCy can be installed as Python packages. This
means that they're a component of your application, just like any other
module. Models can be installed from a download URL or a local directory,
manually or via pip.
The easiest way to download and use these models is using spaCy's download
# out-of-the-box:
spacy download en
spacy download de
spacy download es
spacy download fr
spacy download xx

download best-matching default model
# english model
# german model
# spanish model
# french model
# multi-language model

# download best-matching version of specific model for your spaCy installation
spacy download en_core_web_sm
# download exact model version (doesn't create shortcut link)
spacy download en_core_web_sm-2.0.0 --direct

What the download command does is to use pip to install the model, place it in
your site-packages folder, and create a shortcut link that allows you to easily
load it later.
For example, if we want to use the English language model, we start by
running these commands in Terminal:
pip install spacy
spacy download en

Next we run the following commands in our Python shell:
import spacy
nlp = spacy.load('en')

We now have the English language model loaded, and we can use it to

process our text through the pipeline, like this:
doc = nlp(u'This is a sentence.')

Strings are Unicode by default in Python 3, however, in Python 2, we need to
enclose strings using u'. We will discuss more of the nature of the doc object,
as well as exactly what goes on in the pipeline in the upcoming section.
It's also possible to download the models via pip – to download a model
directly using pip, simply point pip install to the URL or local path of the
archive file. To find the direct link to a model, head over to the model
releases [17] and find the archive links.
Some of these models can be quite large, and the full English model is over 1 GB.

# with external URL
pip install
# with local file
pip install /Users/you/en_core_web_md-1.2.0.tar.gz

By default, this will install the model into your site-packages directory. You can
then use spacy.load() to load it via its package name, create a shortcut link to
assign it a custom name or import it explicitly as a module.
Once we've downloaded a model through pip or via spaCy's downloader, we
can call the load() method, as follows:
import en_core_web_md
nlp = en_core_web_md.load()
doc = nlp(u'This is a sentence.')

The model usage page [15] of spaCy has details on how to manually download
models, using custom shortcut links, and other information which might be
useful – we will be covering some of these topics further in the book (In Chapt
er 5, POS-Tagging and its Applications, Chapter 6, NER-Tagging and its
Applications, and Chapter 7, Dependency Parsing, on POS-tagging, NERtagging, and dependency parsing), but it is worth having a quick glance at
how we can organize these models.

Installation – how and why?
How you choose to load your models is a matter of personal preference and
the type of project you are working on. For example, with larger code bases,
native imports are usually recommended, as this will make it easier to
integrate models with your existing build process, continuous integration
workflow, and testing framework. You can also add the model in your
requirements.txt file like any other library or module being used in your project.
Note that a requirements file is a standard feature in most Python projects.
The documentation page [18] explains it in more detail. It'll also prevent you
from ever trying to load a model that is not installed, as your code will raise
an ImportError error immediately, instead of failing later when spacy.load() is
Apart from the languages we previously spoke about, spaCy has started
tokenization work on Italian, Portuguese, Dutch, Swedish, Finnish,
Norwegian, Danish, Hungarian, Polish, Hebrew, Bengali, Hindi, Indonesian,
Thai, Chinese (Mandarin), and Japanese. Again, since spaCy is open source,
you can contribute to the ongoing efforts.
Now that we know exactly how to get the models on our systems, let's start
asking more questions about these models – how does it perform the POStagging or NER-tagging? What kind of object is returned when we pass
Unicode (Unicode is an industry standard for consistent encoding) through
the pipeline? How do we use that object to do our preprocessing? We'll
attempt to answer these in the coming section, while also discussing the
other possibilities spaCy has to offer with regard to its models, such as
training our own models or adding new languages to spaCy.

Basic preprocessing with language
In Chapter 1, What is Text Analysis?, we mentioned the importance of
preprocessing – after all, garbage in, garbage out, right? But we didn't go into
many details about how we clean up our dirty data. Luckily for us, this is a
well-researched problem in natural language processing, and there are many
different preprocessing techniques, pipelines, and ideas for us to use when
we want to clean up.
Technically, we don't need a package to specifically help us with
preprocessing – simple string manipulation with Python can do the trick,
albeit with a lot more effort. We'll use spaCy to help us with preprocessing,
though in theory, even NLTK, for example, can be used. So why even bother
with using spaCy? It's because along with basic preprocessing, it achieves a
lot more in just one processing step – something we will see very soon in
this chapter.
In particular, we will be using spaCy's language model to help us with the
preprocessing. Before we get into the exact preprocessing steps, let's first
understand what happens when running this:
doc = nlp(u'This is a sentence.')

When you call nlp on Unicode text, spaCy first tokenizes the text to produce
a Doc object. Doc is then processed in several different steps, what we also
refer to as our pipeline.

Fig 3 .2 The default pipeline

Tokenizing text
You can see that the first step in this pipeline is tokenizing – what exactly is
Tokenization is the task of splitting a text into meaningful segments, called
tokens. These segments could be words, punctuation, numbers, or other
special characters that are the building blocks of a sentence. In spaCy, the
input to the tokenizer is a Unicode text, and the output is a Doc object [19].
Different languages will have different tokenization rules. Let's look at an
example of how tokenization might work in English. For the sentence – Let
us go to the park., it's quite straightforward, and would be broken up as
follows, with the appropriate numerical indices:















This looks awfully like the result when we just run text.split('
tokenizing involve more effort?


– when does

If the previous sentence was Let's go to the park. instead, the tokenizer
would have to be smart enough to split Let's into Let and 's. This means that
there are some special rules to follow. spaCy's English language tokenizer
checks the following after splitting a sentence up:
Does the substring match a tokenizer exception rule? For example, don't
does not contain whitespace, but should be split into two tokens, do and n't,
while U.K. should always remain one token.

Can a prefix, suffix or infix be split off? For example, punctuation such as
commas, periods, hyphens, or quotes.
Unlike the other parts of the pipeline, we don't need a statistical model to
perform tokenization. Global and language-specific tokenizer data is supplied
via the language data in the spacy/lang [20] folder, which is simply a directory
that contains model specific data. The tokenizer exceptions define special
cases such as "don't" in English, which needs to be split into two tokens:
{ORTH: "do"} and {ORTH: "n't", LEMMA: "not"}. The prefixes, suffixes, and infixes mostly
define punctuation rules – for example, when to split off periods (at the end
of a sentence), and when to leave token containing periods untouched
(abbreviations such as N.Y.). Here, ORTH refers to the textual content, and
LEMMA, the word with no inflectional suffix.

Fig 3 .3 An exam ple of spaCy 's tokenizing for the sentence "Let's go to N.Y!"

We can add our own special cases to the tokenizer, as well as customize the
spaCy's tokenizer class. If we construct our own tokenizer, we can add it is as
simple as this:
nlp = spacy.load('en')

The details for creating our own tokenizer is in the Linguistic Features [21]
section of spaCy's documentation, though we will into the details of training
and creating our own models in Chapter 5, POS-Tagging and its Applications, Ch
apter 6, NER-Tagging and its Applications, and Chapter 7, Dependency Parsing.
So, once we pass our sentence to the nlp pipeline, the first step was
tokenization – once this is done, we are now dealing with Doc objects, which
are comprised of tokens – which we described before as the basic parts of our
sentence. Once we have our tokens in the doc, each token is then worked on
by the other components of the pipeline.

Part-of-speech (POS) – tagging
The second component of the default pipeline we described before was
the tensorizer.
A tensorizer encodes the internal representation of the doc as an array of
floats. This is a necessary step because spaCy's models are neural network
models, and only speak tensors – every Doc object is expected to be
tenzorised. We as users do not need to concern ourselves with this. After this
step, we start with our first annotation – part of speech tagging.
In the first chapter, we briefly mentioned POS-tagging as marking each token
of the sentence with its appropriate part of speech, such as noun, verb, and
so on. spaCy uses a statistical model to perform its POS-tagging. To get the
annotation from a token, we simply look up the pos_ attribute on the token.
Consider this example:
doc = nlp(u'John and I went to the park'')
for token in doc:
print((token.text, token.pos_))

This will give us the following output:
(u'John', u'PROPN')
(u'and', u'CCONJ')
(u'I', u'PRON')
(u'went', u'VERB')
(u'to', u'ADP')
(u'the', u'DET')
(u'park', u'NOUN')
(u'.', u'PUNCT')
('John', 'PROPN')
('and', 'CCONJ')
('I', 'PRON')
('went', 'VERB')
('to', 'ADP')
('the', 'DET')
('park', 'NOUN')
('.', 'PUNCT')

We'll go into more details about POS-tagging and train our own POS-tagger
in Chapter 4, Gensim – Vectorizing text and transformations and n-grams. As
of now, it is enough to know what exactly POS-tagging is, and that we can use
it to clean our text if we wish to remove a particular part of speech, for
The next part of our pipeline is the parser, which performs dependency
parsing. While parsing refers to any kind of analysis of a string of symbols to
understand relationships between the symbols, dependency parsing refers to
the understanding of dependencies between these symbols. For example, in
the English language, this could be for describing the relations between
individual tokens, such as subject or object. spaCy has a rich API for
navigating parse trees. Since parsing isn't really used in preprocessing, we
will skip going into details, and save it for the coming chapters.

Fig 3 .4 An exam ple of dependency parsing

Named entity recognition
We now have the last part of our pipeline, where we perform named entity
recognition. A named entity is a real-world object that is assigned a name –
for example, a person, a country, a product, or organization. spaCy can
recognize various types of named entities in a document, by asking the
model for a prediction. We have to remember that since models are
statistical and depend on the examples they were trained on, they don't
always work perfectly and might need some tuning later, depending on your
use case – we have a chapter saved up just to better understand named entity
recognition and how to train our own models.
Named entities are available as the ents property of a Doc:
doc = nlp(u'Microsoft has offices all over Europe.')
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
(u'Microsoft', 0, 9, u'ORG')
(u'Europe', 31, 37, u'LOC')

spaCy has the following built-in entity types:
: People, including fictional ones


: Nationalities or religious or political groups


: Buildings, airports, highways, bridges, and so on


: Companies, agencies, institutions, and so on


: Countries, cities, and states


: Non GPE locations, mountain ranges, and bodies of water


: Objects, vehicles, foods, and so on (not services)


: Named hurricanes, battles, wars, sports events, and so on


: Titles of books, songs, and so on


: Named documents made into laws


: Any named language


Rule-based matching
: The exact verbatim text of a token


: The lowercase and uppercase form of the token


: Token text consists of alphanumeric chars


: Token text consists of ASCII characters


: Token text consists of digits




: Token text is in lowercase, uppercase, and title



: Token is punctuation, whitespace, and a stop word



: Token text resembles a number, URL, and email





: The token's simple and extended POS tag




: The token's dependency label, lemma, and shape


SpaCy's default pipeline also performs rule-based matching. This further
annotates tokens with more information and is valuable during
preprocessing. The following token attributes are available:
As with the previous components of the pipeline, we can add our own rules.
For now, though, this much information is enough for us to use in
Now that we know how spaCy processes our text when we pass it through
the pipeline, we can discuss common preprocessing techniques.

The wonderful thing about preprocessing text is that it almost feels intuitive
– we get rid of any information which we think won't be used in our final
output and keep what we feel is important. Here, our information is words –
and some words do not always provide useful insights. In the text mining
and natural language processing community, these words are called stop
words [22].
Stop words are words that are filtered out of our text before we run any text
mining or NLP algorithms on it. Again, we would like to draw attention to the
fact this is not in every case – if we intend to find stylistic similarities or
understand how writers use stop words, we would obviously need to stop
There is no universal stop words list for each language, and it largely depends
on the use case and what kind of results we expect to be seeing. Usually, it is
a list of the most common words in the language, such as of, the, want, to,
and have.
With spaCy, stop words are very easy to identify – each token has an IS_STOP
attribute, which lets us know if the word is a stop word or not. The list of all
the stop words for each language can be found in the spacy/lang [20] folder.
We can also add our own stop words to the list of stop words. For example:
my_stop_words = [u'say', u'be', u'said', u'says', u'saying', 'field']
for stopword in my_stop_words:
lexeme = nlp.vocab[stopword]
lexeme.is_stop = True

We can also add words using this:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS) # <- Spacy's default stop words

When cleaning our text, we can simply choose not to add stopwords to our
You might have noticed in the preceding example how the words say, saying,
and says all pretty much provide the same information to us – grammatical
differences aside, it won't hurt our results to only see one representation of
these words.
There are two popular techniques to achieve this, stemming and
lemmatization. Stemming usually involves chopping off the end of the word,
following some basic rules. For example, the words say, saying, and says would
all become say. Stemming is contextless and does not rely on part of speech,
for example, to make its decisions. Lemmatization, on the other hand,
conducts morphological analysis to find the root word.
The Stanford NLP book [23] page on this provides a nice explanation
describing both and their differences. As far as we are concerned, we need
not worry where we get our root words, just that we are getting them. In
spaCy, the lemmatized form of a word is accessed with the .lemma_ attribute.
Now, with what we know, we can do some basic preprocessing. Let's clean up
this sentence: the horse galloped down the field and past the 2 rivers.. We would like to
get rid of stop words, numbers, and convert our string into a list so that we
can use it later.
doc = nlp(u'the horse galloped down the field and past the river.')
sentence = []
for w in doc:
# if it's not a stop word or punctuation mark, add it to our article!
if w.text != 'n' and not w.is_stop and not w.is_punct and not w.like_num:
# we add the lematized version of the word

By using the .is_stop, is_punct, and w.like_num attributes, we could remove the
parts of the sentence we did not need. Make sure to note that we appended to
the sentence the lemmatized form of the word which we accessed through
This is what our output will look like after the preprocessing:
[u'horse', u'gallop', u'past', u'river']

We can further remove or not remove words based on our use-case. In our
example, it is deemed that numbers are not important information, but in
some cases, it might be. For example, it might be that we want to remove all
verbs from a sentence – in which case we can, by simply checking the POS
tag of that particular token.
Note that we had added field to our stop-words earlier – it's because of this
we don't have a field in our final sentence.
spaCy's pipeline annotates text in such a way that we can very easily use that
information to process our text. The handy thing is that we can further use
that information later on in our text-processing, and not just in preprocessing. It makes sense to start any of our NLP tasks by running it
through a spaCy pipeline, custom or otherwise, just for the large amount of
information and annotation we will get, in almost just five lines of code.

spaCy offers us an easy way to annotate your text data very easily, and with
the language model, we annotate your text data with a lot of information –
not just tokenizing and whether it is a stop word or not, but also the part of
speech, named entity tag, and so on – we can also train these annotating
models on our own, giving a lot of power to the language model and
processing pipeline! Downloading the models and using virtual
environments are also an important part of this process. We will now move
on to using our cleaned data in a way that machines can understand us –
with vectors, and what kind of Python libraries we would need for the same.

[1] spaCy:

[2] NLTK:

[3] Dead Code should be buried:

[4] Pattern:

[5] Wrapper Function:

[6] Brown Corpus:

[7] NLTK Corpus:

[8] CPython:

[9] PyPi:

[10] Conda:

[11] Virtual Environments:

[12] Can't Pip Install:

[13] Failed building wheel for spacy:

[14] spaCy Facts & Figures:

[15] spaCy Language Models:

[16] spaCy Models overview:

[17] spaCy Models releases:

[18] Requirements File:

[19] spaCy Doc object:

[20] spaCy Language Directory:

[21] spaCy linguistic features:

[22] Stop Words:

[23] Lemmatization and Stemming:

Gensim – Vectorizing Text and
Transformations and n-grams
Introducing Gensim
Vectors and why we need them
Vector transformations in Gensim
n-grams and some more preprocessing

Introducing Gensim
So far, we haven't spoken much about finding hidden information - more
about how to get our textual data in shape. We will be taking a brief
departure from spaCy to discuss vector spaces and the open source Python
package Gensim - this is because some of these concepts will be useful in the
upcoming chapters and we would like to lay the foundation before moving
on. However, we'll only be touching the surface of Gensim's capabilities. This
chapter will introduce you to the data structures largely used in text analysis
involving machine learning techniques - vectors [1].
This means that we are still in the domain of preprocessing and getting our
data ready for further machine learning analysis. It may seem like overkill,
focusing so much on just setting up our text/data, but like we've said before garbage in, garbage out. While the previous chapter mostly involved text
cleaning, we will be discussing converting our textual representations to
numerical representations in this chapter, in particular, moving from strings
to vectors.
When we talk about representations and transformations in this chapter, we
will be exploring different kinds of ways of representing our strings as
vectors, such as bag-of-words, TF-IDF (term frequency-inverse
document frequency), LSI (latent semantic indexing), and the more
recently popular word2vec. We will explain these methods soon on in
Vectors and why we need them section and the rest in Chapter 8, Topic Models
(Topic Modelling with Gensim) and Chapter 12, Word2Vec, Doc2Vec and
Gensim, and gensim includes methods to do all of the above. The
transformed vectors can be plugged into scikit-learn machine learning
methods just as easily. Gensim started off as a modest project by Radim
Rehurek and was largely the discussion of his Ph.D. thesis [17], Scalability of
Semantic Analysis in Natural Language Processing [2]. It included novel
implementations of Latent Dirichlet allocation [3] (LDA) and Latent
Semantic Analysis [4] among its primary algorithms, as well as TF-IDF and
Random projection [5] implementations. It has since grown to be one of
the largest NLP/Information Retreival Python libraries, and is both memoryefficient and scalable, as opposed to the previous largely academic code
available for semantic modelling (for example, the Stanford Topic Modelling
Toolkit [6]).

Gensim manages to be scalable because it uses Python's in-built generators
and iterators for streamed data-processing, so the data-set is never actually
completely loaded in the RAM. Most IR algorithms involve matrix
decompositions - which involve matrix multiplications. This is performed by
numpy, which is further built on FORTRAN/C, which is highly optimized for
mathematical operations. Since all the heavy lifting is passed on to these
low-level BLAS libraries, Gensim offers the ease-of-use of Python with the
power of C.
The primary features of Gensim are its memory-independent nature,
multicore implementations of latent semantic analysis, latent Dirichlet
allocation, random projections, hierarchical Dirichlet process (HDP),
and word2vec deep learning, as well as the ability to use LSA and LDA on
a cluster of computers. It also seamlessly plugs into the Python scientific
computing ecosystem and can be extended with other vector space
algorithms. Gensim's directory of Jupyter notebooks [7] serves as an
important documentation source, with its tutorials covering most of that
Gensim has to offer. Jupyter notebooks are a useful way to run code on a live
server - the documentation page [8] is worth having a look at!
The tutorials page can help you with getting started with using Gensim, but
the coming sections will also describe how to get started with using Gensim,
and about how important a role vectors will play in the rest of our time
exploring machine learning and text processing.

Vectors and why we need them
We're now moving toward the machine learning part of text analysis - this
means that we will now start playing a little less with words and a little more
with numbers. Even when we used spaCy, the POS-tagging and NER-tagging,
for example, was done through statistical models - but the inner workings
were largely hidden for us - we passed over Unicode text and after some
magic, we have annotated text.
For Gensim however, we're expected to pass vectors as inputs to the IR
algorithms (such as LDA or LSI), largely because what's going on under the
hood is mathematical operations involving matrices. This means that we
have to represent what was previously a string as a vector - and these kind of
representations or models are called Vector Space Models [9].
From a mathematical perspective, a vector is a geometric object that has
magnitude and direction. We don't need to pay as much attention to this, and
rather think of vectors as a way of projecting words onto a mathematical
space while preserving the information provided by these words.
Machine learning algorithms use these vectors to make predictions. We can
understand machine learning as a suite of statistical algorithms and the
study of these algorithms. The purpose of these algorithms is to learn from
the provided data by decreasing the error of their predictions. As such, this is
a wide field - we will be explaining particular machine learning algorithms as
and then they come up.
Let's meanwhile discuss a couple of forms of these representations.

The bag-of-words model is arguably the most straightforward form of
representing a sentence as a vector. Let's start with an example:
S1:"The dog sat by the mat."
S2:"The cat loves the dog."

If we follow the same preprocessing steps we did in the Basic Preprocessing
with language models section, from Chapter 3, spaCy's Language Models, we
will end up with the following sentences:
S1:"dog sat mat."
S2:"cat love dog."

As Python lists, these will now look like this:
S1:['dog', 'sat', 'mat']
S2:['cat', 'love', 'dog']

If we want to represent this as a vector, we would need to first construct our
vocabulary, which would be the unique words found in the sentences. Our
vocabulary vector is now as follows:
Vocab = ['dog', 'sat', 'mat', 'love', 'cat']

This means that our representation of our sentences will also be vectors with
a length of 5 - we can also say that our vectors will have 5 dimensions. We
can also think of mapping of each word in our vocabulary to a number (or
index), in which case we can also refer to our vocabulary as a dictionary.
The bag-of-words model involves using word frequencies to construct our
vectors. What will our sentences now look like?
S1:[1, 1, 1, 0, 0]
S2:[1, 0, 0, 1, 1]

It's easy enough to understand - there is 1 occurrence of dog, the first word in
the vocabulary, and 0 occurrences of love in the first sentence, so the
appropriate indexes are given the value based on the word frequency. If the
first sentence has 2 occurrences of the word dog, it would be represented as:
S1: [2, 1, 1, 0, 0]

This is just an example of the idea behind a bag of words representation - the
way Gensim approaches bag of words is slightly different, and we will see
this in the coming section. One important feature of the bag-of-words model
which we must remember is that it is an order less document representation
- only the counts of the words matter. We can see that in our example above
as well, where by looking at the resulting sentence vectors we do not know
which words came first. This leads to a loss in spatial information, and by
extension, semantic information. However, in a lot of information retrieval
algorithms, the order of the words is not important, and just the occurrences
of the words are enough for us to start with.
An example where the bag of words model can be used is in spam filtering emails that are marked as spam are likely to contain spam-related words,
such as buy, money, and stock. By converting the text in emails into a bag of
words models, we can use Bayesian probability [10] to determine if it is
more likely for a mail to be in the spam folder or not. This works because like
we discussed before, in this case, the order of the words is not important just whether they exist in the mail or not.

TF-IDF is short for term frequency-inverse document frequency. Largely
used in search engines to find relevant documents based on a query, it is a
rather intuitive approach to converting our sentences into vectors.
As the name suggests, TF-IDF tries to encode two different kinds of
information - term frequency and inverse document frequency. Term
frequency (TF) is the number of times a word appears in a document.
IDF helps us understand the importance of a word in a document. By
calculating the logarithmically scaled inverse fraction of the documents that
contain the word (obtained by dividing the total number of documents by the
number of documents containing the term) and then taking the logarithm of
that quotient, we can have a measure of how common or rare the word is
among all documents.
In case the preceding explanation wasn't very clear, expressing them as
formulas will help!
TF(t) = (number of times term t appears in a document) / (total number of
terms in the document)
IDF(t) = log_e (total number of documents / number of documents with
term t in it)
TF-IDF is simply the product of these two factors - TF and IDF. Together it
encapsulates more information into the vector representation, instead of just
using the count of the words like in the bag-of-words vector representation.
TF-IDF makes rare words more prominent and ignores common words such
as is, of, and that, which may appear a lot of times, but have little
For more information on how TF-IDF works, especially with the
mathematical nature of TF-IDF and solved examples, the Wikipedia page [11]
on TF-IDF is a good resource.

Other representations
It's possible to extend these representations - indeed, topic models, which we
will explore later, are one such example. Word vectors are also an interesting
representation of words, where we train a shallow neural network (a neural
network with 1 or 2 layers) to describe words as vectors, where each feature
is a semantic decoding of the word. We will be spending an entire chapter
discussing word vectors, in particular, Word2Vec. To get a taste of what word
vectors do, this blog post, The amazing power of word vectors [12], is a good

Vector transformations in Gensim
Now that we know what vector transformations are, let's get used to creating
them, and using them. We will be performing these transformations with
Gensim, but even scikit-learn can be used. We'll also have a look at scikitlearn's approach later on.
Let's create our corpus now. We discussed earlier that a corpus is a collection
of documents. In our examples, each document would just be one sentence,
but this is obviously not the case in most real-world examples we will be
dealing with. We should also note that once we are done with preprocessing,
we get rid of all punctuation marks - as for as our vector representation is
concerned, each document is just one sentence.
Of course, before we start, be sure to install Gensim. Like spaCy, pip or conda
is the best way to do this based on your working environment.
from gensim import corpora
documents = [u"Football club Arsenal defeat local rivals this weekend.", u"Weekend football
frenzy takes over London.", u"Bank open for takeover bids after losing millions.", u"London
football clubs bid to move to Wembley stadium.", u"Arsenal bid 50 million pounds for striker
Kane.", u"Financial troubles result in loss of millions for bank.", u"Western bank files for
bankruptcy after financial losses.", u"London football club is taken over by oil millionaire
from Russia.", u"Banking on finances not working for Russia."]

Just a note - we make sure that all the strings are Unicode strings so that we
can use spaCy for preprocessing.
import spacy
nlp = spacy.load("en")
texts = []
for document in documents:
text = []
doc = nlp(document)
for w in doc:
if not w.is_stop and not w.is_punct and not w.like_num:

We performed very similar preprocessing when we introduced spaCy. What

do our documents look like now?
[[u'football', u'club', u'arsenal', u'defeat', u'local', u'rival', u'weekend'],
[u'weekend', u'football', u'frenzy', u'take', u'london'],
[u'bank', u'open', u'bid', u'lose', u'million'],
[u'london', u'football', u'club', u'bid', u'wembley', u'stadium'],
[u'arsenal', u'bid', u'pound', u'striker', u'kane'],
[u'financial', u'trouble', u'result', u'loss', u'million', u'bank'],
[u'western', u'bank', u'file', u'bankruptcy', u'financial', u'loss'],
[u'london', u'football', u'club', u'take', u'oil', u'millionaire', u'russia'],
[u'bank', u'finance', u'work', u'russia']]

Let's start by whipping up a bag-of-words representation for our mini-corpus.
Gensim allows us to do this very conveniently through its dictionary class.
dictionary = corpora.Dictionary(texts)
{u'pound': 17, u'financial': 22, u'kane': 18, u'arsenal': 3, u'oil': 27, u'london': 7,
u'result': 23, u'file': 25, u'open': 12, u'bankruptcy': 26, u'take': 9, u'stadium': 16,
u'wembley': 15, u'local': 4, u'defeat': 5, u'football': 2, u'finance': 31, u'club': 0,
u'bid': 10, u'million': 11, u'striker': 19, u'frenzy': 8, u'western': 24, u'trouble': 21,
u'weekend': 6, u'bank': 13, u'loss': 20, u'rival': 1, u'work': 30, u'millionaire': 29,
u'lose': 14, u'russia': 28}

There are 32 unique words in our corpus, all of which are represented in our
dictionary with each word being assigned an index value. When we refer to a
word's word_id henceforth, it means we are talking about the words integerid mapping made by the dictionary.
We will be using the doc2bow method, which, as the name suggests, helps
convert our document to bag-of-words.
corpus = [dictionary.doc2bow(text) for text in texts]

If we print our corpus, we'll have our bag of words representation of the
documents we used.
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
[(2, 1), (6, 1), (7, 1), (8, 1), (9, 1)], [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
[(0, 1), (2, 1), (7, 1), (10, 1), (15, 1), (16, 1)], [(3, 1), (10, 1), (17, 1), (18, 1),
(19, 1)],
[(11, 1), (13, 1), (20, 1), (21, 1), (22, 1), (23, 1)],
[(13, 1), (20, 1), (22, 1), (24, 1), (25, 1), (26, 1)],
[(0, 1), (2, 1), (7, 1), (9, 1), (27, 1), (28, 1), (29, 1)], [(13, 1), (28, 1), (30, 1),
(31, 1)]]

This is a list of lists, where each individual list represents a documents bagof-words representation. A reminder: you might see different numbers in
your list, this is because each time you create a dictionary, different
mappings will occur. Unlike the example we demonstrated, where an absence
of a word was a 0, we use tuples that represent (word_id, word_count). We
can easily verify this by checking the original sentence, mapping each word
to its integer ID and reconstructing our list. We can also notice in this case
each document has not greater than one count of each word - in smaller
corpuses, this tends to happen.
And voila! Our corpus is assembled, and we are ready to work machine
learning/information retrieval magic on them whenever we would like. But
before we sink our teeth into it... let's spend some more time with some
details regarding corpuses.
We previously mentioned how Gensim is powerful because it uses streaming
corpuses. But in this case, the entire list is loaded into the RAM. This is not a
bother for us because it is a toy example, but in any real-world cases, this
might cause problems. How do we get past this?
We can start by storing the corpus, once it is created, to disk. One way to do
this is as follows:
corpora.MmCorpus.serialize('/tmp/', corpus)

By storing the corpus to disk and then later loading from disk, we are being
far more memory efficient, because at most one vector resides in the RAM at
a time. The Gensim tutorial [13] on corpora and vector spaces covers a little
more than what we discussed so far and may be useful for some readers.
Converting a bag of words representation into TF-IDF, for example, is also
made very easy with Gensim. We first choose the model/representation we
want from the Gensim models' directory.
from gensim import models
tfidf = models.TfidfModel(corpus)

This means that tfidf now represents a TF-IDF table trained on our corpus.
Note that in case of TFIDF, the training consists simply of going through the
supplied corpus once and computing document frequencies of all its
features. Training other models, such as latent semantic analysis or latent
dirichlet allocation, is much more involved and, consequently, takes much

more time. We will explore those transformations on the chapters on topic
modelling. It is also important to note that all such vector transformations
require the same input feature space - which means the same dictionary (and
of course, vocabulary).
So, what does a TF-IDF representation of our corpus look like? All we have to
do is this:
for document in tfidf[corpus]:

This gives us the following:
[(0, 0.24046829370585293), (1, 0.48093658741170586), (2, 0.17749938483254057), (3,
0.3292179861221232), (4, 0.48093658741170586), (5, 0.48093658741170586), (6,
[(2, 0.24212967666975266), (6, 0.4490913847888623), (7, 0.32802654645398593), (8,
0.6560530929079719), (9, 0.4490913847888623)]
[(10, 0.29592528218102643), (11, 0.4051424990000138), (12, 0.5918505643620529), (13,
0.2184344336379748), (14, 0.5918505643620529)]
[(0, 0.29431054749542984), (2, 0.21724253258131512), (7, 0.29431054749542984), (10,
0.29431054749542984), (15, 0.5886210949908597), (16, 0.5886210949908597)]
[(3, 0.354982288765831), (10, 0.25928712547209604), (17, 0.5185742509441921), (18,
0.5185742509441921), (19, 0.5185742509441921)]
[(11, 0.3637247180792822), (13, 0.19610384738673725), (20, 0.3637247180792822), (21,
0.5313455887718271), (22, 0.3637247180792822), (23, 0.5313455887718271)]
[(13, 0.18286519950508276), (20, 0.3391702611796705), (22, 0.3391702611796705), (24,
0.4954753228542582), (25, 0.4954753228542582), (26, 0.4954753228542582)]
[(0, 0.2645025265769199), (2, 0.1952400253294319), (7, 0.2645025265769199), (9,
0.3621225392416359), (27, 0.5290050531538398), (28, 0.3621225392416359), (29,
[(13, 0.22867660961662029), (28, 0.4241392327204109), (30, 0.6196018558242014), (31,

If you remember what we said about TF-IDF, you will be able to identify the
float next to each word_id - it is the product of the TF and IDF scores for that
particular word, instead of just the word count which was present before. The
higher the score, the more important the word in the document.
We can use this representation as input for our ML algorithms as well, and

we can also further chain or link these vector representations by performing
another transformation on them.
Let's move on to a small, but interesting (and useful!) part of text analysis bi-grams and n-grams.

n-grams and some more
When working with textual data, context can be very important. As we
discussed before, we sometimes lose this context in vector representations,
knowing only the count of each word. N-grams, and in particular, bi-grams
are going to help us solve this problem, at least to some extent.
An n-gram is a contiguous sequence of n items in the text. In our case, we
will be dealing with words being the item, but depending on the use case, it
could be even letters, syllables, or sometimes in the case of speech,
phonemes. A bi-gram is when n = 2.
One way bi-grams are calculated in the text is by calculating the conditional
probability of a token given by the preceding token. It can also just be
calculated by choosing words that appear next to each other, but it is more
useful for us to use bi-grams that are more likely to appear as a pair. Such a
bi-gram is called a collocation. What this means is that we're trying to find
pairs of words that are more likely to appear around each other. For
example, New York or Machine Learning could be two possible pairs of
words created by bi-grams. In other words, based on the training data
(usually the corpus), we identify that it is with high probability that the
word York follows the word New, and that it is worth considering New York
as one identity. We must be careful to get rid of stop words before running a
bi-gram model on our corpus, as there could be meaningless bi-grams
formed. The Gensim bi-gram model is basically an implementation of
collocation identification.
We can clearly see how this is useful - we can now pick up phrases from our
corpus, and New York certainly provides us with more information than the
words New and York separately. This means it can be added to our
preprocessing pipeline.
Gensim approaches bigrams by simply combining the two high probability
tokens with an underscore. The tokens new and york will now become
new_york instead. Similar to the TF-IDF model, bigrams can be created
using another Gensim model - Phrases.

import gensim
bigram = gensim.models.Phrases(texts)

We now have a trained bi-gram model for our corpus. We can perform our
transformation on the text the same way we used TF-IDF. We recreate our
corpus like this:
texts = [bigram[line] for line in texts]

Each line will now have all possible bi-grams created. It should be noted that
in our toy example, we will have no bi-grams or meaningless bi-grams being
created. To see an example where bi-grams provide useful information,
the Jupyter notebook [14] written by me on topic modeling serves well.
Since by creating new phrases we add words to our dictionary, this step must
be done before we create our dictionary. We would have to run this:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

After we are done creating our bi-grams, we can create tri-grams, and other
n-grams by simply running the phrases model multiple times on our corpus.
Bi-grams still remains the most used n-gram model, though it is worth one's
time to glance over the other uses and kinds of n-gram implementations.
Again, the Wikipedia page [15] serves as a good introductory resource.
This brings us to the end of the preprocessing techniques covered in this
book. It must be noted however that there is no one perfect preprocessing
pipeline or set of rules - it depends largely on our use-cases, the kind of data
we are working with, and what sort of information we wish to preserve (or
For example, one popular preprocessing technique involves removing both
high frequency and low-frequency words. We can do this in Gensim with
the dictionary module. Let's say we would like to get rid of words that occur in
less than 20 documents, or in more than 50% of the documents, we would
add the following:
dictionary.filter_extremes(no_below=20, no_above=0.5)

We can also remove most frequent tokens or prune out certain token ids.
You can refer to the docs [16] to see the full extent of the preprocessing tools
the dictionary class can provide us.
More often than not, it's after multiple iterations of preprocessing and
running our algorithms when we figure out the correct preprocessing
techniques we wish to use. What is important for us is to know what kind of
tools are available to do this, and what is the reason behind doing all of this.
We are now equipped with everything we need for Gensim and scikit-learns'
algorithms to get working.

We've seen in this chapter why it makes sense to change our representation
of text from words to numbers, and why this is the only language a computer
understands. There are different ways computers can interpret words, and
TF-IDF and bag of words are two such vector representations. Gensim is a
Python package that offers us ways to generate such vector representations,
which are later used as inputs into various machine learning and information
retrieval algorithms.
There are further preprocessing techniques such as creating n-grams,
collocations and removing low-frequency words, which can help us arrive at
better results. The concepts of vectors form a basis in natural language
processing and we can now get back to using spaCy's pipelines; indeed, Chapter
5, POS-Tagging and Its Applications, Chapter 6, NER-Tagging and Its
Applications, and Chapter 7, Dependency Parsing, all showcase the power of
spaCy, and we will start with POS-tagging algorithms using spaCy.

[1] Vectors:

[2] Scalability of Semantic Analysis in Natural Language Processing:

[3] Latent Dirichlet allocation:

[4] Latent semantic indexing:

[5] Random Projection:

[6] Stanford TMT:

[7] Gensim notebooks:

[8] Jupyter Notebooks:

[9] Vector Space Models:

[10] Bayesian Probability:

[11] TF-IDF:

[12] The Amazing power of word vectors:

[13] Corpora and Vector Spaces:

[14] Bi-Gram example notebook:

[15] N-grams:

[16] Gensim dictionary:

[17] Scalability of Semantic Analysis in Natural Language Processing:

POS-Tagging and Its Applications
, What is Text Analysis, and Chapter 2, Python Tips for Text Analysis,
introduced text analysis and Python, and Chapter 3, SpaCy's Language Models,
and Chapter 4, Gensim - Vectorizing Text and Transformations and n-grams,
helped us set-up our code for more advanced text analysis. This chapter will
discuss the first of such advanced techniques – part of speech tagging,
popularly called POS-tagging. We will study what parts of speech exist, how
to identify them in our documents, and what possible uses these POS-tags
Chapter 1

What is POS-tagging?
spaCy for POS-tagging
Training your POS-tagger
POS-tagging examples

What is POS-tagging?
The obvious first step in understanding POS-tagging is to expand the
acronym – Part-Of-Speech tagging. Now, that makes things a lot easier
now, doesn't it? As the name suggests, it is the process of tagging words in a
textual input with their appropriate part of speech. We've already discussed
this before briefly, particularly when dealing with spaCy and its language
models. So, while we know that POS-tagging refers to the action of tagging
words with their POS, we haven't talked very much about what exactly a part
of speech in natural language (and in particular, English) is, and why it might
be relevant to us in the realm of text analysis.
Traditionally, a part of speech is a category of words which have similar
grammatical properties or usage. We will be focusing our efforts on the
English language (as we have been and will continue to do throughout this
book), but generally, these categories can be extended to most, if not all,
languages. The commonly listed categories in English are these:
Noun - The name of a person, place, thing, or idea
Verb - The action or being
Adjective - This modifies or describes a noun or a pronoun
Adverb - This modifies or describes a verb, adjective, or another adverb
Pronoun - The word to be used in place of a noun
Preposition - The word placed before a noun or pronoun to form a
phrase modifying another word in the sentence
Conjunction - This joins words, phrases, or clauses
Interjection - A word used to express emotion
There are also various subcategories into which a word can be put, and there
is no official list of all the parts of speech that exist. In fact, for the purpose
of text analysis or computational linguistics, we will be concerned with all

the possible divisions that a particular tagger can tag a word as! This can
range anywhere between the common categories or even more detailed
categories, as we will see in spaCy.
Since the purpose of this book is not to explain the concepts of linguistics, we
will not be describing the various parts of speech in detail, and the reader is
encouraged to have a look through what each POS category means. We
expect the reader to be comfortable with basic POS categories – this will
come in handy later!
The following links will be handy to get more used to parts of speech:

1. The Eight Parts of Speech [1]

2. [2]
We mentioned earlier that we will be focusing on English and English POS,
but most available POS-taggers offer tagging functions for non-English
languages as well. It should also be noted that the principles we use to train
POS-taggers, as well as the different ways we can use this information tend
to remain the same, and one can carry on the lessons we learn here.
What usually remains common between all-natural languages are nouns and
verbs, but as we move beyond this it becomes more and more difficult to fix
in on different word categories. For some example, some languages don't
differentiate between adjectives and adverbs, while Japanese has three
different classes just for adjectives.
Even within the English language, POS-tagging isn't always a straightforward
task and words have different POS-tags depending on the context. A simple
example is the word refuse, where if it used as a verb it means to decline an
offer, and when used as a noun it is used to refer to something you throw
away or rubbish. It is important for us to be able to identify which meaning
of the word is being referred to, and the POS-tag can help us here. As for
identifying the POS-tag in the first place, the context is crucial – it is not
possible for us to tag a word with its part of speech unless it is in a sentence
or phrase.
And how does one go about identifying the POS-tag for a word? Traditionally,
of course, this was done by hand, but from a computational perspective, we
have more than one way to do this. We mentioned before that we have to
concern ourselves with many POS-tags as the tagger identifies – in some
cases, there are up to 100 different tags, but this isn't always very useful - the
spaCy POS-tagger that we will be largely using uses 19 different categories
for classifying tokens. In all realistic text analysis scenarios, we will not be
dealing with pure textual data – there is likely going to be numbers, symbols,
and words that are not recognized, in which case we are likely to have
multiple categories.
In spaCy, for a more detailed analysis, we also have the .tag_ attribute, which
adds more information to the previously given .pos_ attribute. The following
table gives the breakup of the categories spaCy has to annotate its words.

Fig 5.1 spaCy 's list of POS as described in their annotation specifications [3 ]

Now that we have established what POS-tagging is, let's talk about the how.
Since all the original POS-tagging was done by hand after observation, this
leaves us with a lot of classified data to work with when building statistical
models. The Brown corpus is one example of a corpus that is very well
annotated with POS-tag data. The first few probabilistic models used to train
a POS-tagger would use Hidden Markov Models [4] to predict the tag.
Hidden Markov Models tend to be used whenever there are sequences
present – this turns out to be useful because we can use information about
the context of a word to predict what the POS-tag might be. For example,
once you've seen an article such as the, perhaps the next word is a noun 40%
of the time, an adjective 35%, and a number 25%. Knowing this, a program
can decide that refuse in the refuse is far more likely to be a noun than a
verb, solving the problem we discussed before.
Apart from statistical models, there are also rule-based POS-taggers, which
uses predefined rules to perform the tagging or learns these rules from the
corpus. Of course, these methods do not throw away statistical methods, but
just relies on them less. One of the most popular of such methods is
described by Eric Brill in his 1998 paper titled, A Simple Rule-Based Part of
Speech Tagger [5].
There are other more naive methods that you can try out, just to attempt to
get a feel of the task we are attempting, such as using a regular expressions
to evaluate part of speech or simply storing the most likely tag for a word and
tag all future occurrences with the same tag. Part of speech tagging has since
moved on quite a bit though, and like most computational tasks which are
being completed with high levels of accuracy, it is statistical learning or deep
learning that is the way to go.
State-of-the-art results have been reached with neural networks on multiple
datasets – ACL web maintains a list of this on their website
It is possible to get close to this kind of results even with a simpler machine
learning model such as a perceptron classifier. Indeed, one of spaCy's very
first POS-taggers was an averaged perceptron, and their blog features an
article detailing the inner workings of their tagger and also serves as a
tutorial on how to build it. A perceptron used for POS-tagging works by
learning the probability of the tag of the word based on various features, or
information – these can include the tag of the previous word or the last few

letters of the word. By positively rewarding correct classification and
punishing incorrect classification, this model learns weights which it uses to
predict the tag of the new word. Indeed, most supervised machine learning
algorithms function on similar principles, and these are the algorithms that
perform well in POS-tagging tasks.
Now that we have a better idea of the how, let's talk about the why. While
intuitively it may seem that knowing the part of speech of a word may be
useful, exactly what can we do with this information? POS tags have been
used historically in natural language processing for a variety of reasons and
purposes. One interesting such purpose is speech-to-text conversion and
language translation, which is when a powerful POS-tagger can be used to
disambiguate homonyms. Consider this example when a human says: I am
going to fish a fish, and wishes this sentence to be translated to another
language such as French or Spanish; it is vital to know whether fish here is a
noun or a verb – unlike English, it is highly likely that in the target language,
the word to describe the act of fishing is quite different from that of the
Similarly, POS-tagging is used for Dependency Parsing. As the name
suggests, dependency parsing is the process of identifying dependencies, or
relationships between words in a sentence or phrase. We will be spending an
entire chapter discussing these dependencies and how they work, but it is
enough to understand for now that identifying the part of speech of each
word is an important part of generating such a dependency tree. If we use the
nifty spaCy displacy module [6] in our example sentence – I am going to
fish a fish., this is what we get.

Fig 5.2 The sentence "I am going to fish a fish" after being dependency parsed by spaCY

We can see that POS-tagging has a wealth of applications outside of finding
out a part of speech just for the sake of doing this. But even just this
information can give us some very fun results, some of which we will see in
the very next section.

POS-tagging in Python
It isn't a coincidence that every time we mentioned actually performing POStagging, we linked to or mentioned spaCy – it is arguably one of the fastest
tokenizer, tagger, and parser out there, and we will be using it for all our
But before we dive into spaCy, we will be briefly discussing its main rival
when it comes to POS-tagging in Python, which is NLTK. We have already
gone through the spaCy versus NLTK debate before, and we will stick to our
previous stance of using spaCy for all our real-world application purposes,
but it is still worth looking at what NLTK has to offer.
NLTK's fairly straightforward API for playing around or sandboxing is what
usually tends to make it an attractive choice for beginners. To get the
appropriate tags for a sentence, all we have to run is this:
import nltk
text = nltk.word_tokenize("And now for something completely different")
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'),
('different', 'JJ')]

If we wish to use a particular tagger (NLTK offers many options), we simply
import that particular tagger. The train_sents objects are the training sentences
you wish to use to train the bigram tagger.
bigram_tagger = nltk.BigramTagger(train_sents)

The following links provide more information about POS-tagging with NLTK
if the reader is interested in viewing the same:

1. Official Documentation of tag module [7]

2. Chapter 5 of NLTK book [8]

3. Training NLTK POS-tagger [9]
NLTK isn't the only Python alternative to POS-tagging – AI in Practice:
Identifying Parts of Speech in Python [10]: takes us through all the different
options we can explore in Python. Out of the pick, TextBlob is probably the only
other tagger worth having a look at. This tagger performs very similar to the
one in spaCy, which makes sense as the algorithm is written by the spaCy
maintainer. This blog post goes into some more detail about using TextBlob [11]
to perform your POS-tagging.
This is all we'll discuss when it comes to NLTK and other Python options
though – because of its more academic, bloated approach when it comes to
POS-tagging, we will stick to spaCy.

POS-tagging with spaCy
POS-tagging with spaCy is like any other basic linguistic function with spaCy
– it is one of its core features loaded into its pipeline. If you load up your
spaCy module of choice and run the text through the pipeline, you will have
that text POS-tagged – as well as tokenized, NER-tagged, and ready to be
dependency parsed. We've already seen spaCy's power in this regard in our
chapter on our introduction to spaCy's language models.
Setting up our model involves the same steps we saw before.
import spacy
nlp = spacy.load('en')

Let's now decide some sentence we would like to POS-tag.


nlp(u'Mathieu and I went to the park.')
nlp(u'If Clement was asked to take out the garbage, he would refuse.')
nlp(u'Baptiste was in charge of the refuse treatment center.')
nlp(u'Marie took out her rather suspicious and fishy cat to go fish for fish.')

Sentence 0 is straightforward and will illustrate how a basic sentence will be
for token in sent_0:
print(token.text, token.pos_, token.tag_)
(u'Mathieu', u'PROPN', u'NNP')
(u'and', u'CCONJ', u'CC')
(u'I', u'PRON', u'PRP')
(u'went', u'VERB', u'VBD')
(u'to', u'ADP', u'IN')
(u'the', u'DET', u'DT')
(u'park', u'NOUN', u'NN')
(u'.', u'PUNCT', u'.')

Let's look at a few of the tags here – Mathieu is a name, and it is correctly
marked as a proper noun, went is a verb, and the park is a noun – all that we
would expect it to be. We previously talked about the word refuse, and how it
can be both a noun and a verb.

for token in sent_1:
print(token.text, token.pos_, token.tag_)
(u'If', u'ADP', u'IN')
(u'Clement', u'PROPN', u'NNP')
(u'was', u'VERB', u'VBD')
(u'asked', u'VERB', u'VBN')
(u'to', u'PART', u'TO')
(u'take', u'VERB', u'VB')
(u'out', u'PART', u'RP')
(u'the', u'DET', u'DT')
(u'garbage', u'NOUN', u'NN')
(u',', u'PUNCT', u',')
(u'he', u'PRON', u'PRP')
(u'would', u'VERB', u'MD')
(u'refuse', u'VERB', u'VB')
(u'.', u'PUNCT', u'.')

Here, the word refuse is a verb, as we expect it to be. The word garbage is a noun
and is the object which our friend Clement is refusing to take out. Our next
sentence is also an example involving garbage, but here the word refuse is the
substance being treated in the plant.
for token in sent_2:
print(token.text, token.pos_, token.tag_)
(u'Baptiste', u'PROPN', u'NNP')
(u'was', u'VERB', u'VBD')
(u'in', u'ADP', u'IN')
(u'charge', u'NOUN', u'NN')
(u'of', u'ADP', u'IN')
(u'the', u'DET', u'DT')
(u'refuse', u'NOUN', u'NN')
(u'treatment', u'NOUN', u'NN')
(u'center', u'NOUN', u'NN')
(u'.', u'PUNCT', u'.')

And voila! As we wanted to see, the refuse word is now correctly tagged as a
noun. With the context of it appearing as something Baptiste is in charge of, it
is appropriately changed to a noun. In fact, the last three words are all nouns,
or is something which we call a noun phrase. We will deal with this term in
more detail in the chapter on dependency parsing.
Let's now have a look at our last sentence:
for token in sent_3:
print(token.text, token.pos_, token.tag_)
(u'Marie', u'PROPN', u'NNP')

(u'took', u'VERB', u'VBD')
(u'out', u'PART', u'RP')
(u'her', u'ADJ', u'PRP$')
(u'rather', u'ADV', u'RB')
(u'suspicious', u'ADJ', u'JJ')
(u'and', u'CCONJ', u'CC')
(u'fishy', u'ADJ', u'JJ')
(u'cat', u'NOUN', u'NN')
(u'to', u'PART', u'TO')
(u'fish', u'VERB', u'VB')
(u'for', u'ADP', u'IN')
(u'fish', u'NOUN', u'NN')
(u'.', u'PUNCT', u'.')

The purpose of this sentence was to attempt to fool our tagger with different
variations of the word fish, but our tagger could easily tell the difference in
the appropriate context. Our model is a machine learning model which,
among other training features, uses the tags of the previous words and
upcoming words to decide the new tag – the word fishy was tagged as a verb
partly because of the fact that a noun comes right after, partly because a
conjunction came before, and also possibly because it ends with the letter y.
Most machine learning models take multiple features into account when
deciding a new label.
The other occurrences of the word fish were easily predicted, and we saw this
earlier on in the chapter as well. spaCy does a very smooth job with this –
and we should also remember we have a trove of other data about the tokens
in the sentences as well, and not just POS tags. We are killing many
metaphorical birds with the same stone!
As impressive as spaCy's pretrained models are, we need not limit ourselves
to them. spaCy offers us the functionality to train our models using their
machine learning model, and we will have a look at how this is done.

Training our own POS-taggers
The prediction done by spaCy's models with regard to its POS-tag are
statistical predictions; unlike, say, whether or not it is a stop word, which is
just a check against a list of words. If it is a statistical prediction, this means
that we can train a model for it to perform better predictions or predictions
that are more relevant to the dataset we are intending to use it on. Here,
better isn't meant to be taken too literally – the current spaCy model already
comes to 97% in terms of tagging accuracy.
Before we dive in deep into our training process, let's clarify a few commonly
used terms when it comes to machine learning, and machine learning for
Training - the process of teaching your machine learning model how to make
the right prediction. In text analysis, we do this by providing classified data to
the model. What does this mean? In the setting of POS-tagging, it would be a
list of words and their tagged POS. This labeled information is then used to
learn certain weights, which are further used to make the prediction. We
have used some of this terminology before when describing the perceptron
So how are these weights learned? We mentioned that we provide classified
data to the model – this data is referred to as our training data. Once we start
making predictions and making mistakes, the weights are adjusted
accordingly to minimize the mistakes. We calculate this feedback through
what is called the error gradient of the loss function. The error gradient is
larger if the performance is worse and gets smaller as the performance
improves - we can also understand it as the direction the weights need to
change in so that our predictions are better.

Fig 5.3 An explanation of the training process of spaCy 's m odels as described on their training page [ 12]

The final jargon that we will come across is testing data. This is simply the
data that we will finally use after training to see how well our model is
performing. This is also a set of labeled or classified data, and by checking
the predicted tag by the model versus the actual tag of the word, we can
verify how well our model performs. While all of this is in the context of
POS-tags, it can be extended to other forms of prediction both inside and
outside of text analysis. The spaCy training page [12] is worth having a look
and further talks about how training models in spaCy works.
Now that we have theory out of the way – how do we actually train our
Getting data can be a pain at times, and for very large-scale projects, this can
be a bottleneck. In the training documentation page, there are examples
pertaining to large-scale training problems, and the prodigy tool [13] is
suggested for collecting the said data. Before v2.0, the GoldParse [14] object was
used for training purposes, but we will rather be exploring the more simple
method of using raw text and dictionaries of annotations. Again, we will not
be going into details of either prodigy or into how to use GoldParse, as they are
not the recommended methods – but are still worth knowing.
A simple training loop would look like this:
("Facebook has been accused for leaking personal data of users.", {'entities': [(0, 8,
("Tinder uses sophisticated algorithms to find the perfect match.", {'entities': [(0,
6, "ORG")]})]
nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)

We can see how simple it is in theory – just provide the sentence, the part we
intend to train (this can be: entities, heads, deps, tags, and cats), and the part
of the sentence that corresponds to the entity, or the tag, and the third value
in the tuple corresponds to the label we wish to give to the word between the
indices marked out in the first and second values of the tuple. In the example
given, we can see that Facebook and Tinder are the two entities that are to be
marked as ORG, or organization.

Training a POS-tagger isn't any different in theory, and we will be using the
example code ( [18]) in the spaCy GitHub page which guides us in
how to do this.
import plac
import random
from pathlib import Path
import spacy
'N': {'pos': 'NOUN'},
'V': {'pos': 'VERB'},
'J': {'pos': 'ADJ'}

We've set up our basic imports and have initialized the TAG_MAP dictionary. We
need to define a mapping from the data's part-of-speech tag names to
the Universal Part-of-Speech tag set [15], as spaCy includes an enum of
these tags. In this example, we only intend to train nouns, verbs, and
adjectives, so we include these in our tag map.
("I like green eggs", {'tags': ['N', 'V', 'J', 'N']}),
("Eat blue ham", {'tags': ['V', 'J', 'N']})

Of course, this amount of training data isn't going to train ourselves a very
good model; as in most machine learning problems, more data results in a
better model, and the data present here is to only give an idea of how the
training data should look like.
lang=("ISO Code of language to use", "option", "l", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))

We set up some annotations for the language, output directory, and a
number of training iterations.
def main(lang='en', output_dir=None, n_iter=25):
"""Main function to create a new model, set up the pipeline and train
the tagger. In order to train the tagger with a custom tag map,
we're creating a new Language instance with a custom vocab.
nlp = spacy.blank(lang)
tagger = nlp.create_pipe('tagger')

We have now created a new blank language model and added the tagger to
the pipeline using the create_pipe method. Note that this works for built-ins
that are registered with spaCy.
for tag, values in TAG_MAP.items():
tagger.add_label(tag, values)

We have now added the tags. This needs to be done before you start training.
optimizer = nlp.begin_training()
for i in range(n_iter):
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)

We've seen this part of the training process before in the example.
test_text = "I like blue eggs"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])

Let's do a quick sanity check where we test our model, before saving it to the
output directory.
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
print("Saved model to", output_dir)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == '__main__':

# Expected output:
# [
('I', 'N', 'NOUN'),
('like', 'V', 'VERB'),

('blue', 'J', 'ADJ'),
('eggs', 'N', 'NOUN')
# ]

And there we have it; our own custom-trained POS-tagger! Of course, this
will not be the best POS-tagger, unless our corpus is a tiny corpus of our
opinions on different breakfast foods – but this is not usually the case. For
all real-world scenarios, the training data will be a lot more massive and
assembling this data will be a huge part of our training task.
In our case of training a spaCy model, the machine learning model which we
used to train the POS-tagger was abstracted to us. We only used the
update() method to train our model, and don't know about the nature of the
model, apart from the fact that it works well, and is a neural network. While
for all practical cases this works more than well, if we do wish to train our
own classifier, it isn't terribly hard to do so.
For more advanced users who are aware of how scikit-learn works, the blog
post [16] illustrates an example using NLTK to generate data to train a
classifier yourself using scikit-learn. We will be coming across scikit-learn
and how to train such models later on in the book, but curious readers can
check the link to have an idea of how to build it.
But for the definitive how to build your own POS-tagger tutorial, the spaCy
blog has an article describing the very same – A Good Part-of-Speech Tagger
in about 200 Lines of Python [17]. We've previously linked to the article
when describing the perceptron-based tagger, and this is also the tagger
that TextBlob uses.
And there we go! We are now sufficiently armed with the knowledge to train
our own spaCy POS-tagger, use it in our pipelines, and more importantly, are
aware of why it is a crucial part of text analysis. Our final, short section
illustrates some code snippets describing what we can do with knowledge of

POS-tagging code examples
The following code snippets illustrate some of the simple tasks we can do
with knowledge of POS-tags. These examples don't achieve too much in
terms of in-depth text analysis, but offer a quick glance at text manipulation
once we have our text processed.
def make_verb_upper(text, pos):
return text.upper() if pos == "VERB" else text
doc = nlp(u'Tom ran swiftly and walked slowly')
text = ''.join(make_verb_upper(w.text_with_ws, w.pos_) for w in doc)

As the function name suggests, the preceding code is to change all the verbs
of the sentence into uppercase. By doing a quick check of the POS-tag and
the basic string function upper, we can achieve this in 5 lines!
Another popular task often done during analysis of text is to count the
occurrences of each kind of POS. This can be done quite quickly with the
following code snippet, where we find out the number of occurrences of
these words in the 1st Harry Potter book (which you would buy/download
and save as a text file):
import pandas as pd
harry_potter = open("HP1.txt").read()
hp = nlp(harry_potter)
hpSents = list(hp.sents)
hpSentenceLengths = [len(sent) for sent in hpSents]
[sent for sent in hpSents if len(sent) == max(hpSentenceLengths)]
hpPOS = pd.Series(hp.count_by(spacy.attrs.POS))/len(hp)
tagDict = {w.pos: w.pos_ for w in hp}
hpPOS = pd.Series(hp.count_by(spacy.attrs.POS))/len(hp)
df = pd.DataFrame([hpPOS], index=['Harry Potter'])
df.columns = [tagDict[column] for column in df.columns]

The y-axis is the percentage of that POS-tag appearing in the text.
What if we want to find the most commonly used pronouns? We only need
two lines for this task:
hpAdjs = [w for w in hp if w.pos_ == 'PRON']
Counter([w.string.strip() for w in hpAdjs]).most_common(10)
[(u'he', 1208),
(u'I', 923),
(u'it', 898),
(u'you', 846),
(u'He', 549),
(u'they', 507),
(u'him', 493),
(u'them', 325),
(u'It', 287),
(u'me', 215)]

Knowledge of POS-tags can help us with a more in-depth text analysis. It is a
pillar of natural language processing, and after the tokenizing text is usually
the first piece of analysis which we carry out. spaCy offers us the best way to
perform POS-tagging, but we have had a look at all the other options that
Python gives us as well. We will be sticking with spaCy for the rest of the
computational linguistics tasks we will be exploring, such as NER-tagging
and dependency parsing in the next chapters.

We've explored in this chapter how to use spaCy as part of our pipelines, and
in particular how to extract POS-tags. We discussed what POS-tags are, and
how they can be useful in different kinds of analysis. We soon moved on to
training your own POS-tagger in spaCy and looked at different examples
where we use POS-tags. We will now explore other spaCy functionalities
such as NER-tagging and dependency parsing.

[1] 8 Parts of Speech:

[2] Parts of Speech overview:

[3] spaCy Annotation Specifications:

[4] Hidden Markov Model:

[5] A simple rule-based part of speech tagger:

[6] displaCy:

[7] ntlk tag module:

[8] nltk chapter 5:

[9] Training NLTK tagger:

[10] AI in Practice: Identifying Parts of Speech in Python:

[11] Speech Tagging in TextBlob:

[12] spaCy training:

[13] prodigy:

[14] Gold Standard:

[15] Universal POS tags:

[16] Training your POS-tagger:

[17] A Good Part-of-Speech Tagger in about 200 Lines of Python:


NER-Tagging and Its Applications
We saw in the previous chapter how we can use spaCy's language pipeline –
POS-tagging, which is a very powerful tool, and we will now explore another
interesting usage, NER-tagging. We will discuss what exactly this is from
both a linguistic and text analysis point of view, as well as detailed examples
of its usage, and how to train our own NER-tagger with spaCy. Following are
the topics we will cover in this chapter:
What is NER-tagging?
NER-tagging in Python
Training your NER-tagger
NER-tagging examples and visualization

What is NER-tagging?
We started our chapter on POS-tagging by expanding an acronym, and we'll
be doing the same here as well. NER stands for Named Entity
Recognition, and along with part of speech tagging, it is one of the pillars of
natural language processing.
Let's us start by understanding what a named entity is. A named entity is a
real-world object with a proper name – examples are France, Donald Trump,
and Twitter. In these examples, France is a country and would be identified
as a GPE (Geopolitical Entity), Donald Trump as PER (a person), and
Twitter is a company, so identified as an ORG (Organization). In a study
titled A survey of named entity recognition and classification by David
Nadeau and Satoshi Sekine (New York University) [1], we get a more rigorous
In the expression "Named Entity", the word "Named" aims to restrict the
possible set of entities to only those for which one or many rigid designators
stands for the referent. A designator is rigid when it designates the same
thing in every possible world.
What we have to understand from this is that it refers to a particular object
or person. For example, if we NER-tagged the sentence, Emmanuel Macron
is the current president of France., we would recognize Emmanuel Macron
as a person, and France as a country – but not president because it can refer
to many objects, such as presidents in different countries or even of an
How many different categories of named entities exist? Again, much like
parts of speech, this really depends on us. We can choose to be vague with
our entities, only recognizing a few, or have a really fine-grained set of
categories. We have to remember that most modern NER-taggers, similar to
POS-taggers, are statistically trained models where the number of classes is
equal to the number we want them to be, and depending on the problem,
this is likely to keep changing.
But with this being said, there are a few categories we will expect to see more
often than not, like the ones we have discussed before. These would be a

person (PER), location (LOC), organization (ORG), and other miscellaneous
entities (MISC).

Fig 6 .1 spaCy 's lightweight Wikipedia trained tagger, only featuring basic entity ty pes

You may be wondering why in the previous section we have used the
particular acronyms (PER, LOC, ORG, and MISC). This is because, like we explored in
the previous chapter on POS-tagging, we will largely be discussing spaCy
when we perform our NER-tagging. These acronyms also tend to pop up in
other taggers and schemes.
Apart from these rather obvious entity types, what other categories are we
expected to identify when carrying out this task? Temporal expressions and
numerical expressions often come up in this context. But if we adhere to our
definition of named entities or rigid designators very closely, this might
confuse things for us. For example, think of the year 2016. It designates a
particular year, and we can consider it a named entity. But what if we have
this sentence –
I enjoy going to the beach in the month of July.
Here, any month can be in the place of July, and without context, it would be
difficult to call it a rigid designator, or something that refers to a
particular month. But, on that note, consider this sentence:

I enjoyed going to the beach last July.
Suddenly, the word July now refers to a particular month, is a rigid
designator, and should be considered a named entity type. However, it might
be difficult to always recognize the context in such a case, and we might have
an incorrect label. It is during such situations when we have to decide that it
isn't always wise to stick to strict definitions when we are performing textual
analysis tasks, and that being a little flexible in such situations, can lead to a
better performing, more practical machine.

Fig 6 .2 A full list of entity ty pes that spaCy offers

BBN technologies have released a list of entities and subentities used for

question answering, and this can be found at – Annotation sub-types [2].
spaCy features 18 different categories for its named entity classification, and
we will be using these for the rest of the chapter.
So once again, we've got the what out of the way. Why should we now be
interested in NER-tagging? As usual, simply identifying named entities in
text is not often the end result of our task, but it ends up being an important
building block for further tasks. Entity linking is a task where we use entity
recognition and then attempt to derive relationships between them. Consider
this sentence:
Rome is the capital of Italy.
Any NER-tagger would recognize Rome as a place (GPE), as well as Italy. To
be able to draw the conclusion that Rome is a city, which is linked to the
country Italy, and not Rome, an American R&B artist, are the kind of tasks
that we call as Named Entity Disambiguation (NED).
This is also of great value in biomedical research, where scientists attempt to
identify genes and gene products. It can be used by the businesses to help
identify which organizations are most important by analyzing and identifying
links between other organizations and revenue. Both of these examples are
domain-specific though; do not expect a tagger trained on medical journal
data to perform well on financial documents! This is one difference between
NER-tagging and POS-tagging; while POS tend to be more or less across
different kinds of literature, named entities can differ entirely based on the
context. This results in even really well-trained models to be brittle; this
means that it breaks easily when used in a different domain.
In analyzing literature and styles of writing, NER-tagging can again come in
handy, which we see in the study, Named Entity Recognition and Resolution
for Literary Studies, by Van Dale and co. [3]. The most popular usage of
NER-tagging in science still remains in the field of medicine and biology,
which is also evident by the existence of competitions [18] just devoted to
extracting entities from medical documents.
There's enough motivation for us to build an NER-tagger; so, to the next
question, exactly how are we going to be doing this?
Much like most of the problems we have been attempting to solve so far (and
like the ones we will continue to solve throughout this book!), the answer is
statistical modeling. Similar to our POS-taggers where we used annotated

data-sets and extracted relevant features, we will be doing the same, but with
the dataset annotated with entities. It is worth noting that in this context
when we talk about relevant features, we are discussing possible information
which we can use to predict the class of an unknown object that we wish to
identify. And in our context of NER-tagging, the POS-tag of the word, as well
as the POS-tags of the surrounding words can be used as a predictor!
This is also why in our pipelines we perform POS-tagging before NERtagging; though, in the case of spaCy, where it is a pretrained statistical
model, this doesn't really matter. Other possible features that can be used to
predict whether a word is a named entity or not is the prefix or suffix of the
word (for example, -ion), whether it contains a special symbol, or whether it
is in uppercase or not.
Once we have our features ready, there is a multitude of machine learning
algorithms which we can use to train our model – CRFs (Conditional
Random Fields, described in Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data by John Lafferty and co.
[4]) are often a popular choice for NER-tagging, and as are deep learning
methods, similar to the one we discussed for POS-tagging.
Of course, much like most tasks that are carried out in natural language
processing, we can also attempt more rule-based approaches. Section 13.2.1
of the book Natural Language Processing: Semantic Aspects by
Epaminondas Kapetanios and co. [5] lists and references multiple such
approaches. One such example rule is this one:
Identify a match of a dictionary of salutations followed by a match of a
dictionary of last names and mark the entire region as a candidate person.
This technique requires us to have dictionaries where we store salutations,
as well as a dictionary of last names. This makes this method quite
undesirable; dictionaries can grow to be quite large and take space, can
become redundant if not updated regularly, and can make our approach
rather domain-specific, or more importantly, data-specific.
It can be seen why we prefer to stick to statistical models throughout this
book, and it definitely helps that statistical methods tend to far outperform
rule-based methods.
We have a feel of the what, why, and how of NER-tagging, and it is time to
get cracking and learn to build models ourselves in Python, as well as how to

train these NER-taggers ourselves.

NER-tagging in Python
Our approach with NER-tagging is going to mirror our approach to POStagging; after all, they are very similar tasks, and both of them can be
compared to the machine learning task of classification, where we assign an
unknown object to the class it has the highest probability of belonging to.
Another similarity in our approaches to this task is the fact that we will be
using spaCy to conduct our NER-tagging. Again, this does not mean that
spaCy is the only way to perform NER-tagging; there are two popular
alternatives, one is NLTK, and the other is the Stanford NER-tagger.
Before we start with our explanations, it is worth our while to briefly
understand the term, chunking. It is the process of breaking up your
sentence into constituent parts after the POS-tagging of the sentence is
completed. Examples of these constituent parts are noun phrases or verb
phrases. For example, consider the following sentence:
The little brown dog barked at the black cat.
In this case, we can identify the two noun phrases quite easily: the little
brown dog and the black cat. These chunks can come in handy when we're
doing NER-tagging, and we will explore these topics in a lot more detail in Cha
pter 7, Dependency Parsing. In fact, chunking is also referred to as shallow
So why exactly is it relevant to us during NER-tagging? If you remember,
when we were citing examples of NER tags, we said that Donald Trump
would be tagged as a person; not just Donald, or Trump, but the entire
phrase. This knowledge of a group of words as a noun phrase can help make
decisions when we are tagging.
In most taggers we find online, we can find tagging systems such as the IOB
tagging system. This is just a way to further identify or represent tokens
when we are NER-tagging. This is what IOB simply stands for:
B-{CHUNK_TYPE} - for the word in the Beginning chunk

I-{CHUNK_TYPE} - for words Inside the chunk
O - Outside any chunk
spaCy also uses such a system; it adds L and U, and since we deal with tokens
in spaCy, it is considered a BILOU system.

Fig 6 .3 spaCy 's own BILOU sy stem for its NER tags

Even though we will largely use spaCy, let's briefly discuss NLTK: NLTK uses
these chunks as part of a tree-like system to do its tagging, though it also has
a tagger which follows an IOB system. Here are some code snippets
explaining how to use both, and how to convert between them:

nltk.chunk import conlltags2tree, tree2conlltags
nltk import pos_tag
nltk import word_tokenize
nltk.chunk import ne_chunk

Our imports, where these models are trained on the CoNLL (from the CoNLL
conference) corpus in NLTK. Since we already did our tokenizing, POStagging and chunking, all we need to do for the tree-based tagging is to use
the conlltags2tree method to see our tags.
sentence = "Clement and Mathieu are working at Apple."
ne_tree = ne_chunk(pos_tag(word_tokenize(sentence)))
iob_tagged = tree2conlltags(ne_tree)
[('Clement', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('Mathieu', 'NNP', u'B-PERSON'),
('are', 'VBP', u'O'), ('working', 'VBG', u'O'), ('at', 'IN', u'O'), ('Apple', 'NNP', u'BORGANIZATION'), ('.', '.', u'O')]

Notice here how we first tokenized our sentence, then POS-tagged it, and
chunked it before passing it to the tree-based tagger. Our output is each word
tagged appropriately with both the part of speech and named entity class.
ne_tree = conlltags2tree(iob_tagged)

(PERSON Clement/NNP)
(PERSON Mathieu/NNP)

The other popular tagger is the Stanford Named Entity Recognizer
(NER) tagger [6]. We previously mentioned CRFs (Conditional Random
Fields) and how they are a machine learning constructs that are often used
to train classifiers related to text-based problems; Stanford's tagger uses the
same algorithm. While it is written in Java and to use it you must download
the JAR files to use it (you can find these files on the website), NLTK offers
us a Python interface to access the tagger.
After downloading the JAR files, we must link to them from NLTK. JAR files
are Java files that are created by the Java code - we can understand them as
libraries which we load through Python, in our case.
from nltk.tag import StanfordNERTagger
st = StanfordNERTagger('/usr/share/stanfordner/classifiers/english.all.3class.distsim.crf.ser.gz', '/usr/share/stanford-ner/stanfordner.jar', encoding='utf-8')

Be careful to note the URLs you must refer to. Since our examples will only
deal with English, we only load the English class.
Let's use it is as straightforward as when using the other NLTK taggers:
st.tag('Baptiste Capdeville is studying at Columbia University in NY'.split())
[('Baptiste', 'PERSON'), ('Capdeville', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Columbia', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'),

We see that once again, similar to our POS-tagging example, NLTK can be
tempting because of the simple API it offers, but this is still not what we
would want to use in any production level software. Before we jump into
using spaCy to do our NER-tagging, we encourage the reader to browse the
following links regarding NLTK and what it offers:

1. Testing NLTK and Stanford NER Taggers for Accuracy [7]

2. How to Use Stanford Named Entity Recognizer (NER) in Python NLTK
and Other Programming Languages [8]

3. Chapter 7 (NLTK book) – Extracting Information from Text [9]

4. Named Entity Recognition On Large Collections From Python (PyNER) [

NER-tagging with spaCy
We've discussed the incredible power and simplicity of spaCy before when
we talked about POS-tagging – and we will cite the same reasons when using
spaCy for NER-tagging. In practice, if you've followed the previous chapter
on POS-tagging, we have already finished our POS-tagging; since NERtagging is part of spaCy's natural pipeline, simply processing a document
with the pipeline means that along with being tokenized and POS-tagged, it
has already been NER-tagged (not to mention dependency parsed too!).
Setting up our model involves the same steps we saw before.
import spacy
nlp = spacy.load('en')

Let's now decide some sentences we would like to NER-tag.
sent_0 = nlp(u'Donald Trump visited at the government headquarters in France today.')
sent_1 = nlp(u'Emmanuel Jean-Michel Frédéric Macron is a French politician serving as
President of France and ex officio Co-Prince of Andorra since 14 May 2017.')
sent_2 = nlp(u"He studied philosophy at Paris Nanterre University, completed a Master's of
Public Affairs at Sciences Po, and graduated from the École nationale d'administration (ÉNA)
in 2004.")
sent_3 = nlp(u'He worked at the Inspectorate General of Finances, and later became an
investment banker at Rothschild & Cie Banque.')

The sent_0 sentence is straightforward and will illustrate how we expect a
basic sentence to be NER-tagged by spaCy.
When spaCy processes a document, the named entities are stored in the ents
property of a Doc class. We can still access the entities through the token,
which is stored in the ent_type. Here is an example illustrating the use of both:
for token in sent_0:
print(token.text, token.ent_type_)

(u'Donald', u'PERSON')

(u'Trump', u'PERSON')
(u'visited', u'')
(u'at', u'')
(u'the', u'')
(u'government', u'')
(u'headquarters', u'')
(u'in', u'')
(u'France', u'GPE')
(u'today', u'DATE')
(u'.', u'')

For those words that were not identified as named entities, an empty string
is returned. For those identified as named entities, the appropriate tag is
returned. In our example, we have just three entities, Donald Trump, France, and
today, which are correctly identified as PERSON, GPE, and DATE, respectively. Because
government headquarters doesn't refer to a particular one, it isn't identified
as a named entity. We could argue that because France is mentioned, one could
assume that the government headquarters should also be tagged, but this is a
grey area and we could give our tagger a pass here.
Remember that spaCy intends us to access the entities in the doc.ents
streamable object. This slice of the Doc class is called a Span class [11].
for ent in sent_0.ents:
print(ent.text, ent.label_)
(u'Donald Trump', u'PERSON')
(u'France', u'GPE')
(u'today', u'DATE')

You can see that only the entities are picked up by the span, and the three
entities are printed. Note that Donald Trump is one entity and how this was not
inherently captured when just printing out tokens.
Let's try the next sentence, which is longer and contains a French name,
which might potentially throw off our English tagger.
for token in sent_1:
print(token.text, token.ent_type_)
(u'Emmanuel', u'PERSON')
(u'Jean', u'PERSON')
(u'-', u'PERSON')
(u'Michel', u'PERSON')
(u'Frxe9dxe9ric', u'')
(u'Macron', u'')
(u'is', u'')
(u'a', u'')

(u'French', u'NORP')
(u'politician', u'')
(u'serving', u'')
(u'as', u'')
(u'President', u'')
(u'of', u'')
(u'France', u'GPE')
(u'and', u'')
(u'ex', u'')
(u'officio', u'')
(u'Co', u'PERSON')
(u'-', u'PERSON')
(u'Prince', u'PERSON')
(u'of', u'')
(u'Andorra', u'')
(u'since', u'')
(u'14', u'DATE')
(u'May', u'DATE')
(u'2017', u'DATE')
(u'.', u'')

Here, we notice a few peculiarities. The accent on the é character throws off
the Unicode here, so Macron isn't captured as part of the entity. We will see
later on in this chapter that if not for the accent, or for the way Unicode is
read in this example, Macron would have been included as part of the entity. We
can see that Co-Prince of Andorra isn't identified as the best possible way, too.
Let's look at the same example, but only printing the entities:
for ent in sent_1.ents:
print(ent.text, ent.label_)
(u'Emmanuel Jean-Michel', u'PERSON')
(u'French', u'NORP')
(u'France', u'GPE')
(u'Co-Prince', u'PERSON')
(u'14 May 2017', u'DATE')

We see the mistakes here quite clearly. That being said, let's remove
occurrences of accents before we run the next example:
for token in sent_2:
print(token.text, token.ent_type_)
(u'He', u'')
(u'studied', u'')
(u'philosophy', u'')
(u'at', u'')
(u'Paris', u'ORG')
(u'Nanterre', u'ORG')

(u'University', u'ORG')
(u',', u'')
(u'completed', u'')
(u'a', u'')
(u'Masters', u'ORG')
(u'of', u'ORG')
(u'Public', u'ORG')
(u'Affairs', u'ORG')
(u'at', u'')
(u'Sciences', u'')
(u'Po', u'')
(u',', u'')
(u'and', u'')
(u'graduated', u'')
(u'from', u'')
(u'the', u'ORG')
(u'Ecole', u'ORG')
(u'Nationale', u'ORG')
(u'Administration', u'ORG')
(u'(', u'')
(u'ENA', u'ORG')
(u')', u'')
(u'in', u'')
(u'2004', u'DATE')
(u'.', u'')

We see no errors in this example - let's examine only the spans to see if it's
picked up all the phrases.
(u'Paris Nanterre University', u'ORG')
(u'Masters of Public Affairs', u'ORG')
(u'the Ecole Nationale Administration', u'ORG')
(u'ENA', u'ORG')
(u'2004', u'DATE')

And voila! We see once we've removed the pesky accents it's smooth.
for token in sent_3:
print(token.text, token.ent_type_)

(u'He', u'')
(u'worked', u'')
(u'at', u'')
(u'the', u'ORG')
(u'Inspectorate', u'ORG')
(u'General', u'ORG')
(u'of', u'ORG')
(u'Finances', u'ORG')
(u',', u'')
(u'and', u'')
(u'later', u'')
(u'became', u'')

(u'an', u'')
(u'investment', u'')
(u'banker', u'')
(u'at', u'')
(u'Rothschild', u'ORG')
(u'&', u'ORG')
(u'Cie', u'ORG')
(u'Banque', u'ORG')
(u'.', u'')
for ent in sent_3.ents:
print(ent.text, ent.label_)
(u'the Inspectorate General of Finances', u'ORG')
(u'Rothschild & Cie Banque', u'ORG')

And there we go – we've seen how spaCy works in a variety of settings, and
the possible things that can confuse it. Overall, it works quite well, and we
encourage the reader to try out a few of their examples.
Much like the POS-taggers of spaCy's model, we are also encouraged to train
spaCy's own NER model.

Training our own NER-taggers
In the previous chapter on POS-tagging, we discussed in detail the training
process of a statistical model used for tagging. The idea for NER-tagging
remains the same – we select features we believe are indicative of a named
entity tag, plug these features into a machine learning model, feed it
annotated data, and let the machine learn from the examples provided.
I f you are in the need of a refresher of how the training process happens in a spaCy model, we recommend that
you re-read Training our own POS-taggers section from the Chapter 5, POS-Tagging and I ts Applications of the

We will now examine two code files present in the spaCy examples folder: one
which trains a blank model to perform NER-tagging, and another which adds
a new entity to an existing model.
The following code appears in the file [12]:
import plac
import random
from pathlib import Path
import spacy
# training data
('Who is Shaka Khan?', {
'entities': [(7, 17, 'PERSON')]
('I like London and Berlin.', {
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]

We've set up our basic imports and our training examples. A friendly
reminder that these are far too few examples for any serious training to
happen, and that is merely a representative example.
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=100):
"""Load the model, set up the pipeline and train the
entity recognizer."""

if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")

We've set up annotations for where our model will be saved, as well as the
number of iterations. Our model is loaded, and we have now created a blank
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
ner = nlp.get_pipe('ner')
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
losses = {}
for text, annotations in TRAIN_DATA:
[text], # batch of texts
[annotations], # batch of annotations
drop=0.5, # dropout-make it harder to memorise data
sgd=optimizer, # callable to update weights

We notice here that it follows the exact same training principles as the POStagger. We start by adding the ner label to the pipeline, and then disabling all
the other components of the pipe so that we only train/update the NERtagger. The training itself is straightforward, and the nlp.update() method
abstracts everything for us, letting spaCy deal with the actual machine
learning and heavy lifting.
# test the trained model
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
if __name__ == '__main__':

Soon after our training is done, we test our model and then save it to the
directory specified. If we run the file without any errors, we should expect
the following output:
Entities [('Shaka Khan', 'PERSON')]
Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3),
('Khan', 'PERSON', 1), ('?', '', 2)]
Entities [('London', 'LOC'), ('Berlin', 'LOC')]
Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3),
('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]

Let's now look at adding a new class to a model. The principle remains the
same here; we load the model, disable the pipes we won't be updating, add
the new label, and then loop over the examples and update them. Again,
exactly like the old example, don't expect the trained model to do any
wonders – we don't have enough training examples.
The actual training is performed by looping over the examples and calling
nlp.entity.update(). The update() method steps through the words of the input. At
each word, it makes a prediction. It then consults the annotations provided
on the GoldParse instance, to see whether it was right. If it was wrong, it adjusts
its weights so that the correct action will score higher next time.
import plac
import random
from pathlib import Path
import spacy
# new entity label

("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
("Do they bite?", {
'entities': []
("horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
("horses pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
("they pretend to care about your feelings, those horses", {
'entities': [(48, 54, 'ANIMAL')]
("horses?", {
'entities': [(0, 6, 'ANIMAL')]

We've set up our imports and our training examples.
I f you're using an existing model, make sure to mix in examples of other entity types that spaCy correctly
recognized before. Otherwise, your model might learn the new type, but forget what it previously knew.

This blog post link explains this mistake of forgetting old
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
new_model_name=("New model name for model meta.", "option", "nm", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, new_model_name='animal', output_dir=None, n_iter=20):
"""Set up the pipeline and entity recognizer, and train
the new entity."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# Add entity recognizer to model if it's not in the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
# otherwise, get it, so we can add labels to it

ner = nlp.get_pipe('ner')

The preceding steps are similar to the previous example. Pay attention to the
next line carefully – it's where we add our label.
# add new entity label to entity recognizer
if model is None:
optimizer = nlp.begin_training()
# Note that 'begin_training' initializes the models, so it'll
# zero out existing entity types.
optimizer = nlp.entity.create_optimizer()
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for itn in range(n_iter):

losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer,
drop=0.35, losses=losses)

We trained our model the same way we trained the previous model; after all,
the training process remains the same.
# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
print(ent.label_, ent.text)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
nlp.meta['name'] = new_model_name # rename model
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
for ent in doc2.ents:
print(ent.label_, ent.text)

if __name__ == '__main__':

The rest of the code again remains the same; the crucial difference is in the
training data, adding the new class, and considering we need to add older
examples too.
It is worth looking at spaCy's NER linguistic features page [16] – they also
offer useful advice on how to set entity annotations.
spaCy offers us an easy way to train our models, though it's existing models
do a good job too. We should not forget what lies underneath the hood – a
statistical model that accepts features and makes predictions. Even NLTK
offers us the ability to train their models. There are a number of tutorials
that explain how to build your own classifier, or how to update your NLTK
classifier. While these are interesting to understand the concepts behind
training an NER classifier, it is not relevant to our immediate cause. We offer
a list of these tutorials in case the reader is interested:

1. A complete guide to building your own Named Entity Recognizer with
Python [13]

2. Introduction To Named Entity Recognition In Python [14]

3. Performing Sequence Labelling using CRF in Python [15]

NER-tagging examples and
One of spaCy's most impressive offerings is its visualization suites and API,
and in particular displaCy [17]. We discussed this in the previous chapter when
visualizing part of speech tags. While it is most impressive in visualizing
dependency parsing (which we will see next chapter), it doesn't do a half bad
job with entities either.

Fig 6 .4 An exam ple from a news excerpt from an Elon Musk article on

We can see in the above example that spaCy has caught the entities quite
well. Indeed, even the Elon Musk page is marked as an organization, which
could be considered an organization. It could be the context of Tesla before it
or official pages after it – we cannot be sure. We do have an interesting
mistake caught again here, where Twitter is a geopolitical entity. Again, we
could let this slide if we are considering that Facebook and Twitter are
becoming big enough to be a country! But jokes aside, it is not always easy to
deal with such words unless the corpus was trained on a similar domain.
Let's look at the sentences we NER-tagged before:

Emmanuel Jean-Michel Frédéric Macron (French pronunciation: [ɛmanɥɛl
makʁɔ]̃ ; born 21 December 1977) is a French politician serving as President
of France and ex officio Co-Prince of Andorra since 14 May 2017.
Before entering politics, he was a senior civil servant and investment
banker. Macron studied philosophy at Paris Nanterre University, completed
a Master's of Public Affairs at Sciences Po, and graduated from the École
nationale d'administration (ÉNA) in 2004. He worked at the Inspectorate
General of Finances, and later became an investment banker at Rothschild &
Cie Banque.

Fig 6 .5 An exam ple v isualizing entities from the wiki page of French President, Em m anuel Macron

We can see here that Macron's whole name is caught – the accent didn't
throw off the web-app!
Apart from neat visualizations, we can also use NERs for simple, more
possibly pointless tasks, such as swapping two NERs in a sentence around.
words, indices = [], []
for i, w in enumerate(nlp(u'Tom went to London before going to Paris.')):
words.append(w.text_with_ws), indices.append(i) if w.ent_type_ == "GPE" else

words[indices[0]], words[indices[1]] = words[indices[1]], words[indices[0]]
Tom went to Paris before going to London.

In five sentences, we've swapped London for Paris – might not be the most
useful manipulation of sentence we've seen so far, but it illustrates the ease
of spaCy quite well.

We've seen once again how well spaCy deals with computational linguistic
tasks and also how useful NER-tagging can be. While being a task that is
used in text analysis, the model itself is a statistical one – understanding this
helps in setting the context for building our own models if we would like, or
in updating the existing model that spaCy uses.
In the next chapter, we will see how spaCy deals with our final section on
computational linguistics – dependency parsing.

[1] A survey of named entity recognition and classification:

[2] Annotation Sub-Types:

[3] Named Entity Recognition and Resolution for Literary Studies:

[4] Conditional Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data:

[5] Natural Language Processing: Semantic Aspects:

[6] Stanford NER:

[7] Testing NLTK and Stanford NER Taggers for Accuracy:

[8] How to Use Stanford Named Entity Recognizer (NER) in Python NLTK
and Other Programming Languages:

[9] Chapter 7 (NLTK book) - Extracting Information from Text:

[10] Named Entity Recognition On Large Collections From Python (PyNER):

[11] span:


[13] A complete guide to build your own Named Entity Recognizer with


[14] Introduction To Named Entity Recognition In Python:

[15] Performing Sequence Labelling using CRF in Python:

[16] spaCy NER:

[17] dispaCy:

[18] Biocreative:

Dependency Parsing
We saw in Chapter 5, POS-Tagging and Its Applications and Chapter 6, NERTagging and Its Applications, how spaCy's language pipeline performs a
variety of complex computational linguistics algorithms, such as POS-tagging
and NER-tagging. This isn't all spaCy packs though, and in this chapter, we
will explore the power of dependency parsing and how it can be used in a
variety of contexts and applications. We will have a look at the theory of
dependency parsing before moving on to using it with spaCy, as well as
training our own dependency parsers. Following are the topics we will cover
in this chapter:
Dependency parsing
Dependency parsing with Python
Training our dependency parsers

Dependency parsing
Parsing remains one of the most important processes we can carry out in the
text. It isn't limited to natural languages though and has a history of
computer languages as well, and the ideologies can also be extended to any
kind of data structure that conforms to certain formal grammatical rules.
This means that to be able to do any kind of parsing, we would need two
things – a parser and a grammar. But wait, what exactly is parsing?
We can understand it is a way to analyze a sentence or breaking up a
sentence to understand the structure of a sentence. The way we break up the
sentence to understand its underlying structures is what makes up the crix of
parsing, and there are many different ways we can attempt to interpret a
sentence's structure.
We mention sentence here because it is relevant to us because of natural
languages, but parsing is an activity that can be performed on any statement
with a formal grammar. For example, let's look at this simple arithmetic
statement: ((7 + 3) * (5 - 2))

Fig 7 .1 An exam ple of parsing a sim ple m athem atical statem ent

How would we break this up? The four numbers would be our four main
constituents, and the other symbols represent the actions between these
numbers. Following standard BODMAS rules of arithmetic, we would first
finish the actions in between brackets. We then describe how the
mathematical symbols (+, -, *) relate the leaves of the tree - here the leaves

are the nodes at the very bottom of the tree, and are the numbers 7, 3, 5, and
2. The figure explains how we would parse such a statement.
Now that we know what the idea of parsing is, we can focus on how it is
relevant to us. Even in the world of natural language processing, the term
parsing could mean two different things - while traditional sentence parsing
refers to the understanding of the meaning of a sentence of a word, in the
context of computational linguistics it can also refer to the formal analysis by
an algorithm that results in a parse tree (not dissimilar from the tree we saw
In our discussions throughout this chapter, whenever we refer to parsing, we
will refer to the traditional sentence parsing. In the realm of traditional
sentence parsing, there are many schools of thought, out of which two are
the most popular - dependency parsing and phrase structure parsing. We will
be largely using dependency parsing in our textual analysis, but it is worth
our while to understand both kinds of parsing.
A little bit of history - dependency parsing is a rather new approach to
parsing, and the French linguist Lucien Tesnière [1] is credited with
introducing this school of thought. Constituency Parsing, on the other
hand, has been around for much longer, with Aristotle's ideas on term logic
[2] said to resemble the way we understand constituencies. It is formally
credited to Noam Chomsky [3], who is considered the father of linguistics.
As the name suggests, dependency parsing refers to understanding the
structure of a sentence via the dependencies between words in a sentence.
Dependency is the idea that words in a sentence are connected to each other
with directed links. Phrase structure parsing, on the other hand, breaks up
sentences into phrases, or separate constituents, and can also be referred to
as constituency parsing. So, while a sentence that is dependency parsed
would give us information about the relationships between words in a
sentence, a sentence that is parsed using constituency relationships will help
us understand how we can group our sentences.

Fig 7 .2 Illustrating the differences between constituency and dependency parsing. Credits: Wikipedia editor Tjo3 y a [ 4]

What kind of information can we extract from a sentence parsed using
phrases or constituencies? This kind of parsing depends on splitting up a
sentence into phrases, particularly into a subject (usually the noun phrase
(NP)) and a predicate (the verb phrase (VP)). Note from the diagram that
the relationships between words involve multiple links. Indeed, in the
example, we see an almost recursive structure. The words of the sentence are
also called the leaves of the tree, and each of the phrases here is the nodes. It
is useful in finding out what kind of phrases exist in a sentence, and the subphrases as well. Since this results in us identifying the subject and the object
as well, we have some semantic information about the context of words
which might be previously unknown. For example, consider the
sentence: The lion ate the zebra.
We have previously discussed how to represent words as vectors (Chapter 3,
spaCy's Language Models), and one of these representations was the bag of
words representation. In such a case, we would only be aware of the presence
of the words (assuming that the stop-words are removed) lion, ate, and
zebra. While it is likely that the lion did indeed eat the zebra, we cannot
really be sure unless we know the order and structure of the sentence - a
phrasal parsing of the sentence will deliver us with the subject (lion) and
object (zebra), which would allow us to confirm our intuition that the lion
did indeed eat the zebra.
Again, since this is not a linguistics book, we will not be further focusing on
the kinds of grammars that exist, or indeed even the parsing techniques
being used (and there are many! The Wikipedia article on phrase structure
grammar [5] on the same has a helpful summary), but rather on how to
actually perform the parsing, and on how to interpret and use the results.
Dependency parsing focuses on the relationships or dependencies between

the words in a sentence. That being said, there are many kinds of
dependencies which can be represented during such a parsing; the popular
ones being semantic dependencies, morphological dependencies, prosodic
dependencies, and syntactic dependencies (the article [6] by Joakim Nivre
sums up the theory behind some of these, as does the Wikipedia page on
dependency grammar [7]).
Throughout this chapter, we will focus on one particular kind of dependency
parsing - syntactic dependency parsing. This is in part because most work in
dependency parsing refers to the syntactic dependency parsing, and also in
part because spaCy's parsing algorithm is a syntactic dependency parser. This
kind of parsing, as the name suggests, assigns a syntactic structure to a
sentence, and in our case, this will be a tree.
Let's sum up some of the differences between the two kinds of parsing
methods. Constituency parsing breaks up a sentence into sub-phrases, where
the non-terminal nodes are the types of phrases, and the terminal nodes
(leaves) are the words in the sentence, and the edges are unlabeled. We
would use them to understand the phrases a sentence has, as well as the
subject and objects.
Dependency parsing connects words according to relationships, and each
vertex in the tree represents a word. There are child words and parent nodes,
and each edge is labeled to explain the relationship between the words.
Constituency parsers and dependency parsers also differ in the first break-up
or split between the sentences; constituency parsers break up a sentence into
a subject and an object, which is usually a noun phrase and a verb phrase.
Dependency parsers, on the other hand, consider the verb as the head of the
sentence, and all dependencies are built around it.
We have spoken a lot about these dependencies - but what exactly are they?
spaCy uses the CLEAR style [8] for marking its dependencies. We would
again like to stress that understanding the linguistic dependencies and what
they imply is beyond the scope of the book, and we encourage the reader to
use the aforementioned links and research articles to refresh their
knowledge on dependencies. That being said, let's look at this simple
The dog is faster than the cat.
If we dependency parse this sentence and visualize it (again, using the always

useful displaCy [9]), this is what we see.

Fig 7 .3 Visualizing dependency parsing

In this example, the root word is is, which is the main verb of the sentence.
The dog is the noun phrase which is marked as nsubj, which refers to the
nominal subject of the sentence. Acomp means adjectival complement,
which means that it is a clause or phrase that modifies an adjective or adds to
the meaning to an adjective. The word than is our preposition; and pobj
stands for the object of a preposition, which is here the cat.
For a faster round-up of what labels spaCy assigns, the annotation page [10]
is very useful.
We now have an idea of what exactly dependency parsing and have a peek
into why they can be very useful for us in our text analysis tasks. But where
exactly is this information of phrases or of dependencies going to come in
Like most NLP tasks, finishing one task can greatly help in other tasks. In
this case, having a sentence parsed with phrasal rules can help us in NERtagging. We remember from the previous chapter that often noun chunks are
tagged as an entire entity, and these chunks are often identified post-parsing.
The other major use of parsers is in machine translation, where semantic
and syntactic information is very important. Since we are constructing trees
when we perform parsing, we can convert this tree and represent it as a
knowledge graph where we have information regarding the words and how
they relate to one another. Using such a knowledge graph as an intermediate
step, we can attempt to perform language agnostic translation.
This kind of knowledge graph representation of a sentence can also be
helpful when constructing chatbots or a system where we have to understand
tasks that need to be performed - in this case, identifying actions are very

important. Parsing can also help verify the grammatical correctness of a
But let's move beyond grammatical correctness and attempt to solve another
problem: ambiguity. Like most languages, the English language isn't always
straightforward, and a single comma can change the meaning of a sentence.
Consider the following two sentences:
I saw a girl with a telescope.
I saw a girl, with a telescope.
While both sentences seemingly appear to mean the same thing, the comma
in the second sentence changes this entirely. The first sentence implies that
the subject, I saw a girl who has a telescope. The second sentence, on the
other hand, suggests that the subject saw a girl using a telescope. How does
spaCy's dependency parser deal with this?

Fig 7 .4 Visualizing without a com m a

As expected, without a comma, the dependency links a girl, with and
telescope, suggesting that the girl is in possession of the telescope.
When we visualize it again, but this time with a comma in the sentence:

Fig 7 .5 Visualizing with a com m a

We immediately see how the word with, and by extension a telescope are
linked to the root verb, saw. This is because the act of seeing was done using
a telescope.
Here, dependencies helped resolve ambiguities between two very similar
It is clear that dependency parsing has a wealth of applications. Constructing
such parsers have long been a problem in natural language processing, and
also one that we will not attempt to explain or solve, as they merit their own
chapters and theoretical underpinnings. Older methods used heavily rulebased parsing techniques which were dependent on the grammar used. We
have now, much like POS-tagging and NER-tagging, shifted to statistical
methods to parse, where we tag our phrases and dependencies, using a
probabilistic measure which tells us the most likely way a sentence should be
parsed based on historical training data and a few basic rules. We have
already seen two instances of training such models in Chapter 5, POS-Tagging
and Its Applications and Chapter 6, NER-Tagging and Its Applications, so you
should now be comfortable with the process.
As usual, the Python programming language provides us with rich tools and
libraries to perform dependency parsing - let's move to the next section

where we will discuss this.

Dependency parsing in Python
It's easy to spot the trend in Chapter 4, Gensim - Vectorizing Text and
Transformations and n-grams, Chapter 5, POS-Tagging and Its Applications,
and Chapter 6, NER-Tagging and Its Applications - all of which choose spaCy as
the preferred implementation, not just for the accuracy and speed, but for
the way it naturally fits into our text analysis pipelines. We still discussed the
other Python libraries available to perform the task, and we will do the same
for dependency Parsing.
As usual, we will start with NLTK, which provides the most options regarding
parsing methods, but unlike the previous cases, a not so intuitive API and
one where we are forced to pass our own grammar for effective results. It is
not our purpose to learn grammars before we run computational linguistic
algorithms, and this is another reason we will always prefer spaCy for
industry strength code.
What we will do however is to demonstrate how to use the Stanford
Dependency Parser wrapped using NLTK.
The first step would be to download the necessary JAR files from the
Stanford Dependency Parser page [11] (it would also be worth your while to
check out Stanford's other Statistical Parser [12], simply for the historical
from nltk.parse.stanford import StanfordDependencyParser
path_to_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_to_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1models.jar'
dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar,

The preceding lines of code demonstrate how to load the Stanford JAR files
into our Python NLTK interface, which like the previous POS-tagging and
NER-tagging examples, link to the JAR files on your machine. Be sure to post
the path to the directory of your files.
result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result._next_()


If you print the list, this is the output we expect:
[((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
((u'elephant', u'NN'), u'det', (u'an', u'DT')),
((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))]

We can see that shot, the verb, takes the root of the tree.
This is the extent of what we will demonstrate with NLTK, but if the reader
wishes to define a grammar and use more academic statistical or rule-based
parsing techniques, the following links illustrate this:

1. NLTK Dependency Grammars [13]

2. NLTK Book Chapter 8: Analyzing Sentence Structure [14]

3. Configuring Stanford Parser and Stanford NER Tagger with NLTK in
Python on Windows and Linux [15]
We will now jump straight to spaCy's dependency parsing API.

Dependency parsing with spaCy
If you've followed every chapter of this book until this one, you would
already have finished dependency parsing your data, multiple times; each
run of your text through the pipeline had already annotated the words in the
sentences in your document with their dependencies to the other words in
the sentence. Let's set-up our models again, similar to how we did in the
previous chapters.
import spacy
nlp = spacy.load('en')

Now that our pipeline is ready, we can begin analyzing our sentences.
spaCy's parsing portion of the pipeline does both phrasal parsing and
dependency parsing - this means that we can get information about what the
noun and verb chunks in a sentence are, as well as information about the
dependencies between words.
Phrasal parsing can also be referred to as chunking, as we get chunks that are
part of sentences, which are phrases. These chunks are stored in each
sentence noun_chunks attribute.
Let's illustrate this with three simple sentences:
sent_0 = nlp(u'Myriam saw Clement with a telescope.')
sent_1 = nlp(u'Self-driving cars shift insurance liability
toward manufacturers.')
sent_2 = nlp(u'I shot the elephant in my pyjamas.')
for chunk in sent_0.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,

(u'Myriam', u'Myriam', u'nsubj', u'saw')
(u'Clement', u'Clement', u'dobj', u'saw')
(u'a telescope', u'telescope', u'pobj', u'with')

We can see here that we now have the chunks, the root text (we can see this

in the a telescope chunk, whose root is telescope), the dependency type, and the
head. As expected, as the verb is saw, it is the head for both Myriam and Clement,
where Myriam is the subject and Clement is the object.
The next sentence encapsulates the idea of chunking far better.
for chunk in sent_1.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,

(u'Self-driving cars', u'cars', u'nsubj', u'shift')
(u'insurance liability', u'liability', u'dobj', u'shift')
(u'manufacturers', u'manufacturers', u'pobj', u'toward')

We have three noun phrases, out of which Self-driving cars and insurance liability
give us a clearer idea of what noun phrases are - here Self-driving and insurance
qualify the root nouns car and liability. The manufacturers is the final noun of the
sentence, being the object of the verb, toward.
Our last example is a lot more straightforward:
for chunk in sent_2.noun_chunks:
print(chunk.text, chunk.root.text, chunk.root.dep_,

(u'I', u'I', u'nsubj', u'shot')
(u'the elephant', u'elephant', u'dobj', u'shot')
(u'my pyjamas', u'pyjamas', u'pobj', u'in')

The words the and my identify both the elephant and the pajamas and are
included as part of our noun phrases.
Let's now look at our sentences again, but with individual words instead of
phrases. Make a note of how we accessed chunks in the previous examples,
and how we will be instead accessing tokens in the coming examples.
for token in sent_0:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

(u'Myriam', u'nsubj', u'saw', u'VERB', [])
(u'saw', u'ROOT', u'saw', u'VERB', [Myriam, Clement, with, .])
(u'Clement', u'dobj', u'saw', u'VERB', [])
(u'with', u'prep', u'saw', u'VERB', [telescope])

(u'a', u'det', u'telescope', u'NOUN', [])
(u'telescope', u'pobj', u'with', u'ADP', [a])
(u'.', u'punct', u'saw', u'VERB', [])

The output is similar to the noun chunk examples, with the addition of a list
that contains the children (if any) of the nodes. We can see immediately with
the preceding example that the word saw, the root verb, is the head node, with
four children nodes that are dependent on it, which is visible in the list.
The dependencies are the same as we previously observed in the example
with the noun chunks.
for token in sent_1:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

(u'Autonomous', u'amod', u'cars', u'NOUN', [])
(u'cars', u'nsubj', u'shift', u'VERB', [Autonomous])
(u'shift', u'ROOT', u'shift', u'VERB', [cars, liability, .])
(u'insurance', u'compound', u'liability', u'NOUN', [])
(u'liability', u'dobj', u'shift', u'VERB', [insurance, toward])
(u'toward', u'prep', u'liability', u'NOUN', [manufacturers])
(u'manufacturers', u'pobj', u'toward', u'ADP', [])
(u'.', u'punct', u'shift', u'VERB', [])

With more verbs our parsing looks more interesting - we can see how the
verb shift is linked to a variety of words in the sentence. We now have an
exercise for the user - using the information presented above, draw your own
dependency graph for the sentence, and verify this using displaCy.
for token in sent_2:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

(u'I', u'nsubj', u'shot', u'VERB', [])
(u'shot', u'ROOT', u'shot', u'VERB', [I, elephant, .])
(u'the', u'det', u'elephant', u'NOUN', [])
(u'elephant', u'dobj', u'shot', u'VERB', [the, in])
(u'in', u'prep', u'elephant', u'NOUN', [pyjamas])
(u'my', u'poss', u'pyjamas', u'NOUN', [])
(u'pyjamas', u'pobj', u'in', u'ADP', [my])
(u'.', u'punct', u'shot', u'VERB', [])

Our last example is simple and does present anything out of the ordinary.
Let's now look at other ways we can navigate this tree. We've spoken about

how each sentence has exactly one head, and sometimes we would want to
identify this. One way to do this is to iterate from below, that is, to iterate
over possible subjects instead of possible verbs.
For example, iterating through subjects would look like this:
from spacy.symbols import nsubj, VERB
verbs = set()
for possible_subject in sent_1:
if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:

We've iterated through all the words and checked cases where we have a
nominal subject (nsubj), and where the head of that word is a verb. Running
this for sentence 1 gives us the following result when we print verbs:

This is what we expect to see!
It is also possible to search for verbs directly instead, but this takes double
the iterations.
The doc variable is a placeholder variable and you will need to pass your own document.

verbs = []
for possible_verb in doc:
if possible_verb.pos == VERB:
for possible_subject in possible_verb.children:
if possible_subject.dep == nsubj:

While this gives us the same result, note that there are two for loops.
spaCy also provides us with some useful attributes such as lefts, rights, n_rights,
and n_lefts. This gives us information about what is on the left of a particular
token in a tree, the right, and the number of either.
Let's take a look at this example to find phrases using the syntactic head.
root = [token for token in sent_1 if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:

assert subject is descendant or subject.is_ancestor(descendant)
print(descendant.text, descendant.dep_, descendant.n_lefts,
descendant.n_rights, [ancestor.text for ancestor in

We find the root by checking where the head is the token itself. The subject
would be to the left of this tree, so we run a check for this. We then iterate
through the subject and print the appropriate descendants and the number of
other leaves. Let's look at an output from one of our sentences after running
the preceding code:
(u'Autonomous', u'amod', 0, 0, [u'cars', u'shift'])
(u'cars', u'nsubj', 1, 0, [u'shift'])

spaCy's section on dependency parsing [16] has these examples and more
(albeit with fewer explanations), and we highly recommend that you visit the
page. The annotation page [10] is further recommended for reading.
An example of how we can use this in a more realistic example is, for
example, in identifying commonly used adjectives to describe a character in a
The book variable is a placeholder variable and you will need to pass your own document.

adjectives = []
for sent in book.sents:
for word in sent:
if 'Character' in word.string:
for child in word.children:
if child.pos_ == 'ADJ':

The code itself remains very simple but does the job effectively. We iterate
over our books sentences, look for our character in the sentence, look for the
children of that character, and check if the child is an adjective. It being a
child means the word is likely to have been marked as a dependency, with
the root word (here, the Character depending on who it is) being described
by the child. By checking the most common adjectives, we can do a minianalysis of the characters in our books.
Let's now move on to training our own parsers!

Training our dependency parsers
Again, if you have read Chapter 4, Gensim - Vectorizing Text and
Transformations and n-grams, Chapter 5, POS-Tagging and Its applications,
and Chapter 6, NER-Tagging and Its applications, then you would be
comfortable with the theory behind training our own models in spaCy. We
would recommend that you go back and read Vector transformations in
Gensim section from chapter 4 and Training our own POS-taggers section
from chapter 5 to refresh your ideas on what exactly training means in
context with machine learning and in particular, spaCy.
Again, the advantage with spaCy is that we don't need to care about the
algorithm being used under the hood, or which features are the best to select
for dependency parsing - this is usually the hardest part of machine learning
research. We know that an optimal learning algorithm has been selected, and
all we have to care about is in passing appropriate training examples and in
setting up of the API so that we appropriately update our models. This is
what we will be doing in the next two code examples.
The first among the two tells us how to update a dependency parser starting
off with a blank model, and the source code can be found in the file [ 17].
from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy

As usual, we start with imports, before moving on to our training data.
# training data
("They trade mortgage-backed securities.", {
'heads': [1, 1, 4, 4, 5, 1, 1],
'deps': ['nsubj', 'ROOT', 'compound', 'punct', 'nmod', 'dobj',
("I like London and Berlin.", {
'heads': [1, 1, 1, 2, 2, 1],

'deps': ['nsubj', 'ROOT', 'dobj', 'cc', 'conj', 'punct']

We need to give examples of heads and dependency label in our training
data. A quick glance at our training data can confirm this; in both examples,
the verb is the word at index 0, and the dependencies are fairly
model=("Model name. Defaults to blank 'en' model.", "option", "m",
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=10):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")

This step is again similar to our other training examples, where we load a
blank model.
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'parser' not in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
parser = nlp.get_pipe('parser')

The comments are fairly self-explanatory here; we add a parser to the
pipeline if it doesn't exist, and if it does, we add labels.
# add labels to the parser
for _, annotations in TRAIN_DATA:
for dep in annotations.get('deps', []):
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
losses = {}

for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer,

We follow the same process of the training examples of the previous chapter,
where we add labels, disable the other parts of the pipe so that we're only
training the parser.
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Dependencies', [(t.text, t.dep_, t.head.text) for t in doc])

The final steps involve training our model and saving it to our appropriate
if __name__ == '__main__':

On running the main file, we should expect to see the following output:
('I', 'nsubj', 'like'),
('like', 'ROOT', 'like'),
('securities', 'dobj', 'like'),
('.', 'punct', 'like')

While the preceding training example was rather vanilla, with it following
the exact same style as the POS and NER-taggers, we can do a lot more
interesting things with parsing; for example, adding our own custom
What does this mean? We can now train our parsers to understand new

semantic relationships or dependencies between words. The spaCy
documentation page gives us the following example to illustrate this:

Fig 7 .6 Dependency parsing with an additional dependency , "quality "

This is particularly interesting because we can model our own dependencies
that are useful for our particular use-cases; though we must keep in mind
that it may not always result in correct dependency parsing, but it is still
useful in encapsulating relationships between words.
The code to do this training can be found in the [18] file.
From the comments in the file, in this example, we'll build a message parser
for a common chat intent: finding local businesses. Our message semantics
will have the following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME,

"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin

Let's now start with the code.
from __future__ import unicode_literals, print_function
import plac
import random
import spacy
from pathlib import Path
# training data: texts, heads and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
("find a cafe with great wifi", {
'heads': [0, 2, 0, 5, 5, 2], # index of token head
'deps': ['ROOT', '-', 'PLACE', '-', 'QUALITY', 'ATTRIBUTE']
("find a hotel near the beach", {
'heads': [0, 2, 0, 5, 5, 2],
'deps': ['ROOT', '-', 'PLACE', 'QUALITY', '-', 'ATTRIBUTE']
("find me the closest gym that's open late", {
'heads': [0, 0, 4, 4, 0, 6, 4, 6, 6],
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-',
("show me the cheapest store that sells flowers", {
'heads': [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store!
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', '-', 'PRODUCT']
("find a nice restaurant in london", {
'heads': [0, 3, 3, 0, 3, 3],
'deps': ['ROOT', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
("show me the coolest hostel in berlin", {
'heads': [0, 0, 4, 4, 0, 4, 4],
'deps': ['ROOT', '-', '-', 'QUALITY', 'PLACE', '-', 'LOCATION']
("find a good italian restaurant near work", {
'heads': [0, 4, 4, 4, 0, 4, 5],

It is worth our while to take a close look at the training examples. Like
mentioned in the comments, ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, and LOCATION is our

new dependencies. Our examples illustrate this, with some of the qualities in
our examples being coolest, good, great, and closest. Words like near and open are
marked as an attribute to distinguish it from qualities. Place, time, and
location are dependencies that are quite clear as well. This kind of
information can be very useful when building a semantic information graph.
model=("Model name. Defaults to blank 'en' model.", "option", "m",
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=5):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# We'll use the built-in dependency parser class, but we want to create
# a fresh instance - just in case.
if 'parser' in nlp.pipe_names:
parser = nlp.create_pipe('parser')
nlp.add_pipe(parser, first=True)
for text, annotations in TRAIN_DATA:
for dep in annotations.get('deps', []):

The training examples remain the only real change; we can see that this step
mirrors the previous training example.
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'parser']
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer,
# test the trained model
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)

def test_model(nlp):
texts = ["find a hotel with good wifi",
"find me the cheapest gym near work",
"show me the best hotel in berlin"]
docs = nlp.pipe(texts)
for doc in docs:
print([(t.text, t.dep_, t.head.text) for t in doc
if t.dep_ != '-'])

if __name__ == '__main__':

The rest of the steps follow suit; let's have a look at the result when we run
the main module.
find a hotel with good wifi
('find', 'ROOT', 'find'),
('hotel', 'PLACE', 'find'),
('good', 'QUALITY', 'wifi'),
('wifi', 'ATTRIBUTE', 'hotel')
find me the cheapest gym near work
('find', 'ROOT', 'find'),
('cheapest', 'QUALITY', 'gym'),
('gym', 'PLACE', 'find')
('work', 'LOCATION', 'near')
show me the best hotel in berlin
('show', 'ROOT', 'show'),
('best', 'QUALITY', 'hotel'),
('hotel', 'PLACE', 'show'),
('berlin', 'LOCATION', 'hotel')

And voila, it's what we expect to see!
This example illustrates the real power spaCy has when creating our custom
models; we can not only retrain our models with domain-specific data to
work better for our specifications, but also train completely new
dependencies. Couple that with a very easy to use training API, and it
becomes very clear that it leads the pack in all practical NLP applications.

Some useful links which might also be relevant when considering spaCy and
dependency parsing:

1. Dependency Tree with spaCy [19]

2. Parsing English in 500 Lines of Python [20]

This brings us to the end of our chapter on spaCy and dependency parsing.
The previous four chapters have illustrated the many powers of spaCy, and
how we can harness these powers. Dependency parsing, in particular,
remains very important to us as finding semantic or syntactic relationships
between words within sentences can have many uses, whether it is simply
identifying the most used adjectives or adverbs for a particular word or
mapping custom relationships.
In the next chapters, we will move on from computational linguistics-based
algorithms to information retrieval-based algorithms to conduct our text
analysis. In particular, this will be topic models as well as clustering and
classification algorithms.

[1] Introduction to Structural Syntax:

[2] Term Logic:

[3] Noam Chomsky:

[4] Image link:

[5] Phrase Structure Grammar:

[6] Dependency Grammar and Dependency Parsing:

[7] Dependency Grammar:

[8] CLEAR style:

[9] displaCy:

[10] spaCy annotation page:

[11] Stanford Dependency Parser:

[12] Stanford Statistical Parser:

[13] NLTK Dependency Grammars:

[14] Analysing Sentence Structure:

[15] Configuring Stanford Parser and Stanford NER Tagger with NLTK in
Python on Windows and Linux:

[16] spaCy dependency parsing:

[17] spaCy train parser:

[18] spaCy train intent parser:

[19] How to get the dependency tree with spaCy?:

[20] Parsing English in 500 Lines of Python:

Topic Models
Until now, we dealt with computational linguistics algorithms and spaCy,
and we understood how to use these computational linguistic algorithms to
annotate our data, as well as understand sentence structure. While these
algorithms helped us understand the finer details of our text, we still didn't
get a big picture of our data - what kind of words appear more often than
others in our corpus? Can we group our data or find underlying themes? We
will be attempting to answer these questions and more in this
chapter. Following are the topics we will cover in this chapter:
What are topic models?
Topic models in Gensim
Topic models in scikit-learn

What are topic models?
We will now make our first foray into probabilistic models and machine
learning with text. We did, of course, come across such models earlier on (in
Chapter 5, POS-Tagging and Its Applications, Chapter 6, NER-Tagging and Its
Applications, and Chapter 7, Dependency Parsing), especially in the way we
trained our NER and POS taggers, but our goal in the previous chapters was
not to come up with a statistical model involving our text data.
What is a topic model? As the name might suggest, it is a probabilistic model
which contains information about topics in the text. We now must ask what
exactly a topic is - we can understand a topic as a theme, or underlying ideas
represented in text. For example, if we are working with a corpus of
newspaper articles, possible topics would be weather, politics, sport, and so
Why would such topic models be important in the world of text processing?
Traditionally, information retrieval and searching techniques involved using
words to identify similarity or relevance - now, we can instead search and
arrange our files more broadly, with topics instead of words. But what exactly
are topics? They are a distribution of words - in particular, a probabilistic
distribution of words. We can further use this model to describe our
documents as the probabilistic distribution of topics. Since we know the
words and count of the words in documents, we can use this knowledge to
generate these topic models. Once we have our topic model, we can start
representing all our documents as topic distributions!
So, this means in our newspaper corpus we discussed before, instead of
clustering based on TF-IDF or bag-of-words, we can now cluster according
to the topics. We can also explore the documents in each topic, and further
zoom in on these documents to better understand the topics, or themes.
Creating topic models for your text corpus is also useful when we want to
explore our dataset, to see what kind of documents our corpus contains, by
just observing the topics.
By arranging our documents in chronological order, we can further see how
documents in a topic evolved over time. Why is this interesting, or useful?
When time-arranged documents from the research journal science were

topic modeled keeping time-stamps in mind (a technique called Dynamic
Topic Modeling), the results were particularly fascinating.
The topic that we associated with atomic physics started in 1881 with a
high chance of finding the word matter, motion, and light. By the year 1999,
these words under the same topic soon became state, energy, and electron!
You can see how we used a topic model that takes time-stamps into account
to see how a topics words evolved over time - topic models allow us to look at
and understand our data in ways we could not have done before.
We must keep in mind, however, that a topic is merely a probabilistic
distribution of words, and doesn't create its own label, or title. For example,
the topic that we would call the weather topic in the newspaper corpus would
just be a collection of words (such as sun, temperature, wind, storm,
and forecast), with the associated probability of those words appearing in the
topic. A topic such as the weather topic would contain the words we
previously mentioned with a high probability of appearing on that topic. By
arranging the words according to probability, we can get an idea of what the
topic represents. Of course, in our code, these topics would simply be called
topic 0, topic 1, topic 2... topic n-1, where n is the total number of topics we
wish to identify in our corpus. At this point, a human will simply have to
assign whatever topic label they would like to the probability distribution
Now that we have documents as a representation of topics instead of words,
we are effectively reducing the dimensions of our data (documents or
articles), from the total vocabulary size to the number of topics. In fact, one
of the earliest IR algorithms, latent semantic analysis [1] does pretty much
this, inadvertently, and by reducing the dimensions we get a representation
of topics in that corpus.
We've discussed a fair amount about what topic models are - but how do you
generate them? There is more than one way to do this, and we will be using
Gensim [2] to create our models, which has implementations of Latent
Dirichlet Allocation (LDA), Latent semantic analysis (LSA),
Hierarchical Dirichlet Process (HDP), and Dynamic Topic
Modelling (DTM) to help us with this. All of these algorithms have a few
things in common - they assume words in documents have underlying
probabilistic distributions and attempts to find out these distributions. These
distributions end up being our topics. The way we attempt to identify these
distributions (which is with mathematical and statistical techniques) is what

makes these algorithms different.
As for the mathematical foundation of these topic models, that is beyond the
scope of the book, but the paper by Blei et al which describes LDA [3] is a
great read. A more casual approach to understanding how this works is
Edwin Chen's blog post [4]. This Quora article [5] also has a nice repository of
explanations of LDA, which might require a little mathematical background.
This paper by Blei, titled Probabilistic Topic Models [6] is also a nice resource
which sums up all the kinds of topic models that have been developed so far.

Topic models in Gensim
Gensim [2] is arguably the most popular topic modeling toolkit freely
available, and it being in Python means that it fits right into our ecosystem.
Gensim's popularity is because of its wide variety of topic modeling
algorithms, straightforward API, and active community. Of course, we have
already introduced Gensim before, in Chapter 4, Gensim - Vectorizing Text and
Transformations and n-grams, on vector spaces. We would be needing to
know how to set up our corpus for the topic modeling algorithms we will be
using, so now is a good time to brush on the contents of the Vector
transformation in Gensim section, in Chapter 4, Gensim - Vectorizing Text and
Transformations and n-grams.
All done? Now we can start using the powerful tools that Gensim have to
offer. The Jupyter notebook [7] runs us through the same corpus generating
techniques we previously discussed, as well as loading the Lee Newspaper
corpus, which is found in the Gensim code base. The notebook will be
attached at the end of this chapter. The code in the notebook is in Python 2.7
to accommodate all users, and it also works in Python 3. The corpus contains
headline texts of about 300 documents from 2000-2001.
More information on this corpus can be found in this research paper, An Empirical Evaluation of Models of Text
Document Similarity [ 8].

This corpus will be useful in illustrating how topic models work because it is
large enough for us to have coherent topics and isn't too large so as to take
long training times.
We won't be focusing as much on the how but more on the what - though it
is highly encouraged that you have a look at what's happening under the
hood, and we will be linking to the relevant reading material as we describe
how to use these algorithms. The reason we choose to ignore the how is
because of how well Gensim abstracts it for us; also, because it is also
challenging to interpret the results - which is the what of topic modeling is.
Just to remind ourselves of the kind of data we are dealing with, let's look at
what texts and corpus look like. This would be after the 8th and 9th cells in the
Jupyter notebook.


[(51, 1),
(53, 1),
(95, 1),
(108, 1),


Texts contain the tokenized and cleaned version of the original text data and
the corpus is our bag of words representation, which we will feed into our
machine learning algorithms.

Latent Dirichlet allocation
Let's start with the most popular topic modeling algorithm - latent Dirichlet
allocation, or LDA as we called it before. The LDA model was created in 2003
by Blei and others and is described in the paper, Latent Dirichlet Allocation
Like we discussed before, LDA helps us model a corpus based on topic
distributions, which are in turn made of word distributions. What exactly is a
distribution of words? Gensim lets us understand and use this very easily.
Cells 15 and 16 of the Jupyter notebook let you see this.
ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)

That's how easy it is to create a model - just specify the corpus, the dictionary
mapping, and the number of topics we want to use in our model.
Keep in mind that we imported LdaModel from gensim.models back in the first cell.
Now that we have a trained model, let's look at what topics are hidden in our

This gives us the following:
u'0.006*"force" + 0.006*"year" + 0.005*"australian" + 0.004*"new" + 0.004*"afghanistan" +
0.004*"people" + 0.004*"official" + 0.004*"area" + 0.004*"fire" + 0.004*"day"'),
u'0.005*"attack" + 0.005*"people" + 0.004*"man" + 0.004*"group" + 0.004*"report" +
0.004*"company" + 0.003*"australia" + 0.003*"force" + 0.003*"kill" + 0.003*"come"'),
u'0.009*"australia" + 0.005*"australian" + 0.005*"government" + 0.004*"day" + 0.003*"new"
+ 0.003*"united_states" + 0.003*"child" + 0.003*"come" + 0.003*"report" + 0.003*"good"'),
u'0.005*"day" + 0.005*"people" + 0.004*"police" + 0.004*"australian" + 0.004*"australia" +
0.003*"today" + 0.003*"test" + 0.003*"palestinian" + 0.003*"attack" + 0.003*"centre"'),

u'0.008*"australian" + 0.005*"fire" + 0.005*"year" + 0.005*"government" + 0.005*"people" +
0.004*"union" + 0.004*"south" + 0.004*"centre" + 0.003*"company" + 0.003*"day"'),
u'0.008*"israeli" + 0.006*"palestinian" + 0.005*"force" + 0.004*"fire" + 0.004*"people" +
0.004*"kill" + 0.004*"government" + 0.004*"police" + 0.004*"day" + 0.004*"australia"'),
u'0.008*"australian" + 0.007*"year" + 0.006*"world" + 0.005*"australia" + 0.005*"force" +
0.004*"government" + 0.004*"people" + 0.003*"economy" + 0.003*"metre" + 0.003*"win"'),
u'0.005*"government" + 0.004*"australia" + 0.004*"pakistan" + 0.004*"people" +
0.003*"tell" + 0.003*"force" + 0.003*"israeli" + 0.003*"time" + 0.003*"claim" +
u'0.005*"day" + 0.004*"good" + 0.004*"year" + 0.003*"new" + 0.003*"australian" +
0.003*"australia" + 0.003*"wicket" + 0.003*"take" + 0.003*"hour" + 0.003*"area"'),
u'0.005*"people" + 0.005*"australia" + 0.005*"man" + 0.004*"arrest" + 0.004*"union" +
0.004*"tell" + 0.004*"india" + 0.004*"pakistan" + 0.003*"claim" + 0.003*"united_states"')]
Topic models are probabilistic, and you might see different results, with different words, probabilities, and topic

Let's spend some time understanding the nature of this output.
The first value in the tuple is the topic id, which is how we will identify the
topic. Let's pick up topic 5 and see what we can understand from this.
u'0.008*"israeli" + 0.006*"palestinian" + 0.005*"force" + 0.004*"fire" + 0.004*"people" +
0.004*"kill" + 0.004*"government" + 0.004*"police" + 0.004*"day" + 0.004*"australia"')

What does this mean? This means topic ID 5 is made up of the words israeli,
palestinian, force, fire, and so on, and these are the ones with the highest
probability in the topic. The number that the word is multiplied with (such as
0.008 with Israeli), is the probability of that word appearing in that topic
distribution. We can look at the words with the highest probability to
understand the theme of our topic.
It's clear that this topic is about the Israel-Palestine conflict, something
which would have definitely been in newspaper headlines in the early 21st
century. Briefly going through the other topics tells you that most topics
have the word Australia in it, which again makes sense because it is an
Australian news dataset.
There is a lot we can do with topic models, such as clustering, coloring word
documents, and topic model visualization. We will talk about all of these
further functionalities of topic models in the next chapter (Chapter 9, Advanced
Topic Models); let's first check out the other topic models that Gensim has to


Latent semantic indexing
Along with LDA, the other algorithm that was first implemented in Gensim
is Latent Semantic Indexing (LSI). Setting up our LSI model simply
requires us to import the model from gensim.models and set it up the same way
we set up our LDA model.
lsimodel = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)

To see what our topics are, use this:

# Showing only the top 5 topics

This give us the following:
u'-0.216*"israeli" + -0.211*"palestinian" + -0.196*"arafat" + -0.181*"force" +
-0.149*"official" + -0.148*"kill" + -0.142*"people" + -0.142*"attack" + -0.129*"government"
+ -0.127*"australian"'),
u'-0.321*"palestinian" + -0.306*"israeli" + -0.299*"arafat" + 0.171*"australia" +
0.166*"australian" + -0.158*"israel" + 0.149*"afghanistan" + -0.137*"sharon" +
-0.134*"hamas" + -0.124*"west_bank"'),
u'-0.266*"afghanistan" + -0.242*"force" + -0.191*"al_qaeda" + 0.180*"fire" +
-0.176*"bin_laden" + -0.153*"pakistan" + 0.138*"good" + 0.138*"sydney" + -0.131*"tora_bora"
+ -0.129*"afghan"'),
u'0.373*"fire" + 0.270*"area" + 0.199*"sydney" + -0.191*"australia" + 0.176*"firefighter"
+ 0.160*"south" + 0.157*"north" + 0.148*"wind" + -0.146*"good" + 0.132*"wales"'),
u'-0.238*"company" + -0.221*"union" + 0.199*"test" + -0.187*"qantas" + -0.152*"australian"
+ 0.145*"good" + 0.141*"match" + 0.137*"win" + -0.136*"government" + -0.136*"worker"')]

It makes sense to note more or less similar topics as our LDA output. The
Israeli-Palestinian topic emerges once again! For our use-cases, it is okay to
ignore the negative sign before the numbers - it is not easy to interpret the
significance of the numbers and is related to the Singular-value
Decomposition (SVD) [9] performed during LSI being run. SVD is a
matrix factorization method to decompose matrices. For more mathematical

information on how LSI actually works, the original paper, Indexing by
Latent Semantic Analysis [10] by Deerwester and others, and the
publication, Probabilistic latent semantic indexing [11], by Hoffman would
serve as a useful resource.

Hierarchical Dirichlet process
The other standard topic modeling algorithm popular in Gensim is
Hierarchical Dirichlet process (HDP) - it is also a brainchild of Micheal.
I. Jordan and David Blei. It is different from LDA and LSI because it is nonparametric - we don't need to mention the number of topics we need.
Again, to use it in Gensim we need to import the model from gensim.models.
hdpmodel = HdpModel(corpus=corpus, id2word=dictionary)

Note that we don't need to specify the number of topics.

This will allow us to view the topics:
u'0.005*israeli + 0.003*arafat + 0.003*palestinian + 0.003*hit + 0.003*west_bank +
0.003*official + 0.002*sharon + 0.002*force + 0.002*afp + 0.002*arrest + 0.002*militant +
0.002*storm + 0.002*hamas + 0.002*strike + 0.002*come + 0.002*military + 0.002*source +
0.002*group + 0.002*soldier + 0.002*kill'),
u'0.004*company + 0.003*administrator + 0.002*yallourn + 0.002*entitlement +
0.002*traveland + 0.002*staff + 0.002*austar + 0.002*union + 0.002*travel + 0.002*employee +
0.002*end + 0.002*cent + 0.002*government + 0.002*remain + 0.002*go + 0.002*seek +
0.002*leave + 0.002*people + 0.002*agreement + 0.002*$'),
u'0.003*airport + 0.003*taliban + 0.002*kill + 0.002*opposition + 0.002*kandahar +
0.002*force + 0.002*night + 0.002*leave + 0.002*man + 0.002*lali + 0.002*near + 0.002*city +
0.001*wound + 0.001*end + 0.001*agha + 0.001*civilian + 0.001*gul + 0.001*people +
0.001*military + 0.001*injure'),
u'0.002*job + 0.002*australian + 0.002*cent + 0.002*read + 0.002*mysticism + 0.002*drop +
0.002*band + 0.001*survey + 0.001*wales + 0.001*olivier + 0.001*beatle + 0.001*week +
0.001*intensive + 0.001*result + 0.001*add + 0.001*alarming + 0.001*harrison + 0.001*cite +
0.001*big + 0.001*song'),
u'0.003*group + 0.003*palestinian + 0.002*government + 0.002*sharon + 0.002*kill +
0.002*choose + 0.002*israeli + 0.002*attack + 0.002*bright + 0.002*call + 0.002*security +
0.002*arafat + 0.002*defend + 0.002*suicide_attack + 0.002*terrorism + 0.002*hamas +
0.001*militant + 0.001*human_right + 0.001*gaza_strip + 0.001*civilian'),

u'0.003*match + 0.003*israeli + 0.002*ask + 0.002*team + 0.002*rafter + 0.002*tennis +
0.002*play + 0.002*not + 0.002*australia + 0.002*guarantee + 0.001*france + 0.001*be +
0.001*role + 0.001*hobart_yacht + 0.001*government + 0.001*kill + 0.001*late + 0.001*attack
+ 0.001*world + 0.001*topple'),
u'0.003*australian + 0.002*afghanistan + 0.002*state + 0.002*reach + 0.002*day +
0.002*head + 0.001*give + 0.001*go + 0.001*couple + 0.001*view + 0.001*plan +
0.001*government + 0.001*crash + 0.001*aware + 0.001*report + 0.001*future + 0.001*editor +
0.001*prevent + 0.001*blake + 0.001*party'),
u'0.004*storm + 0.003*tree + 0.002*ses + 0.002*work + 0.002*sydney + 0.002*damage +
0.002*hornsby + 0.002*service + 0.002*area + 0.002*home + 0.002*call + 0.002*bad + 0.001*hit
+ 0.001*bring + 0.001*australia + 0.001*afternoon + 0.001*power + 0.001*large +
0.001*electricity + 0.001*sutherland'),
u'0.004*arrest + 0.003*indonesia + 0.002*year + 0.002*smuggle + 0.002*howard +
0.002*agreement + 0.002*summit + 0.002*police + 0.002*president + 0.002*australia +
0.002*people + 0.002*megawati + 0.001*meeting + 0.001*palestinian + 0.001*meet +
0.001*council + 0.001*leader + 0.001*loya + 0.001*structure + 0.001*host'),
u'0.004*director + 0.003*friedli + 0.003*india + 0.002*union + 0.002*reply + 0.002*day +
0.002*unwell + 0.002*mistake + 0.002*report + 0.002*ask + 0.002*river + 0.002*sector +
0.001*unforeseeable + 0.001*australia + 0.001*people + 0.001*court + 0.001*trip +
0.001*australians + 0.001*swiss + 0.001*people_die'),
u'0.003*guide + 0.003*adventure_world + 0.002*people + 0.002*canyon + 0.002*interlaken +
0.002*charge + 0.002*year + 0.002*tourist + 0.002*republic + 0.001*swiss + 0.001*tragedy +
0.001*atrocity + 0.001*tomorrow + 0.001*include + 0.001*inexperienced + 0.001*kill +
0.001*change + 0.001*sweep + 0.001*allow + 0.001*court'),
u'0.002*australian + 0.002*commission + 0.002*company + 0.002*call + 0.002*people +
0.002*collapse + 0.001* + 0.001*power + 0.001*theatre + 0.001*martin + 0.001*begin +
0.001*dickie + 0.001*wisdom + 0.001*refund + 0.001*national + 0.001*include +
0.001*determine + 0.001*arafat + 0.001*procedural + 0.001*today'),
u'0.002*high + 0.002*lee + 0.001*year + 0.001*inject + 0.001*match + 0.001*lockett +
0.001*passage + 0.001*casa + 0.001*day + 0.001*test + 0.001*compare + 0.001*bond +
0.001*presence + 0.001*outlook + 0.001*osaka + 0.001*canada + 0.001*maintenance_worker +
0.001*china + 0.001*game + 0.001*$'),
u'0.003*krishna + 0.003*ash + 0.002*hare + 0.002*ganges + 0.002*harrison + 0.002*ceremony
+ 0.002*hindu + 0.002*devotee + 0.002*sect + 0.002*hundred + 0.002*holy + 0.002*river +
0.002*closely + 0.002*benares + 0.001*task + 0.001*scatter + 0.001*place + 0.001*devout +
0.001*official + 0.001*rescue'),
u'0.003*harrison + 0.002*george + 0.002*beatle + 0.002*die + 0.002*tonight +
0.002*liverpool + 0.002* + 0.002*memory + 0.002*music + 0.002*seventh + 0.001*decisive +
0.001*percent + 0.001*hold + 0.001*silence + 0.001*people + 0.001*tree + 0.001*minute +
0.001*pole + 0.001*stabbing + 0.001*plant'),
u'0.003*strong + 0.003*economy + 0.002*forward + 0.002*australia + 0.002*olympic +
0.002*hoon + 0.002*follow + 0.002*proposal + 0.002*extensive + 0.002*australian + 0.002*year
+ 0.001*goner + 0.001*mystery + 0.001*haggle + 0.001*constitutional + 0.001*fazalur +
0.001*weekend + 0.001*limit + 0.001*term + 0.001*set'),
u'0.002*tell + 0.002*launceston + 0.002*virgin + 0.002*airline + 0.002*terminal +
0.002*flight + 0.001*daily + 0.001*melbourne + 0.001*morning + 0.001*new + 0.001*second +

0.001*check + 0.001*sherrard + 0.001*administrator + 0.001*shot + 0.001*sabotage +
0.001*unacceptable + 0.001*coroner + 0.001*ansett + 0.001*hayden'),
u'0.002*choose + 0.002*aids + 0.002*hiv + 0.001*official + 0.001*state_emergency +
0.001*reporter + 0.001*europe + 0.001*soviet + 0.001*find + 0.001*late + 0.001*rush +
0.001*double + 0.001*today + 0.001*union + 0.001*number_people + 0.001*service +
0.001*report + 0.001*arabian + 0.001*footing + 0.001*state'),
u'0.003*know + 0.002*accident + 0.002*company + 0.002*carry + 0.002*organise +
0.002*region + 0.002*charge + 0.001*appear + 0.001*loot + 0.001*defunct + 0.001*market +
0.001*question + 0.001*live + 0.001*accuse + 0.001*initially + 0.001*rhino + 0.001*stephan +
0.001*canyoning + 0.001*possibility + 0.001*bayu'),
u'0.003*afghanistan + 0.003*powell + 0.002*taliban + 0.002*southern + 0.002*want +
0.002*developer + 0.001*face + 0.001*marines + 0.001*officer + 0.001*bin_laden +
0.001*pakistan + 0.001*kilometre + 0.001*united_states + 0.001*kandahar + 0.001*vacate +
0.001*force + 0.001*ground + 0.001*troop + 0.001*time + 0.001*secretary')]

Again, we can see how we have similar topics creeping up in our results. HDP
is particularly interesting to us because it slightly differs from the two
previous methods in being non-parametric, and in offering us the capacity to
cluster our topics according to hierarchies. The paper describing HDP was
among the proceedings at NIPS [12] and is titled Sharing Clusters Among
Related Groups: Hierarchical Dirichlet Processes [13].

Dynamic topic models
While our previous topic models focused on identifying topics across the
whole corpus, the next topic model that we introduce also takes into account
the time frame in which the document exists. Using this additional
information, we can then model our topics in each time frame and try to
understand how these topics evolve over time.
The nature of the topics are fixed in the first time-frame - we are not likely to
see the introduction of a new topic as time passes, but can rather see how
these topics changed over time - in particular, we can see which words
replaced which. We gave an example of this in our introductory section,
What are topic models?, where we talked about the Atomic Physics topic.
The Jupyter notebook I've written for Gensim covers large parts of the
theory, as well as all possible uses of the dynamic topic model. The notebook
is included in the appendix, and you can also find it on GitHub in the Gensim
repository [14].

Topic models in scikit-learn
Gensim isn't the only package offering us the ability to topic model: scikitlearn, while not dedicated for text, still offers fast implementations of LDA
and Non-negative Matrix Factorization (NMF), which can help us
identify topics.
We already discussed how LDA works, and the only difference between the
Gensim and scikit-learn implementations are as follows:

1. The perplexity bounds are not expected to agree exactly here because the
bound is calculated differently in Gensim versus sklearn. These bounds
are ways we calculate how topics converge in topic modeling algorithms.


2. Sklearn uses cython which creates numerical 6th decimal point
Non-negative matrix factorization (NMF) [15], unlike LDA, is not a method
mostly limited to text mining (though interestingly, LDA's variants also have
been used in genetics and image processing). NMF [16] is a linear algebra
method that involves reconstructing a single matrix V into two matrices, W
and H. These matrices when multiplied with each other, approximately
reconstruct V. W and H are then used to identify our topics as they best
represent the original matrix, V. Here the matrix V is the document-term
matrix that contains information about which words are in which
Another key aspect of NMF is that the matrix must have no negative
elements. This non-negativity makes the resulting matrices easier to inspect
Also, in applications such as processing of audio spectrograms or text
processing, non-negativity is inherent to the data being considered. Since the
problem is not exactly solvable in general, it is commonly approximated
numerically, using various distance norms to do this. Euclidean distance,
which we commonly use in 2-dimensional is one such norm, and the
Kullback-Leibler divergence [17] is another more complex metric. This
factorization can be used for example for dimensionality reduction, source
separation, or topic extraction. In our example, we use the generalized
Kullback-Leibler divergence, which is equivalent to Probabilistic Latent
Semantic Indexing (PLSI) [1] [11].
Scikit-learn has a very straightforward API which makes it appealing to use,
also because of the high-level of consistency it achieves across all of its
models - most of which have fit, transform, and predict methods based on
the model's purpose. In our case, since they are decomposition models, we
will only be using the fit method and using the model's components to print
our topics. Let's look at some code that trains two models and prints the
from sklearn.decomposition import NMF, LatentDirichletAllocation
no_topic = 10
nmf = NMF(n_components=no_topic).fit(tfidf_corpus)
lda = LatentDirichletAllocation(n_topics=no_topics).fit(tf_corpus)

Here, tfidf_corpus and tf_corpus are the tfidf and tf transformed corpuses; you can
do this either with Gensim or with scikit-learn. Here, tf_feature_names and
tfidf_feature_names are the lists that contain the entire vocabulary arranged
alphabetically; you could use Gensim's dictionary method here with equal
Let's now write a small function, which will help us print the topics:
def display_topics(model, feature_names, no_top_words):
for topic_idx, topic in enumerate(model.components_):
print "Topic %d:" % (topic_idx)
print " ".join([feature_names[i]
for i in topic.argsort()[:-no_top_words - 1:-1]])

The model.components_ objects are the variational parameters for topic word
distribution. Since the complete conditional for topic word distribution is a
Dirichlet, components_[i, j] can be viewed as a pseudo count that represents the
number of times the word j was assigned to the topic i.
Let's run this:
no_top_words = 10
display_topics(nmf, tfidf_feature_names, no_top_words)

We get the following:
Topic 0:
afghanistan bin laden qaeda al force taliban tora bora afghan
Topic 1:
palestinian arafat israeli israel hamas gaza attack suicide sharon militant
Topic 2:
qantas union worker industrial maintenance dispute wage freeze action relations
Topic 3:
test africa south match day waugh bowler wicket cricket lee
Topic 4:
river guide adventure canyon court trip interlaken australians swiss accident
Topic 5:
detainee centre woomera detention facility department damage overnight visa night
Topic 6:
hollingworth dr governor abuse general anglican child school allegation statement

Topic 7:
new year australia south government people sydney australian wales state
Topic 8:
harrison beatle cancer george krishna lord lung know ceremony life
Topic 9:
commission hih royal collapse hearing company report union martin evidence

Now let's run this:
display_topics(lda, tf_feature_names, no_top_words)

We get the following:
Topic 0:
space station shuttle endeavour russian crew ice vaughan centre launch
Topic 1:
test south day australia match lee africa wicket waugh cricket
Topic 2:
afghanistan force taliban government laden bin president australian united al
Topic 3:
russian people christmas authority security cause economy drop america kilometre
Topic 4:
union qantas worker industrial action company maintenance dispute pay relations
Topic 5:
palestinian israeli arafat attack hamas suicide Gaza sharon israel kill
Topic 6:
win metre good year race event world new australia australian
Topic 7:
year company commission people australian report world director royal child
Topic 8:
new australia south people government sydney state australian storm year
Topic 9:
flight virgin disease airline melbourne blue tell second ansett japan

Let's briefly inspect the topics - may be to find our Israel and Palestine topics
again? And yes! Topic id 1 from NMF and topic id 5 from LDA describes the
same topic we saw in all our previous topic modeling experiments with

By running the Jupyter notebook based on Chapter 8, Topic Models, you should
be able to reproduce the same results.
And there we have it - we can now use topics to describe our textual data, and
in two different Python machine learning frameworks. So far, we've only
really seen how to identify and print topics in the text; but there is a lot more
we can do with topic models, particularly with the way we can explore
documents. We will explore additional topic modeling techniques and also
ways to better train our topic models in the next chapter.

In this chapter, we saw our first usage of Gensim's machine learning
algorithms, and in particular, topic models. Topic models are a great way for
us to work with unlabeled data, and they help us find underlying structures
in text. There are multiple ways for us to identify topics in the text, with LDA,
LSI, HDP, and NNMF being the most popular methods, and we have
discussed ways to use all these methods in both scikit-learn and Gensim.
In the next chapter, we will move into advanced operations using topic

[1] Latent Semantic Analysis:

[2] Gensim:

[3] Latent Dirichlet Allocation:

[4] Introduction to LDA:

[5] Explanation of LDA:

[6] Probabilistic Topic Models:

[7] Jupyter Notebook:

[8] An Empirical Evaluation of Models of Text Document Similarity:

[9] Singular-Value Decomposition:

[10] Indexing by Latent Semantic Analysis:

[11] Probabilistic Latent Semantic Indexing:

[12] NIPS:

[13] Sharing Clusters Among Related Groups: Hierarchical Dirichlet

[14] Dynamic Topic Models:

[15] NNMF:

[16] Algorithms for NNMF:

[17] On information and sufficiency:

Advanced Topic Modeling
We saw in the previous chapter the power of topic modeling, and how
intuitive a way it can be to understand our data, as well as explore it. In this
chapter, we will further explore the utility of these topic models, and also on
how to create more useful topic models which better encapsulates the topics
which may be present in a corpus. Since topic modeling is a way to
understand the documents of a corpus, it also means we can analyze
documents in ways we have not done before.
In this chapter, we will cover the following topics:
Advanced training tips
Exploring documents
Topic coherence and evaluating topic models
Visualizing topic models

Advanced training tips
In Chapter 8, Topic Models, we explored what topic models are, and how to set
them up with both Gensim and scikit-learn. But just setting up a topic model
isn't sufficient - a poorly trained topic model would not offer us any useful
We've already talked about the most important pre-training tip preprocessing. It would be quite clear now that garbage in is garbage out, but
sometimes even after ensuring it isn't garbage you're putting in, we still get
nonsense outputs. In this section, we will briefly discuss what else it is you
can do to polish your results.
It would be wise to re-look at Chapter 3, SpaCy's Language Model, and Chapter 4,
Gensim - Vectorizing Text and Transformations and n-grams, now - they
introduce the methods used in preprocessing, which is usually the first
advanced training tip given. It is worth noting that some of these
preprocessing tips are geared more towards generating topic models than
other forms of text analysis algorithms. For example, using lemmatizing
instead of stemming is a practice which especially pays off in topic modeling
because lemmatized words tend to be more human-readable than stemming.
Similarly, using bi-grams or tri-grams as part of your corpus before applying
the topic modeling algorithm means our results would be further human
Since our purpose with topic models is to explore the corpus, it makes sense
that we work towards getting results which are more understandable by
humans. This would be slightly different than clustering documents for
example, where we would be more focused on having a higher accuracy than
on having anything human interpretable. Keeping this in mind is important
while preprocessing our documents, and also means we are free to add our
own preprocessing steps along the way to help us with our results.
It is highly unlikely that we get very useful results the first time we attempt
to topic model our data - successful topic modeling requires multiple runs of
cleaning the data, reading the results, adjusting the preprocessing
accordingly and trying again. For example, we might want to add new stop
words to our stop-word list after viewing our first topic model. It's more

often than not that based on the domain you are conducting your text
analysis; the stop words would be quite different.
In the Jupyter notebook we first looked at in Chapter 8, Topic Models, we were
working on the Lee Newspaper corpus. On the first few topic modeling runs,
the results weren't the most useful - the word say would come up a
disproportionately high number of times in the topics. This made sense, of
course - in a corpus containing newspaper articles, the words said or saying
would come up often, and these would get lemmatized to say. But even
though it made sense, it still meant that our topic model was not the most
useful. In this case, the solution was clear - remove variations of the word say
from the corpus so that it does not show up in our topic models.
With spacy, this would be done like this:
my_stop_words = [u'say', u'\'s', u'Mr', u'be', u'said', u'says', u'saying']
for stopword in my_stop_words:
lexeme = nlp.vocab[stopword]
lexeme.is_stop = True

So, what exactly is going on here? For every word that we wish to add as a
stop word, we change the is_stop attribute for that lexeme class. Lexemes are not
case sensitive, so we can ignore case here. To add more stop words, we
simply add the words to the my_stop_words list.
This is just the way spaCy handles stop words - a more common way to
remove stop words is to put all our stop words in one list and simply remove
all occurrences of those words from the corpus. If you're using NLTK, it
would be something like this:
from nltk.corpus import stopwords
stopword_list = stopwords.words("english")

Here, stopword_list is a list, so adding new words to our list is as simple as
appending words to the list.
We will be using spaCy throughout for any kind of preprocessing, so that is the stop-word removal method we
should really concern yourself with; that being said, you can technically use any method to remove stop-words.

Another way to prune out unwanted words is to use the Gensim Dictionary
class. Consider this example:

This filters out the remove_n most frequent tokens that appear in the
This quick example from the Gensim Dictionary documentation [1] illustrates
from gensim.corpora import Dictionary
corpus = [["máma", "mele", "maso"], ["ema", "má", "máma"]]
dct = Dictionary(corpus)

This process of generating a topic model, manually inspecting it and
appropriately changing our preprocessing steps is a common exercise in
almost all machine learning or data science projects - in text analysis, the
difference is the human interpretable nature of the results.
When do we stop this process of cycles of preprocessing and generating topic
models? Pretty much when we are satisfied with the results we see - since
we're not attempting to get higher accuracy values when we topic model, we
can stop when we think our topic model is finally useful. Of course, there are
also more objective ways to measure how useful a topic model is, and we will
discuss these techniques in our Topic Coherence and evaluating topic
models section.
Now, all of these tips involved what we do before we start our topic modeling.
There's also a fair amount of tuning we can do even when we are creating our
topic models. While these training options are different for Gensim and
scikit-learn, one thing is common - how many topics do we choose for the
most optimal topic model?
There's no real answer to this question, and again, a measure for the best
number of topics really depends on the kind of corpus you are using, the size
of the corpus, and the number of topics you might expect to see - maybe a
100 topics for a large corpus and 10 for a smaller one. If we have no prior
knowledge about the dataset, running a model with 5 topics, and then 10, and
so on and so forth in steps of 10 is actually a reasonable enough approach,
though there are also more quantitative methods to measure this, which we
will soon discuss in the section on topic coherence.

In all machine learning algorithms, we have various parameters that affect
the results of the algorithm. The process of changing these parameters to
achieve different results is called parameter tuning, and these parameters are
also colloquially referred to as tuning parameters.
Some important tuning parameters, at least for Gensim include the


: This controls how many documents are processed at a time in
the training algorithm. Increasing chunksize will speed up training, at
least as long as the chunk of documents easily fit into memory (RAM).


: This controls how often we train the model on the entire corpus.
Another word for passes might be epochs.


: This controls how often we repeat a particular loop over each
document. It is important to set the number of passes and iterations
high enough.

You can see the other parameters for LdaModel here [2] - you can recall us using
LdaModel in Chapter 8, Topic Models. For scikit-learn, these [3] are the parameters
for their LDA implementation, for a quick comparison. These are useful to
understand what kind of parameters we can play around with.
Hyperparameters are a word to describe parameters of a machine learning
algorithm which are set before the machine learning algorithm begins.
In machine learning, we often refer to the result of our algorithm as a model
- in the context of topic modeling, an LDA model, HDP model, or LSI model
is simply a probabilistic model that describes the documents in the corpus.
When we talk about topic models or an LDA model, for example, we refer to
this trained model.
In general, an LDA algorithm would have two hyperparameters:

1. Alpha: This represents the document-topic density. Higher the value of
alpha, documents are composed of more topics and lower the value of
alpha, documents contain fewer topics.

2. Beta: This represents the topic-word density. If the value of beta is high,
topics are composed of a large number of words in the corpus, and with
the lower value of beta, they are composed of few words.

3. Number of topics: The number of topics we wish to model.
During the training process, for more information, it makes sense to turn on
logging, as Gensim doesn't print training information by default.
This can be done with:
import logging
logging.basicConfig(filename='logfile.log', format='%(asctime)s :
%(levelname)s : %(message)s', level=logging.INFO)

This blog post by Chris Tufts also serves as a useful resource for training
your LDA model [4]. The Gensim FAQ and recipes page is also worth going
through [5].
Once we are sufficiently happy with the model we have trained, we can play
around more - and you will see we can do more than just looking at what
kind of topics exist in a corpus.

Exploring documents
Once we have our topic model of choice set up, we can use it to analyze our
corpus, and also get some more insight into the nature of our topic models.
While it is certainly useful to know what kind of topics are present in our
dataset, to go one step further we should be able to, for example, cluster or
classify our documents based on what topics they are made out of.
In our Jupyter notebook example from Chapter 8, Topic Models, let's start
looking at document-topic proportions. What exactly are these? When we
were looking at topics in the previous chapter, we were observing topic-word
proportions - what are the odds of certain words appearing in certain topics.
We previously mentioned that we assumed that documents are generated
from topics - by identifying document-topic proportions, we can see exactly
how the topics generated the documents.
So, do we do this Gensim? It's particularly straightforward:

Is all you need to get the document-topic proportions. Document here is the
vector representation of the document which we wish to analyze.
This does not need to be a document that was used to train the LDA model - it can be an unseen document, so
long as the words in the document are within the same vocabulary of the LDA model.

Let's try this with the context of the Lee Newspaper corpus:

This gives us the following:
[(1, 0.99395897621183538)]

What does this mean? The list contains tuples with the topic number and the
corresponding probability of that topic appearing in that topic - above a
certain cutoff-probability. Since we only have one tuple in our list, this
means that the contributions of the other topics to this document are

negligible. Let's verify this.
What is topic 1?
(1, u'0.008*"area" + 0.007*"fire" + 0.006*"people" + 0.005*"sydney" + 0.005*"force" +
0.004*"pakistan" + 0.004*"new" + 0.004*"afghan" + 0.004*"new_south" + 0.004*"wales"')

It seems to represent two themes - of the Afghan-Pakistan conflict and of a
possible fire or accident in New South Wales or Sydney. Let's see if our first
document has any of these themes coming up.
Let's now have a look at a few words of our first document and see if the
topic assignment makes any sense:

We see that it certainly does match one of the themes of the topic and that
our topic model is indeed useful. We can further use this information to
cluster documents into each topic based on the document-topic proportions.
One very important point to note: you may be seeing different topics,
different proportions, and different words - topic models are probabilistic,
and we do not get the same results every time.
It is important to note now that a representation where we have the
document-topic proportions is also a vector representation such as TF-IDF;
instead of our vector length spanning over the vocabulary, it is the size of the
number of topics.

Gensim doesn't just stop here and has further methods to help us analyze
the topic proportions of both documents and words.
We will be using this Jupyter notebook [6] I have written for Gensim to
illustrate the methods that Gensim has to offer.
Let's quickly have a look at the corpus we will be using to illustrate the
texts = [['bank','river','shore','water'],

Some quick notes about this corpus - it contains sentences that have two
distinct topics - one to do with finance, and the other to do with rivers. You
should also notice that the word bank repeats itself in both the contexts - this
allows us to do some more experiments with words.
Let's look at the topics generated from this corpus:
[(0, u'0.164*"bank" + 0.142*"water" + 0.108*"river" + 0.076*"flow" + 0.067*"borrow" +
0.063*"sell" + 0.060*"tree" + 0.048*"money" + 0.046*"fast" + 0.044*"rain"'),
(1, u'0.196*"bank" + 0.120*"finance" + 0.100*"money" + 0.082*"sell" + 0.067*"river" +
0.065*"water" + 0.056*"transaction" + 0.049*"loan" + 0.046*"tree" + 0.040*"mud"')]

We can see that as expected, one topic is to do with river banks, while the
other topic is to do with financial banks.
Within documents, it is possible to find the odds of a particular word
belonging to a particular topic. This is done with the get_term_topics() method.
Let's look at a few examples:
[(0, 0.12821234071249418), (1, 0.047247458568794511)]

This makes sense; the value for it belonging to topic_0 is a lot more.
[(0, 0.017179349495865623), (1, 0.10331511184214655)]

As expected, the finance word has a much higher probability of being within
the second topic. We leave it to the reader to figure out what the result might
look like if we ran the same method for the word bank.
This method concerned itself with particular words in a corpus - let's now see
how to find topic proportions for entire documents. The get_document_topics
method is the Gensim functionality that uses the inference function to get
the sufficient statistics and figure out the topic distribution of the document.
Let's test this with two different documents which have the word bank in it,
one in the finance context and one in the river context.
The get_document_topics method returns (along with the standard document topic
proportion) the word_type followed by a list sorted with the most likely topic
ids when per_word_topics is set as true.
Have a look at this excerpt from the notebook:
bow_water = ['bank','water','bank']
bow_finance = ['bank','finance','bank']
bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)

[(0, [0, 1]), (3, [0, 1])]

Now, what does that output mean? It means that like word_type 1, our
word_type 3, which is the word bank, is more likely to be in
topic_0 than topic_1. A reminder here that the numbers 0, 1, and 3 refer to
the id or index of that word. Word 1 is the word in the dictionary with id 1,
and topic 0 is the first topic.
You must have noticed that while we unpacked into doc_topics and word_topics,
there is another variable - phi_values. Phi is essentially the probability of that
word in that document belonging to a particular topic. Like the name

suggests, phi_values contains the phi values for each topic for that particular
word, scaled by feature length. The next few lines should illustrate this:
[(0, [(0, 0.92486455564294345), (1, 0.075135444357056574)]),
(3, [(0, 1.5817120973072454), (1, 0.41828790269275457)])]

This means that word_type 0 has the following phi_values for each of the
topics. What is interesting to note is word_type 3 - because it has 2
occurrences (that is, the word bank appears twice in the bow), we can see
that the scaling by feature length is very evident. The sum of phi_values is 2,
and not 1.
Now that we know exactly what get_document_topics does, let's now do the same
with our second document, bow_finance.
bow = model.id2word.doc2bow(bow_finance) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)
[(3, [1, 0]), (12, [1, 0])]

And lo and behold, because the word bank is now used in the financial
context, it immediately swaps to be more likely associated with topic_1.
We've seen quite clearly that based on the context, the most likely topic
associated with a word can change. This differs from our previous method,
get_term_topics, where it is a static topic distribution.
It must also be noted that because the Gensim implementation of LDA uses
Variational Bayes sampling, a word_type in a document is only given one
topic distribution. For example, the sentence the bank by the river bank is
likely to be assigned to topic_0, and each of the bank word instances has the
same distribution.
Using these two methods we see how we can infer further information from
using our topic models. Having the document - topic distributions means we
can also use this information to do some cool stuff - for example, coloring all
the words in a document based on which topic it belongs to, or using distance
metrics to infer how close or far away two topics or documents are.

The following Jupyter notebooks that I've contributed to Gensim exactly tell
us how to do the following tasks - it is highly recommended that you have a
look and run the notebooks before moving to the next section.
Coloring Words in a Document - notebook 1 [6]
Distance Metrics - notebook 2 [7]
Scikit-learn similarly has further caveats you can explore - the blog post [8]
would be a good place to start!
We are now going over to a crucial part of topic models - understanding
quantitatively how well a topic model performs.

Topic coherence and evaluating topic
In the previous sections, we spoke extensively about how topic models, in
general, are rather qualitative in nature - it's difficult to put a number on how
useful a topic model is. Despite this, there is a need to evaluate topic models,
and the most popular method out there is topic coherence - and lucky for us,
Gensim has quite an extensive suite of topic coherence methods for us to try
What exactly is topic coherence? Briefly put, it is a measure of how
interpretable topics are for human beings. There are multiple coherence
measures in topic modeling literature, and we won't be going through the
theory for these, but the following links should walk you through the theory
and intuition, if interested:

1. What is topic coherence? [9]

2. Exploring the Space of Topic Coherence Measures [10]
The first link is a Gensim blog post, and the second is a research paper and
goes into further theoretical details.
All we need to know to proceed is that we now have a quantitative measure
of how well our topic model is. This opens up a lot of possibilities for us - we
can now compare between two differently trained (with a different number
of iterations or passes, for example) LDA models, or between an HDP model
and an LSI model, or even between similarly trained models with a different
number of topics. This means that we now also have a quantitative way to
even measure the optimal number of topics for a corpus, as well as a way to
compare between completely different classes of models.
Of course, we can still use qualitative methods to get an understanding how
well our topic model performs. Visualizing topic models are one way to do
this - we've already explored one such method in the previous section, where
the Jupyter notebook walks us through coloring words in a document. By
quickly glancing at the colored words in a document, we can get an idea of
how well the topic model understands which words belong to which topic.
Using more advanced topic visualization tools, we can further analyze how
efficient our topic model is. We'll talk about these tools extensively in the
next section - for now, let's look at Gensim's topic coherence pipeline!
Before topic coherence, perplexity was used to measure how well a topic
model was fit - indeed, even now Gensim allows us to hold out a testing set
and measure perplexity while training our model. You can read more about
perplexity and topic models over here [11].
Gensim has a very straightforward API to perform topic coherence:
If for example, we wish to check the coherence value of the three models in
our Lee Newspaper corpus data-set, we simply run. Note that these examples
are from the Jupyter notebook attached at the end of Chapter 8, Topic Models.
lsi_coherence = CoherenceModel(topics=lsitopics[:10], texts=texts, dictionary=dictionary,
hdp_coherence = CoherenceModel(topics=hdptopics[:10], texts=texts, dictionary=dictionary,
lda_coherence = CoherenceModel(topics=ldatopics, texts=texts, dictionary=dictionary,

Here, topics are just the list of the top n words for each topic. Since the topics
are all different, we pass the list of the top word instead of passing the model
itself. We can then print the values of coherence for each of the models to get
the comparative coherence values - this exercise has been completed in the
Jupyter notebook [12] and we urge the reader to explore this comparison.
In cases when we are, for example, just comparing between two different
kinds of LdaModel objects, we can also pass the models. Here, goodLdaModel and
badLdaModel are just placeholder variable names for a good and bad model- you
are expected to pass whichever models you wish to.
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary,
badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary,

We notice here that in both the examples, we pass texts - here texts is the
original corpus before we convert it into its vector form. You can have a look
at the texts list in the notebook to confirm the contents of it.
Once we have our coherence models trained, we simply have to run
get_coherence() to get the value of coherence. Note that coherence values on
their own have no meaning - it is only when compared to another coherence
value with the same corpus that it is of any meaning - and higher the value of
coherence, better the model.
In the bad and good LdaModel example, the bad LdaModel object was trained with
only 1 iteration, and the good LdaModel object was trained with 50 iterations.
When we attempt to print the values of coherence:

We can see that the good LdaModel object has a higher value of coherence,
which confirms our hypothesis that a model trained with more iterations
would perform better. A note: these are sample coherence values based on
any sets of models where one is more trained than the other. The user
should attempt to train his or her own good and bad models and experiment
with the results.

Like we mentioned before, we can also use coherence measures to see what
the optimal number of topics for a corpus are. The following is an example of
a simple for loop which would do the same:
c_v = []
limit = 10
for num_topics in range(1, limit):
lm = LdaModel(corpus=corpus, num_topics=num_topics,
cm = CoherenceModel(model=lm, texts=texts, dictionary=dictionary,

Printing c_v would give us a list of the corresponding coherence values for
each topic number - the highest coherence value would be a way to identify
the number of topics which are most human-comprehensible.
We can also print the top topics from an LDA model, depending on which
coherence measure we intend to use. The top_topics method helps perform this
and uses the coherence model to generate the top topics. While we have
covered most of the functionalities of coherence models in this section, there
are multiple Gensim Jupyter notebooks which go into more detail into the
different functionalities which coherence models can offer:

1. Coherence Model pipeline [13]

2. News Classification with Gensim [14]

3. Topic Coherence on Movies Dataset [15]

4. Topic Coherence Introduction [16]

5. Topic Coherence Use Cases [17]

6. Topic Coherence Model Selection [18]
Now that we have our models set up, as well as analyzed, we can go ahead to
visualizing them.

Visualizing topic models
Like we have said before, the purpose of topic models is to better understand
our textual data - and visualizations are one of the best ways to understand
and look at our data. There are multiple ways and techniques to visualize
topic models - we will be focusing on the methods implemented and
compatible with Gensim, but like we have done throughout the book, we will
be providing links and documentation to the other popular topic modeling
visualization tools.
One of the most popular topic modeling visualization libraries is LDAvis - an R
library build largely on D3, it has been ported to Python as pyLDAvis and is just
as nifty in Python and is very well integrated with Gensim as well. It is based
on the original paper (LDAvis: A method for visualizing and interpreting
topics [19]) by Carson Sievert and Kenneth E. Shirley.
The pyLDAvis library is agnostic to how your model is trained - this means we
are not restricted to Gensim or even LDA for that matter. All we need are the
topic-term distributions and the document-topic distributions - and basic
information about the corpus which was trained on.

Fig 9 .1 py LDAv is

It's even easier if we are using a Gensim based model. All we need to do is
import pyLDAvis.gensim
pyLDAvis.gensim.prepare(model, corpus, dictionary)

Here the model is a placeholder variable, and we can pass any of our trained
lda models.
We are then able to visualize a lot of information about our topics all at once
- something which is a lot easier than manually inspecting topics printed out
on your console. In the preceding figure, we can see each topic represented
as a circle in a 2-dimensional space - this space is generated by finding the
distance between the topics. The words on the right refer to the words in a
topic and is a quick and useful way to see how words are spread out among
topics. The original paper referred to in reference no 19 further details the
visual elements.
The pyLDAvis library itself has a few more options you can tinker around with
it, and it is highly recommended to check out the Jupyter notebook [20]
tutorial which walks you through the details.
Now, this visualization is after we are done training - what if we want to
visualize the progress during training? Gensim has newly added features to
help with just this.
We previously discussed coherence and perplexity as measures to check how
well a model is fit - we are able to see the progression of these models as it is
being trained.

Fig 9 .2 Visualizing coherence, perplexity , topic difference, and conv ergence

We can also measure the topic difference - it calculates the distance between
two topic models, using one of the many distance metrics implemented in
Gensim. Another metric we can watch is convergence - this is the sum of the
difference between all the identical topics from two consecutive epochs.
It is fairly easy to set this up using Gensim, though we also need the visdom
[21] server to do this. A visdom server is a Python-based server which is
specifically built to aid in visualizing data. Since we are visualizing a live
training process, we will be needing a server. The instructions to both set-up
the server and the visualizations are explained clearly in the Jupyter
notebook [22].
Topic models can be further viewed as clusters - for example, by using the
machine learning algorithm T-distributed Stochastic Neighbor
Embedding (T-SNE) [23] we can use the document-topic proportions to
cluster our corpus.

Fig 9 .3 LDA docum ent clusters based on topics and T-SNE

It is also possible to cluster using Word2Vec - the details for this kind of

clustering is given in the Jupyter notebook [24].
Using Gensim and scipy we can also do cool things such as creating
dendrograms of how our topics are related - the Jupyter notebook [25] goes
over the details.

Fig 9 .4 Topic dendogram s

Fig 9 .5 Topic dendrogram with heat-m aps

A dendrogram is a tree-structured graph that can be used to visualize the
result of any kind of hierarchical clustering. Hierarchical clustering puts
individual data points into similarity groups, with some groups on top of
each other based on the content of the groups. For example, if we are
modeling a corpus with various kinds of industries, the Mercedes topic might
come below the cars topic. We can use it to explore the topic models and see
how the topics are connected to each other in a sequence of successive
fusions or divisions that occur in the clustering process.
All of these visualizations are based on Gensim, and the Jupyter notebooks
linked to are all from the Gensim documentation - it is worth taking the time
to run the Jupyter notebooks and have a look at the visualizations yourself.
There are also some neat visualizations out there which are not officially
Gensim but allow us to look at our data in interesting ways. We will link to
the pages so that the reader can have a look:
Visualizing Trends [26]
Topic Modeling and t-SNE Visualization [27]
Visualizing Topic Shares [28]

David Blei - Visualizing Topic Models [29]

With Chapter 8, Topic Models and Chapter 9, Advanced Topic Modelling, we are
now equipped with the tools and knowledge of applying topic models to our
textual data. Topic modelling is a largely data exploratory tool, but we can
also carry out some more targeted analysis, like seeing the topics which
make up a document, or which words in a document belong to which topic.
Gensim gives us the functionality to carry out these tasks quite easily, with
its API constructed so that we can access the mathematical information
behind topic models without a hassle.
In the next chapter, we will carry our more targeted text analysis tasks, such
as clustering or classification. Clustering and classification algorithms are
largely used in text analysis to group similar documents together and are
machine learning algorithms. We will explain the intuition behind these
methods as well as illustrate code examples.

[1] Gensim Dictionary class:

[2] Gensim LdaModel Class:

[3] Scikit-Learn LDA Class:

[4] Gensim LDA: Tips and Tricks:

[5] Recipes and FAQ:

[6] Term Topics Jupyter notebook:

[7] Distance Metrics:

[8] Interpretation of Topic Models:

[9] What is Topic Coherence:

[10] Exploring the space of Topic Coherence Measures:

[11] Perplexity in Topic Models:

[12] Topic Modelling Notebook:

[13] Coherence Model pipeline:

[14] News Classification with Gensim:

[15] Topic Coherence on Movies Dataset:

[16] Topic Coherence Introduction:

[17] Topic Coherence Use Cases:

[18] Topic Coherence Model Selection:

[19] LDAvis: A method for visualizing and interpreting topics:

[20] pyLDAvis:

[21] visdom:

[22] LDA training visualization:

[23] t-SNE:

[24] TensorBoard Visualizations:

[25] Topic Dendrograms:

[26] Visualizing Trends:

[27] Topic Modeling and t-SNE Visualization:

[28] Visualizing Topic Shares:

[29] David Blei - Visualizing Topic Models:

Clustering and Classifying Text
In the last chapter we studied topic models and how they can help us in
organizing and better understanding our documents and its sub-structure.
We will now move on to our next set of machine learning algorithms, and for
two particular tasks — clustering and classification. We will learn what the
intuitive reasoning of these two tasks is, as well as how to perform these
tasks using the popular Python machine learning library, scikit-learn:
Clustering text
Classifying text

Clustering text
So far we looked at analyzing text to understand better what the text or
corpus consists of. When we tried to POS-tag or NER-tag, we were interested
in knowing what kind of words were presented in our documents, and when
we topic-modeled, we wanted to know the underlying topics which could be
hidden in our texts. Sure, we could use our topic models to attempt to cluster
articles, but that isn't its purpose; we would be silly to expect great results if
we tried this, too. Remember that since the purpose of topic modeling is to
find hidden themes in a corpus and not to group documents together, our
methods are not optimized for the task. For example, after we perform topic
modeling, a document can be made of 30% topic 1, 30% topic 2, and 40%
topic 3. In such a case, we cannot use this information to cluster.
Let us now start exploring how to use machine learning methods to move on
to tasks which are more quantitative in nature: clustering and
classification. Clustering is a popular machine learning task, and the
techniques used in classical clustering tasks can be used for text as well. As
the name suggests, clustering is the task of grouping together or clustering
data points in the same group, where points in the same group are more
similar to each other than points in other groups. In our context, data points
can be thought of as documents, or in some cases, words. Clustering is an
unsupervised learning problem. We are not aware of the clusters or groups
before we start assigning our data points to them (though we might have an
idea of what we might find).
Classification is a similar task and is the problem of identifying to which of a
set of categories (sub-populations) a new observation belongs, by a training
set of data containing observations (or instances) whose category
membership is known. An example would be assigning a given email into
spam or non-spam classes, or the task of assigning newspaper articles to
predetermined classes or groups.
An example of a famous clustering or classifying task could be the Iris flower
dataset [1], where we attempt to find out which class a flower belongs to
based on its petal length. Another popular dataset used for these purposes is
the MNIST dataset [2], which contains handwritten digits that are meant to
be classified under the number it is supposed to represent.

Clustering text follows most of the principles which standard clustering
problems follow, but we have to keep one thing in mind: the high number of
dimensions in text analysis. In the Iris dataset, for example, there are only
four features which we use to identify our classes or clusters. However, in
the case of text, we have to deal with the entire vocabulary size when setting
up our problem. Of course, we will do our best to reduce our dimensions
using some of the techniques like SVD, LDA, and LSI, which we discussed
While we largely used Gensim before to carry out our quantitative tasks, and
spaCy for computational linguistics, we will move on to a more traditional
machine learning library, scikit-learn. Indeed, we have already introduced
scikit-learn earlier on in the book, but from this chapter onward we can
expect an increased use of the same.
While we perform our clustering and classifying tasks, you might often come
across both Word2Vec and Doc2Vec, two ways of representing words and
documents as vectors. We have to remember that it is just another vector
representation of words and documents, albeit in a more sophisticated
manner than which we have explored so far. We will explore Word2Vec and
Doc2Vec in detail in Chapter 12, Word2Vec, Doc2Vec, and Gensim, and revisit
clustering and classifying using them, but for now, it is enough to
understand them as a way we can provide more curated information to our
clustering or classifying algorithms.

Starting clustering
Like every other text analysis algorithm we applied before, the most
important step remains the pre-processing step — getting rid of our stop
words and lemmatizing words.
Once we're done with this, the next step is to convert our document into a
vector representation we are most comfortable with.
Since we're dealing with scikit-learn's implementations for clustering and
classification, let us use scikit-learn for our pre-processing. We should also
use this opportunity to decide which dataset we intend to use for our
experiments. While there are lots of solid options, we will stick with the
popular 20 Newsgroups [3] dataset. Since the dataset comes bundled with
scikit-learn, loading it and using it becomes an easy task as well.
You can follow the Jupyter notebook [4] on clustering and classification for
the full details; we will be using code snippets from there to explain the
To start accessing our dataset, we run:
from sklearn.datasets import fetch_20newsgroups

categories = [
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True,
labels =
true_k = np.unique(labels).shape[0]
data =

Let's have a brief look at what we have done so far. The import statement
allowed us to easily access the 20NG dataset, and for the sake of the

example, we have decided to pick up only 4 categories. Following [3] will give
you the full list of categories. We create our dataset by choosing all the
subsets, while also shuffling the dataset, but with a random-state set. As
always, we have to now convert our textual data to a form which machine
learning algorithms can understand — vectors.
We will be using scikit-learn's in-built TfidfVectorizer to make our job easy:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english', use_idf=True)
X = vectorizer.fit_transform(data)

The X object is now our input vector which contains the TF-IDF
representation of our dataset. We have to remember that we are still dealing
with rather high-dimensional data when we do a TF-IDF transformation. To
better understand the nature of the data, it is useful to visualize it. We can do
this by doing a Principal Component Analysis [5] on our data-set to reduce
the number of dimensions to 2. PCA is an algorithm which looks for
unrelated (mathematically, these are referred to as linearly uncorrelated)
components from a dataset. By identifying these unrelated components from
a high-dimensional dataset, we are effectively performing dimension
reduction. Note that we are only doing this for the purpose of visualization;
for the clustering problem we will attempt other dimensionality reduction
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

newsgroups_train = fetch_20newsgroups(subset='train', categories=['alt.atheism',
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
X_visualise = pipeline.fit_transform(
pca = PCA(n_components=2).fit(X_visualise)
data2D = pca.transform(X_visualise)
plt.scatter(data2D[:,0], data2D[:,1],

Let's briefly discuss this code. We loaded the data again, but with only two
categories (the ones we want to visualize). We ran a count-vectorizer and a
TF-IDF transformation on this and fit a PCA model where we want only two

key components. On plotting this, we have an idea of how the clusters in the
dataset might be separated:

Fig 1 0.1 : Visualizing our dataset

It is important to note that the axis here simply represents two components
which were discovered by PCA.
Let's now get back to our original vector, X, and set it up to be ready for
clustering. When discussing topic models, we discussed how they could also
work as a dimensionality reduction technique. Let us use Singular Value
Decomposition (SVD) and Latent Semantic Analysis (LSA/LSI) (we
came across these methods in Chapter 8, Topic Models, on topic modelling) to
do our dimensionality reduction for this example.
Not e: We have to re-normalize after we run our SVD on the dataset.

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
n_components = 5
svd = TruncatedSVD(n_components)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
X = lsa.fit_transform(X)

The final X is the input which we will be using. It has been cleaned, TF-IDF
transformed, and further had its dimensions reduced. It is now ready to have
clustering techniques run on it!

K-means [6] is a classical machine learning algorithm for clustering. It is
intuitively easy to understand. Based on a predetermined number of clusters
the user decides, it attempts to create clusters. This is done by reducing the
distance of points from the respective centroid the point is assigned to. It is
an iterative algorithm and keeps doing the process until the centroids and
points assigned don't change. It is worth one's time to go through the theory
behind the algorithm, though it isn't necessary for us to proceed.
Using K-means with scikit-learn is very easy, and scikit-learn offers two
implementations [7] which we can use – either in mini-batches or without.
In our code, we allow the user to toggle between which option to use:
minibatch = True
if minibatch:
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
init_size=1000, batch_size=1000)
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100,

And voila! We have a fit model now which has four different clusters. Instead
of visualizing this, let's try and find what the top words per cluster are:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
The preceding bit of code is necessary because of our LSI transformation.

terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
Cluster 0:

Cluster 1:
Cluster 2:
Cluster 3:
Not e: You might see different results, as machine learning algorithms do not produce the exact same results each

We can see how each of the four clusters represents the four categories we
initially chose – our clustering has turned out just fine! We can further use
our fit model to predict which cluster a new document belongs to; just
remember to run the same pre-processing steps for the new document as
well. This is as simple as:

So what did we do over here? We loaded our dataset, chose four categories,
ran pre-processing steps, visualized our data, trained a K-means model, and

printed the top words per cluster to see if they made sense—and they did just
fine. Since we knew there were four categories, we choose our K-means
cluster to have four clusters, that is, K=4.
We are free to play around more with the pre-processing and can expect
different kinds of results with different steps. Let's now explore another form
of clustering.

Hierarchical clustering
Before we dive into hierarchical clustering, it would be a very handy exercise
to go through the scikit-learn documentation on clustering [8]. We have to
remember that using a different model in scikit-learn is very easy, and that
almost all the other steps in the process of clustering remain the same
We will use Ward's algorithm/method [9] to attempt hierarchical clustering.
The algorithm is based on the idea of reducing the variance within each
cluster and uses distance measures to do this. Ward's method is one of the
earliest methods used in various hierarchical clustering algorithms, which
are based on building clusters and arranging them in a hierarchy. In our
examples, we will use dendrograms [10] to represent our hierarchical
To set up our dataset for this method we must first create a matrix with pairwise distances. We can do this very easily with scikit-learn like this:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(X)

Now that we have our distance matrix ready, we will use SciPy's ward and
dendrogram functions:
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist)
fig, ax = plt.subplots(figsize=(10, 15)) # set size
ax = dendrogram(linkage_matrix, orientation="right")

And that's it! SciPy does all the hard work for us and presents us with this
pretty diagram. The dendrogram gives us an idea of the clusters the
documents can be arranged in. The x-axis mentions the name or indices of
the documents, but those cannot be seen now because there happen to be too
many documents. The y-axis refers to the distance between each hierarchy of

Fig 1 0.2 An exam ple of a dendrogram generated as a result of text clustering using Ward's algorithm in SciPy

We can see, in this particular case, this might not be the best method to
cluster, mainly because of the number of documents. It is tougher to look at
the relationship between the documents and what the clusters represent.
Trying this for a smaller corpus might prove very handy, though!
The following tutorials (links in the References section) also illustrate the
methods we have tried but on different datasets:
Brandon Rose - Clustering [11]
Working with Text [12]
We would like to stress again about using different dimensionality reduction
and vector representations before feeding our corpus to the clustering
algorithm. Both Word2vec and Doc2Vec offer very interesting ways to do
this, and Gensim has ready implementations for this very purpose! The blog
post at [13]
on clustering with Word2Vec attempts to explain this as well.
We will now move on to classifying text documents, another popular usage

of machine learning algorithms in the text.

Classifying text
In our previous section, we discussed cluster, which was an unsupervised
learning algorithm. Classification, on the other hand, is a supervised learning
algorithm. What does supervised and unsupervised mean? In our previous
example, we had the labels or the truth values. This is information about
which class or label a document actually belongs to. But you would have also
noticed we never used this information. When we trained our model, we
never used the labels. This kind of learning is called unsupervised learning,
and clustering is a popular example of an unsupervised learning task.
In classification problems, we are aware of the classes which we want to
assign documents or data points to, and we use this information to train our
model. In fact, as we are going to see very soon - there is hardly any change
in our approach to clustering and classification, apart from the fact that we
will be paying attention to our labels, and that we will be using a different
machine or model to train on.
Like we've been stressing throughout the book, it is important to make sure
our text is cleaned and vectorized before we start feeding it into any machine
learning pipeline. Our steps will remain the same as before, though we do
have the liberty to change things around a bit until we get the accuracy or
performance we are looking for.
We will use the Naive Bayes classifier [14] and a Support Vector Machine [15]
classifier to help us with our classification tasks. While the mathematical
nature of these machines is beyond our scope, the scikit-learn
documentation for them (NB [16], SVM [17]) provides for some intuitive

Fig 1 0.3 : Using kernels to transform the input space - SVM. Source: Quora, What does support v ector m achine (SVM)
m ean in lay m an's term s? [1 9 ]

Support Vector Machines work by using kernels [18] to transform the input
space so that we can best draw a line (or in higher dimensions, as is the case
with text - a plane) to separate the classes. Kernels are mathematical
functions which help us in transforming our dimension space.
Briefly, the Naive Bayes Classifier works by applying Bayes' theorem with
the naive assumption of independence between every pair of features; we
can predict which category a document may belong to. One must note that
independence is normally assumed. When this case does not hold true, it is
called naïve. The prior probability of whether a document belongs to a
certain class is calculated using the labels. Essentially, we attempt to figure
out which words predict which class. The code itself is very straightforward:
the only difference being that we use the labels to train our machine as well.
This is what the code snippet would look like, but you might want to refer to
the notebook once again in case you stumble upon any errors. Do not forget
to transform your data before training the model, and to run X = X.toarray() if it
is a sparse array:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB(), labels)
from sklearn.svm import SVC
svm = SVC(), labels)

And there we have it!
The models gnb and svm can use their predict() method to classify unknown
documents into the classes.

For example, with Naive Bayes:

Would give us an array with all the predicted classes. There are four classes
in our dataset, and this is the result we see:
array([0, 3, 3, ..., 3, 3, 3])

Similarly with SVM, we run:

And our result is:
array([0, 3, 3, ..., 3, 3, 3])

While clustering tends to also be a more explanatory process, during
classification, we tend to want to increase our accuracy or our success rate of
predicting the right class. The GridSearchCV [21] is a scikit-learn function that
lets us choose optimal parameters for a classifier object, and we can check
our performance of the classifier with the classificaiton_report object.
For an idea of how to do this, the scikit-learn documentation page (http://sciki linked to
gives us a brief example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters),

In the preceding example, we are doing the grid search for SVM and choose
between linear and rbf kernels, and two different values of C.
The code at
ups.html [20] runs us through the process of choosing multiple classifiers of
scikit-learn, and you will notice the approach is quite similar to the code you

have seen so far. The link is worth visiting to see what other classifiers scikitlearn has to offer, and on how to compare the results of these classifiers. You
can see the following image which illustrates this, where the classifiers are
compared against each other in terms of relative performance and time.

Fig 1 0.4 Perform ance of different classifiers on the 2 0NG dataset. Note that these are for classifiers we hav e not
explored ourselv es; link 2 1 describes the code and the classifiers to generate this im age.

For those curious to use even more powerful machine learning tools, the blog
post at [22]
tells us how to classify documents using Word2Vec. We will be going over
this process in detail ourselves on the chapter on Word2Vec and Doc2Vec.

And that sums it up! You can now build basic classifiers yourself - the classic
problem of classifying emails as spam and not-spam is now something you
can replicate yourself. We have seen various clustering algorithms such as kmeans, and hierarchal clustering algorithms. We discussed what supervised
and unsupervised learning algorithms are, and saw examples of how to run
both using scikit-learn.
You can also explore your text data in all sorts of ways with the clustering
and topic modeling tools we have. Let's attempt to go one step further in the
next chapter - and build a basic information retrieval machine which can
search for similar documents.

[1] Iris Flower Dataset:

[2] MNIST Digit Data Set:

[3] 20 NG Dataset:

[4] Clustering and Classification Notebook:

[5] Principal Component Analysis:

[6] K-Means Clustering:

[7] scikit-learn k-means:

[8] scikit-learn clustering:

[9] Ward's Method:'s_method

[10] Dendrogram:

[11] Document Clustering:

[12] Working with Text:

[13] Automatic Topic Clustering using Doc2Vec:

[14] Naive Bayes:

[15] Support Vector Machines:

[16] Naive Bayes Scikit-Learn:

[17] Support Vector Machine Scikit-Learn:

[18] Kernel Method:

[19] What does SVM mean in Layman's terms?:

[20] Classification of text documents scikit-learn:

[21] GridSearchCV:

[22] Text Classification with Word2Vec:

Similarity Queries and
Once we have begun to represent text documents in the form of vector
representations, it is possible to start finding the similarity or distance
between documents, and that is exactly what we will learn about in this
chapter. We are now aware of a variety of different vector representations,
from standard bag-of-words or TF-IDF to topic model representations of text
documents. We will also learn about a very useful feature implemented in
Gensim and how to use it—summarization and keyword extraction. Here's a
summary of what we'll learn from this chapter:
Similarity metrics
Similarity queries
Text summarization

Similarity metrics
Similarity metrics [1] are a mathematical construct which is particularly
useful in natural language processing—especially in information retrieval.
Let's first try to understand what a metric is. We can understand a metric as
a function that defines a distance between each pair of elements of a set, or
vector. It's clear how this would be useful to us - we can compare between
how similar two documents would be based on the distance. A low value
returned by the distance function would mean that the two documents are
similar, and a high value would mean they are quite different.
While we mention documents in the example, we can technically compare
any two elements in a set – this also means we can compare between two
sets of topics created by a topic model, for example. We can check between
the TF-IDF representations of documents and between LSI or LDA
representations of documents.
Most of us would be aware of one distance or similarity metric already – the
Euclidean metric. It is one of the first distance metrics we come across in
high school mathematics, and we would have likely seen it being used to
calculate the distance between two points in a 2-dimensional space (XY).
While we won't get into the mathematical details of metrics, it is worthwhile
knowing the four characteristics of a distance metric.
d(x,y) >= 0
This must be non-negative.
d(x,y) = 0 <=> x = y
Here, if x and y are the same, the distance must be zero.
d(x,y) = d(y,x)
This must be symmetric.
d(x,z) <= d(x,y) + d(y,z)
This must obey the triangle inequality law.

Fig 1 1 .1 The four m athem atical prerequisites for a function to be a m etric

Gensim (and scikit-learn, and most other machine learning or scientific
computing packages) recognize the importance of distance metrics and have
them implemented as part of the package, which means it's easy to use them
in the context of documents or topics.
Let's now discuss how we can actually use these – we will be following this
tutorial I wrote for Gensim, which you can find over here in the link [2]. Note
that the Gensim notebook does not have TfIdf models, those were added in
this chapter to further help illustrate distances.
Let's first remind ourselves of exactly what distances we will be calculating –
two vector representations of documents. Let's set up our corpus and the
documents we will be comparing. We've used this version of the corpus
before in the Chapter 9, Advanced Topic Modeling to illustrate the examples on
document-word topics.
texts = [['bank','river','shore','water'],
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Creating TF-IDF and LDA models for the following corpus will help us
illustrate our distance metrics.

from gensim.models import ldamodel
from gensim.models import TfidfModel
tfidf = TfidfModel(corpus)
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2)

Note that now that a representation of TF-IDF would have as many features
as the size of the vocabulary, and an LDA model representation would have
as many features as the number of topics. We will be using both these
models later to compare distances.
Now, what do our topics look like?
[(0, u'0.164*"bank" + 0.142*"water" + 0.108*"river" + 0.076*"flow" + 0.067*"borrow" +
0.063*"sell" + 0.060*"tree" + 0.048*"money" + 0.046*"fast" + 0.044*"rain"'),
(1, u'0.196*"bank" + 0.120*"finance" + 0.100*"money" + 0.082*"sell" + 0.067*"river" +
0.065*"water" + 0.056*"transaction" + 0.049*"loan" + 0.046*"tree" + 0.040*"mud"')]

Let's use three documents to compare – a document to do with river banks,
one to do with financial banks, and one that has the context of both (maybe a
financial bank on the bank of a river?).
doc_water = ['river', 'water', 'shore']
doc_finance = ['finance', 'money', 'sell']
doc_bank = ['finance', 'bank', 'tree', 'water']

Once we have our documents, we quickly convert these into a bag of words,
TF-IDF, and LdaModel representations.
bow_water = model.id2word.doc2bow(doc_water)
bow_finance = model.id2word.doc2bow(doc_finance)
bow_bank = model.id2word.doc2bow(doc_bank)
lda_bow_water = model[bow_water]
lda_bow_finance = model[bow_finance]
lda_bow_bank = model[bow_bank]
tfidf_bow_water = tfidf[bow_water]
tfidf_bow_finance = tfidf[bow_finance]
tfidf_bow_bank = tfidf[bow_bank]

Let's have a look at lda_bow_water and see what it looks like:

[(0, 0.8225102558524345), (1, 0.17748974414756546)]

This makes sense – the document contained words to do with river banks,
and its proportion of topic_0 is 82%. The lda_bow_finance variable should be
roughly the opposite – let's test this:
[(0, 0.14753674420005805), (1, 0.852463255799942)]

And voila, as we expected – the LDA representations of the two documents
are quite different, which we could see even when we constructed the
documents. This means that their distance would also be quite high, as they
are not similar documents.
Let's have a quick peek at lda_bow_bank as well:
[(0, 0.44153395450870797), (1, 0.558466045491292)]

This is a well-balanced document with respect to the topics (as expected).
Let's import our distance functions which we will be using – the Hellinger
metric [3], the Kullback-Leibler divergence function [4], and
the Jaccard index [5]. The Hellinger and KL-Divergence are two distance
metrics that help us identify how similar or different two probability
distributions are. Links 3, 4, and 5 explain the mathematical basis behind
these metrics; what we should keep in mind is that there is no one perfect
metric to choose when deciding to compare two documents, and to give both
methods a shot, which is why we include examples of both. The Jaccard
index is a more traditional metric primarily used to compare between two
from gensim.matutils import kullback_leibler, jaccard, hellinger

Let's find the distances between our documents:
hellinger(lda_bow_water, lda_bow_finance)
hellinger(lda_bow_finance, lda_bow_bank)
hellinger(lda_bow_bank, lda_bow_water)

Interpreting these results are very straightforward – we find the largest
Hellinger distance returned for the documents to do with finance and water
– they don't have much in common, so this is a good result. The bank
document, which contains both the finance and water contexts, is equally
distant from both the water and the finance documents – but seems to be
further from the water document (0.287 opposed to 0.234). These are
relative values that range from 0 to 1, where 0 means no distance, 0.5 could
be intuitively understood as being in-between, and 1 is when they are equal.
This again makes sense – when we looked at lda_bow_bank, it leaned more
towards finance than water.
Try t his: A small exercise for the reader would be to identify why the bank document tend toward the finance
topic – is the word bank in the document associated with finance or water? Doing a document word coloring
would be a way to identify this!

So as we can see, using these distance metrics we can identify how far or
close certain documents are. Its usefulness might not be as apparent in a
small corpus and small documents, but this becomes invaluable as we
continue. We can similarly do the same experiments with the KL function
and Jaccard function. A subtle point to remember; in the strictest sense, the
Kullback-Leibler function is not a metric. This is because it isn't symmetric.
This means that kullback_leibler(lda_bow_finance, lda_bow_bank) is not equal to
kullback_leibler(lda_bow_bank, lda_bow_finance), for example.
Let's illustrate this; we already calculated the Hellinger distance between the
water and finance documents. Calculating the distance by swapping the
finance and water documents should return the same value got before
because the Hellinger metric is a mathematical distance metric.
hellinger(lda_bow_finance, lda_bow_water)

As we expected, we got the same value as before, and this confirms what we
already know about Hellinger – it is a symmetrical distance function. Let's
try the same exercise with the KL function.
kullback_leibler(lda_bow_water, lda_bow_bank)
kullback_leibler(lda_bow_bank, lda_bow_water)

The values aren't terribly far apart, but they aren't the same; this means that

while the KL function can give us an intuition of how far or close two
probability distributions are, it isn't a strict mathematical distance metric.
This doesn't discount its usefulness, however – values closer to 0 are still
considered similar and values that are close to 1 are not similar.
Our last distance function is the popular Jaccard metric. Unlike the other
distance functions, the Jaccard method also works on a bag of words.
jaccard(bow_water, bow_bank)
jaccard(doc_water, doc_bank)
jaccard(['word'], ['word'])

The preceding three examples feature two different input methods.
In the first case, we present to jaccard document vectors already in the bag of
words format. The distance can be defined as 1 minus the size of the
intersection upon the size of the union of the vectors. We can see (on
manual inspection as well), that the distance is likely to be high – and it is.
The last two examples illustrate the ability for jaccard to accept even lists (that
is, documents) as inputs. In the last case, because they are the same vectors,
the value returned is 0 – this means the distance is 0 and they are very
We can also use these distance functions to find how close or far apart topics
themselves are. While it is more useful to attempt doing this with larger
corpuses and larger vocabularies, we will still give this a shot. First, we must
make appropriate changes to the way we display topics so that we can pass it
into our distance functions.
def make_topics_bow(topic):
# takes the string returned by model.show_topics()
# split on strings to get topics and the probabilities
topic = topic.split('+')
# list to store topic bows
topic_bow = []
for word in topic:
# split probability and word
prob, word = word.split('*')
# get rid of spaces
word = word.replace(" ","")
# convert to word_type

word = model.id2word.doc2bow([word])[0][0]
topic_bow.append((word, float(prob)))
return topic_bow

On passing the results of model.show_topics() to these, we can create appropriate
topic_water, topic_finance = model.show_topics()
finance_distribution = make_topics_bow(topic_finance[1])
water_distribution = make_topics_bow(topic_water[1])

Let's look at what finance_distribution would look like, for example.
[(3, 0.196),
(12, 0.12),
(10, 0.1),
(14, 0.082),
(2, 0.067),
(0, 0.065),
(11, 0.056),
(15, 0.049),
(5, 0.046),
(9, 0.04)]

This basically maps the ID of the word and its proportion in the topic.
Let's now run the following:
hellinger(water_distribution, finance_distribution)

A small corpus and overlap of the word bank in the topics mean that the
distance doesn't seem as large as we expect – an interesting experiment is to
generate more topics with a larger corpus and rank topic pairs on how similar
they are – this would better illustrate the distance metrics we have been
working with.
That pretty much sums up using distance functions in the context of
documents and topics – but we should remember that we can compare
between any two vector representations of topic distributions, so this is a
useful function to have in our arsenal.
For a few more details on using the distance metrics, it is recommended that
you run the Jupyter notebook [2] which illustrates these examples.

We can now move on to making queries and using these distance metrics for
more sophisticated purposes!

Similarity queries
Now that we have the capability to compare between two documents, it is
possible for us to set up our algorithms to extract out the most similar
documents for an input query – simply index each of the documents, then
search for the lowest distance value returned between the corpus and the
query, and return the documents with the lowest distance values – these
would be most similar. Luckily for us, however, Gensim has in-built
structures to do this document similarity task!
We will be using the similarities module to construct this structure.
from gensim import similarities

We previously mentioned creating an index – we can do this far faster with
the similarities module. As mentioned in the Gensim documentation for the
Similarity class – the Similarity class splits the index into several smaller subindexes (shards), which are disk-based. If your entire index fits in memory
(hundreds of thousands of documents for 1 GB of RAM), you can also use
the MatrixSimilarity or SparseMatrixSimilarity classes directly. These are more simple
but do not scale as well (they keep the entire index in RAM, no sharding).
Since we have a small corpus, we can use the MatrixSimilarity class to create our
index = similarities.MatrixSimilarity(model[corpus])

We created our index based on the similarities created by the LDA
transformation of our corpus. We can create the same index using TF-IDF, or
even bag of words, but we can expect better performance when using topics.
We should also keep in mind that our queries should be in the same input
space as the representation in which we created our index.
Now that we've created our index, we can query to find the most similar
documents in the corpus. Let's use the same lda_bow_finance document and find
which articles are most similar.

sims = index[lda_bow_finance]

Sims now contains the similar documents; let's have a better look at what's


(8, 0.99956596),
(9, 0.5114244),
(10, 0.9995375)]

And there we go! We now have a list with each document and the
corresponding similarity values. Keep in mind that these values were
generated using cosine similarities – Gensim doesn't have the functionality
to plug in our own similarity metrics, so until then we must stick to using the
cosine similarity – or create our own indexing method.
Let's look at which documents were actually picked up, and sort them
according to how similar they are.
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_id, similarity in sims:
print(texts[doc_id], similarity)
['finance', 'money', 'sell', 'bank'] 0.99956596
['bank', 'loan', 'sell'] 0.9995375
['money', 'transaction', 'bank', 'finance'] 0.99913883
['bank', 'finance'] 0.9970802
['bank', 'borrow', 'money'] 0.8764254
['borrow', 'sell'] 0.5114244
['bank', 'river', 'shore', 'water'] 0.36124918
['river', 'water', 'mud', 'tree'] 0.33108047
['bank', 'water', 'fall', 'flow'] 0.30807066
['bank', 'bank', 'water', 'rain', 'river'] 0.30388257
['river', 'water', 'flow', 'fast', 'tree'] 0.27387184

Quite nifty, eh? By simply sorting sims we have an ordered list of each
document similarity – we then print the original documents. Our query was
the LDA representation of a finance-related document, and the similarity

query returned all finance-related documents as most similar while the
documents to do with trees and rivers were least similar - just as we would
The tutorial [6] on the Gensim website performs a similar experiment, but on
the Wikipedia corpus – it is a useful demonstration on how to conduct
similarity queries on much larger corpuses and is worth checking if you are
dealing with a very large corpus.
A Gensim-related project, simserver [7] contains more dedicated similarity
querying functionalities, but this project is no longer maintained as open
source – with that being said, the tutorial [8] might still be relevant, and the
GitHub source code [9] could inspire you in your similarity querying efforts.
What have we seen in the last two sections? We can now effectively compare
between two probability distributions, which means we can compare both
topics and documents. This means that we are one step closer to creating our
own search engine – and with the similarities module, the heavy lifting is
done for us, and we have a ready-made API to conduct basic queries!
In the examples, we used LDA models for both distance calculation and to
generate the index for the similarities. We can, however, use any vector
representation of documents to generate this – it's up to us to decide which
one would be most effective for our use case.

Summarizing text
Often in text analysis, it is useful to summarize large bodies of text – either
to have a brief overlook of the text before deeply analyzing it or identifying
the keywords in a text. It is also often the end game – a text analysis task of
its own. We will not be working on building our own text summarization
pipeline, but rather focus on using the built-in summarization API which
Gensim offers us.
It is important to remember that the algorithms included in Gensim do not
create its own sentences, but rather extracts the key sentences from the text
which we run the algorithm on. This summarizer is based on the TextRank
algorithm, from an article by Mihalcea and others, called TextRank [10]. This
algorithm was later improved upon by Barrios and others in another
article, Variations of the Similarity Function of TextRank for Automated
Summarization [11], by introducing a BM25 ranking function [12].
It must be noted that unlike all the other algorithms discussed so far, as of
Gensim version 3.4.0 [13], the text summarization module in Gensim only
works in English – it doesn't allow you to preprocess the text as well or add
your own stop words.
To illustrate the summarization module, we will use a story from the Harry
Potter and the Philosopher's Stone movie.
from gensim.summarization import summarize

We can now simply use the summarization module to create the summarized
Remember to copy the text you wish to summarize and store it in the text variable. I n our example, we will be
using the following text.

Eleven-year-old Harry Potter has been living an ordinary life, constantly
abused by his surly and cold uncle and aunt, Vernon and Petunia Dursley,
and bullied by their spoiled son, Dudley.

Hagrid explains Harry's hidden past as the wizard son of James and Lily
Potter, who are a wizard and witch, respectively, and how they were
murdered by the most evil and powerful dark wizard in history, Lord
Voldemort, which resulted in the one-year-old Harry being sent to live with
his aunt and uncle.
There, Harry also makes an enemy of yet another first-year, Draco Malfoy,
who prejudices against Hermione due to her being the daughter of Muggles,
a term used by wizards and witches, which describes ordinary humans with
no magical ability.
He winds up in Gryffindor instead with Ron and Hermione while Draco is
sorted into Slytherin, like his whole family before him. As classes begin at
Hogwarts, Harry discovers his innate talent for flying on broomsticks despite
no prior experience and is recruited into his House's Quidditch (a
competitive wizards' sport, played in the air) team as a Seeker, which is said
to be the most difficult role.
When the school's headmaster Albus Dumbledore is lured from Hogwarts
under false pretenses, Harry, Hermione, and Ron fear that the theft is
imminent and descend through the trapdoor themselves.
The eventful school year ends at the final feast, during which Gryffindor
wins the House Cup. Harry returns to Privet Drive for the summer,
neglecting to tell them that the use of spells is forbidden by under-aged
wizards and witches and thus anticipating some fun and peace over the
A quick glance tells us this pretty much covers the important parts of the
book (further examination of the original Wikipedia text can help with this).
Of course, it isn't a perfect summarization of the story – a little more finetuning is required.
If we would just like the top sentences picked up from the paragraph and
returned as a list, we can use the split option, which returns a list of strings
instead of a single string.
We can also adjust how much text the summarizer outputs via the ratio
parameter or the word_count parameter. Using the ratio parameter, you specify
what fraction of sentences in the original text should be returned as output.
The default is 20%.

Now, let's run the following:
print (summarize(text, word_count=50))

We get this:
He winds up in Gryffindor instead with Ron and Hermione while Draco is sorted into
Slytherin, like his whole family before him. As classes begin at Hogwarts, Harry discovers
his innate talent for flying on broomsticks despite no prior experience and is recruited
into his House's Quidditch (a competitive wizards' sport played in the air) team as a
Seeker, which is said to be the most difficult role.

What's been done here is what is considered the top-ranked sentence, which
is chosen, as with a word limit of 50 only one sentence could make the cut –
we see here that it is not always the best algorithm for very short summaries
if the sentences are long.
An interesting experiment for the reader – try using the same summarization technique on the I MDB synopsis of
the plot of the Harry Potter and the Philosopher's Stone movie and compare the results!

As mentioned earlier, this module also supports keyword extraction.
Keyword extraction works in the same way as a summary generation (that is,
sentence extraction), in that the algorithm tries to find words that are
important or seem representative of the entire text. The keywords are not
always single words; in the case of multi-word keywords, they are typically
all nouns.
from gensim.summarization import keywords
powerful dark
life constantly
magic like


desires come
dog standing
requires unique skills possessed

A quick run-through of the words tells us that they are indeed the keywords
from the synopsis.
For your information, the other parameters involved in the keywords module
are as follows:
: Input text

text (str)

: If no words option is selected, the number of sentences
is reduced by the provided ratio, else, the ratio is ignored
ratio (float, optional)

: Number of returned words

words (int, optional)

: Whether split keywords if True

split (bool, optional)

: Whether score of the keyword

scores (bool, optional)

: Part of speech filters

pos_filter (tuple, optional)

: If True – lemmatize words

lemmatize (bool, optional)

: If True – remove accentuation

deacc (bool, optional)

An excerpt from the Gensim tutorial tells us a little bit about the complexity
and time taken for the algorithm:

Fig 1 1 .2 A figure describing the running tim e v ersus size of the corpus

In the preceding plot, we see the running times together with the sizes of the
datasets. To create datasets of different sizes, we have simply taken prefixes
of text; in other words, we take the first n characters of the book. The
algorithm seems to be quadratic in time, so one needs to be careful before
plugging a large dataset into the summarizer. One reason for this difference
in running times is the data structure that is used. The algorithm represents
the data using a graph, where vertices (nodes) are sentences and then
constructs weighted edges between the vertices that represent how the
sentences relate to each other. This means that every piece of text will have a
different graph, thus making the running times different. The size of this
data structure is quadratic in the worst case (the worst case is when each
vertex has an edge to every other vertex).

Gensim has another way to extract keywords: the Montemurro and Zanette's
entropy-based keyword extraction algorithm. The paper – Towards the
quantification of the semantic information encoded in written language [14]
describes the algorithm, which uses the entropy of each word's distribution
among the blocks to pick keywords.
from gensim.summarization import mz_keywords
mz_keywords(text, scores=True, weighted=False, threshold=1.0)
[(u'had', 0.002358350743193241),
(u'from', 0.002039753203785301),
(u'hagrid', 0.002039753203785301),
(u'hermione', 0.002039753203785301),
(u'into', 0.002039753203785301),
(u'hogwarts', 0.0017206396372542237),
(u'an', 0.001400618744466898),
(u'first', 0.001400618744466898),
(u'ron', 0.001400618744466898),
(u'slytherin', 0.001400618744466898),
(u'trapdoor', 0.001400618744466898),
(u'is', 0.00111564319627375),
(u'dark', 0.0010787207994767374),
(u'instead', 0.0010787207994767374),
(u'snape', 0.0010787207994767374),
(u'wizard', 0.0010787207994767374)]

The MZ keyword extraction algorithm tends to perform better for larger
corpuses. The complexity of the algorithm is O(Nw), where N is the number
of words in the document and w is the number of unique words. The
parameters for this algorithm are as follows:
: Document for summarization.

text (str)

: Size of blocks to use in analysis.

blocksize (int, optional)

: Whether to return score with keywords.

scores (bool, optional)

: Whether to return results as list.

split (bool, optional)

: Whether to weight scores by word frequency. False
can useful for shorter texts and allows automatic thresholding.
weighted (bool, optional)

: Minimum score for returned keywords,
auto calculates the threshold as n_blocks / (n_blocks + 1.0) + 1e-8, use
auto with weighted=False.
threshold (float or 'auto', optional)

With text similarity and text summarization, we can now build more
sophisticated text analysis pipelines. In the next few chapters, we will be
working on more advanced machine learning techniques for text, such as
deep learning.

Throughout this chapter, we saw how basic mathematical and information
retrieval methods can be used to help identify how similar or dissimilar two
text documents are. We also saw how we can extend these methods to any
probabilistic distribution as well, such as topic models themselves – this can
be particularly handy especially when we are working with more topics than
we can analyze with the human eye. Summarization is also another useful
tool we are now exposed to – since it works on the principle of which
keywords provide the most information in a passage, we can use this
knowledge of keywords to further aid us in building natural language
processing pipelines.
We will now move on to more advanced topics involving neural networks
and deep learning for textual data. These include methods such as Word2Vec
and Doc2Vec, as well as shallow and deep neural networks. We will explore
the Python packages, the theory, and the application of these deep learning
methods in the text.

[1] Similarity Metrics:

[2] Distance Metrics ipynb:

[3] Hellinger Distance:

[4] KL- divergence:

[5] Jaccard:

[6] Gensim wikipedia example:

[7] Simserver:

[8] Doc Similarity Server:

[9] Gensim SimServer GitHub:

[10] textRank:

[11] Variations of the Similarity Function of TextRank for Automated

[12] BM25:

[13] Gensim:

[14] Towards the quantification of the semantic information encoded in

written language:

Word2Vec, Doc2Vec, and Gensim
We have previously talked about vectors a lot throughout the book – they are
used to understand and represent our textual data in a mathematical form,
and the basis of all the machine learning methods we use rely on these
representations. We will be taking this one step further, and use machine
learning techniques to generate vector representations of words that better
encapsulate the meaning of a word. This technique is generally referred to as
word embeddings, and Word2Vec and Doc2Vec are two popular variations
of these.
Other word embeddings

Arguably the most important application of machine learning in text
analysis, the Word2Vec algorithm is both a fascinating and very useful tool.
As the name suggests, it creates a vector representation of words based on
the corpus we are using. But the magic of Word2Vec is in how it manages to
capture the semantic representation of words in a vector. The
papers, Efficient Estimation of Word Representations in Vector Space [1]
[Mikolov and others, 2013], Distributed Representations of Words and
Phrases and their Compositionality [2] [Mikolov and others, 2013], and
Linguistic Regularities in Continuous Space Word Representations [3]
[Mikolov and others, 2013] lay the foundations for Word2Vec and describe
their uses.
We've mentioned that these word vectors help represent the semantics of
words – what exactly does this mean? Well for starters, it means we could
use vector reasoning for these words – one of the most famous examples is
from Mikolov's paper, where we see that if we use the word vectors and
perform (here, we use V(word) to represent the vector representation of the
word) V(King) - V(Man) + V(Woman), and the resulting vector is closest
to V(Queen). It is easy to see why this is remarkable – our intuitive
understanding of these words is reflected in the learned vector
representations of the words!
This gives us the ability to add more of a punch in our text analysis pipelines
– having an intuitive semantic representation of vectors (and by extension,
documents – but we'll get to that later) will come in handy more than once.
Finding word-pair relationships is one such interesting use – if we define a
relationship between two words such as France : Paris, using the appropriate
vector difference we can identify other similar relationships – Italy : Rome,
Japan : Tokyo are two such examples which are found using Word2Vec. We
can continue to play with these vectors like any other vectors – by adding
two vectors, we can attempt to get what we would consider the addition of
two words. For example, V(Vietnam) + V(Capital) is closest to the vector
representation of V(Hanoi).
How exactly does this technique result in such an understanding of words?

Word2Vec works by understanding context – in particular, what of words
tend to appear in certain words? We choose a sliding window size, and based
on this window size, attempt to identify the conditional probability of
observing the output word based on the surrounding words. For example, if
the sentence is The personal nature of text data always adds an extra bit of
motivation, and it also likely means we are aware of the nature of the data,
and what kind of results to expect., and our target word is the word in bold,
motivation, we try and figure out what are the odds of finding the word
motivation if the context is always adds an extra bit of on the left-hand side
of the window and and it also likely means on the right. Of course, this is
just an illustrative example – the exact training procedure requires us to
choose a window size and the number of dimensions among other details.
There are two main methods to perform Word2Vec training, which are the
Continuous Bag of Words model (CBOW) and the Skip Gram model.
The underlying architecture of these models is described in the original
research paper, but both of these methods involve in understanding the
context which we talked about before. The papers written by Mikolov and
others provide further details of the training process, and since the code is
public, it means we actually know what's going on under the hood!
The blog post [4], Word2Vec Tutorial - The Skip-Gram Model, by Chris
McCormick explains some of the mathematical intuition behind the skipgram word2vec model, and the post [5], The amazing power of word vectors,
by Adrian Colyer talks about the some of the things we can do with
word2vec. The links are useful if you wish to dig a little deeper into the
mathematical details of Word2Vec, a topic we will not be covering in this
chapter. The resources page [6] contains theory and code resources for
Word2Vec and is also useful in case you wish to look up the original material
or other implementation details.
While Word2Vec remains the most popular word vector implementation,
this is not the first time it has been attempted, and certainly not the last
either – we will discuss some of the other word embeddings techniques in
the last section of this chapter. Right now, let's jump into using these word
vectors ourselves.
Gensim comes to our assistance again and is arguably the most reliable open
source implementation of the algorithm, and we will explore how to use it.

Using Word2Vec with Gensim
While the original C code [7] released by Google does an impressive job,
Gensims' implementation is a case where an open source implementation is
more efficient than the original.
The Gensim implementation was coded up back in 2013 around the time the
original algorithm was released – the blog post by Radim Řehůřek [8]
chronicles some of the thoughts and problems encountered in implementing
the same for Gensim, and is worth reading if you would like to know the
process of coding word2vec in Python. The interactive web tutorial [9]
involving Word2Vec is quite fun and illustrates some of the examples of
Word2Vec we previously talked about. It is worth looking at if you're
interested in running Gensim Word2Vec code online, and can also serve as a
quick tutorial of using Word2Vec in Gensim.
We will now get into actually training our own Word2Vec model. The first
step, like all the other Gensim models we used, involved importing the
appropriate model.
from gensim.models import word2vec

At this point, it is important to go through the documentation for the word2vec
class, as well as the KeyedVector class, which we will both use a lot. From the
documentation page, we list the parameters for the word2vec.Word2Vec class.


: This defines the training algorithm. By default (sg=0), CBOW is used.
Otherwise (sg=1), skip-gram is employed.


: This is the dimensionality of the feature vectors.



: This is the maximum distance between the current and predicted
word within a sentence.


: This is the initial learning rate (will linearly drop to min_alpha as
training progresses).


: This is used for the random number generator. Initial vectors for
each word are seeded with a hash of the concatenation of word + str(seed).
Note that for a fully deterministically reproducible run, you must also
limit the model to a single worker thread, to eliminate ordering jitter
from OS thread scheduling. (In Python 3, reproducibility between
interpreter launches also requires the use of the PYTHONHASHSEED
environment variable to control hash randomization.)


: Ignore all words with a total frequency lower than this.



: Limit RAM during vocabulary building; if there are more
unique words than this, then prune the infrequent ones. Every 10
million word types need about 1 GB of RAM. Set to None for no limit


: This is the threshold for configuring which higher-frequency
words are randomly downsampled; default is 1e-3, the useful range is (0,


: Use this many worker threads to train the model (faster training
with multicore machines).


: If 1, hierarchical softmax will be used for model training. If set to 0
(default), and negative is non-zero, negative sampling will be used.


: If > 0, negative sampling will be used, the int for negative
specifies how many noise words should be drawn (usually between 520). The default is 5. If set to 0, no negative sampling is used.


: If 0, use the sum of the context word vectors. If 1 (default), use
the mean. Only applies when CBOW is used.


: This is the hash function to use to randomly initialize weights, for
increased training reproducibility. The default is Python's rudimentary
built-in hash function.


: This is the number of iterations (epochs) over the corpus. The
default is 5.


: The vocabulary trimming rule specifies whether certain words
should remain in the vocabulary, be trimmed away, or handled using the
default (discard if word count < min_count). This can be None (min_count will
be used), or a callable that accepts parameters (word, count, and min_count)
and returns either utils.RULE_DISCARD, utils.RULE_KEEP, or utils.RULE_DEFAULT. Note
that the rule, if given, is only used to prune vocabulary during build_vocab()
and is not stored as part of the model.


: If 1 (default), sort the vocabulary by descending frequency
before assigning word indexes.


: This is the target size (in words) for batches of examples
passed to worker threads (and thus cython routines). The default is
10000. (Larger batches will be passed if individual texts are longer than
10000 words, but the standard cython code truncates to that maximum).

We won't be using or exploring all of these parameters in our examples, but
they're still important to have an idea of - fine-tuning your model would
heavily rely on this. When training our model, we can use our own corpus or
more generic ones – since we wish to not train on a particular topic or
domain, we will use the Text8 corpus [10] which contains textual data
extracted from Wikipedia. Be sure to download the data first - we do this by
finding the link under the Experimental Procedure section.
We will be more or less following the Jupyter notebook attached at the end
of this chapter, which can also be found here [13].
sentences = word2vec.Text8Corpus('text8')
model = word2vec.Word2Vec(sentences, size=200, hs=1)

Our model will use hierarchical softmax for training and will have 200
features. This means that it has a hierarchical output and uses the softmax
function in its final layers. The softmax function is a generalization of the
logistic function that squashes a K-dimensional vector z of arbitrary real
values to a K-dimensional vector of real values, where each entry is in the
range (0, 1), and all the entries add up to 1. We don't need to understand the
mathematical foundation at this point, but if interested, links [1] to [3] go
into more details about this.
Printing our model tells us this:
Word2Vec(vocab=71290, size=200, alpha=0.025)

Now that we have our trained model, let's give the famous King - Man +
Woman example a try:
model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)[0]

Here, we are adding king and woman (they are positive parameters), and

subtracting man (it is a negative parameter), and choosing only the first value
in the tuple.

And voila! As we expected, queen is the closest word vector when we search for
the word most similar to woman and king, but far away from man. Note that since
this is a probabilistic training process, there is a slight chance you might get a
different word - but still relevant to the context of the words. For example,
words like throne or empire might come up.
We can also use the most_similar_cosmul method – the Gensim documentation
[11] describes this as being slightly different to the traditional similarity
function by instead using an implementation described by Omer Levy and
Yoav Goldberg in their paper [12] Linguistic Regularities in Sparse and
Explicit Word Representations. Positive words still contribute positively
toward the similarity, negative words negatively, but with less susceptibility
to one large distance dominating the calculation. Consider this example:
model.wv.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
[(u'queen', 0.8473771810531616),
(u'matilda', 0.8126628994941711),
(u'throne', 0.8048466444015503),
(u'prince', 0.8044915795326233),
(u'empress', 0.803791880607605),
(u'consort', 0.8026778697967529),
(u'dowager', 0.7984940409660339),
(u'princess', 0.7976254224777222),
(u'heir', 0.7949869632720947),
(u'monarch', 0.7940317392349243)]

If we wish to look up the vector representation of a word, all we need to do is

We won't display the output here, but we can expect to see a 200-dimension
array, which is what we specified as our size.
If we wish to save our model to disk and reuse it again, we can do this using
the save and load functionalities. This is particularly useful – we can save
and retrain models, or further train on models adapted to a certain domain."text8_model")
model = word2vec.Word2Vec.load("text8_model")

The magic of Gensim remains in the fact that it doesn't just give us the
ability to train a model – like we have been seeing so far, it's API, which
means we don't have to worry much about the mathematical workings, but
can focus on using the full potential of these word vectors. Let's check out
some other nifty functionalities the Word2Vec model offers:
Using word vectors we can identify which word in a list is the farthest away
from the other words. Gensim implements this functionality with the
doesnt_match method, which we illustrate here:
model.wv.doesnt_match("breakfast cereal dinner lunch".split())

As expected, the one word which didn't match the others on the list is picked
out – here, it is cereal. We can also use the model to understand how similar
or different words are in a corpus:
model.wv.similarity('woman', 'man')
model.wv.similarity('woman', 'cereal')
model.wv.distance('man', 'woman')

The results are quite self-explanatory in this case, and as expected, the words
woman and cereal are not similar. Here, distance is merely 1 - similarity.
We can continue training our Word2Vec model using the train method – just
remember to explicitly pass an epochs argument, as this is a suggested way
to avoid common mistakes around the model's ability to do multiple training
passes itself. The Gensim notebook tutorial [14] walks one through how to
perform online training with Word2Vec. Briefly, it requires performing the
following tasks - building a new vocabulary and then running the train
function again.

Once we're done training our model, it is recommended that you start only
using the model's keyed vectors. You might have noticed so far that we've
been using the keyed vectors (which is simply a Gensim class to store vectors)
to perform most of our tasks – model.wv represents this. To free up some RAM
space, we can run the following:
word_vectors = model.wv
del model

We can now perform all the tasks we did before using the word vectors. Keep
in mind that this is not just for Word2Vec but even for all word embeddings.
To evaluate how well our model has done, we can test it on data-sets that are
loaded when we install Gensim.
model.wv.evaluate_word_pairs(os.path.join(module_path, 'test_data','wordsim353.tsv'))
((0.6230957719715976, 3.90029813472169e-39), SpearmanrResult(correlation=0.645315618985209,
pvalue=1.0038208415351643e-42), 0.56657223796034)

Here, to make sure that we find our file, we have to specify the module path
– this is the path for the gensim/test folder, which is where the files exist. We
can also test our model on finding word pairs and relationships by running
the following code.
model.wv.accuracy(os.path.join(module_path, 'test_data', 'questions-words.txt'))

In our examples so far, we used a model which we trained ourselves – this
can be quite a time-consuming exercise sometimes, and it is handy to know
how to load pretrained vector models. Gensim allows for an easy interface to
load the original Google news trained Word2Vec model (you can download
this file from link [9]), for example.
from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

Our model now uses a 300-dimension word vector model, and we can run all
the previous code examples we ran before, again – the results won't be too
different, but we can expect a more sophisticated model.

Gensim also allows similar interfaces to download models using other word
embeddings – we'll go over this in the last section. We're now equipped to
train models, load models, and use these word embeddings to conduct

We know how important vector representation of documents are – for
example, in all kinds of clustering or classification tasks, we have to
represent our document as a vector. In fact, in most of this book, we have
looked at techniques either using vector representations or worked on using
these vector representations – topic modeling, TF-IDF, and a bag of words
were some of the representations we previously looked at.
Building on Word2Vec, the kind researchers have also implemented a vector
representation of documents or paragraphs, popularly called Doc2Vec. This
means that we can now use the power of the semantic understanding of
Word2Vec to describe documents as well, and in whatever dimension we
would like to train it in!
Previous methods of using word2vec information for documents involved
simply averaging the word vectors of that document, but that did not provide
a nuanced enough understanding. To implement document vectors, Mikilov
and Le simply added another vector as part of the training process – one
which they called the paragraph id. Similar to word2vec, there are two
primary training methods - Distributed Memory version of paragraph
vector (PV-DM) and Words version of paragraph vector (PV-DBOW).
They are variations of the CBOW and Skip Gram models, which were used
to train Word2Vec, and we can understand it as extending the idea of context
to paragraphs by adding a label or an ID. The paper [15] by Mikolov and Le,
Distributed Representations of Sentences and Documents, describes the
algorithm in detail, and if you take the effort to read the Word2Vec papers –
this is definitely worth a shot!
For an easier reading into the inner workings of Doc2Vec, the blog post, A
gentle introduction to Doc2Vec [16] also helps. The blog post walks us
through the training methods we discussed before, namely SkipGram and
As usual, we are less interested in the theory and more interested in the
practical applications of these algorithms – so let's jump right into using
Gensim for Doc2Vec!

The one major difference about Gensim's Doc2Vec implementation is that it
doesn't expect a simple corpus as input – the algorithm expects tags or
labels, and we are also expected to provide this as part of our input. Gensim
helps us do this with.

Alternately, we can use this:
sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])

In case of any errors, also try the following:
sentence = TaggedDocument(words=[u'some', u'words', u'here'], tags=[u'SENT_1'])

Here, sentence is an example of what our input is going to be like. For our
illustrative example, we will be using the Lee news corpus, and more or less
follow the Lee tutorial [17] – we are familiar with this corpus, having
previously worked with it during our topic modeling exercises. It should be
noted that similar to Word2Vec, the more varied and larger the corpus, the
better we can expect our training results to be. We load the corpus the same
way we did before:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0],
'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
lee_test_file = test_data_dir + os.sep + 'lee.cor'

To construct our corpus, we will use the TaggedDocument class.
def read_corpus(file_name, tokens_only=False):
with smart_open.smart_open(file_name, encoding="iso-8859-1") as f:
for i, line in enumerate(f):
if tokens_only:
yield gensim.utils.simple_preprocess(line)
# For training data, add tags
yield gensim.models.doc2vec.TaggedDocument(
gensim.utils.simple_preprocess(line), [i])

Here, we are simply adding the document number as our label – if we have

further, more useful information about our data, we are welcome to add this
information. In our function we defined to read the Lee corpus, we added a
parameter to only read the tokens – this is for testing purposes.
train_corpus = list(read_corpus(lee_train_file))
test_corpus = list(read_corpus(lee_test_file, tokens_only=True))

From here onwards, Gensim's simple API remains the same, and to define
and train our model we run this:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)

Again, we will list out all the parameters from the Doc2Vec class. The original
documentation can be found here [18].


: This defines the training algorithm. By default (dm=1), distributed
memory (PV-DM) is used. Otherwise, a distributed bag of words (PVDBOW) is employed.


: This is the dimensionality of the feature vectors.



: This is the maximum distance between the predicted word and
context words used for prediction within a document.


: This is the initial learning rate (will linearly drop to min_alpha as
training progresses).


: This is used for the random number generator. Note that for a fully
deterministically reproducible run, you must also limit the model to a
single worker thread, to eliminate ordering jitter from OS thread
scheduling. (In Python 3, reproducibility between interpreter launches
also requires the use of the PYTHONHASHSEED environment variable
to control hash randomization.)


: Ignore all words with a total frequency lower than this.



: Limit RAM during vocabulary building; if there are more
unique words than this, then prune the infrequent ones. Every 10
million word types need about 1 GB of RAM. Set to None for no limit


: Threshold for configuring which higher-frequency words are
randomly downsampled.


: This is 1e-3, values of 1e-5 (or lower) may also be useful, set to
0.0 to disable downsampling.


: Use this many worker threads to train the model (faster training
with multicore machines).


: Number of iterations (epochs) over the corpus. The default
inherited from Word2Vec is 5, but values of 10 or 20 are common in
published paragraph vector experiments.


: If 1, hierarchical softmax will be used for model training. If set to 0
(default), and negative is non-zero, negative sampling will be used.


: if > 0, negative sampling will be used, the int for negative
specifies how many noise words should be drawn (usually between 520). The default is 5. If set to 0, no negative sampling is used.


: If 0 (default), use the sum of the context word vectors. If 1, use
the mean. Only applies when dm is used in non-concatenative mode.


: If 1, use concatenation of context vectors rather than
sum/average; default is 0 (off). Note concatenation results in a muchlarger model, as the input is no longer the size of one (sampled or
arithmetically combined) word vector, but the size of the tag(s) and all
words in the context strung together.


: This is the expected constant number of document tags per
document, when using the dm_concat mode; default is 1.


: If set to 1 trains word-vectors (in skip-gram fashion)
simultaneous with DBOW doc-vector training; default is 0 (faster
training of doc-vectors only).


: The vocabulary trimming rule specifies whether certain words
should remain in the vocabulary, be trimmed away, or handled using the
default (discard if word count < min_count). This can be None (min_count will be
used), or a callable that accepts parameters (word, count, and min_count) and
returns either util.RULE_DISCARD, util.RULE_KEEP, or util.RULE_DEFAULT. Note that the
rule, if given, is only used prune vocabulary during build_vocab() and is not
stored as part of the model.

We have a fairly small corpus in our case, so we decide on 50 dimensions, a
minimum count of 2 to ignore low information words, and 100 iterations of
our training algorithm.
model.train(train_corpus, total_examples=model.corpus_count,

And there we go! With this, we have our Doc2Vec model trained. This is just
to illustrate how to go about setting up the corpus and training a model –
evaluating, assessing, and fine-tuning our model is a more nuanced process
and depends on our use case as well – we could see how this was done with
Word2Vec, by attempting to assess question-answer pairs or semantic pairs.
For a more detailed example of evaluating Doc2Vec, as well as code
examples, the Gensim notebook [19] on using IMDB to train the vectors is
worth a look.
In the paper on Doc2Vec, the authors recommend using both the PV-DBOW
training method, as well as the PV-DM method to train the model. We can do
this using the following:
from gensim.models import Doc2Vec
models = [
Doc2Vec(dm=0, dbow_words=1, vector_size=200, window=8, min_count=10, epochs=50),
# PV-DM w/average
Doc2Vec(dm=1, dm_mean=1, vector_size=200, window=8, min_count=10, epochs=50),

We then build the vocabularies before we start with our training. Just a note
here: documents are any tagged document and are a placeholder variable, we can
use train_corpus or provide different documents of our choice.

for model in models:
model.train(documents, total_examples=model.corpus_count,

This leaves us with two trained models, and we can evaluate these how we
would like. The ConcatenatedDoc2Vec class helps us here.
You may need to run pip

install testfixtures before

running this code.

from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
new_model = ConcatenatedDoc2Vec((models[0], models[1]))

As for what we can do with a Doc2Vec model, inferring a vector and
searching for similar vectors are the more obvious applications. We can see
this for ourselves with the Lee dataset from link [17] or the Jupyter notebook:
inferred_vector = model.infer_vector(train_corpus[0].words)
sims = model.docvecs.most_similar([inferred_vector])
[(0, 0.9216967225074768),
(48, 0.822547435760498),
(255, 0.7833435535430908),
(40, 0.7805585861206055),
(8, 0.7584196925163269),
(33, 0.7528027892112732),
(272, 0.7409536838531494),
(9, 0.7000102400779724),
(264, 0.6848353743553162),
(10, 0.6837587356567383)]

Note that in practical cases, we will not be testing for most similar vectors on
the training set – this is to simply illustrate how to use the methods.
We can see that in the list of documents most similar to document 0, ID 0
shows up first – this is obvious (and redundant information). However,
things get interesting when we check the 48th or 255th document. Let's see
what document 0 contains:
"hundreds of people have been forced to vacate their homes in the southern
highlands of new south wales as strong winds today pushed huge bushfire
towards the town of hill top new blaze near goulburn south west of sydney
has forced the closure of the hume highway at about pm aedt marked

deterioration in the weather as storm cell moved east across the blue
mountains forced authorities to make decision to evacuate people from
homes in outlying streets at hill top in the new south wales southern
highlands an estimated residents have left their homes for nearby mittagong
the new south wales rural fire service says the weather conditions which
caused the fire to burn in finger formation have now eased and about fire
units in and around hill top are optimistic of defending all properties as more
than blazes burn on new year eve in new south wales fire crews have been
called to new fire at gunning south of goulburn while few details are
available at this stage fire authorities says it has closed the hume highway in
both directions meanwhile new fire in sydney west is no longer threatening
properties in the cranebrook area rain has fallen in some parts of the
illawarra sydney the hunter valley and the north coast but the bureau of
meteorology claire richards says the rain has done little to ease any of the
hundred fires still burning across the state the falls have been quite isolated
in those areas and generally the falls have been less than about five
millimetres she said in some places really not significant at all less than
millimetre so there hasn't been much relief as far as rain is concerned in fact
they ve probably hampered the efforts of the firefighters more because of the
wind gusts that are associated with those thunderstorms"
A brief run through can tell us it contains information about a fire and the
response of the fire-fighters. As for document 48:
"thousands of firefighters remain on the ground across new south wales this
morning as they assess the extent of fires burning around sydney and on the
state south coast firefighters are battling fire band stretching from around
campbelltown south west of sydney to the royal national park hundreds of
people have been evacuated from small villages to the south and south west
of sydney authorities estimate more than properties have been destroyed in
the greater sydney area fourteen homes have been destroyed in the
hawkesbury area north of sydney and properties have been ruined at jervis
bay john winter from the new south wales rural fire service says firefighters
main concern is the fire band from campbelltown through to the coast that is
going to be very difficult area today we do expect that the royal national park
is likely to be impacted by fire later in the morning he said certainly in terms
of population risk and threat to property that band is going to be our area of
greatest concern in the act it appears the worst of the fire danger may have
passed though strong winds are expected to keep firefighters busy today the
fires have burned more than hectares over the past two days yesterday winds
of up to kilometres an hour fanned blazes in dozen areas including
queanbeyan connor mount wanniassa red hill and black mountain strong

winds are again predicted for today but fire authorities are confident they
have the resources to contain any further blazes total fire ban is in force in
the act today and tomorrow emergency services minister ted quinlan has
paid tribute to the efforts of firefighters there has just been the whole body
of people that have been magnificent in sacrificing their christmas for the
benefit of the community he said."
We can see very clearly that the context has been captured perfectly by
Doc2Vec! We just searched up the most similar document – imagine the
power Doc2Vec can bring if used in tandem with clustering and classifying
documents. We urge the reader to retry some of the problems from Chapter 10,
Clustering and Classifying Text, using Doc2Vec instead of TF-IDF or topic
models as representation.
We now have the ability to vectorize (with semantic understanding!) both
our words and documents. While word2vec and doc2vec remain the most
popular vectorizing algorithms, these are not the only ones – let's explore
how to use some of the alternate algorithms in the next section.

Other word embeddings
There is a wealth of word embeddings which we can choose from for our
vectorization tasks – the original implementations of these methods are
scattered around in different languages, hosting websites, binaries, and
repositories – but luckily for us, Gensim comes to the rescue again, with
implementations or well-documented wrappers for most (if not all) of other
word embeddings.
Gensim has wrappers for WordRank, VarEmbed, and FastText, as well as native
implementations for Poincare Embeddings and FastText. Gensim also has a
neat script to use GloVe embeddings as well, which comes in handy when
comparing between different kinds of embeddings.
Gensim's KeyedVectors class means that we have a base class to use all our word
embeddings. The documentation page [21] covers most of the information
you need to know (though we have already used these vectors in our
examples for Word2Vec).
All we need to know is that after we are finished training our model, it's more
prudent to run this:
word_vectors = model.wv

Also, continue using word_vectors for all our other tasks – for most similar
words, most dissimilar, as well as running tests for word embeddings. It's
worth having a look at the code of the [22] file to see what's going
on under the hood!
Once we're aware of how to use word vectors, we can have a look at how to
get the other word embeddings up and running using Python.

GloVe is a word vector representation method where the training is
performed on aggregated global word-word co-occurrence statistics from a
corpus. This means that like Word2Vec, it uses context to understand and
create the word representations. The GloVe method was developed by the
Natural language processing lab at Stanford University – you can find more
information about their project on their website. The research paper
describing the method is called GloVe: Global Vectors for Word
Representation [23] and is well worth a read as it describes some of the
drawbacks of LSA and Word2Vec before describing their own method.
There are multiple implementations of GloVe out there, and even multiple
implementations in Python – but we will only stick to using these vectors,
and not training them. Of course, if one wishes to train their own GloVe
vectors this can be done with either glove_python [24] or just glove [25]. You can
also have a look at the original Stanford code over here [26].
As usual, we will be using Gensim to load these vectors. Our first step is to
download (or train) our GloVe vectors. Once we have them saved, we convert
the GloVe vector format to the Word2Vec format so that we can continue
using them with the Gensim API. Remember to download the GloVe input
file from link [22].
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

Here, we have loaded the glove vectors and converted it into the word2vec
format, and further saved it to disk. We load this the same way we would
load any saved vector file.
from gensim.models import KeyedVectors
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

Our model should now work the same way our word2vec models – albeit

slightly better if we go by the results which the GloVe paper describe. Let's
give our go-to example a shot:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[(u'queen', 0.7698540687561035)]

And like clockwork, we have our expected result!

FastText is a vector representation technique developed at Facebook AI
research. As its name suggests, it is a fast and efficient method to perform
the same task – and because of the nature of its training method, it ends up
learning morphological details as well. FastText is unique because it can
derive word vectors for unknown words or out of vocabulary words – this is
because by taking morphological characteristics of words into account, it can
create the word vector for an unknown word.
This becomes particularly interesting in languages where the morphological
structure is important – Turkish and Finnish are two such examples. It also
means that with a limited vocabulary it is still possible to make sufficiently
intelligent word embeddings. In the case of English, for example, it means it
is able to understand what the ly represents in words like charmingly or
strangely. We can further extend this to say that according to FastText,
embedding(strange) - embedding(strangely) ~= embedding(charming) embedding(charmingly).
We see that FastText more or less captures this in practice, and does this by
character level analysis of using words such as Word2Vec or GloVe. We test
the performance of word embeddings either by measuring how well the
vectors perform in semantic tasks and syntactic tasks. Since morphology
refers to the structure or syntax of words, FastText tends to perform better
for such tasks, and Word2Vec performs better for semantic tasks.
The original paper describing the method is titled Enriching Word Vectors
with Subword Information, and can be found on arxiv [27]. The
implementation by Facebook can be found in their GitHub repo [28]. We will
be using Gensim to use FastText, which contains both a native
implementation as well as a wrapper. The blog post [29] covers some of the
comparisons between FastText and Word2Vec, which we discussed before,
while the notebook [30] has code examples for the same. The blog post is part
of the official Gensim blog and compares it using Gensim as a common
Training is similar to the other gensim models we have dealt with. To use
and train the native Gensim implementation [31], we can run the following

code, where data is a placeholder variable for the textual data you wish to train
the model on.
from gensim.models.fasttext import FastText
ft_model = FastText(size=100)
ft_model.train(data, total_examples=ft_model.corpus_count,

We can also use the original C++ code through a wrapper [32], though this
requires us to first download the code.
from gensim.models.wrappers.fasttext import FastText
# Set FastText home to the path to the FastText executable
ft_home = '/home/bhargav/Gensim/fastText/fasttext'
# train the model
model_wrapper = FastText.train(ft_home, train_file)

Using the vectors generated from FastText are similar to all the word vectors
operations we covered before, so we will not explain how to use them – for a
simple illustration the following Jupyter notebooks help – notebook 1 [33],
notebook 2 [30].
One interesting exercise to try out with FastText is to see how it evaluates
words not present in the vocabulary. Consider this example:
print('dog' in ft_model.wv.vocab)
print('dogs' in ft_model.wv.vocab)

But, we can still generate word vectors for both dog and dogs despite dogs not
being in the training vocabulary! A quick observation of the vectors also tells
us that they are quite similar as we would expect. We can further verify this:
print('dog' in model)
print('dogs' in model)

We leave it to the user to test the other method provided by Gensim, using

WordRank, as the name suggests, attempts to solve embeddings as a ranking
problem. The idea behind it remains similar to GloVe, where we used global
co-occurrences of words to generate the word embeddings. The code can be
downloaded off bitbucket [34], and the GitHub [35] can also be used. The
original paper describing the method is titled WordRank: Learning Word
Embeddings via Robust Ranking, and can also be found on arxiv [36].
Again, we will be using Gensim's wrapper to access and use WordRank. Here,
data is a variable which holds the path to your personal Gensim installation
followed by the data. Remember how we did this for the Lee corpus – we
used gensim.__path__[0].
from gensim.models.wrappers import Wordrank
wordrank_path = 'wordrank' # path to Wordrank directory
out_dir = 'model' # name of output directory to save data to
data = '../../gensim/test/test_data/lee.cor' # sample corpus
model = Wordrank.train(wordrank_path, data, out_dir, iter=21,

In this particular case, we are using the same Lee corpus to run our training
and testing.
We need to be aware of two parameters, dump_period and iter, which need to be
synchronized as it dumps the embedding file with the start of next iteration.
For example, if you want results after 20 iterations, we set iter=21, and
dump_period can be any multiple after which there is no remainder – for 20, this
could be 2, 4, 5, or 10.
The Gensim documentation [37] can be found here, as well look at a basic
tutorial [38].
Some caveats – a window size of 15 performed with optimum results, and
100 epochs is a better idea than 500 epochs, as training time can be quite
long. Again, as with the other embeddings, we use the KeyedVectors class that
contains the same methods throughout all word vectors. For a comparison

between FastText, word2vec, and WordRank, the blog-post [39] and Jupyter
notebook [40] will walk you through.

Varembed is the 4th-word embedding method we will discuss, and like
FastText, it takes advantage of morphological information to generate word
vectors. The original paper describing the method is titled Morphological
Priors for Probabilistic Neural Word Embeddings, and can be found on arxiv
Similar to our GloVe vectors, we cannot update our model with new words
and would need to train a new model. Information on training our own
models can be found on the original [42] containing the code.
Gensim comes with Varembed word embeddings trained on the Lee dataset,
so we will take advantage of this to illustrate setting up a model. You can find
the documentation for Varembed [43]. Here, Varembed is a variable that
holds the path to your personal Gensim installation and the test data.
Remember how we did this for the Lee corpus – we used gensim.__path__[0].
from gensim.models.wrappers import varembed
varembed_vectors = '../../gensim/test/test_data/varembed_leecorpus_vectors.pkl'
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_vectors)

We previously mentioned how Varembed uses morphological information –
we can adjust our vectors accordingly by also adding this information. Again,
Gensim comes with this morphological information.
morfessors = '../../gensim/test/test_data/varembed_leecorpus_morfessor.bin'
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_vectors,

Once our model is loaded, we use its methods similar to our other word

The last word embedding technique we will look at is Poincare embeddings,
also developed by the good folks over at Facebook AI research. The general
idea is to use a graphical representation of words to better understand the
relationship between words and to generate the word embeddings. Poincare
embeddings can also capture hierarchical information using this graphical
representation – in the original paper [44], titled Poincaré Embeddings for
Learning Hierarchical Representations, this hierarchical information is
learned by using WordNet noun hierarchy. This information is calculated in
the hyperbolic space, and not the traditional euclidean space – allowing for
us to better capture the notions of hierarchy.
Gensim's notebook directory contains data required to train these
embeddings. We can access this using the following:
import os
poincare_directory = os.path.join(os.getcwd(), 'docs', 'notebooks',
data_directory = os.path.join(poincare_directory, 'data')
wordnet_mammal_file = os.path.join(data_directory,

To use this data to train our model, we run this:
from gensim.models.poincare import PoincareModel, PoincareKeyedVectors, PoincareRelations
relations = PoincareRelations(file_path=wordnet_mammal_file, delimiter='t')
model = PoincareModel(train_data=relations, size=2, burn_in=0)
model.train(epochs=1, print_every=500)

We can also use our own iterable of relations to train our model. In such a
case, each relation is just a pair of nodes. Gensim also has pre-trained
models we can use the following:
models_directory = os.path.join(poincare_directory, 'models')
test_model_path = os.path.join(models_directory,
model = PoincareModel.load(test_model_path)

We can use the standard word embeddings methods with our Poincare
model, as well as use graph related information, such as closest_child,
closest_parent, and norm.
For more information about the model refer to the following:
Documentation [45]:
Evaluation [46]:

Training [47]:

Blog Post [48]:

We explored in this chapter one of the major innovation in text analysis,
word embeddings or word vectors. Word vectors are unique in being not only
a way for us to represent our documents and our words but to also offer a
new way of looking at our words. The success of Word2Vec led to an
explosion in various word embedding methods, each with its own quirks,
advantages, and disadvantages. We not only learned about the popular
Word2Vec and Doc2Vec implementations but also five other word
embedding methods – all of them are supported well in the Gensim ecosystem making them easy to use.

[1] Efficient Estimation of Word Representations in Vector Space [Mikolov
et al. 2013]:

[2] Distributed Representations of Words and Phrases and their
Compositionality [Mikolov et al. 2013]:

[3] Linguistic Regularities in Continuous Space Word Representations
[Mikolov et al. 2013]:

[4] Word2Vec Tutorial - The Skip-Gram Model:

[5] Amazing power of word vectors:

[6] Word2Vec resources:

[7] Original C Word2Vec code:

[8] Deep Learning with Word2Vec and Gensim:

[9] Interactive Word2Vec tutorial:

[10] text8 data file:

[11] Word2Vec model:

[12] Linguistic Regularities in Sparse and Explicit Word Representations:

[13] Word2Vec/Doc2Vec notebook:

[14] Online word2vec:

[15] Distributed Representations of Sentences and Documents:

[16] A gentle introduction to Doc2Vec:

[17] Doc2Vec Lee tutorial:

[18] Doc2Vec Gensim:

[19] Doc2Vec IMDB:

[20] KeyedVectors:

[21] KeyedVectors file:

[22] GloVe:

[23] GloVe: Global Vectors for Word Representation:

[24] GloVe Python:

[25] GloVe:

[26] Standford GloVe:

[27] Enriching Word Vectors with Subword Information:

[28] fastText:

[29] FastText and gensim word embeddings:

[30] FastText comparison notebook:

[31] Gensim fastText:

[32] fastText wrapper:

[33] FastText Gensim notebook:

[34] WordRank:

[35] WordRank GitHub:

[36] WordRank: Learning Word Embeddings via Robust Ranking:

[37] WordRank Gensim:

[38] WordRank tutorial:

[39] WordRank blog-post:

[40] WordRank Jupyter notebook:

[41] Morphological Priors for Probabilistic Neural Word Embedding:

[42] GitHub page Varembed:

[43] Varembed:

[44] Poincare Embeddings:

[45] Documentation:

[46] Evaluation:

[47] Training:

[48] Blog Post:

Deep Learning for Text
Until now, we have explored the use of machine learning for text in a variety
of contexts – topic modeling, clustering, classification, text summarization,
and even our POS-taggers and NER-taggers were trained using machine
learning. In this chapter, we will begin to explore one of the most cuttingedge forms of machine learning – Deep Learning. Deep Learning is a form
of ML where we use biologically inspired structures to generate algorithms
and architectures to perform various tasks on the text. Some of these tasks
are text generation, classification, and word embeddings. In this chapter, we
will discuss some of the underpinnings of deep learning as well as how to
implement our own deep learning models for text. Following are the topics
we will cover in this chapter:
Deep learning
Deep learning for text
Text generation

Deep learning
Throughout this book, we have made use of machine learning techniques,
with topic modeling, clustering and classifying algorithms, as well as what we
call shallow learning – word embeddings. Word embeddings were our first
glimpse into neural networks and the kind of semantic information they can
Neural networks can be understood as a computing system or machine
learning algorithm whose architecture is vaguely inspired by biological
neurons in the brain. We say vaguely here because of the lack of thorough
understanding we have of the human brain – through the neural connections
and structure of the brain was certainly influential in some of the basic
building blocks of neural networks, such as the perceptron [1] and singlelayer neural network [2].
A neural network generally consists of a number of nodes that perform
mathematical operations and interact with each other via connections. This
model resembles a brain in the sense that the nodes tend to represent
neurons and the connections the wiring between these neurons. Different
layers can perform a different kind of operations, and there is generally an
input layer, multiple hidden layers, and an output layer.

Fig 1 3 .1 An exam ple of the com m on structure of neural networks [4 ]

In turn, neural network research has inspired cognitive research, and there
was considerable interest in using neural networks to also understand
human brains [3]. Neural networks can be used for most of the machine
learning tasks we have previously conducted, such as classification,
clustering, and as we saw in just the last chapter, in creating vector
representations of words and documents.
Outside of the domain of text analysis, neural networks have found
considerable success. State-of-the-art results in image classification,
computer vision, speech recognition, and medical diagnosis are often
achieved by neural networks. We mentioned earlier that we used neural
networks to generate word embeddings – after the learning or training was
completed, we used the values stored in the hidden layers as our
We've been talking extensively about neural networks even though this
chapter is titled deep learning – but deep learning is just another way to refer
to neural networks with multiple layers. Since most current neural networks
tend to use multiple layers in their architecture, we can refer to these
techniques as deep learning techniques. There are exceptions to this of
course, such as in Word2Vec, where we pick up the weights from only one
Neural networks and deep learning architectures are versatile in their use,
and even though we may not have a complete mathematical understanding

of neural networks, it is still a very good choice for practical natural language
processing, and that is what we will attempt to walk the reader through in
this chapter.

Deep learning for text (and more)
We're already aware of the power of neural networks first hand when we
used word embeddings. This is one aspect of neural networks – using parts
of the architecture itself to get useful information, but neural networks are
far from limited to this. When we start using deeper networks, it is not
prudent to use the weights to extract useful information – in these cases; we
are more interested in the natural output of the neural network. We can train
neural networks to perform multiple tasks to do with text analysis – indeed,
for some of these tasks, the introduction of neural networks have completely
changed how we approach the task.
A popular example here is Language Translation, and in particular,
Google's Neural Translation model. Starting from until September 2016
Google used statistical and rule-based methods and models to perform its
language translation, but with the advent of the Google Brain research team,
they soon switched over to using neural networks, and a technique now
referred to as
zero-shot translation [5]. This means that if the Google translate
algorithm intended to translate from Malaysian to Arabic, for example, it
would first translate to English as an intermediate step. With its current
neural network set-up, models take an input sentence, and its output is a
translated sentence – of course, it doesn't just throw the first output, there is
usually a scoring mechanism, where grammatical correctness is verified.
Instead of breaking up a sentence into multiple parts, performing rule-based
translation and rearranging the sentence, we now have a cleaner method of
attempting a translation. The neural models also tend to be smaller than the
statistical models, even if they might need more data or time to perform the
initial training. More languages are being released by Google as and when
these models outperform the existing models – only recently Google
released new models for a number of Indian languages.
Despite the advances made in machine translation, it still remains a difficult
task – while we can expect more or less grammatically accurate sentences, it
is more meant to provide a general idea to the user about what the input
sentence would mean in another language. Like other deep learning fields,
one can expect the results of machine translation to only get better.

Word embeddings are another very popular use of neural networks for text –
considering how word vectors and document vectors are used in many NLP
tasks, it means word embeddings have a home in many machine learning
algorithms involving text. In fact, replacing all previous vectors with word
embeddings means that we now have a few neural networks in all our
algorithms or applications! With its ability to capture context so well, it can
help heavily in tasks such as clustering or classification.
Also, speaking of techniques such as clustering or classification, we can also
train neural networks to perform these tasks. In fact, more complex text
analysis tasks, such as building a chatbot, require one to perform text
classification along the way. The task that we refer to as sentiment
analysis in the text is essentially a classification task where we classify
documents as having positive or negative sentiment (or multiple sentiments
if that is our job, of course). We can use more complex kinds of neural
networks to perform this, such as Convolutional Neural
Networks (CNN) [6] or Recurrent Neural Networks (RNN) [7], but
even your vanilla single layer NN tends to do an okay job.
We saw a glimpse of this when we were training our own POS-taggers or
NER-taggers - what was going on under the hood was a neural network being
trained to identify different classes for words – these classes were different
parts of speech or a named entity. So, technically, we've been using elements
of deep learning in all of our applications, just using the spaCy trained POStaggers!
We will not be going to the mathematical details of neural networks as it is
beyond the scope of the book, but when discussing different kinds of neural
networks and how we will be using them, we will attempt to discuss the
architecture and more importantly – the hyperparameters and best practices
of that particular method. Just a reminder: hyper parameters are parameters
to a machine learning algorithm which are set before starting the algorithm.
When dealing with vanilla neural networks or even convolutional neural
networks, our input space and output space is fixed – we decide what the
input is. It could be an image, or it could be a sentence, but it is basically a
vector input which produces a vector output. In natural language processing,
this vector output can be, for example, the probabilities of a document
belonging to a certain class. Recurrent neural networks are different with
regard to this because of its architecture (information is) - by allowing
sequences as inputs, we can do a lot more than just predicting classes.
Recurrent neural networks are particularly useful for text because they

understand the input data as sequences, and allow us to capture the context
of the words in a sentence.
One of the ideas of how neural networks work with text is that generates a
probabilistic language model for the body of text. This can be understood as
it is a technique where we calculate the probability of the next word (or
character!) in a sequence based on the previous inputs. In other words, they
attempt to calculate the probability of a word based on its context. Indeed,
even before neural networks were regularly used in natural language
processing this was a popular method – we have previously used n-grams,
which more or less work on the same principle. Based on a corpus or group
of texts, it attempts to learn what the odds of two words appearing next to
each other based on a particular context - that is, words around it. This is
how we start to consider new_york as a new addition to our vocabulary, it
means that there is a high probability that these two words will appear next
to each other, and this probability was calculated through basic conditional
probability and chain probability rules.
When using a neural network, we can argue that by learning the odds or
probabilities of words or characters appearing, we are using a sequence
generator, or that a neural network is now a generative model. Generative
models in the context of natural language processing can be particularly
interesting – if we can teach a neural network what kind of sentences occur
with high probability, we can also attempt to make this neural network
output sequences which mimic the text it was trained on.
It's this same thinking which lets us create word embeddings – the odds of
the word blue appearing after the sentence the wall is painted would be
similar to the word red appearing, and our embeddings learn to encode blue
and red with similar semantics. This kind of semantic understanding is
further explored with experiments on shared representations. Shared
representations are the idea that different kinds of input sharing the same
semantics can map to the same vector space – for example, the English word
for dog and the Chinese word for dog would map to very similar vectors in a
shared Chinese-English vector space. But the power of neural networks gets
even more impressive – it is possible to train a network to also map images
to the same space! Image captioning is also a task performed well by such
neural networks.
Using reinforcement learning [8] (where our model learns from its own
mistakes through a system of rewards and punishments) neural networks
have also been able to beat humans at the game of Go, which was once

considered a very tough game for artificial intelligence systems to beat.
One of the first natural language processing tasks was text summarization –
the traditional approach to such a problem is to rank the sentences based on
which ones provide the most information, and choose a subset of these. We
used such an algorithm in our own attempts at text summarization. With
deep learning, however, we now have the capacity to generate text, and much
like more human text summarization attempts, we will not be just choosing
important sentences, but rather creating the summary from a probabilistic
model. This process is also often referred to as Natural Language
Generation (NLG).
Indeed, when we previously discussed the power of neural networks in
language translation, it is through such generative models with which it
recreates the sentence in another language. As our first example of using
neural networks for text, we will attempt to generate text – in a variety of

Generating text
In our discussions involving deep learning and natural language processing,
we extensively spoke about how it is used in text generation to very
convincing results – we are now going to get our hands dirty with a little bit
of text generation ourselves.
The neural network architecture we will be using is a recurrent neural
network, and in particular, an LSTM [9]. LSTM stands for Long Short
Term Memory and is unique because its architecture allows it to capture
both short term and long term context of words in a sentence. The very
popular blog post Understanding LSTM Networks [11] by deep learning
researcher Colah is a great way to further understand LSTMs.
This is the same architecture used in the popular blog post [10] by Andrej
Karpathy, The unreasonable effectiveness of Neural Networks, though
Karpathy wrote his code for his NN in Lua – we will be using Keras, which
with its high level of abstraction serves as a perfect choice.
The Python ecosystem for deep learning is certainly thriving now –
depending on your use case there are multiple ways we can build a deep
learning system. For us, we would like a high level of abstraction, and the
ability to easily use text to train our machine. As of now, in 2018, choosing a
deep learning framework is no easy task, but we will stick with Keras for our
deep learning tasks, but not before briefly discussing what other tools there
are out there.

1. TensorFlow ( TensorFlow is a neural network
library released by Google, and also happens to be the same framework
that their artificial intelligence team, Google Brains uses. It is, of course,
different from the exact framework used for production, but TensorFlow
is nevertheless very well maintained, remains an active community, and
has strong GPU support. GPU support is important because it allows us
to perform mathematical operations faster than a normal CPU can.
Because of its graph-based computation model, it ends up being a
natural fit for constructing neural networks. It offers a high level of
control and options in terms of how low-level you want the operations to
be and is generally a popular choice now in both research and industry.

2. Theano ( Arguably one of the first
thorough deep learning frameworks, it was built at MILA by Yoshia
Bengio, one of the pioneers of deep learning. Focused on using symbolic
graphs as the building blocks of neural networks, its API is quite low
level, and if used effectively can result in some very powerful deep
learning systems. It is not being maintained anymore, but is still worth
checking out, even if just for the history! The libraries, Lasagne [12] and
Blocks [ 13] allow you to use Theano from a higher layer of abstraction.

3. Caffe ( & Caffe2 ( Caffe is
one of the first dedicated deep learning frameworks, developed at UC
Berkeley. It is both fast and modular, if a bit clunky to use because it is
not written in native Python and requires you to manage the .prototxt files
to use the networks in your applications. These .protoxt files describe
neural networks using a predescribed format you can find here [14]. This
only adds an extra layer of complexity to our time span coding neural
networks, and there are more abstracted libraries which we would rather
want to use.

4. PyTorch ( The new kid on the block but also a library
which is growing rapidly, PyTorch is loosely based on Lua's Torch
library. The Facebook Artificial Intelligence Research team
(FAIR) has endorsed PyTorch, and with a healthy mix of low level and
high-level APIs also based on dynamic computational graphs, it is
definitely worth checking out.

5. Keras ( Keras will be our library of choice - and we are
not alone here. With its high level of abstraction and clean API, it
remains the best deep learning framework for prototyping and can use
either Theano or TensorFlow as the backend for constructing the
networks. It is very easy to go from the idea -> execution, as we will see
in our text generation example. It has a large and active community, and
with TensorFlow announcing they will be shipping with Keras, it means
that it will continue to be used for the foreseeable future.
We invite the reader to have a look at the other deep learning frameworks
out there – depending on the use case; a different framework might be better
for you! Of course, the techniques we will be trying out will remain the same,
so apart from syntactic changes we can expect the same logic and process for
text generation.
We mentioned before that we would be using a recurrent neural network for
our example. A recurrent neural network does one step better than other
neural networks because of its ability to remember the context, as each layer
in the network is built with information from the previous layer – this
additional context allows it to perform better, and also gives it the name
We will be using a particular variant of an RNN called LSTM, or Long ShortTerm Memory – as the name suggests, it has the ability to have a short-term
memory which can last for a long period of time. Whenever there is a
significant time-lag between inputs, LSTMs tend to perform well considering the nature of language, where a word which appears later on in a
sentence is influenced by the context of the sentence, this property starts
becoming more important. We mentioned before it is unique because it can
understand the context of words immediately around it while remembering
words from before.
For a more detailed explanation of the mathematics or intuition behind an
LSTM and RNN, the following blog posts can be very useful (we've come
across these blog posts earlier on in the chapter).
Understanding LSTM Networks [11]
Unreasonable Effectiveness of Recurrent Neural Networks [10]

As usual, we start with our imports – be sure to install Keras and tensorflow
using pip or conda before we start!
The code we will be using as a reference is from the Jupyter notebook [15],
though there will be some differences.
import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import numpy as np

Here, we use Keras' sequential model where we can add an LSTM structure.
The next step is to deal with data organization. We can use any text source as
our input, based on what kind of data we would like to generate. This is
where we can get creative - do we want our RNN to write like J.K. Rowling,
Shakespeare, or even like yourself – if you have enough examples of your
writing stored somewhere!
When using Keras to generate text, we need to generate a mapping of all the
distinct characters in the book (our LSTM is a character level model). A
note – source_data.txt here is your personal dataset of choice. In the example
code which follows, all the other variables depend on what you choose as
your dataset of choice, but the code will run fine regardless of any text file
you choose.
= 'data/source_data.txt'
= open(filename).read()
= data.lower()
# Find all the unique characters
= sorted(list(set(data)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
ix_to_char = dict((i, c) for i, c in enumerate(chars))
vocab_size = len(chars)

Our two dictionaries will help us in both passing characters to our model and
in generating text as well. A standard data source will give us results
resembling this if we use print(chars), vocab_size, and char_to_int.
This is a list of unique characters:
['n', ' ', '!', '&', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

This is a number of unique characters:

The character to integer mapping is as follows:
{'n': 0,
'1': 11,
';': 21,
'g': 31,
'q': 41,

' ':

1, '!': 2, '&': 3, "'": 4,
12, '3': 13, '4': 14, '5':
22, '[': 23, ']': 24, 'a':
32, 'i': 33, 'j': 34, 'k':
42, 's': 43, 't': 44, 'u':

'(': 5, ')':
15, '6': 16,
25, 'b': 26,
35, 'l': 36,
45, 'v': 46,

6, ',': 7, '-': 8, '.': 9,
'7': 17, '8': 18, '9': 19,
'c': 27, 'd': 28, 'e': 29,
'm': 37, 'n': 38, 'o': 39,
'w': 47, 'x': 48, 'y': 49,



Our RNN accepts sequences of characters as an input and outputs such
similar sequences. Let's now break up our data source into such sequences.
seq_length = 100
list_X = [ ]
list_Y = [ ]
for i in range(0, len(chars) - seq_length, 1):
seq_in = raw_text[i:i + seq_length]
seq_out = raw_text[i + seq_length]
list_X.append([char_to_int[char] for char in seq_in])
n_patterns = len(list_X)

We have to do a little bit more to get our input perfectly ready for our model:
X = np.reshape(list_X, (n_patterns, seq_length, 1))
# Encode output as one-hot vector
Y = np_utils.to_categorical(list_Y)

We do this because we want to predict one character at a time, which means
we would want one-shot encoding, which is what the
np_utils.to_categorical function does. For example, when we want to encode the
letter m with the index 37, it would look like this:
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Let's now define our neural network model.
model = Sequential()

model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In this case, we have defined an LSTM with one layer (which we create with
Dense), and a Dropout of 0.2, SoftMax activation, and the ADAM optimizer.
Dropout is a value used to control overfitting when a neural network only
performs well only on one dataset. Activation methods decide at what value
we activate a neuron in a network, and optimizers are used to reduce the
error overtime we navigate back and forth over a neural network.
Indeed, choosing these hyperparameters ends up being a matter of practice
and fine-tuning, though we will briefly mention how to choose appropriate
parameters for your particular text processing task in the next chapter. For
the moment, it is sufficient to treat this a black-box, while understanding the
intuition behind it. Note that the hyperparameters used here are the
standard parameters for text generation using Keras.
Training our model is easy – like scikit-learn, we run the fit function to do
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True,
callbacks_list = [checkpoint]
# fit the model, y, epochs=20, batch_size=128, callbacks=callbacks_list)

The fit function will run the input batchwase n_epochs number of times, and
it will save the weights to a file whenever there is an improvement. This is
taken care of through the callback.
You should be done with training after running fit – keep in mind that based
on the size of the dataset used this could take hours or even days.
Another option is to simply load the weights of an already pretrained model:
filename = "weights.hdf5"
model.compile(loss='categorical_crossentropy', optimizer='adam')

So now, with either our loaded weights or with a trained model, we are ready

to generate text character by character!
= np.random.randint(0, len(X) - 1)
pattern = np.ravel(X[start]).tolist()

We wish to start our text generation randomly, so we use numpy to find this
character within our range.
output = []
for i in range(250):
= np.reshape(pattern, (1, len(pattern), 1))
= x / float(vocab_size)
prediction = model.predict(x, verbose = 0)
= np.argmax(prediction)
= index
pattern = pattern[1 : len(pattern)]
print (""", ''.join([ix_to_char[value] for value in output]), """)

What happened here? Based on our input, x, we choose the highest
probability for the next character (using argmax, which is a method to return
the indie of the maximum value), and then convert that index to a character,
and append it to our output list. Based on how many iterations we want to
see in our output, we run that many loops.
In the LSTM example, we have just seen, we have not trained a massive
network – by stacking further layers on top, we can start seeing even better
results. We have already seen in our example that after a few epochs our
model starts performing a lot, lot better. Indeed, Andrej Karpathy's blog
demonstrates this particularly well, and with a wide variety of inputs, from
Shakespeare to the Linux code base!
Further pruning of the input data would give us even better results, as well
as increasing the number of epochs. Of course, adding more layers or
increasing the number of epochs would increase our training time – if our
mission is just to experiment with RNNs and not build a scalable or in the
production model, Keras does a very good job.

We saw the incredible power of deep learning first hand – we could
successfully train a neural network to generate text that very much
resembles human-produced text, if at least in its syntax and to some extent,
grammar and spelling. With more fine-tuning and maybe a little bit of
human supervision, we can see how we can create very realistic chatbots
with this kind of technology.
While this kind of text analysis may not seem particularly useful for us,
neural networks find a lot of use in more practical text analysis tasks, such as
in text classification or text clustering. We will be exploring these kinds of
tasks in our next chapter – in particular, text classification using Keras and
using spaCy.
We present the following links to the reader before moving on to the next
chapter; they are blog posts discussing effective strategies when dealing with
text generation using deep learning.

1. NLP Best Practices [16]

2. Deep Learning and Representations [17]

3. Unreasonable Effectiveness of Neural Networks [10]

4. Best of 2017 for NLP and DL [18]

[1] Perceptron:

[2] Feedforward Neural Network:

[3] Biologically Inspired Computing:

[4] By - Own work, Derivative of File:Artificial neural network.svg,
CC BY-SA 3.0:

[5] Zero-Shot Translation with Google's Multilingual Neural Machine
Translation System:

[6] Convolutional Neural Network:

[7] Recurrent Neural Network:

[8] Reinforcement Learning:

[9] LSTM:

[10] The Unreasonable Effectiveness of RNNs:

[11] Understanding LSTM networks:

[12] Lasagne:

[13] Blocks:

[14] Caffe:

[15] Text Generation:

[16] Deep Learning for NLP best practices:

[17] Deep Learning, NLP, and Representations:

[18] Deep Learning 2017:

Keras and spaCy for Deep Learning
In the previous chapter we introduced you to deep learning techniques for
text, and to get a taste of using neural networks, we attempted to generate
text using an RNN. In this chapter, we will take a closer look at deep learning
for text, and in particular, how to set up a Keras model that can perform
classification, as well as how to incorporate deep learning into spaCy
Here are few useful links:

1. Keras Sequential model [1]

2. Keras CNN LSTM [2]

3. Pre-trained word embeddings [3]

Keras and spaCy
In the previous chapter, we already discussed various deep learning
frameworks - in this chapter, we will discuss a little more in detail about one,
in particular, Keras, while also exploring how we can use deep learning with
During our attempts at text generation, we already used Keras, but did not
explain the motivation behind using the library, or indeed even how or why
we constructed our model the way we did. We will attempt to demystify this,
as well as set up a neural network model that will aid us in text classification.
In our brief review of the various deep learning frameworks available in
Python, we described Keras as a high-level library which allows us to easily
construct neural networks.

Fig 1 4 .1 The arXiv m entions of Keras. arXiv is a website where researchers upload research papers before it is accepted
by a journal. Here, the x-axes are the different Py thon deep learning libraries, and the y-axis is the num ber of
references of that library by the papers on arXiv

Keras happens to be all this and much more and offers one of the cleanest
APIs for generating very complex learning systems. Only Tensorflow is cited
more on arXiv, and even in industry, Keras is widely used. It comes packaged

with Tensorflow as tf.keras, which means it is backed by Google, and its CNTK
[4] backend is supported by Microsoft. CNTK is another backend that can be
used to build neural networks, but we will not be using it or going into detail
about it, as it is not as supported or widely used as Tensorflow or Theano.
Being able to use multiple backends (Theano, Tensorflow, and CNTK) makes
it a very flexible framework to adapt. There are a wide user base and active
community, which means that getting your problems solved on
StackOverflow or GitHub is highly likely, and it is also easy to put your
model into production. For example, iOS development is supported by
Apple's CoreML [5], where Apple provides support for Keras.
But the reason we, as text analysis practitioners, are interested in Keras is
how easy it is to perform text analysis tasks with Keras. We have mentioned
throughout the book how important preprocessing is when it comes to text
analysis - Keras has a class on preprocessing, and even a sub-module [6]
more focused on text preprocessing. When cleaning text for deep learning
purposes, the context can be slightly different. For example, for text
generation, we don't remove stop words or stem words, because we want the
model to predict what will look like real text. In this chapter, we will focus on
classification, where we will follow more or less the same preprocessing we
did before.
When we previously mentioned neural networks, we used the terms models,
and how these models are made of an input, one or several layers, and an
output. These layers consist of neurons (or nodes) that are interconnected in
different ways. Different kinds of neural networks have different ways of
being connected - for example, a Convolutional Neural Network is what
is considered a dense network, with multiple connections between the layers
and the nodes. A Recurrent Neural Network, which we used to generate
text in the previous chapter, had resulted from previous nodes and layers
appended, to build context. The performance of a neural network heavily
depends on its architecture. Luckily for us, the architecture we intend to use
for text and document classification is very well researched – we won't have
to worry too much about our hyperparameters and how we intend to set up
our neural network, as most of this has already been thoroughly researched
(though with that being said, there is still a lot more to understand about
neural networks!).
The Keras documentation is thorough, and worth having a look at. We will be
describing some of the more important parts of Keras as and when we use it,
but before we start with our examples, it is highly recommended for the user
to go through the following links involving Keras:

1. About Keras Models [7]: This explains the various kinds of neural
network models in Keras

2. About Keras Layers [8]: This discusses the various kinds of layers you
can add to a Keras neural network

3. Core Layers (Dense) [9]: This is the documentation of the dense layer in

4. Keras Datasets [10]: This is the list explaining and documenting
various Keras datasets

5. LSTMs [11]: This has more detail about the Keras LSTM module

6. Convolutional Layers [12]: This has More detail about the Keras
convolutional layer
We will be learning how to classify sequences and documents in this chapter,
so some reading about Dense layers, LSTMs, and RNNs from the preceding
links will give you the background to breeze through the coming examples.
We will be using Sequential models for our classifier - this simply means
that it is a simpler neural net with the layers stacked in order. It is worth
having a look at the Keras documentation for sequential models [1] as well.
But before we get into the gritty details and code, let's briefly discuss spaCy
and how it is linked to deep learning. While we didn't get into details when
we trained custom spaCy models before, it was entirely based on deep
learning techniques. We have mentioned before the spaCy's POS-tagger,
NER-tagger, and parser. We must credit spaCy's smooth training APIs here we are allowed to treat the model doing the training as a black box and focus
on just the training data or what kind of new information we intend to train.
But with that being said, we can still play around with the model, with the
ability to change various hyperparameters, such as regularizes or the dropout
rate. A quick reminder: dropout rate is the hyperparamter that controls
overfitting and makes sure that our neural network doesn't perform well
only on the training dataset.
From a purely technical point of view, using Doc2Vec to vectorize your documents and then classifying them
using a standard statistical classifier (such as the Naive Bayes classifier) can also be considered as a machine
learning system employing neural networks/deep learning; however, in this chapter we will attempt to build a
classifier system where the final classification task is performed by a neural network.

spaCy allows us to use its built-in TextCategorizer component, where we train it
in a manner similar to its other components such as POS or NER tagging. It
also seamlessly integrates with other word embeddings such as Gensim's
Word2Vec or GloVe, as well as allows us to plug in a Keras model if we would
like. Using spaCy and Keras in tandem can allow us to leverage a very
powerful classification machine - and now that we have the theory and ideas
out of the way, let's jump into the code!

Classification with Keras
For our experiments, we will be using the IMDB sentiment classification
task. This is quite the small dataset - we are using it for the convenience of
loading it and using it, as it is easily available via Keras. It is very important
to understand here that for datasets of the size we are using, it is not the best
idea to use a Deep Neural Network for classification - indeed, we might
even get better results with a simple bag of words followed by a Support
Vector Machine (SVM) doing the classification. The purpose of the
following examples is to rather allow the user to understand how to
construct a neural network using Keras, and how to make predictions using
it. The fine tuning of the neural network and studying its hyperparameters is
a different ball game altogether and is not the focus of this chapter. Another
thing to remember when working with text data and neural networks is that
in almost all cases, more data is better and that neural networks are far
better suited to tackle problems with a lot more data to work with.
We will be following the code and examples from the Keras/examples folder [13]
to help us - this gives the opportunity for the readers to verify their code or
run the examples as a whole. We start with the sequential model you can
find here [17]:
Let's first set up our imports:

keras.preprocessing import sequence
keras.models import Sequential
keras.layers import Dense, Embedding
keras.layers import LSTM
keras.datasets import imdb

The following are some notes:

1. We are not using the text preprocessing modules in this example
because we will be using a dataset which is included in Keras (IMDB).

2. We will be using an LSTM for the classification task, which is a variant
of a recurrent neural network. We previously came across this in our
text generation tasks.

3. We imported Sequential from the model's module. Our LSTM here is
merely a layer, and the type of model is a simple sequential model. Dense
is a layer of regularly connected neurons.

max_features = 20000
maxlen = 80 # cut texts after this number of words (among top max_features most common
batch_size = 32
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

The max_features variable here refers to the top words we wish to use from the
dataset - we limit this to 20000 words here. This is similar to getting rid of least
used words, a technique we discussed previously during text preprocessing.
The maxlen variable is used when we are creating our sequences from the
dataset - we need to fix the length of sequences as the neural network
accepts a fixed length as input. The batch_size variable is later used to specify
the number of batches during training, which is an empirically measured
value. The print statements in the code are for the user to manually inspect
the size of the data (we can see it isn't so large!).
We then split our data into training and testing and print the sizes for this.
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

And this is it - in 4 lines of code we've built up a neural network! The first
line sets up our sequential model, after which we just keep stacking on what
we need. In the first layer of stacking, we've put up word embeddings - this
means from 20000 features, we've dropped straight down to 128. We'll find
out soon that we can also use our own word embeddings, such as Word2Vec
or GloVe vectors. Our next layer is the LSTM - we can see the number 128
there, which is our total number of dimensions that the neural network will

be dealing with.
Here, the dropout argument is to prevent overfitting - here we use the
default value, which happens to be 0.2. Since an LSTM is a recurrent neural
network, we have a value for the recurrent dropout too. Our final layer is a
standard Dense layer, with just one input (which happens to be the output
from the LSTM). We use the sigmoid activation for this layer. An activation
here refers to the activation function [14] used for that particular layer - you
can read more about activation layers that Keras provides on their
documentation page [15]. We can understand them as the way the neural
network decides to accept an input, and what kind of output it provides.
That's it then - we've constructed our first neural network!
Of course, it isn't ready for us to start any kind of prediction or classification
- we still must compile and fit it before any prediction.
# try using different optimizers and different optimizer configs

print('Train...'), y_train,
validation_data=(x_test, y_test))

A heads up – this is some intense training you will be doing, and it could take
between 30 minutes to an hour if you are running this on a CPU.
And voila! We're done training and fitting our model. This means that we are
now ready to predict. You would have noticed when we ran the compile method
we used binary_crossentropy as our loss and adam as our optimizer for the same. All
neural networks need a loss function and optimizer to learn. We can
understand loss here as the way it learns how far away the predictions of the
neural network are from the truth and the optimizer as the way it adjusts its
weights to get better results.
Let's first test how well our model works – Keras allows us to very easily
verify this with the evaluate function. Let's have a quick look at how well our
model works.
score, acc = model.evaluate(x_test, y_test,

print('Test score:', score)
print('Test accuracy:', acc)

For a neural network, we just cooked up in 4 lines, not so bad, eh? We are
soon going to see what else we can do with Keras - let's now build a
convolutional neural network, which has a little more complexity in it. The
neural network we built was trained on the IMDB dataset for text
classification, so it is now equipped with the power to classify documents
based on sentiment. It was also a sequential neural network - we will now
build a convolutional neural network.
For the convolutional neural network, we will need a few more parameters to
fine tune our network. We are following the code from this example [18] on
the Keras GitHub page.

keras.preprocessing import sequence
keras.models import Sequential
keras.layers import Dense, Dropout, Activation
keras.layers import Embedding
keras.layers import LSTM
keras.layers import Conv1D, MaxPooling1D
keras.datasets import imdb

We have some new imports over here which we should keep an eye out for –
in particular, separate modules for Dropout, Activation, and for our convolutional
net, Conv1D, and MaxPooling1D.
# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# Embedding
max_features = 20000
maxlen = 100
embedding_size = 128

lstm_output_size = 70

# Training
batch_size = 30
epochs = 2

We notice a bunch of new variables right in the start – these are all for the
convolutional layer, and at this point, we will have to ask the reader to trust
our choice in setting these constants – these variables usually end up
affecting the training quite heavily and are empirically derived after
experiments. We've come across the other variables/arguments in the
previous example.
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))

We see immediately that this model has a lot more complexity than the
previous one. We can understand complexity to be the number of layers, and
we have 7 layers added up here. We see there is a separate dropout layer here
- again, like previous cases, this is to prevent overfitting. We then add our
first convolutional layer - this is where the variables/arguments we
mentioned before kick in.
The pooling layer we soon add is also part of our convolutional architecture.
The Stanford course on convolutional neural networks describes the function
of a pooling layer as - its function is to progressively reduce the spatial size of
the representation to reduce the number of parameters and computation in
the network, and hence to also control overfitting [16]. The remaining layers
we've already seen before, though we also explicitly add an Activation function
for our network, which like last time, is the sigmoid function. This completes
the setup of our network - we can now proceed to compiling and training. We
use the same loss and optimizer methods as our last neural network.
print('Train...'), y_train,

validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

We can see the extra layers added some punch to our network, didn't it? We
can see our improved accuracies. Along with the improved accuracies, we can
also see our training time has increased to more than 30 minutes on a CPU.
We previously mentioned how using pretrained word embeddings in a
classifier can improve the results - Keras allows us to use these results, and
quite easily too. In our chapter on word embeddings (Chapter 12, Word2Vec,
Doc2Vec, and Gensim), we discussed GloVe word embeddings - if you still
have them downloaded, we can get going right away. We will be following the
code found in this example [19].
BASE_DIR = ''" # you would have to paste the actual directory of where your GloVe file is
over here.
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')

We will be using the preceding variables/arguments to help load our word
embeddings. Our first step is to access these embeddings from the file and
index them.
print('Indexing word vectors.')
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Found %s word vectors.' % len(embeddings_index))

A simple loop through the embeddings file was all that was needed to set this
up. We now set up a matrix that will help us actually use the embeddings.
print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)

embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i >= MAX_NUM_WORDS:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector

We are now ready to use our embeddings in our model – it is important to
make sure that we set the training argument to false, so we are using the
word vectors as is.
embedding_layer = Embedding(num_words,

With the embedding layer set up, our model is almost done – we now follow
a pattern very similar to what we used before.
print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

In this example, we stack up our layers slightly differently, with
the x variable holding each layer. The preds variable is our final layer that
contains all the previous layers. We set up our model with the Model class, and
now we are good to go!
model = Model(sequence_input, preds)
metrics=['acc']), y_train,

validation_data=(x_val, y_val))

Notice here that we used a different measure for calculating the loss; we
encourage the reader to try different loses and optimizers themselves to get a
feel for it. We've seen examples of a basic LSTM, a convolutional neural
network, and a convolutional neural network which uses pretrained word
embeddings. We can also see the progressive increase in the performance of
each of these networks. Embeddings are particularly useful to us when we do
not have much training data – we have previously read how word
embeddings learn context; we use this same context to inject some extra
prediction power into our network.
Generally, the convolutional model will perform better than a sequential
model, and the model using the word embeddings performs even better. This
makes sense; word embeddings add more context to the model and better
describes each word from a computational point of view, we have discussed
this in Chapter 12, Word2Vec, Doc2Vec, and Gensim. As for when to use which
model, if we have access to word embeddings trained on a dataset with a
context similar to that we are going to classify, and if we have computers
powerful enough to train a neural network, we might as well use a
convolutional network for our classification tasks. Like any machine learning
task, it is also worth training another model which is more simple, such as a
support vector machine or a Naive Bayes classifier. After we examine the
performance and accuracies, we can choose which model to use in our final
Keras gives us the ease of use, flexibility, and power to build neural networks
at will. Deep learning papers published on arXiv often link to GitHub
repositories with Keras code examples, and having gone through this chapter
and the previous one should give you the ability to easily understand how
these neural networks are built.

Classification with spaCy
While Keras works especially well in standalone text classification tasks,
sometimes it might be useful to use Keras in tandem with spaCy, which
works exceedingly well in text analysis. In Chapter 3, spaCy's Language Models,
Chapter 5, POS-Tagging and Its Applications, Chapter 6, NER-Tagging and Its
Applications, and Chapter 7, Dependency Parsing, we already saw how well
spaCy works with textual data, and it is no exception when it comes to deep
learning – its text oriented approach makes it easy to build a classifier that
works well with text. There are two ways to perform text classification with
spaCy – one is using its own neural network library, thinc, while the other
uses Keras. Both the examples we will explain are from spaCy's
documentation, and it is highly recommended that you check out the original
The first example we will be exploring can be found on the spaCy example
page, and is titled [20]. In the example, we use an LSTM for
the sentiment classification task. This will be a Keras-trained model. This
model is trained to classify sentences, and the scores and then aggregated so
that we can then classify documents. It is more difficult to perform this kind
of hierarchical aggregation using pure Keras or Tensorflow, so this is a good
example to observe the power of spaCy.
import plac
import random
import pathlib
import cytoolz
import numpy
from keras.models import Sequential, model_from_json
from keras.layers import LSTM, Dense, Embedding, Bidirectional
from keras.layers import TimeDistributed
from keras.optimizers import Adam
import thinc.extra.datasets
from spacy.compat import pickle
import spacy

We should be able to recognize most of these imports, having used them
with Keras or spaCy before.
class SentimentAnalyser(object):

def load(cls, path, nlp, max_length=100):
with (path / 'config.json').open() as file_:
model = model_from_json(
with (path / 'model').open('rb') as file_:
lstm_weights = pickle.load(file_)
embeddings = get_embeddings(nlp.vocab)
model.set_weights([embeddings] + lstm_weights)
return cls(model, max_length=max_length)
def __init__(self, model, max_length=100):
self._model = model
self.max_length = max_length
def __call__(self, doc):
X = get_features([doc], self.max_length)
y = self._model.predict(X)
self.set_sentiment(doc, y)

The first few lines just set up our class and instruct how to load our model
and our embedding weights. We then initialize the model, maximum length,
and set up instructions to predict. The load method returns the loaded model,
which we use in the evaluate method to set up our pipeline. We initialize the
class with the model and maximum length. The call method gets the features
and the prediction. We continue our explanation after the next block of code,
which is the pipe method. Note that it is a not a new code file, but a method of
the SentimentAnalyser class; do make it a point to look at this link [21] for the
entire code!
def pipe(self, docs, batch_size=1000, n_threads=2):
for minibatch in cytoolz.partition_all(batch_size, docs):
minibatch = list(minibatch)
sentences = []
for doc in minibatch:
Xs = get_features(sentences, self.max_length)
ys = self._model.predict(Xs)
for sent, label in zip(sentences, ys):
sent.doc.sentiment += label - 0.5
for doc in minibatch:
yield doc
def set_sentiment(self, doc, y):
doc.sentiment = float(y[0])

The pipe method actually performs the prediction after splitting up our
dataset into batches. We can see the ys = self._model.predict(Xs) line, which
calculates the sentiment value. It also assigns a sentiment value to a
document. Now that we have finished writing the SentimentAnalyser class, we will
start writing methods that will help with our training.

def get_labelled_sentences(docs, doc_labels):
labels = []
sentences = []
for doc, y in zip(docs, doc_labels):
for sent in doc.sents:
return sentences, numpy.asarray(labels, dtype='int32')

def get_features(docs, max_length):
docs = list(docs)
Xs = numpy.zeros((len(docs), max_length), dtype='int32')
for i, doc in enumerate(docs):
j = 0
for token in doc:
vector_id = token.vocab.vectors.find(key=token.orth)
if vector_id >= 0:
Xs[i, j] = vector_id
Xs[i, j] = 0
j += 1
if j >= max_length:
return Xs

The methods for getting labeled sentences is fairly straightforward, with it
returning sentences and the appropriate label. The get_features method needs a
little more attention paid: you can notice it is where we construct our feature
vector for each document.
def train(train_texts, train_labels, dev_texts, dev_labels,
lstm_shape, lstm_settings, lstm_optimizer,
batch_size=100, nb_epoch=5, by_sentence=True):
nlp = spacy.load('en_vectors_web_lg')
embeddings = get_embeddings(nlp.vocab)
model = compile_lstm(embeddings, lstm_shape, lstm_settings)
train_docs = list(nlp.pipe(train_texts))
dev_docs = list(nlp.pipe(dev_texts))
if by_sentence:
train_docs, train_labels = get_labelled_sentences(train_docs, train_labels)
dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels)
train_X = get_features(train_docs, lstm_shape['max_length'])
dev_X = get_features(dev_docs, lstm_shape['max_length']), train_labels, validation_data=(dev_X, dev_labels),
nb_epoch=nb_epoch, batch_size=batch_size)
return model

It is easy to guess that the training method is where all our heavy lifting is
happening – some important lines to notice here are the ones involving

spaCy's pipeline, where we add a sentencizer to it. The lines following the
setting up of the pipe involve compiling the LSTM (we will have a look at our
model just below), in loading our word embeddings, and then receiving our
features from our documents so we can proceed with training.
def compile_lstm(embeddings, shape, settings):
model = Sequential()
model.add(Dense(shape['nr_class'], activation='sigmoid'))
return model

This part of the code should look more familiar to us – as we have done in
the previous section, we set up each of our layers and stack them up. We can
use any Keras model we would like to do this, and in this case, a bidirectional
LSTM is used.
def get_embeddings(vocab):
def evaluate(model_dir, texts, labels, max_length=100):
def create_pipeline(nlp):
This could be a lambda, but named functions are easier
to read in Python.
return [nlp.tagger, nlp.parser,
SentimentAnalyser.load(model_dir, nlp,
nlp = spacy.load('en')
nlp.pipeline = create_pipeline(nlp)
correct = 0
i = 0
for doc in nlp.pipe(texts, batch_size=1000, n_threads=4):

correct += bool(doc.sentiment >= 0.5) == bool(labels[i])
i += 1
return float(correct) / i

The evaluate method returns a score of how well our model performed; the
code is fairly straightforward and merely checks the assigned sentiment
score with the label of the document.
def read_data(data_dir, limit=0):
examples = []
for subdir, label in (('pos', 1), ('neg', 0)):
for filename in (data_dir / subdir).iterdir():
with as file_:
text =
examples.append((text, label))
if limit >= 1:
examples = examples[:limit]
return zip(*examples) # Unzips into two lists

We use the IMDB sentiment analysis dataset; this method is an interface to
access this data.
train_dir=("Location of training file or directory"),
dev_dir=("Location of development file or directory"),
model_dir=("Location of output model directory",),
is_runtime=("Demonstrate run-time usage", "flag", "r", bool),
nr_hidden=("Number of hidden units", "option", "H", int),
max_length=("Maximum sentence length", "option", "L", int),
dropout=("Dropout", "option", "d", float),
learn_rate=("Learn rate", "option", "e", float),
nb_epoch=("Number of training epochs", "option", "i", int),
batch_size=("Size of minibatches for training LSTM", "option", "b", int),
nr_examples=("Limit to N examples", "option", "n", int)

The preceding annotations set up our options which sets the various model
directories, runtime, and the parameters for the model. Let's move on to the
main function now.
def main(model_dir=None, train_dir=None, dev_dir=None,
nr_hidden=64, max_length=100, # Shape
dropout=0.5, learn_rate=0.001, # General NN config
nb_epoch=5, batch_size=100, nr_examples=-1): # Training params
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if train_dir is None or dev_dir is None:

imdb_data =
if is_runtime:
if dev_dir is None:
dev_texts, dev_labels = zip(*imdb_data[1])
dev_texts, dev_labels = read_data(dev_dir)
acc = evaluate(model_dir, dev_texts, dev_labels,
if train_dir is None:
train_texts, train_labels = zip(*imdb_data[0])
print("Read data")
train_texts, train_labels = read_data(train_dir,
if dev_dir is None:
dev_texts, dev_labels = zip(*imdb_data[1])
dev_texts, dev_labels = read_data(dev_dir, imdb_data,
train_labels = numpy.asarray(train_labels, dtype='int32')
dev_labels = numpy.asarray(dev_labels, dtype='int32')
lstm = train(train_texts, train_labels, dev_texts, dev_labels,
{'nr_hidden': nr_hidden, 'max_length': max_length,
'nr_class': 1},
{'dropout': dropout, 'lr': learn_rate},
nb_epoch=nb_epoch, batch_size=batch_size)
weights = lstm.get_weights()
if model_dir is not None:
with (model_dir / 'model').open('wb') as file_:
pickle.dump(weights[1:], file_)
with (model_dir / 'config.json').open('wb') as file_:
if __name__ == '__main__':

Don't let the size of the main function scare you – you can notice that the
first few lines set up the model folder and will load the dataset. We then check
if we wish to print run time information, in which case we run the evaluate
method. If not, and training is not complete, we proceed to train our model.
The lstm.train() method trains the model and then if the model folder is not
undefined, we save our model.
Running, saving and using the model in your own production pipelines is a
huge motivation behind using Keras and spaCy in such a way. The key
takeaway here is that we are updating the sentiment attribute for each doc. How
we decide to use this is up to us. One of the main selling points of the spaCy
implementation is that it does not remove or truncate the inputs - the writers
argue that doing so negatively affects the results because users tend to sum
up their review in the last sentence of the document, and a lot of the

sentiment can be inferred from this sentence.
So now that we have our trained model, how do we use it? Our model now
adds one more attribute to our document, which is the doc.sentiment attribute.
This value captures the sentiment of the document. The user can verify this
by later loading the saved model and running any document through the
pipeline the same way we did in Chapter 5, POS-Tagging and Its Applications, C
hapter 6, NER-Tagging and Its Applications, and Chapter 7, Dependency Parsing:
doc = nlp(document)

Here, nlp is the pipeline object of the loaded model which we just trained, and
the document is any unicode text we wish to analyze. The doc object now
contains information about the sentiment.
We can also train a more traditional classifier based on the probability of a
document belonging to a particular class. The training is extremely simple to
perform - the update method which is part of the pipeline is what does the
actual training. The example code in the documentation can be found here
[21], and the code on GitHub can be found here [22]. We will be walking the
reader through the code, and highly encourage the user to run the code and
to have a look at what it adds to the pipeline. Note that this file is meant to
be ran all at once, and we have only split-up the code so that we can explain
it. When testing the code, run the file which can be found here [22].
import plac
import random
from pathlib import Path
import thinc.extra.datasets
import spacy
from spacy.util import minibatch, compounding

These imports are what we are used to seeing, but we do not have Keras here
as we will be using the in-built thinc library.
model=("Model name. Defaults to blank 'en' model.", "option", "m",
output_dir=("Optional output directory", "option", "o", Path),
n_texts=("Number of texts to train from", "option", "t", int),
n_iter=("Number of training iterations", "option", "n", int))
def main(model=None, output_dir=None, n_iter=20, n_texts=2000):
if model is not None:
nlp = spacy.load(model) # load existing spaCy model

print("Loaded model '%s'" % model)
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")

We have set up the annotations during printing, as well as loaded the model.
If we don't pass a model, we can initiate an empty model.
if 'textcat' not in nlp.pipe_names:
textcat = nlp.create_pipe('textcat')
nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
textcat = nlp.get_pipe('textcat')

# add label to text classifier

We now add a text categorizer label to our pipeline if it doesn't already exist
– and if it does exist, we get it and add a sample label to it.
print("Loading IMDB data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
print("Using {} examples ({} training, {} evaluation)"
.format(n_texts, len(train_texts), len(dev_texts)))
train_data = list(zip(train_texts,
[{'cats': cats} for cats in train_cats]))

We're now playing with our dataset – we've loaded our dataset, and then
stored the training data.
other_pipes = [pipe for pipe in nlp.pipe_names if pipe !=

Before we start any training, we first disable all the other parts of the
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
print("Training the model...")
print('{:^5}t{:^5}t{:^5}t{:^5}'.format('LOSS', 'P', 'R', 'F'))
for i in range(n_iter):
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_data, size=compounding(4., 32.,
for batch in batches:

texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, drop=0.2,

We will be using batches to train our data, similar to previous examples. The
nlp.update method is the heart of all the code and performs the training using
the training information and annotations.
with textcat.model.use_params(optimizer.averages):
# evaluate on the dev data split off in load_data()
scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
# print a simple table
.format(losses['textcat'], scores['textcat_p'],
scores['textcat_r'], scores['textcat_f']))
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = nlp2(test_text)
print(test_text, doc2.cats)

We then test our model with the evaluate method, which calculates
precision, recall, and f-score values. The last part of the main function is
saving the trained model in an output directory if specified, and in testing the
saved model.
def load_data(limit=0, split=0.8):
"""Load data from the IMDB dataset."""
# Partition off part of the train data for evaluation
train_data, _ =
train_data = train_data[-limit:]
texts, labels = zip(*train_data)
cats = [{'POSITIVE': bool(y)} for y in labels]
split = int(len(train_data) * split)
return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
docs = (tokenizer(text) for text in texts)
tp = 1e-8 # True positives
fp = 1e-8 # False positives
fn = 1e-8 # False negatives

tn = 1e-8 # True negatives
for i, doc in enumerate(textcat.pipe(docs)):
gold = cats[i]
for label, score in doc.cats.items():
if label not in gold:
if score >= 0.5 and gold[label] >= 0.5:
tp += 1.
elif score >= 0.5 and gold[label] < 0.5:
fp += 1.
elif score < 0.5 and gold[label] < 0.5:
tn += 1
elif score < 0.5 and gold[label] >= 0.5:
fn += 1
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f_score = 2 * (precision * recall) / (precision + recall)
return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

if __name__ == '__main__':

We've come across these methods earlier in the main function; one is to load
the dataset, and the other to evaluate the performance of our trained model.
We use the dataset which comes bundled with thinc, and return the data
appropriately shuffled and split. The evaluate function simply calculates the
true negatives, true positives, false negatives, and false positives to create the
measures for recall, precision, and f-measure.
test_text = "This movie disappointed me severely"
doc = nlp(test_text)
print(test_text, doc.cats)

The doc.cats parameter gives us the result of the classification – here, it is
negative sentiment and is correctly classified as so.
This would be our final step – to test our model on a sample sentence. It is
here we can also see one of the main advantages of using spaCy for deep
learning - it fits seamlessly in our pipeline, and the classification or
sentiment score ends up being another attribute of the document. This is
quite different to how we approach deep learning with Keras, where our
purpose was to either generate text or to output probability vectors – it is
simply a vector in, vector out method. It is, of course, possible to leverage
this information as part of our text analysis pipeline, but the way spaCy does
the training under the hood and learns attributes to the documents makes
for a very easy way to include the information as part of any text analysis

In the previous chapter we introduced our readers to deep learning for text,
and in this chapter, we saw how we can leverage the power of deep learning
in our own applications, whether we use Keras or spaCy. Knowing how to
assign sentiment scores or classify our documents gives us a huge boost
when designing intelligent text systems, and with pretrained models, we
don't have to perform heavy computations every time we wish to make such
a classification. It is now within our capacity to build a strong and varied text
analysis pipeline!
In the next chapter, we will discuss two popular text analysis problems—
sentiment analysis and building our own chatbot—and what possible
approaches we can take to solve these problems.

[1] Keras Sequential Model:

[2] Keras CNN LSTM:

[3] Pre-trained Word Embeddings:

[4] CNTK:

[5] Apple CoreML:

[6] Keras Text Processing:

[7] Keras Models:

[8] Keras Layers:

[9] Keras Core Layers:

[10] Keras Datasets:

[11] Keras LSTM:

[12] Keras Convolutional Layers:

[13] Keras examples directory:

[14] Activation function:

[15] Keras Activation functions:

[16] Pooling layer:

[17] Sequential Example:

[18] Convolutional Example:

[19] Convolutional with embeddings:

[20] Deep Learning Keras:

[21] Text Classification Model spaCy:

[22] Text Classification code:

Sentiment Analysis and ChatBots
By now, we are equipped with the skills needed to get started on text analysis
projects and to also take a shot at more complicated, meatier projects. Two
common text analysis projects that encapsulate a lot of the concepts we have
explored throughout the book are sentiment analysis and chatbots. In fact,
we've already touched upon all the methods we will be using for these
projects, and this chapter will serve as a guide to how one can put up such an
application on their own.
In this chapter, we will not be providing the code to how to build a chatbot or
sentiment analysis pipeline from the first step to the last, but rather to
introduce the reader to a variety of techniques which will help when setting
up such a project. Following are the topics we will cover in this chapter:
Sentiment analysis
Mining data

Sentiment analysis
Sentiment analysis is merely another term given to text classification or
document classification – where the classifying feature happens to be the
sentiment of the text. We can understand sentiment as a feeling or opinion
about something – if we said The movie was terrific!, it means it expresses a
positive sentiment or feeling, and if we say The movie is terrible!, it would be
expressing negative sentiment or feeling. Here, sentiment usually refers to
positive or negative sentiment, but this can, of course, be extended to include
multiple sentiments, such as angry, sad, happy, and maybe even a thoughtful
sentiment if we so wish. In other words, sentiment analysis tasks are simply
classification tasks where each class is a kind of sentiment which we wish to
In fact, we have seen an example of sentiment analysis in the previous
chapter, when we used Keras and spaCy together to build a deep learning
pipeline. Sentiment analysis was performed by assigning probability
distributions of positive and negative sentiment. In fact, even the examples
using only Keras were classifying based on sentiment, but we approached the
problem as a simple classification task and not as a sentiment analysis task.
The example with spaCy was more explicit, where we assigned scores of
sentiments to each document and then did the classification.
Based on what we intend to do with the information of the sentiment, we can
approach our problem in different ways – though the core idea that we are
simply using probabilities of which class a document will belong in remains
the same. It is highly recommended for any sentiment analysis task to end
up training your data according to its domain – identifying sentiment in
tweets with an algorithm trained on movie reviews will not work as well as
one trained on its own domain.
Sometimes it is helpful to prototype your text analysis pipeline or to quickly
demo your ideas. In such cases, before working with Keras or spaCy, it might
be useful to get a quick gauge of sentiment before actually setting up the
heavy machinery. Setting up a quick Naive Bayes classifier can be handy in
such a case. We've already come across this classifier in our chapter on
clustering and classifying text documents (Chapter 10, Clustering and
Classifying Text), so we are aware of how to set up our code to do this. Just a

note that the following code is a template – we have not defined X or labels.
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB(), labels)

We can then use our Naive Bayes machine for predicting a class. Here, the
class would be positive or negative sentiment. The Python package TextBlob [1]
works on the same principle when classifying or assigning sentiment. It also
uses a Naive Bayes classifier under the hood. Again, here, the text variable is a
placeholder variable and you are expected to define the text yourself if you
wish to see the result of the example.
from textblob import TextBlob
analysis = TextBlob(text)
Pos_or_neg = analysis.sentiment.polarity

The Pos_or_neg variable now contains the sentiment of the text in terms of
being positive or negative and is a float. This kind of API allows us to very
easily work with the sentiment information, unlike in the case of Keras or
scikit-learn where we must predict the class of a document and then assign
this to the document. We might have noticed the same API ideology in spaCy
as well – after running our document through the pipeline, it tags the
documents with different attributes. In the previous chapter (Chapter 14, Keras
and spaCy for Deep Learning), we saw the particular example where we
added this attribute to the pipeline. Note that in this example, nlp is the
trained model we saw in the Deep Learning with spaCy section, and we have
to finish running that code example to see it work.
doc = nlp(text)
sentiment_value = doc.sentiment

We can see how TextBlob and spaCy have almost the same way of
approaching sentiment analysis from an API point of view. While we can
prototype with TextBlob, it is not recommended to be used in any production
code, or even in any serious text analysis project - the naive Bayes algorithm
is trained on movie reviews, and this context might not always yield the best
values. When we use spaCy to assign sentiment scores, we are training our
model ourselves, and on data, we want to train on. We can actually build the
neural network ourselves, which gives us the opportunity to fine-tune our
model for the context in an even finer manner.

A quick google search for sentiment analysis python gives us a plethora of results,
and most of these involve analysis of tweets for sentiment and tend to use
NLTK's built-in sentiment analyzer to perform the analysis. We will be
avoiding the use of NLTK's classifier because it also uses a naive Bayes
classifier to perform the classification, and unlike TextBlob, does not offer an
API that gives documents attributes – like Keras or scikit-learn, it accepts a
vector as input and assigns values based on this.
With that being said, it does not hurt to familiarize ourselves with the
Sentiment Analysis API [2] that NLTK offers, even if only to better follow
online tutorials on the matter. If anything, their SentimentAnalyzer [3] class offers
some use, even if only as a way for us to design our analysis after
constructing our own sentiment analyzer.
One useful method which is provided by NLTK is the
show_most_informative_features() method, which shows us which features are
informative (in this case, features are words). For example, if we are
classifying spam mail, words such as winner or casino would be very telling
features. The ratios we see in the right most column are the ratios of it being
ok : spam.
winner = None ok : spam = 4.5 : 1.0
hello = True ok : spam = 4.5 : 1.0
hello = None spam : ok = 3.3 : 1.0
winner = True spam : ok = 3.3 : 1.0
casino = True spam : ok = 2.0 : 1.0
casino = None ok : spam = 1.5 : 1.0

The presence of the words, winner and casino increases the odds of the mail
being marked as spam. But it is possible to extract the same information
from a scikit-learn model, for example. Let's write a small method to do this
and examine it:
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
# get feature names returns the features used in the classifier,
# and here the words in the vocabulary are the features
feature_names = vectorizer.get_feature_names()
# We now loop over every class label
for i, class_label in enumerate(class_labels):
# clf.coef_ contains the coefficients of each class; we extract the
# 10 highest coefficient values, which are a way to measure which
# features (words) are most influencing the probability of a document
# belonging to that class
top10 = np.argsort(clf.coef_[i])[-10:]
# we finally print the particular class and the top 10 features (words)

# of that class
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))

Here, simply extract the coefficient values for each feature, and sort it before
printing it. This particular example is for multi-class classifiers; if we are
using a binary classifier, then clf.coef_[0] would also do the trick. It's possible
with even a little bit of effort to replicate all NLTK functions with spaCy,
scikit-learn, and Gensim. As for what kind of machine learning technique
best works for sentiment analysis, the state-of-the-art at the moment
remains deep learning techniques – in particular, a bidirectional LSTM is
particularly good at understanding sentiment in text. We have already seen
examples of how to construct such neural networks in the previous chapters.
As for why these work the best, it is because this is a form of a recurrent
neural network - this means that context is carried in further layers or nodes
of the network. LSTM stands for long short-term memory – this is
precisely the idea that is encapsulated, and memory is essential in
understanding context. Bidirectional means that we have a context in either
direction. Of course, with a field as rapidly evolving as deep learning, a new
architecture might be able to outperform LSTMs soon.
Adding more information or depth to your neural network (such as using
word embeddings or stacking up more layers) might further increase our
performance, and so can increase the number of training epochs. Of course,
like any other problem we intend to solve using deep learning, a lot of finetuning is required for high accuracies. For a little more information about
why LSTMs work so well for sentiment analysis, the following posts can be

1. LSTMs for sentiment analysis [4]

2. Understanding LSTMs [5]
With more advanced tools at our disposal, we can skip using NLTK to analyze
sentiment. It still remains important to be able to mine the internet for
useful textual data which we can use to analyze sentiment – we will be
discussing two such sources, in particular, Reddit and Twitter.

Reddit for mining data
In the very first chapter we talked about mining the internet for data sources
– now, we're going to actually explore how to do this. We've mentioned
before how Reddit [6] can be an interesting data source because it includes
real conversations with mostly grammatically correct sentences – it also has
subreddits where we can focus on interest groups. Reddit also happens to
have a well-organized API which we can use to mine for data, meaning that a
lot of cleaning effort is saved!
To be able to gather data without a problem, one must first sign up for an
account on Reddit. This link, will help you sign up, and
some browsing of the website yourself will help build context for our
After getting comfortable with the nature of the website and the data we will
be gathering, we should take a look at the API rules, which can be found on
the wiki – Two rules that particularly
stand out here are that we can only send 60 requests per minute and that we
don't lie about the user agent. A user agent is a software acting on behalf of a
user, and in the case of accessing the internet, it is information about which
browser or application is accessing the internet. These are not too difficult to
adhere to, and if the preceding links have all been read, we can start looking
at some code.
import requests
import json

# Add your username below
hdr = {'User-Agent': ':r/news.single.result:v1.0' +
'(by /u/username)'}
url = ''
req = requests.get(url, headers=hdr)
data = json.loads(req.text)

In the preceding lines of code, we are mining results from the subreddit r/news
[7], which is subreddit largely discussing American and international news
and politics. We could mine from any subreddit which we are comfortable
with in terms of the content or material on that subreddit. Some things we

should be careful about in the code are the user-agent and the Reddit
username we have created our account with.
The best part about Reddit data is the format we are receiving our data in –
JSON! There are many standard ways to load JSON in Python, and the JSON
encoder and decoder [8] helps us do this easily.
The textual data stored in the JSON can be topic modeled, used to train
Word2Vec for a particular context, or as we have been discussing – classified
for sentiment. The r/news and r/politics subreddits are particularly interesting
places to attempt this, as they tend to receive the most polarizing posts. I
could personally recommend that you have a look at the following



















And if interested in the more eclectic material, or internet memes, you can
look at these:







A few things to be careful about is to make sure not to anger the API by
making more than 60 requests per minute – this would mean using the time [
9] library to organize our requests. The fact that Reddit is already organized
into many subreddits based on interest groups, hobbies, or subjects means
that it is a rich source of textual information with appropriate context which
we can adjust, something we cannot as easily do with other online data
With all these limitations, we might be tempted to download historical data
– in which case, this Reddit thread (meta!) has links to about 1.7 billion
comments at about 250 GB, after being compressed.
One associated project written in Python which used Reddit for some
interesting results is sense2vec [10], where the creators of spaCy used Reddit
data to attempt a semantic analysis of Reddit. The best part is that we can
find the entire code-base of the project here on Reddit:, meaning we are free to play around with it.
This means that we can use the sense2vec on other sources of data, or even
modify what is considered as semantics. Since it is a web app, it is a neat way
to look at how to display results online.

Twitter for mining data
While Reddit is a great way to analyze a more structured form of data, we
often turn to social media to analyze text – there seem to be more real-world
implications, and for social scientists, it can serve as a treasure trove of
textual data. Indeed, sentiment analysis for tweets is a very popular project
for budding data scientists because it allows one to try their hand at both
data collection and data analyzing.
In all our examples throughout the book, we have dealt with datasets that are
usually loaded from the packages we use, such as the 20 Newsgroup dataset
(from scikit-learn), the Lee news corpus (from Gensim), or the IMDB dataset
(Keras). While it is important to know how to work with well-documented
datasets which are also used as benchmarks in research when working in
real-world scenarios things are not so easy, and data has to be collected.
We've already dealt with the importance of thoroughly cleaning our textual
data, and with Twitter, we need to be doubly careful. We are now suddenly
presented with smileys, emoticons, hashtags, abbreviations, slang, and so
much more. Handling this means we have to make sure what kind of
analysis we wish to perform – in some cases, we might want to use
information in hashtags, and in some cases, we might not. It is the same case
with smiley faces – for example, including them as a word might heavily
influence the result of our sentiment analysis classifier – a ":-)" would likely
correlate highly with positive sentiment and a ":-(" with negative sentiment.
If we would just want to classify tweets, a model with smileys would be more
efficient – but if we also wish to understand what kind of semantic
information might be present in tweets, it might be more prudent to remove
anything which is not a word. In the end, how we wish to clean and process
our tweet depends on our use case.
With that being said, there is already datasets present which allow us to
leverage Twitter data for sentiment analysis. This link [11] gives us access to a
dataset that has labeled data for sentiment analysis, which is, as we can
guess, very important when we are training our data. The majority of this
data comes from the University of Michigan Kaggle challenge, which we can
read about here - While another popular
Twitter dataset is the Sentiment140 dataset, which can be found here

We can use these datasets to train our classifier as they are already labeled.
As for using this classifier to actively label new tweets, we would need to use
a Twitter API to get the data. The official twitter API for Python is tweepy
[12], and it works well. Much like Reddit, we are required to make an account
before we can use it. An account can be made here:
Once we have created this account, we would have received information
about consumer tokens and access tokens. The authentication tutorial [13] on
the tweepy documentation website provides even more information about
how to deal with this.
Our first few lines of code to set up our API is as follows:
import tweepy
# Authentication and access using keys:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
# Return API with authentication:
api = tweepy.API(auth)

We then use the API object to do all of our extraction. Considering the
current political climate, a proper twitter query is Donald Trump, and a
popular user to analyze is @realDonaldTrump.
tweets = api.user_timeline(screen_name="realDonaldTrump", count=20)
For tweet in tweets:

And bam, with just about 7 lines of code, we have access to the 200 most
recent Donald Trump tweets. Of course, this is just raw text, so we would still
have to clean the text, and more importantly, store it in an alternate data
structure that is more amenable to text analysis.
If we wish to, for example, search for Donald Trump on Twitter instead of
looking at his personal handle, we would need to run this:
tweets = api.get_tweets(query = 'Donald Trump', count = 200)

It is worth having a look at tweepy's documentation [14] to see what else it is

capable of, especially if we intend to use it extensively.
We've seen the availability of Twitter datasets that are already cleaned
and/or labeled – as well as how to mine real-time Twitter data off the
internet. Reddit also remains another important source to mine data off, and
as social scientists increasingly look toward the internet to understand social
behavior better, they must also know how to interact and mine this data. We
have just seen how easy it is to do this!

Getting a machine to learn to speak like a human being has been a holy grail
for computer scientists as well as linguists – of all the things which
machines can mimic human behavior, holding a conversation has been a
challenging ordeal. The quest to make such a machine which can chat with
us humans (or a chatbot, if you will), has had many different approaches,
and while none of them work perfectly, it is important to be aware of them and pick and choose which kind will be the best for our purpose!
As for why we would want to build them – chatbots are increasingly used by
businesses; both to help customers to answer basic questions, as well as for
building more complex personal assistants. It is also becoming increasingly
easier to build such chatbots, and using open source tools.
There are many motivations to study the art of conversation for machines,
from both a research perspective and a more practical approach. An ideal
artificial intelligence bot should be able to remember context from earlier in
the conversation, build on this information when coming up with responses,
and possibly have a personality of its own. Of course, it is tough to properly
measure how well a conversation is held up, or how much of a personality a
bot has. What we can measure is how well a response one can give based on
a question or query, and this offers one way to judge how well a bot
The famous Turing Test [15] argues that if we cannot differentiate between
a chatbot and a human, it is a truly intelligent bot. Our purpose is to however
not fool human beings (or debate whether this is truly a measure of
intelligence!), but to rather build a bot which can answer human questions
with some level of intelligence.
We will discuss possible methods to do this, and provide documentation,
reading material, and code snippets. There is no one perfect chatbot, as this
is a field which is still slowly evolving, and state of the art will take quite a
while before it is achieved. It is highly likely that we ourselves have
interacted with such chatbots – Siri [16] is arguably the most popular
example, and Amazon's Alexa [17] is also a well-known personal digital
assistant. Despite a large amount of money pumped into these applications,

they are still with many flaws, and customers have complained about Siri's
poor understanding of some English accents, as well as the lack of flexibility
in coming up with responses.
This post [18] compares the major personal assistants and ranks them on a
variety of measures – from smartest to funniest. It is also possible to now
use Facebook Messenger to develop businesses using their chatbot API [19]
as well. Like we mentioned before, building a chatbot has become
increasingly easy.
Traditional chatbots worked largely with logical constructs which helped
create responses, where the program would attempt to break the user input
into different parts and match it with possible inputs while choosing an
appropriate output. One of the first such constructs was AIML (Artificial
Intelligence Markup Language) [20]. It served as an XML dialect for
creating natural language agents. Based on certain templates, it would help
choose an appropriate response to a query. One can argue that is not true
intelligence, as we are directly limited by the number of examples we have,
and we do not naturally generate language but merely choose the closest
possible response to the question. For a long time though, this was the only
way to construct chatbots - while it was helped with advancements in the
field of natural language processing, such as better accuracies in POS
identification or NER tagging, it was still largely just choosing out the best
In fact, in 2010, if one was to build a chatbot using Python, we would not be
using machine learning techniques as much as just using a rule-based
system with a little bit of fine-tuning. The vert first chatbot, ELIZA [21], was
built in 1966! While for a believable system we might still use this, recent
advances in deep learning have led us a lot closer to replicating
We have already discussed the exact technologies which are doing this – text
generation using deep learning. In fact, this was our first exposure to deep
learning with text! The sequence to sequence modeling has made our
chatbots smarter, and they are not better able to understand the context and
not simply choose the most appropriate response.
The first real taste of this is in the landmark paper by Oriol Vinyals and Quoc
V. Lee from Google, A Neural Conversational Model [22]. In the paper, they
describe a sequence to sequence model, where it would predict the next
sentence, given the previous one. Trained on an IT Helpdesk dataset, this bot

provided very, very human responses; arguably, they didn't always make
sense, but it is certainly still very impressive. The following is an excerpt
from the research paper, where a machine and human discuss VPN
connectivity issues.
Conversation 1: VPN issues:
Describe your problem: I am having issues accessing VPN
Machine: hi
Human: hello
Machine: Could you please let me know what are the operating systems
you are using?
Human: I am using Linux
Machine: ok it was in the SSL VPN on which you stated it was broken
Machine: please check if you are logged in to corp network
Machine: what is the error that you are running, please
Human: I am seeing an error related to VPN
Machine: what is the error message that you are getting when connecting
to VPN using network connect?
Human: connection refused or something like that
Machine: may I know the version of network connect you connect?
Human: I am not sure I know that
Machine: is the network connect prompting for any user certificate
Human: yes
Machine: are you at home or at work at the moment?

Machine: office?
Human: I am at home
Machine: try this
Machine: goto  page and enter your name and password
Human: done
Machine: try logging in with  and check if its working fine now
Human: yes, now it works!
Machine: great
Machine: anything else that I can help
Human: no, that was very useful
Machine: you are other than  : )
Machine: have a nice night
It is indeed quite remarkable to see what looks like a very plausible
conversation between two humans! What is especially impressive here is
that the questions asked do not have to be seen before in the training dataset
– the context here is learned during the training process. There are no rules
provided in the preceding example, and all the responses are learned purely
with data. It is really worth one's time to explore all the conversations listed
out in the paper. We will provide one more examples to illustrate the power
of a generative model, again from the research paper:
Conversation 4: Philosophical Q&A:
Human: what is the purpose of life?
Machine: to serve the greater good.
Human: what is the purpose of living?
Machine: to live forever.

Human: what is the purpose of existence?
Machine: to find out what happens when we get to the planet earth.
Human: where are you now?
Machine: I'm in the middle of nowhere.
Human: what is the purpose of dying?
Machine: to have a life.
Human: what is the purpose of being intelligent?
Machine: to find out what it is.
Human: what is the purpose of emotions?
Machine: I don't know.
We can truly see the power of our generative model here, where it discusses
the meaning of life, among other things. There are some drawbacks, though
– it is difficult to train such a neural network to have a personality as well as
be helpful, which means it is difficult to pass the Turing test, for example. As
it is a generative model, it also means it might not give a coherent or relevant
solution every time, and rather just responds with what could possibly be the
best response. These neural nets are trained with question-answer pairs to
understand what a response should look like.
Again, we would like to point out that we have already seen this in action
before when we trained a text generating neural network – the only
difference being in the kind of text we were generating. If we train our neural
network on all the works of J.K. Rowling, we will likely have a text generator
which will regale us with stories of magic, or at least attempt to. There have
been neural networks that attempt to write code as well, so it is not
surprising to imagine that a neural network which when trained on
conversations would perform fairly well as a chatbot.
Of course, as promising this may look, it is not without its drawbacks – such
a generational model on its own is likely not going to be a working chatbot
and will still require a lot of supervision. We are also limited by our data; and
if we are building this chatbot for a purpose where precision with the tasks to

be performed is important, it might not be the best idea. In such a case,
choosing templates might be our best bet!
It is also possible to use these generative models in tandem with a logicbased system. What if we would like to aimlessly chat with a bot without a
particular task to be performed, or maybe model a friend's personality in a
bot? In such a case, there is hardly a better alternative than a well-trained
RNN. One example of a project to try out for users: it is possible to mail
WhatsApp conversation logs to yourself. We can very easily extract our own
texts or a friends texts, and train an RNN on this data. Imagine building a
rule-based bot to mimic a person's typing style – it isn't the easiest thing to
We've seen then that there are two possible approaches to creating a chatbot
- either have an information retrieval system where we choose the most
appropriate response based on a set of rules or create a model that generates
text based on the response. Both models have their pros and cons, which we
have discussed before.
If we are going to use our intended chatbot in production, using an
information retrieval-based system or using a standard chatbot API may be
more practical. Two examples of such frameworks are RASA-NLU [23] and
ChatterBot [24].
When using such frameworks, we aren't really building an intelligent system
ourselves, but using one built by the API we chose. This isn't necessarily a
bad thing, especially if it gets the job done. RASA-NLU, for example, uses
JSON files to train its models. You can have a look at the sample data here
By adding more entities and the intent, the model learns more context, and
can better understand the questions we ask the bot. Interestingly, one of the
back-end options which power the bot are spaCy and scikit-learn, two
libraries we should be comfortable using with text by now!
Under the hood, they use Word2Vec to better understand the intent, spaCy to
clean up text, and scikit-learn to build models. For more details into the
inside workings of how RASA functions, their blog post [26] on Medium runs
us through some of the concepts used, most of which we would be
comfortable looking at at this point. One of the ideas of RASA involves being
able to write own parts of your bot instead of it being like a traditional thirdparty bot API. The code is all in Python, so we can really get our hands dirty

and play around with it. It also gives us ideas about how to possibly construct
our own clever bot, if we would like to be a little more ambitious!
"text": "show me a mexican place in the centre",
"intent": "restaurant_search",
"entities": [
"start": 31,
"end": 37,
"value": "centre",
"entity": "location"
"start": 10,
"end": 17,
"value": "mexican",
"entity": "cuisine"

This is an example of a JSON entry to train a RASA model. Here, we give
example text, the intent which we are attempting to learn, and the entities
field describes the exact nature of the entity.
Of course, building a chatbot doesn't just require us to understand how
natural language works – we should also be able to construct a functional
front end that can actually talk to the user. This means being aware of how
we pass information to an online app, and how to set up the pipeline. This is
beyond the scope of the book, but luckily for us, using RASA Core to do this
is quite straightforward and their documentation [27] does a good job at how
to set up a conversational model. We have to be aware of both RASA NLU
and RASA Core to be able to get the most out of things. With RASA Core we
are able to set-up our domain and stories, and we use RASA NLU as our
mind, which is to extract entities. Stories are the way we expect our bot to
communicate with users, and we have to train our bot the same way we
would train it in our domains. The tutorial [28] runs us through how to build
a basic bot using RASA Core.
Another possible Python-based option to help build our bot would be
ChatterBot [29]. The logic behind ChatterBot is quite similar to how most
information-retrieval based chatbots would work – based on the input
sentence of the user; it selects a known statement which is similar to the
input statement. There are multiple such possible responses which can be
selected, and we call each of the machines which create a response as a logic

adapter. Once we have our collection of logic adapters, we can return the
most likely response to the question. We can create and train our own
adapters, both with respect to what kind of information to expect (the input),
as well as what kind of responses should be created.
Training such a bot is also very straightforward for a quick test:
from chatterbot import ChatBot
bot = ChatBot('Stephen')
'How are you?',
'I am good.',
'That is good to hear.',
'Thank you',
'You are welcome.',

Now this will obviously not make for a very mature or powerful bot but is
rather an example of how easy it can train a chatbot with such an API.
We've seen examples of how we can use specific libraries aimed at building
chatbots - but how would we start building our very own, at least slightly
functional chatbot?
We've already discussed the two separate philosophies of doing this, one
being simply generating text, and the other a more pipelined approach.
A pipelined approach would start by parsing and cleaning the user's input,
and first identifying the kind of sentence the user inputted. Is it a question,
or a statement? Does it relate to the bots domain, and if so, how? One way to
attempt to find this is to build a classifier. We are well aware of how to build
a classifier, and how well a neural network can perform in choosing between
different classes of documents.

Fig 1 5.1 The process flow diagram as described on the Chatterbot Docum entation website

Now that we have the type of user input, let's further analyze the sentence;
break it up into different parts of speech, identify named entities, and
appropriately construct a sentence as a response. We saw in the RASA
example how we added information about Mexican being a kind of cuisine.
Using Word2Vec, we can assemble a series of different cuisine options, and
suggest alternatives to our user if there doesn't happen to be Mexican food
Now how do we choose among the best possible responses to give to the

user? Again, a neural network can come in quite handy here, where we can
attempt to predict the odds of a particular response based on the input – and
choose our most appropriate response. Of course, this means that we still
have to construct appropriate question-response pairs. Once we choose a
question that is similar (for example, find a place for lunch, and find a place
for dinner would be similar questions), we can change the appropriate
proper noun in the response with the entity in the question and suggest it as
a possible output.
If the kind of chatbot is not built to perform a targeted task in a domain, but
just to attempt conversation, we can generate a response. This means that we
are no longer interested in analyzing the user's input sentence for parts of
speech or entities, and we have no use of choosing between sample
responses – we want to organically generate a unique (or mostly unique)
response to the question. In this case, we will just plug in an RNN and let it
work its magic. We then simply spit this back at the user and continue the
conversation in the hope that it is interesting. We've seen multiple examples
of this when discussing Google's Neural Conversational model.
This GitHub repository [30] and blog post [31] discuss an approach not using
any machine learning, but just a basic analysis of sentences, and replying
with a bot that has access to only a basic set of responses. This kind of way of
building a chatbot may not be as powerful, but getting used to the ideas
behind the responses are important, and can help us in designing our own
pipeline for building a chatbot, if we wish to construct one without the help
of an external framework or API.
Of course, this is not a zero-sum game where we have to choose between
choosing appropriate responses or generating text. Consider the following

1. Accept user input

2. Classify input as a statement, question, or greeting – basically, identify

3. If it is a greeting, respond in kind – Hello!

4. If it is a question, look up similar questions in our dataset, perform
rudimentary sentence analysis, and choose an appropriate response with
the nouns or adjectives replaced

5. If it is a statement or attempt at conversation, let our generative model
talk back to the user – at least until the user has a question

6. If the user is saying goodbye, politely wish them goodbye
This is a rough approach – we haven't mentioned how exactly we are going to
find a similar document (though going through Chapter 11, Similarity Queries
and Summarization is a good start!), or how we are going to do our multiclass classification. Based on the context we would want to plug in and play
accordingly. At this point, we have all the tools needed to follow this pipeline
for our chatbot!
This employs ideas of both generating texts as well as information retrieval.
There are multiple machine learning models that can be employed in such a
model – a classifier to decide the input type, a topic model for finding similar
documents, Word2Vec for identifying intent or certain entities, and a neural
network to generate text. All of these models need to be trained appropriately
for the kind of task they are expected to perform, with what could be
extremely different training data. For example, if we are creating a chatbot
with the purpose of helping the user to find the perfect restaurant, we would
train the final output to be a restaurant suggestion, and train the
conversational bot with data from Reddit/r/food [32]! We can throw in some
tweets in there as well for a bot which attempt to replicate normal foodrelated conversation the best possible way.
Of course, building an entire chatbot that can provide both interesting
conversations and help find the closest Mexican restaurant is no trivial task
– it is why we have entire teams at Google or Apple working on this. But in
an attempt to build such a machine, we can learn a lot about the kind of
things we can do with text along the way. There is no one best way to go
about building such a bot and it depends highly on the context and problem
we are trying to solve.
A chatbot employs some of the most advanced text analysis techniques –
machine learning, computational linguistics, and also basic software
engineering sense when deciding our pipeline and serves as an excellent way
to exercise the skills picked up throughout this book. Since there is no one
best chatbot creation method, we have instead in this chapter discussed
popular methods currently available in the Python natural language
processing world and laid them out in front of you, the reader – it is now up
to you to pick up the tools and start building!

In this chapter, we discussed two important text analysis problems –
sentiment analysis and building a chatbot. Sentiment analysis refers to the
task of understanding sentiment in the text, and we have seen the various
libraries, algorithms, and approaches to perform this task. A crucial part of
performing such tasks is gathering data – we then saw how to download data
from internet sources such as Twitter or Reddit. The final section of the
chapter focused on how to build chatbots. We explored it from both a
historical and theoretical point of view and explored Python libraries that
help us easily build chatbots. This brings us to the end of the book – you
would now be confident in analyzing text the way you see fit, with a variety
of techniques, approaches, and settings. We focused on using the most
efficient Python open source libraries, with a focus on Gensim, spaCy, Keras,
and scikit-learn throughout the book, while still discussing the other Python
text analysis libraries available. It is important to know which tool works
best in which context, as well as which kind of pipelines and architectures we
should explore for a problem. If you have followed the book carefully, with
the code examples, Jupyter notebooks, and external links – you should now
be able to confidently analyze your text.

[1] TextBlob:

[2] NLTK Sentiment Analysis:

[3] NLTK Sentiment Analyser class:

[4] LSTMs for Sentiment Analysis:

[5] Understanding LSTM Networks:

[6] Reddit:

[7] News subreddit:

[8] JSON:

[9] Time library:

[10] sense2vec:

[11] Twitter datasets:

[12] Tweepy:

[13] Tweepy authentication tutorial:

[14] Tweepy documentation:

[15] Turing Test:

[16] Siri:

[17] Alexa:

[18] Ranking Digital Assistants:

[19] FB messenger bots:

[20] AIML:

[21] ELIZA:

[22] A Neural Conversation Model:

[23] RASA-NLU:

[24] Chatterbot:

[25] RASA sample data:

[26] Do it yourself NLP:

[27] RASA Core:

[28] Basic bot building:

[29] Chatterbot:

[30] Brobot:

[31] Chatbot fundamentals:

[32] reddit food:

Other Books You May Enjoy
If you enjoyed this book, you may be interested in these other books by

Natural Language Processing with TensorFlow
Thushan Ganegedara
ISBN: 978-1-78847-831-1
Core concepts of NLP and various approaches to natural language
How to solve NLP tasks by applying TensorFlow functions to create
neural networks
Strategies to process large amounts of data into word representations
that can be used by deep learning applications
Techniques for performing sentence classification and language
generation using CNNs and RNNs
About employing state-of-the art advanced RNNs, like long short-term
memory, to solve complex text generation tasks
How to write automatic translation programs and implement an actual
neural machine translator from scratch
The trends and innovations that are paving the future in NLP

Python Machine Learning, Second Edition
Sebastian Raschka, Vahid Mirjalili
ISBN: 978-1-78712-593-3
Understand the key frameworks in data science, machine learning, and
deep learning
Harness the power of the latest Python open source libraries in machine
Explore machine learning techniques using challenging real-world data
Master deep neural network implementation using the TensorFlow
Learn the mechanics of classification algorithms to implement the best
tool for the job
Predict continuous target outcomes using regression analysis
Uncover hidden patterns and structures in data with clustering
Delve deeper into textual and social media data using sentiment analysis

Leave a review - let other readers
know what you think
Please share your thoughts on this book with others by leaving a review on
the site that you bought it from. If you purchased the book from Amazon,
please leave us an honest review on this book's Amazon page. This is vital so
that other potential readers can see and use your unbiased opinion to make
purchasing decisions, we can understand what our customers think about
our products, and our authors can see your feedback on the title that they
have worked with Packt to create. It will only take a few minutes of your
time, but is valuable to other potential customers, our authors, and Packt.
Thank you!


Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Creator                         : Zamzar
Producer                        : Zamzar
Page Count                      : 523
EXIF Metadata provided by

Navigation menu