Ultimate Guide To Understandx Understand

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 29

DownloadUltimate Guide To Understandx Understand
Open PDF In BrowserView PDF
Ultimate Guide to Understand &
Implement Natural Language
Processing (with codes in Python)
According to industry estimates, only 21% of the available data is present in structured
form. Data is being generated as we speak, as we tweet, as we send messages on
Whatsapp and in various other activities. Majority of this data exists in the textual form,
which is highly unstructured in nature.
Few notorious examples include – tweets / posts on social media, user to user chat
conversations, news, blogs and articles, product or services reviews and patient records in
the healthcare sector. A few more recent ones includes chatbots and other voice driven
bots.
Despite having high dimension data, the information present in it is not directly accessible
unless it is processed (read and understood) manually or analyzed by an automated
system.
In order to produce significant and actionable insights from text data, it is important to get
acquainted with the techniques and principles of Natural Language Processing (NLP).
So, if you plan to create chatbots this year, or you want to use the power of unstructured
text, this guide is the right starting point. This guide unearths the concepts of natural
language processing, its techniques and implementation. The aim of the article is to teach
the concepts of natural language processing and apply it on real data set.

Table of Contents
1. Introduction to NLP
2. Text Preprocessing
o Noise Removal
o Lexicon Normalization
 Lemmatization
 Stemming
o Object Standardization
3. Text to Features (Feature Engineering on text data)
o Syntactical Parsing
 Dependency Grammar
 Part of Speech Tagging

Entity Parsing
 Phrase Detection
 Named Entity Recognition
 Topic Modelling
 N-Grams
o Statistical features
 TF – IDF
 Frequency / Density Features
 Readability Features
o Word Embeddings
4. Important tasks of NLP
o Text Classification
o Text Matching
 Levenshtein Distance
 Phonetic Matching
 Flexible String Matching
o Coreference Resolution
o Other Problems
5. Important NLP libraries
o

1. Introduction to Natural Language Processing
NLP is a branch of data science that consists of systematic processes for analyzing,
understanding, and deriving information from the text data in a smart and efficient manner.
By utilizing NLP and its components, one can organize the massive chunks of text data,
perform numerous automated tasks and solve a wide range of problems such as –
automatic summarization, machine translation, named entity recognition, relationship
extraction, sentiment analysis, speech recognition, and topic segmentation etc.
Before moving further, I would like to explain some terms that are used in the article:




Tokenization – process of converting a text into tokens
Tokens – words or entities present in the text
Text object – a sentence or a phrase or a word or an article

2. Text Preprocessing
Since, text is the most unstructured form of all the available data, various types of noise are
present in it and the data is not readily analyzable without any pre-processing. The entire

process of cleaning and standa
ardization of
o text, makiing it noise--free and re
eady for ana
alysis
is known
n as text pre
eprocessing.
It is pred
dominantly comprised
c
of
o three step
ps:




Noise
N
Remov
val
Lexicon Norm
malization
Object
O
Stand
dardization

The follo
owing image
e shows the architecturre of text pre
eprocessing
g pipeline.

2.1 No
oise Rem
moval
Any piec
ce of text wh
hich is not relevant
r
to the
t context of the data and the en
nd-output ca
an be
specified
d as the nois
se.
For exam
mple – language stopw
words (comm
monly used words of a language – is, am, the, of,
in etc), URLs
U
or link
ks, social media entities
s (mentionss, hashtags)), punctuatio
ons and ind
dustry
specific words.
w
This
s step deals with remov
val of all type
es of noisy entities pressent in the ttext.
A genera
al approach
h for noise re
emoval is to
o prepare a dictionary o
of noisy enttities, and iterate
the text object by to
okens (or by
b words), eliminating
e
tthose token
ns which arre present in
n the
noise dic
ctionary.
Following is the pyth
hon code fo
or the same purpose.

```

# Sample
e code to remove noisy
y words from a text

noise_list = ["is", "a", "this", "..."]

def _remove_noise(input_text):

words = input_text.split()

noise_free_words = [word for word in words if word not in noise_list]

noise_free_text = " ".join(noise_free_words)

return noise_free_text

_remove_noise("this is a sample text")

>>> "sample text"

```

Another approach is to use the regular expressions while dealing with special patterns of
noise. Following python code removes a regex pattern from the input text:

```

# Sample code to remove a regex pattern

import re

def _remove_regex(input_text, regex_pattern):

urls = re.finditer(regex_pattern, input_text)

for i in urls:

input_text = re.sub(i.group().strip(), '', input_text)

return input_text

regex_pattern = "#[\w]*"

_remove_regex("remove this #hashtag from here ", regex_pattern)

>>> "remove this

```

from here "

2.2 Lexicon Normalization
Another type of textual noise is about the multiple representations exhibited by single word.
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as lemma).
Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature), which
is an ideal ask for any ML model.
The most common lexicon normalization practices are :



Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes
(“ing”, “ly”, “es”, “s” etc) from a word.
Lemmatization: Lemmatization, on the other hand, is an organized & step by step
procedure of obtaining the root form of the word, it makes use of vocabulary
(dictionary importance of words) and morphological analysis (word structure and
grammar relations).

Below is the sample code that performs lemmatization and stemming using python’s
popular library – NLTK.

```

from nltk.stem.wordnet import WordNetLemmatizer

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer

stem = PorterStemmer()

word = "multiplying"

lem.lemmatize(word, "v")

>> "multiply"

stem.stem(word)

>> "multipli"

```

2.3 Object Standardization
Text data often contains words or phrases which are not present in any standard lexical
dictionaries. These pieces are not recognized by search engines and models.
Some of the examples are – acronyms, hashtags with attached words, and colloquial
slangs. With the help of regular expressions and manually prepared data dictionaries, this
type of noise can be fixed, the code below uses a dictionary lookup method to replace
social media slangs from a text.

```

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"lov
e", "..."}

def _lookup_words(input_text):

words = input_text.split()

new_words = []

for word in words:

if word.lower() in lookup_dict:

word = lookup_dict[word.lower()]

new_words.append(word) new_text = " ".join(new_words)

return new_text

_lookup_words("RT this is a retweeted tweet by Gautam ")

>> "Retweet this is a retweeted tweet by Gautam"

```

Apart from three steps discussed so far, other types of text preprocessing includes
encoding-decoding noise, grammar checker, and spelling correction etc.

3.Tex
xt to Fea
atures (Featur
(
e Engin
neering
g on tex
xt data)
To analy
yse a preprrocessed da
ata, it needs
s to be con
nverted into
o features. D
Depending upon
the usag
ge, text features can be constrructed usin g assorted
d technique
es – Syntactical
Parsing, Entities / N-grams / word-ba
ased featu
ures, Statisstical features, and word
embeddings. Read on to underrstand these
e techniquess in detail.

3.1 Sy
yntactic Parsing
P
Syntactic
cal parsing invol ves th
he analysis
s of words in the sente
ence for gra
ammar and their
arrangem
ment in a manner
m
tha
at shows th
he relationsships amon
ng the word
ds. Depend
dency
Gramma
ar and Part of
o Speech tags are the important a
attributes off text syntacctics.
Dependency Trees
s – Sentenc
ces are composed of so
ome words ssewed together. The
relations
ship among the words in
n a sentenc
ce is determ
mined by the
e basic depe
endency
grammar. Dependency gramma
ar is a class
s of syntacti c text analyysis that dea
als with
(labeled)) asymmetriical binary relations
r
bettween two le
exical itemss (words). E
Every relation
can be re
epresented in the form of a triplet (relation, go
overnor, dep
pendent). For example:
considerr the sentence – “Bills on
o ports and
d immigratio
on were sub
bmitted by S
Senator
Brownba
ack, Republlican of Kan
nsas.” The re
elationship a
among the w
words can b
be observed
d in
the form of a tree re
epresentatio
on as shown
n:

The tre
ee shows tthat “submittted” is the
e root
word of this sentence, and is linked by tw
wo sub-tree
es (subject a
and object subtrees). Each
subtree is a itself a dependency tree with
w
relationss such as – (“Bills” <
<-> “ports” 
“proposittion” relation
n), (“ports” <->
< “immigrration” 
> “conjugatio
on” relation)).
This type of tree, when
w
parse
ed recursive
ely in top-d
down mann
ner gives grammar rellation
triplets as
a output which
w
can be
b used as features fo
or many nlp
p problems like entity wise
sentimen
nt analysis, actor & entity iden
ntification, and text classificatio
on. The pyython
wrapper StanfordCo
oreNLP (by Stanford NLP
N
Group,, only comm
mercial lice
ense) and N
NLTK
depende
ency gramm
mars can be used to gen
nerate depe
endency tree
es.

Part of speech tagging – Apart from the grammar relations, every word in a sentence is
also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The
pos tags defines the usage and function of a word in the sentence. H ere is a list of all
possible pos-tags defined by Pennsylvania university. Following code using NLTK performs
pos tagging annotation on input text. (it provides several implementations, the default one is
perceptron tagger)

```

from nltk import word_tokenize, pos_tag

text = "I am learning Natural Language Processing on Analytics Vidhya"

tokens = word_tokenize(text)

print pos_tag(tokens)

>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('Language'
, 'NNP'),

('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),('Vidhya', 'NNP')]

```

Part of Speech tagging is used for many important purposes in NLP:
A.Word sense disambiguation: Some language words have multiple meanings according
to their usage. For example, in the two sentences below:
I. “Please book my flight for Delhi”
II. “I am going to read this book in the flight”
“Book” is used with different context, however the part of speech tag for both of the cases
are different. In sentence I, the word “book” is used as v erb, while in II it is used as no un.
(Lesk Algorithm is also used for similar purposes)

B.Impro
oving word-based fea
atures: A le
earning mod
del could le
earn differen
nt contexts of a
word when used wo
ord as the fe
eatures, how
wever if the part of spe
eech tag is linked with tthem,
the conte
ext is preserved, thus making
m
stron
ng features . For examp
ple:
Sentence -“book my
y flight, I willl read this book”
b
Tokens – (“book”, 2),
2) (“my”, 1), (“flight”, 1), (“I”, 1), (“w
will”, 1), (“rea
ad”, 1), (“thiss”, 1)
Tokens with
w POS – (“book_VB”, 1), (“my_P
PRP$”, 1), ((“flight_NN”,
”, 1), (“I_PRP
P”, 1),
(“will_MD
D”, 1), (“read_VB”, 1), (“this_DT”,
(
1),
1 (“book_N
NN”, 1)
C. Norm
malization and
a
Lemma
atization: POS
P
tags a re the basis of lemma
atization pro
ocess
for conve
erting a worrd to its base form (lemma).
D.Efficie
ent stopwo
ord remov
val : P OS
S tags are also usefful in efficient remova
al of
stopword
ds.
For exam
mple, there are some tags
t
which always deffine the low
w frequency / less impo
ortant
words of
o a language. For ex
xample: (IN – “within”, “upon”, “e
except”), (CD – “one”,””two”,
“hundred
d”), (MD – “m
may”, “mu st”
s etc)

3.2 En
ntity Extrraction (E
Entities as
a featurres)
Entities are defined
d as the most importa
ant chunks of a sente
ence – noun
n phrases, verb
phrases or both. En
ntity Detectiion algorithm
ms are gen
nerally ense
emble mode
els of rule b
based
parsing, dictionary lookups, pos tagging and depende
ency parsing
g. The appllicability of e
entity
detection
n can be seen
s
in the
e automate
ed chat bo
ots, contentt analyzers and consumer
insights.

Topic Mo
odelling & Named
N
Entitty Recognition are the ttwo key entity detection
n methods in
NLP.

A. Nam
med Entity Recognittion (NER)
The proc
cess of dete
ecting the na
amed entitie
es such as p
person nam
mes, location
n names,
company
y names etc
c from the te
ext is called as NER. F or example :
Sentence – Sergey Brin, the manager of Google
G
Inc. iis walking in
n the streetss of New Yo
ork.

Named Entities – ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New
York”)
A typical NER model consists of three blocks:
Noun phrase identification: This step deals with extracting all the noun phrases from a
text using dependency parsing and part of speech tagging.
Phrase classification: This is the classification step in which all the extracted noun
phrases are classified into respective categories (locations, names etc). Google Maps API
provides a good path to disambiguate locations, Then, the open databases from dbpedia,
wikipedia can be used to identify person names or company names. Apart from this, one
can curate the lookup tables and dictionaries by combining information from different
sources.
Entity disambiguation: Sometimes it is possible that entities are misclassified, hence
creating a validation layer on top of the results is useful. Use of knowledge graphs can be
exploited for this purposes. The popular knowledge graphs are – Google Knowledge Graph,
IBM Watson and Wikipedia.

B. Topic Modeling
Topic modeling is a process of automatically identifying the topics present in a text corpus, it
derives the hidden patterns among the words in the corpus in an unsupervised manner.
Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic
model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”,
“crops”, “wheat” for a topic – “Farming”.
Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is
the code to implement topic modeling using LDA in python.

```

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."

doc2 = "My father spends a lot of time driving my sister around to dance practice."

doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."

doc_complete = [doc1, doc2, doc3]

doc_clean = [doc.split() for doc in doc_complete]

import gensim from gensim

import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an
index.

dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary pr
epared above.

doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library

Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix

ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results

print(ldamodel.print_topics())

```

C. N-Grams as Features
A combination of N words together are called N-Grams. N grams (N > 1) are generally more
informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are
considered as the most important features of all the others. The following code generates
bigram of a text.

```

def generate_ngrams(text, n):

words = text.split()

output = []

for i in range(len(words)‐n+1):

output.append(words[i:i+n])

return output

>>> generate_ngrams('this is a sample text', 2)

# [['this', 'is'], ['is', 'a'], ['a', 'sample'], , ['sample', 'text']]

```

3.3 Statistical Features
Text data can also be quantified directly into numbers using several techniques described in
this section:

A. Term Frequency – Inverse Document Frequency (TF – IDF)
TF-IDF is a weighted model commonly used for information retrieval problems. It aims to
convert the text documents into vector models on the basis of occurrence of words in the
documents without taking considering the exact ordering. For Example – let say there is a
dataset of N text documents, In any document “D”, TF and IDF will be defined as –
Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document
“D”
Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of
total documents available in the corpus and number of documents containing the term T.

TF . IDF
F – TF IDF
F formula gives the re
elative impo
ortance of a term in a corpus (liist of
documen
nts), given by the follo
owing formula below. Following iis the code
e using pyth
hon’s
scikit lea
arn package
e to convert a text into tf idf vectorss:

```

from skl
learn.feature_extracti
ion.text import TfidfV
Vectorizer

obj = Tf
fidfVectorizer()

corpus = ['This is sample doc
cument.', 'another ran
ndom docume
ent.', 'third sample d
docum
ent text
t']

X = obj.fit_transfo
orm(corpus)
)

print X

>>>

(0, 1) 0.345205016
0
865

(0, 4) ... 0.444514
4311537

(2, 1) 0.345205016
0
865

(2, 4) 0.444514311537

```

The model creates a vocabulary dictionary and assigns an index to each word. Each row in
the output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

B. Count / Density / Readability Features
Count or Density based features can also be used in models and analysis. These features
might seem trivial but shows a great impact in learning models. Some of the features are:
Word Count, Sentence Count, Punctuation Counts and Industry specific word counts. Other
types of measures include readability measures such as syllable counts, smog index and
flesch reading ease. Refer to Textstat library to create such features.

3.4 Word Embedding (text vectors)
Word embedding is the modern way of representing words as vectors. The aim of word
embedding is to redefine the high dimensional word features into low dimensional feature
vectors by preserving the contextual similarity in the corpus. They are widely used in deep
learning models such as Convolutional Neural Networks and Recurrent Neural Networks.
Word2Vec and GloVe are the two popular models to create word embedding of a text.
These models takes a text corpus as input and produces the word vectors as output.
Word2Vec model is composed of preprocessing module, a shallow neural network model
called Continuous Bag of Words and another shallow neural network model called skipgram. These models are widely used for all other nlp problems. It first constructs a
vocabulary from the training corpus and then learns word embedding representations.
Following code using gensim package prepares the word embedding as the vectors.

```

from gensim.models import Word2Vec

sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machin
e', 'learning'], ['deep', 'learning']]

# train the model on your corpus

model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')

>>> 0.11222489293

print model['learning']

>>> array([ 0.00459356

0.00303564 ‐0.00467622

0.00209638, ...])

```

They can be used as feature vectors for ML model, used to measure text similarity using
cosine similarity techniques, words clustering and text classification techniques.

4. Important tasks of NLP

This sec
ction talks ab
bout differen
nt use cases and probl ems in the ffield of natu
ural languag
ge
processing.

4.1 Text Classification
Text classification is one of th
he classical problem o
of NLP. Nottorious examples inclu
ude –
Email Spam
S
Identtification, to
opic classification of news, se
entiment cla
assification and
organiza
ation of web pages by search
s
engin
nes.
Text clas
ssification, in common words is defined as a technique to systema
atically classsify a
text obje
ect (docume
ent or sentence) in one of the fixed
d category. It is really h
helpful when the
amount of data is too large, especially for organizzing, information filterin
ng, and sto
orage
purposes
s.
A typica
al natural la
anguage cla
assifier cons
sists of two
o parts: (a) Training (b
b) Predictio
on as
shown in
n image be
elow. Firstly
y the text in
nput is proccesses and features a
are created.. The
machine
e learning models
m
then
n learn these features and is used
d for prediccting againsst the
new textt.

Here is a code that uses naive bayes class
sifier using ttext blob lib
brary (built o
on top of nltkk).

```

from tex
xtblob.classifiers imp
port NaiveBayesClassif
fier as NBC
C

from tex
xtblob import TextBlob
b

training_corpus = [

('I am exhausted of this work.', 'Class_B'),

("I can't cooperate with this", 'Class_B'),

('He is my badest enemy!', 'Class_B'),

('My management is poor.', 'Class_B'),

('I love this burger.', 'Class_A'),

('This is an brilliant place!', 'Class_A'),

('I feel very good about these dates.', 'Class_A'),

('This is my best work.', 'Class_A'),

("What an awesome view", 'Class_A'),

('I do not like this dish', 'Class_B')]

test_corpus = [

("I am not feeling well today.", 'Class_B'),

("I feel brilliant!", 'Class_A'),

('Gary is a friend of mine.', 'Class_A'),

("I can't believe I'm doing this.", 'Class_B'),

('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B
')]

model = NBC(training_corpus)

print(model.classify("Their codes are amazing."))

>>> "Class_A"

print(model.classify("I don't like their computer."))

>>> "Class_B"

print(model.accuracy(test_corpus))

>>> 0.83

```

Scikit.Learn also provides a pipeline framework for text classification:

```

from sklearn.feature_extraction.text

import TfidfVectorizer from sklearn.metrics

import classification_report

from sklearn import svm

# preparing data for SVM model (using the same training_corpus, test_corpus from naiv
e bayes example)

train_data = []

train_labels = []

for row in training_corpus:

train_data.append(row[0])

train_labels.append(row[1])

test_data = []

test_labels = []

for row in test_corpus:

test_data.append(row[0])

test_labels.append(row[1])

# Create feature vectors

vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)

# Train the feature vectors

train_vectors = vectorizer.fit_transform(train_data)

# Apply model on test data

test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear

model = svm.SVC(kernel='linear')

model.fit(train_vectors, train_labels)

prediction = model.predict(test_vectors)

>>> ['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']

print (classification_report(test_labels, prediction))

```

The text classification model are heavily dependent upon the quality and quantity of
features, while applying any machine learning model it is always a good practice to include
more and more training data. H ere are some tips that I wrote about improving the text
classification accuracy in one of my previous article.

4.2 Text Matching / Similarity
One of the important areas of NLP is the matching of text objects to find similarities.
Important applications of text matching includes automatic spelling correction, data deduplication and genome analysis etc.
A number of text matching techniques are available depending upon the requirement. This
section describes the important techniques in detail.
A. Levenshtein Distance – The Levenshtein distance between two strings is defined as
the minimum number of edits needed to transform one string into the other, with the
allowable edit operations being insertion, deletion, or substitution of a single character.
Following is the implementation for efficient memory computations.

```

def levenshtein(s1,s2):

if len(s1) > len(s2):

s1,s2 = s2,s1

distances = range(len(s1) + 1)

for index2,char2 in enumerate(s2):

newDistances = [index2+1]

for index1,char1 in enumerate(s1):

if char1 == char2:

newDistances.append(distances[index1])

else:

newDistances.append(1 + min((distances[index1], distances[index1+1],
newDistances[‐1])))

distances = newDistances

return distances[‐1]

print(levenshtein("analyze","analyse"))

```

B. Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s
name, location name etc) and produces a character string that identifies a set of words that
are (roughly) phonetically similar. It is very useful for searching large text corpuses,
correcting spelling errors and matching relevant names. Soundex and Metaphone are two

main phonetic algorithms used for this purpose. Python’s module Fuzzy is used to compute
soundex strings for different words, for example –

```

import fuzzy

soundex = fuzzy.Soundex(4)

print soundex('ankit')

>>> “A523”

print soundex('aunkit')

>>> “A523”

```

C. Flexible String Matching – A complete text matching system includes different
algorithms pipelined together to compute variety of text variations. Regular expressions are
really helpful for this purposes as well. Another common techniques include – exact string
matching, lemmatized matching, and compact matching (takes care of spaces,
punctuation’s, slangs etc).
D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine
similarity can also be applied in order to measure vectorized similarity. Following code
converts a text to vectors (using term frequency) and applies cosine similarity to provide
closeness among two text.

```

import math

from collections import Counter

def get_cosine(vec1, vec2):

common = set(vec1.keys()) & set(vec2.keys())

numerator = sum([vec1[x] * vec2[x] for x in common])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])

sum2 = sum([vec2[x]**2 for x in vec2.keys()])

denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:

return 0.0

else:

return float(numerator) / denominator

def text_to_vector(text):

words = text.split()

return Counter(words)

text1 = 'This is an article on analytics vidhya'

text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1)

vector2 = text_to_vector(text2)

cosine = get_cosine(vector1, vector2)

>>> 0.62

```

4.3 Coreference Resolution
Coreference Resolution is a process of finding relational links among the words (or phrases)
within the sentences. Consider an example sentence: ” Donald went to John’s office to see
the new table. He looked at it for an hour.“
Humans can quickly figure out that “he” denotes Donald (and not John), and that “it”
denotes the table (and not John’s office). Coreference Resolution is the component of
NLP that does this job automatically. It is used in document summarization, question
answering, and information extraction. Stanford CoreNLP provides a python wrapper for
commercial purposes.

4.4 Other NLP problems / tasks








Text Summarization – Given a text article or paragraph, summarize it automatically
to produce most important and relevant sentences in order.
Machine Translation – Automatically translate text from one human language to
another by taking care of grammar, semantics and information about the real world,
etc.
Natural Language Generation and Understanding – Convert information from
computer databases or semantic intents into readable human language is called
language generation. Converting chunks of text into more logical structures that are
easier for computer programs to manipulate is called language understanding.
Optical Character Recognition – Given an image representing printed text,
determine the corresponding text.
Document to Information – This involves parsing of textual data present in
documents (websites, files, pdfs and images) to analyzable and clean format.

5. Important Libraries for NLP (python)








Scikit-learn: Machine learning in Python
Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
Pattern – A web mining module for the with tools for NLP and machine learning.
TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.
spaCy – Industrial strength N LP with Python and Cython.
Gensim – Topic Modelling for Humans
Stanford Core NLP – NLP services and packages by Stanford NLP Group.



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : Yes
Author                          : Machine Learning
Create Date                     : 2019:03:03 12:54:15+05:30
Modify Date                     : 2019:03:03 12:54:15+05:30
XMP Toolkit                     : Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00
Creator Tool                    : PScript5.dll Version 5.2.2
Producer                        : Acrobat Distiller 9.0.0 (Windows)
Format                          : application/pdf
Title                           : Microsoft Word - Ultimate Guide to Understand.docx
Creator                         : Machine Learning
Document ID                     : uuid:c0150dfe-c5be-45b6-aca4-b0c47298bf43
Instance ID                     : uuid:ae55b8bc-9aac-4442-9040-4ee6785529c3
Page Count                      : 29
EXIF Metadata provided by EXIF.tools

Navigation menu