Pyspellchecker Ation Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 25

Download
Open PDF In Browser	View PDF

pyspellchecker Documentation
Release 0.4.0

Tyler Barrus

Jun 10, 2019

Contents

1

Installation

3

2

Quickstart

5

3

Additional Methods
3.1 The following are less likely to be needed by the user but are available: . . . . . . . . . . . . . . . .

7
7

4

Credits

9

5

Table of contents
5.1 Quickstart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 pyspellchecker API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11
11
14

6

Additional Information

19

Index

21

i

ii

pyspellchecker Documentation, Release 0.4.0

Pure Python Spell Checking based on Peter Norvig’s blog post on setting up a simple spell checking algorithm.
It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It
then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word
frequency list. Those words that are found more often in the frequency list are more likely the correct results.
pyspellchecker supports multiple languages including English, Spanish, German, French, and Portuguese. Dictionaries were generated using the WordFrequency project on GitHub.
pyspellchecker supports Python 3 and Python 2.7 but, as always, Python 3 is the preferred version!
pyspellchecker allows for the setting of the Levenshtein Distance to check. For longer words, it is highly
recommended to use a distance of 1 and not the default 2. See the quickstart to find how one can change the distance
parameter.

Contents

1

pyspellchecker Documentation, Release 0.4.0

2

Contents

CHAPTER

1

Installation

The easiest method to install is using pip:
pip install pyspellchecker

To install from source:
git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python setup.py install

As always, I highly recommend using the Pipenv package to help manage dependencies!

3

pyspellchecker Documentation, Release 0.4.0

4

Chapter 1. Installation

CHAPTER

2

Quickstart

After installation, using pyspellchecker should be fairly straight forward:
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))

If the Word Frequency list is not to your liking, you can add additional text to generate a more appropriate list for your
use case.
from spellchecker import SpellChecker
spell = SpellChecker() # loads default word frequency list
spell.word_frequency.load_text_file('./my_free_text_doc.txt')
# if I just want to make sure some words are not flagged as misspelled
spell.word_frequency.load_words(['microsoft', 'apple', 'google'])
spell.known(['microsoft', 'google']) # will return both now!

If the words that you wish to check are long, it is recommended to reduce the distance to 1. This can be accomplished
either when initializing the spell check class or after the fact.
from spellchecker import SpellChecker
spell = SpellChecker(distance=1)

# set at initialization
(continues on next page)

5

pyspellchecker Documentation, Release 0.4.0

(continued from previous page)

# do some work on longer words
spell.distance = 2

6

# set the distance parameter back to the default

Chapter 2. Quickstart

CHAPTER

3

Additional Methods

On-line documentation is available; below contains the cliff-notes version of some of the available functions:
correction(word): Returns the most probable result for the misspelled word
candidates(word): Returns a set of possible candidates for the misspelled word
known([words]): Returns those words that are in the word frequency list
unknown([words]): Returns those words that are not in the frequency list
word_probability(word): The frequency of the given word out of all words in the frequency list

3.1 The following are less likely to be needed by the user but are
available:
edit_distance_1(word): Returns a set of all strings at a Levenshtein Distance of one based on the alphabet of
the selected language
edit_distance_2(word): Returns a set of all strings at a Levenshtein Distance of two based on the alphabet of
the selected language

7

pyspellchecker Documentation, Release 0.4.0

8

Chapter 3. Additional Methods

CHAPTER

4

Credits

• Peter Norvig blog post on setting up a simple spell checking algorithm
• hermetdave’s WordFrequency project for providing the basis for Non-English dictionaries

9

pyspellchecker Documentation, Release 0.4.0

10

Chapter 4. Credits

CHAPTER

5

Table of contents

5.1 Quickstart
pyspellchecker is designed to be easy to use to get basic spell checking.

5.1.1 Installation
The best experience is likely to use pip:
pip install pyspellchecker

If you are using virtual environments, it is recommended to use pipenv to combine pip and virtual environments:
pipenv install pyspellchecker

Read more about Pipenv

5.1.2 Basic Usage
Setting up the spell checker requires importing and initializing the instance.
from spellchecker import SpellChecker
spell = SpellChecker()

There are several methods to determine if a word is in the word frequency list:
from spellchecker import SpellChecker
spell = SpellChecker()
spell['morning'] # True
(continues on next page)

11

pyspellchecker Documentation, Release 0.4.0

(continued from previous page)

'morning' in spell

# True

# find those words from a list of words that are found in the dictionary
spell.known(['morning', 'hapenning']) # {'morning'}
# find those words from a list of words that are not found in the dictionary
spell.unknown(['morning', 'hapenning']) # {'hapenning'}

Once a word is identified as misspelled, you can find the likeliest replacement:
from spellchecker import SpellChecker
spell = SpellChecker()
misspelled = spell.unknown(['morning', 'hapenning'])
for word in misspelled:
spell.correction(word) # 'happening'

# {'hapenning'}

from spellchecker import SpellChecker
spell = SpellChecker(distance=1)

# set the Levenshtein Distance parameter

# do additional work
# now for shorter words, we can revert to Levenshtein Distance of 2!
spell.distance = 2

Or if the word identified as the likeliest is not correct, a list of candidates can also be pulled:
from spellchecker import SpellChecker
spell = SpellChecker()
misspelled = spell.unknown(['morning', 'hapenning']) # {'hapenning'}
for word in misspelled:
spell.correction(word) # {'penning', 'happening', 'henning'}

5.1.3 Changing Language
To set the language of the dictionary to load, one must set the language parameter on initialization.
from spellchecker import SpellChecker
spell = SpellChecker(language='es')
print(spell['mañana'])

# Spanish dictionary

5.1.4 Adding and Removing Terms from a Dictionary
There are several ways to add additional terms to your word frequency dictionary including by filepath, string of text,
or by a list of words.
To load a pre-defined dictionary file (either as a json file or a gzipped json file):

12

Chapter 5. Table of contents

pyspellchecker Documentation, Release 0.4.0

from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.load_dictionary('./path-to-my-word-frequency.json')

To load a text document that will be parsed into individual words and each word added to the frequency list:
from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.load_text_file('./path-to-my-text-doc.txt')

To load plain text from input or another source:
from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.load_text('Text to be parsed and added to the system')

Or update using a list of words:
from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.load_words(['Text', 'to', 'be','added', 'to', 'the', 'system'])

Or add a single word:
from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.add('Text')

Removing words is as simple as adding words:
from spellchecker import SpellChecker
spell = SpellChecker()
spell.word_frequency.remove_words(['Text', 'to', 'be','removed', 'from', 'the',
˓→'system'])
# or remove a single word
spell.word_frequency.remove('meh')

5.1.5 How to Build a New Dictionary
Building a custom or new language dictionary is relatively straight forward. To begin, you will need to have either a
word frequency list or text files that represent the usage of the terms. Since pyspellchecker uses word frequency, it is
better to have the most common words have higher frequencies!
Once you have the corpus, code similar to the following should build out the dictionary:
from spellchecker import SpellChecker
spell = SpellChecker(language=None)

# turn off loading a built language dictionary
(continues on next page)

5.1. Quickstart

13

pyspellchecker Documentation, Release 0.4.0

(continued from previous page)

# if you have a dictionary...
spell.word_frequency.load_dictionary('./path-to-my-json-dictionary.json')
# or... if you have text
spell.word_frequency.load_text_file('./path-to-my-text-doc.txt')
# export it out for later use!
spell.export('my_custom_dictionary.gz', gzipped=True)

5.1.6 A quick, command line spell checking program
Setting up a quick and easy command line program using pyspellchecker is straight forward:
from spellchecker import SpellChecker
# could add command line arguments to set the parameters of the spell
# check class; setup what type of information to present back, etc.
spell = SpellChecker()
print("To exit, hit return without input!")
while True:
word = input('Input a word to spell check: ')
if word == '': # not sure, but need a way to kill the program...
break
word = word.lower()
if word in spell:
print("'{}' is spelled correctly!".format(word))
else:
cor = spell.correction(word)
print("The best spelling for '{}' is '{}'".format(word, cor))
print("If that is not enough; here are all possible candidate words:")
print(spell.candidates(word))

5.2 pyspellchecker API
Here you can find the full developer API for the pyspellchecker project. pyspellchecker provides a library for determining if a word is misspelled and what the likely correct spelling would be based on word frequency.

5.2.1 SpellChecker
class spellchecker.SpellChecker(language=u’en’, local_dictionary=None, distance=2, tokenizer=None)
The SpellChecker class encapsulates the basics needed to accomplish a simple spell checking algorithm. It is
based on the work by Peter Norvig (https://norvig.com/spell-correct.html)
Parameters
• language (str) – The language of the dictionary to load or None for no dictionary.
Supported languages are en, es, de, fr‘ and pt. Defaults to en
• local_dictionary (str) – The path to a locally stored word frequency dictionary; if
provided, no language will be loaded
14

Chapter 5. Table of contents

pyspellchecker Documentation, Release 0.4.0

• distance (int) – The edit distance to use. Defaults to 2
candidates(word)
Generate possible spelling corrections for the provided word up to an edit distance of two, if and only
when needed
Parameters word (str) – The word for which to calculate candidate spellings
Returns The set of words that are possible candidates
Return type set
correction(word)
The most probable correct spelling for the word
Parameters word (str) – The word to correct
Returns The most likely candidate
Return type str
distance
The maximum edit distance to calculate
Note: Valid values are 1 or 2; if an invalid value is passed, defaults to 2

Type int
edit_distance_1(word)
Compute all strings that are one edit away from word using only the letters in the corpus
Parameters word (str) – The word for which to calculate the edit distance
Returns The set of strings that are edit distance one from the provided word
Return type set
edit_distance_2(word)
Compute all strings that are two edits away from word using only the letters in the corpus
Parameters word (str) – The word for which to calculate the edit distance
Returns The set of strings that are edit distance two from the provided word
Return type set
export(filepath, encoding=u’utf-8’, gzipped=True)
Export the word frequency list for import in the future
Parameters
• filepath (str) – The filepath to the exported dictionary
• encoding (str) – The encoding of the resulting output
• gzipped (bool) – Whether to gzip the dictionary or not
known(words)
The subset of words that appear in the dictionary of words
Parameters words (list) – List of words to determine which are in the corpus
Returns The set of those words from the input that are in the corpus
Return type set

5.2. pyspellchecker API

15

pyspellchecker Documentation, Release 0.4.0

split_words(text)
Split text into individual words using either a simple whitespace regex or the passed in tokenizer
Parameters text (str) – The text to split into individual words
Returns A listing of all words in the provided text
Return type list(str)
unknown(words)
The subset of words that do not appear in the dictionary
Parameters words (list) – List of words to determine which are not in the corpus
Returns The set of those words from the input that are not in the corpus
Return type set
word_frequency
An encapsulation of the word frequency dictionary
Note: Not settable

Type WordFrequency
word_probability(word, total_words=None)
Calculate the probability of the word being the desired, correct word
Parameters
• word (str) – The word for which the word probability is calculated
• total_words (int) – The total number of words to use in the calculation; use the
default for using the whole word frequency
Returns The probability that the word is the correct word
Return type float

5.2.2 WordFrequency
class spellchecker.WordFrequency(tokenizer=None)
Store the dictionary as a word frequency list while allowing for different methods to load the data and update
over time
add(word)
Add a word to the word frequency list
Parameters word (str) – The word to add
dictionary
A counting dictionary of all words in the corpus and the number of times each has been seen
Note: Not settable

Type Counter

16

Chapter 5. Table of contents

pyspellchecker Documentation, Release 0.4.0

items()
Iterator over the words in the dictionary
Yields str – The next word in the dictionary int: The number of instances in the dictionary
Note: This is the same as dict.items()
keys()
Iterator over the key of the dictionary
Yields str – The next key in the dictionary
Note: This is the same as spellchecker.words()
letters
The listing of all letters found within the corpus
Note: Not settable

Type str
load_dictionary(filename, encoding=u’utf-8’)
Load in a pre-built word frequency list
Parameters
• filename (str) – The filepath to the json (optionally gzipped) file to be loaded
• encoding (str) – The encoding of the dictionary
load_text(text, tokenizer=None)
Load text from which to generate a word frequency list
Parameters
• text (str) – The text to be loaded
• tokenizer (function) – The function to use to tokenize a string
load_text_file(filename, encoding=u’utf-8’, tokenizer=None)
Load in a text file from which to generate a word frequency list
Parameters
• filename (str) – The filepath to the text file to be loaded
• encoding (str) – The encoding of the text file
• tokenizer (function) – The function to use to tokenize a string
load_words(words)
Load a list of words from which to generate a word frequency list
Parameters words (list) – The list of words to be loaded
pop(key, default=None)
Remove the key and return the associated value or default if not found
Parameters

5.2. pyspellchecker API

17

pyspellchecker Documentation, Release 0.4.0

• key (str) – The key to remove
• default (obj) – The value to return if key is not present
remove(word)
Remove a word from the word frequency list
Parameters word (str) – The word to remove
remove_by_threshold(threshold=5)
Remove all words at, or below, the provided threshold
Parameters threshold (int) – The threshold at which a word is to be removed
remove_words(words)
Remove a list of words from the word frequency list
Parameters words (list) – The list of words to remove
tokenize(text)
Tokenize the provided string object into individual words
Parameters text (str) – The string object to tokenize
Yields str – The next word in the tokenized string
Note: This is the same as the spellchecker.split_words()
total_words
The sum of all word occurances in the word frequency dictionary
Note: Not settable

Type int
unique_words
The total number of unique words in the word frequency list
Note: Not settable

Type int
words()
Iterator over the words in the dictionary
Yields str – The next word in the dictionary
Note: This is the same as spellchecker.keys()

18

Chapter 5. Table of contents

CHAPTER

6

Additional Information

• genindex
• modindex
• search

19

pyspellchecker Documentation, Release 0.4.0

20

Chapter 6. Additional Information

Index

A

P

add() (spellchecker.WordFrequency method), 16

pop() (spellchecker.WordFrequency method), 17

C

R

candidates() (spellchecker.SpellChecker method), remove() (spellchecker.WordFrequency method), 18
remove_by_threshold()
15
(spellchecker.WordFrequency method), 18
correction() (spellchecker.SpellChecker method),
remove_words()
(spellchecker.WordFrequency
15
method), 18

D

dictionary (spellchecker.WordFrequency attribute),
16
distance (spellchecker.SpellChecker attribute), 15

E

S
SpellChecker (class in spellchecker), 14
split_words() (spellchecker.SpellChecker method),
15

edit_distance_1()
(spellchecker.SpellChecker T
tokenize() (spellchecker.WordFrequency method), 18
method), 15
edit_distance_2()
(spellchecker.SpellChecker total_words (spellchecker.WordFrequency attribute),
18
method), 15
export() (spellchecker.SpellChecker method), 15

U

I
items() (spellchecker.WordFrequency method), 16

K
keys() (spellchecker.WordFrequency method), 17
known() (spellchecker.SpellChecker method), 15

unique_words (spellchecker.WordFrequency attribute), 18
unknown() (spellchecker.SpellChecker method), 16

W

word_frequency (spellchecker.SpellChecker attribute), 16
word_probability() (spellchecker.SpellChecker
L
method), 16
letters (spellchecker.WordFrequency attribute), 17
WordFrequency
(class in spellchecker), 16
load_dictionary() (spellchecker.WordFrequency
words()
(spellchecker.WordFrequency
method), 18
method), 17
load_text() (spellchecker.WordFrequency method),
17
load_text_file()
(spellchecker.WordFrequency
method), 17
load_words() (spellchecker.WordFrequency method),
17

21

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 25
Page Mode                       : UseOutlines
Author                          : Tyler Barrus
Title                           : pyspellchecker Documentation
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.18
Create Date                     : 2019:06:10 23:01:25Z
Modify Date                     : 2019:06:10 23:01:25Z
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/Debian) kpathsea version 6.2.3

EXIF Metadata provided by EXIF.tools

Pyspellchecker Ation Manual

Navigation menu

Versions of this User Manual:

Views

Navigation