Py Elly Manual

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 161 [warning: Documents this large are best viewed by clicking the View PDF Link!]

派業歷
PyElly Users Manual
For Release v1.4.25
22 February 2018
Clinton P. Mah
Walnut Creek, CA 94595!
Table of Contents
1. Introduction 5
2. The Syntax of a Language 9
3. The Semantics of a Language 14
4. Defining Tables of PyElly Rules 18
4.1 Grammar (A.g.elly) 20
4.1.1 Syntactic Rules 20
4.1.2 Grammar-Defined Words 21
4.1.3 Generative Semantic Subprocedures 22
4.1.4 Global Variable Initializations 22
4.2 Special Patterns For Text Elements (A.p.elly) 23
4.3 Macro Substitutions (A.m.elly) 26
5. Operations for PyElly Generative Semantics 29
5.1 Insertion of Strings 29
5.2 Subroutine Linkage 29
5.3 Buffer Management 30
5.4 Local Variable Operations 30
5.5 Set-Theoretic Operations with Local Variables 32
5.6 Global Variable Operations 32
5.7 Control Structures 32
5.8 Character Manipulation 34
5.9 Insert String From a Table Lookup 36
5.10 Buffer Searching 37
5.11 Execution Monitoring 37
5.12 Capitalization 39
5.13 Semantic Subprocedure Invocation 39
6. Simple PyElly Rewriting Examples 40
6.1 Default Semantic Procedures 40
6.2 A Simple Grammar with Semantics 41
7. Running PyElly From a Command Line 45
8. Advanced Capabilities: Grammar 51
8.1 Syntactic Features 51
8.2 The ... Syntactic Type 53
9. Advanced Capabilities: Vocabulary 56
9.1 More on the UNKN Syntactic Type 56
9.2 Breaking Down Unknown Words 57
9.2.1 Inflectional Stemming 57
9.2.2 Morphology 60
9.3 Entity Extraction 63
9.3.1 Numbers 64
9.3.2 Dates and Times 64
9.3.3 Names of Persons (A.n.elly) 65
9.3.4 Defining Your Own Entity Extractors 70
9.4 PyElly Vocabulary Tables (A.v.elly) 71
10. Logic for PyElly Cognitive Semantics 75
10.1 The Form of Cognitive Semantic Clauses 77
10.2 Kinds of Cognitive Semantic Conditions 79
10.2.1 Fixed Scoring 79
10.2.2 Starting Position, Token Count, and Character Count 79
10.2.3 Semantic Features 80
10.2.4 Semantic Concepts 82
10.2.4.2 Language Definition Files for Semantic Concepts 85
10.3 Adding Cognitive Semantics to Other PyElly Tables 87
10.3.1 Cognitive Semantics for Vocabulary Tables 87
10.3.2 Cognitive Semantics for Pattern Rules 89
10.3.3 Cognitive Semantics for Entity Extraction 90
11. Sentences and Punctuation 91
11.1 Basic PyElly Punctuation in Grammars 92
11.2 Extending Stop Punctuation Recognition 95
11.2.1 Stop Punctuation Exceptions (A.sx.elly) 95
11.2.2 Bracketing Punctuation 97
11.2.3 Exotic Punctuation 97
11.3 How Punctuation Affects Parsing 98
12. PyElly Tokenization and Parsing 99
12.1 A Bottom-Up Framework 99
12.2 Token Extraction and Lookup 101
12.3 Building a Parse Tree 102
12.3.1 Context-Free Analysis Main Loop 102
12.3.2 Special PyElly Modifications 103
12.3.3 Type 0 Grammar Extensions 104
12.4 Success and Failure in Parsing 104
12.5 Parse Data Dumps and Tree Diagrams 105
12.6 Parsing Resource Limits 109
13. Developing Language Rules and Troubleshooting 112
13.1 Pre-Checks on Language Rule Files 112
13.2 A General Application Development Approach 114
13.3 Miscellaneous Tips 114
14. PyElly Applications 123
14.1 Current Example Applications 123
14.2 Building Your Own Applications 127
15. Going Forward 130
15.1 What PyElly Tries To Do 130
15.2 Practical Experience in PyElly Usage 132
15.3 Where We Stand For Now 133
Appendix A. Python Implementation 135
Appendix B. Historical Background 140
Appendix C. Berkeley Database and SQLite 142
Appendix D. PyElly System Testing 144
Appendix E. PyElly as a Framework for Learning 148
Appendix F. A Shallow XML Markup Application 150
Appendix G. Unicode Issues 160
PyElly User’s Manual
1. Introduction
PyElly is an open-source software toolkit for creating computer scripts to analyze and
rewrite English and other natural language text. This processing will of course fall far
short of the talking robot fantasies of Hollywood, but with only modest effort, you can
still quickly put together many nontrivial linguistic applications short of full
understanding. In practical usage, PyElly as a preprocessor can help to clean up the
pesky low-level details of language that often burden text data mining.
PyElly is designed to let you get some firsthand experience with the nuts and bolts of
computational linguistics. You can quickly write scripts to do tasks like conjugating
French verbs, rephrasing information requests into a formal query language,
compressing messages for texting, extracting names and other entities from a text
stream, or even re-creating the storied Doctor simulation of Rogerian psychoanalysis.
We have been building such natural language applications since computers were only a
millionth as powerful as they are today. The overall problem of processing natural
language remains quite challenging, however, and most toolkits to support NLP often
require heavy lifting to develop the logic and interpretive framework to accomplish even
something rather basic. PyElly aims to expedite such busy work through ready-made
tools and resources, all integrated within a single free open-source package.
Why do we need yet another natural language processing toolkit? To begin with, a
complete natural language solution is still far off, and so we can gain from a diversity of
reliable methods to address the problems of text data. Also, though PyElly is all new
code, it is really a legacy system, with some major components dating back more than
40 years. This sounds quite ancient, but language changes slowly, and mature linguistic
software resources can be of service even with today’s text data.
The impetus for PyElly and its predecessors came from observing that many different
natural language systems face similar issues. For example, information retrieval and
machine learning with text data can both work better when we can reduce the words in
the text to their roots. So, instead of contending with variants like RELATION,
RELATIONAL, RELATIVELY, and RELATING, a system could just keep track only of
RELATE. This is of course the familiar stemming problem, but available free resources
to match up such word variants are often disappointing.
A stemmer is of course not hard to build, but it takes time and commitment to do a good
job, and no one really wants to start from scratch for every new project. That is true of
many other low-level language processing capabilities as well. So, it seems prudent to
pull together at least some kind of reusable software library here, but even better would
be to integrate all those tools and resources more closely. PyElly does that.
The current implementation of PyElly is intended primarily for educational use and so
was written entirely in Python, currently a favored first programming language in high
schools. This should allow students to adapt and incorporate PyElly code into class
projects that have to be completed fairly quickly. PyElly can be of broader interest,
Page 5
PyElly User’s Manual
though, because of its range of natural language support, including stemming,
tokenizing, entity extraction, vocabulary management, sentence recognition, idiomatic
transformation, rule-driven syntactic analysis, and ambiguity handling.
The operation of PyElly revolves around classic rule-based computational linguistics. It
will require some language expertise because you need to be able to define all the details
of the language processing that you want, but many of the basics here have been prebuilt
in PyElly if you are working with English input. The standard PyElly distribution also
includes the language definition rules for eight different example applications to get you
started in constructing your own.
The current PyElly package consists of a set of Python modules in sixty-four source files.
The code should run on any computer platform with a Python 2.7 interpreter, including
Windows 8 and 10, Linux, Mac OS X and other flavors of Unix, IOS for iPhone and iPad,
and Android. The PyElly source is downloadable from GitHub under a standard BSD
license; you may freely modify and extend it as needed. Though intended mainly for
education, commercial usage of PyElly is unrestricted.
For recognizing just a few dozen sentences, PyElly is probably overkill; you could handle
them directly by writing custom code in any popular programming language. More
often, however, the possible input sentences will be too many to list out fully, and so you
must characterize them more generally through rules describing how the words you
expect to see are formed, how they combine in text, and how they are to be interpreted.
PyElly lets you manage such systematic language definition cleanly.
PyElly is a kind of translator: it reads in plain text, analyzes it, and writes out a
transformation according to the rules that you supply. So an English sentence like “She
goes slowly” might be rewritten in French as “Elle va lentement” or in Chinese as 她⾛慢
慢地. Or you might reduce the original sentence to just “slow” by stripping out suffixes
and words of low content. Or you may want to rephrase the sentence as a question like
“Does she go slowly?” All such rewriting is readily doable in PyElly.
PyElly rules encode linguistic knowledge of many types. The main ones will define a
grammar and vocabulary for the sentences of an input language plus associated
semantic procedures for rewriting those sentences to produce target output. Creating
such rules requires some trial and error, but usually should be no more difficult than
setting up macros in a word processor. PyElly will get you started properly here and
then will provide debugging aids to help track down problems that may arise.
Many natural language systems, especially those in academic research, aim at the most
thorny problems in language interpretation. These are opportunities for impressive
processing gymnastics and often lead to dense theoretical papers without necessarily
producing anything for everyday use. PyElly tries instead to be simple and pragmatic. In
response to well-known tough sentences like “Time flies like an arrow,” it is quite all
right for a system to respond just with "Huh?”
Page 6
PyElly User’s Manual
PyElly is compact enough to run on mobile devices with no cloud connection. Excluding
the Python environment, compiled PyElly code along with encoded rules and other data
for an application should typically require less that 500 Kbytes of storage, depending on
the number of rules actually defined. A major project may involve hundreds of grammar
rules and tens of thousands of vocabulary elements, but some useful text analyses
require just a few dozen rules and little explicit domain vocabulary.
What is a grammar, and what is a vocabulary? A vocabulary establishes the range of
words you want to recognize; a grammar defines how those words can be arranged into
sentences of interest. You may also specify idiomatic rephrasing of particular input word
sequences prior to analysis as well as define patterns for making sense of various entities
not listed in your vocabulary. For example, you can recognize 800 telephone numbers or
Russian surnames ending in -OV or -OVA without having to specify them one by one.
This manual will explain how to do all of this and will also introduce some basics of
language and language processing that every PyElly user should know. To take
advantage of PyElly, you should be able to create and edit text files for the platform of
your choice and set up file directories there so that PyElly can find your rules. PyElly
currently has to run from a command line. For advanced work, you should also be
comfortable with Python coding. In an ideal world, an interactive development
environment (IDE) could make everything easier here, but that is yet to be.
Currently, PyElly is biased toward English input, although it can read and process
Unicode characters from the Latin-1 Supplement, Latin Extender A, and parts of Latin
Extender B plus some extra punctuation and spaces. These include most of the letters
with diacritical marks used in European languages written with Latin alphabets. For
example, PyElly knows that é is a vowel, that ß is a letter, that Œ is the uppercase form
of œ, and that ² is a digit. This can be helpful even for nominally English data, since we
often encounter terms with foreign spellings like NÉE or RÉSUMÉ or ÆGIS.
As any beginning student of a new language soon learns, its rules are inevitably messy.
Irregularities will trip up someone or something trying to speak or write mechanically
from a simple grammar. PyElly users will face the same challenge, but by working
generally at first and then adding special rules to deal with exceptions by as they turn
up, we can evolve our language processing to reach some useful level of parlance. There
is no royal road to natural language processing, but persistence can lead to progress over
time. PyElly will help you keep on track in making a sustained effort in rule definition.
You need not be an experienced linguist or a computer programmer to develop PyElly
applications; and I have tried to write this manual to be understandable by non-experts.
The only requirements for users are basic computer literacy as expected of 21st-Century
high school graduates, linguistic knowledge as might be gained from a first course in a
foreign language, and willingness to learn. You can start out with simple kinds of PyElly
processing and move on to more complex translations as you gain experience.
In addition to this introduction, the PyElly User’s Manual consists of fourteen other
major sections plus seven appendices. Sections 2 through 7 should be read in sequence
as they provide a tutorial on how get started in using PyElly. Sections 8 through 11 deal
Page 7
PyElly User’s Manual
with advanced features helpful for developing complex applications. Section 12 explains
PyElly parsing, Section 13 lays out some practical strategies and tips for PyElly
application development, and Section 14 describes a variety of current and possible
future PyElly applications. Section 15 wraps everything up.
PyElly (“Python Elly”) was inspired by the Eliza system created by Joseph Weizenbaum
over 50 years ago for modeling natural language conversation, but PyElly code has a
different genesis. Its implementation in Python is the latest in a series of related natural
language toolkits going back four generations: Jelly (Java, 1999), nlf (C, 1984 and 1994),
the Adaptive Query Facility (FORTRAN, 1981), and PARLEZ (PDP-11 assembly
language, 1977). The core PyElly parsing algorithm and the ideas of cognitive and
generative semantics come from Vaughan Pratt’s LINGOL (LISP, 1973). Frederick
Thompson’s REL system (1972) also influenced the design of PyElly.
The main PyElly website is at https://sites.google.com/site/pyellynaturallanguage/ .
This introduces PyElly and shows some actual translations with language definition
rules from different PyElly example applications. The latest PyElly source code,
language definition rule files for example applications, and documentation are available
from https://github.com/prohippo/pyelly.git . This is also where you can find the latest
version of this user’s manual in PDF for download.!
Page 8
PyElly User’s Manual
2. The Syntax of a Language
A language is as a way of stringing together words or other symbols to form sentences
that other people can make sense of in some context. In general, not all combinations of
symbols will make a meaningful sentence; for example, “Cat the when” is nonsense in
English. To define a language that you want PyElly to process, you must first identify
those combinations of symbols that do make sense and then assign suitable
interpretations to them.
If a language is small enough, such as the repertory of obscene gestures, we can simply
list all its possible “sentences” and say in expletives what each of them means. Most
nontrivial natural languages, though, have so many possible sentences that this
approach is impractical. Instead one must note that a language tends to have regular
structures; and by identifying those structures, a computational linguist can formally
characterize the language much more concisely than by listing all possible sentences.
The structural description of a language is called a grammar. It establishes what the
building blocks of a language are and how they form simple structures, which in turn
combine into more complex structures. Almost everyone has studied grammar in school,
but formal grammars go into much greater detail. They commonly are organized around
sets of syntactic rules describing all the particular kinds of language structure to be
expected in sentences. Such syntactic rules will provide a basis for both generating and
recognizing the sentences of a language with a computer.
In linguistics, a grammar expresses how one or more structures can come together to
produce a new composite structure. This is described through syntactic rules, commonly
written with an arrow notation as follows:
W->X Y Z
This rule states that a W-structure can be composed of an X-structure followed by a Y-
structure followed by a Z-structure; for example, a noun phrase in English can consist of
a number, followed by an adjective, and followed by a noun:
NOUNPHRASE->NUMBER ADJECTIVE NOUN
There is nothing mysterious here; it is like the kind of sentence diagramming once
taught in junior high school and now coming back into vogue. We could draw the
following equivalent diagram on a blackboard for the syntactic rule above.
W
/|\
/ | \
X Y Z
Page 9
PyElly User’s Manual
where a W can in turn be part of a higher-level structure and X, Y, and Z can also split out
further into various substructures. Using a tree to describe syntactic structure is fine,
but the arrow notation will be more compact and easier to enter from a keyboard,
especially as sentence syntax grows more complex.
The structure-types W, X, Y, Z , and so forth roughly correspond to the parts of speech
from our school days, but will be interpreted more generally in PyElly. In this user
manual, we shall refer to them as syntactic types or syntactic categories. The ones that
you need for a PyElly grammar will depend on your particular target application. Some
applications may require up to almost a hundred different syntactic categories, going far
beyond NOUN and VERB.
Syntactic rules for linguistic analysis can be much more complicated than W->X Y Z,
but PyElly will work with just three forms of rules:
X->token
X->Y
X->Y Z
where X, Y, and Z are syntactic types and token is a word or some other kind of
vocabulary element like a number or some arbitrary alphanumeric identifier.
These three types are enough to re-express a more complex rule like
R->A B C
This is done by dividing the more complex rule into multiple simpler rules
R->A T
T->B C
where T is a unique intermediate syntactic structure introduced solely to stand for a B
followed by a C. It is not really a part of speech.
Here is a set of PyElly grammar rules that might be employed to describe the structure
of the sentence “It is red.”
SENTENCE->SUBJECT PREDICATE
SUBJECT->PRONOUN
PRONOUN->it
PREDICATE->COPULA ADJECTIVE
COPULA->is
ADJECTIVE->red
PyElly User’s Manual
This shows all three forms of PyElly restricted syntax rules. The ordering of the rules
does not matter. The names of syntactic types like SUBJECT or COPULA are arbitrary
except for a few reserved names to be explained below.
The structure of a sentence as implied by these rules can be expressed graphically as a
labeled tree diagram, where the root type is SENTENCE and where branching
corresponds to splitting into constituent substructures. By convention, the tree is always
shown upside down, and so the lowest part of a tree will show the individual words of
the sentence. For example, the sentence “It is red” would have the following diagram:
SENTENCE
|
+------------+------------+
| |
SUBJECT PREDICATE
| |
| +-------+--------+
| | |
PRONOUN COPULA ADJECTIVE
| | |
| | |
it is red
The derivation of such a diagram for a sentence from a given set of syntactic rules is
called “parsing.” The diagram itself is called a “parse tree,” and its labeled parts are
called the “phrase nodes” of the parse tree; for example, the PREDICATE phrase node in
the tree above encompasses the actual sentence fragment “is red.” Given our syntax
rules above, PyElly can automatically parse our sentence; and the resulting tree diagram
will then be the starting point for interpreting the sentence.
Our simple grammar above so far can describe only a single sentence, but we can extend
its coverage by adding rules for other kinds of structures and vocabulary. For example,
the added syntax rules
SUBJECT->DETERMINER NOUN
DETERMINER->an
NOUN->apple
will let PyElly analyze the sentence “An apple is red.” We can continue to build up our
grammar here by adding other rules; for example,
PREDICATE->VERB
VERB->falls
Page 11
PyElly User’s Manual
will also put “It falls” and “An apple falls” into the language PyElly can recognize. We
can continue in this way to encompass still more types of grammatical structure and still
more vocabulary. The new rules can be added in any order you want.
The key idea here is that a few rules can be combined in various ways to describe many
different sentences. There is still the problem of choosing the proper mix of rules to
describe a language in the most natural and efficient way, but we do fairly well by simply
adding one or two rules at a time as done above. In some complex applications, we may
eventually need hundreds of such rules, but these can still be worked out in small steps.
Technically speaking, PyElly grammar rules shown above define a “context-free
language.” Such a grammar mainly describes the relationships between adjacent
subtrees for a sentence; and it is harder to correlate the possibilities for the structures
far apart in a parse tree even though they are close together as words in an actual
sentence. For example, consider the simple example of a context-free rule with the
parallel structures SUBJECT and PREDICATE.
SENTENCE->SUBJECT PREDICATE
In languages like English, subjects and predicates have to agree with each other
according to the attributes of person and number: “We fall” versus “He falls”. When
grammatically acceptable SUBJECT and PREDICATE structures have to be formed in
correlated ways, context-free rules here force us to work harder to restrict a SENTENCE
to have only certain combinations of subjects and predicates in agreement. We might
have to write multiple explicitly correlated rules like
SENTENCE->SUBJECT1 PREDICATE1
SENTENCE->SUBJECT2 PREDICATE2
SENTENCE->SUBJECT3 PREDICATE3
where SUBJECTi would always agree properly with PREDICATEi, but this has the
disadvantage of greatly multiplying the number of rules we have to define. A good
natural language toolkit should make our job easier than this.
Even though English and other natural languages are not context-free in a theoretical
sense, we still want to treat them that way for practical reasons. By doing so, we can take
advantage of the many sophisticated techniques developed for parsing artificial
computer languages like C++ or Swift, which do tend to be context-free. This is the
approach taken in PyElly and its predecessors.
For convenience, PyElly also incorporates semantic checking of intermediate results in
parsing and allows some shortcuts to make grammars more concise (see Section 8).
These extensions can be put on top of a plain context-free parser to give it some context-
sensitive capabilities, although some kinds of sentences still cannot be handled by
PyElly. (The classic problematic context-sensitive examples are the parallel subjects and
predicates, in sentences like “He and she got cologne and perfume, respectively.”)
PyElly User’s Manual
The syntax of natural language can get quite complicated in general; but we usually can
break this down in terms of simpler structures. The challenge of defining a PyElly
grammar is to capture enough of such simpler structures in grammar rules to support a
proper analysis of any input sentence that we are likely to see.
A PyElly sentence analysis can actually get more complicated than the examples above.
For example, the leaves of a parse tree may be more than single words. PyElly
vocabulary rules can combine multiple input words into a single parse tree leaf token or
split off parts of an input word into separate tokens. PyElly can also control when
particular PyElly syntax rules apply by subdividing syntactic categories. Finally, PyElly
has to handle the unavoidable ambiguity in analyses when working with a natural
language. This user manual will go into detail on these matters in later subsections, but
for simple applications, the basic framework described in this section should be enough.
You must be able to understand most of the discussion in this section in order to
proceed further with PyElly. A good text for those interested in learning more about
language and formal grammars is John Lyon's book Introduction to Theoretical
Linguistics (Cambridge University Press, 1968). This is written for college-level readers,
but covers to the basics that you will need to know.!
PyElly User’s Manual
3. The Semantics of a Language
The notion of meaning has always been difficult to talk about. It can be complicated
even for individual sentences in a language, because meaning involves not only their
grammatical structure, but also where and when it is used and who is using it. A simple
expression like “Thank you” can take on different significance, depending on whether
the speaker is a thug collecting extortion money, the senior correspondent at a White
House press conference, or a disaster victim after an arduous rescue.
Practical computer natural language applications cannot deal with all the potential
meanings of sentences, since this would require modeling almost everything in a
person’s view of world and self. A more realistic approach is to ask what meanings will
actually be appropriate for a computer to understand in a particular application. If the
role of a system in a user organization is to provide, say, only information about
employee benefits from a policy manual, then it probably has no reason to figure out
references to subjects like sex, golf, or the current weather.
Within PyElly, the scope of semantics will be limited even more drastically: we will deal
with the meaning of sentences only to the extent of being able to translate them into
some other language and to evaluate our options when we have more than one possible
translation. This has the advantage of making semantics less mysterious while allowing
us still to implement useful kinds of language processing.
For example, the meaning of the English sentence “I love you” could be expressed in
French as “Je t’aime.” Or we might translate the English “How much does John earn?”
into a data base query language “SELECT SALARY FROM PAYROLL WHERE
EMPLOYEE=JOHN.” Or we could convert the statement “I feel hot” into an IoT
command line like
set thermostat /relative /fahrenheit:-5
In a sense, we have cheated here, avoiding the problem of meaning in one language by
passing it off to another language. Such a translation, however, can be quite useful if
there is a processor that understands the second language, but not the first. This is
definitely a modest approach to semantics; but it beats talking endlessly about the
philosophical meaning of meaning without ever accomplishing anything in actual code.
As noted before, the large number of possible sentences in a natural language prevents
us from compiling a table to map every input into its corresponding output. Instead, we
must break the problem down by first taking the semantics of the various constituent
structures defined by a grammar and then combine their individual interpretations to
derive the overall meaning of a given sentence.
With PyElly, we define the semantics of a sentence structure as procedures associated
with each of the grammatical rules describing the structure. There will actually be two
different kinds of semantic procedures here: those for writing out translations will be
called “generative,” while those for evaluating alternative translations will be called
PyElly User’s Manual
“cognitive.” At this stage, however, we shall focus on the generative, and leave the
cognitive until Section 10, since these two aspects of semantics operate quite differently.
A successful PyElly sentence analysis will produce a parse tree describing its syntactic
structure. Each phrase node of that tree will be due to a particular grammatical rule, and
associated with that rule will be a generative semantic procedure defining its meaning.
You will have to supply such a procedure for each of your grammar rules, though often
you can just take the defaults defined by PyElly for its three types of grammar rules.
The top phrase node of a complete PyElly parse tree will always be that of the type SENT
(=“sentence”). The generative procedure for such a phrase node will always be called by
PyElly to begin the translation of the original input sentence. This should then set off a
cascade of other procedure calls through the various lower-level constituent structures
of the sentence to produce a final output. The actual ordering of calls to subconstituent
procedures will be determined by the logic of the procedures at each level of the tree.
A PyElly generative semantic procedure basically will operate on text characters in a
series of output buffers. This will involve standard text editing operations commonly
supported in word processing programs: inserting and deleting, buffer management,
searching, substitution, and transfers between buffers. Consistent with PyElly semantics
being procedures, there will also be local and global variables, structured programming
control structures, subprocedures, simple lookup, and set manipulation.
Communication between different semantic procedures will be through local and global
variables as well as by inspecting the current content of buffers. The value of any local or
global variable will always be a string of arbitrary Unicode characters, possibly the null
string. Global variables will be accessible to all procedures and will remain defined even
across the processing of successive sentences, serving as a longer-term memory for
PyElly translations.
Local variables will have a limited scope such as seen in programming languages like C
or PASCAL. They are defined in the procedure where they are declared and also in those
procedures called as subroutines either directly or indirectly. When there are multiple
active declarations of a variable with a given name visible to a semantic procedure, the
most recent one applies. Upon exit from a procedure, all of its local variables
immediately become undefined.
Let us look at some actual semantic procedures to see how a PyElly translation would
actually work. Suppose that we have the five grammar rules
SENT->SUBJECT PREDICATE
SUBJECT->PRONOUN
PRONOUN->we
PREDICATE->VERB
VERB->know
PyElly User’s Manual
With these rules, we can implement a simple translator from English into French with
the five semantic procedures below, defined respectively for each rule above. For the
time being, the commands in the procedures will be expressed in ordinary English.
These commands will control the entry of text into some output area, such as a text field
in a window of a computer display. Here is what our procedures will do:
For a SENT consisting of a SUBJECT and PREDICATE: first run the procedure for
the SUBJECT, insert a space into the output being generated, and then run the
procedure for the PREDICATE.
For a SUBJECT consisting of a PRONOUN: just run the procedure for the
PRONOUN.
For the PRONOUN we, insert nous into the output being generated.
For a PREDICATE consisting of a VERB, just run the procedure for the VERB.
For the VERB know, insert connaisser.
With this particular set of semantic procedures, the English sentence “we know” will be
translated to nous followed by a space followed by connaisser. You can easily verify
this by starting with the semantic procedure for SENTENCE and tracing through the
cascade of procedure executions.
Each syntactic rule in a grammar must have a semantic procedure, even though the
procedure might be quite trivial such as above when a SUBJECT is just a PRONOUN or a
PREDICATE is just a VERB. This is because we need to make a connection at each level
from SENTENCE all the way down to individual input words like we and know. These
connections will give us a framework to extend our translation capabilities just by
adding more syntactic rules plus their semantic procedures; for example,
PRONOUN->they
For the PRONOUN they: insert ils.
You may have noticed, however, our example translation here is incorrect. Even more so
than English, French verbs must agree in person and number with their subject, and so
the translation of know with the SUBJECT we should be connaissons (first person
plural) instead of connaisser (the infinitive). Yet we cannot simply change the VERB
semantic procedure above to “insert connaissons” because this would be wrong if the
PRONOUN for SUBJECT becomes they (third person plural).
We really need more elaborate semantic procedures here to get correct agreement. This
is where various other PyElly generative semantic commands come in. In particular, we
can use local variables to pass information about number and person between the
semantic procedures for our syntactic structures to govern their translations (see
Section 6). Nevertheless, the overall PyElly framework of semantic procedures attached
to each syntactic rule and called recursively will remain the same.
PyElly User’s Manual
(If you want to look ahead to see how a simple English-to-French translation might be
done properly in PyElly, you can peek ahead to Subsection 6.2. Understanding the
details of the syntactic rules and their generative semantics will require you to go
through Sections 4 and 5 first, however.)
Semantic procedures must always be coded carefully for proper interaction and
handling of details in all contexts. We have to anticipate all the ways that constituent
structures can come together in a sentence and provide for all the necessary
communication between them at the right time. We can make the problem easier here
by taking care to have lower-level structures be parts of only a few higher-level
structures, but this will require some advance planning.
Writing syntactic rules and their semantic procedures is actually a kind of programming
and will require some programming skills. It will be harder than you first might think
when you try to deal with the complexity of natural languages like English or French.
PyElly, however, is designed to help you to do this kind of programming in a highly
structured way, and it should be easier than trying to accomplish the same kind of
translation in a language like Python or even LISP.
The idea of procedural semantics was introduced by Vaughan Pratt in his LINGOL
system. Procedures are not the only way of dealing with meaning, but they fit in well
with what we do in computational linguistics and especially within a framework where
we want to rewrite input text from one language to another. The availability of various
programming language features like variables in PyElly semantic procedures also
provides a way of modeling the contexts of sentence parts beyond what we can describe
with syntactic rules alone.
PyElly User’s Manual
4. Defining Tables of PyElly Rules
By now, you should understand the idea of grammar rules and semantic procedures.
This section will go into the mechanics of how to define them in text files to be read in
by PyElly at startup. To implement different applications such as translating English to
French or rewriting natural language questions as structured data base queries, you just
need to provide the appropriate files of rules for PyElly to load.
PyElly rules fall into seven main types: (1) grammar, (2) vocabulary, (3) macro
substitutions, (4) patterns for determining syntactic types, (5) personal names and
name components, (6) morphology, and (7) punctuation and sentence delineation. The
grammar of a language for an application tends to reflect the capabilities supported by a
target system, while a vocabulary tends to be geared toward a particular context of use;
macros support particular users of a system, and special patterns and name components
tend to be specific to given applications. Having separate tables of rules make it easier to
tailor PyElly operation for different environments while allowing parts of language
definitions to be reused.
This section will focus on the grammar, special pattern, and macro rule tables, required
by most PyElly applications. Some of the finer details of syntactic rules will be
postponed to Section 8, “Advanced Programming: Grammar.” The creation and use of
tables for vocabulary, names, and morphology will be described in Section 9, “Advanced
Programming: Vocabulary.”
To make PyElly do something, you have to set up an application defined by a specific set
of language definition rules organized into tables. The current PyElly package
implements each type of rule table as a Python class with an initialization procedure that
reads in its rules from an external text source file. The text input files for rules governing
a particular application A should be named as follows:
A.g.elly for syntactic rules and their semantic procedures in a grammar.
A.m.elly for macro substitutions.
A.p.elly for special patterns.
You may replace the A here with whatever name you choose for your application, subject
to the file-naming conventions for your computer platform. Only the A.g.elly file is
mandatory for any PyElly application; the other two may be omitted if you have no use
for either macro substitutions or patterns. Section 7 will explain how PyElly will look for
various language definition files for an application and read them in.
The rest of this section will describe the required formats of the definitions in the input
files A.g.elly, A.m.elly, and A.p.elly. Normally you would create these files with
a text editor or a word processor. The NotePad accessory on a Windows PC or TextEdit
on a Mac will be quite adequate, although you may have to rename your files afterward
because they insist on writing out files only with extensions like .txt.
PyElly User’s Manual
An important element of most language rules will be syntactic structure names, seen in
Section 2. We also call them “syntactic types” or “parts of speech,” but they will be more
general than what we learned in grade school. The current implementation of PyElly in
Python can handle up to 96 different syntactic types in its input files. Six of these types,
however will be predefined by PyElly with special meanings.
SENT Short for SENTence. Every grammar must have at least one rule of
the form SENT->X or SENT->X Y. PyElly translation will always
start by executing the semantic procedure for a SENT structure.
END For internal purposes only. Avoid using it.
UNKN Short for UNKNown. This structure type is automatically assigned
to strings not known to PyElly through its various lookup options.
(See Subsection 9.1 for more on this.)
... For an arbitrary sequence of words in a sentence. This is for
applications where much of the text input to process is
unimportant. (See Section 8 for more details.)
PUNC For punctuation. (See Section 11.)
SEPR For inferred sentence separators. (See Subsection 12.6.)
You will of course have to make up your own names for any other syntactic types needed
for a PyElly application. Names may be arbitrarily long in their number of characters
but may include only ASCII letters, digits, and periods (.); upper and lower case will be
the same. You do not have to use traditional grammatical names like NOUN, but why be
unnecessarily obscure here?
You may want to keep your syntactic type names unique in their first four characters,
however. This is because PyElly will truncate names to that many characters in its
formatted diagnostic output like parse trees (see Section 12, “PyElly Parsing”). The
resulting tree will be confusing if you have syntactic types like NOUN and NOUNPHRASE.
When reading in grammar rules, PyElly will warn you about any such situation.
Here are some trivial, but functional, examples of grammar, macro, and pattern
definition files:
# PyElly Definition File
# example.g.elly
g:sent->ss # top-level rule
__
g:ss->unkn
__
g:ss->ss unkn
__!
PyElly User’s Manual
# PyElly Definition File
# example.m.elly
i’m->i am # contraction
# PyElly Definition File
# example.p.elly
0 &# number -1 # simple integer
These rules will be explained in separate subsections below. Lines beginning with ‘# ’,
disregarding any leading space characters, will be taken as comments and ignored. A
comment can also be at the end of an input line, but this has to start with ‘ # ’ . The
extra spaces are necessary because the # character is special in PyElly pattern matching.
4.1 Grammar (A.g.elly)
An A.g.elly text file may have four different types of definitions: (1) syntactic rules
with their associated semantic procedures, (2) individual words with their associated
semantic procedures, (3) general semantic subprocedures callable from elsewhere, and
(4) initializations of global variables at startup. These definitions will be respectively
identified in the A.g.elly file by special markers at the start of a line: G:, D:, P:, or
I:. The definitions may appear in any order; markers can be upper or lower case.
4.1.1 Syntactic Rules
These must be entered as text in a strict format of lines. Syntactic rule definitions will
follow the general outline as shown here in a monospaced font for easier reading:
G:X->Y # a marker G: + a syntax rule
_ # a single <UNDERSCORE>,
# omitted if no semantic
# procedure follows
. #
. # the body of a generative
. # semantic procedure
#
__ # a double <UNDERSCORE>,
# mandatory definition terminator
a. In a G: line, PyElly will allow spaces anywhere except after the G, within a syntactic
structure name, or between the ‘-’ and ‘>’ of a rule.
b. The same formatting applies for a PyElly syntactic rule of the form X->Y Z.
c. A generative semantic procedure for a syntactic rule will always appear between the line
starting with a single underscore (_) and the line starting with a double underscore (__).
PyElly User’s Manual
d. A cognitive semantic procedure may appear before the single underscore, but this will be
described later in Section 10 (“Logic for PyElly Cognitive Semantics”).
e. The actual basic actions for generative semantics will be covered in Section 5
(“Operations for PyElly Generative Semantics”).
f. If a semantic procedure is omitted, various defaults apply; see Section 6 (“PyElly
Programming Examples”). In this case, the first single underscore delimiter is optional.
4.1.2 Grammar-Defined Words
It is usually a good idea to keep the vocabulary for a PyElly application separate from a
grammar as much as possible. For scalability, PyElly will keep its vocabulary mainly in
an external data store; and Section 9 will describe how to set this up. Some single-word
definitions, however, may also appear alongside the syntactic rules in a grammar
definition file. These will be called internal dictionary rules.
In particular, some words like THE, AND, and NOTWITHSTANDING are associated
with a language in general instead of any specific content. These are probably best
defined as part of a grammar file anyway. In other cases, there may also be so few words
in a defined vocabulary for an application that we may as well include them all internally
in a grammar rather than externally.
The form of an internal dictionary rule is similar to that for a grammatical rule:
D:w<-X # a marker D: + a word "w"
# + a structure type X
_ # a single <UNDERSCORE>
. #
. # a generative semantic procedure
. #
__ # a double <UNDERSCORE>,
# mandatory definition terminator
a. The D: is mandatory in order to distinguish a word rule from a grammatical rule of the
form X->Y.
b. To suggest the familiar form of printed dictionaries, the word w being defined appears
first, followed by its structure type X (i.e., part of speech). Note that the direction of the
arrow <- is reversed from that of syntax rules.
c. The w must be a single word, number, or other text token, possibly hyphenated, or a
single punctuation mark, or a short bracketed text segment like (s). Multi-word terms
in an application must be defined in PyElly’s external vocabulary or stitched together by
grammar rules, macro substitution rules, or other mechanisms discussed in Section 9.
d. The single and double underscore separators for semantics are the same as for syntactic
rules. A word definition may also have both cognitive and generative semantics to
establish its meaning for PyElly.
e. A given word may have multiple internal dictionary rules and so be ambiguous.
f. A given internal dictionary word may also be defined elsewhere in PyElly (see Section 9).
These other definitions may differ. PyElly will figure out how to make sense of them.!
PyElly User’s Manual
4.1.3 Generative Semantic Subprocedures
Every PyElly generative semantic procedure for a rule will be written in a special PyElly
programming language for text manipulation. This language also allows for named
subprocedures, which need not be attached to a specific syntax rule or internal
dictionary rule. Such subprocedures may be called in a generative semantic procedure
for a PyElly rule or by another subprocedure. Their definitions may appear anywhere in
a *.g.elly grammar file.
A subprocedure will take no arguments and return no values. All communication
between semantic procedures must be through global or local variables or from the text
written into PyElly output buffers (see Section 5 for details). Calls to subprocedures may
be recursive, but you will be responsible for avoiding infinite recursion here.
A subprocedure definition will have the following form:
P:nm # a marker + procedure name "nm"
_ # a single <UNDERSCORE>,
# mandatory
. #
. # generative semantic procedure body
. #
__ # double <UNDERSCORE> delimiter,
# mandatory
a. Note the absence of any arrow, either -> or <-, in the first definition line.
b. A procedure name n should be a unique string of alphanumeric characters without any
spaces. It can be of any non-zero length. The case of letters is unimportant. Duplicate
definitions for the same procedure name will be reported as an error.
c. The underscore separators are the same as for syntactic rules and word definitions and
will both be mandatory for a subprocedure definition.
d. A subprocedure definition may have only generative semantics. Cognitive semantics will
never apply to a subprocedure and will always be ignored if specified.
4.1.4 Global Variable Initializations
PyElly global variables in a generative semantic procedure can be set in various ways.
When such variables store important parameters referenced in a particular application
grammar, it may be helpful to be able to define them within the definition file for that
grammar. In that way, the definition will be more readily referenced and more easily
maintained. The startup initialization of global variable x to the string s is accomplished
by a I: line in a grammar definition file:
I:x=s
PyElly User’s Manual
One must have one I: line for each global variable being initialized. Note that an I: line
always stands by itself; there is no associated generative semantic procedure as in the
case of G:, D:, and P: lines. An I: line may appear anywhere in a grammar definition
file, but for clarity, it should be before any reference to it in a semantic procedure. For
readability, you may freely put spaces around the variable name x and after the = sign
here. For example,
I:iterate = abcdefghijklm
I:joiner = svnm
In the first initialization above, the iterate global variable is set to abcdefghijklm.
A string value may have embedded space characters, but all leading and trailing spaces
will be ignored and multiple consecutive embedded spaces will be collapsed into one.
4.2 Special Patterns For Text Elements (A.p.elly)
Many elements of text are too numerous to list out in a dictionary, but are recognizable
by their form; for example, Social Security numbers, web addresses, or Icelandic
surnames. PyElly allows you to identify such elements in input text by specifying the
patterns that they conform to. That is how PyElly now deals with ordinary decimal
numbers in input text to be translated.
PyElly special patterns serve to assign a syntactic structure type (part of speech) to a
single word or other token in its input text. This will supplement any explicit definition
in a grammar’s internal dictionary (see Subsection 4.1.2) or in its external vocabulary
table (see Subsection 9.4). For example, you can make ‘123’ a NOUN by a D: rule in a
grammar table, but PyElly can still infer that it is a structural type NUM from a pattern.
In general, we may have to compare multiple patterns in various orders to identify a
particular kind of text element. PyElly coordinates this kind of processing through a
finite-state automaton (FSA), which should be familiar to every aspiring computational
linguist. This is not a physical machine, but a software algorithm working from a
predefined set of rules telling it how to proceed step by step in matching up patterns
from left to right against input text.
The key concept in an FSA is that of a state, which sums up how far along a text string
the FSA has so far matched and what patterns it should look for next. An FSA will
typically have multiple states, but one will always be the starting state when the FSA is
at the front of an input text string with nothing yet matched. The total number of
different states must be limited, hence the finiteness of an automaton.
In any given state, an FSA will have a specific list of patterns to check against the input
text at its current position. These may have wildcards, a pattern element able to match
various text characters; for example, any digit 0-9. PyElly wildcards are similar to the
wildcards used in regular expressions in utilities like grep, but are defined specifically
for natural language processing; these will be listed below. This departs from the usual
kind of an FSA; which typically allow no wildcards in its patterns governing transitions.
PyElly User’s Manual
Each pattern at a state will have a certain action to take upon any match. Usually, this
means moving forward in its input string and going on to a next state according to its
predefined rules. There may be more than one such next state because a string at a given
FSA state can match more than one pattern with wildcards. This is a complication, but
everything is still equivalent to a regular FSA. It just lets our rule sets be more compact.
Some matches will have no next state in a PyElly FSA, but instead will specify a syntactic
structure type. If an FSA has also reached a suitable endpoint for an input token, then a
match is complete, and PyElly can create a token of that given syntactic type from the
text being matched. At this point, a normal FSA would be done, but PyElly will also have
to examine all the matching possibilities arising from multiple next states for an FSA.
PyElly continues until it has checked all reachable states and patterns or until the FSA
runs out of input. PyElly then will return a positive match length if any matching token
has been found; 0, otherwise. See Section 12 for more details.
PyElly requires every FSA state to be identified by a unique non-negative integer, where
the initial state is always 0. The absence of a next state after a match is indicated by -1.
At each state, what to look for next is defined as a pattern of literal characters and
wildcards. A *.p.elly definition file will consist of separate lines each specifying a
possible pattern for a given state, an optional PyElly syntactic structure type associated
with any match, and a next state upon a match. These specifications comprise the table
of pattern rules that a PyElly FSA will work with.
Here is a simple PyElly file of an FSA with pattern rules:
# simple FSA to recognize syntactic structure types
# example.p.elly
#
# each input record is a 4-tuple
# STATE PATTERN SYNTAX NEXT
0 #, - 1
0 ##, - 1
0 ###, - 1
1 ###, - 1
1 ###$ NUM -1
0 &# - 2
2 . - 3
2 $ NUM -1
3 &#$ NUM -1
3 $ NUM -1
This recognizes numbers like 1024, 3.1416, and 1,001,053 as tokens of syntactic
type NUM. A pattern rule in its most basic form will be a single line with four parts:
STATE PATTERN SYNTACTIC TYPE NEXT
PyElly User’s Manual
a. The first part is an integer 0 representing a current PyElly automaton state.
b. The second part is a PATTERN, which may be an arbitrary sequences of letters, numbers,
and certain punctuation, including hyphen (-), comma (,), period (.), slash (/). If these
are present, they must be matched exactly within a word being analyzed.
c. A PATTERN may have explicit characters and also wildcards, which can match various
substrings in a token string. PyElly wildcards in a pattern will be as follows:
d. If a character to be matched is also a PyElly wildcard, then you must escape it in a
*.p.elly pattern with a double backslash; for example, \\* to match * literally.
e. A special PATTERN consisting only of \0 (ASCII NUL) will cause an automaton to move
to a next state without any matching. This pattern does not require a double backslash.
f. Brackets [ and ] in a pattern will enclose an optional subsequence to match; only one
level of bracketing is allowed and only alphanumeric characters are allowed inside the
brackets. The pattern [ab]c will match the string abc or the string c.
g. A pattern with a wildcard other than \0 or $ by itself must always match at least one
character; for example, the pattern[a]* will be rejected.
h. The third part of a pattern rule is a syntactic type (part of speech) like NOUN, indicating a
possible final state. A ‘-’ here means that no type is specified.
i. The fourth part is the next state to go to upon matching a specified pattern. This will be
an integer -2. A -1 here means a final state, which should follow a syntactic type; a -2
will also be final, but allows for matched token to end in the middle of a normal token.
j. All -1 final state patterns not ending with the * or $ wildcards will have a wildcard $
appended automatically. Patterns for all non-final and -2 states will be left alone.
#
will match a single digit 0 - 9, including some exponents
@
will match a single letter a - z or A - Z, possibly with diacritics
!
will match a single uppercase letter, possibly with diacritics
¡
will match a single lowercase letter, possibly with diacritics
?
will match a single digit or letter
*
will match an arbitrary sequence of non-blank characters, including a null sequence
&?
will match one or more letters or digits in a sequence
&#
will match one or more digits in a sequence
&@
will match one or more letters in a sequence
^
will match a single vowel
%
will match a single consonant
will match an apostrophe appearing either as ' (ASCII) or (Unicode right single
quotation mark) or (Unicode prime)
$
will match the end of a token, but not add to the extent of any matching; a token will
end if followed by a space or certain punctuation
PyElly User’s Manual
For example, the pattern ###-##-####$ matches Social Security numbers, while the
pattern (###)###-####$ matches a telephone number with an area code, but without
separating spaces. See Subsection 9.2.1 for more on possible number patterns. To match
a character like & or $ in text, it may be specified explicitly for matching in a pattern by
escaping it with backslashes: \\&. This then will not be interpreted as a wildcard.
PyElly also treats lowercase alphabetic characters as semi wild in effect. A lowercase
letter in a pattern will match the same letter in text input irrespective of case. An
uppercase letter in a pattern, however, will match only the uppercase version of the
letter in text input. So, put an uppercase letter into a pattern only if you mean it.
The PyElly FSA will always recognize only a single token with no spaces. You cannot
match multiple words separated by spaces like XXX YYY ZZZ. You can, however,
recognize special tokens with punctuation characters not normally in a PyElly token; for
example, (R-TEXAS). To do this in the PyElly FSA, you should have separate states to
match a single punctuation character:
0 ( - 1
1 !-!&@ - 2
2 ) XID -1
It is possible to define rules that reference only the start state 0. When your patterns
have many wildcards, however, having multiple states can make your matching much
clearer. This will always work and will facilitate the sharing of states by different rules.
4.3 Macro Substitutions (A.m.elly)
Macro substitution is a way of automatically replacing specific substrings in an input
stream by other substrings. This is a useful capability in a language translator, and so
PyElly adds it as yet another integrated tool, adapting code from Kernighan and Plauger,
Software Tools, Addison-Wesley, 1976.
The main difference in PyElly macro substitution versus Software Tools is that
substrings to be replaced can be described with wildcards along with explicit characters
to match and that the substrings to replace parts of the original string can specify the
parts of the original string that matched consecutive sequences of pattern wildcards.
Macro substitution is convenient for handling idioms, synonyms, abbreviations, and
alternative spellings and for doing syntactic transformations awkwardly described by
context-free grammar rules. A PyElly macro substitution rule is generally defined as
follows in language definition file A.m.elly:
P_Q_R->A B C D
PyElly User’s Manual
a. Each macro substitution rule must be a single line. Since macros will defined in their
own *.m.elly file, we need no marker at the beginning of each line.
b. The left side of a substitution will be a pattern containing literal Unicode characters and
PyElly wildcards. It may not be empty. The wildcards will include all the ones listed in
the preceding subsection (3.2) for special patterns, plus a _ space wildcard and a ~
nonalphanumeric wildcard. The left side may have arbitrarily many parts separated by
the _ wildcard. The space wildcard is the only one allowed within pattern brackets [ ]
for optional components in a match; there will also be semi wild matching of lower
alphabetic characters in patterns as in the preceding subsection. A left side not ending in
a _ or * wildcard is assumed to end with $. Note that a ~ wildcard will not match an
ampersand, which is a stylized writing of the Latin et.
c. The right side of a substitution may contain space characters as word separators; or it
may be empty. Upper and lower case will be significant on the right side. It will be
significant on the left side only for semi wild matching. The right side can be arbitrarily
long, although you usually want a substitution to be shorter than an original string.
d. Input text matching the pattern on the left side will be replaced by the right side. This is
the actual macro substitution defined by a rule.
e. A \\1, \\2, \\3, and so forth, on the right stands respectively for the first, second, third,
and so forth, parts of text matched by wildcard patterns on the left. PyElly allows up to
nine bindings for a pattern. Each binding applies to a sequence of contiguous wildcards,
except for _ and , which will always be bound to a single character. For example,
matching the pattern #@abc@# will associate the first digit and letter of a match with \
\1 and the last letter and digit with \\2. Matching the pattern #a’_* with a string will
associate the character before a with \\1 , a space with \\2, and any string after the
space with \\3.
f. Characters used as wildcards can be matched literally by escaping them with \\; for
example \\? matches a question mark. This applies only to the left side of macro rules.
g. When any macro is matched and its substitution done, then all macros will be checked
again against the modified result for further possible substitutions. When a macro
eliminates its match entirely, though, substitutions will be ended at that position.
h. The order of macro substitution rules is significant. Those first in a definition file will
always be applied first, possibly affecting the applicability of those defined afterward.
Macro patterns starting with a wildcard will always be checked after all others, however.
i. Macros have no associated semantic procedures because they run outside of PyElly
syntax-driven parsing and rewriting.
Macro substitutions will be trickier to manage than grammatical rules because you can
accidentally define them to work at cross-purposes. An infinite loop of substitutions can
result if you are careless. Nevertheless, macros can greatly simplify a language definition
when you use them properly and keep their patterns fairly short.
Substitution rules will be applied to the current PyElly input text buffer at the current
position each time before the extraction of the next token to be processed. This can
override any tokenization rules in effect and can modify any previous stemming of
words. All of that can add up to substantial overhead if you define many macros because
all possible substitutions will always be tried out at each possible token position.
PyElly User’s Manual
Here are some actual PyElly macro substitution rules from an application trying to
compress input as much as possible while keeping it readable:
*'ll -> \\1
percent* -> %
will_not -> wont
data_base -> DBs
greater than or equal to -> >=
carry_-ing -> with
receiv[e]* -> rcv\\1
#*_@[.]m$ -> \\1\\3m
The last macro rule above will replace “10 p.m” with “10pm” to save space.
The substitution part of a macro rule may include a \\s special character, indicating the
u’\u001E’ Unicode character, which is ASCII RS (record separator). For example,
,_@@&@ing_that -> ,\\s \\1ing that
will insert RS into the PyElly text input buffer after finding a comma (,) when followed
by a participial -ING THAT expression. PyElly automatically recognizes RS as the
syntactic category SEPR, but your grammar rules then must do something with this.
You often can use macro substitutions to handle idioms or other irregular forms that are
exceptions to general language rules or patterns of words with dependencies often hard
to capture in PyElly context-free syntax rules. Just rewrite them into a less unambiguous
form that your grammar rules will recognize. This could be an arbitrary string of letters
or digits not normally readable by a person; for example, didn’t -> DDNT.
Macros will always be applied after any inflectional stemming, but also before any
morphological stemming (see Subsection 9.1) with an unknown text element. Use them
with care; PyElly will warn you when something might be dangerous, but will not stop
you from getting into an infinite loop of substitutions.!
PyElly User’s Manual
5. Operations for PyElly Generative Semantics
In Section 3, we saw examples of generative semantic procedures expressed in English.
PyElly requires, however, that they be written in a special structured programming
language for editing text in a series of output buffers. This language has conditional and
iterative control structures, but generally operates at the nitty-gritty level of
manipulating a few characters at a time.
Basically, PyElly generative semantics manages buffers and moves around text in them.
The semantic procedures for various parts of a PyElly sentence all have to put their
contributions for a translation into the right place at the right time. Proper coordination
is critical; you have to plan everything out and control the interactions of all procedures.
Every generative semantic procedure will be a sequence of simple commands, each
consisting of an operation name possibly followed by arguments separated by blanks.
These various operations are described below in separate subsections. For clarity, the
operation names are always shown in uppercase here, but lower or mixed case will be
fine. Comments below begin with ‘ # ’ and are not part of a command.
5.1 Insertion of Strings
These operations put a literal string at the end of the current PyElly output buffer:
APPEND any string # put “any string" into current buffer
BLANK # put a space character into buffer
SPACE # same as BLANK
LINEFEED # start new line in buffer, add space
OBTAIN # copy in the text for the first token
# at the sentence position of the
# phrase constituent being translated
5.2 Subroutine Linkage
For calling procedures of subconstituents for a phrase and returning from such calls:
LEFT # calls the semantic procedure
# for subconstituent structure
# Y when a rule is of the form
# X->Y or X->Y Z.
RIGHT # calls the semantic procedure
# for subconstituent structure
# Z for a rule of the form
# X->Y Z, but Y for rule X->Y
PyElly User’s Manual
RETURN # returns to caller, not needed at
# the end of a procedure
FAIL # rejects the current parsing of an
# input sentence and returns to the
# first place where there is a choice
# of different parsings for a part of
# an input structure
5.3 Buffer Management
Processing starts with a single output text buffer. Spawning other buffers can help to
keep the output of different semantic procedures separate for adjustments before
joining everything together. You can set aside the current buffer, start working in a new
buffer and then return to the old buffer to shift text back and forth between them.
SPLIT # creates a new buffer and
# directs processing to it
BACK # redirects processing to end
# of previous buffer while
# preserving the new buffer
MERGE # appends content of a new
# buffer to the previous one,
# deallocating the new one
These allow a semantic procedure to be executed for its side effects without immediately
putting anything into the current output buffer. Splitting and merging will work when
nested recursively, but for clarity, put a corresponding SPLIT and MERGE in the same
procedure. The MERGE operation can also be combined with string substitution:
MERGE /string1/string2/ # as above, except that all occurrences
# of "string1" in the new buffer will be
# be changed to "string2"
# (the divider / here may be replaced by
# any char not in string1 or string2)
5.4 Local Variable Operations
Local variables can store a Unicode string value. Variable names may have one or more
letters or digits. They will be declared within the scope of a semantic procedure and will
automatically disappear upon a return from the procedure.
VARIABLE xv=string # declare and set variable xv to
# "string"; if the "=string" part
# is missing, then initialization
# is to a null string ""
PyElly User’s Manual
SET xv=string # assigns "string" to the most recent
# declaration of local variable xv,
# defining xv if necessary; if the
# "=string" part is missing, then
# assignment is to a null string ""
A string may contain any printing characters, but trailing spaces will be dropped. To
handle single space characters specified by their ASCII names, you may use the
following special forms:
VARIABLE xv SP # define variable xv as single
# space char
SET xv SP # set variable xv as single
# space char, defining xv
# if necessary
Note the absence of the equal sign (=) here. PyElly will recognize SP, HT, LF, and CR as
space characters here and nothing else. This form can also be used with the IF, ELIF,
WHILE, and BREAKIF semantic operations described below. You may write VAR as
shorthand for VARIABLE; they are equivalent.
Some operations have a local variable as their second argument. These support
assignment, concatenation of strings, and queuing.
ASSIGN xv=zv # assigns the strings value of local
# variable zv to the local
# variable xv in their most
# recent declarations
QUEUE qv=xv # appends the string stored for local
# variable xv to any string stored for
# variable xv
UNQUEUE xv=qv n # removes the first n chars of the
# string stored in local variable
# qv and assigns them to local
# variable xv; if n is unspecified,
# the character count defaults to 1;
# if qv has fewer than n chars, then
# xv is just set to the value of qv
# and qv is set to the null string
The equal sign (=) and righthand local variable name is required for UNQUEUE and
QUEUE. If a lefthand local variable is undefined here, it will become automatically
defined in the scope of the current generative semantic procedure.
PyElly User’s Manual
5.5 Set-Theoretic Operations with Local Variables
PyElly allows for manipulation of sets of strings, represented as their concatenation into
a single string with commas between individual strings. For example, the set
{“1”,”237”,”ab”,”u000”} would be represented as the single string “1,237,ab,u000”.
When local variables have been set to such list values, you can apply PyElly set-theoretic
operations to them.
UNITE x<<z # takes the union of the list values
# of local variables x and z
# and saves the result in x
INTERSECT x<<z # intersects the list values
# of local variables x and z
# and saves the result in x
COMPLEMENT x<<z # restricts the list values of
# of local variable x to those
# not in the list value for
# local variable z and saves
# the result in x
5.6 Global Variable Operations
Global variables are permanently allocated in a language definition and are accessible to
all semantic procedures through two restricted operations:
PUT x y # store the value of local
# variable x in global
# variable y
GET x y # the inverse of PUT
There is no limit on the total number of global variables. The global variables gp0, gp1,
… can be defined and set from a PyElly command line (see Section 7); you can define
other global variables yourself in semantic procedures by doing a PUT or a GET with a
new global variable name. You can also set global variables with the I: option in a
grammar rule file.
5.7 Control Structures
Only two structures are supported in generative semantics: the IF-ELIF-ELSE
conditional and the WHILE loop; they are as follows:!
PyElly User’s Manual
IF x=string # if local variable x has
# value string, execute the
# following block of code
ELIF x=string # follows an IF; the test is
# made if all preceding
# tests failed and will
# control execution of
# following block of code
# (more than one ELIF can
# follow an IF)
ELSE # the alternative to take
# unconditionally after all
# preceding tests have failed
WHILE x=string # the following block of code
# is repeatedly executed
# while the local variable
# x is equal to string
END # delimits a block of code and
# terminates an IF-ELIF-ELSE
# sequence or a WHILE loop
An END must terminate every IF-ELIF-ELSE sequence and every WHILE loop. PyElly
will report a table definition error if any required END is missing for a control structure.
As in Subsection 5.4, we can check for single space characters here. For example,
IF x SP # check if local variable x is
# a space character
ELIF x SP #
WHILE x SP #
Instead of SP, you may also have HT, LF, NL , or CR. ; NL is the same as LF. New lines in
a PyElly output buffer are always represented as a LF followed by an SP.
A tilde (~) preceding the variable name x reverses the logical sense of comparison in all
the checks above.
IF ~x=string # test if x string
The IF and ELIF commands also have a form that allow for the testing a variable
against a list of strings. PyElly allows for!
PyElly User’s Manual
IF x=s, t, u # test if x == s or x == t or x == u
ELIF x=s, t, u # test if x == s or x == t or x == u
The multiple strings to be compared against here must be separated by a comma (,)
followed by a space. The space is essential for PyElly to recognize the listing here. The
tests here can be negated with a tilde (~) also. The checking of multiple space characters
SP, HT, NL, and CR as described above is unsupported.
Within a WHILE loop, you may also have
BREAK # unconditionally break out
# of current WHILE loop
BREAKIF x=string # if local variable x has
# value string, break out of
# current WHILE loop
The condition for BREAKIF can negated with a preceding tilde (~) as above. You can
check for a single space character also.
5.8 Character Manipulation
These work with the current and next output buffers as indicated by < or > in a
command; x specifies a source or target local variable to work with.
EXTRACT > x n # drops the last n chars of
# the current output buffer and
# sets local variable x to the
# string of dropped characters
EXTRACT x < n # drops the first n chars of
# the next output buffer and
# sets local variable x to the
# string of dropped characters
INSERT < x # insert the chars of local
# variable x to the end of the
# current output buffer
INSERT x > # insert the chars of local
# variable x to the start of the
# next output buffer
PEEK x < # get a single char from
# start of next output buffer
# without removing it
PyElly User’s Manual
PEEK > x # gets a single char from
# end of current output buffer
# without removing it
DELETE n < # deletes up to n chars from the
# start of the next output buffer
DELETE n > # deletes up to n chars from the
# end of the current output
# buffer
STORE x k # save last deletion in a current
# procedure in local variable
# except for last k chars when
# k > 0 or the first k chars
# when k < 0; if unspecified,
# k defaults to 0
SHIFT n < # shifts up to n chars from
# the start of the next output
# buffer to the end of the
# current output buffer
SHIFT n > # shifts up to n chars from
# the end of the current output
# buffer to the start of the
# next output buffer
If n is omitted for EXTRACT operation, it is assumed to be 1; if the < or > are omitted
from a DELETE or a SHIFT, then < is assumed. These commands must always have at
least one argument. All deleted text can be recovered by a STORE command.
The DELETE operation also has two variants
DELETE FROM s # this deletes an indefinite
# number of chars starting from
# the string s in the current
# buffer up to the end
DELETE TO s # this deletes an indefinite
# number of chars up to and
# including the string s
# in the next buffer
If the argument s is omitted for DELETE FROM or DELETE TO, it is taken to be the
string consisting of a single space character. If s is not found in the current or the next
buffer for DELETE FROM or DELETE TO, all of that buffer will be deleted. As with the
regular DELETE operation, any characters removed by this command can be recovered
by the STORE command.
PyElly User’s Manual
5.9 Insert String From a Table Lookup
This operation that uses the value of a local variable to select a string for appending at
the end of the current output buffer. It has the form
PICK x table # select from table according
# to the value of x
The table argument is a literal string of the form
(v1=s1#v2=s2#v3=s3#...vn=sn#)
where the # character must be used to terminate each table option. If the value of given
local variable x is equal to substring vi for a table option, then substring si will be
inserted. If there is no match, nothing will be inserted, but when vn is null, then sn will
be inserted if the variable x matches no other vi.
For example, the particular PICK operation
PICK x (uu=aaaa#vv=bbbb#ww=cccc#=dddd#)
in a generative semantic procedure or in a vocabulary table entry is equivalent to the
PyElly code
IF x=uu
APPEND aaaa
ELIF x=vv
APPEND bbbb
ELIF x=ww
APPEND cccc
ELSE
APPEND dddd
END
but the IF-ELSE form will take up multiple lines. PyElly will compile a PICK operation
to use a Python hash object for faster lookup.
The operation
PICK x (=dddd#)
will append dddd for any x, including x being set to the null string.!
PyElly User’s Manual
5.10 Buffer Searching
There is a general search operation in forward and reverse forms. These assume
existence of a current and a new buffer as the result of executing SPLIT and BACK.
FIND s < # the contents of the new
# buffer will be shifted to the
# current buffer up to and including
# the first occurrence of string s
FIND s > # as above, but the transferring
# will be in the other direction
# past first occurrence of s
The substring s may not contain any spaces. If s is missing, but either < or > is
specified, then s will be set to a single space character. If s is not found in a buffer scan,
the entire contents of the buffer will be shifted. If s is specified and a final < or > is
omitted, then > will be assumed. Note that s can include the characters < or >, which
can look confusing, but will have the correct interpretation in what to search for. Note,
however, that FIND < and FIND >, will always search for a single space.
A more specialized search allows you to align your current and new buffers at the start
of an output line as marked by a previously executed LINEFEED command.
ALIGN < # shift chars in new buffer to current
# until a \n inserted by LINEFEED is
# found; the current buffer should
# end with \n followed by a space
ALIGN > # as above, but transfer will be in the
# other direction; any matching \n and
# following space will be left in the
# current buffer, however
ALIGN will be like FIND in that the entire source buffer be moved if no \n is found; but
if found, the current buffer will always end up with <nl><sp> as its last two chars.
5.11 Execution Monitoring
To track the execution of semantic procedures when debugging them, you can use the
command:
TRACE # show processing status in tree
In the semantic procedure for a phrase, this will print to the standard error stream the
starting token position of the phrase in a sentence, its syntax type, the index number of
the syntactic rule, the degree of branching of the rule, the generative semantic stack
depth, the output buffer count, the number of characters in the current buffer:
PyElly User’s Manual
TRACE @0 type=field rule=127 (1-br) stk=9 buf=1 (2 chars)
We see here that PyElly is running the semantics for the 1-branch rule 127 associated
with a phrase of type FIELD at token position 0; it is executing at the 9th level of calls
with a single buffer containing only two characters. If a subprocedure named pn (see
Subsection 5.13) has called the current generative semantic procedure either directly or
indirectly, then this will be identified also. The output above then becomes
TRACE @0 type=field rule=127 (1-br) stk=9 in (pn) buf=1 (2 chars)
If there are multiple named subprocedures in the chain of calls for the current
generative semantic procedure, then only the most recent will be reported.
To show the current string value of a local variable x, you can insert this command into
a semantic procedure:
SHOW x message .... # show value of local variable x
This writes the ID number of the phrase being interpreted, the name of the variable
being shown, its current string value, and an optional identifying message to the
standard error stream. For example,
SHOW @phr 108 : [message ....] VAR x= [012345]
The message string is optional here; it may contain spaces.
To see up to the last n chars of the current and up to the first n of the next output buffer
at the current point of running generative semantics, you can use a third command
VIEW n # show n chars of current + next buffers
When executed in a generative semantic procedure, VIEW 3 will write the following
kind of message to the standard error stream:
VIEW = 0 @phr 6 : [u'u', u'n', u'>'] | [u’<‘, u’s’, u’s’’]
This first gives the index number of the current output buffer and the ID number of the
phrase being interpreted. The vertical bar (|) separates the list of Unicode characters
ending the current buffer from the list starting the next buffer. Set n to get as wide a
window here as you need; if unspecified, the default for n is 5. Run a sequence of VIEWs
to monitor progress in accumulating rewritten text for PyElly output.
PyElly User’s Manual
5.12 Capitalization
PyElly has only two commands to handle upper and lower case in output.
CAPITALIZE # capitalize the first char
# in the next buffer after a
# split and back operation
This operates only on the next output buffer. If you fail to do a SPLIT and BACK
operation to create a next output buffer before running this command, you will get a
null pointer exception, which will halt PyElly.
UNCAPITALIZE # uncapitalize the first char
# in the next buffer after a
# split and back operation
The restrictions for CAPITALIZE apply here also.
5.13 Semantic Subprocedure Invocation
If DO is the name of a semantic subprocedure defined with P: in a PyElly grammar
table, then it can be called from a generative semantic procedure for a rule or another
subprocedure by giving the name in parentheses:
(DO) # call the procedure called DO
The subprocedure name must be defined somewhere in a PyElly A.g.elly file. This
definition does not have to come before the call. When a subprocedure finishes running,
execution will return to just after where it was called. Any local variables in the
subprocedure will then become undefined.
A subprocedure call will always take no arguments. If you want to pass parameters, you
must do so through local or global variables or in an output buffer. Similarly, results
from a subprocedure can be returned only by putting them into an output buffer or
passing them in a local or global variable.
The null subprocedure call () with no name is always defined; it is equivalent to a
generative semantic procedure consisting of just a RETURN. This is normally used only
for PyElly vocabulary definitions with no associated generative semantics (see
Subsection 9.4).!
PyElly User’s Manual
6. Simple PyElly Rewriting Examples
We are now ready to look at some simple examples of semantic procedures for PyElly
syntax rules, employing the mechanisms and operations defined in the preceding
sections. Sections 8 and 9 will discuss more advanced capabilities.
6.1 Default Semantic Procedures
The notes in the Section 4.1.1 of this manual mentioned that omitting the semantic
procedure for a syntax rule would result in a default procedure being assigned to it. Now
we can finally define those default procedures. A rule of the form G:X->Y Z will have
the default
_
LEFT
RIGHT
__
Note that a RETURN command is unnecessary here as it is implicit upon reaching the
end of the procedure. You can always put one in yourself, however.
A rule of the form G:X->Y has the default semantic procedure
_
LEFT
__
A rule of the form D:w<-X has the default
_
OBTAIN
__
These are automatically defined by PyElly as subprocedures without names. They do
nothing except to implement the calls and returns needed minimally to maintain
communication between the semantic procedures for the syntactic rules associated with
the structure of a sentence derived by a PyElly analysis.
In the first example of a default semantic procedure above, a call to the procedure for
the left constituent structure X comes first, followed immediately by a call to the
procedure for the right constituent Y. If you wanted instead to call the right constituent
first, then you would have to supply your own explicit semantic procedure, writing
_
RIGHT
LEFT
__
PyElly User’s Manual
In the second example above of a default semantic procedure, there is only one
constituent in the syntactic rule, and this can be called as a left constituent or a right
constituent; that is, a RIGHT call here will be interpreted as the same as LEFT.
In the third example of a default semantic procedure above, which defines a
grammatical word, there is neither a left nor a right constituent; and so we can execute
only an OBTAIN. Either a LEFT or a RIGHT command here would result in an error.
6.2 A Simple Grammar with Semantics
We now give an example of a nontrivial PyElly grammar. The problem of making
subjects and predicates agree in French came up previously in Section 3. Here we make
a start at a solution by handling elison and the present tense of first conjugation verbs in
French and of the irregular verb AVOIR “to have.” For the relationship between a
subject and a predicate in the simplest possible sentence, we have the following
syntactic rule plus semantic procedure.
G:SENT->SUBJ PRED
_
VAR PERSON=3 # can be 1, 2, or 3
VAR NUMBER=s # singular or plural
LEFT # for subject
SPLIT
RIGHT # for predicate
BACK
IF PERSON=1
IF NUMBER=s
EXTRACT X < # letter at start of predicate
IF X=a, e, è, é, i, o, u
DELETE 1 # elison j’
APPEND ’ #
ELSE
BLANK # otherwise, predicate is separate
END
INSERT < X # put predicate letter back
END
ELSE
BLANK # predicate is separate
END
ELSE
BLANK # predicate is separate
END
MERGE # combine subject and predicate
APPEND !
__
The two local variables NUMBER and PERSON are for communication between the
semantic procedures for SUBJ and PRED; they are set by default to “singular” and “third
PyElly User’s Manual
person”. The semantic procedure for SUBJ is called first with LEFT; then the semantic
procedure for PRED is called with RIGHT, but with its output in a separate buffer. This
lets us adjust the results of the two procedures before we actually merge them; here the
commands in the conditional IF-ELSE clauses are to handle a special case of elison in
French when the subject is first person singular and the verb begins with a vowel.
G:SUBJ->PRON
__
The above rule allows a subject to be a pronoun. The default semantic procedure for a
syntactic rule of the form X->Y as described above applies here, since none is supplied
explicitly.
D:i<-PRON
_
APPEND je
SET PERSON=1
__
D:you<-PRON
_
APPEND vous
SET PERSON=2
SET NUMBER=p
__
D:it<-PRON
_
APPEND il
__
D:we<-PRON
_
APPEND nous
SET PERSON=1
SET NUMBER=p
__
D:they<-PRON
_
APPEND ils
SET NUMBER=p
__
These internal dictionary grammar rules define a few of the personal pronouns in
English for translation. The semantic procedure for each rule appends the French
equivalent of a pronoun and sets the PERSON and NUMBER local variables appropriately.
Note that, if the defaults values for these variables apply, we can omit an explicit SET.
Continuing, we fill out the syntactic rules for our grammar.
G:PRED->VERB
__
PyElly User’s Manual
This defines a single VERB as a possible PRED; the default semantic procedure applies
again, since no procedure is supplied explicitly here.
Now we are going to define two subprocedures needed for the semantic procedures of
our selection of French verbs.
P:plural
_
PICK PERSON (1=ons#2=ez#3=ent#)
__
P:1cnjg
_
IF NUMBER=s
PICK PERSON (1=e#2=es#3=e#)
ELSE
(plural)
END
__
Semantic subprocedures plural and 1cnjg choose an inflectional ending for the
present tense of French verbs. The first applies to most verbs; the second, to first
conjugation verbs only. We need to call them in several places below and so define the
subprocedures just once for economy and clarity.
D:sing<-VERB
_
APPEND chant # root of verb
(1cnjg) # for first conjugation inflection
__
D:have<-VERB
_
IF NUMBER=s
PICK PERSON (1=ai#2=ais#3=a#)
ELSE
IF PERSON=3
APPEND ont # 3rd person plural is irregular
ELSE
APPEND av # 1st and 2nd person are regular
(plural)
END
END
__
We are defining only two verbs to translate here. Other regular French verbs of the first
conjugation can be added by following the example above for “sing”. Their semantic
procedures will all append their respective French roots to the current output buffer and
call the subprocedure 1cnjg.
PyElly User’s Manual
The verb AVOIR is more difficult to handle because it is irregular in most of its present
tense forms, and so its semantic procedure must check for many special cases. Every
irregular verb must have its own special semantic procedure, but there are usually only a
few dozen such verbs in any natural language.
Here is how PyElly will actually process input text with this simple grammar. The
English text typed in for translation is shown in uppercase on one line, and the PyElly
translation in French is shown in lowercase on the next line.
YOU SING
vous chantez!
THEY SING
ils chantent!
I HAVE
j’ai!
WE HAVE
nous avons!
THEY HAVE
ils ont!
The example of course is extremely limited as translations go, but the results are correct
French, unlike those in Section 3. For more comprehensive processing, we would also
take English inflectional stemming into account, use macro substitutions to take care of
irregularities on the English side like has, and handle other subtleties. We also have to
deal with various tenses other than present as well as aspect, mood, and so forth. You
should, however, be able to envision now what a full PyElly grammar should look like
for rewriting English as French; it will take much more work just to make the rules fairly
complete, but the steps would be similar to what we already have seen above.
PyElly User’s Manual
7. Running PyElly From a Command Line
We have so far described how to set up definition text files to create the various tables to
guide PyElly operation. This section will show you how to run PyElly for actual language
analysis, but first we will have to take care of some preliminary setup. That should be
fairly straightforward, but computer novices may want to get some technical help here.
To begin with PyElly was written entirely in version 2.7 Python, which seems to be the
most widely preinstalled by computer operating systems. The latest version of Python is
3.*, but unfortunately, this is incompatible with 2.7. So make sure you have the right
version here. Python is free software, and you can download a 2.7 release from the Web,
if needed. The details for doing so will depend on your computing platform.
There is a problem with Unicode output when running version 2.7 of Python. If you try
to redirect such output to a file, you may encounter a UnicodeEncodeError because
of the defaults of your Python system configuration. To avoid this error, put the
following line into the initialization shell file that will run each time you log in.
export PYTHONIOENCODING=utf8
The shell files for PyElly integration testing will do this setup, but you will have to take
care of it yourself whenever running PyElly application directly from a command line.
Once you have the latest version Python 2.7.* ready to go, you can download the full
PyElly package from GitHub. This is open-source software under a BSD license, which
means that you can do anything you want with PyElly as long as you identify in your
own documentation where you got it. All PyElly Python source code is free, but still
under copyright.
The Python code making up PyElly currently consists of 64 modules comprising about
16,000 source lines altogether. A beginning PyElly user really needs to be familiar with
only three of the modules.
ellyConfiguration.py - defines the default environment for PyElly processing.
Edit this file to customize PyElly to your own needs. Most of the time, you can leave
this module alone.
ellyBase.py - sets up and manages the processing of individual sentences from
standard input. You can run this for testing or make it a part of a programming
interface if you want to embed PyElly in a larger application.
ellyMain.py - runs PyElly from a command line. This is built on top of EllyBase
and is set up to extract individual sentences from continuous text in standard input.
The ellyBase module reads in *.*.elly language definition files to generate the various
tables to guide PyElly analysis of input data. Section 4 introduced three of them. For a
given application A, these will be A.g.elly, A.p.elly, and A.m.elly, with only the
PyElly User’s Manual
A.g.elly file mandatory. Subsequent sections of this user manual will describe the
other *.*.elly definition files.
The PyElly tables created for an application A will be automatically saved in two files:
A.rules.elly.bin and A.vocabulary.elly.bin. The first is a Python pickled
file, which is not really binary since you can look at it with a text editor, but this will be
hard for people to read. The second is an actual binary database file produced by SQLite
from definitions in a given A.v.elly (see Subsection 9.4 for an explanation).
If the *.elly.bin files exist, ellyBase will compare their creation dates with the
modification dates of corresponding A.*.elly definition files and create new tables
only if one or more definition files have changed. If a *.rules.elly.bin is more
recent than a *.vocabulary.elly.bin file, then the latter will be recompiled
regardless of whether it is more recent than its corresponding *.elly.bin file.
Otherwise, the existing PyElly language rule tables will be reloaded from the
*.elly.bin files.
The files *.rules.elly.bin record which version of PyElly they were created under.
If this does not agree with the current version of PyElly, then PyElly will immediately
exit with an error message that the rule file is inconsistent. To proceed, you must then
delete all of your *.elly.bin files so that they will be regenerated automatically from
your latest language definition files.
In most cases, ellyBase will try to substitute a file default.x.elly if an A.x.elly file
is missing. This may not be what you want. You can override this behavior just by
creating an empty A.x.elly file. The standard PyElly download package includes eight
examples of definition files for simple applications to show you how to set everything up
(see Section 14).
You can see what ellyBase does by running it directly with the command line:
python ellyBase.py [name [depth]]
This will first generate the PyElly tables for the specified application and provide a
detailed dump of grammar rules allowing you to see any problems in a language
definition. The default application here will be test if none is specified. Resulting
tables will be saved as *.elly.bin files that PyElly can subsequently load directly to
start up faster. If you do run the test application here, ellyBase will respond:
release= PyElly v1.4.25
system = test
. . .
>
(Initialization output has been omitted here, replaced by the ellipsis.)
PyElly User’s Manual
After initializing, ellyBase will prompt for one sentence per input line, which it will then
try to translate. Its output will be a rewritten sentence in brackets if translation is
successful; or just ???? on failure. It will also show all parse trees built for the syntactic
analysis plus a detailed summary of internal details of parsing. The optional depth
argument above will limit how far down the reporting of parse trees will go (see -d for
ellyMain below).
For an application with batch processing of input sentences not necessarily on separate
lines, you normally will invoke ellyMain from a command line. The ellyMain.py file is
a straight Python script that reads in general text and allows you to specify various
options for PyElly language processing. Its full command line is as follows in usual Unix
or Linux documentation format:
python ellyMain.py [ -b][ -d n][ -g v0,v1,v2,…][ -p][ -noLang] [name] < text
where name is an application identifier like A above and text is an input source for
PyElly to translate. If the identifier is omitted, the application defaults to test.
The commandline flags here are all optional. They will have the following
interpretations in PyElly ellyMain:
When ellyMain starts up in interactive mode, you will see a banner in the following form
with some diagnostic output replaced by . . .:!
-b
operate in batch mode with no prompting; PyElly will otherwise run in
interactive mode with prompting when its text input comes from a user
terminal.
-d n
set the maximum depth for showing a PyElly parse tree to an integer n.
This can be helpful when input sentences are quite long, and you do not
want to see a full PyElly parse tree. Set n = 0 to disable parse trees
completely. See Section 12 for more details.
-g v0,v1,v2,..
define the PyElly global variables gp0, gp1, gp2, ... for PyElly semantic
procedures with the respective specified string values v0, v1, v2, ...
-p
show cognitive semantic plausibility scores along with translated output.
If semantic concepts are defined, PyElly will also give the contextual
concept of the last disambiguation according to the order of
interpretation by generative semantics. This is intended mainly for
debugging, but may be of use in some applications (see disambig,
described in Section 14).
-noLang
do not assume that input text will be in English; the main effect is to turn
off English inflectional stemming (See Section 11).
PyElly User’s Manual
PyElly v1.4.25, Natural Language Filtering
Copyright 2014-2017 under BSD open-source license by C.P. Mah
All rights reserved
reading test definitions
recompiling grammar rules
. . .
recompiling vocabulary rules
. . .
Enter text with one or more sentences per line.
End input with E-O-F character on its own line.
>>
You may now enter multiline text at the >> prompt. PyElly will process this exactly as it
would handle text from a file or a pipe. Sentences can extend over several lines, or a
single line can contain several sentences. PyElly will automatically find the sentence
boundaries according to its current rules and divide up the text for analysis.
As soon as PyElly reads in a full sentence, it will try to write a translation to its output.
In interactive mode, this will be after the first linefeed after the sentence because PyElly
has to read in full lines. A linefeed will NOT break out of the ellyMain input processing
loop, but consecutive linefeeds will terminate a sentence even without punctuation. End
all input with an EOF (control-D on Unix and Linux, control-Z in Windows). A keyboard
interrupt (control-C) will break out of ellyMain with no further processing. PyElly
*.*.elly language definition files should be UTF-8 with arbitrary Unicode except in
grammar symbol names. As text input to translate, however, PyElly will accept only
ASCII, Latin-1 Supplement, Latin Extender A, some Extender B characters plus some
other Unicode punctuation, and Greek lowercase; all other input characters will be
converted to spaces or to underscores. The chinese application described in Section 14
uses definition files with both traditional and simplified Chinese characters in UTF-8.
All PyElly translation output will be UTF-8 characters written to the standard output
stream, which you may redirect to save to a file or pipe to modules outside of PyElly.
PyElly parse trees and error messages will also be in UTF-8 and will go to the standard
error stream, which you can also redirect. Historically, the predecessors of PyElly have
been filters, which in Unix terminology means a program that reads from standard
input and writes a translation to standard output.
Here is an example of interactive PyElly translation with a minimal set of language rules
(echo.*.elly) defining a simple echoing application with inflectional stemming:
>> Who gets the gnocchi?
=[who get -s the gnocchi?]
PyElly User’s Manual
where the second line is actual output from ellyMain. PyElly by default converts upper
case to lower, and will strip off English inflectional endings as well as -ER and -EST. You
can get stricter echoing by turning off inflectional stemming and morphological analysis.
By default, PyElly will look for the definition files for an application in your current
working directory. You can change this by editing the value for the symbol baseSource in
ellyConfiguration.py. The various PyElly applications described in Section 14 are
distributed in the applcn subdirectory under the main directory of Python source files,
resources, and documentation. Configure here to your particular working environment.
PyElly *.py modules by default should be in your working directory, too. You can
change where to look for them, but that involves resetting environment variables for
Python itself. PyElly is written as separate Python modules to be found and linked up
whenever you start up PyElly. This is in contrast to other programming languages where
modules can be prelinked in a few executable files or packaged libraries.
There is a stripped-down implementation of top-level PyElly processing called
ellySurvey.py, which ignores sentence boundaries and omits the parsing and
rewriting of input text. Instead, this produces a listing of all the tokens found by PyElly
along with source tags indicating how each was derived. This is invoked with the
command line:
python ellySurvey.py [name] < text
where name is an application identifier and text is an input source to translate. If the
identifier is omitted, the application defaults to test.
The ellySurvey listing of tokens will have the following source tags:
A token can have more than one source if your language rules have multiple definitions
for it; for example, a term may be in both your internal grammar dictionary and your
external vocabulary table, and it might also be recognized by entity extraction as well as
by the finite state automaton built into PyElly. Here is an example of a token listing with
the marking application:!
Ee
by entity extraction
Fa
by finite automaton for application
Id
in internal dictionary for application
Pu
by punctuation recognizer
Un
unknown
Vt
in external vocabulary table for application
PyElly User’s Manual
Id on/On
Ee 09/16/____
Pu ,
Id his
Vt country
Vt take
FaId -ed
Id in
Vt at least
Fa 1500
Vt refugee/refugees
FaId -s
Vt flee/fleeing
FaId -ing
Vt war
Pu .
A token is shown in its analyzed form as it would appear in a PyElly parse tree; if this
differs from its original input form after possible macro and other transformations, then
that form is also given on the same line, separated by a slash (/). The listing makes it
easier to find problems in tokenization or vocabulary lookup. Unknown tokens are
identified in the listing, since you often want to define them explicitly.
If you are processing some large text corpus for the first time, you should always run
ellySurvey first. This should help identify the problems in that data solvable just by
vocabulary definitions, which should then make subsequent grammar definitions easier.
On the whole, PyElly gives you many options for processing natural language input. You
must, however, be comfortable with computing at the level of command lines in order to
run PyElly in ellyMain.py or ellyBase.py or ellySurvey.py. There is as yet no
graphical user interface for PyElly. The current PyElly implementation may be a
challenge to computer novices unfamiliar with Python or with commandline invocation.
PyElly User’s Manual
8. Advanced Capabilities: Grammar
As noted above, PyElly language analysis is built around a parser for context-free
languages to take advantage of extensive technology developed for parsing computer
programming languages. So far, we have stayed strictly context-free except for macro
substitution prior to parsing and use of local variables shared by generative semantic
procedures to control translation.
You can actually accomplish a great deal with such basics alone, but for more
challenging language analysis, PyElly supports other grammatical capabilities. These
include extensions to grammar rules like syntactic and semantic features and the special
... syntactic type mentioned earlier. Other extensions related to vocabularies are
covered in the next section.
The handling of sentences and punctuation in continuous text is also normally a topic of
grammar, but PyElly breaks this out as a separate level of processing for modularity. The
details on this will be discussed in Section 11.
8.1 Syntactic Features
PyElly currently allows for only 96 distinctive syntactic types, including predefined
types like SENT and UNKN. If needed, you get more types by redefining the variable NMAX
in the PyElly file symbolTable.py, but there is a more convenient option here. PyElly
also lets you qualify syntactic types through syntactic features, which in effect greatly
multiplies the total number of syntactic types available.
Syntactic features are binary tags for syntactic types, allowing them to be divided into
subcategories; they are best known from Noam Chomsky’s seminal work Syntactic
Structures (1957). Currently, PyElly allows up to sixteen syntactic features associated
with a subset of syntactic types. You can define those subsets and name the features
however you want. You can get more than sixteen syntactic features by redefining the
variable FMAX in symbolTable.py, but this is not recommended.
The advantage of syntactic features is that grammar rules can disregard them. For
example, a DEFINITE syntactic feature would allow definite noun phrases to be
identified in a grammar rule without having to introduce a new structural type like DNP.
Instead, we would have something like NP[:DEFINITE]. A grammar syntax rule like
PRED->VP NP will apply to NP[:DEFINITE] as well as to plain NP. With a new syntax
type like DNP, we would also have to add the rule PRED->VP DNP.
PyElly syntactic features are expressed by an optional bracketed qualifier appended to a
syntactic structural type specified in a rule. The qualifier takes the form
[oF1,F2,F3,...,Fn]
PyElly User’s Manual
where “o” is a single-character identifier for a set of feature names for a specific subset
of syntactic types and F1, …, Fn are the actual names composed of alphanumeric
characters, possibly preceded by a prefix ‘-’ or ‘+’. For clarity, set identifiers are usually
punctuation characters, but should never be from the set { ‘+’ , ‘-’ , ‘*’ , ‘[’ , ‘]’ , ‘, }.
Allowing multiple sets of feature names is for convenience only. Each set will have to
refer to the same FMAX feature bits defined for each phrase node in a PyElly parse tree.
When defining multiple name sets, make sure that their usage is consistent. PyElly will
reject a syntactic type occurring with syntactic feature names from more than one set
because features with the same name in different rules may refer to different bits.
Bracketed syntactic features in language rules must follow a syntactic type name with no
intervening space. Spaces may follow a comma in a list of syntactic feature names for
easier reading, but any before or after a starting left bracket ([) will be seen as an error.
A syntax rule with feature names might appear as follows:
G:NP[:DEFINITE,*RIGHT]->THE NP[:-DEFINITE,*RIGHT]
This specifies a rule of the form NP->THE NP, but with additional restrictions on
applicability. The NP as specified on the right side of the rule must have the feature
*RIGHT, but not DEFINITE. If the condition is met, then the resulting NP structure as
specified on the left of the rule is defined with the features DEFINITE and *RIGHT. The
:’ is the feature class identifier here for the DEFINITE and *RIGHT feature names.
PyElly sets have no upper limit on the number of different sets, but it is probably a good
idea to have only five or six. Just remember that syntactic features are supposed to
simplify grammars; and too many will make inheritance difficult. A syntactic feature
name should include only ASCII letters and digits, with the case of letters no mattering.
The special feature name *RIGHT (or equivalently *R) will be defined automatically for
all syntactic feature sets. Setting this feature on the left side of a syntactic rule will have
the side effect of making that constituent structure inherit any and all syntactic features
of its rightmost immediate subconstituent as specified in the rule. This provides a
convenient mechanism for passing syntactic features up a parse tree without having to
say what exactly they are.
The special feature name *LEFT (or equivalently *L) will also be defined automatically.
This will work like *RIGHT, except that inheritance will be from the leftmost immediate
subconstituent. It is permissible to inherit both ways. The *LEFT and *RIGHT syntactic
features for a phrase node will be mutually exclusive, however; setting one will turn off
the other. With a one-branch rule, *LEFT and *RIGHT will be the same for inheritance,
though remaining distinct as syntactic features.
A third special feature name *UNIQUE (or equivalently *U or *X) will also be in every
PyElly syntactic feature set. Its main purpose is to prevent a phrase from matching any
other phrase in PyElly ambiguity checking while parsing; it cannot be inherited. It can,
PyElly User’s Manual
however, be a regular syntactic feature at a leaf node where there is no inheritance. No
starred (*) special syntactic features should ever be redefined by a PyElly user.They will
always be counted in the total number of syntactic features available for any grammar.
A feature F can be marked with a ‘-’ on the left side of a grammatical rule; for example,
X[:*L,-F]—>Y[:F,-G]. This has a different interpretation than that for a feature on
the right side of a rule, such as G for the syntactic category Y in the example rule. It
serves to turn off a particular feature that might have been inherited, in this case F.
You may also turn off a feature that was just set as in Z[:*R,-*R]—>W[:H]; this makes
sense only with the special *L and *R syntactic features, which have side effects when
turned on that will persist even after they are turned back off. The result here is that the
resulting Z phrase node will inherit the feature H, but will not have *R also turned on in
the end. The action is inherent to the PyElly implementation of feature inheritance, but
can be helpful when you use the *L and *R features to distinguish leaf phrase nodes.
8.2 The ... Syntactic Type
When the ... type shows up in a grammar, PyElly automatically defines a syntax rule
that allows phrases to be empty. If you could write it out, the rule would take the form
...->
This is sometimes called a zero rule, which PyElly will not allow you to specify explicitly
in a *.g.elly file for any syntactic type on the left. In strict context-free grammars, any
rule having a syntactic structural type going to an empty phrase is forbidden. Such rules
are allowed only in so-called type 0 grammars, the most unrestricted of all; but the
languages described by such grammars tend to be avoided in text processing because of
the difficulty in parsing them.
With ... as a special syntactic type, however, PyElly achieves much of the power of type
0 grammars without giving up the parsing advantages of context-free grammars. The
advantage with using ... in PyElly is that it allows a grammar to be more compact
when this syntactic type is applicable. For example, suppose that we have the rules
z->x a
z->x b
z->x c
z->x d
z->a
z->b
z->c
z->c
x->unkn
x->x unkn
PyElly User’s Manual
where unkn is the predefined PyElly syntactic category introduced in Section 4 (this will
be explained more fully in Section 9.1). Now if x is not of interest to us in an eventual
translation, then we can replace all the above with just the rules
z->... a
z->... b
z->... c
z->... d
...->unkn
...->... unkn
The ... type was intended specifically to support keyword parsing, which tries to
recognize a limited number of words in input text and more or less ignores anything
else. A PyElly grammar to support such parsing can always be written without ..., but
may be unwieldy. The doctor application for PyElly illustrates how this kind of
keyword grammar would be set up; it includes syntax rules like the following:
g:ss->x ...
__
g:x[@*right]-> ... key
__
g:...->unkn ...
The syntactic type key here represents all the various kinds of key phrases to recognize
in a psychiatric dialog; for example, “mother” and “dream”. We can get away with only
one syntactic type here because, with about a dozen syntactic features available for it, we
can distinguish between 4095 different kinds of key phrases.
The actual responses of our script will be produced by semantic procedures for the rules
defining x[@*right] phrases. Note that different responses to the same keyword must
be listed as separate rules with the same syntactic category and features. A simplified
listing of grammar rules here might be
g:sent[@*right]->ss
__
g:x->... key
__
g:key[@ 0,1]->fmly
__
g:ss[.*right]->x[@ 0, 1,-2,-3,-4,-5,-6] ...
_
append TELL ME MORE ABOUT YOUR FAMILY
__
g:ss[.*right]->x[@ 0, 1,-2,-3,-4,-5,-6] ...
_
append WHO ELSE IN YOUR FAMILY
__
d:mother <- fmly
PyElly User’s Manual
__
g:...->unkn
__
g:...->unkn ...
__
This defines two different possible responses for key[@ 0,1] in our input. PyElly
ambiguity handling will then automatically alternate between them (see Section 10).
The grammar here is incomplete, recognizing only sentences with a single keyword and
nothing else. To allow for sentences without a keyword, we also need a rule like
g:ss->...
__
The ... syntactic type also has the restriction that you cannot specify syntactic features
for it. If you do put something like ...[.F1,F2,F3] in a PyElly rule, it be treated as
just .... This is mainly to help out the PyElly parser, which is already working quite
hard here; but it also is because PyElly needs to use such features for its own internal
purposes (see Subsection 12.3.3).
PyElly will also block you from defining a rule like
g:...->...
__
or like
g:X->... ...
__
where X is any PyElly syntactic type, including ....
The ... syntactic type can be quite tricky to use effectively in a language description,
but it is also tricky for PyElly to handle as an extension to its basic context-free parsing.
The various restrictions here are a reasonable compromise to let us do most of what we
really need to do. See Subsection 12.3.3 for details on how PyElly parsing actually
handles grammar rules containing the syntactic type ....
PyElly User’s Manual
9. Advanced Capabilities: Vocabulary
PyElly operates by reading in, analyzing, and rewriting out sentences. To do this, it
requires syntactic and semantic information for every text element that it encounters:
words, names, numbers, identifiers, punctuation, and so forth. Certain text elements
like punctuation will be fairly limited, but defining all the rest can be a big undertaking
even for fairly simple applications.
In all our PyElly examples here so far, we have already seen several ways of defining text
elements in a language.
An explicit D: rule in a grammar.
Assignment of syntactic information through matching of specified patterns.
Using the predefined UNKN syntactic type when a definition is lacking.
These are fine with small vocabularies, but useful natural language applications must
deal with hundreds or even tens of thousands of distinct terms. These may not fall into
obvious patterns; and stuffing them all into a *.g.elly grammar file will demand more
keyboard entry than most people care to do. Treating most text elements as UNKN is
always a fallback option, but this works well only with simple grammars.
There is no perfect solution here. PyElly can only try to provide a user enough
vocabulary definition options to make the overall task a little less painful. So, in addition
to the methods above, PyElly also incorporates builtin morphological analysis of
unknown words to infer a syntactic type, plug-in code for recognizing complex entities
like numbers, time, and dates, and vocabulary tables loaded from external databases.
These will be described in separate subsections below, but we first should explain better
how the UNKN syntactic type works.
9.1 More on the UNKN Syntactic Type
We have run across the UNKN syntactic type several times already in this manual.
Whenever text element xxxx in its input cannot be otherwise identified by PyElly, it will
be assigned the type UNKN. In effect, PyElly generates a temporary rule of the form:
D:xxxx <- UNKN
_
OBTAIN
__
Such a rule is in effect only while PyElly is processing the current input sentence.
By itself, UNKN solves nothing. It just gives PyElly a way of working with unknown
elements, and you still are responsible for supplying the grammar rules and associated
semantics to tell PyElly how to interpret a sentence having UNKN as one of its
subconstituents. The simplest possibility here is to make some guesses; for example,!
PyElly User’s Manual
G:NOUN->UNKN
___
G:VERB->UNKN
__
These two rules allow an unknown word to be treated as either a noun or a verb. So,
when given a sentence containing unknown xxxx, PyElly can try to analyze it as either a
noun or a verb. If only one results in a successful parse, then we have managed to get
past the problem of having no definition for xxxx. If neither works out, we have lost
nothing; if both work out, then PyElly can try to figure out which is the more plausible
using the cognitive semantic facilities described in Section 10.
9.2 Breaking Down Unknown Words
An unknown word can also be resolved by looking at how it is put together. For
example, the word UNREALIZABLE may be missing from a vocabulary, but it could be
broken down into UN+ +REAL -IZE -ABLE, allowing us to identify it as an adjective
based on the root word REAL, which is more likely to be defined already in a vocabulary.
PyElly develops this idea will be further, and this will be described in the immediately
following subsections.
Text document search engines fifty years ago were already using word analysis to reduce
the size of their keyword indexes. This was to manage the many variations a search term
might take: MYSTERY versus MYSTERIES as well as MYSTIFY, MYSTICISM, and
MYSTERIOUS. Since these all revolve around a common concept, many system builders
opt to reduce them all to the single term MYSTERY in a search index. This is also helpful
for maximizing the number of relevant documents retrieved for a query.
Consequently, many kinds of rule- and table-driven word stemming exist. These can
often be rather crude like the Porter algorithm, but we can do it much more reliably if
we work long enough at refining the rules for stemming. For English at least, PyElly now
has two quite separate PyElly tools for analyzing the structure of words as a way of
dealing with unknown terms.
9.2.1 Inflectional Stemming
An inflection is a change in the form of a word reflecting its grammatical use in a
sentence. Indo-European languages, which include English, tend to be highly inflected;
and in instances like Russian, the form of most words can vary greatly to indicate
person, number, tense, aspect, mood, and case. Modern English, however, has kept only
a few of the inflections of Old English, and so it has been easier to formulate rules to
characterize how a particular word can vary.
PyElly inflectional stemming currently recognizes only five endings for English: -S, -ED,
-ING, -N, and -T. These each have their own associated stemming logic and also share
additional logic related to recovering the uninflected form of a word. All that logic is
PyElly User’s Manual
based on American English spelling rules and recognizing special cases. PyElly
coordinates its execution through the module inflectionStemmerEN.py.
If an unknown word ends in -S, -ED, -ING, -N, or -T, PyElly will apply the logic for the
ending to see whether it is an inflection and, if so, what the uninflected word should be.
Though such logic is necessarily incomplete, it has been refined by forty years of use in
various systems and is generally accurate for American spellings of most English words.
For example,
winnings ==> win -ing -s
placed ==> place -ed
judging ==> judge -ing
cities ==> city -s
bring ==> bring
sworn ==> swear -n
meant ==> mean -t
PyElly stemming will automatically prepend a hyphen (-) on any split off word ending
so that it can be recognized. The original word in the PyElly input stream is then
replaced by the uninflected word followed by the removed endings as shown. Each
ending will be taken as a separate token in PyElly parsing.
In some applications, you may just want to ignore the removed word endings, but these
can be quite valuable for figuring out unknown words. The -ED, -ING, -N, and -T
endings indicate a verb, and you can provide grammar rules to exploit that syntactic
information. For example,
D:-ED <- ED
__
D:-T <- ED
__
D:-N <- ED
__
G:VERB[|ED]->UNKN ED
__
To use English inflectional stemming in PyElly, setting the language variable in the
ellyConfiguration.py file to EN. To override such stemming just for a particular
word, define that word in a vocabulary table entry so that it will also be known in its
inflected form. This does not work for D: internal dictionary entries.
The logic for an ending X is defined by a text file X.sl loaded by PyElly at runtime. You
can also define your own inflectional stemming logic by editing the current *.sl files or
by writing new ones. The current files for English are Stbl.sl, EDtbl.sl,
INGtbl.sl, Ntbl.sl, Ttbl.sl, rest-tbl.sl, spec-tbl.sl, and undb-tbl.sl.
To do inflectional stemming for a new language ZZ, you will have to write the *.sl files
and a inflectionStemmerZZ.py. Use inflectionStemmerEN.py as a model here.
PyElly User’s Manual
Here is a segment of actual logic from Stbl.sl, which tells PyElly what to check in
identifying a -S inflectional ending when it is preceded by an IE. The literal strings for
comparison in the logic below have their characters in reverse order because PyElly will
be matching from the end of a word towards its start.
IF ei
IF tros {SU}
IF koo {SU}
IF vo {SU}
IF rola {SU}
IF ppuy {SU}
IF re
IF s
IF im {SU 2 y}
END {FA}
IF to {SU}
END {SU 2 y}
IF t
IS iu {SU 2 y}
LEN = 6 {SU}
END
END {SU 2 y}
This approximately translates to
if you see an IE at the current character position, back up and
if you then see SORT, succeed.
if you then see OOK, succeed.
if you then see OV, succeed.
if you then see ALOR, succeed.
if you then see YUPP, succeed.
if you then see ER, back up and
if you then see S, back up and
if you then see MI, succeed, but drop the word’s last two letters and add Y.
otherwise fail.
if you then see OT, then succeed.
otherwise succeed, but drop the word’s last two letters and add Y.
if you then see T, then back up and
if you then see a I or a U, then succeed, but drop the word’s last two letters and add Y.
if the word’s length is 6 characters, then succeed.
otherwise succeed, but drop the word’s last two letters and add Y.
This stemming logic is equivalent to a finite state automaton (FSA). Its operation should
be fairly transparent, although the total number of different rules for English inflections
has grown to be quite extensive. You may nevertheless eventually run into a case that is
handled incorrectly and that you will want to add to the rules. Make sure, however, to
test out every change so that you can avoid making other things worse.!
PyElly User’s Manual
9.2.2 Morphology
Morphology in general is about how words are put together, including processes like
BLACK + BIRD ==> BLACKBIRD, EMBODY + -MENT ==> EMBODIMENT, and
KOREAN + POP ==> K-POP. PyElly morphological analysis is currently limited to that
involving the addition of prefixes or suffixes to a root, which is not necessarily a word.
The morphology component of PyElly started out as a simple FSA stemmer that served
just to remove common endings from English words, including -S, -ED, and -ING. It has
now evolved to focus on non-inflectional endings and to output the actual affixes
removed as well as the final root form.
Earlier above, we saw “unrealizable” broken down into UN+, +REAL, -IZE, and -ABLE.
True morphological analysis here would also tell us that the -IZE suffix changes the
word REAL into a verb, the -ABLE suffix changes the verb REALIZE into an adjective
again, and the prefix UN+ negates the sense of the adjective REALIZABLE. This is what
PyElly can now do, which is useful in figuring out the syntactic type of unknown words.
For an application A, PyElly will work with prefixes and suffixes through two language
rule tables defined by files A.ptl.elly and A.stl.elly, respectively. These are akin
to the grammar, macro substitution, and word pattern tables already described. We
have two separate files here because suffixes tend to be more significant for analyses
than prefixes, and it is common to do nothing at all with prefixes.
PyElly morphological analysis will be applied only words that otherwise would be
assigned the UNKN syntactic type after all other lookup and pattern matching has been
done. The result will be similar to what we see with inflectional stemming; and to take
advantage of them, you will also have to add the grammar rules to recognize prefixes
and suffixes and incorporate them into an overall analysis of an input sentence.
9.2.2.1 Word Endings (A.stl.elly)
PyElly suffix analysis will be done after any removal of inflectional endings. For
application A, the A.stl.elly file guiding this will contain a series of patterns and
actions like the following:
abular 2 2 le.
dacy 1 2 te. 1
entry 1 4 .
gual 2 3 . 0a
ilitation 2 6 &,
ion 2 0 .
lenger 2 5 . 0e
oarsen 1 5 .
PyElly User’s Manual
piracy 1 4 te. 1
santry 1 4
tention 1 3 d.
uriate 2 2 y.
worship 0 0 .
|carriage 0 0 .
|safer 1 5 . 0e
Each line of a *.stl.elly file defines a single pattern and actions upon matching. Its
format is as follows from left to right:
A word ending to look for. This does not have to correspond exactly to an actual
morphological suffix; the actions associated with an ending will define that suffix. The
vertical bar (|) at the start of a pattern string matches the start of a word.
A single digit specifying a contextual condition for an ending to match: 0= always do
no stemming for this match, 1= no conditions, 2= the ending must be preceded by a
consonant, and 3= the ending must be preceded by a consonant or U.
A number specifying how many of the characters of the matched characters to keep as
a part of a word after removal of a morphological suffix. A starting vertical bar (|) in a
listed ending will count as one character here.
A string specifying what letters to add to a word after removal of a morphological
suffix. An & in this string is conditional addition of e in English words, applying a
method defined in English inflectional stemming.
A period (.) indicates that no further morphological analysis be applied to the result
of matching a suffix rule and carrying out the associated actions; a comma (,) here
means to continue morphological analysis recursively.
A number indicating how many of the starting characters of the unkept part of a
matching ending to drop to get a morphological suffix to be reported in an analysis.
A string specifying what letters to add to the front of the reduced unkept part of a
matching ending in order to make a complete morphological suffix.
In applying such pattern rules to analyze a word, PyElly will always take the longest
match. For example, if the end of a word matches the LENGER pattern above, then
PyElly will ignore the shorter matches of a ENGER pattern or a GER pattern.
In the LENGER rule above, PyElly will accept a match at the end of word only if
preceded by a consonant in the word. On a match, the rule specifies to keep 5 of the
matched characters in the resulting root word. From the rest of a matched ending,
PyElly will drop no characters, but add an E in front to get the actual suffix removed.
So the word CHALLENGER will be analyzed as follows according to the suffix patterns
above:
PyElly User’s Manual
CHAL LENGER (split off matched ending and check preceding letter)
CHALLENGE R (move five characters of matched ending to resulting word)
CHALLENGE ER (add E to remaining matched ending to get actual suffix -ER)
The period (.) in the action for LENGER specifies no further morphological analysis.
With a comma (,), PyElly would continue, possibly producing a sequence of different
suffixes by reapplying its rules to the word resulting from preceding analyses. This can
continue indefinitely, with the only restriction being that PyElly will stop trying to
remove endings when a word is shorter than three letters.
To recognize the stripped off morphological suffixes in a grammar, you should define
rules like
D:-ion <- SUFFIX[:NOUN]
__
and then add G: grammar rules for dealing with these syntactic types as in the case of
inflections. For example,
G:NOUN->UNKN SUFFIX[:NOUN]
__
A full grammar would of course have to be ready to deal with many different
morphological suffixes.
The PyElly file default.stl.elly is a comprehensive compilation of English word
endings evolving over the past fifty years and covering most of the non-foreign irregular
forms listed in WordNet exception files. If there is more than one possible analysis,
PyElly can make no rule, so that RENT is not reduced to REND. If you actually want to
force a decision here, then you must supply your own grammar rule to do it.
The default.stl.elly file also includes transformations of English irregular
inflectional forms, which actually involve no suffix removal. For example, DUG becomes
DIG -ED. This cannot be handled by PyElly inflectional stemming logic.
9.2.2.2 Word Beginnings (A.ptl.elly)
For prefixes, PyElly works with patterns exactly as with suffixes, except that they are
matched from the beginning of a word. For example
contra 1 0 .
hydro 1 0 .
non 2 0.
noness 1 3.
pseudo 1 0 .
quasi 1 0 .
PyElly User’s Manual
retro 1 0 .
tele 1 0 .
trans 1 0 .
under 1 0 .
The format for patterns and actions here is the same as for word endings. As with
endings, PyElly will take the action for the longest pattern matched at the beginning of a
word being analyzed.
Prefixes will be matched after suffixes and inflections have been removed. Removing a
prefix must leave at least three characters in the remaining word. Actions associated
with the match of a prefix will typically be much simpler than those for suffixes, and
rules for prefixes will tend to be as simple as those in the example above.
PyElly removal of prefixes will be slightly different from for suffixes. With suffixes, the
word STANDING becomes analyzed as STAND -ING, but with the prefix rules above,
UNDERSTAND would become UNDER+ +STAND. Note that a trailing + is used to
mark a removed prefix instead of a leading – for suffixes.
In the overall scheme of PyElly processing of an unknown word, inflections are checked
first, then suffixes, and finally prefixes. If there is any overlap between the suffixes and
the prefixes here, then inflections and suffixes takes priority.
For example, NONFUNCTIONING becomes NON+ +FUNCT -ION -ING with the
morphology rules above. A grammar would then have to stitch these parts back together
in an analysis.
For prefixes here, you will need a dictionary rule like
D:non+ <- PREFIX[+NEG]
__
and you should by now know how to supply the required grammar rules yourself.
9.3 Entity Extraction
In computational linguistics, an entity is some phrase in text that stands for something
specific that we can talk about. This is often a name like George R. R. Martin or North
Carolina or a title like POTUS or the Bambino; but it also can be insubstantial like Flight
VX 84, 888-CAR-TALK, 2.718281828, NASDAQ APPL, or orotidine 5'-phosphate.
The main problem with entities is that almost none of them will normally be found in a
predefined vocabulary. People seem to handle them in stride while reading text,
however, even when they are unsure what a given entity means exactly. This is in fact
the purpose of much text that we read: to inform us about something we might be
unfamiliar with. A fully competent natural language system must be able to function in
this kind of situation.
PyElly User’s Manual
At the beginning of the 21st Century, systems for automatic entity extraction from text
were all the rage for a short while. Various commercial products with impressive
capabilities came on the market, but unfortunately, just identifying entities is
insufficient to build a compelling application, and so entity extraction systems mostly
fell by the wayside in the commercial marketplace. In a tool like PyElly, however, some
builtin entity extraction support can be quite valuable.
9.3.1 Numbers
PyElly no longer has a predefined NUM syntactic type. The PyElly predecessor written in
C did have compiled code for number recognition, but this covered only a few possible
formats and was dropped later in Jelly and PyElly for a more flexible solution. If you
want PyElly to recognize literal numbers in text input, you must make use of special
patterns in files *.p.elly as described in Section 4.2.
PyElly, however, also has gone further here. It also has some builtin capabilities for
automatic normalizations of number references so that you need fewer patterns to
recognize them. In particular,
Automatic stripping out of commas in numbers as an alternative to doing this with
special pattern matching:
1,000,000 ==> 1000000.
Automatic mapping of spelled out numbers to a numerical form:
one hundred forty-third ==> 143rd
fifteen hundred and eight ==> 1508
Here you still need patterns to recognize the rewritten numbers so that PyElly can
process them. You can disable all such number rewriting by setting the variable
ellyConfiguration.rewriteNumbers to False.
9.3.2 Dates and Times
Dates and Times could be handled as PyElly patterns, but their forms can vary so much
that this would take an extremely complicated finite-state automaton. For example, here
are just two of many possible kinds of dates:
the Fourth of July, 1776
2001/9/11
To recognize such entities, the PyElly module extractionProcedure.py defines
some date and time extraction methods written in Python that can be called
automatically when processing input text.
PyElly User’s Manual
To make such methods available to PyElly, they just have to be listed in the
ellyConfiguration.py module. Here is some actual Python code to do so:
import extractionProcedure
extractors = [ # list out extraction procedures
[ extractionProcedure.date , 'date'] ,
[ extractionProcedure.time , 'time']
]
You can disable date or time extraction by just removing its method name from the
extractors list. The second element in each listed entry is a syntax specification string,
generally indicating a syntactic category plus syntactic features to be given to a
successfully extracted entity; these should be coordinated with other PyElly grammar
rules. An optional third element is a semantic feature specification string, which can be
-’ for no features. An optional fourth element is an integer plausibility scoring.
The date and time methods above are part of the standard PyElly distribution. These
will do some normalization of text before trying to recognize dates and times. Dates will
be rewritten in the form
mm/dd/yyyyXX
For example, 09/11/2001ad. Times will be converted to a 24-hour notation
hh:mm:ssZZZ
For example, 15:22:17est. If date or time extraction is turned on, then your grammar
rules should expect to expect to see these forms when a generative semantic procedure
executes an OBTAIN command. The XX epoch indicator in a date and the ZZZ zone
indicator in a time may be omitted in PyElly input.
9.3.3 Names of Persons (A.n.elly)
Natural language text often contains the names of persons and of things. These can be
handled in various ways within PyElly, but names generally will present unique
problems for both syntactic and semantic analysis. For example, entirely new names or
old names with unusual spellings often show up in text, but it is hard to anticipate them
in a vocabulary table or even a pattern table. Also, a name can appear in multiple forms
in the same text: Joanne Rowling, J.K. Rowling, Rowling, Ms. Rowling.
To help out here, PyElly incorporates a capability for heuristically recognizing personal
names and their variations in natural language text. This will automatically be
configured into PyElly whenever a user runs an application A that includes a A.n.elly
rule file. Name recognition will run independently of other PyElly language analysis, but
will create parse tree leaf nodes with the syntactic type NAME for any names that it is
able to identify.
PyElly User’s Manual
Each rule in an A.n.elly file will be in one of two forms:
X : T
=PPPP
The first form associates a type with a specified name component; it consists of a pattern
X for a component followed by a colon (:) with optional spaces around it and followed
by a name component type T. The second form lists a phonetic pattern PPPP (see below)
that will be used to validate inferred component types that are otherwise unknown; the
pattern must be preceded by an equal sign (=) with no space after it.
9.3.3.1 Explicit Name Component Patterns and Types
PyElly predefines 12 component types for name recognition; these are not syntactic
categories. Currently these are indicated by three-letter identifiers as follows:
One of these types must be the T part of a X : T rule in an A.n.elly. Anything else
will cause an error exception during table generation. The case of an identifier here will
be unimportant.
The X pattern part of a X : T rule must be a string of ASCII letters possibly including
spaces; a string without spaces can also optionally start or end with a + or -. The
possibilities here are
REJ
reject name with this component
STP
stop any scan for a name
TTL
a title like “Captain”
HON
an honorific like “The Honorable”
PNM
a personal name
SNM
a surname
XNM
a personal name or surname
SNG
possible single name
INI
an initial like “C.”
REL
a relation like “von”
CNJ
a conjunction like ‘y’
GEN
a generation tag like ‘Jr.’
PyElly User’s Manual
The X part of a type rule will always delineate a single name component, although this
might have multiple parts like de la.as in George de la Tour. The basic idea here is to
provide the various possible parts of a name, which will then be combined by hard-
coded PyElly logic to report actual names and name fragments in input text within the
existing framework of PyElly entity extraction.
Here some name rules in a PyElly *.n.elly file:
# simple name table definition
# example.n.elly
John : PNM
Smith : SNM
Kelly : XNM
Mr. : TTL
Sir : TTL
III : GEN
de la : REL
Fitz- : SNM
+son : SNM
-aux : SNM
y : CNJ
prince: SNG
university : REJ
With these rules, “Fitzgerald” and “FitzABBA” will be recognized as surname
components, while “Peterson” will be recognized as a surname only if “Peter” is also
recognized as a name component. Upper and lower case will not matter in the rules
here, nor will the ordering of the rules. Comments for documentation take the same
form as in other PyElly rule files, a line starting with ‘#’ or the rest of a line after ‘ # ‘.
abc de
matches the exact string “abc de”
abc-
matches a string of letters starting with “abc”
abc+
matches a string of letters starting with “abc”, but the rest of
the string must also match another table entry
-abc
matches a string of letters ending with “abc”
+abc
matches a string of letters ending with “abc”, but the rest of
the string must also match another table entry
PyElly User’s Manual
9.3.3.2 Implicit Name Components
In any PyElly application that has to recognize personal names, the most reliable
approach is to maintain lists of the most commonly expected name components. These
are fairly easy to compile with the resources available on the Worldwide Web, but no
listing here will ever be complete. Various rules of thumb can help us to find unknown
names, but this is guessing, and we really have to make only a few mistakes here.
For example, if every capitalized word is a possible name component, then we can get
text items like ABC, The, University, Gminor. and Ltd. Such results will diminish the
value of the true names that we do find. So, we have to be quite strict about the criteria
for judging a string to be a possible name component:
1. The string is alphabetic, with at least four letters. Anything shorter can be listed
explicitly in a name table to eliminate guessing.
2. Its first letter is capitalized. An explicitly known name component can leave off
capitals, but any inference of a name must have as much support as possible.
3. Its adjacent digraphs (e.g. ab, bc, and cd in the string abcd) are all in common
digraphs for first names in the 2010 U.S. census when a candidate string has six or
fewer characters. It may have all but one of its digraphs be common when a string
has seven or more characters.
4. It occurs along with at least one explicitly known name component. That is, a name
cannot consist completely of inferred name components.
5. If the first three conditions above are met, and its phonetic signature matches the
signature for common name components explicitly known, then an inferred name
component can also be used to corroborate another inferred component with respect
to condition 4.
A PyElly phonetic signature is based on a kind of Soundex encoding of a name
component. This is a method of approximating the pronunciation of names in English
by mapping its consonants to phonological equivalence classes. That is a big mouthful,
but in classical Soundex, its six equivalence classes are actually understandable:
{ B , F , P , V }
{ C , G , J , K , Q , S , X , Z }
{ D , T }
{ L }
{ M , N }
{ R }
PyElly User’s Manual
In Soundex, all consonants of an equivalence class map into its representative letter,
shown boldface above. All other letters are ignored, and two consecutive letters going
to the same class will have only a single representative: “Brandt” becomes BRNT.
Soundex also prepends the first letter of a name to get the complete code bBRNT, but
PyElly simplifies that scheme by prepending an ‘a’ only if the first letter is a true vowel
Otherwise, no extra letter is added.
To be more phonetic, PyElly will split the biggest Soundex equivalence class so that
hard-C and hard-G are together with in a new class with representative K and soft-C and
soft-G are together in another new class with representative S. This complicates the
mapping of consonants to equivalence classes, but is still fairly easy to implement. So
“Eugene” becomes aSN and Garibaldi becomes KRPLT.
PyElly will also encode the letters ‘h’, ‘w’, or ‘y’ when used as semi-consonants as H, W,
or Y. So both “Rowen” and “Rowan” become RWM, while “Foyer” becomes FYR, and
“Ayer” becomes aYR. The three extra semi-consonant equivalence classes here allow for
finer phonetic distinctions than with plain Soundex.
Finally, PyElly will transform certain letter combinations the spelling of names in
English to get phonetic signatures better representing their pronunciation. For example,
“Alex” becomes aLKS, “Eustacia” becomes YSTS, and “Wright” becomes RT.
The necessary phonetic signatures for supporting inferred name component will have to
be listed in the definition file for a PyElly name table. They should appear one per line
starting with an equal sign (=) to distinguish them from the listing of explicit name
components and their types. For example,
=aSN
=KRPLT
After getting known or inferred name components, PyElly will string together as many
as possible to make a complete name. This will be done under the following constraints:
a. Any particular name component type may occur only once, unless they are
consecutive.
b. A TTL name component will always start the accumulation of the next name; a GEN
will always end any name being accumulated.
c. A CNJ or REL cannot be at the end of a name.
d. There must be at least one instance of PNM, SNM, XNM, or SNG, or a TTL and an INI.
e. A name with a single component must be a SNG or locally known (see below).
When any complete name is accepted, all of its individual components will be
remembered in a non-persistent local PyElly table. This will be kept only until the end of
the current PyElly session.
PyElly User’s Manual
Name recognition is implemented as part of PyElly entity extraction. When the ellyBase
module sees a *.n.elly definition file to load, it will automatically put the
nameRecognition.scan on its list of extractors with the NAME syntactic type. Any
recognized name will then enter PyElly sentence analysis just like any other kind of
entity. In particular, any longer text element found at the sentence position for a
recognized name will supersede the name.
9.3.4 Defining Your Own Entity Extractors
You can write your own entity extraction methods in Python and add them to the
extractors list for PyElly in ellyConfiguration.py. This should be done as follows:
1. The name of a method can be anything legal in Python for such names.
2. The method should be defined at the level of a module, outside of any Python class.
This should be in a separate Python source file, which can then be imported into
ellyConfiguration.py.
3. The method takes a single argument, a list of individual Unicode characters taken
from the current text being analyzed. PyElly will prepare that list. The method may
alter the list as a side effect, but you will have be careful in how you do this if you
want the changes to persist after returning from the method. That is because Python
always passes arguments to a method by value.
4. The method returns the count of characters found for an entity or 0 if nothing is
found. The count will always be from the start of an input list after any rewriting. If
no entity is at the current position input text, return 0.
5. If a non-zero character count is returned, these characters are used to generate a
parse tree leaf node of a syntactic type specified in the ellyConfiguration.py
extractors list.
6. PyElly will always apply entity extraction methods in the order that they appear in
the extractors list. Note that any rewriting of input by a method will affect what a
subsequent method will see. All extractor methods will be tried.
7. An extraction method will usually do additional checks beyond simple pattern
matching. Otherwise, you may as well just use PyElly finite-state automatons
described in Section 4.2.
8. Install a new method by editing the extractors list in the PyElly module
ellyConfiguration.py to append a method and a syntax specification to the list.
You will have to import the actual module containing your method. For example, to
add a method called yourModule.yourExtractor to the default list of PyElly
entity extractors
PyElly User’s Manual
import yourModule
extractors = [ # entity extraction procedures to use
[ extractionProcedure.date , 'date' ] ,
[ extractionProcedure.time , 'time' ] ,
[ yourModule.yourExtractor , 'type' ]
]
The module extractionProcedure.py defines the method stateZIP, which looks
for a U.S. state name followed by a five- or nine-digit postal ZipCode. This will give you a
model for writing your own extraction methods; it is currently not installed.
9.4 PyElly Vocabulary Tables (A.v.elly)
PyElly can maintain large vocabulary tables in external files created and managed with
the SQLite package, a standard part of Python libraries. PyElly formerly used Berkeley
Database for vocabulary, but changes in its open-source licensing made it awkward for
unencumbered educational use. For more details here, please refer to Appendix C.
You can run PyElly without vocabulary tables, but these can make life easier for you
even when working with only a few hundred different terms. They provide the most
convenient way to handle multi-word terms and terms including punctuation. They also
can be more easily reused with different grammar tables and generally will be more
compact and easier to set up than D: rules of a grammar. Without them, PyElly will be
limited to fairly simple applications.
PyElly vocabulary table entries will be defined in a *.v.elly definition file where each
entry can be only a single text line and can have only extremely limited semantics. This
is mainly so that one may generate large numbers of entries automatically through
scripts. For example, the PyElly distribution file default.v.elly has 155,229 entries
generated with bash shell scripts from WordNet 3.0 data files.
Each vocabulary entry in a PyElly *.v.elly definition file should be a single text line
that usually takes one of the following formats:
TERM : SYNTAX
TERM : SYNTAX =TRANSLATION
TERM : SYNTAX x=Tx, y=Ty, z=Tz
TERM : SYNTAX (procedure)
Other formats will be described later in Subsection 10.3.2.
PyElly User’s Manual
The TERM : SYNTAX part is mandatory for vocabulary entry. A TERM can be
Lady Gaga
Lili St. Cyr
Larry O’Doule
“The Robe”
ribulose bisphosphate carboxylase oxygenase
The ‘ : ’ lets PyElly know when a term ends; no wildcards are allowed in a TERM, and it
may start with a letter, digit, or a punctuation mark. Upper and lower case in a TERM will
not matter; PyElly will always ignore case when matching up entries with input text.
An external vocabulary entry must start with a letter or digit or one of the following
punctuation marks: period (.) , hyphen-minus (-), left double quotation mark () ,
comma (‘,’) , percent sign (%) , em dash () , and hyphen (-). The last is Unicode and is
treated differently by PyElly than the ASCII hyphen-minus character. You can change
this list by editing the assignment to initChar in the source code for vocabularyTable.
In general, a vocabulary table entry may containing spaces and arbitrary punctuation.
This is not allowed with internal dictionary rules or with FSA-defined patterns.
SYNTAX is just the usual PyElly specification of syntactic type plus optional syntactic
features; for example, VERB[^PTCL].
The final translation part of a vocabulary entry is optional and can take one of the forms
shown above. If the translation is omitted, then the generative semantic procedure for
the entry will be just the operation OBTAIN. This is equivalent to no translation at all.
An explicit TRANSLATION is a literal string to be used in rewriting a vocabulary entry;
the ‘=’ is mandatory here. The x=Tx, y=Ty, z=Tz alternate form is a generalization of
the simpler TRANSLATION; it maps to the generative semantic operation
PICK lang (x=Tx#y=Ty#z=Tz#)
One of the x or y or z in the translation options of a vocabulary entry can be the null
string. In this case, the PICK operation will treat the corresponding translation as the
default to be taken when the value of the lang PyElly local variable is undefined or
matches none of the other specified options.
If there is no default and lang is undefined or not set to one of the pick options, then a
null translation will be selected. This may sometimes be helpful, but is probably not
what you really want.
A (procedure) in parentheses is a call to a generative semantic subprocedure defined
elsewhere in a *.g.elly grammar rule file.
PyElly User’s Manual
Here are some full examples of possible vocabulary table entries in a *.v.elly file:
Lady Gaga : noun =Stefani Joanne Angelina Germanotta
Lili St. Cyr : noun[:name] 0
horse : noun FR=cheval, ES=caballo, CN=󵏸, RU=лошадь
twerk : verb[|intrans]
All references to syntactic types, syntactic and semantic features, and procedures will be
stored in a vocabulary table as an encoded numerical form according to a symbol table
associated with a PyElly grammar.
Syntactic features must be immediately after a syntactic category name with no space in
between. Otherwise, PyElly will be unable to differentiate between syntactic and
semantic features in a *.v.elly file. Individual syntactic or semantic features inside of
brackets may be preceded by a space, however.
Unlike the dictionary definitions of words in a grammar, there are no permanent rules
associated with the terms in a vocabulary table. When a term is found in a vocabulary
table, PyElly automatically generates a temporary internal dictionary rule to define that
term. This rule will persist only for the duration of the current sentence analysis.
Often, a vocabulary table may have overlapping entries like
manchester : noun [^city]
manchester united : noun [^pro,soccer,team]
PyElly will always take the longest matching entry consistent with the analysis of an
input sentence and ignore any shorter matches. Long matches will also supersede that of
shorter matches for other PyElly lookup methods.
When comparing a vocabulary entry with text, the entry must match in the text up to a
nonalphanumeric delimiter. That is, the entry “jigsaw puzzle” will match the text “jigsaw
puzzle?”, but not “jigsaw puzzlement”. A exception is made here, though, for inflectional
endings and the English possessive endings -’S and -S’. This will let “jigsaw puzzle”
match “jigsaw puzzles”, “jigsaw puzzled”, “jigsaw puzzling”, “jigsaw puzzle’s”, and
“jigsaw puzzles’”. PyElly will take these endings into account automatically.
You can turn off this English inflectional analysis by using a subclass of the PyElly
VocabularyTable that has its doMatchUp() method overridden appropriately. The
default inflectional analysis complicates vocabulary lookup, but can be helpful in
providing limited world knowledge to facilitate PyElly parsing.
An external vocabulary entry can contain arbitrary characters that PyElly normally
would take to mean that a word or other token has ended and that another has started.
For example, the vocabulary entry “R&B” can be matched as a whole in text even though
PyElly would analyze “R&B” in text as separate “R”, “&”, and “B” without the vocabulary
entry. This capability can be quite useful in many PyElly applications. A vocabulary
entry may not match across sentences, however.
PyElly User’s Manual
For a given application A, PyElly first will look for vocabulary definitions in A.v.elly.
If this is missing, default.v.elly is taken, which includes most of the nouns, verbs,
adjectives, and adverbs in WordNet 3.0. Always define A.v.elly if you do not want a
huge vocabulary, which can take a long time to load. PyElly will save a compiled
vocabulary data base for an application A in the file A.vocabulary.elly.bin. If you
change A.v.elly, PyElly will automatically recompile any A.vocabulary.elly.bin
at startup. Recompilation will also happen if A.g.elly has changed. Otherwise, PyElly
will just read in the last saved A.vocabulary.elly.bin.
Note that the A.vocabulary.elly.bin file created by PyElly must always be paired
only with the A.rules.elly.bin file it was created with. This is because syntactic
types and features are encoded as numbers in *.elly.bin files, which may be
inconsistent when they are created at different times. If you want to reuse language
rules, always start from the *.*.elly files. If PyElly has to recompile
A.rules.elly.bin at startup, then it will also automatically recompile
A.vocabulary.elly.bin. It will also recompile if the grammar rule file is more
recent that the vocabulary rule file.
PyElly User’s Manual
10. Logic for PyElly Cognitive Semantics
The generative semantic part of a grammar rule tells PyElly how to translate its input
into output, while the cognitive semantic part evaluates the plausibility for any
particular translation. Generative semantics is always the final step in PyElly processing;
cognitive semantics will run each time a new phrase node is created in PyElly sentence
analysis. This gets ready to resolve any subsequent ambiguities.
With a large grammar, we cannot expect every input sentence to break down in only one
way into subconstituents according to the rules of that grammar. In most languages, for
example, a particular word might be assigned multiple parts of speech, and each
possibility here can result in a different syntactic analysis for an input sentence. All such
alternate analyses must be evaluated to find the best interpretation of a sentence.
PyElly takes a wait-and-see approach in ambiguous situations. Multiple interpretations
might exist at lower levels of a parse tree, but some could end up not fitting into any
final analysis for an entire sentence. In this case, an ambiguity will resolve itself within a
bigger context, and so we really want to hold off any decision on interpretation until as
late as possible.
PyElly looks at differing interpretations only when it comes across two or more phrase
nodes of the same syntactic type with the same syntactic features over the same tokens
of an input sentence, and without the *unique feature set. At that point, PyElly will
compare the cognitive semantic plausibility scores already computed for each alternate
phrase node and then go forward only with the highest ranking of them to reach a full
sentence analysis.
There is only one exception to this requirement for exact matching of syntactic category
and features. At the end of processing a sentence, PyElly may have two or more phrase
nodes of type SENT over the entire sentence, but with different syntactic features. PyElly
will then ignore any differing features here and just choose one interpretation according
to its semantic plausibility.
PyElly will see ambiguity only when its language rules actually allow for it. For instance,
“I love rock” could be about music or landscaping in normal human understanding, but
if the grammar rules of a language definition fail to produce two different syntactic
analyses with co-extensive phrase nodes of the same type and same features, PyElly will
see only a single interpretation. If you want to find ambiguity here, you must provide
PyElly with two separate vocabulary entries for “rock.”
Let us consider the following simple set of grammar rules:
PyElly User’s Manual
g:sent[:*r]->x
__
g:sent[:*r]->y
__
d:wwww<-x[:f]
_
(xgen)
__
d:wwww<-y[:g]
_
(ygen)
__
where wwww is an internal dictionary word associated with two different syntactic types
x and y. A sentence consisting only of the word wwww will therefore be ambiguous at the
lowest level of analysis because the generative semantics for the overall sentence must
call either(xgen) or (ygen) as a subprocedure, but not both.
Two PyElly sentence analyses are possible here, given the rules for inheriting syntactic
features through the predefined *R syntactic feature described in Subsection 8.1:
SENT[:*r,f] SENT[:*r,g]
| |
X[:f] Y[:g]
| |
wwww wwww
There are no PyElly ambiguities here, however, because none of the constituents in the
two alternate analyses of the sentence “wwww” have the same syntactic type and the same
syntactic features. We do, however, end up with two possible parse trees for the type
SENT at the end of processing and will then choose one of them from which to produce a
translation here regardless of syntactic features.
The choice between alternate sentence analyses or between different interpretations of
individual phrases will be through a numerical plausibility score assigned to each phrase
node in a parse tree. A score of 0 here will be neutral, increasingly positive will be more
plausible, and increasingly negative will be more implausible. Various characteristics of
a phrase node can be examined to decide whether to be increase or decrease its
plausibility. The highest plausibility score wins.
This section of the PyElly User’s Manual tells how to assign a score to each possible
subtree in a PyElly sentence analysis from the rules from a PyElly language definition
used to generate an analysis. You can choose to write language descriptions without
such scoring, but plausibility is built into the PyElly parsing algorithm and provides
another way to control how PyElly runs.
The PyElly plausibility score for an analyzed constituent of a sentence will always be an
integer value. The score for a particular phrase node generally will add up the scores of
PyElly User’s Manual
its immediate phrase subconstituents plus an adjustment from the cognitive semantics
of the grammatical rule combining those subconstituents into one resulting phrase.
For example, suppose that we have a constituent described by a grammar rule A->X Y.
We will expect that plausibility scores were computed already for subconstituent X and
for subconstituent Y in earlier PyElly parsing. So we can then run the cognitive semantic
logic for the grammar rule A->X Y, producing an adjustment to the summed
plausibility scores for X and Y to get an overall plausibility score for our phrase of type A.
With competing analyses, if only one phrase has the top score, PyElly chooses it and is
done. If more than one phrase has the highest score, then PyElly will arbitrarily pick one
of them. When there are multiple choices and they differ by at most one between the
highest score and the next highest, PyElly will also tag the rule for the phrase that was
picked. This will then favor a different choice the next time a similar ambiguity arises.
10.1 The Form of Cognitive Semantic Clauses
PyElly allows us to define special logic for determining the contribution of a grammar
rule to an overall plausibility score for a phrase that it describes. The logic consists of a
series of clauses; each one specifying the conditions under which the clause will apply.
PyElly cognitive semantics will check the clauses in sequence, the first to have all its
conditions satisfied will be applied. All other clauses will be disregarded.
In a *.g.elly file providing grammar rules for a PyElly language description, the
clauses in the cognitive semantic logic for a rule will come just after the G: or a D: line
introducing the rule and will end at the first _ or __ line (see Section 4). Each clause will
be a single line separated into two parts by the character sequence ‘>>’, with its left part
containing zero or more conditions for applicability and its right part specifying an
action to take and an adjustment to apply. For example, here is a rule with three
cognitive semantic clauses, but no explicit generative semantics:
G:NP->ADJ NOUN
L[^EXTENS] R[^ABSTRACT]>>*R-
R[^GENERIC] >>*L+
>>*R
__
The ‘>>’ is mandatory in any clause even if there are no conditions. If one or more
conditions are specified, they must all be satisfied for the clause to take effect. Spaces
between the conditions of the left part of a clause are optional.
The conditions here can be of three types: (1) testing the semantic features associated
with the constituents to be combined by a grammar rule into a single phrase; (2)
checking the starting position of the constituents, the number of characters in resulting
phrase, and the total number of tokens covered by the constituents combined; and (3)
measuring the semantic distance between the concepts associated with each constituent
in the case of a 2-branch rule.
PyElly User’s Manual
Having nothing on the left side of a clause is an always-satisfied condition, and this will
always make any following ones irrelevant. It is possible also that no listed clauses for a
rule apply. In that case, a zero plausibility adjustment is assumed for a phrase. That is,
the sum of subconstituent plausibility scores will be reported for the phrase.
Each of the three types of conditions for a clause will have a distinct form (see below).
You may freely mix all the different types in the left side of a single clause. The
conditions will be independent and may appear in any order. Be careful, however, to
avoid contradictory conditions, which can never be satisfied.
A special cognitive semantic clause allows for tracing the execution of logic in which it
appears. It will have the fixed form
?>>?
The ? condition on the left side is always False, so that right side of the clause will never
be run. This clause will have the side-effect, however, of getting PyElly to identify the
phrase and generating rule being processed. Here is a sample trace message:
tracing phrase 0 : rule= 29 with current bias=0
cog sem at clause 4 of 4
l: phrase 1 @0: type=0 syn[00 00] sem[00 00] : bia=0 use=0
r: phrase 2 @1: type=0 syn[00 00] sem[00 00] : bia=0 use=0
raw plausibility= 0
adjustment= 1 sem[00 00]
3 token(s) spanned @0
This specifies the phrase node where the cognitive semantic logic is being executed, the
grammar rule generating the node, and the subconstituents involved. If an action is then
taken subsequently for the kth clause out of n in the logic being traced, then this will be
reported as in the second line. In general, this will be as follows:
cog sem at clause k of n
Whether or not the action for any clause is taken, tracing output will then show the total
computed plausibility increment plus the resulting semantic features (see Subsection
10.2.3 below) for a phrase.
incremental scoring= -2 sem[24 00]
Cognitive semantic tracing will show how a PyElly parse tree is being assembled node by
node. Generative semantic tracing will show the order of execution for individual
semantic procedures and subprocedures after a parse tree is completely built.
A grammar rule may have cognitive semantic clauses even if it has no explicit generative
semantic procedure. In this case, the listing of clauses will be terminated by a double
underscore (__) line without a preceding single underscore (_).!
PyElly User’s Manual
10.2 Kinds of Cognitive Semantic Conditions
PyElly cognitive semantics mainly shows up in the grammar rule logic for evaluating the
plausibility of phrase nodes in an analysis, but can appear elsewhere in PyElly as well.
Four different kinds of conditions are currently supported: semantic features, starting
token position and token count, and semantic concepts. Cognitive semantic logic can
also have no conditions, which allows a grammar rule to have fixed scoring.
10.2.1 Fixed Scoring
The simplest and most common cognitive semantic clause will have no condition. This is
always satisfied and will specify fixed positive or negative adjustment for a grammar
rule when computing plausibility scores in a phrase analysis. Such clauses may take one
of the following forms:
>>-
>>+
>>+++
>>———-
>>+5
>>-20
The initial + or - signs are mandatory in the scoring. A string of n +’s or -’s is equivalent
to +n or -n. Here is an example of use in a grammar rule:
G:NP->ADJ UNKNOWN
>>-- # cognitive semantics disfavoring this rule by -2
_
RIGHT # generative semantics
LEFT #
__
If no cognitive semantic clauses are specified for a grammar rule, this is equivalent to
>>+0
a special case of fixed scoring. Note that the “+” is necessary here if you actually want to
be explicit here about a zero score. This can be expressed more simply, however, by
specifying no scoring increment or decrement, which defaults to zero.
10.2.2 Starting Position, Token Count, and Character Count
Each PyElly phrase node in a parse tree records its starting token position and the
number of input tokens and the number of text characters that it encompasses. A
cognitive semantic scoring for a phrase can be conditional on whether its token position
PyElly User’s Manual
p or comparing its token count n or its character count c to some reference value This
will happen on the left side of the >> in a clause. For example,
p<1 >> -1
n>1>>+1
n<8>>+1
c>4>>-1
n>1n<8 >> +2
n>2 n<7 >> +1
A test of starting token position is with p< or p>, of token count with n< or n>. and
character count with c< or c>, In the last clause above, a phrase will be scored as +1
only if its token count is 7 > n > 2. You may insert spaces in any clause for readability,
but no spaces are allowed before or after a < or a > in any comparison.
Normally a leaf node in a parse tree will have a token count of 1, but a leaf node with
the ... syntactic type may have a count of 0. Note also that a multiword vocabulary
table entry like flash flood is always counted as a single token. Character count will
include embedded spaces in multiword vocabulary, but will count no text spaces
between separate tokens.
10.2.3 Semantic Features
Semantic features are similar to syntactic features as defined above in Section 8, but
play no role in distinguishing between different grammar rules. They are assigned to
particular phase nodes and are specified in the same bracketed notation as syntactic
features; for example:
[&ANIMATE,ARMORED]
The & is the feature set identifier, and ANIMATE and ARMORED are two specific features.
Semantic features will have completely separate lookup tables from syntactic features.
In particular, a syntactic feature set and a semantic feature set can have the same set
identifier without any conflict, but always make the identifiers different just for clarity.
As with syntactic features, you may have up to 16 semantic feature names, with the rules
for legal names being the same. Semantic feature names must include only ASCII letters
or digits, with the case of letters not mattering, just as with syntactic feature names.
Unlike syntactic features, however, they will have only one predefined feature name:
*CAPITAL or equivalently *C, which indicates that a phrase is capitalized and so is
probably a name or a proper noun. This name is reserved in every semantic feature set.!
PyElly User’s Manual
10.2.3.1 Semantic Features in Cognitive Semantic Clauses
A cognitive semantic clause for a 2-branch splitting G: grammar rule will have the
following general form when semantic features appear:
L[oLF1,...,LFn] R[oRF1,...,RFn]>>x[oF1,...,Fn]#
Semantic features can appear on both the left and the right side of a clause. The symbol
o” is a feature set identifier; “x” may be either *L or *R, and the “#” is a fixed scoring
action as in Subsection 10.1 above; for example, +++ or -3.
Here is an example:
l[!coord]r[$*c]>>*r-2
This tells PyElly that a phrase with a left part marked as coord and a capitalized right
part inherit the semantic features from the right part and will lose two points from its
semantic plausibility score.
As with syntactic features, a semantic feature F can be preceded by a ‘-’ in a clause. On
the left side, this means that feature F must not be associated with a matching phrase
structure. On the right side, this means that any inherited F must be turned off in the
new phrase being created for a grammar rule.
The prefixes L and R on the left side of a clause specify the constituent substructures to
be tested, respectively left and right descendant. You may test none, one, or both
descendants in a 2-branch rule; this is the most common kind of condition when you
want something different from fixed scoring.
The “x” prefix on the right is optional for specifying inheritance of features. An *L
means to copy the semantic features of the left subconstituent into the current phrase;
*R means to copy the right. You cannot have both; a missing “x” means no inheritance
at all. Any explicit semantic feature appearing in the right part of the clause will indicate
any additional features to turn on or off for a phrase node.
A full cognitive semantic clause for a 1-branch extending G: grammar rule will have the
following general form in its semantic features:
L[oLF1,...,LFn]>>x[oF1,...,Fn]#
Here is an example:
L[^animate]>>*L[^actor]+1
PyElly User’s Manual
A D: grammar rule defines a phrase without any constituent substructures. The
semantic features in a clause must take the form
>>[oF1,...,Fn]#
That is, you can set semantic features for a D: rule, but may not test or inherit any. Here
is an example
>>[^animate]+1
For both splitting and extending grammar rules, any of the left side of a cognitive
semantic clause can be omitted. If all are omitted here, then a clause always applies. It
becomes a case of fixed scoring.
10.2.3.2 Semantic Features in Generative Semantics
PyElly also allows a generative semantic procedure to look at the semantic features for
the phrase node to which it is attached. This is done in a special form of the IF command
where the testing of a local variable is replaced by the checking of semantic features as
done in cognitive semantics. For example,
IF [&F1,-F2,F4]
(do-SOMETHING)
END
The testing here is like that on syntactic features to determine the applicability of a 1- or
2-branch grammar rule in PyElly parsing. The IF here cannot be negated with ~. If you
want negation, you have to specify it for the individual features.
10.2.4 Semantic Concepts
PyElly is a currently being used experimentally in applying conceptual information from
WordNet to infer the intended sense of ambiguous words in English text. This now an
option for cognitive semantics in PyElly and may appear on both the left and right sides
of a clause attached to a grammar rule and in other places.
PyElly allows you to establish a set of concepts each identified by a unique alphanumeric
string and related to one other by a conceptual hierarchy defined in the language
description for an application. This can be done any way that you want, but WordNet
provides a good starting point here since it contains over two hundred thousand
different related synonym sets (or synsets) as potential concepts to work with.
(WordNet is produced manually by professional lexicographers affiliated with the
Cognitive Science Laboratory at Princeton University and is an evolving linguistic
resource now at version 3.1. [George A. Miller (1995). WordNet: A Lexical Database for
PyElly User’s Manual
English. Communications of the ACM Vol. 38, No. 11: 39-41.] This notice is required by
the WordNet license.)
In WordNet, each possible dictionary sense of a term will be represented as a set of
synonyms (synset) in a given language. This can be uniquely identifiable as an offset into
one of four WordNet data files associated with the main parts of speech—data.noun,
data.verb, data.adj, and data.adv.
For disambiguation experiments in PyElly, trying to work with all the synsets of
WordNet is too cumbersome. So we instead focused on concepts from a small subset of
WordNet synsets related to interesting kinds of ambiguity in English. We can identify
each such concept as an 8-digit decimal string combining the unique WordNet offset for
its corresponding synset plus a single appended letter to indicate its part of speech. For
example,
13903468n : (=STAR) a plane figure with 5 or more points; often used as an emblem
01218092a : (=LOW) used of sounds and voices; low in pitch or frequency
The standard WordNet letter abbreviation for a part of speech is n = noun, v = verb, a =
adjective, r = adverb.
For any set of such concepts, we can then map selected semantic relations for them from
WordNet into a simple PyElly conceptual hierarchy structure, which will be laid out in a
PyElly language definition file A.h.elly. The current disambig example application
in the PyElly package has a hierarchy with over 800 such related concepts, all taken
from WordNet 3.1.
You can of course also define your own hierarchy of concepts with their special
hierarchy of semantic relations. The only restriction here is that each concept name
must be an alphanumeric string like aAA0123bcdef00. Upper and lower case will be
ignored in letters. Such semantic concepts can be explicitly employed by cognitive
semantic clauses on their left side and can be explicitly employed in one way and
implicitly employed in two ways on the right side.
10.2.4.1 Concepts in Cognitive Semantic Logic
Semantic concepts serve to provide another aspect to consider when computing
plausibility scores to choose between alternate interpretations in case of ambiguity. This
will happen in the cognitive semantic logic associated with each syntax rule, and the
concepts will have different roles when they are on the left and on the right sides of a
cognitive semantic clause.!
PyElly User’s Manual
10.2.4.1.1 Concepts on the Left Side of Cognitive Semantic Clauses
The left half of a clause is for testing its applicability to a particular phrase, and PyElly
allows the semantic concepts associated with its subconstituents to be checked out. The
syntax here is similar to how you test semantic features of subconstituents, except you
will use parentheses ( ) to enclose a concept name instead of the [ ] around semantic
features. Here is an example of a concept check:
L(01218092a) R(13903468n) >>+
This checks whether the left subconstituent of a phrase has a concept on a path down
from concept 01218092a in a conceptual hierarchy and whether the right
subconstituent has a concept on a path down from concept 13903468n. The ordering of
testing here does not matter, and you may omit either the L or the R test or both.
You can mix concept testing with semantic feature testing in the conditional part of a
cognitive semantic clause. For example,
L(01218092a) L[^PERSON] >>++
You may also specify more than one concept per test. For example,
L(00033319n,08586507n) >>+
Here, PyElly will check for either L(00033319n) or L(08586507n).
You of course can define more self-descriptive concept names for your own application.
You are not limited to WordNet 3.1 synset ID’s.
10.2.4.1.2 Concepts on the Right Side of Cognitive Semantic Clauses
A single concept can be explicitly appended on the right side of a clause with a
separating space. For example,
>>++ CONCEPT
This must always come after a plausibility scoring expression. If you want a neutral
scoring here, you must specify it explicitly here as
>>+0 CONCEPT
Normally, this kind of concept reference will be useful only for the cognitive semantics
of D: dictionary rules of a grammar, but nothing keeps you from trying it out in G: rules
as well.
PyElly User’s Manual
Concepts can also be referenced implicitly of the right side of a clause. When a
subconstituent of a phrase has an associated concept, the *L or *R inheritance actions
specified by a clause will apply to concepts as well. So, a clause like
>> *L++
will cause not only the inheritance of semantic features from a left subconstituent, but
also the inheritance of any semantic concept from that left subconstituent. That is also
true for *R with a right subconstituent.
To use semantic concepts on the right side of a clause, you generally must use the *L or
*R mechanism even if you have no semantic features defined. This must be done to pass
concepts in a parse tree for later checking. Note that you cannot have both *L and *R in
a cognitive semantic clause.
Semantic concepts also implicitly come into play in two ways when PyElly is computing
a plausibility score for a phrase:
1. When a subconstituent has a semantic concept specified, PyElly will check whether
it is on a downward path from a concept previously seen in the current or an earlier
sentence. PyElly will maintain a record of such previous concepts to check against. If
such a path is found, the plausibility score of a phrase will be incremented by one. If
a phrase has one subconstituent, the total increment possible here is 0 or 1; if the
phrase has two, the total increment could be 0, 1, or 2.
2. If a phrase has two subconstituents with semantic concepts, PyElly will compute a
semantic distance between their two concepts in our inverted tree by following the
upward paths for each concept until they intersect. The distance here will be the
number of levels from the top of the tree to the point of intersection. If the
intersection is at the very top, then the distance will be zero. The lower the
intersection in the tree, the higher the semantic relatedness. This distance will be
added to the plausibility score of a phrase containing the two subconstituents.
If no semantic concepts are specified in the subconstituents of a phrase, then a semantic
plausibility score will be computed exactly as before.
10.2.4.2 Language Definition Files for Semantic Concepts
To use semantic concepts, you must define them in a PyElly language definition. For an
application A, this must happen in the files A.h.elly, A.g.elly, or A.v.elly. They
can be omitted entirely if you have no interest in them.
10.2.4.2.1 Conceptual Hierarchy Definition (A.h.elly)
This specifies all the concepts in a language definition and their semantic relationships.
You can define everything arbitrarily, but to ensure consistency, start from some
PyElly User’s Manual
existing language database like WordNet. Here are some entries from
disambig.h.elly, a conceptual hierarchy definition file based on WordNet 3.1
concepts for a PyElly example application:
14831008n > 14842408n
14610438n > 14610949n
00033914n > 13597304n
05274844n > 05274710n
07311046n > 07426451n
07665463n > 07666058n
04345456n > 02818735n
03319968n > 03182015n
08639173n > 08642231n
00431125n > 00507565n
The “>” separates two concept names to be interpreted as a link in a conceptual
hierarchy, where the left concept is the parent and the right concept is a child. In this
particular definition file, each concept name is an offset in a WordNet 3.1 part of speech
data file plus a single letter indicating which part of speech (n, v, a, r). Both offset and
part of speech are necessary to identify any WordNet concept uniquely.
For convenience, a *.h.elly file may also have entries of the form
=xxxx yyyy
=zzzz wwww
These let you to define equivalences of concept names, where the right concept becomes
the same as the left. For example, the entries make yyyy to be the same as xxxx and
wwww to be the same as zzzz. The left concept must occur elsewhere in the hierarchy
definition, though. An equivalence can be specified anywhere in a *.h.elly file. It
specifies only a convenient alias for a concept name without defining a new concept,
which can be helpful in documentation of semantic relationships.
10.2.4.2.2 Semantic Concepts in Grammar Rules (A.g.elly)
This was already discussed briefly in Subsection 10.3.2 above. Here is an example of a
grammar dictionary rule with cognitive semantics referencing semantic concepts:
D:xxxx <- NOUN
>>+0 CONCEPT
_
APPEND xxxx-C
__
Similarly with a regular syntax rule:!
PyElly User’s Manual
G:X -> Y Z
>>*L+0 CONCEPT
_
RIGHT
SPACE
LEFT
__
Note here that the *L action will also cause any concept associated with Y to be
inherited by X, but the explicit assignment of CONCEPT here will always override any
such inheritance.
10.3 Adding Cognitive Semantics to Other PyElly Tables
Earlier sections described various PyElly language definition tables, but without
mention of cognitive semantics, since they were as yet undefined. Because cognitive
semantics helps in resolving ambiguity, however, we now need to take care of that
omission in how we define PyElly language elements.
Currently, vocabulary tables, pattern tables, and entity extractors all optionally allow
fixed plausibility scoring and semantic features in definitions. You can also specify
semantic concepts in vocabulary tables; but elsewhere, you can bring them in only
though special grammatical rules for the syntactic types associated with them.
10.3.1 Cognitive Semantics for Vocabulary Tables
Vocabulary table rules can include all basic cognitive semantics, but no logic. This is
because vocabulary can appear only at the bottom of a PyElly parse tree.
10.3.1.1 Semantic Features in Vocabulary Table Entries (A.v.elly)
In addition to the rule forms listed in Subsection 9.4, PyElly also recognizes
TERM : SYNTAX SEMANTIC-FEATURES PLAUSIBILITY
TERM : SYNTAX SEMANTIC-FEATURES PLAUSIBILITY =TRANSLATION
TERM : SYNTAX SEMANTIC-FEATURES PLAUSIBILITY x=Tx, y=Ty, z=Tz
TERM : SYNTAX SEMANTIC-FEATURES PLAUSIBILITY (procedure)
SEMANTIC-FEATURES are the bracketed semantic features for cognitive semantics (see
Subsection 10.2); it can be “0” or “-” if no features are set. PLAUSIBILITY is an integer
value for scoring a phrase formed from a vocabulary entry; this value may include an
attached semantic concept name separated by a “/” (see Subsection 10.3). Both
PyElly User’s Manual
SEMANTIC-FEATURES and PLAUSIBILITY may be omitted, but if either is present,
then the other must be also.
Here are some expansions of vocabulary definitions from Subsection 9.4:
Lady Gaga : noun [^celeb] =Stefani Joanne Angelina Germanotta
Lili St. Cyr : noun[:name] [^celeb] 0
horse : noun [$animate] FR=cheval, ES=caballo, CN=󵏸, RU=лошадь
twerk : verb[|intrans] [^sexy] (xxxx)
Semantic features are helpful in distinguishing words with multiple senses as multiple
vocabulary table entries; for example,
bank : noun [^institution]
bank : noun [^geology]
bank : verb [|intrans]
If the word BANK shows up in input text, then all of these entries will be tried out in
possible PyElly analyses, with the most plausible taken for final PyElly output.
10.3.1.2 Semantic Concepts in Vocabulary Table Entries (A.v.elly)
For a vocabulary table entry, we extend the plausibility field in an A.v.elly input file
to allow appending a concept name separated by a “/” (See Subsection 9.4 above).
Omitting a concept name here will be equivalent to a null concept.
Here are some entries from disambig.v.elly, a vocabulary table definition file
making use of the concepts above.
finances : noun[:*unique] - 0/13377127n =funds0n
monetary resource : noun[:*unique] - 0/13377127n =funds0n
cash in hand : noun[:*unique] - 0/13377127n =funds0n
pecuniary resource : noun[:*unique] - 0/13377127n =funds0n
assets : noun[:*unique] - 0/13350663n =assets0n
reaction : noun[:*unique] - 0/00860679n =reaction0n
response : noun[:*unique] - 0/00860679n =reaction0n
covering : noun[:*unique] - 0/09280855n =covering0n
natural covering : noun[:*unique] - 0/09280855n =covering0n
cover : noun[:*unique] - 0/09280855n =covering0n
The *UNIQUE syntactic feature in each entry is to disable PyElly ambiguity resolution at
lower levels of sentence analysis, a requirement for the disambig example application.
The translation provided for each entry above is the WordNet offset designation for a
particular word sense plus a single letter specifying its part of speech.
You may name the concepts in your own A.h.elly hierarchy definitions however you
wish, but with two exceptions: the name “-” will be reserved to denote a null concept
PyElly User’s Manual
explicitly in grammar rules; and the name “^” will be reserved for the top of a hierarchy
to which every other concept is linked eventually. You must have “^” somewhere in a
A.h.elly hierarchy definition file for it to be accepted by vocabularyTable.py as a
language definition file.
10.3.2 Cognitive Semantics for Pattern Rules
PyElly pattern rules for determining the syntactic type of an arbitrary string were first
presented in Section 4.2. With the definition of semantic features and plausibility scores
here, we can now also talk about how pattern rules can specify them. This can be done
by inserting one or two extra fields into a pattern rule specification. The four-part
format defined in Section 4.2 is still valid.
STATE PATTERN SYNTACTIC TYPE NEXT
PyElly will allow two other formats as well.
STATE PATTERN SYNTACTIC TYPE SEMANTIC FEATURES NEXT
STATE PATTERN SYNTACTIC TYPE SEMANTIC FEATURES SCORE NEXT
a. The STATE, PATTERN, SYNTACTIC TYPE, and NEXT are the same as before.
b. SEMANTIC FEATURES is a bracketed list of feature names as seen in a vocabulary rule or
in the cognitive semantics of a grammar rule. These features are optional and may be set
or left unset.
c. SCORE is a possibly signed integer value to be assigned as an initial plausibility for a
token matching a pattern in a final state.
d. If a SCORE is specified in a pattern rule, then the SEMANTIC FEATURES must also be
present. As in the case of a vocabulary entry; a simple - place holder can be specified
here to indicate no setting of features.
Here some examples of some pattern rules with cognitive semantics:
0 &#@$ XID [!nom] -1
3 \#&#$ ORD - 2 -1
3 \#&#@#$ ORD - -1 -1
11 &@mab$ PHRM[:sgl] [$cancer] 1 -1
Note that PyElly allows basic cognitive semantics only at a final state of a pattern-
matching automaton. A final state is always indicated with a -1 as the next state of a
pattern rule, which will always be the last part of a rule. An error will be flagged for
cognitive semantics if there is a next state.
Assigning a base cognitive semantic score allows you to favor or disfavor a syntactic type
assigned to a token by pattern matching versus by another possibility like lookup in an
internal dictionary or in an external vocabulary table. Semantic features allow pattern-
matched interpretations of a token to be tested by the cognitive semantic logic in rules
when ambiguity resolution is required.
PyElly User’s Manual
10.3.3 Cognitive Semantics for Entity Extraction
You can define the cognitive semantics for an entity extractor listed in the PyElly
module ellyConfiguration.py. This can be done by adding one or two optional
elements in a list entry for an extractor to specify semantic features or a plausibility
scoring. For example, we can rewrite the listing of Subsection 9.3.2 as
import extractionProcedure
extractors = [ # list out extraction procedures
[ extractionProcedure.date , 'date' , '-' , 0 ] ,
[ extractionProcedure.time , ‘time'] , '[^x]' ]
]
Note that semantic features must be specified as a bracketed string in a third list
element and a plausibility scoring must be specified as an integer in a fourth list
argument. Semantic features may not be omitted if a plausibility scoring is present; but
these can be specified as a non-committal ‘-’ as done in vocabulary table entries.
Name recognition operates the same way as entity extraction. When turned on by a
commandline flag, PyElly will insert the method nameRecognition.scan into the
extractors array in the PyElly module ellyConfiguration.py, but without
specifying any cognitive semantics.
You can insert the method yourself while specifying cognitive semantics, but PyElly does
it because it must at the same time also initialize a name table from the A.n.elly
definition file for a given application A. This has to be properly coordinated with the
initialization of other language definition tables.
If you absolutely must have cognitive semantics for name recognition, you should leave
ellyConfiguration.py alone. Instead edit ellyBase.py to add semantic features
or plausibility scoring where it appends to the ellyConfiguration.extractors
array during initializations. You can also just define a grammar rule with the cognitive
semantics you want as an extension of any NAME leaf node of a parse tree.!
PyElly User’s Manual
11. Sentences and Punctuation
Formal grammars typically describe the structure only of single sentences in a language.
PyElly accordingly is set up to analyze one sentence at a time through its ellyBase
module. In real-world text, however, sentences are all jumbled together, and we have to
divide them up properly before doing anything with them. That task is harder than one
might think; for example,
I met Mr. J. Smith at 10 p.m. in St. Louis.
This sentence contains six periods (.), but only the final one should stop the sentence. It
is not hard to recognize non-stopping exceptions, but this is yet one more detail to take
care of on top of the many other basic tasks of natural language processing.
Furthermore, we have to deal with special instances of punctuation like
He said “No.”
(Turn to page 6.)
where punctuation after the stop (.) really should be tacked onto the sentence that it
ends. PyElly is prebuilt to handle all of this.
PyElly divides text input into sentences with its ellySentenceReader module, which
looks for various patterns of punctuation to detect sentence boundaries in text. While
doing this, PyElly also normalizes each sentence to make it easier to process. The
algorithm is simple and tends to find too many sentences, but we can always help PyElly
out with various special-case logic to give it more smarts .
Currently, the PyElly stopException module lets a user provide a list of patterns to
determine whether a particular instance of punctuation like a period (.) should actually
stop a sentence. The PyElly exoticPunctuation module tries to normalize various kinds
of unorthodox punctuation found in informal text. This solution is imperfect, but we can
extend or modify it as needed. See Subsection 11.2 below for details.
The approach of PyElly here is to provide sentence recognition a notch or two better
than what one can cobble together just using Python regular expressions or the standard
sentence recognition methods provided by libraries in a language like Java. If you really
need more than this, then there are other resources available; for example, NLTK can be
trained on sample data to discover its own stop exceptions. Builtin PyElly sentence
recognition should be adequate for most applications, however.
PyElly sentence reading currently operates as a pipeline configured as follows:
raw text => ellyCharInputStream => ellySentenceReader => ellyBase
where raw text is an input stream of Unicode encoded as UTF-8 and read in line by line
with the Python readline() method. The ellyCharInputStream module is a filter that
removes extra white space, substitutes for Unicode characters not recognized by PyElly,
PyElly User’s Manual
converts hyphens used as dashes when appropriate, and replaces single new line
characters with spaces. The ellyCharInputStream and ellySentenceReader modules both
operate at the character level and together will divide input text into individual
sentences for subsequent PyElly processing.
A single input line could contain multiple sentences, or a single sentence may extend
across multiple input lines. There is also no limit on how long an input line may be; it
could be an entire paragraph terminated by a final linefeed as found in many word
processing files. PyElly can also read text divided into short lines terminated by
linefeeds, carriage returns, or carriage returns plus linefeeds. It currently will not splice
back a hyphenated word split across two lines, however.
The ellySentenceReader module currently recognizes five kinds of sentence stopping
punctuation: period (.), exclamation point (!), question mark (?), colon (:), and
semicolon (;). By default, any of these followed by whitespace except for a Unicode thin
space will indicate the end of a sentence. A blank line consisting of two new line
characters together will imply the end of a sentence without any final punctuation.
The ellyMain module, the standard top-level module for PyElly, employs
ellySentenceReader. You can run ellyMain interactively from a keyboard, but since it
expects general text input, you may have to add an extra <RETURN> to get the module to
recognize the end of a sentence and start processing.
11.1 Basic PyElly Punctuation in Grammars
The PyElly punctuationRecognizer module automatically defines a small set of single
Unicode characters as punctuation for text. These include the stop punctuation already
recognized by ellySentenceReader, plus comma (,) , bracketing and parentheses ([]),
apostrophe ('), and double-quote (") in ASCII, and a few non-ASCII Unicode characters
like () and () seen in formatted text. These are defined in the Python source file
punctuationRecognizer.py.
The punctuationRecognizer module is a builtin extension of the grammar rules in a
X.g.elly definition file for an PyElly application X. It has the effect of automatically
creating default internal dictionary entries for single-character punctuation in every
PyElly application. This is currently biased toward English, but can be adapted for other
languages by changing the basic punctuationRecognizer table and recompiling or by
adding explicit internal dictionary rules to override the table.
The punctuationRecognizer module can be replaced in PyElly by a stub with an empty
table and a match() method that always returns False. In that case, you will have to
supply all your own punctuation rules, but most of the time, you can just take the PyElly
defaults. This was the approach in all eleven of the functioning example applications in
the current PyElly distribution package.
All predefined PyElly single- and multi-character punctuation will be associated with
the syntactic type PUNC. If you want to make your own system of punctuation, define
PyElly User’s Manual
your own syntactic types here and then use them in your grammar and vocabulary rules.
You can even reuse PUNC, but remember that this will come with prior definitions.
PyElly also will qualify the syntactic type PUNC with syntactic features under the ID
[| ] and semantic features under the ID [! ]. These are pre-defined in the Python
code of PyElly and can be changed only there. Their names currently are as follows:
You can add your own features under the [| ] or[! ] IDs, but will have to give them
new unique names and assign them only to feature bits that are still free. It is easy to get
into trouble when you do not know exactly what you are doing.
Remember that punctuation, like all other input text elements, may be translated by
PyElly into something else in its output, but may be kept unchanged. You will have to
decide on the proper action and lay out the necessary rules for PyElly.
Including the full punctuationRecognizer module in PyElly is equivalent to putting the
following internal dictionary rules into every A.g.elly language definition file:
syntactic
feature
Indication
start
can start a sentence
stop
can stop a sentence
*l
is a left bracket or quotation mark (special use of predefined feature name)
*r
is a right bracket or quotation mark (special use of predefined feature name)
quo
is quotation mark
com
is comma
hyph
is hyphen
emb
can be included within a PyElly token
*x
indicates a square bracket when occurring with *L or *R, a period when occurring
with STOP, or an m dash otherwise (special use of predefined feature name)
semantic
feature
Indication
brk
can divide a sentence without ending it
PyElly User’s Manual
d:[ <- PUNC[|*l,*x,start]
__
d:] <- PUNC[|*r,*x]
__
d:( <- PUNC[|*l,start]
__
d:) <- PUNC[|*r]
>>[!spcs]
__
d:“ <- PUNC[|*l,quo,start]
__
d:” <- PUNC[|*r,quo]
__
d:" <- PUNC[|*l,*r,quo,start]
__
d:‘ <- PUNC[|*l,quo,start]
__
d:’ <- PUNC[|*r,quo]
__
d:` <- PUNC[|*l,quo,start]
__
d:' <- PUNC[|*l,*r,quo,start]
__
d:, <- PUNC[|com]
>>[!brk]
__
d:. <- PUNC[|stop,emb,*x]
__
d:! <- PUNC[|stop,emb]
__
d:? <- PUNC[|stop,emb]
__
d:: <- PUNC[|stop,emb]
__
d:; <- PUNC[|stop]
__
d:… <- PUNC # horizontal ellipsis
__
d:™ <- PUNC # TM
__
d:– <- PUNC # en dash
__
d:— <- PUNC[|*x] # em dash
__
d:- <- PUNC[|hyph] # hyphen or minus
__
All PyElly predefined punctuation will translate into itself with neutral cognitive
semantic plausibility. You can override this action by defining a vocabulary rule with a
different rewriting for a specific punctuation, but if this has the same syntactic features
as a default rule, make sure that the new rule has a positive semantic plausibility
increment so that PyElly will always choose it instead of the default.
You can also use the ellyCharInputStream module to change the form of punctuation in
text before it is looked up. This is how PyElly now handles ellipsis written as three dots.!
PyElly User’s Manual
11.2 Extending Stop Punctuation Recognition
The division of text into sentences by ellySentenceReader can currently be modified in
two ways: by the stopException module that recognizes special cases when certain
punctuation should not terminate a sentence and by the exoticPunctuation module that
checks for cases where sentence punctuation can be more than a single character.
11.2.1 Stop Punctuation Exceptions (A.sx.elly)
When PyElly starts up an application A, its stopException module will try to read in a
file called A.sx.elly, or failing that, default.sx.elly. This file specifies various
patterns for when a regular stop punctuation character ( . ! ? : ; ) should not
terminate a sentence. These patterns will only be checked when the punctuation is
followed by a space character in input text.
A pattern in a *.sx.elly file must each be expressed in the following form:
l...lp|r
where p is the punctuation character for the exception, l...l is a sequence of literal
characters or wildcards for the immediate left context of p, and r is a single literal
character or wildcard for the immediate right context of p plus the space after it. The
vertical bar (|) marks the start of a right context and must be present.
The l and r parts of a pattern may have only certain Elly wildcards:
@ matches a single letter
# matches a single digit
~ matches a single nonalphanumeric character
! matches an uppercase letter (exclamation point)
¡ matches a lowercase letter (inverted exclamation point)
You may also have * wildcard at the right end of a left context, as long there is
something else to match. The right context of an exception pattern may be omitted.
Here are some examples of actual exception patterns from default.sx.elly:
@.|
!*:|!
dr.|
mr.|
mrs.|
u.s.s.|
PyElly User’s Manual
The first pattern above picks up initials, which consist of a single letter followed by a
period and a space character. The second lets a sentence continue past a colon (:) and a
space when this is preceded by a capitalized word, indicated by !*, and followed by a
capital letter, indicated by the ! wildcard alone. The other patterns match formal titles
for names and work as you expect them to. The file default.sx.elly has an
extensive list of stop exceptions that might be helpful for handling typical text. You may
enlarge this or make up your own list for an application.
As with PyElly patterns elsewhere, lowercase letters in a pattern will match the same
letter in text irrespective of case. So the third pattern above will match DR., Dr., dR.,
and dr.. If the exception pattern were Dr.|, then it would match only DR. and Dr..
Uppercase in a pattern must match uppercase in text.
Note also that ordering makes a difference in the listing of patterns here. PyElly will
always take the first match from a listing, which should do the right thing. The problem
is when a longer and a shorter pattern can both match input text. You probably would
want to list the longer pattern first, or otherwise it may never be matched.
PyElly will apply some additional conditions before accepting a match of the left context
pattern of stop exception rule. It will limit the range of a left context to the longest
sequence of characters preceding a candidate stop punctuation that can appear in a
simple token. Rule patterns cannot refer to anything outside that range. In particular, a
space wildcard can never be matched.
If a left context pattern ends with a * wildcard, then will be compared without the *
wildcard against the starting characters of the actual left context in input text. If this
matches, then any extra characters in the left context will have to be alphanumeric in
order to make a full pattern match.
If a left context pattern has no * wildcard, then it will be compared against the ending
characters in the actual input left context. If this matches, then if the first element of
matching input text is alphanumeric, then any preceding input character may not be
alphanumeric or ampersand (&). For example, with the rule dr.|, the input text XDR.
will match the left context pattern by itself, but PyElly will reject this because of the X.
The handling of right context patterns will examine only one text character at most, but
this will be the first character after the candidate stop punctuation and its following
space. The * wildcard is forbidden in a right context pattern; and there must be an
actual input text character here if a right context pattern is specified.
As of PyElly release v1.3.23, the stopException module includes hardwired logic to
bypass the stop exception table in a few special cases. This currently is only to handle
the abbreviations A.M. and P.M. in time expressions, which requires going beyond the
simple pattern matching in the stop exception table. The hardwired logic cannot be
overridden except by changing the Python code in the class method nomatch().
PyElly User’s Manual
11.2.2 Bracketing Punctuation
Some punctuation can show up in complementary pairs to bracket segments of text. The
most common of these are parentheses and brackets like ( and ) or [ and ] and
quotation marks like and . The PyElly ellySentenceReader module recognizes such
paired punctuation and will adjust its sentence boundaries accordingly. For example,
the text segment
(He walks along.)
should be a sentence even though the period (.) here is not followed by a space. The
closing right parenthesis will also be included in identified sentence.
Paired quotation marks are handled in the same way, but there is an added complication
here because some text may use the same character for left and right quotation marks.
For example,
"He walks along."
To handle this situation, PyElly automatically interprets " at the beginning of a sentence
or preceded by a space as and at the end of a sentence or followed by a space as . It
does not replace the original character, however. You can do the replacement yourself
with macro substitutions if you really want to.
Within paired bracketing of any type, colons (:) and semicolons (;) will not be stopping
punctuation and are treated more like a comma (,). The other usual stopping
punctuation (.?!) will also be ignored it if there are fewer than 3 space characters seen
so far within the brackets, not counting the spaces after a previously ignored (.?!)
punctuation marks. This is to avoid highly fragmented sentence analysis.
PyElly currently puts a 80-character limit on any bracketing with respect to sentence
boundaries. If a matching character pair like (“”) is farther apart than that, then no
bracketing is recognized. You can change this limit by editing the parameter NLM in
ellyCharInputStream.py.
The bracketing logic in PyElly is heuristic and may have to be tuned for a particular
application. No information should be lost here, however. The result will just be to
divide text input into a different set of sentences, which may or may not matter. For
example, a long quotation in input text may be broken up instead being taking as a
single segment of text for analysis.
11.2.3 Exotic Punctuation
This is for dealing with punctuation like !!! or !?. The capability is coded into the
Pyelly exoticPunctuation module, and its behavior cannot be modified except by
changing the Python logic of the module. This change will be easy, though.
PyElly User’s Manual
The basic procedure here is to look for contiguous sequences of certain punctuation
characters in an input stream. These are then automatically collapsed into a single
character to be passed on to the ellySentenceReader module. The main ellyBase part of
PyElly should therefore always see only standard punctuation.
11.3 How Punctuation Affects Parsing
Typical input sentences processed by PyElly may currently include all kinds of
punctuation, including those recognized by stopException as not breaking a sentence.
When PyElly breaks a sentence into parts for analysis, a single punctuation character by
default will be taken as a token. PyElly will assign common English punctuation to the
predefined syntactic type PUNC unless you provide vocabulary table rules or D:
grammar rules or FSA pattern rules specifying otherwise.
For example, you might put DR. into your vocabulary table, perhaps as the syntactic
type TITLE. Since this will take three characters from an input stream, including the
period, PyElly will no longer see the punctuation here. PyElly tokenization will always
take longest possible match when multiple PyElly rules can apply; a token including
punctuation like quotation marks will probably be longer than anything else.
Identifying punctuation in an input sentence is just the start of PyElly analysis, however.
The grammar rules for a PyElly application will then have to describe how to fit the
punctuation into the overall analysis of a sentence and how eventually to translate it.
This will be entirely your responsibility; and it can get complicated.
In simple text processing applications taking only a sentence at a time and only a small
amount of text, you might choose just to ignore all punctuation by making them
disappear in macro substitutions, but more usually, punctuation occurrences in
sentence will provide important clues about the boundaries of phrases in text input that
can greatly help in keeping your syntactic analyses manageable.
When you choose to work with sentence punctuation, you will need at least one
grammar rule like
g:SENT->SENT PUNC[|STOP]
__
for handling stop punctuation terminating a sentence, although the syntactic feature
reference is often unnecessary. The setting of syntactic features by the PyElly
punctuationRecognizer module will have no on sentence analysis when these are
unreferenced by grammar rules.
PyElly parsing will fail if any part of a sentence cannot be put into a single coherent
syntactic and semantic analysis; and punctuation handling will be a highly probable
point of failure here. Watch out for sentences broken in two by incorrectly interpreted
punctuation; this cannot be corrected with macro substitutions since these rules must
always operate within the sentence boundaries already found.!
PyElly User’s Manual
12. PyElly Tokenization and Parsing
Parsing and tokenization are usually invisible in PyElly operation, which should help to
simplify the development of natural language applications. Still, we do sometimes need
to look under the hood, either when something goes wrong or when efficiency becomes
an issue. So this section will take a deep dive into how PyElly analyzes a sentence, a
procedure that has evolved over many decades to its present elaboration.
PyElly follows the approach of compiler-compilers like YACC. Compilers are the
indispensable programs that translate code written in a high-level programming
language like Java or C++ into the low-level machine instructions that a computer can
execute directly. In the early days of computing, all compilers were written from scratch;
and the crafting of individual compilers was complicated and slow. The results were
often unreliable.
To streamline and rationalize compiler development for a proliferation of new languages
and new target machines, compiler-compilers were invented. These provided
prefabricated and pretested components that could be quickly customized and bolted
together to make new compilers. Such standard components typically included a lexical
analyzer based on a finite-state automaton and a parser of languages describable by a
context-free grammar.
Using a compiler-compiler of course limits the options of programming language
designers. They have to be willing to work with the constraint of context-free languages;
and the individual tokens in that language (variables, constants, and so forth) have to be
recognizable by a simple finite-state automaton. Such restrictions are significant, but
being able to have a reliable compiler running in weeks instead of months is so
advantageous that almost everyone can accept the tradeoffs.
The LINGOL system of Vaughan Pratt adapted compiler-compiler technology to help
build natural language processors. Natural languages are not context-free, but life is
simpler if we can parse them as if they were and then take care of context sensitivities
through other means like local variables in semantic procedures attached to syntax
rules. PyElly follows the LINGOL plan and takes it even further.
12.1 A Bottom-Up Framework
A parser analyzes an input sentence and builds a description of its structure. As noted
earlier, this structure can be represented as a kind of tree, where the root of the tree is a
phrase node of the syntactic type SENT and the branching of the tree shows how
complex structures break down into simpler structures. A tool like PyElly must be able
to build such trees incrementally for a sentence, starting either at the bottom with the
basic tokens from the sentence or at the top by putting together different possible
structures with SENT as root and then matching them up with the parts of the sentence.
One can debate whether bottom up or top down is better, but both should produce the
same parse tree in the end. We can have it both ways by adopting a basic bottom-up
PyElly User’s Manual
framework with additional checks to prevent a parse tree phrase node from being
generated if it would not show up in a top-down analysis. PyElly does this through a
true/false matrix m(X,Y) telling whether a syntactic type Y could eventually satisfy a goal
of X at parse position; it is automatically compiled when PyElly loads its grammar rules.
LINGOL and subsequently PyElly both take this restricted bottom-up approach. Doing
so is quite efficient, and the various resulting subtrees can provide helpful information
when parsing fails or when a translation goes wrong. Bottom-up is also more convenient
for computing plausibility scores with PyElly cognitive semantics.
The PyElly bottom-up algorithm operates on a queue that lists the newly created phrase
nodes of a parse tree. These still need to be processed to create the phrase nodes at the
next higher levels of our tree. Initially, the queue is empty, but we then read the next
token in an input sentence and look it up to get some new bottom-level parse tree nodes
to prime our queue. A token can be a single word or phrase, a number, punctuation, an
arbitrary alphanumeric identifier, or a complex entity like a calendar date.
PyElly parsing then runs in a loop, taking the node at the front of its queue and applying
grammar rules to create new nodes to be appended to the back of the queue for further
action. This process keeps going until the queue finally empties out. At that point, PyElly
will then try to read the next token from a sentence to refill the queue and proceed as
before. Parsing will stop after every token in sentence has been seen or when current
grammar rules cannot generate any more phrase nodes past a given token position.
There is one special circumstance when a new node will not be added to the end of a
queue. If there is already a phrase node of the same syntactic type with the same
syntactic features built up from the same sentence tokens and if the new node does not
have the *UNIQUE syntactic feature, PyElly will note an ambiguity here and will attach
the new node as an alternative to the already processed node instead of queueing it
separately for further tree building.
This consolidation of new ambiguous nodes serves to reduce the total number of nodes
generated for the parsing of a single sentence. Otherwise, PyElly would have to build
parallel tree structures for both the old node and the new node without necessarily any
benefit. The *UNIQUE syntactic feature will allow you to override the handling of
ambiguities here if you really want to do so.
In any event, PyElly immediately computes the cognitive semantic plausibility score of
each new phrase as it is generated in bottom-up parse tree building. Whenever an
ambiguity is found, PyElly will find the alternative with the highest plausibility and use
it in all later processing of a sentence. All the other alternatives will, however, be
retained for reporting, for possible backup on a semantic failure, or for automatically
adjusting biases to insure that the same rule will not always be taken when there are
multiple rules with the same semantic plausibility.
Page 100
PyElly User’s Manual
12.2 Token Extraction and Lookup
PyElly token lookup is complicated because it can happen in many different ways:
external vocabulary tables, FSA pattern rules, entity extraction, the internal dictionary
rules for a grammar, and builtin rules like those for punctuation recognition. These
possibilities must also interact with macro substitution, inflectional stemming, and
morphological analysis; and so it can be hard to follow what is going on here.
PyElly breaks up a sentence into a sequence of tokens, each a single- or multi-word
term, a common expression, a word fragment, a name or other complex entity, a string
of defined format like a number, or punctuation. PyElly parsing goes from left to right in
a sentence, applying its language rules to get the longest possible token at the next
sentence position. For an application A, the full lookup procedure is currently as follows:
1. If number rewriting is enabled, rewrite any spelled-out number like SIXTEEN
HUNDRED in the current sentence position as digits plus any ordinal suffix like -ST,
-RD, or -TH. A spelled-out fraction like THREE-EIGHTHS becomes 3/8THS.
2. Apply macro substitution rules at the current position.
3. Try also to match up the next input up to the next space or other separator with the
FSA pattern table for A; queue up matches as leaf phrase nodes at the current
position if they are consistent with top-down parsing expectations (derivability).
4. Try entity extraction at the current position; put matches as phrase nodes into the
PyElly parsing queue if consistent with top-down expectations at the current
position. Entities must be within a single sentence; otherwise, PyElly puts no
restrictions like above on what extraction code (written in Python) can do for
matching. An entity string can include punctuation and spaces.
5. Look up the next input text in the external vocabulary table for A; put matches into
the PyElly parsing queue as parse tree leaf phrase nodes if consistent with top-down
parsing expectations at the current position according to the current grammar rules
for A. PyElly has special punctuation rules on how far ahead to scan for a match.
6. If steps 3, 4, or 5 have queued up phrase nodes, keep only those with the longest
extent. These should be over the same extent of input text.
7. If any queued phrases are longer that the next simple input token, we are done with
the generation of leaf phrase nodes and ready to start the main parsing loop.
8. Otherwise, extract the next token of alphanumeric plus embeddable punctuation
characters from PyElly input with inflectional stemming and macro substitution.
9. If inflectional stemming is enabled, then apply it to the next token. Put any
inflectional endings back into PyElly input.
10. Apply macro substitution rules again. These may override inflectional stemming.
Page 101
PyElly User’s Manual
11. Look up the input token as a single word in both external vocabulary table and the
internal dictionary for A. Queue up a phrase node for any matches here if consistent
with top-down expectations and the match is as long as what has been seen so far.
12. If we have queued phrases from any of the preceding steps, we are done with
tokenization and ready to go into the PyElly main parsing loop.
13. Otherwise, morphologically analyze our current single-word token with the rules in
effect for A. Put any suffixes found here back into PyElly input. If analysis resulted in
a new token, look it up in the external vocabulary table and in the internal dictionary
for A. If found and consistent with top-down expectations and long enough, queue
up phrase nodes for each match.
14. If there are any queued up phrase nodes, we are done.
15. Otherwise, if nothing has been queued up yet, check if the next token is standard
punctuation. If so and the punctuation is consistent with top-down expectations,
enqueue a phrase node of syntactic type PUNC and quit tokenization.
16. If nothing has yet been queued up, then create a phrase node of UNKN type for the
next single token up to a break character. This will be without any top-down
consistency check.
The lookup process should produce a queue of at least one phrase node for the next
token. We then will start in on parse tree building with queued up nodes and continue
until the queue is exhausted. When that happens, it is time to look for another token
from which to refill our queue until all of an input sentence has been processed.
12.3 Building a Parse Tree
Given bottom-level phrase nodes in the PyElly parsing queue, we can start to build up a
parse tree from them. The basic algorithm here is from LINGOL, but it is similar to
other bottom-up parsing procedures in systems driven by context-free grammars. The
next subsection will cover the details of the basic algorithm’s main loop, and the two
following subsections will describe PyElly extensions to that algorithm.
12.3.1 Context-Free Analysis Main Loop
At each step in parsing, we first enqueue the lowest-level phrase nodes for the next piece
of an input sentence, with any ambiguities already identified and resolved. Then for
each queued phrase node, we find all the ways that the node will fit into a parse tree
currently being built. This is called “ramification” in PyElly source code commentary.
For newly enqueued phrase node, PyElly ramification will go through three steps when
the syntactic type of the node is X:
Page 102
PyElly User’s Manual
1. Look for grammar rules of the form Z->Y X that have earlier found a Y and has set a
goal of an X in the current position. For each such goal found, create a new phrase
node of type Z, which will be at the same starting position as phrase Y.
2. Look for rules of the form Z->X. For each such rule, create a new node of type Z at
the same starting position and with the same extent in a sentence as X.
3. Look for rules of the form Z->X Y. For each such rule, set a goal at the next position
to look for a Y to make a Z at the same starting position as X.
A new phrase node will be vetoed in steps 1 and 2 if inconsistent with a top-down
algorithm. The same derivability matrix was employed in token lookup. Each newly
created node will be queued up for further processing with the three steps above. When
all the phrase nodes ending at the current sentence position have been ramified, PyElly
parsing advances to the next position.
The main difference between PyElly basic parsing here and similar bottom-up context-
free parsing elsewhere is in the handling of ambiguities. Artificial languages generally
forbid any ambiguities in their grammar, but natural languages are full of them and so
we have to be ready to handle them. In PyElly, the solution is to resolve ambiguities
outside of its ramification steps.
PyElly sees an ambiguity only when two phrase nodes of the same syntactic types and
features cover the same tokens in a sentence. For example, the single word THOUGHT
could be either a noun or the past tense of a verb. This will have to be resolved at some
point, but if they are marked with different syntactic categories or have different
syntactic features, PyElly will put off making any resolution.
It is possible that a parsing ambiguity may found for a phrase node after it has already
been ramified. This is no problem if that previous node has a higher plausibility than the
new phrase node producing the ambiguity; but if the new phrase is more semantically
plausible, then it must replace the old phrase, and the plausibility scores of all of the old
phrase’s ramifications must also be adjusted upward to reflect the replacement. Such
changes of the plausibility of other phrase nodes may in turn require adjustment of their
previous ramifications as well. PyElly handles all of this automatically.
12.3.2 Special PyElly Modifications
Except for ambiguity handling, basic PyElly parsing is fairly generic. We can be more
efficient here by anticipating how grammar rules for natural language might benefit
from special handling as compared to those for context-free artificial languages. The
first extension of the core algorithm is the introduction of syntactic features as an extra
condition on whether or not a rule is applicable for some aspect of ramification.
On the right side of a rule like Z->X or Z->X Y, you can specify what syntactic features
must and must not be turned on for a queued phrase node of syntactic type X to be
Page 103
PyElly User’s Manual
matched in steps 2 and 3 above and for a queued phrase node of type Y to satisfy a goal
based on a rule Z->X Y in step 1. This checking involves extra code, but it is
straightforward to implement with bit-checking operations.
There is also a special constraint applying to words split into a root and an inflectional
ending or suffix (for example, HIT -ING). The parser will set flags in the first of the
resulting phrase nodes so that only step 3 of ramification will be taken for the root part
and only step 1 will be taken for each inflection part. A parse tree will therefore grow
more slowly than otherwise expected, making for faster parsing and less overflow risk.
12.3.3 Type 0 Grammar Extensions
The introduction of the PyElly ... syntactic type complicates parsing, but handling the
type 0 grammar rules currently allowed by PyElly turns out to require only two localized
changes to its core context-free algorithm.
1. Just before processing a new token at the next position of an input sentence,
generate a new phrase node for the grammar rule …[.1]->nothing. This is not a
normally legal rule! Enqueue the node and get its ramifications immediately.
2. Just after processing the last token of an input sentence, generate a new phrase node
for the grammar rule ...[.2]->nothing. Enqueue it and get its ramifications
immediately.
Those reading this manual closely will note that the two rules here have syntactic
features associated with ..., which Section 8 said was not allowed. That restriction is
still true, and that is because PyElly reserves the syntactic features of ... to make the
type 0 logic handling work properly as done above.
The difficulty here is that the ... syntactic type is prone to producing ambiguities. This
will be especially bad if the PyElly parser cannot distinguish between a ... phrase that
is empty and one that includes actual pieces of a sentence. So PyElly itself keeps track by
using syntactic features here, but keeps that information invisible to users.
The solution will propagate up the syntactic feature [.1] to indicate an empty phrase
due to case 1 and the feature [.2] to indicate an empty phrase due to case 2. Though
invisible, a grammar will still need to guide this explicitly through setting *LEFT or
*RIGHT in rules for syntactic feature inheritance when a rule involves ....
12.4 Success and Failure in Parsing
For any application, PyElly automatically defines the special grammar rule:
g:SENT->SENT END
__
Page 104
PyElly User’s Manual
This rule will never be realized in an actual phrase node, but the basic PyElly parsing
algorithm uses this rule to set up a goal for the syntactic type END in phase 3 of
ramification. After a sentence has been fully parsed, PyElly will look for an END goal at
the position after the last token extracted from the sentence. If no such goal is found,
then we know that parsing has failed; otherwise, we can then run the generative
semantics for the SENT phrase node that generated the END goal just found.
There may be more than one END goal in the final position, indicating that their
respective generating SENT phrase nodes were not collapsed as an ambiguity because of
syntactic feature mismatch. As a special case, PyElly will compare their cognitive
semantic plausibility scores select the most plausible and run its generative semantic
procedure to get the interpretation for a sentence. This is equivalent to making actual
phrase nodes based on the implicit PyElly SENT->SENT END rule, which would trigger
PyElly ambiguity handling as just described.
Failure in parsing gives us no generative semantic procedure to run, and our only
recourse then is to dump out intermediate results and hope that someone can spot some
helpful clue in the fragments of analysis. If the failure is due to something happening in
semantic interpretation, though, PyElly can automatically try to recover by backing up
in a parse tree to look for an ambiguity and selecting a different alternative at that point.
12.5 Parse Data Dumps and Tree Diagrams
PyElly can produce dumps of parsing data, including all the complete or partial parse
trees built up for a sentence. In a successful analysis, this helps in verifying that PyElly is
running as expected. In a failed analysis, the partial parse trees will provide clues about
what went wrong. For example, you can see where the building of a parse tree had to
stop and whether this was due to a missing rule or unexpected input text.
All this information is written to the standard error stream. Such output originally was
an informal debugging aid, but has proven so useful that it is now built into PyElly
operation. The most important part of parse data dumps are the trees. These will be
presented horizontally, with their highest nodes on the left and with branching laid out
vertically. For example, here is a simple 3-level subtree with 4 phrase nodes:
sent:0000───ss:8001noun:8000 @0 [nnnn]
6 = 3 4 = 2| 1 = 1
verb:0000 @1 [vvvv]
2 = -1
Each phrase node in a tree display will have the form
type:hhhh
n = p
Where type is the name of a syntactic type truncated to 4 characters, hhhh is
hexadecimal for the associated feature bits (16 are assumed), n is a phrase sequence
Page 105
PyElly User’s Manual
number indicating the order in which it was generated in a parse, and p is the numerical
semantic plausibility score computed for the node. The nodes are connected by Unicode
drawing characters to show the kind of branching in grammar structures.
To interpret the feature bits hhhh here, you should look at the encoding of feature
names produced by grammarTable.py. The feature encoded as 0 will be the leftmost bit
and will show up as the hexadecimal 8000; the feature encoded as 1 will be 4000. In the
above example, the top-level node here for type sent is
sent:0000
6 = 3
This node above has no syntactic features turned on; its node sequence ID number is 6,
and its plausibility score is +3. Similarly, the node for type ss at the next level is
ss:8001
4 = 2
The actual sentence tokens for a PyElly will be in brackets on the far right, preceded by
its sentence position, which starts from 0. In the example above, the tokens are the
“words” nnnn and vvvv in sentence positions 0 and 1, respectively. Every parse tree
branch will end on the far right with a position and token, plus a semantic concept if
your grammar includes them.
With the analysis of a word into components becoming separate tokens, representing
prefixes or suffixes, we can get trees like
sent:0000───ss:0000┬──ss:0000unit:0000unkn:0000 @0 [it]
11 = 0 10 = 0 2 = 0 1 = 0 0 = 0
unit:0000unkn:0000 @1 [live]
9 = 0 4 = 0
sufx:0000 @2 [-s]
8 = 0
When a grammar includes ... rules, the display will be slightly more complicated, but
still follows the same basic format.
sent:0000sent:C000───ss:C000┬───x:A400┬─...:4000 @0 []
11 = 4 8 = 4 7 = 4 3 = 4 0 = -2
└─key:2400 @0 [hello]
2 = 2
└─...:4000 @1 []
6 = -2
punc:2000 @1 [.]
9 = 0
Page 106
PyElly User’s Manual
The empty phrases number 0 and number 6 have sentence positions 0 and 1, but these
positions are shared by two actual sentence elements HELLO and period (.), as you
would expect. Note that the hidden syntactic flags of the ... type will show up in a
displayed tree; just ignore them.
All the examples here show a single analysis of a complete sentence, where PyElly start
from for semantic interpretation. This minimal dump will not show any rejected
alternative analyses, which might have been generated for ambiguities or for dead ends
that led to no full analysis of a sentence recognized as a syntactic type SENT.
When a PyElly analysis fails or when you are unsure whether an analysis is correct,
however, you may want to see all the intermediate results of parsing, including
ambiguities and dead ends. PyElly can do a full tree dump here showing all phrase
nodes in an analysis in the order of their generation.
In a full dump, PyElly first looks for the node with the highest sequence ID number not
yet shown in any parse tree for the current dump. PyElly then shows the subtree under
that node as done above for the subtree under sent and loops back in this way until all
phrase nodes are accounted for. For a long sentence, we may have tens of thousands of
phrase nodes, but each node will show up in only one subtree.
In addition to all the trees and subtrees, a PyElly full dump will also show the goals at
the final position in a sentence analysis, all grammar and internal dictionary rules
applied plus the phrases nodes generated, and all ambiguities found in the process. This
information should allow you to reconstruct how PyElly parsed a sentence.
Full dumps are default when you are running ellyBase.py from a command line,
regardless of whether PyElly succeeded or failed on an input sentence. With
ellyMain.py, you will see only a minimal dump if PyElly succeeded, but a full dump if
PyElly has failed.
For example, suppose that we have the following trivial grammar, which allows for
sentences consisting of either a NOUN plus a VERB or a VERB plus a NOUN:
# trivial grammar
p:do
_
left
space
right
__
g:sent->noun verb
_
(do)
__
g:sent->verb noun
>>-
_
(do)
__
Page 107
PyElly User’s Manual
g:noun->noun sufx
_
(do)
__
g:verb->verb sufx
_
(do)
__
d:dog <- noun
__
d:dog <- verb
__
d:-s <- sufx
__
d:bark <- verb
__
d:bark <- noun
__
Here is an example of an ellyBase full dump for an analysis of a short sentence:
> dogs bark.
parse FAILed!
dump all
dumping from phrase 8 @0: typ=0 syn[00 00] sem[00 00] : bia=-1 use=0
sent:0000verb:0000verb:0000 @0 [dog]
8 = -1 4 = 0 1 = 0
sufx:0000 @1 [-s]
2 = 0
noun:0000 @2 [bark]
6 = 0
dumping from phrase 7 @0: typ=0 syn[00 00] sem[00 00] : bia=0 use=0
sent:0000noun:0000noun:0000 @0 [dog]
7 = 0 3 = 0 0 = 0
sufx:0000 @1 [-s]
2 = 0
verb:0000 @2 [bark]
5 = 0
rules invoked and associated phrases
rule 2: [7]
rule 3: [8]
rule 4: [3]
rule 5: [4]
rule 6: [0]
rule 7: [1]
rule 8: [2]
rule 9: [5]
rule 10: [6]
3 final goals at position= 3 / 3
goal 8: sufx typ=6 for [phrase 5 @2: typ=5 syn[00 00] sem[00 00] : bia=0 use=0] rul=5
goal 9: sufx typ=6 for [phrase 6 @2: typ=4 syn[00 00] sem[00 00] : bia=0 use=0] rul=4
goal 10: end typ=1 for [phrase 7 @0: typ=0 syn[00 00] sem[00 00] : bia=0 use=0] rul=1
9 phrases altogether
Page 108
PyElly User’s Manual
ambiguities
sent 0000: 7 (+0/0) 8 (-1/0)
4 raw tokens= [[dog]] [[-s]] [[bark]] [[.]]
9 phrases, 11 goals
Parsing fails for the input “dogs bark.” because our simple grammar expects no
punctuation at the end of a sentence. In the full dump, we first see the two subtrees for
the first three tokens of the input sentence. The final position with goals are defined is 3,
and the goals there are listed; the position of the actual last token seen is also 3, but this
may not always be so. In the list of ambiguities we have phrases 7 and 8, both of type
SENT with no syntactic features set; phrase 7 starts the listing because it has a higher
semantic plausibility score as result of the cognitive semantics for rule 3.
If a semantic concept is defined for a phrase at a leaf node in a parse tree, a PyElly tree
dump will show the concept immediately after the bracketed token at the end of an
output line. For example, the augmented printout for tree the first tree from the
example above might become
sent:0000noun:0000noun:0000 @0 [dog] 02086723N
4 = 0 2 = 0 0 = 0
sufx:0000 @1 [-s]
1 = 0
verb:0000 @2 [bark] 01049617V
3 = 0
where 02086723N and 011049617V are WordNet-derived concept names as described
in Subsection 10.4.1 above. If no concept is defined for a leaf node, then the tree output
will remain the same as before. This is the case for the suffix -S here.
12.6 Parsing Resource Limits
PyElly is written in Python, a scripting language that can be interpreted on the fly. In
this respect, it is closer to the original LINGOL system written in LISP than to its
predecessors written in Java, C, or FORTRAN. PyElly takes advantage of Python object-
oriented programming and list processing with automatic garbage collection.
Unless you are running on a platform with tight main memory, PyElly should readily
handle sentences of up to a hundred tokens with little difficulty. Writing complex
grammar tules to describe long sentences efficiently, however, will require that you
control the combinations of the possible interpretations from those rules. Too many
combinations here will result in phrase node overflow, producing no translation.
When grammar rules allow for high degrees of ambiguity, the total number of phrase
nodes allocated in generating a complete parse for a sentence can grow exponentially
with the number of distinct tokens in the input. This can slow processing noticeably or
even crash your computer. Be careful especially when using the predefined *unique
syntactic feature or when many words have to be treated syntactically as UNKN.
Page 109
PyElly User’s Manual
In general, PyElly will generate every possible parse trees for a sentence. Analyses will
be reduced, though, by immediate resolution of ambiguities of the same syntactic type
with the same syntactic features over the same text, but sentences with many competing
interpretations higher up in a parse tree can still clog up PyElly processing. Inflectional
stemming or morphological analysis will of course produce extra tokens beyond the
basic count of words and punctuation for a sentence.
The PyElly ellyConfiguration module as of v1.2.2 imposes a 50,000 node limit on
the total number of phrases available for a sentence parse tree. This is generous, since
parsing can seem glacially slow when over 10,000 phrase nodes are generated. If
sentence analysis hits the node cutoff, PyElly will throw an overflow exception that will
be caught at the top level of ellyBase; only a token list dump will be then be shown.
You can raise the phrase node limit by editing the phraseLimit parameter definition
in the file ellyConfiguration.py. This probably will not help much, however. Node
overflow means that your grammar and vocabulary really need to be rethought and
tightened up. A remedy here might be to use syntactic features strategically to restrict
the applicability of alternate syntax rules and to define more terms explicitly, especially
ones with multiple words. Just recognizing a phrase like in the course of as a
single token can often halve the total phrase node count for a long sentence.
As of v1.3.5, you can also insert a control character into your input to force PyElly to
split the analysis of a long sentence into two parts when you provide an appropriate
syntax rule here. This insertion can be done through a macro substitution rule with a
pattern that identifies where you want to put break into input text. The control character
here could be anything not normally found in input text, but in particular, PyElly
predefines ASCII RS (record separator or Unicode u’\u001E) for this purpose.
To make its use easier, the PyElly separator character will automatically be put into the
internal dictionary of every PyElly grammar as a 1-character token of syntactic type
SEPR. The separator character can be specified on the right side of a macro substitution
as \\s. You then just have to define grammar rules of the X->Y SEPR or X->SEPR Y to
tell PyElly where to expect a SEPR token in the input for an application. PyElly parsing
will take care of everything else.
There may be more than one separator in a given sentence. Each has the effect of
dividing a sentence analysis so that no phrase completely to the left of the separator will
generate a goal to the right of the separator. Fewer goal nodes generated will mean
fewer phrase nodes. To get maximum benefit here, you should have only one 2-branch
syntax rule with SEPR. and define it at the level of the SENT syntactic type.
With PyElly defaults, the analysis of a sentence with a separator token in input will still
produce a single parse tree. Your generative semantics will work exactly the same as
before to stitch together a proper output translation for the original input sentence.
Unless you have some special requirement, a SEPR phrase node should always translate
to a null string, since it was not in the original input text.
Page 110
PyElly User’s Manual
Outside of phrase and goal nodes, the main defined resource restrictions in PyElly are
on the total number of syntactic types (96) and the total number of different syntactic
features or of different semantic features for a phrase node (both 16). These are fixed to
allow for preallocation of various arrays used by the PyElly to simplify lookup of
grammatical symbols when setting up a language definition. The limits also make the
formatting of parse trees easier.
You can change these limits on syntactic types and syntactic and semantic feature
counts by editing the definitions in the PyElly symbolTable.py module. The current
numbers should be quite enough for ordinary applications, however. If you increase the
feature count past 16, you also must change the parseTreeWithDisplay.py module
so that its formatting of parse trees will allow more than four hexadecimal digits to
represent syntactic feature bits. This will be non-trivial.
You will be better off leaving the limit on the number syntactic features alone. Instead,
you might try increasing the number of syntactic feature sets or the number of syntactic
types; but this may make your grammar rules harder to understand. We are taught in
grammar school that there are only eight parts of speech in English, but it is often
advantageous to define many more syntactic types in order to make individual syntactic
rules more specific. Experiment here to see what works best for you in PyElly.
In simple PyElly applications, the limits here will be unimportant. In the marking
example application with tens of thousands of grammar and vocabulary rules, however,
the language definition has reached almost 90 syntactic types, 9 syntactic feature sets of
16, and 3 semantic feature sets of 16. Almost all of the syntactic and semantic feature
sets are fully defined. It should be possible, however, to reorganize that language
definition to be more efficient in its use of symbols.!
Page 111
PyElly User’s Manual
13. Developing Language Rules and Troubleshooting
PyElly rewrites input sentences according to the rules that you provide. A natural
language application can involve up to eight different *.*.elly text files for rule
definition, with some containing hundreds or even thousands of rules. There are plenty
of ways to go wrong here; and so we all have to be systematic in developing PyElly
applications, taking advantage of all the tools available when problems arise.
Even when trying to do something simple, you must always be alert and cultivate good
habits. In general, the best way to use PyElly is to approach a solution bit by bit, trying
to take care of just one sentence at a time. Let PyElly to check everything out for you at
each step and go no further until everything is clean and satisfactory. Remember that a
change in your rules can break previous analyses. Always expect the worst.
Application building will never be a slamdunk, but remember that you are already a
natural language expert! Despite enormous advances in hardware and software, an
intelligent young child nowadays still knows more about natural language than Siri or
Watson. If you can harness some basic analytical skills and add some programming
chops to your innate language expertise, then you should do well with PyElly. Just keep
your goals clear and proceed slowly and carefully.
Start with the simplest sentences requiring only a few rules. Once these can be handled
successfully, move on to more complex sentences, adding more rules as needed to
describe them. With a modular PyElly rule framework, you should be able to build on
your previous rules without having to re-create them. This is one big advantage of
processing sentences recursively around the syntactic structures of sentences.
When testing out a new sentence, you should not only verify that PyElly is producing the
right output, but also inspect its parse tree dump to see that it is doing what you expect
from your current grammar rules. If everything is all right, then add the sentence and its
translation to a list to be run in regression testing with ellyMain.py later. Such testing
will take only seconds or minutes, but must be done.
13.1 Pre-Checks on Language Rule Files
As your language definition files get longer, PyElly can help to verify that each of them is
set up correctly and makes sense by itself before you try to bring everything together.
This can be done by running the unit tests of the modules to read in definition files and
check the acceptability of tables of rules. Running ellyBase.py or ellyMain.py will
also do some of this, but you can sort out issues more easily when looking at only one
definition file at a time. For example, with application X, you can run any or all of the
following unit tests from your command line:
Page 112
PyElly User’s Manual
python grammarTable.py X
python vocabularyTable.py X
python patternTable.py X
python macroTable.py X
python nameTable.py X
python conceptualHierarchy.py X
Each command will read in the corresponding X.*.elly file, check for errors, and
point out other possible problems. If a table or a hierarchy can be successfully
generated, PyElly will also dump this out entirely for inspection, except for external
vocabularies, which can be too big for a full listing.
The vocabulary, pattern, macro, and name table unit tests will also prompt you for
additional examples to run against their rules for further verification of correct lookup
or matching. This can be helpful when debugging a PyElly language definition with a
specific problematic bit of text.
PyElly error messages from language definition modules will always be written to
sys.stderr and will have a first line starting with “** ”. They may be followed by a
description line starting with “* ” showing the input text triggering the problem. For
example, here is an error message for a bad FSA rule for assigning a part of speech:
** pattern error: bad link
* at [ 0 *bbbb* ZED start ]
PyElly will continue to process an input rule definition file after finding an error so as to
catch as many definition problems as possible in one pass. No line numbers are given in
error messages because PyElly normalizes all its input to simplify processing for
definition modules. This will strip out comments and will eliminate any blank lines, thus
changing the original numbering of lines.
Once each separate table has been validated in isolation, you can run ellyBase.py to
load everything together. Its unit test will run a cross-table check on the consistency of
your syntactic categories and of your syntactic and semantic features. For application X,
do this with the shell command
python ellyBase.py X
This is also how you would normally test the rewriting of individual test sentences, but
the information provided by PyElly from the loading of language rules is a good way to
look for omissions or typos in your language rules, which can be hard to track down
otherwise. Running ellyMain.py will skip this detailed kind of checking.
Page 113
PyElly User’s Manual
13.2 A General Application Development Approach
For those wanting more specifics on setting up PyElly language definition files, here is a
reasonable way to build a completely new application X step by step.
1. Set up initially empty X.g.elly, X.m.elly, X.stl.elly, X.n.elly, and
X.v.elly files. For the other PyElly language definition files, taking the defaults
should be all right as least to start with.
2. Select some representative target sentences for your PyElly application to rewrite.
Five or six should be enough to start with. You can add more as you go.
3. Write G: grammar rules in X.g.elly to handle to handle one of your target
sentences; leave out the cognitive and generative semantics for now. Just check for
correctness of the rules by running the PyElly module grammarTable.py with X as
an argument.
4. Add the words of a target sentence as D: internal dictionary rules in X.g.elly or as
vocabulary entries in X.v.elly. Run grammarTable.py or
vocabularyTable.py with X as an argument to verify correctness of language
definition files as your working vocabulary changes.
5. Run PyElly module ellyBase.py with X as an argument to verify that your
language definition files can be loaded together. Enter single target sentences as
input and inspect the parse data dumps to check that analyses are correct. Ignore
any generated output for now.
6. Write the generative semantic procedures for your grammar rules and check for
correctness by running grammarTable.py. If you have problems with a particular
semantic procedure, copy its code to a text file and run the PyElly module
generativeProcedure.py with the name of that text file as an argument.
7. When everything checks out, run ellyBase.py with X as an argument and verify
that PyElly translates a target sentence as you want.
8. When everything is working for a target sentence, repeat the above from step 2 with
other sentences. Always test your new system against all previous target sentences to
ensure that everything is still all right after any change of language rules. You can
create X.main.txt and X.main.key files to automate such testing with doTest.
13.3 Miscellaneous Tips
This subsection is a grab bag of advice about developing nontrivial PyElly grammars and
vocabularies. It is distills various lessons learned about rule-based language definition
since PARLEZ, the original PyElly ancestor. In writing language rules for a new
application, expect to make some mistakes; but try to avoid repeating the same old ones.!
Page 114
PyElly User’s Manual
1. PyElly has only about ten thousand lines of Python code, exclusive of commentary. It
is set up to translate natural language strings into other kinds of strings and nothing
more. In other words, it will not by itself let you replicate Watson or Siri. Go ahead
and push the limits, but be aware that PyElly is not magic and that you are limited in
how many rules you can develop effectively in a short period of time.
2. The PyElly ellyMain.py module is the better choice to rewrite batches of raw text
data because of its sentence recognition. If you are working with only one input
sentence at a time, run ellyBase.py. This is friendlier for interactive processing
and also provides more diagnostic information about its translations.
3. PyElly analysis always revolves around sentences, but remember that you can define
them however you want and need not follow what you learned in 8th grade. It is all
right to break sentences at colons and semicolons or even at certain subordinate
conjunctions. Shorter sentences will simplify your grammar rules and parsing.
4. In developing a grammar, less is better. You are more likely to get into trouble with
as you add syntactic categories and rules. In many applications, for example, you can
ignore language details like gender, number, tense, and subject-verb agreement.
Avoid defining rules for what does not matter to you.
5. Get the syntax of a target input language right before worrying about the semantics.
PyElly automatically supplies you with stubs for both cognitive and generative
semantics in grammar rules. You can replace the stubs later with full-fledged logic
and procedures after you can successfully parse sentences.
6. Natural language always has regular and irregular forms. Tackle the regular first in
your grammar rules and make sure you have a good handle on them before taking
on the irregular. The latter can often be handled by macro substitutions: for
example, just change DIDN’T to DID NOT or if tense is unimportant, DO NOT.
7. Writing semantic procedures for a grammar is a kind of programming. Therefore,
follow good software engineering practices. Divide the rewriting text into smaller
subtasks that can be finished quickly and individually tested. Test thoroughly as you
go; never put it off until all your language rules have all been written out.
8. Make your generative semantic procedures short, fewer than twenty lines if possible;
this will make it easier to verify the correctness of your code visually. If a long
procedure is unavoidable, run it first in a separate file with the unit test for the
PyElly module generativeProcedure.py. Throw in some TRACE, SHOW, and
VIEW commands temporarily to monitor what that code is doing.
9. Be liberal with named generative semantic subprocedures to shorten large blocks of
code. A subprocedure may have as few as only two or three commands. Calling such
a subprocedure could actually take more memory than equivalent inline commands,
Page 115
PyElly User’s Manual
but clarity and ease of maintenance will trump efficiency here. Common code used
in multiple procedures should always be broken out as a named subprocedure.
10. Group the rules for syntactic types into different levels where the semantic
procedures at each level will do similar things. A possible succession of levels here
might be (1) sentence types, (2) subject and predicate types, (3) noun and verb
phrase types, and (4) noun and verb types with inflections. This is also a good way to
organize the definition of local variables for communication between different
generative semantic procedures.
11. Macro substitutions will usually be easier to use than syntax rules plus semantic
procedures, but they have to be fairly specific about the words that they apply to.
Syntax rules are better for patterns that apply to broad categories of words.
12. Macro substitution can be dangerous. Watch out for infinite loops of substitutions,
which can easily arise with * wildcards. Macros can also interact unexpectedly; in
particular, make sure that no macro is reversing what another is doing, which can
also lead to infinite loops.
13. Macros can match multiple tokens in input when a pattern for a substitution
includes a ‘_’ wildcard, which will match any space character, including ASCII HT,
NL, CR, US, and SP or Unicode NBSP. Any punctuation in a multi-token pattern
must be one that may be embedded in single token; for example, the commas in
1,000,000. Literal spaces are prohibited in macro patterns; you must use the ‘_
wildcard here.
14. The text matching pattern wildcards can be put into a macro substitution through
the elements \\1, \\2, \\3, and so forth; remember that each ‘_’ wildcard is always
bound separately, and you must count the elements that way; for example, the two
alphabetic segments of a string matching the pattern ‘&@_&@’ will be \\1 and \\3.
15. Try to avoid macros in which the result of substitution in longer that the original
substring being replaced. These are sometimes necessary, but can get you into
trouble; PyElly will warn you if it comes across such a macro, but will allow it.
16. The ordering of rules in a macro substitution table is important. Rules at the top of a
definition file can change the input text seen by a macro further down in the file.
17. PyElly macro substitution comes after transformations of spelled out numbers, but
before external vocabulary lookup, and entity extraction. It comes again just before
the next token is taken from its input buffer after any inflectional stemming and
again just before the next token is taken with morphological prefixes and suffixes
split off. This will allow macros to undo any word analyses in special cases. Be
careful here; macros actions can undo, but cannot themselves be easily undone.
18. Macro substitutions and transformations can change the spelling of words in
sentences being processed. Make sure that your internal dictionary and the external
vocabulary rules for an application take such changes into account. Sometimes you
Page 116
PyElly User’s Manual
can change the spelling of a word in a specific context to tell PyElly how to interpret
it; for example, you can indicate that an instance of WHAT is a conjunction by
rewriting it as cnjWHAT or WHĀT, which can then be added to your vocabulary.
19. Avoid macros for matching literal phrases like “International Monetary Fund.”
Unless you need the wild card matching in macros or want to use only part of a
match, use vocabulary tables instead. This will be faster.
20. Macros can be slow because all rules will be checked again after any successful
substitution except when an entire match is deleted by a rule. Each substitution will
involve much copying and recopying of text in a buffer.
21. A PyElly non-deterministic finite-state automaton is more complicated than the
deterministic FSA that more people are familiar with. Having wildcards in patterns
controlling state transitions, however, often lets you to avoid having to specify
anything other than 0 and -1 as a next state. PyElly will automatically look for all
possible matches consistent with your various patterns; you need not specify any
order in which to do this, as would be required in a deterministic FSA.
22. The -2 final state will let you override PyElly default tokenization. This will let you
split a token at a character normally in the middle of a single token or join what
would be separate tokens at a character normally seen as a separator. Such options
might be rarely needed, but PyElly will gives you the flexibility to take them.
23. Vocabulary building should be the last thing you do. You should define at least a few
terms to support early testing, but hold off on the bulk of your vocabulary. Adding
vocabulary is easy; adding grammar is complicated and will have many side effects.
24. In English and other languages including Chinese, certain combinations of words
are so frequent that they really should be taken idiomatically as a single vocabulary
table term; for example, FAKE NEWS. This will increase the number of different
rules, but will actually make parsing easier. Do this liberally when working with
longer sentences.
25. You can also absorb punctuation into vocabulary table entries for lookup. This is
most commonly done with hyphens, but quotation marks, commas, apostrophes,
and others are also fair game here. It can help greatly when working with awkward
kinds of syntax involving embedded punctuation.
26. For large ambiguous grammars, predefine as much vocabulary as possible in order
to keep parsing reasonably fast and parse trees smaller. Try to avoid the UNKN
syntactic type in rules. Use ellySurvey.py to find out what tokens appear in the
text data you want to process and what status they have with current language
definition rules. The file default.v.elly (WordNet 3.0) is a good source of
possible vocabulary rules specifying parts of speech for words and phrases. It may be
missing important terms for a particular content area, however. Add in your own
terms as needed, especially multiword terms.
Page 117
PyElly User’s Manual
27. Syntactic and semantic features can be quite helpful. The former lets you be more
selective about which rules to apply to an analysis; the latter lets you better choose
between different interpretations in case of ambiguity. These can reduce the total
number of syntax rules, but will add a different complexity to your grammar.
28. PyElly will allow a syntactic category to be associated with only one set of syntactic
feature names. Different syntactic categories may share the same set of feature
names, however. This is necessary when features are inherited across categories.
29. Features are encoded in a grammar table rule and saved in a parse tree node only as
n anonymous bits. The same feature name may be in different feature sets, but will
not necessarily refer to the same feature bit. This can make for bad surprises when
trying to inherit syntactic and semantic features through the *L or *R mechanisms;
Make sure that inheritance is only with phrases having the same set of feature
names. PyElly will check here, but only when feature names are referenced in rules.
30. You must always say whether feature inheritance in a parse tree is from a left or a
right descendant. There is nothing by default for either syntactic or semantic
features. All the features from a designated descendant will be inherited, but you can
always explicitly turn off particular inherited features afterward.
31. You should avoid defining new feature names beginning with ‘*’. PyElly will allow
this, but it can lead to confusion because such names will normally identify syntactic
and semantic features predefined by PyElly. Also, do not forget the ‘*’ when you do
use one of these PyElly-defined feature names.
32. The *l, *r, and *x syntactic features can be used in ways that disregard their side
effects in building parse trees for sentence analysis. This option should be tried only
when you have run short of syntactic features in a grammar; and it is most useful at
the leaf level of parse tree nodes, where no inheritance is possible anyway.
33. To dump out the entire saved grammar rule file for application A, run
grammarTable.py. with A as an argument. This will also show generative and
cognitive semantics, which ellyBase.py omits in its diagnostic output.
34. Keep PyElly parse data dumps enabled in PyElly language analysis and learn to read
them. This will be the easiest and most helpful way to obtain diagnostic information
when an application is not working as expected (usually the case). Full dumps will
show all subtrees generated for any ambiguous analyses, even those not
incorporated into actual PyElly output; it provides detailed parsing information.
35. PyElly will cut a tree display off at 25 levels of nodes by default. You can adjust this
limit in the ellyMain and the ellyBase command lines, but deeper trees will be hard
to read when output must be broken up to fit into the maximum width of a display.
36. If you run into a parsing problem with a long input sentence, try to reproduce the
same problem in with a shorter version of that sentence. This will help to isolate the
issue and also make your parse tree dumps easier to read.
Page 118
PyElly User’s Manual
37. When a parse fails, the last token in the listing shown in a parse tree dump will show
approximately where the failure occurred. It may be a few tokens before, however.
Check for a token for which no phrase node was generated. Look also at the last set
of goals generated in a bottom-up parsing and see what token position they occur in.
38. If you are working with English input and have not defined syntax rules for handling
the inflectional endings -S, -ED, and -ING, a parse will fail on them. The file
default.p.elly can define certain endings as the syntactic category SUFX, but
you still need something in your grammar rules to do something with them.
39. To verify the execution of a generative semantic procedure, put in a TRACE
command. This will write to the standard error stream whenever it is run. In a
procedure attached to a phrase node, TRACE will show the syntactic type and
starting position of that phrase node and the grammar rule governing the node. In a
named subprocedure, PyElly will also show the first attached generative procedure
calling a chain of subprocedures.
40. To see the value of local variables during execution for debugging, use the SHOW
generative semantic command, which writes to the standard error stream.
Remember that both local and global variables will always have string values, with
an empty string possible. Local variables may be from further up in a parse tree.
41. To see the contents of your current and your next output buffer at some point in a
generative semantic procedure, use the VIEW command. This is also a good way to
learn what the various PyElly semantic commands do.
42. Minimize the number of TRACE, SHOW, or VIEW commands at one time. Being
overwhelmed with too much instrumentation can be as problematic as having too
little information. Clean up such commands when no longer needed.
43. Punctuation is tricky to handle. Remember that a hyphen will normally be treated as
a token break; for example, GOOD-BYE currently becomes GOOD, - , and BYE
unless specified as a single term by entity extraction or external vocabulary lookup.
An underscore or an apostrophe is not normally a word break.
44. Quotation marks can be quite troublesome, since this may involve special Unicode
forms as well as ambiguous uses of the ASCII apostrophe character. PyElly
predefines quotation marks, including Unicode variations for formatted text, but
these all must still be properly handled by your grammar rules.
45. Default vocabulary rules for punctuation can be overridden by D: rules defined by a
PyElly application, but only when their cognitive semantic scoring make these
particular interpretations preferred by PyElly.
46. Macro substitution will not apply across sentence boundaries. To override
punctuation otherwise seen as a sentence stop, define PyElly stop exception rules
(described in Subsection 11.1.1). Note that macros can already deal with embedded
periods and commas, since these will be seen as non-stopping punctuation.
Page 119
PyElly User’s Manual
47. Ambiguity is often seen as a problem in language processing, but PyElly embraces it.
Deliberate ambiguity can simplify a grammar. For example, the English word IN can
be either a preposition or a verb particle. Define rules for both usages with
appropriate cognitive semantic plausibility scores and let PyElly figure out which
one to apply. To avoid parse node overflow, though, avoid unnecessary ambiguity.
48. Ambiguous grammar rules and alternative external vocabulary definitions should be
assigned different plausibility scores according to their probability in actual text.
Otherwise, PyElly may switch between them arbitrarily when analyzing different
sentences in the same session, which is probably not what you want.
49. When assigning plausibility scores to rules, try to keep adjustments either mostly
positive or mostly negative. Otherwise, they can cancel each other out in unexpected
and often unfortunate ways. Plausibility for a phrase is computed by adding up all
the plausibility scores for a phrase and all its immediate subconstituents. Note that a
plausibility may change in the middle of an analysis as ambiguities are resolved.
50. To see what is going on in cognitive semantic logic for a particular phrase node,
put ?>>? as the first clause. This will turn on tracing for subsequent clauses in a rule
and write to the standard error stream to verify that the logic is being executed. It
will identify which clause is actually used to compute plausibility. This can help to
track the generation of new phrase nodes and see what grammar rules are used.
51. When a problem arises in a particular input sentence, run the sentence by itself with
ellyBase.py to get more diagnostic information. This will show what grammar
rule is tied to each phrase node and what ambiguities it is associated with.
52. A common failing is when the wrong interpretation gets the highest plausibility
score. To remedy this, identify the phrase nodes contributing most to that score and
check their ambiguous counterparts for a better interpretation. This should tell you
where plausibility scoring needs adjustment in grammar or vocabulary rules. You
may have to go up or down in a parse tree to find which ambiguities to focus on.
53. The predefined *unique syntactic feature allows you to control where PyElly
ambiguity resolution happens. A phrase marked with that feature will remain
unresolved in PyElly parsing even when there is another phrase of the same
syntactic type and with the same syntactic features covering the same set of input
sentence tokens. This feature cannot be inherited.
54. Ambiguity will produce many possible parsings of a sentence, often leading to
exponential growth in total processing. A sentence with four points of ambiguity,
each with two possible interpretations, has sixteen possible analyses. The problem
worsens in longer sentences with more room for ambiguity. To minimize this
problem, avoid input tokens taken to be UNKN, and be sparing with the syntactic
feature *unique in grammar rules. When reasonable, recognize multi-world phrase
as a single token. Use RS control characters to partition analyses if possible.
Page 120
PyElly User’s Manual
55. The PyElly name recognition capability is based on the idea that we can list out the
most common names in a particular domain of text input and that other names will
be lexically and phonetically similar and can be contextually inferred. This will
reduce the number of UNKN tokens seen by PyElly and will help in recognizing
isolated surnames after the appearance of a full name.
56. The name.n.elly definition file is actually quite diverse, being based on U.S.
Census data, but if you are working primarily with foreign names in a particular
language like Arabic, Chinese, or Russian, you probably want to extend your tables
with sample names from other sources. Do not expect to get away with only a little
work here. Current PyElly phonetic signatures reflect American pronunciation and
may have to be adjusted to make inferences about foreign names more reliable.
57. Experiment. PyElly offers an abundance of language processing capabilities, and
there is often more than one way to do something. Find out what works for you and
share your ideas with others.
58. Fix any language definition problems due to typos first. It is easy to make mistakes
in keying in names of syntactic categories or names of semantic or syntactic features.
Always check the complete listing of grammar symbols in ellyBase diagnostic output
to verify that there are no unintended ones due to typos.
59. PyElly uses ‘#’ to mark comments in rule files and also as a wildcard matching
numeric characters ‘0’ through ‘9’. This is a hazard. To be safe, use a backslash (\) to
escape a single ‘#’ wildcard in a rule file to make it unambiguous. A space must
appear before and after a ‘#’ marking a comment in the middle of an input line.
60. Whenever a grammar or a vocabulary is complex, something will almost always be
wrong with it. Check underlying analyses to make sure the PyElly is really doing
what you expect; it is not enough for just the final output to look correct. Watch out
especially for ambiguous phrases with the same plausibility score.
61. You need to test thoroughly any language definition being developed. Always collect
sentences plus their expected translations to cover each grammar rule and each
dictionary entry in a language definition. Repeat some sentences in such a set to
check for possible variations in the resolution of ambiguities depending on context.
62. In natural language processing, you have to be systematic and committed for the
long haul; always rerun your test sets after any significant change in language
definition rules. A slight changes in your language definition rules can result in
PyElly being unable to translate a test sentences successfully handled previously.
Always check as soon as possible after adjustment of rules; waiting here can multiply
your problems, which then can be quite difficult to sort out. Be especially careful if
you change a default.*.elly file because it may affect more than one application
and may cause noticeable problems in only some of them.
63. Being able to rewrite n sentences properly will not guarantee that you can handle the
next new sentence. To increase the probability of success here, we have to keep
Page 121
PyElly User’s Manual
challenging an application with a larger and more diverse sets of target sentences.
The marking example application can give us an idea of how hard we have to work
here. It currently has been successful with over 875 real-world sentences, but it is
still easy to find new sentences where its rules will fail. Natural language is complex
and hard to model regardless of the approach taken to process it.
64. You may have to scrap some earlier language definitions to make a clean start on
handling especially problematic language features. Learn from failures and always
be ready to change your processing strategy if you run into a roadblock. PyElly rules
can be complicated and interact in unexpected ways.
An adult learning a second language soon discovers how hard it is hard to achieve even a
minimum level of fluency. The same is true for building a non-trivial natural language
system, even in the age of Siri, Watson, and Google Translate. Tools like PyElly can take
you a long way to credible functionality, but you may still need to take care of details
along the way. It is amazing how we humans can pick up most of this with little
apparent effort, but 12 years of parental teaching and formal schooling may have helped.!
Page 122
PyElly User’s Manual
14. PyElly Applications
PyElly is a collection of basic natural language processing capabilities developed
independently over many decades. Its applications mainly deal only with various low-
level details of language, but together, they cover a wide range of linguistic competence
that can be quite helpful in support of NLP in general. Within its limitations, you can
produce useful results quite quickly from raw text data.
The best way to show what can be done here is to present a representative sample of
actual PyElly applications, both simple and complex. Most of these can be implemented
by other means, but it is significant that PyElly can readily handle all of them within its
framework of rewriting one language into another according to user-provided rules.
PyElly is no one-trick pony.
14.1 Current Example Applications
The example applications included in the PyElly distribution package fall into two
classes: those used for debugging and validation only and those demonstrating
potentially useful NLP functionality. PyElly integration testing consists of running
debugging and validation applications plus all of the functional applications on preset
test data input files and checking that the results are as expected (See Appendix D).
Below, we shall describe each current example application provided with PyElly and list
the language definition files provided with them. These files are in the subdirectory
applcn of the unpacked distribution package. You can look at the rules in these files to
see how to develop rules for your own PyElly applications.
Six example applications are just for support, debugging, or validation:
default (.g,.m,.p,.ptl,.stl,.sx,.v) - not really an application, but a set of
language definition files that will be substituted by PyElly if your application does
not specify one explicitly. These include rules for sophisticated morphological
stemming, basic stop punctuation exceptions, and vocabulary rules covering most of
the terms in WordNet 3.0.
input: - -
output: - -
echo (.g,.m,.p,.stl,.v) - a minimal application that echoes its input after being
analyzed by PyElly into separate tokens. It will by default show inflectional
stemming of English words and do basic number transformations. You can disable
that stemming in your ellyMain.py command line (see Section 7) and replacing
the .stl file containing rules for the English comparative endings -er and -est.
input: Her faster reaction startled him two times.
output: her fast -er reaction startle -ed him 2 time -s.
Page 123
PyElly User’s Manual
test (.g,.m,.p,.ptl,.stl,.v) - for testing with a vocabulary of mostly short fake
words for faster keyboard entry; its grammar defines only simple phrase structures
with a minuscule vocabulary. This essentially replicates the basic testing done to
validate PyElly through its early alpha versions and after introduction of various new
modules extending PyElly capabilities.
input: nn ve on september 11, 2001.
output: nn ve+on 09/11/2001.
stem (.g,.m,.p,.v) - to check that PyElly inflectional stemming is properly
integrated with both internal and external vocabulary lookup. Such stemming
happens across multiple PyElly modules, which requires integration testing to detect
the various ways in which problems may arise here.
input: Dog's xx xx xx.
output: dog-'s xx xx xx.
fail (.g,.m,.p,.v) - to check that PyElly generative semantic FAIL command is
properly implemented. This will involve some complicated context switching, which
has been problematic.
input: xyz.
output: generative semantic FAIL
generative semantic FAIL
/Test 1/== aaxyz
The sample output here shows two warning messages from PyElly that the
generative semantic FAIL command has been executed, causing a translation to back
up to a point of ambiguity in a sentence analysis. This eventually leads to the output
as shown above.
bad (.g,.h,.m,.n,.p,.sx,.v) - deliberately malformed language rules to test PyElly
error detection, reporting, and recovery. No grammar or vocabulary table will be
generated here because of the malformed rules, and so PyElly will do no translation
here. The files bad.main.txt and bad.main.key are supplied for use with
doTest, but are both empty.
input: - -
output: - -
The integration test here is just to verify that PyElly can process its ill-formed input
crashing. This is helpful in verifying the robustness of PyElly processing.
A second, more substantial, class of applications are mostly derived from various
demonstrations developed for PyElly or its predecessors. These have nontrivial
examples of language definitions, illustrate the diversity of PyElly capabilities, and
provide a basis for broad integration testing of PyElly. Most of the applications are only
skeletal prototypes, but you can flesh them out for more operational usage by adding
your own vocabulary and grammar rules. !
Page 124
PyElly User’s Manual
indexing (.g,.p,.ptl,.v) - to check removal of purely grammatical words
(stopwords), stemming, morphological analysis, and dictionary lookup. By obtaining
roots of content words from arbitrary input text, it could predigest input English text
for information searching, statistical data mining, or machine learning systems. This
application was written for PyElly. It uses default.stl.elly.
input: We never had the satisfaction.
output: - - - - satisfy -
Note that PyElly will replace each non-content word and each broken-out word
prefix or suffix from its input with a hyphen (-) in its output to give an idea of the
extent of original text translated by PyElly.
texting (.g,.m,.p,.ptl,.stl,.v) - with a big grammar and nontrivial generative
semantic procedures. This implements a more or less readable text compression
similar to that seen in mobile messaging. This was a demonstration written
originally for the Jelly predecessor of PyElly and shows how a full-fledged texting
compression application might be approached.
input: Government is the problem.
output: govt d'prblm.
doctor (.g,.m,.ptl,.stl,.v) - This has a big grammar taking advantage of PyElly
ambiguity handling to choose one of several scripted responses for user input
containing specific keywords. It uses the PyElly ... syntactic type to set up
grammar rules to emulate Weizenbaum’s Doctor program for Rogerian
psychoanalysis, incorporating the full keyw0rd-based script published by him in
1966. The language definition rules were first written for the nlf predecessor of
PyElly and then adapted for Jelly and now PyElly.
input: My mother is always after me.
output: CAN YOU THINK OF A SPECIFIC EXAMPLE.
chinese (.g,.m,.p,.ptl,.v) - a test of PyElly Unicode handling, translating simple
English into either traditional [tra] or simplified [sim] Chinese characters. Both
the grammar and the vocabulary of this application are still in progress.
input: they sold those three big cars.
output: [sim]他们卖那三辆汽󰇤.
output: [tra]他們賣那三輛汽󴱌.
In actual operation, only one form of Chinese output will be shown at a time. You get
traditional character output when ellyMain.py is run with the flag -g tra . The
default is simplified output as with the option -g sim. The current integration test
is with traditional characters. Work on this application started in Jelly, but this was
greatly expanded in PyElly where most of the rule development was done.
Page 125
PyElly User’s Manual
querying (.g,.m,.ptl,.stl,.v) - heuristically rewrites English queries into SQL
commands directed at a structured database of Soviet Cold War aircraft organized
into multiple tables. This is a reworking of language definition files for the very first
nontrivial application written for the PARLEZ and AQF predecessors of PyElly; it
was updated in PyElly to produce SQL output.
input: how high can the foxbat fly?
output: from Ai a,AiPe b
select ALTD
where NTNM=foxbat,a.NTNM=b.NTNM
;
Table and field names here are abbreviations: Ai is “aircraft,” AiPe is “aircraft
performance, ALTD is “altitude,” NTNM is “NATO name,” and so forth. The original
AQF system aimed to make such cryptic names transparent to database users as well
as to hide the mechanics of query formation.
marking (.g,.m,.p,.v) - rewrite raw text with shallow XML tagging, a canonic
PyElly application. This shows how a data scientist might preprocess unrestricted
raw text for easier mining. It has evolved to have the most complex grammar and
vocabulary rules of all PyElly example applications by orders of magnitude and is
especially notable in its extensive cognitive semantics.
input: The rocket booster will carry two satellites into orbit.
output: <sent>
<nclu><det>the</det><noun>rocket booster</noun></nclu>
<vclu><aux>will</aux><verb>carry</verb></vclu>
<nclu><num>2</num><noun>satellite -s</noun></nclu>
<nclu><prep>into</prep><noun>orbit</noun></nclu>
<punc>.</punc></sent>
This application is still evolving in its language definition rules. It is the biggest part
of the PyElly integration test suite with a large input set of sample text from the
Worldwide Web. Such test data has been quite challenging and should give you a
better appreciation of the complexity of processing any natural language. See
Appendix F for more discussion.
name (.g, .m,.n,.p,.ptl,.stl,.v) - identify personal names in part or in whole
within raw text both by lookup and by inference. This capability could eventually be
merged with other example applications like marking, but it has been kept separate
as an integration test validating the special name recognition modules. It became
available in PyElly entity extraction as of release v1.1.
input: John Adams married Abigail Smith of Weymouth in 1764.
output: "John Adams"
"Abigail Smith"
The application uses a rule table based on 2010 U.S. Census data. It recognizes the
1,000 most common surnames, the 900 most common male names, and the 1,000
Page 126
PyElly User’s Manual
most common female names; and it can infer other name components from context
and from their similarity to known names. This heuristic logic is still experimental; it
is a reworking of a name searching demonstration built in the 1990’s using C.
disambig (.g,.h,.stl,.v) - disambiguation with a PyElly conceptual hierarchy by
checking the semantic context of an ambiguous term. This is only a first try and still
crude compared to the other applications listed here. It was written mainly as an
integration test focusing on cognitive semantics. Its output is a numerical scoring of
semantic relatedness between pairs of possibly ambiguous terms in its input and
showing the generality of their intersection in a conceptual hierarchy. Output will
show with the actual WordNet 3.1 concepts assigned to input words by PyElly.
input: bass fish.
output: 11 00015568N=animal0n: bass0n/[bass] fish0n/[fish]
This uses the PyElly output option to show the plausibility of a translation along with
the translation itself, which is right of the second colon (:) in the output line. The
plausibility will be left of that colon. The output score seen above is 11, which is high,
and the output also includes the concepts associated with a sentence analysis. The
example above shows that the intersection of bass and fish is under the WordNet
concept 00015568n, which bears the label animal0n.
These eight example applications show just some of the simple translations of natural
language you can do. All the processing is still rather basic—no deep learning or detailed
world models, but it calls for extensive linguistic competence and should give you
helpful examples in how to produce useful results over a broad range of text data.
14.2 Building Your Own Applications
The PyElly example applications are just a few possibilities for NLP short of full
understanding. The purpose of PyElly from an educational standpoint is to encourage
you to create your own unique projects. This will require thinking and tinkering, but
then you will be the one learning, not some neural net.
In general, a good candidate application for PyElly implementation should meet the
following conditions:
1. Your input data is UTF-8 Unicode text divisible into sentences consisting of ASCII,
Latin 1 Supplement, Latin Extender A and B, and special punctuation characters.
This need not be English, but that is where PyElly offers the most builtin support.
2. Your intended output will be translations of fairly short input sentences into
arbitrary Unicode text in UTF-8 encoding, not necessarily in sentences.
3. No world knowledge is required in the translation of input to output except for what
might be expected in a basic dictionary.
Page 127
PyElly User’s Manual
4. You understand what your translation would involve and accomplish at least on a
broad scale and mainly need some support in automating it.
5. Your defined vocabulary is limited enough for you to specify manually with the help
of a text editor like vi or emacs, and you can tolerate everything else being treated as
the UNKN syntactic type.
6. Your computing platform has Python 2.7.* installed. This will be needed both to
develop your language rules and to run your intended application.
7. You are comfortable with trial and error development of language definition files
and are willing to put up with idiosyncratic non-commercial software.
Familiarity with the Python language will help here, but is not critical. You will,
however, definitely have to write the code for PyElly cognitive and generative semantics.
All PyElly rule definition will generally be nontrivial, but it will be in a highly restricted
framework, which should be straightforward for someone with basic coding experience.
To build a PyElly application A, you first need to define its rules through various
language definition files. In the extreme case, these would include A.g.elly
(grammar), A.v.elly (vocabulary), A.m.elly (macro substitutions), A.p.elly
(syntactic type patterns), A.n.elly (name components), A.ptl.elly (prefix removal
rules), A.stl.elly (suffix removal rules), A.h.elly (semantic concept hierarchy),
and A.sx.elly (stop punctuation exceptions). Only A.g.elly is mandatory; the rest
can be either be empty files or omitted. If you do omit a definition file, the respective
substitution from default will be loaded instead.
Here are three fairly simple application projects you can quickly build as a way of getting
to know PyElly better:
A translator from English to pig Latin.
A bowdlerizer to replace objectionable terms in text with sanitized ones.
A part of speech tagger for English words using only morphological analysis.
If you are more ambitious, you might try to tackle some applications comparable to the
ones included in the PyElly distribution package. These probably cannot be completed
as a short-term project, but the goal here would be mainly to show the possibilities for a
particular translation approach.:
translit - transliterate English words into a non-Latin alphabet or into syllabic or
ideographic representation. Some research will be needed to make the output
plausible as a demonstration to be seen by native speakers of a target language.
editing - detect and rewrite verbose English into concise English, correcting
common misspellings and removing clichés.
Page 128
PyElly User’s Manual
anonymizing - remove identifying elements from text: names, telephone numbers,
addresses, and so forth.
nomenclature - identify standard chemical names in text. This is to allow for more
accurate processing of technical text, as such names are often mangled otherwise.
This will probably require writing some special PyElly entity extraction modules.
preprocessing - a reformatter of raw text data into more structured input
friendlier to the input layer of a deep neural network. This is probably more
appropriate for long-term research, but it should be instructive to see whether
computational linguistics can somehow reduce the dimensionality of text input data
for a particular learning problem.
You might also try combining the rules from several current example applications, like
the grammar rules of marking with the name recognition of name with the big
WordNet vocabulary of default. In the example applications already implemented,
PyElly grammar rules have only been in the hundreds and vocabulary rules only in the
thousands, but these can be extended further with the investment of a few months or
even a few weeks of work.
If your application requires change to a default.*.elly definition file, make sure
that other applications referring to it will still work. To be fully safe, always delete every
saved *.elly.bin file after the change so that PyElly rebuilds them all. Then rerun all
your integration tests.
Eventually, someone could write an application for non-English input. PyElly currently
can recognize input characters from the first four Unicode blocks plus a few extras as
text, allowing to you process French, Spanish, German, Czech, Hungarian, and other
Western European language as well as Pinyin for Chinese. You need new rules for
inflectional and morphological analysis for other inflected languages, though, as PyElly
has resources so far only for American English.
Page 129
PyElly User’s Manual
15. Going Forward
PyElly offers many different tools and resources for practical computational linguistics.
Most of them are still evolving in both code and documentation; but the system has
been extensively tested since 2013 and should be fairly reliable for non-critical
applications. You need to understand its limitations, but within them, you can achieve
useful results and learn firsthand about the structure of language.
Natural language is complex and hard to process if our goal is to approximate how a
human understands it. Using simplified models of text can be helpful for expediting
some kinds of processing, but this inevitably throws away information. Text as yet
cannot readily be reduced to a clean and logical encoding of information; to dig into it
and mine it effectively, we still need to dig into it with tools like PyElly.
15.1 What PyElly Tries To Do
PyElly is old-fashioned in its use of manually crafted rules to describe text data for
various translation. This contrasts with current practices favoring unsupervised
automatic machine learning of language structure from large samples of text data.
Machine learning reduces human labor of course; but a rule-based approach can still be
advantageous if someone else has already laid out most of the rules in a usable form.
The challenge here is that natural language operates at many different levels:
orthography (spelling), morphology, lexicography (parts of speech), syntax, semantics,
and pragmatics (intended meaning). Competence in processing text at higher levels is
the Holy Grail, but it depends on achieving it first at lower levels, where a system like
PyElly is already strong. So if we want to build higher-level NLP systems quickly, we
should not condemn ourselves to reinvent the wheel from the bottom up each time.
Creating a PyElly language definition is a non-trivial problem, given the sheer amount of
knowledge required to deal effectively with a language like English. Earlier efforts in
building rule-based systems for artificial intelligence foundered because the number of
rules for real-world applications soon became unmanageable. PyElly tries to avoid this
fate by focusing on low-level computational linguistic knowledge, which is still quite
difficult, but can actually be approached in an incremental way.
In particular, PyElly provides various integrated prebuilt capabilities like English
inflectional stemming and morphological analysis, American name recognition, and
sentence delineation. None of these are “game-changing” technologies, but they do
provide critical basic support. Even when your primary analysis will rely mainly on
machine learning, some basic linguistic sophistication about your data can reduce the
dimensionality of your problem space and speed the finding of practical solutions.
For example, take the issues of punctuation, capitalization, and diacritical marks in raw
text data. A data scientist might be tempted to ignore all of this, but this throws away
information that people did deliberately put into text for some reason. Such information
can help us in breaking up text into sentences and in determining the roles of words in
Page 130
PyElly User’s Manual
text. You can tell if a word is being used denotatively in a name like “Tenderweet
Clams™” or connotatively in a description. This kind of competence in many small
aspects can add up to make a big difference.
In a sense, we can be seduced by what we can accomplish with simplistic processing
strategies. Treating text as a bag of words is often good enough for a basement level of
understanding; but to do better, we must really be smarter about language than an
eight-year-old. A neural net can hide many linguistic details for us, but if English,
Japanese, Arabic, Mayan, and Turkish all look the same to that neural net, then our
approach is probably missing something obvious to any human here.
In processing English, knowledge about its -S, -ED, and -ING inflectional endings can
help in parsing it more effectively. Add in all the other kinds of knowledge built into
PyElly or captured by PyElly rules supplied by users, and we can quickly get ahead of the
game in NLP. The extent of such rules will be much larger than most people expect.
Otherwise, why do most non-native speakers make so many mistakes in English even
after years of being immersed in it?
PyElly will let users control what aspects of their text data to focus on. For example, to
process text from a large corpus of recipes, we might want to identify beforehand basic
elements such as equipment, measurements, cooking techniques, common ingredients,
and descriptions of texture and taste. This can greatly reduce noise in deep learning so
that we can avoid getting terms like “one tsp” as a major feature for distinguishing
recipes. Machine learning should never be made harder than it needs to be.
Any PyElly application will of course translate its text input imperfectly, but even this
will often be a significant improvement over completely ignoring the structure of its text
data. If a PyElly translation needs to improve, you can always add rules to address
specific deficiencies. You can also change PyElly code itself. In line with modern
software engineering, no code will ever be completely finished.
Remember that PyElly applications are never meant to pass any kind of Turing test.
PyElly works with minimal world knowledge and lacks full reasoning and common
sense, limiting its degree of language understanding. This sounds like a handicap, but it
means that PyElly can focus on what it is good at. Think of it as being like a high-end
food processor in the kitchen, which can make a skilled chef more productive, but will
be by itself no guarantee of Michelin three-star cuisine at every meal.
PyElly has already gone far beyond its original PARLEZ roots. If its development is
carried on, one would hope to improve various non-intrinsic capabilities like entity
extraction, handling of semantic concepts, and name recognition. Basic PyElly
algorithms could also be better tuned for efficiency and resilience, especially in
ambiguity resolution. Eventually this should all lead to a successor to PyElly. We are still
learning about natural languages in general and how to process them in particular.
A major breakthrough would be welcome of course, but PyElly and its predecessors have
been more about continually accumulating and integrating various kinds of basic
linguistic competence over time. As we gain experience in building more applications,
Page 131
PyElly User’s Manual
this should guide PyElly development in new directions. Progress may seem slow, but
we are engaged with NLP for the long term; and whatever rules we learn stay learned.
15.2 Practical Experience in PyElly Usage
The PyElly marking example application can give you an idea of what to expect in a big
NLP project. Its original goal was to test the limits of PyElly by carrying out shallow
XML markup of actual English text from the Worldwide Web. At the onset, it was
unclear how best to attack this problem and whether PyElly could really handle complex
“wild” input. Shallow markup seemed likely to require at least an order of magnitude
more rules than previous PyElly example applications.
Appendix F describes the details of how the marking application evolved. Our
emphasis here will be more generally on PyElly itself and how it works in a non-trivial
real-world task. With marking, the basic strategy was to start with a simple grammar
and vocabulary and then try to process a large set of target sentences from the Web.
PyElly rules will often fail on unfamiliar input, but by adjusting and expanding them, we
can get an acceptable markup even for hard sentences. Such changes in rules of course
have to preserve any previous successful markups.
The first marking language definition consisted of only 132 grammar rules, 161
vocabulary rules, 28 pattern rules, and 0 macro substitution rules. This had been
checked out beforehand with some simple input before moving on to markup of real
Web text. The first sample from the Web consisted of 22 segments of text taken from
different sources with 129 sentences altogether. Every segment was processed as a single
continuous input in which PyElly had to figure out the sentence boundaries.
The hope was that PyElly would be flexible enough to let us to produce acceptable
markups eventually for every Web test segment through enough rule changes, possibly
highly specific to a problem sentence. At some point, we would hope that fewer new
rules should be needed for each new segment, but this failed to pan out. The exercise did
uncover many previously undetected problems in PyElly Python code, however, and
these bugs did drop off over time.
Eventually, PyElly did process all 129 sentences of the first Web sample acceptably. This
showed that PyElly was capable of heavy-duty work, although the viability of a general
markup application remained in question. More testing was called for, and so another
five samples with 759 sentences were eventually collected from eighty more Web
sources. These took about two years to mark up successfully. In that time, the marking
language definition to grow to 612 grammar rules, 6,157 vocabulary rules, 97 pattern
rules, and 50 macro substitution rules. It has not yet stopped growing.
The marking application itself is still unfinished. The failures seen when processing the
last segment of the sixth sample for the first time were similar to those seen for the first
segment of the first sample. Nevertheless, the PyElly shallow markup rules are actually
getting better and so it is reasonable to continue collecting more data from the Web to
see how far we could go. Natural language now seems to be too complicated for full
Page 132
PyElly User’s Manual
success in shallow markup anytime soon, but being able to handle thousands of real-
world sentences is no minor accomplishment.
PyElly also showed that it was able to handle a large number of grammar and
vocabulary rules. The loading of those rules from definition files has so far been
manageable for the marking application, and processing with them has yet to push
PyElly anywhere near a breaking point. Sentences in our expanded test set of 878 now
take an average of about 2 seconds for markup, although most sentences can be
processed in less than a second. This performance seems reasonable for Python code
running interpretively without concerted optimization.
On the whole, the results of the markup project have been positive. It has been a good
way of discovering the pitfalls of English text on the Web and has helped to test PyElly
code more thoroughly. It has revealed that a combinatoric approach to ambiguity runs
into difficulty when sentences are long and have many unknown tokens; that needs to be
addressed in future PyElly releases. We have, however, validated PyElly as an integrated
tool able to go beyond operating in a sandbox and shown that shallow markup is an
achievable goal when we allow for a small sentence failure rate, say, less than 10 precent.
So, there is plenty of work left to do here. The obvious direction going forward, is to feed
still more Web text to the marking application until we hit some kind of hard wall. At
the present stage, we do not want to do all the work to check out and correct all the
markups for every sentence in new text segments, but would focus on adjusting rules to
fix any PyElly parsing failures. In this way, the PyElly marking application should
continue to improve and PyElly itself should gain credibility as an NLP toolkit.
15.3 Where We Stand For Now
The entire PyElly open-source package is free and compact enough for home use. It will
operate at a more detailed level than other natural language toolkits, but this should
help students and experimenters to get a better feel for the nuts and bolts of natural
language processing and to understand how basic capabilities work. Unrestricted text
data will always be hard to handle, and NLP practitioners need to learn where all the
potholes lie on the road and how to maneuver around them.
PyElly can be intimidating in its complexity with so many different kinds of language
rules somehow coordinating to produce a result. This complexity is appropriate for
natural language, however. No one sat down to design a language like English, Chinese,
Amharic, or Quechua from logical first principles. They evolved haphazardly over
thousands of years according to changing local requirements. We should not expect that
any set of descriptive rules for them will be simple.
The good news is that the rule-based translation framework of PyElly seems to hold up
at least to a level of many hundreds of grammar rules and tens of thousands of
vocabulary rules, not to mention all the other PyElly rule types in any real-world
application. We do really need all such knowledge to get beyond the most basic kinds of
Page 133
PyElly User’s Manual
sentence rewriting. Our rules require an effort for a user to compile for the first time,
but we may be able to carry them across different applications to save us work later.
One could probably spend another forty years just tinkering with various computational
linguistic tools in PyElly. In the 1980s, artificial intelligence researchers thought that
they had everything they needed to understand natural language within a decade or so,
but no one should to be that confident today, even with deep learning. A more realistic
goal is for us just to continue making progress through sustained effort as well as
occasional sudden inspiration. Everyone can help to blaze some trails here; and the
more people working on it, the better.
A general takeaway here is that, for any nontrivial natural language processing, we really
need to spend some serious time with linguistic details. Our own human skills for
language understanding are complex and need to be reflected in our processing
algorithms. The marking samples show clearly that even apparently simple text
analysis of real-world data will require having detailed knowledge of a target language
and its world of discourse. This holds even with deep neural nets or other machine
learning. Real linguistics is far from being obsolete yet for data scientists.
PyElly provides a framework to exploit human expertise in the processing of natural
language. Yes, an unsupervised neural net in theory can learn everything automatically
with enough training data, but one would have to create the equivalent of the twelve
years of exposure to language for a human child in the home, the streets, in schools, and
in various media. Except in cases of child abuse, much of our own learning while
growing up involved extensive supervision; a “child” neural net deserves no less.
We still do not have a magic bullet for NLP or even really know how to lay out a plan for
finding one. A lack of knowledge about the inner and outer workings of language is not
the problem, however. As an educational system, PyElly is about collecting all kinds of
such linguistic knowledge and then turning people loose to experiment and see what is
ultimately possible. Let us not be hyperbolic here. PyElly is just a collection of code, but
it has been around the Horn a few times already.
Try PyElly on your favorite text corpora and see whether its translations can improve
how your current or future data analysis and mining technology performs. Share with
others any new PyElly rules you develop or old rules you have refined. Any criticisms or
suggestions about PyElly itself will be welcome, and of course, you may freely improve
on the PyElly open-source Python code and share this with everyone else. You might
even decide to build your own comprehensive open-source natural language processing
tool. Happy explorations in a wide open world!!
Page 134
PyElly User’s Manual
Appendix A. Python Implementation
This appendix is for Python programmers. You can run PyElly without knowing its
underlying implementation, but at some point, you may want to modify PyElly or embed
it within some larger information system. The Python source code for PyElly is released
under a BSD license, which allows you to change it freely as needed. You can download
it from
https://github.com/prohippo/pyelly.git
PyElly was written in Python 2.7.5 under Mac OS X 10.9 and 10.10; it will not run under
earlier versions of Python because of changes in the language. To implement its external
vocabulary tables, PyElly v1.2+ no longer requires the Berkeley Database (BDb)
database manager or the bsddb3 third-party BDb Python API wrapper. The PyElly code
with BDb in release v1.1 or earlier should be discarded because a GPL copyleft license
would be required for it, greatly limiting freedom of use. Vocabulary table lookup code
has also evolved greatly in PyElly v1.2 and v1.3.
Currently, the PyElly v1.3+ source code consists of 64 Python modules, each a text file
named with the suffix .py. All modules were written to be self-documenting through
the standard Python pydoc utility. When executed in the directory of PyElly modules,
the command
pydoc -w x
will create an x.HTML file describing the Python module x.py.
The code was written neither for speed of execution nor for space efficiency. This is
normal in Python development practice, however, because it is an interpreted language,
and it is consistent with PyElly’s emphasis on quickly putting together a broad range of
functionality for doing useful language processing right now. Although the code has
become fairly stable through extensive testing, it remains experimental and still keeps
many debugging print statements that can be reactivated by uncommenting them.
The algorithms underlying PyElly have become somewhat intricate after decades of
tinkering. Section 12 of this manual does describe the bottom-up parsing approach of
PyElly, but other important aspects of the system have to be gleaned from source code.
The most notable of these are wildcard string matching, macro substitution, the non-
deterministic finite automaton for token pattern matching, compiling PyElly code for
generative and cognitive semantics, and multi-element vocabulary table lookup.
Here is a listing of all 64 current PyElly Python modules grouped by functionality. Some
non-code definition and data files are also included below when integral to the
operation or builtin unit testing of the modules there. They are shown in the dark-
shaded rows of the tables.
Page 135
PyElly User’s Manual
Inflectional Stemmer (English)
ellyStemmer.py
base class for inflection stemming
inflectionStemmerEN.py
English inflection stemming
stemLogic.py
class for stemming logic
Stbl.sl
remove -S ending
EDtbl.sl
remove -ED ending
Ttbl.sl
remove -T ending, equivalent to -ED
Ntbl.sl
remove -N ending, a marker of a past participle
INGtbl.sl
remove -ING ending
rest-tbl.sl
restore root as word
spec-tbl.sl
restore special cases
undb-tbl.sl
undouble final consonant of stemming result
Tokenization
ellyToken.py
class for linguistic tokens in PyElly analysis
ellyBuffer.py
for manipulating text input
ellyBufferEN.py
manipulating text input with English inflection stemming
substitutionBuffer.py
manipulating text input with macro substitutions
macroTable.py
for storing macro substitution rules
patternTable.py
extraction and syntactic typing by FSA with pattern matching
Page 136
PyElly User’s Manual
Parsing
symbolTable.py
for names of syntactic types, syntactic features, generative
semantic subprocedures, global variables
syntaxSpecification.py
syntax specification for PyElly grammar rules
featureSpecification.py
syntactic and semantic features for PyElly grammar rules
grammarTable.py
for grammar rules and internal dictionary entries
grammarRule.py
for representing syntax rules
derivabilityMatrix.py
for establishing derivability of one syntax type from another so
that one can make bottom-up parsing do nothing that top-down
parsing would not
ellyBits.py
bit-handling for parsing and semantics
parseTreeBase.py
low-level parsing structures and methods
parseTreeBottomUp.py
bottom-up parsing structures and methods
parseTree.py
the core PyElly parsing algorithm
parseTreeWithDisplay.py
parse tree with methods to dump data for diagnostics
Semantics
generativeDefiner.py
define generative semantic procedure
generativeProcedure.py
generative semantic procedure
cognitiveDefiner.py
define cognitive semantic logic
cognitiveProcedure.py
cognitive semantic logic
semanticCommand.py
cognitive and generative semantic operations
conceptualHierarchy.py
concepts for cognitive semantics
Sentences and Punctuation
ellyCharInputStream.py
single char input stream reading with unread() and reformatting
ellySentenceReader.py
divide text input into sentences
stopExceptions.py
recognize stop punctuation exceptions in text
exoticPunctuation.py
recognize nonstandard punctuation
punctuationRecognizer.py
define single-character punctuation defaults for English
Page 137
PyElly User’s Manual
Morphology
treeLogic.py
binary decision logic base class for affix matching
suffixTreeLogic.py
for handling suffixes
prefixTreeLogic.py
for handling prefixes
morphologyAnalyzer.py
do morphological analysis of tokens
Entity Extraction
entityExtractor.py
runs Python entity extraction procedures
extractionProcedure.py
some predefined Python entity extraction procedures
simpleTransform.py
basic support for text transformations and handling of spelled
out numbers
dateTransform.py
extraction procedure to recognize and normalize dates
timeTransform.py
extraction procedure to recognize and normalize times of day
nameRecognition.py
identify personal names
digraphEN.py
letter digraphs to establish plausibility of possible new name
component
phondexEN.py
get phonetic encoding of possible name component
nameTable.py
defines specific name components
Top Level
ellyConfiguration.py
define PyElly parameters for input translation
ellySession.py
save parameters of interactive session
ellyDefinition.py
language rules and vocabulary saving and loading
ellyPickle.py
basic loading and saving of Elly language definition objects
interpretiveContext.py
handles integration of sentence parsing and interpretation
ellyBase.py
principal module for processing single sentences
ellyMain.py
top-level main module with sentence recognition
ellySurvey.py
top-level vocabulary analysis and development tool
dumpEllyGrammar.py
methods to dump out an entire grammar table
Page 138
PyElly User’s Manual
All *.py and *.sl files listed above are distributed together in a single directory along
with *.main.* integration test files. The *.txt files for unit testing will be in a
subdirectory forTesting.
The first v0.1beta version of the Python code in PyElly was written in 2013 with some
preparatory work done in November and December of 2012. This was an extensive
reworking and expansion of the Java code for its Jelly predecessor, making it no longer
compatible with Jelly language definition files. PyElly v1.0 moved beyond beta status as
of December 14, 2014, but active development continues. The latest release is v1.4.25.
The emphasis in PyElly is now moving away from adding on new Python modules and
moving towards better reliability and usability. Existing modules will continue to evolve,
but mainly to provide better support for building real-world applications to process
unrestricted natural language text. This will involve the eventual construction of big
grammars and big vocabularies, which should be the ultimate test of any natural
language toolkit.!
External Database
vocabularyTable.py
interface to external vocabulary database
vocabularyElement.py
internal binary form of external vocabulary record
Test Support
parseTest.py
support unit testing of parse tree modules
stemTest.py
test stemming with examples from standard input
procedureTestFrame.py
support unit test of semantic procedures
generativeDefinerTest.txt
to support unit test for building of generative semantic
procedures
cognitiveDefinerTest.txt
to support unit test for building of cognitive semantic
procedures
suffixTest.txt
to support comprehensive unit test with list of cases to handle
morphologyTest.txt
to support unit test with prefix and suffix tree logic plus
inflectional stemming
sentenceTestData.txt
to support unit test of sentence extraction
testProcedure.*.txt
to run with the generativeProcedure.py unit test to verify correct
implementation of generative semantic operations
*.main.txt
Input text for integration testing
*.main.key
Expected output text for integration testing with provided input
Page 139
PyElly User’s Manual
Appendix B. Historical Background
PyElly natural language tools have evolved greatly in the course of being completely
written or rewritten five times in five different languages over the past forty years.
Nevertheless, it retains much of the flavor of the original PDP-11 assembly language
implementation of PARLEZ. Writing such low-level code forced PyElly software
architecture to be simple, a help in porting the system to later computing platforms.
The PARLEZ system, for example, had its own stripped-down custom language for
generative semantics because nothing else was available at the time. That solution is
platform-independent, however, and so has been carried along with only a few changes
and additions in systems up to and including PyElly for generating translated text
output. And, sorry, arithmetic is still unsupported.
PyElly departs in major ways even from its immediate predecessor Jelly.
The inflectional stemmer has simplified its basic operations and has eliminated
internal recursive calling. Stemming logic is now in editable text files for loading at
run-time. The number of special cases recognized in English has been greatly
expanded. The -N and -T English inflectional past tense endings were added.
Morphological analysis from Jelly was enhanced to include proper identification of
removed prefixes and suffixes as well as returning stems (lemmas). This results in a
true analytic stemmer, appropriate to a general natural language tool. English suffix
recognition was greatly expanded and now covers most WordNet “exceptions.”
The syntactic type recognizer was upgraded to employ an explicit non-deterministic
finite-state automaton with transitions made when an initial part of an input string
matches a specified pattern with both literal characters wildcards in a state. A special
null pattern was added for more flexibility in defining automata state changes.
New execution control options were added to generative semantics. Local and global
variables were changed to store string values, and list and queue operations were
defined for local variables. Deleted buffer text can now be recovered in a local variable.
Support for debugging and tracing semantic procedures was expanded.
Semantic concepts were added to cognitive semantics for ambiguity handling. This
makes use of a new semantic hierarchy with information derived from WordNet.
Vocabulary tables have become more easily managed by employing the SQLite
package to manage persistent external data. An external vocabulary rule was limited to
a single input line to facilitate automatic compilation of definitions.
Support for recognizing and remembering personal names or name fragments is now
available in PyElly entity extraction. This is based on 2010 U.S. Census data.
A new interpretive context class was introduced to coordinate execution of generative
semantic procedures and consolidate data structures for parsing and rewriting.
Page 140
PyElly User’s Manual
Handling of Unicode was improved. UTF-8 is now allowed in all language definition
files and in interactive PyElly input and output. Non-data parts of rules are still ASCII.
Ambiguity resolution was completely overhauled with expanded cognitive semantics.
Control character recognition in input allows for closer management of the explosion
of phrase nodes due to ambiguity.
Sentence and punctuation processing is cleaner and more comprehensive. Unicode
formatted punctuation like and and small Greek letters can now be recognized.
The PyElly command line interface was reworked to support new initialization and
rewriting options.
Error handling and reporting has been broadened for the definition of language rules.
Warnings have also been added for common problems in definitions.
New unit tests have been attached to major modules. Integration tests are with
existing example applications developed to exercise a broad range of PyElly features.
New example applications have also been written to show the broad range of PyElly
processing. These also serve as integration tests to validate PyElly natural language
capabilities; older example applications from Jelly and earlier systems were converted
to run in PyElly.
Many bugs from Jelly or earlier were uncovered by extensive testing and fixed.
Jelly is now superseded and retired, reflecting the current greater importance of
scripting languages like Python in software development and education.
PyElly is by no means perfect or complete and might be rewritten in yet another
programming language as computing practices change. The goal here, however, is less a
long-term utopian system than an integrated set of reliable natural language processing
tools and resources immediately helpful to people building practical natural language
applications. Many PyElly tools and resources will seem dated to some technologists;
but these have had time to mature and prove their usefulness and to continue evolving.
There is no point in continually having to reinvent or relearn such capabilities.
Going forward, we want to implement and demonstrate a robust capability to process
arbitrary English text in a nontrivial way. Currently, this goal has been explored in the
context of the marking example application begun just after the v1.0 release of PyElly.
The language definition problem in marking is quite complex and has been helpful in
exercising almost all of PyElly except for conceptual hierarchies.
This PyElly User’s Manual revises, reorganizes, and greatly extends the earlier one for
Jelly, but still retains major parts from the original PARLEZ Non-User’s Guide, as it was
printed out on an early dot-matrix printer. Editing for clarity, accuracy, and
completeness continues. Check https://github.com/prohippo/pyelly.git for the latest
PDF for the manual.!
Page 141
PyElly User’s Manual
Appendix C. Berkeley Database and SQLite
Berkeley Database is an open-source database package available by license from Oracle
Corporation, which in 2006 bought SleepyCat, the company holding the BDb copyright.
BDb was the original basis for PyElly external vocabulary tables, but has been replaced
by SQLite because of changes in Oracle licensing policy for BDb.
The current PyElly v1.2+ vocabulary tables with SQLite should incur no noticeable
performance penalty despite having to access all persistent data through an SQL
interface instead of function calls. SQLite will allow PyElly to remain under a BSD
license, since SQLite is included in the Python 2.7.* and 3.* libraries. It does not have to
be downloaded separately.
If running with Berkeley Database is really important, you can download it yourself and
return to the former PyElly vocabulary table code in v1.1. The latest versions of BDb do
come with a full GNU copyleft license, though, with possibly entailing unattractive legal
implications. The apparent intent of Oracle here is to compel many users of BDb to buy
a commercial license instead of using free open-source code.
The Python source of vocabularyTable.py in the previous PyElly v1.1 release has
NOT been updated, however, to display a GNU copyleft license as required for use of
Bdb. That will not happen unless there is a reason to fork off a specific BDb version of
PyElly in the future.
Downloading Berkeley Database and making it available in Python is a complex process
depending on your target operating system. You will typically need Unix utilities to
unpack, compile, and link source code. For background on Berkeley Database, see
http://en.wikipedia.org/wiki/Berkeley_DB
For software downloads, you must go to the Oracle website
http://www.oracle.com/technetwork/database/database-technologies/berkeleydb/downloads/index.html
to get the latest Berkeley Database distribution file. The instructions for doing so on a
Unix system can be viewed in a browser by opening the Berkeley Database
documentation file:
db-*/docs/installation/build_unix.html
The installation procedure should be fairly straightforward for anyone familiar with
Unix. An actual MacOS X walkthrough of such compilation and installation can be
found on the Web at
https://code.google.com/p/tonatiuh/wiki/InstallingBerkeleyDBForMac
To access Berkeley Database from Python, you must next download and install the
bsddb3 package from the web. This is available from
https://pypi.python.org/pypi/bsddb3
Page 142
PyElly User’s Manual
The entire installation procedure turns out to be quite complicated, however, and
difficult to carry out directly from a command line. The problem is with dependencies
where a module A cannot be installed unless module B is first installed. Unfortunately,
such dependencies can cascade unpredictably in different environments, so that one
fixed set of instructions cannot always guarantee success.
To avoid missteps and all the ensuing frustrations, the best approach is use a software
package manager that will trace out all module dependencies and formulate a workable
installation path automatically. On MacOS X, several package managers are available,
but the current favorite is homebrew. See this link for general details:
http://en.wikipedia.org/wiki/Homebrew_(package_management_software)
As it turns out, homebrew will also handle the installation of Berkeley Database and the
upgrading of Python on MacOS X to version 2.7.5 (recommended for bsddb3). If you
have a MacOS system with Xcode already installed, you can follow these steps to
download the homebrew package and use its brew and pip commands to get Berkeley
Database:
# get homebrew
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”
# get latest Python
brew install python —framework
# get BdB
brew install berkeley-db
sudo BERKELEYDB_DIR=/usr/local/Cellar/berkeley-db/5.3.21/ pip install bsddb3
(This procedure is subject to change. See the latest pertinent webpages for the most
current information.)
This web page explains what is going on here:
http://stackoverflow.com/questions/16003224/installing-bsddb-package-python
The homebrew package manager is helpful because it maintains a shared community
library of tested installation “formulas” to work with. These resources are specific to
MacOS X, however, making homebrew inapplicable to Windows or even Linux or other
Unix operating systems. If you are running on a non-MacOS X platform, you have to
turn to other software package managers; see
http://en.wikipedia.org/wiki/List_of_software_package_management_systems
Some of these managers implement parallels to homebrew commands, but you will
have to check what parts are actually equivalent.!
Page 143
PyElly User’s Manual
Appendix D. PyElly System Testing
After any major change to PyElly source code, you should thoroughly validate the
resulting system. Check that every module passes muster with the PyElly interpreter and
run the current suite of unit and integration tests in the PyElly package.
The following PyElly Python modules have builtin unit tests, executed with the
command python M.py, where M is the module name:
cognitiveDefiner cognitiveProcedure conceptualHierarchy
conceptualWeighting dateTransform derivabilityMatrix
dumpEllyGrammar ellyBits ellyBuffer
ellyBufferEN ellyChar ellyCharInputStream
ellyDefinition ellyDefinitionReader ellySentenceReader
ellySurvey ellyWildcard entityExtractor
extractionProcedure featureSpecification generativeDefiner
generativeProcedure grammarRule grammarTable
inflectionStemmerEN macroTable morphologyAnalyzer
nameRecognition nameTable parseTree parseTreeBase
parseTreeBottomUp parseTreeWithDisplay patternTable
phondexEN prefixTreeLogic punctuationRecognizer
simpleTransform stemLogic stopExceptions
substitutionBuffer suffixTreeLogic syntaxSpecification
timeTransform treeLogic vocabularyTable
Most of these unit tests are self-contained with predefined input test data, but some also
will read sys.stdin to get additional input for testing:
ellyBase ellyBufferEN ellyCharInputStream
entityExtractor inflectionStemmerEN macroTable
morphologyAnalyzer nameRecognizer nameTable
patternTable phondexEN stemLogic
substitutionBuffer suffixTreeLogic vocabularyTable
Running your own examples with the unit test of one of these modules can help you
track down a specific problem in a language description. Enter as many inputs as you
want; just type a <RETURN> by itself to terminate the input loop here. Manually entered
test examples will be optional for pre-release PyElly unit validation, however.
The following PyElly modules have specific test input files included in the standard
distribution from GitHub. Their associations are as follows:
cognitiveDefiner: cognitiveDefinerTest.txt
ellySentenceReader: sentenceTestData.txt
generativeDefiner: generativeDefinerTest.txt
generativeProcedure: testProcedure.*.txt
morphologicalAnalyzer: suffixTest.txt
In unit testing, these files are either specified in a commandline argument or read from
redirected standard input. See the Python code for each tested PyElly module to see how
to do this. Unit test output can usually be quickly evaluated by inspection.
Page 144
PyElly User’s Manual
For integration testing, the doTest shell script with argument A will run ellyMain.py
with preselected parameters for application A while reading input from A.main.txt. In
full integration testing, run doTest with each of the PyElly applications having
language definition files in the applcn subdirectory of the PyElly download package:
./doTest echo
./doTest test
./doTest stem
./doTest fail
./doTest indexing
./doTest texting
./doTest doctor
./doTest chinese
./doTest querying
./doTest marking
./doTest name
./doTest disambig
The doTest script will automatically compare PyElly output for A.main.txt input
with the corresponding A.main.key file with the builtin diff shell command. For
example, doTest querying might produce
test application= querying, input= querying.main.txt
real 0m0.404s
user 0m0.345s
sys 0m0.040s
< ACTUAL
> EXPECTED
. . .
This reports the running time of the test along with any significant output differences
found by diff listed out below. A successful test should produce no differences.
The comparison here is always with the PyElly translation as the first argument to diff
(ACTUAL) and with the .key file (EXPECTED) as the second. The doTest script will try to
ignore any PyElly output that not part of a translation. If your application produces any
extra output, however, such as from PyElly execution monitoring commands in your
rule semantics, then it may show up as a difference. Please make the appropriate
adjustments in your interpretation of the test results.
The bad application will always fail to generate a language definition because its rules
are all deliberately malformed. Nevertheless, you should run it with doTest as part of
integration testing to verify that PyElly can recover from and report on errors in rules.
There should be no uncaught exceptions arising from rule input. Error messages from
table building also should be appropriate.
Page 145
PyElly User’s Manual
Integration testing will be the main form of regression testing to verify that a new
version of PyElly or application rules can still handle what it used to do correctly. This is
important to do frequently, even with only small changes in PyElly code or in
application rules. Otherwise, translation problems can quickly accumulate and
compound, making debugging quite difficult.
If you do change PyElly code or either the language definitions for a test application or
its integration test input, you should also update the *.main.key files to reflect any
expected changes these will make in PyElly translations.
With doTest, you can also process input from a particular file x.main.txt while
running with the language definition files for an application A. Do this with the
command line for doTest:
./doTest A x
The marking integration test in the PyElly package now includes five extra test input
files marking.more*.main.txt, to run in the above way for extended testing. There
are also corresponding marking.more*.main.key files to compare with the
translations done by PyElly with the marking language definition rules. This splitting of
test data into multiple files is mainly historic; but it allows for faster checking when
trying to fix an isolated translation problem found in integration testing.
The broad suite of integration test applications serves to validate the range of actual
processing demonstrated historically in PyElly and its predecessors. The test instances
are still limited, however, so that success at one point is in no guarantee that a given
language definition will still work with all future uncontrolled input. We can only
achieve a reasonable level of competence at some point with a given set of PyElly
application rules and then continually strive to do better.
Although all the other tests are necessary, the marking integration test currently is the
most important. It has by far the most grammar and vocabulary rules and includes more
test sentences than all the other integration tests combined. Its input has actually been
collected from across the Web; this is not a bunch of sentences cooked up with nice
linguistic characteristics. Such “wild” data has always been a challenge for any natural
language system to process and so provides a serious shakedown of both PyElly code
and language definition rules for an application.
Upcoming PyElly work will push testing with even more “wild” data in marking and
other applications. Fortunately or unfortunately, finding sentence examples to break
PyElly processing is still quite easy with any current application. So validation always
has to be tentative when building up any particular kind of language competence; but
then that is how human beings naturally learn language anyway. The marking test set
currently includes over 875 sentences and takes about ten minutes to run.
Work on the marking, chinese, and name example applications will also expand the
number of examples for integration tests with those applications. The main goal here is
Page 146
PyElly User’s Manual
to see how far we can go with PyElly before language definitions become too
complicated or too bloated for most users. So far, experience with marking has been
encouraging; despite many unexpected issues and many bugs uncovered, the basic
PyElly approach to natural language has held up. Problems of phrase node overflow
with long sentences has been the most serious problem so far.
The unpredictability of the marking test data has been quite helpful in putting stress on
PyElly. This has helped to uncover many bugs in PyElly itself as well as forcing the
evolution of language rules to handle the range of expression found in actual English
text. More data from the Web is being obtained as time permits to continue such testing
going forward into 2018.
A new release of PyElly will be uploaded to the Web only after it passes all current unit
and integration tests and the pylint tool has checked every modified Python file for
common problems in source code. The ellyMain.py module should also be run with the
bad language rules to verify that this will not crash PyElly. PyElly will receive a new
version number only for significant code changes, which will usually will mean that
previously saved language rule tables have to be regenerated. Any change in how PyElly
works will also have to be described in an update of this PyElly User’s Manual; that
should be done even for releases with no change in version number.!
Page 147
PyElly User’s Manual
Appendix E. PyElly as a Framework for Learning
PyElly is about doing computational linguistics with rules. It calls for some serious
digging into the structure of text, which many data scientists nowadays would rather
avoid. With many kinds of text processing, minimizing prior language expertise can be a
good strategy for system building; but natural language in general will be hard, and
usually defeat simple solutions. We need to be both smart and knowledgeable to do NLP
most effectively. PyElly brings many resources together to let students experiment freely
and to give practitioners proven capabilities right out of the box.
Lately, machine learning and artificial intelligence have been hot areas in computing
applications. Notable successes have come out of image recognition and playing of
games like chess, go, and poker; and deep neural nets have become the go-to technology
of automatic machine translation of natural language. Neural nets in particular are able
to learn without supervision when given sufficient training data, and so one must ask
whether the rule-based techniques of 20th Century computational linguistics still
matter. The idea here is simply to turn AI loose on big data, sit back, and clip coupons.
If you look at some actual machine translation on the Web like Google Translate,
however, it is easy to find atrocious results even with deep learning. This is especially
bad when the languages come from quite different family groups like Chinese and
English. In theory, one might think that a neural net able to learn how to translate well
from Spanish to English would be immediately applicable to learning to go from Chinese
to English. In reality, translation is problematic even with closely related languages.
Much has already been written about the complexities and idiosyncrasies of natural
language. Its structure is messy, usually with many different dialects well on their way to
evolving into different languages. Any data scientist should at least be aware of such
problems, but to be really serious about processing text data effectively, one needs
deeper understanding possible only with extensive practical experience. PyElly will let
students get it by tackling and digging into real-world text.
The PyElly toolkit currently includes a non-deterministic finite-state automaton, macro
substitutions with wildcard matching, bottom-up parsing driven by a grammar,
ambiguity handling through semantics, support for large external dictionaries with
multiword terms, basic text entity extraction for dates and times, punctuation handling
and sentence demarcation, and special logic for recognizing the names of people. You
can combine these resources in various ways to build sophisticated text processing
systems like the example applications in the PyElly distribution package.
The PyElly tools should all be familiar to computer science students and remain useful
despite the new technologies like automatic programming or artificial intelligence with
deep learning. Not every data scientist has to be an expert on low-level text processing,
but someone on a major project should be conversant about its complexity, if only to
temper any magical thinking about breeze through it given enough training data
thinking. Students and practitioners both need to learn some humility here.
Page 148
PyElly User’s Manual
With PyElly, a student will see how computational linguistics actually works, whether
this is in analyzing the structure of sentences or just extracting stemmed keywords. All
the rules for PyElly language processing will be fully visible, never buried in the entrails
of some deep neural network. Appropriate rules for a new application will take time to
compile and develop, and analyses with them may be messy, but such difficulty is
inherent to all natural languages and is what makes it a challenge to process them.
PyElly encourages students to play with language and makes it easy to build toy
applications that nevertheless carry out real, nontrivial NLP. Such simple exercises may
be of little interest for academic research or for commercial exploitation, but it is
excellent for learning and opens the way to more challenging projects down the road. It
could conceivably even be fun, even for students uninterested in eventually becoming a
computational linguist.
A good way to employ PyElly in secondary schools is to let students take on individual or
group projects focused on a single narrow NLP task like those described in Section 14,
not necessarily covering all the supported features of PyElly. You probably want to avoid
something as complex as the PyElly marking example application, but something like
chinese or texting is quite doable. Stay away from projects involving extensive
extra-linguistic background research, since that alone could easily eat up all available
time; let students focus instead on learning specifically about natural language and NLP.
A reasonable PyElly project should take from two to six weeks and may involve defining
about a hundred rules altogether. Experimentation is important; no one should be
satisfied with initial success in processing a given set of text data input. Instead,
students should try to expand the coverage of their applications as much as possible and
to see what happens when small changes are made to language definition rules. It is all
too easy to get rules tangled up, and future practitioners need to know the hazards.
Unsuccessful projects can be as valuable for learning as successful ones. Natural
language processing of unrestricted text is hard. The varieties of grammatical structure
and the exponential possibilities for alternate interpretations of a sentence with
ambiguities is often overwhelming. This problem will not go away with approaches like
unsupervised deep learning. So, students need to confront and get a sense of the nature
of difficulties one will face here.
As always, it is best to start with something simple and gradually build it up step by step
to into something more elaborate. This can take quite a while. Students should expect to
make many mistakes and to spend plenty of time in debugging, reworking, and
clarifying their language descriptions. They should nevertheless try to process as much
real-world input text as they can. That experience will undoubtedly be where most
important NLP learning will happen with PyElly or any other system.!
Page 149
PyElly User’s Manual
Appendix F. A Shallow XML Markup Application
The marking example application originally was to be a limited demonstration of
natural language translation—in this case, rewriting English text into XML with shallow
markup. The idea was just to define just enough language rules to analyze a finite, but
diverse, target set of raw text from the Worldwide Web. The initial data here consisted
of 22 text segments in English, containing a total of 2,766 words in 129 sentences. This
would show how well PyElly could perform on a real-world sentences.
In most student natural language processing, target data is carefully selected and
cleaned up. This is fine for student projects to be completed in a short time, but a NLP
toolkit intended for real-world use needs to be tested with the unvarnished stuff that
people actually write. Anyone who spends much time on the Web will be aware of the
challenges, and we need to be aware of what specific difficulties will show up and
whether the various capabilities of PyElly are adequate for getting past them.
By design, PyElly is all about encapsulating basic linguistic competence in a compact
software package. This choice means that we have to limit the amount of world
knowledge available to guide processing; but makes PyElly much easier to implement
than NLP by an artificial intelligence. PyElly as such still lets us accomplish many
nontrivial processing tasks, but we now must ask how it would fare in the real world,
like getting off an enclosed test track onto actual streets, highways, and dirt roads.
Shallow markup of Web text generally should be simpler then a full grammatical
analysis. It would mainly identify simple noun phrases like THE SIX RED APPLES and
simple verb phrases like HAS NOT YET BEEN FOUND plus assign traditional parts of
speech like noun, verb, adjective, and so forth to the various tokens in a sentence. This
would label low-level text entities like numbers, dates, and times, but would ignore more
complex recursive grammatical structures like relative clauses, participial phrase
modifiers, or direct or indirect quotations.
Here is an example of the shallow XML markup we would do for a simple sentence:
Several dedicated men promised they would reunite on September 11th, 2021.
<sent>
<nclu><quan>several</quan><adj>dedicate -ed</adj><noun>men</noun></nclu>
<vclu><verb>promise -ed</verb></vclu>
<pro>they</pro>
<vclu><aux>would</aux><verb>reunite</verb></vclu>
<nclu><date><prep>on</prep>09/11/2021</date></nclu>
<punc>.</punc></sent>
where each basic noun phrase is tagged as nclu and each basic verb phrase is tagged as
vclu. The word SEVERAL is a quan (quantifier) in a noun phrase and THEY is a pro
(pronoun). The preposition ON is tagged as a prep (preposition) attached to a date
element, recognized as a PyElly entity and interpreted syntactically as a noun phrase.
The word WOULD is an aux (auxiliary) in a basic verb phrase. The final period (.) is
Page 150
PyElly User’s Manual
marked punc, but the comma (,) in the date was dropped in its rewriting. Finally all the
elements are grouped together under the sent tag marking a full sentence.
This is far from understanding what the sentence means. Reliably tagging each word
with its part of speech and then grouping those words into simple phrases is a good
start, however. In particular, it could provide nontrivial information to support and
guide deeper sentence analyses. So, it should be useful to have a compact set of PyElly
rules able to mark up Web text in English as a kind of preprocessing.
There is of course no one way to mark up a given sentence, but our particular scheme
here should be adequate for a proof of principle. It is consistent with how most of us
understand the grammar of English and is fairly easy for PyElly language rules to
produce. As it turns out, the main difficulty of markup will not be in the form of output,
but in making sense of highly unpredictable input. The challenge quickly becomes
evident when we try to define PyElly rules for some target set of “wild” text to rewrite.
Text from the Web often lacks the editorial control of published books and periodicals.
Sentences often run on and on, capitalization and punctuation are irregular, typos are
missed, and obscure jargon is rampant. The writing often may be word salads thrown
together with little regard for readability by human or by machine. Words are like
ADVANCE, BEST, DIRECT, INITIAL, LEAN, OKAY, READY, SMART, WARM, and
ZERO can each be multiple parts of speech, and when crammed into a sentence, the
various possible combinations of meaning must be sorted out somehow.
Shallow markup is harder than one might think, but focusing on a finite number of
target sentences does limit the problem. We still want to keep all our language rules as
general as possible here, but when they inevitably fail, we can add other rules
specifically to deal with those failures. This is a low-level kind of world knowledge to
help PyElly out. Discovering where general rules fail is part of how we all learn a
language as a child or adult; and PyElly is set up to be used in such an iterative way.
The marking example application started out with only a few grammatical rules just for
the syntactic type NCLU and a few others for VCLU, corresponding to our tagging. An
NCLU was assumed to have the form QUAN + DET + NUM + ADJ + NOUN; for
example, ALL THE SIX RED BALLS. A VCLU would have the form AUX + NEG + HAVE
+ VERB; for example, WILL NOT HAVE KNOWN. Some of the components in an NCLU
or a VCLU may of course could be omitted in an actual instance of a phrase.
Compiling such rules here will be fairly easy if we postpone worrying about cognitive
semantics. The actual rules here will have the form NCLU->xN NCLU or the form
VCLU->xV VCLU; for example, NCLU->DET NCLU. We can use syntactic features with
the NCLU and VCLU syntactic types to control the order in which their rules will be
applied. Complications will arise later, but we can take care of them as they crop up. At
some point, we may run into a showstopper, but we want to see how far we can get.
Overall sentence tagging for shallow markup will also have simple rules. They will take
the form SS->ELEM SS, where SENT->SS and where ELEM->NCLU, ELEM->VCLU,
and ELEM->x for conjunctions, interjections, and miscellaneous syntactic types as well
Page 151
PyElly User’s Manual
as punctuation. It should be easy to visualize how this will all work with Dick-and-Jane
text for the youngest of schoolchildren and how this might scale up to handle Web text.
There is still the matter of vocabulary. In a minimal language definition, we might first
try to define only functional words like THE, OF, AND, and NOT along with inflectional
word endings like -S, -ED, and -ING and morphological word endings like -LY, -MENT,
and -ION. The idea is that most words in a sentence can be unknown (UNKN). We
would expect to deal with them through general grammar rules like VERB->UNKN -ED,
ADJ->UNKN -ICAL, and ADV->ADJ -LY. For example, an unknown word
XXXXICALLY would be recognized as an adverb through such rules.
When we actually try to process our target sample of 129 sentences, however, the
minimal approach almost immediately fails, which should be no surprise. A natural
language like English is complicated and tends to defy simple description. It requires
sustained effort plus a bit of luck to get workable processing rules at any level of
analysis. This assumes of course that PyElly rules are adequate for taking care of all the
kinds of issues arising in processing. In the worst case, adding a rule to fix one problem
might break a solution for a previous problem, leaving us stymied.
Fortunately, PyElly passed this initial test, even though our wild data kept uncovering
processing problems in almost every new segment of text. These included bugs in PyElly
Python code, various errors in the definition of rules, unexpected Unicode characters,
new kinds of text entities not covered by existing rules, idiomatic expressions requiring
special rules for handling, and tricky cognitive semantic scoring. With persistent
reworking of both PyElly modules and marking rules, however, we eventually managed
to process all our initial 129 target sentences with a single set of rules.
That result would be enough for a limited victory, but all the extra work required to get
there inspired no confidence about being able to mark up any new set of text. Yet, our
intuition is that the number of general rules have to be finite, or else no one would ever
be able to learn a natural language. If it is finite, then we should see some kind of
convergence where fewer new rules are needed to process new sentences. It might then
be possible to build a markup application to handle, say, ninety percent of sentences.
Being optimistic, we were obliged to look at more data. Testing continued with another
13 text segments collected from the Web, containing 1,773 words in 91 sentences. If all
went well, PyElly would have sailed through of this sample, but the bad news was that
the former pattern of failures repeated with each new text segment. We again had to
keep changing both PyElly modules and marking rules before we could mark up all 220
sentences of our expanded collection—not a win, but not a loss either.
Living to fight another day was encouraging, though. PyElly itself was getting better
through better code and broader rules. So we got four more text samples from the Web
for more testing. These included 7 segments with 1,289 words in 62 sentences, 10
segments with 1,149 words in 61 sentences, 13 segments with 2,752 words in 133
sentences, and 37 segments with 8,033 words in 403 sentences. We still saw the same
pattern of failure with every segment in that new data, but got better at making the
changes to get acceptable markups for all sentences in the end and saw fewer bugs.
Page 152
PyElly User’s Manual
We cannot yet claim to have a viable shallow markup application, although the rate of
failure may have fallen slightly in processing the last text sample. The PyElly rule-based
approach has held up, however; and there seem to be no obvious obstacles to keep us
from improving our shallow markup further. It is mainly a matter of patience: getting
more “wild” Web text and evolving our language rules. If we can keep at this, a
respectable level of markup competence might be achievable on a time scale of years.
The overall problem in PyElly for shallow markup is now fairly clear, boiling down to
basic combinatorics. Having more grammar and vocabulary rules is unavoidable, but
this is all right only if we can hold down the overall number of ambiguities in sentence
analyses. Otherwise, that the number of possible interpretations of a sentences will grow
exponentially. That will make PyElly parsing glacially slow and decrease the odds of
PyElly being able to find a good markup through cognitive semantic scoring.
As it turns out, the easiest way to reduce ambiguity is for PyElly to have a big external
vocabulary table, especially with many multi-word terms. Having fewer unknown terms
alone will reduce ambiguity. Also, a term like DEPARTMENT OF JUSTICE can be a
single token in a text analysis instead of three separate tokens. That gives us fewer
degrees of freedom in parsing a sentence, and each multi-word token will probably be
less ambiguous than its individual parts. Finding only four or five such long tokens in a
text segment can often reduce PyElly parsing time by an order of magnitude.
To identify possible single- and multi-word vocabulary for PyElly, we only have to look
at a large digital dictionary of English. WordNet in particular works out quite well here
because of its comprehensiveness, its organization into ASCII text files, and its liberal
open-source licensing. The file default.v.elly distributed with PyElly is essentially
WordNet 3.0 except for some numerical terms already handled elsewhere in PyElly and
some terms including punctuation other than period (.), apostrophe (’), or hyphen (-).
The marking example application uses only a subset of default.v.elly because the
full file takes a long time to load. The current marking vocabulary table was selected
using various online lists of the most frequent English words and the WordNet words
found in target text. Then certain words in target text, but not in WordNet, were also
added, including irregular forms of nouns and verbs, some frequent inflected forms,
common names, specialized jargon, and neologisms not yet in WordNet 3.0. The non-
WordNet additions have been kept separate in the marking.v.elly definition file.
To obtain multi-word vocabulary, we can just look for WordNet terms that have its first
two words already in the marking vocabulary table. This yields a list that needs manual
pruning to get rid of rare or obscure terms that will clutter up a vocabulary table with
little benefit, but this has to be done only once. Currently, the marking vocabulary has
grown to over 26,000 entries, with more than half composed of two or more words. In
comparison, the PyElly marking application has just over 600 grammar rules, about
100 pattern rules, and about 50 macro substitution rules.
Here are some actual markups for sentences taken at random from the various original
target samples. The first is for text from Robert Louis Stevenson’s “Treasure Island”:
Page 153
PyElly User’s Manual
He had taken me aside one day and promised me a silver fourpenny on the first of every
month if I would only keep my ‘weather-eye open for a seafaring man with one leg’ and let
him know the moment he appeared.
<sent>
<pro>he</pro>
<vclu><aux>had</aux><verb>taken</verb></vclu>
<pro>me</pro>
<adv>aside</adv>
<nclu><num>1</num><noun>day</noun></nclu>
<conj>and</conj>
<vclu><verb>promise -ed</verb></vclu>
<pro>me</pro>
<nclu><det>a</det><adj>silver</adj><noun>fourpenny</noun></nclu>
<nclu><prep>on</prep><noun>the 1st</noun></nclu>
<nclu><prep>of</prep><quan>every</quan><noun>month</noun></nclu>
<conj>if</conj>
<pro>i</pro>
<vclu><aux>would</aux><adv>only</adv><verb>keep</verb></vclu>
<nclu><dem>my</dem>
<punc>‘</punc>
<noun>weather-eye</noun><adj>open</adj>
<prep>for</prep><det>a</det><noun>seafaring man</noun>
<prep>with</prep><num>1</num><noun>leg</noun>
<punc>’</punc></nclu>
<conj>and</conj>
<vclu><verb>let</verb></vclu>
<pro>him</pro>
<vclu><verb>know</verb></vclu>
<nclu><det>the</det><noun>moment</noun></nclu>
<pro>he</pro>
<vclu><verb>appear -ed</verb></vclu>
<punc>.</punc></sent>
Our second example comes from a website for the Mennonite Church:
Mennonites value the sense of family and community that comes with a shared vision of
following Jesus Christ, accountability to one another and the ability to agree and disagree
in love.
<sent>
<nclu><noun>mennonite -s</noun></nclu>
<vclu><verb>value</verb></vclu>
<nclu><det>the</det><noun>sense</noun></nclu>
<nclu><prep>of</prep><noun>family</noun></nclu>
<conj>and</conj>
<nclu><noun>community</noun></nclu>
<>that</>
<vclu><verb>come -s</verb></vclu>
<nclu><prep>with</prep><det>a</det><adj>share -ed</adj><noun>vision</noun></nclu>
<vclu><prep>of</prep><verb>follow -ing</verb></vclu>
<nclu><noun>jesus christ</noun></nclu>
<punc>,</punc>
<nclu><noun>accountability</noun></nclu>
<pro><prep>to</prep>one another</pro>
<conj>and</conj>
<nclu><det>the</det><noun>ability</noun></nclu>
<vclu>to<verb>agree</verb></vclu>
<conj>and</conj>
<vclu><verb>disagree</verb></vclu>
<nclu><prep>in</prep><noun>love</noun></nclu>
<punc>.</punc></sent>
Page 154
PyElly User’s Manual
A third example comes from a conspiracy-theory blog that loves quotation marks:
There are a lot of people hearing alarm bells going off in their heads with the recent
announcement of multiple Walmart Supercenters closing, all at once, in multiple states
(some of which are states being prepared for the Jade Helm military exercises), all
claiming a "plumbing" issue, with recent news of some type of underground tunnel
projects involving Walmart and DHS.
<sent>
<vclu><adv>there</adv><verb>are</verb></vclu>
<nclu><quan>a lot of</quan><noun>people</noun></nclu>
<vclu><verb>hear -ing</verb></vclu>
<nclu><noun>alarm bell -s</noun></nclu>
<vclu><verb>go -ing off</verb></vclu>
<nclu><prep>in</prep><dem>their</dem><noun>head -s</noun></nclu>
<nclu><prep>with</prep><det>the</det><adj>recent</adj><noun>announcement</noun></nclu>
<nclu><prep>of</prep><quan>multiple</quan><noun>walmart supercenter -s</noun></nclu>
<vclu><verb>close -ing</verb></vclu>
<punc>,</punc>
<adv>all at once</adv>
<punc>,</punc>
<nclu><prep>in</prep><quan>multiple</quan><noun>state -s</noun></nclu>
<punc>(</punc>
<conj>some of which</conj>
<vclu><verb>are</verb></vclu>
<nclu><noun>state -s</noun></nclu>
<vclu><aux>being</aux><verb>prepare -ed</verb></vclu>
<nclu><prep>for</prep><det>the</det><noun>jade helm military exercise -s</noun></nclu>
<punc>)</punc>
<punc>,</punc>
<nclu><quan>all</quan></nclu>
<vclu><verb>claim -ing</verb></vclu>
<nclu><det>a</det>
<punc>"</punc>
<noun>plumbing</noun>
<punc>"</punc></nclu>
<vclu><verb>issue</verb></vclu>
<punc>,</punc>
<nclu><prep>with</prep><adj>recent</adj><noun>news</noun></nclu>
<nclu><prep>of</prep><quan>some</quan><noun>type</noun></nclu>
<nclu><prep>of</prep><adj>underground</adj><noun>tunnel project -s</noun></nclu>
<vclu><verb>involve -ing</verb></vclu>
<nclu><noun>walmart</noun></nclu>
<conj>and</conj>
<nclu><noun>dhs</noun></nclu>
<punc>.</punc></sent>
A fourth and final example comes from a Wikipedia article about a highway in Missouri:
The route reaches a roundabout where it intersects Route VV and Prathersville Road, the
latter providing access to the southbound direction of US 63.
<sent>
<nclu><det>the</det><noun>route</noun></nclu>
<vclu><verb>reach -s</verb></vclu>
<nclu><det>a</det><noun>roundabout</noun></nclu>
<conj>where</conj>
<pro>it</pro>
<vclu><verb>intersect -s</verb></vclu>
<nclu><noun>route vv</noun></nclu>
<conj>and</conj>
<nclu><noun>prathersville road</noun></nclu>
<punc>,</punc>
Page 155
PyElly User’s Manual
<nclu><det>the</det><noun>latter</noun></nclu>
<vclu><verb>providing</verb></vclu>
<nclu><noun>access</noun></nclu>
<nclu><prep>to</prep><det>the</det><adj>southbound</adj><noun>direction</noun></nclu>
<nclu><prep>of</prep><noun>us 63</noun></nclu>
<punc>.</punc></sent>
As can be seen, actual text from the Web is much more complex than the initial example
of a sentence made up to illustrate shallow markup. Although analyzing 878 real
sentences like the above four examples turned out to be a major challenge, they all can
still be handled through fairly general rules, although these often had to be adjusted.
The particular reasons for markup difficulty were quite varied:
Function words like A, AND, OF, and THE are part of the structure of English and not indicative of the
content of any text. These are not in WordNet, which lists only nouns, verbs, adjectives, and adverbs. In
PyElly, function words will usually be defined in the internal dictionary of a grammar. They together
tend to be quite frequent in text; but many are individually infrequent and can be overlooked in a
language definition. Any sentence with undefined function words will almost always be parsed wrong
by PyElly because it is impossible to guess that they are a preposition, conjunction, or something else.
We must supply PyElly with rules for every function word. Sometimes, a word sequence like AS WELL
AS or AS A RESULT OF should taken together as a compound function word in a proper markup.
Words like ACCESS, FACE, NEED, and WAGE can assume more than one part of speech, and it may be
hard to tag them correctly all the time. PyElly cognitive semantics is supposed to take care of this
problem, but sometimes the logic can get so complex that it may not always produce a desired result.
The solution here is to reorganize syntax rules so as to get simpler cognitive semantics, but that will
usually lead to more grammar rules, usually involving syntactic features and of nonstandard syntactic
categories that make them harder to read.
Punctuation can often be difficult, especially enclosure punctuation like parentheses and quotation
marks. For example, in the noun phrase A (RED) KNIT CAP, (RED) is obviously meant as an
adjective; but we need to introduce some special rules here so that the PyElly parser can recognize this.
Otherwise, KNIT CAP will its own noun phrase and A (RED) will just be a bad construction. The extra
rules are not hard to define, but if you did not anticipate this situation, then markup will fail. Quotation
marks are especially troublesome in Web text because many writers use them willy-nilly.
Our original concepts of NCLU and VCLU phrases as described above turned out to be extremely
limited; For example, the simple NCLU above does not cover common kinds of English expressions like
ACT 2, Sagittarius A*, US 63, OR VERSION 10.1.2B. These and many other similar unanticipated
forms need their own grammar rules.
Some function words like THAT, FOR, TO, FIRST, CAN, MAY, and ONE are extra difficult to tag
because they have unusual patterns of usage unlike other function words in English. For example, TO
can also identify the English infinitive form of a verb, FOR can be a conjunction in some contexts, and
MAY can be a noun. In the marking example application, we see only a few difficult function words,
but they can play havoc with parsing and usually must be handled through a coordinated combination
of grammar, vocabulary, and macro substitution rules.
Many idiomatic expressions need to be recognized explicitly in a proper markup. For example: FROM
BEGINNING TO END, ROCKET SCIENCE, PURCHASING POWER, FOR EXAMPLE, BETTER OFF,
and BEAT AROUND THE BUSH are respectively an adverb, a noun, another noun, another adverb, an
adjective, and a verb. WordNets lists many of these expressions, but not all. PURCHASING POWER
had to be added as a noun in the PyElly vocabulary table for marking. This was despite the fact that
the alternative interpretation of buying electricity is not implausible.
Names like JOHN JONES need special treatment because they generally do not follow the same rules
as ordinary words; for example, we probably do not want to see JONES split into JONE -S. PyElly can
extract personal names from text as text, but to keep everything simple, the marking rules just list
names and name components as straight vocabulary. When new names occur in text for markup, they
often also need to be identified in this way. Profligate name dropping is a major cause of parsing
overflow when the name components are unknown to PyElly.
Page 156
PyElly User’s Manual
Unicode is often a problem. Hyphens are often replaced inappropriately by the Unicode en dash. Other
troublesome Unicode characters are ellipsis, non-breaking space, narrow space, trademark ™, Greek
letters, diacritical marks, musical accidentals like , and numerical exponents as in CM². These all
have to be recognized and handled properly within PyElly language definition rules.
These various issues with Web text data are not peculiar to PyElly. Working within the
PyElly framework merely helped to reveal their presence. A black box approach to
natural language with machine-learning framework would run into the same issues, but
the problems would probably be harder to diagnose. Leaving the dirty work to a robot
does mean less labor, but this more or less assumes that the messiness of actual text
data can be safely swept under a rug. Such confidence is premature.
With PyElly, we are obliged can take time to explore the structure of natural language. If
nothing else, our experience with the marking example application shows us that we
should not expect smooth sailing in real-world text processing. There is still much to
learn here even after looking at on the order of a thousand Web sentences so far.
Analyzing much more text is not difficult, however, and should go a long way toward
validating (or invalidating) the whole PyElly rule-based approach to NLP.
Testing in the marking example application continues with the processing of large
amounts of new text without necessarily producing flawless markups. The test data here
comes from Google News articles, with ~12 text segments collected for each of 30 days.
This amounts to several thousand new sentences of diverse content, although all in a
Web journalistic style. These are challenging in that they tend to be longer and quirkier
than those for the six Web samples processed previously by the marking application
and now part of PyElly integration testing.
We want to be able to go through all the new data without some error in PyElly Python
code or a parsing failure or an overflow. From results so far, each daily sample requires
some jiggering of rules for about one out of ten sentences, but the hope is that this will
all go smoothly, requiring mostly the addition or the modification of vocabulary rules
plus a few changes to grammar, pattern, and macro substitution rules.
The biggest problem so far has been with parsing overflows. In the previous six mixed
Web data samples, this occurred in only three or four sentences, but in the Google News
data, four or five sentences in each daily sample will cause a phrase node overflow on a
first try with PyElly. On a typical overflow for a sentence analysis, PyElly will run for
many minutes before finally exiting with an error message. After rule adjustments, the
sentence will often take only about ten seconds or less to process.
Here is an actual example of a sentence where overflow had to be remedied:
Special Counsel Robert Mueller’s sprawling investigation has spawned a new guessing
game in Washington centered on retired Lt. Gen. Michael Flynn, President Trump's
former national security adviser.
This comes from an article in The Hill, a Washington political publication. The sentence
is only about average length for Google News, but with the rules used in integration
testing for the acceptance of PyElly release v1.4.17, PyElly parsing will overflow on it. At
Page 157
PyElly User’s Manual
this point, we can then successively add new multi-word vocabulary to reduce the
number of degrees of freedom in the analysis. Here is what happens with parsing times
for the sentence with each new vocabulary entry:
PyElly continued to overflow even after the addition of the first two multi-word terms.
With the third term, parsing did finish, but took about 3 minutes, which is slow. With
the 4th and 5th terms, parsing time drops dramatically, but adding the 6th term makes
almost no difference. Adding the 7th term once again speeds up parsing significantly.
The parsing times are for a MacBook Pro running a 2.7 GHz Intel Core i5 under MacOS
11.13.1 (High Sierra) with Python 2.7.9.
Combinatorics are hard at work here. Reducing the number of degrees of freedom in a
PyElly analysis does greatly speed up parsing. Multi-word vocabulary as a kind of
outside world knowledge helps to reduce ambiguity and makes parsing easier. Except
for GUESSING GAME, none of the added multi-word terms are in WordNet 3.0, but
most would agree that we have good reason to put them all into our vocabulary table
beyond just enabling the processing of one particular example sentence.
An educated English-speaking human adult should read our example sentence with
little difficulty. We need, however, the background to recognize ROBERT MUELLER as
a single unit instead of having to figure it out from capitalization and other clues. In our
example, this made the critical difference between an overflow in processing or a
completed markup. There are dozens of other Web sample sentences requiring PyElly to
have extensive cultural knowledge of this sort before they can be parsed successfully.
Here are some of eventually had to be included as non-linguistic knowledge in the
vocabulary table for the “marking” example application.
announced the discovery of : VERB
another front in the culture wars : NOUN
armed with fresh ammunition : ADJ
complex new modeling : NOUN
counts of murder : NOUN[:plur]
disappeared from the fever swamps : VERB[^ed]
Vocabulary Term Added
PyElly Parsing Time
1
national security adviser
2
guessing game
3
Robert Mueller
2m57.9s
4
special counsel
30.4s
5
Lt. Gen.
9.3s
6
President Trump
9.1s
7
Michael Flynn
2.9s
Page 158
PyElly User’s Manual
effectively acknowledging : VERB[^ing]
making a false statement to the FBI : VERB[^ing]
repeatedly used the flag : VERB[^ed]
It is likely that more data issues like those listed as bullets above will be discovered as
we process further samples of actual Web text. That is what we want. Our hope here is
still that sentence analysis will become easier as we accumulate changes in our PyElly to
get some kind of markup for new text data. If the new issues remain at the same level or
even increase, then we have to surmise that no one is really close to declaring victory in
natural language processing.!
Page 159
PyElly User’s Manual
Appendix G. Unicode Issues
You might think that processing natural language text in Unicode should be no problem
if you can already handle ASCII text. After all, any decent programming language like
Python or Java will have libraries that let us handle Unicode in exactly the same way we
handle ASCII. There are some complications, however, that we have to get around.
The first issue is choose what subset of Unicode we want to recognize. It is impractical to
try to do anything with 136,755 distinct glyphs in 139 different alphabets plus all the
special symbols, including emoji. Even if our processing strategy had been to map
everything into plain ASCII, we still have to decide what actually to convert and what
just to ignore. This can make a big difference in the kind of toolkit we end up with.
PyElly opts to recognize characters from the first four pages of Unicode at 0000 through
034F; that includes the ASCII character set in page 0. This range of characters is mostly
unneeded for English but it is a useful exercise to gain some experience with more
general text processing. It also lets us deal with the spelling of European names and with
occasional foreign phrases showing up in mainly English text.
PyElly v1.0 worked with only the first two pages of Unicode, but that coverage has
steadily grown after working with more and more “wild” text from the Web. A serious
general natural language toolkit really needs to address diacritical marks, extended
punctuation like “ and ”, Greek letters, special symbols like © and º, some exponents
like cm²,currency symbols like and £, different kinds of Unicode spaces, pinyin tone
markers, and even musical notation like B.
Most of PyElly handling of Unicode is done by the ellyChar module, which currently
runs to 328 lines of Python code. This is hardly trivial, but it will have to be expanded
greatly to deal with Russian, Chinese, Arabic, or any of the Indic languages at any
reasonable level of competence. It is only one of the reasons why natural language
processing is hard even with deep neural nets.
An unexpected issue arose with UTF-8, which is a standard byte-encoding of Unicode.
Pure Unicode is written out to files in 32- or 64-bit codes. Current practice on the Web is
to use UTF-8 instead, which has the advantage of allowing ASCII-only text to look the
same as it did before. This requires, however, up-to-date libraries for programming
languages be able to convert UTF-8 into 32-bit Unicode for internal processing in main
memory and reconvert Unicode for output to external devices.
Python 2.7 turns out to be awkward at writing UTF-8 to files, especially with redirection
of the standard output and standard error streams. One probably has to experiment to
see what solutions actually work on a given computing platform. PyElly as of v1.4.2+
now works for redirected Unicode output for MacOS 10 Unix, but this called for a bit of
hackery that may not transfer cleanly to other environments.
Unicode also allows for some tricks in processing that may surprise someone unfamiliar
with all its various corners. In particular, the Unicode Consortium has defined three
Page 160
PyElly User’s Manual
“private usage areas” (PUA) that are to remain permanently unassigned. That is, we
should not expect any conforming Unicode text to use any of these codes. PyElly takes
advantage of the PUA starting at E000 to encode its character and string wildcards.
That allows a pattern of literal characters and wildcards to be saved conveniently as a
straight Unicode string.
Another Unicode issue arose with the requirement that a search key for SQLite had to be
ASCII-only for any vocabulary table lookup. Python does provide a low-level function
for ASCII conversion, but PyElly implements its own to maintain flexibility here. This is
natural to include in its ellyChar module with the toLowerCaseASCII() method
PyElly release v1.4.24 also experiments with using Unicode to pass information between
various stages of processing. This involves the ASCII hyphen-minus =002D, which is
currently treated as non-embedding in default token formation and also will fail to
match the $ wildcard for the end of a token. It is possible, however, to use a macro
substitution to change a hyphen-minus =002D in certain contexts into a Unicode
hyphen =2010.
The idea is make the Unicode hyphen to be different from ASCII hyphen-minus in
having it match the $ wildcard. This may seem like a strange kind of natural language
processing, but it can solve some non-trivial problems in PyElly rule definition. We have
no such option when working with ASCII-only text data.
The bottom line is we have to get our hands dirty with Unicode text as it exists in the
wild before we can understand how to process it properly. That is only a narrow detail of
computational linguistics, but it is still something unwise to gloss over. Natural language
remains a hard problem, and we need to apply all available knowledge to give us all the
traction possible in developing practical solutions.
Page 161

Navigation menu