SRC RR 92A

SRC-RR-92A SRC-RR-92A

SRC-RR-92A The Eye | File Listing

User Manual: SRC-RR-92A

Open the PDF directly: View PDF PDF.
Page Count: 40

DownloadSRC-RR-92A
Open PDF In BrowserView PDF
92

Hector: Connecting Words with
Definitions
Lucille Glassman, Dennis Grinberg, Cynthia Hibbard,
James Meehan, Loretta Guarino Reid, and
Mary-Claire van Leunen

October 20, 1992

Systems Research Center
DEC’s business and technology objectives require a strong research program. The Systems Research Center
(SRC) and three other research laboratories are committed to filling that need.
SRC began recruiting its first research scientists in l984—their charter, to advance the state of knowledge in
all aspects of computer systems research. Our current work includes exploring high-performance personal
computing, distributed computing, programming environments, system modelling techniques, specification
technology, and tightly-coupled multiprocessors.
Our approach to both hardware and software research is to create and use real systems so that we can
investigate their properties fully. Complex systems cannot be evaluated solely in the abstract. Based on
this belief, our strategy is to demonstrate the technical and practical feasibility of our ideas by building
prototypes and using them as daily tools. The experience we gain is useful in the short term in enabling
us to refine our designs, and invaluable in the long term in helping us to advance the state of knowledge
about those systems. Most of the major advances in information systems have come through this strategy,
including time-sharing, the ArpaNet, and distributed personal computing.
SRC also performs work of a more mathematical flavor which complements our systems research. Some
of this work is in established fields of theoretical computer science, such as the analysis of algorithms,
computational geometry, and logics of programming. The rest of this work explores new ground motivated
by problems that arise in our systems research.
DEC has a strong commitment to communicating the results and experience gained through pursuing these
activities. The Company values the improved understanding that comes with exposing and testing our ideas
within the research community. SRC will therefore report results in conferences, in professional journals,
and in our research report series. We will seek users for our prototype systems among those with whom we
have common research interests, and we will encourage collaboration with university researchers.

Robert W. Taylor, Director

Hector: Connecting Words with Definitions
Lucille Glassman, Dennis Grinberg, Cynthia Hibbard, James Meehan, Loretta Guarino Reid, and
Mary-Claire van Leunen
October 20, 1992

Publication History
This paper was presented at the 8th Annual Conference of the UW Centre for the New Oxford English
Dictionary and Text Research, Waterloo, Canada. October, 1992.
Affiliations
Dennis Grinberg is at the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
15213-3890. Dennis worked at the Digital Equipment Corporation Systems Research Center during the
summer of 1991.

c Digital Equipment Corporation 1992
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to
copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes
provided that all such whole or partial copies include the following: a notice that such copying is by
permission of the Systems Research Center of Digital Equipment Corporation in Palo Alto, California; an
acknowledgment of the authors and individual contributors to the work; and all applicable portions of the
copyright notice. Copying, reproducing, or republishing for any other purpose shall require a license with
payment of fee to the Systems Research Center. All rights reserved.

i

Abstract
Hector is a feasibility study on high-tech corpus lexicography. Oxford University Press provided the
lexicographers and a corpus of 20 million words of running English text; Digital Equipment Corporation
Systems Research Center provided the high tech tools to enable the lexicographers to do all of their work
on-line.
The tools provide the ability to query the corpus in various ways and see the resulting matches, to write
and edit dictionary entries, and to link each occurrence of a word in the corpus with its sense as displayed in
the entry editor. Additional support tools give statistical information about words in the corpus, derivatives
and related words, syntactic structures, collocates, and case-variants.
This report describes the tools and the status of the project as of July, 1992.
An accompanying videotape demonstrates the Hector tools. If you would like a copy, please send mail
including your full postal address to src-report@src.dec.com.

ii

Contents
1 Introduction
2 Preparing the Hector Corpus
2.1 Contents of the Corpus : : : :
2.2 Cleaning Up the Corpus : : :
2.3 Why Cleanse the Corpus? : :
2.4 Adam: Our Wordclass Tagger
2.5 The Houghton Mifflin Parser :
2.6 Which Word Is This? : : : : :
2.7 Which Sentence Is This? : : :
2.8 The “hi” Server : : : : : : :
2.9 Target Words : : : : : : : : :

1
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:

2
2
2
6
6
7
8
8
8
9

3 The Lexicographer’s Workbench
3.1 Hooking Words and Definitions Together : : : : : : : : : : :
3.2 Argus: The Corpus Viewer : : : : : : : : : : : : : : : : : : :
3.2.1 Searching for a Word or a List of Words : : : : : : : :
3.2.2 Searching for a Wordclass or Wordclasses : : : : : : :
3.2.3 Searching for Syntax, Position, Genre, Authorship, Etc.
3.2.4 Searching for a Sense or Senses : : : : : : : : : : : :
3.2.5 Searching for Words with Collocates : : : : : : : : : :
3.2.6 Looking at Concordances : : : : : : : : : : : : : : :
3.2.7 Sense-Tagging Concordance Lines : : : : : : : : : : :
3.3 Ajax: The Dictionary-Entry Editor : : : : : : : : : : : : : : :
3.3.1 Writing a Dictionary Entry : : : : : : : : : : : : : : :
3.3.2 Numbering the Components of an Entry : : : : : : : :
3.4 Atlas: Supporting Information : : : : : : : : : : : : : : : : :
3.5 Gritty Details : : : : : : : : : : : : : : : : : : : : : : : : : :

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

:
:
:
:
:
:
:
:
:
:
:
:
:
:

9
11
12
12
13
14
14
14
15
16
16
17
18
18
21

4 What Have We Learned So Far?
4.1 Natural Language Is Hard : :
4.2 User-Interface Design Is Hard
4.3 Speed Is Functionality : : : :
4.4 Data Integrity : : : : : : : :
4.5 And in a Cheerier Vein ... : :

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

22
22
23
24
24
25

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

:
:
:
:
:

Acknowledgements

26

References

27

A Appendix: SGML Constructions

28

B Appendix: The Target Words

29

C Appendix: The Wave-2 Words

32

iii

1 Introduction
Starting in October of 1990, the Systems Research Center (SRC) of the Digital Equipment Corporation
undertook with Oxford University Press a joint project called Hector. Hector was a feasibility study on
high-tech corpus lexicography. The experience of the COBUILD Project at Collins and the University
of Birmingham demonstrated that lexicographers need a corpus, a very large body of running text, to
work on ordinary words[6, 2]. But at COBUILD the lexicographers looked at their corpus on paper and
impressionistically; they spread paper concordances out on their kitchen tables, marked a few lines with
colored pens, and threw them away. We wanted to offer lexicographers the opportunity to use a corpus in
more rigorous and creative ways. Oxford provided the corpus and the lexicographers. We at SRC provided
the high tech (and some low tech too, as you will see).
So we built an on-line corpus viewer and an on-line editor for dictionary entries. We produced indexes
that would enable us to build concordances of:
words
case-variants
inflections
derivatives and related words
wordclasses
syntactic structures (including clause and sentence boundaries)
collocates
and we provided searching, sorting, and statistical information on all of them. For instance, we wanted to
make it possible for the lexicographer to search for all occurrences of “cake” or “cakes” used as a finite verb
in a predicate with “up.”
But we also wanted the lexicography to feed back into the corpus. We thought that a sense-tagged corpus
was an interesting object in and of itself, and we wanted to let the lexicographers use one sense division in
determining another. So we integrated the corpus viewer and the dictionary-entry editor; the lexicographers
could link each occurrence of a word in the corpus with its sense as displayed in the entry editor, and that
link could then be used in subsequent searches and sorts. Once lexicographer A had identified an occurrence
of “cake” as meaning “a crusty mass,” that sense became available to lexicographer B working on “flat” in
the same sentence.
We started building tools in October of 1990; the lexicographers arrived in Palo Alto (where SRC is
located) and began writing definitions and linking them to the corpus in January of 1992. Now, as we write,
it’s July of 1992. The pilot project will end in March of 1993. What have we learned so far?

ž Natural language is hard.
ž User interface design is hard.
ž Speed is functionality.
ž Data integrity can’t be retrofitted into a system.
ž A corpus (with a set of good tools) is useful for other things besides professional lexicography.
What do we hope for by March?

ž A test of the predictive value of the links between words and senses.
ž An evaluation of the automatic wordclass assignments.
1

ž 500 wonderful dictionary entries.
ž A 20-million-word corpus with 300,000 words linked to those 500 dictionary entries.
ž Insight into what tools the lexicographers found useful in doing their job.
ž Statistical information on the distribution of words, wordclasses, and dictionary senses in the corpus.
ž A contrast of the raw and the cleaned-up versions of the corpus.
ž Published code for inflecting English nouns, verbs, adjectives, and adverbs.
What do we hope for in the future, beyond the end of the project?

ž A solution to the copyright problems that restrict the availability of the Hector corpus.
ž Another project somewhere to link all of the lexical words (but not function words, like prepositions
and conjunctions and linking verbs) in a corpus to their dictionary senses.

2 Preparing the Hector Corpus
2.1

Contents of the Corpus

The Hector corpus was compiled by Oxford University Press for the Hector project and sent to us over the
course of 1991.
The corpus consisted originally of 20 million words of running English text from both written and
spoken sources no earlier than 1930. Its object was to sample language used in natural discourse, in a
variety of social and professional contexts. It included examples of both formal and informal usage but
was meant to exclude poetical or artificially self-conscious language. The written text samples included
quality and popular journalism, scholarly periodicals and the journals of various professions, pop culture and
hobby magazines, works of fiction, biography, autobiography, and travel. The samples of spoken language
were from transcribed interviews, lectures and training sessions, radio sports commentaries, and informal
meetings.
The table above shows the proportions of the main groups of texts in the corpus as it is today. Newspapers
account for 60 percent of the written texts, and the composition of the corpus is heavily biased towards written
text, not for any theoretical reason but because speech samples are very difficult to acquire and process. The
ratio of written text to transcribed speech is approximately 32 to 1.

2.2

Cleaning Up the Corpus

There were several kinds of problems with the corpus as we received it from Oxford:
missing structure
duplication
inappropriate material
gaps
typos and misspellings
missing punctuation
inconsistent notation

2

Written
newspaper journalism
serious non-fiction
fiction
recreational non-fiction
recreational magazines and periodicals
advertising, newsletters, memos, promotional material
serious periodicals, professional journals, news magazines

59.5%
12.8%
10.9%
5.1%
4.9%
1.9%
1.7%

Spoken
all categories of speech

3.1%

The Hector Corpus as of July 1992

Throughout our work on the corpus we suffered from the lack of paper originals of the documents on which
we were working, and we would recommend that any future corpus-workers be sure to obtain paper originals
before they set out.
Oxford sent us the corpus essentially as one big 140-million-byte file. Our first step was to identify the
document boundaries and divide the corpus into files of a convenient size, no file containing more than one
document. We divided large samples into smaller chunks: for example, we divided the 7 million words of
the Independent, an English daily newspaper, into 83 files.
Smaller structural elements, such as headings, salutations, addresses, datelines, bylines, and story or
article boundaries, remained a problem throughout the processing; we found them as well as we could with
various ad hoc techniques and marked consistently all the ones we found.
As soon as we had divided the corpus into chunks, we noticed that some of the chunks were the same as
others. Most of the duplication was in the samples of journalism, where we found stories repeated—as many
as sixteen times—because the samples were taken either from different editions of the same paper or from
an open wire or both. Furthermore, the stories were repetitions but not identical repetitions. Sometimes
the text would be altered for emphasis (the local team goes first in the local edition), sometimes for sheer
journalistic exuberance—or perhaps because the paper had space to fill up. Exact duplications we could
identify and delete automatically. Stories that were repeated but not exactly duplicated required ingenuity
to find and judgment to eliminate; our strategy was to remove all but the longest version of closely similar
stories.
We discovered inappropriate material at two different levels, whole documents and parts of documents.
We discarded some whole documents. For instance, we discarded Possession, the Booker prize winner
by A. S. Byatt, because it is largely a pastiche of 19th-century prose. (“A man might die, though nothing
else ailed him, only upon an extreme weariness of doing the same thing, over and over.” “I can never tire of
you—of this—. ” “It is in the nature of the human frame to tire. Fortunately. Let us collude with necessity.”)
At another extreme, we discarded the Challenger Inquiry, which differed along three axes from everything
else in the corpus: It was the only formal inquiry, the only sample of American speech, and the only example
of technical vocabulary in its field. (“Okay, this third one, again, is—there is your plume already. This time
it is lying correctly. This again is the lefthand rocket this time. The righthand rocket on the other side. The

3

plume is coming toward you. The orbiter is here, and the external tank.”)
But at a lower level, many documents that were useful overall nonetheless contained some inappropriate
material: words written or spoken before 1930, passages in foreign languages, and tables. Poetry showed
up in all kinds of places—it is amazing how often prose writers break into verse. Again a variety of ad hoc
techniques helped us find this material. The most useful was to chop the documents up into small pieces
and subject the pieces to spell-checking. Luckily, we have a good spell-checker. For instance, one piece
contained this passage:
Right worshipfulls, This may be to acquaynt you that their is a
pore yong women in oure Towne of Aston-under-lyne infected with
a filthy deceassed called the French poxe and shee saith shee was
defiled by one Henry Heyworth a maryed man.
in which the spell-checker finds the following “errors”:
Aston-under-lyne
Heyworth
Towne

acquaynt
deceassed
maryed

oure
poxe
shee

worshipfulls
yong

The human being following along after the spell-checker could then mark the passage to be ignored. During
our first several months of work on the corpus, we simply replaced the whole of any such passage with a
single notation:
{deadGuys}
but that turned out to be a mistake. Although the passage was not itself appropriate for lexical analysis, the
lexicographers explained that it could help in the analysis of surrounding words. So later we would mark
such passages to be ignored by the indexing programs and other tools but leave them in place for human
readers looking at words before or after:

4


Right worshipfulls, This may be to acquaynt you that their is a
pore yong women in oure Towne of Aston-under-lyne infected with
a filthy deceassed called the French poxe and shee saith shee was
defiled by one Henry Heyworth a maryed man.

(At this writing, we haven’t yet gone back and restored the passages we deleted.)
The spell-checker was also helpful in finding passages in foreign languages. We were careful to mark
off only whole sentences and passages in foreign languages; we did not mark individual words, which might
be new assimilations.
Finding tables and verse was harder; some we were able to find automatically by looking (for instance)
for short line lengths or runs of numbers, but some we just stumbled upon in looking for other things.
We found some gaps in the material, and one document simply left off mid-sentence. In an ideal world
we would have restored the missing material, but we didn’t.
Some of the text had been typed in, some had been scanned with an optical character recognizer, and
most had been produced using electronic typesetting. All of these methods produce mistakes and so we
were not surprised to find typographical errors, OCR errors, and transcription errors in the text. Because
we used the spell-checker heavily to identify inappropriate material, we also found the sorts of errors that a
spell-checker can find, and those we corrected with a construction called the typo sundry:
{typo bad="theoretcial",good="theoretical"}
We used the typo sundry for typos and OCR errors and misspellings and whatever without distinction; we
couldn’t see any point in trying to guess the origin of each error. The typo sundry keeps both the erroneous
version and the corrected version, but only the corrected version gets indexed. (It may seem comical that
we carefully preserved this information while discarding long sections of tables and verse and so on, but
we were afraid that the fatigue of correcting errors by hand might lead us to mis-label perfectly reasonable
forms as typos.)
Of course, spell-checkers can find only certain classes of spelling and typographical errors. We made no
systematic attempt to find the ones that the spell-checker couldn’t catch. If we chanced upon a mistake, we
corrected it with the typo sundry, but that’s all.
The OCRed material had much of the sentence punctuation missing or misread—commas for full stops
were the commonest error, and the one that seriously reduced our chance of finding sentence boundaries.
We used regular expressions to find such places and human labor to correct them.1 Unlike our careful
preservation of lexical evidence in dealing with typos and spelling errors, we silently corrected punctuation
errors and left no trace behind. (The corpus would therefore be useless for a study of punctuation; too bad.)
The representation of essentially everything in the corpus except the roman alphabet varied from one
document to another (and sometimes within a document as well). For instance, we found fourteen ways of
representing an em-dash and rampant inconsistency in the conventions used for embedded quotes and final
quotes. Again, we used a combination of regular expressions and sweated labor to make notation fairly
consistent in the corpus. We started out over-enthusiastically trying to make everything perfectly consistent
and eventually settled on a more manageable set of notational conventions, laid out in Appendix A.
The reduction in bulk resulting from all this cleaning-up was substantial. In round numbers, we refer to
the corpus as having 20 million words. But in reality, Oxford provided us with 21.7 million at the beginning
of the processing, just for safety’s sake. By now, we have reduced the original 21.7 million words to 18
1

Regular expressions let a computer user search for patterns in text; Unix systems are typically rich in regular-expression tools,
other systems often lack them.

5

million, and still shrinking. We nonetheless call it 20 million words because we’re pretty confident that any
other 20-million-word corpus would yield about the same amount if processed the same way.

2.3

Why Cleanse the Corpus?

We had different motives for the different kinds of cleaning-up we did. For instance, we felt that we had
no choice but to supply missing structure to make the documents tractable. On the other hand, filling gaps
seemed nice but not worth the effort.
Regularizing notation seemed to us to be an engineering necessity. The problem about inconsistent
notation is that unless you fix it in the source, all the downstream tools become more complicated. If you
leave in fourteen kinds of dashes, the tagger, the parser, and all the indexing tools need to understand fourteen
kinds of dashes.
The duplications in the corpus struck us as irritating (and sure enough, the lexicographers complained
about the ones we didn’t catch). More seriously, though, we were also afraid that they would skew the
lexical evidence. For instance, one of the newspaper stories in the corpus used the word “perch” and the
word “pope” in adjacent sentences. If that story appears only once, the most significant collocate for “perch”
in the corpus is “skimmer” (a skimmer bream is another kind of fish) and the most significant collocate for
“pope” is “St” (abbreviation for Saint). But if the story is repeated often enough, “perch” and “pope” will
appear to be significant collocates of one another.
We agonized over removing inappropriate documents, but in the end we did decide to eliminate a few
troublesome ones. Again, we were worried about skewing the lexical evidence. In the Challenger Inquiry,
for instance, the word “booster” appears 464 times; in the whole rest of the corpus it appears 19 times. And
we were convinced that the lexicographers would refuse any new information offered by so idiosyncratic a
document. An expression that occurred only in the Challenger Inquiry might represent American speech,
or technical jargon, or formal discomfort, or some combination of the three; without parallel documents
along each axis, the lexicographers would have no way of guessing what was going on. The expression
“O-ring,” for instance, occurs 1,733 times in the Challenger Inquiry and nowhere else in the corpus. What
lexicographer would feel confident, on the basis of that evidence, to make any statement whatsoever about
the term?
The other kinds of cleaning-up that we did—removing inappropriate material within documents, correcting typos and misspellings, and supplying missing punctuation—took place on much shakier ground,
because we were aware that we couldn’t fix everything. 20 million words is just too much even for rabid
enthusiasts with good machine resources and a year to spend. We guess that we may have read as much as
one line in every seven or eight of the corpus; but a lot of lines remain unread.
There are two points of view from which to object to what we did. The first is that we were marking the
cards; the second is that we were wasting our time.
Luckily, we’re going to be able to shed some light here. By the end of the project we hope to be able to
report statistical differences between the raw and cleaned-up versions of the corpus and let the world decide
whether the cleansing was worth while. Being sensible people, we hope that the second point of view is
correct – that extensive hand work on a corpus doesn’t change it much, so there’s no point in doing it. But
we believe we may be the first ones in a position to test this hypothesis.

2.4

Adam: Our Wordclass Tagger

To help the lexicographers divide the words into senses, we first identified the wordclass (“part of speech”) of
every word in the corpus. Fortunately, wordclasses can be identified automatically with a reasonable degree
of accuracy. We adapted the algorithm described by Ken Church, and used the data from the Lancaster-Oslo-

6

Bergen corpus (LOB), graciously provided to us in machine-readable form by Knut Hofland, to produce a
wordclass tagger called Adam[1, 3].
Briefly, the LOB data tells us how often a word was assigned a particular wordclass and how often any
combination of three wordclasses occurred in a row. From that, we can compute the lexical probability
(the odds that word X is of wordclass A) and the contextual probability (the odds that wordclass A will be
followed by wordclasses B and C). We multiply the lexical probability by the contextual probability and
take for each word the combination with the greatest product.
Church tested the algorithm on a small set of texts. As in the LOB corpus, the sentence boundaries in
Church’s texts were already marked. He reported a good rate of accuracy.
Since the sentence boundaries in the Hector corpus were not marked in advance, we adapted the tagger
to find them. In fact, although there was an LOB wordclass tag for the beginning of a sentence, Hofland
didn’t initially send us the data on triples where the beginning-of-sentence tag was the middle tag; it hadn’t
occurred to him that we were going to use this data to determine sentence boundaries automatically.
Even with the complete data, sentence boundaries remained a problem. The beginning of a sentence is
invisible: there’s no text to tag, not even a punctuation mark, so there’s nothing that has a lexical probability
for this funny kind of wordclass. We modified the algorithm to handle this situation.
Adam is fast enough to tag the entire corpus overnight by using a group of machines linked over a
network.

2.5

The Houghton Mifflin Parser

Looking back on it, we wonder how we were bold enough to undertake the Hector project before Houghton
Mifflin generously provided us with the use of their parser. It turned out to be the linchpin of the Hector
system.
The Houghton Mifflin parser is used in their CorrecText Grammar Correction System 2 . It breaks
the input text into sentences, assigns a wordclass tag to each word in the sentence, and identifies clause
boundaries for each sentence, subject and predicate boundaries for each clause, and prepositional phrase
boundaries.
We changed the program as little as possible, so that we could take improved versions as they became
available. In addition to increasing the parser’s size limits as much as possible, we made two modifications:
We filter the input file before it reaches the parser, passing through only those sections containing information
to be indexed. And we report the results in our preferred indexing format—every word is identified by its
character position and length in the original source file.
These two modifications interact in ways that make each task harder. For the text that we pass to the
parser, we must preserve information about its location in the original source file. Some of the parser’s
algorithms make it difficult to track the locations of the text that it is parsing. However, we managed to
produce a version that generally succeeded in tracking the words; occasionally, we would change the spacing
or punctuation in the corpus when we could not succeed in parsing the original version.
Why did we make these modifications? We found that unless we filtered out material that couldn’t be
parsed (like SGML markings and headers and addresses), the parser became very, very slow as it beat its
head against insoluble problems. With the modifications, the parser was fast enough so that we could yoke
a group of machines together and parse the whole corpus overnight.
2
CorrecText is the registered trademark of the grammar-checker, which we understand is marketed with several different front
ends.

7

2.6

Which Word Is This?

There are a number of different tools that operate on the corpus, and it was important that they be able to
identify and refer to the words in a consistent manner. For instance, both Adam and the Houghton Mifflin
parser were generating wordclass tags, and the indexing program needed to know when different tags applied
to the same word. If one program thought of a word as a pair of offsets and another program thought of
it as an offset and length, the conversion would be trivial but we’d waste time reconciling the two. If one
program thought of a word as including trailing whitespace and another thought of a word as excluding it,
we’d have chaos on our hands.
We decided to use the character position and length of a word in its corpus file as the standard representation of a corpus word for data produced by programs. Individual analysis tools worked with a single
source file containing the information on the position, length, and corpus file of every word in the corpus;
all the tools produced values in terms of this standard representation.
The indexing program combined all the data files from the different corpus files, and assigned each
corpus word a unique word index. It also provided a mechanism for translating between these word indexes
and the source file/character position representation.
We knew that the corpus was undergoing constant modification as we cleaned it up. Furthermore, we
suspected that it would be necessary to continue to modify the corpus after the lexicographers had started
sense-tagging it. So we wrote a tool that would analyze additions and deletions to the corpus and compute
the difference in the word indexes that these changes implied. For instance, if we had to remove a sentence
from the corpus, the word indexes of all the following words would get decremented by the number of words
in the sentence.
By storing all the data, including the lexicographers’ sense-tags, in terms of these word indexes, and
applying this analysis tool, we can continue certain classes of improvements to the corpus without losing the
sense-tag work that was done on an earlier version.
Now in reality, we have been so busy since the lexicographers arrived that the corpus might just as
well have been frozen. But it is comforting to know that we can accommodate any changes we need to
make—such as removing some more of those blasted duplicates when we get the time.

2.7

Which Sentence Is This?

With sentences as with words, we couldn’t afford to let the different tools have different ideas about what
constituted a sentence and where each sentence began and ended. Since we were getting so much more
information out of the Houghton Mifflin parser than out of Adam (clause, subject, predicate, prepositional
phrase) we decided to let Houghton Mifflin decide where the sentence boundaries were in the corpus. In the
early months of the project, before we had the Houghton Mifflin parser, we had labored mightily to get Adam
to find sentence boundaries and even wrote a stand-alone heuristic sentence chunker. We were therefore in
an unusually good position to appreciate the accuracy of the Houghton Mifflin parser in recognizing sentence
boundaries in the face of such unwieldy material as lists, contract language, and speech transcripts.

2.8

The “hi” Server

Mike Burrows, one of our colleagues at SRC, helped us out by writing an indexing program called the “hi”
server, “hi” being short for Hector Information. One of the advantages to doing this project at SRC was the
chance to tap the expertise of folks like Mike.
We had originally thought of using the Pat program from Open Text for indexing the corpus, but we
stumbled over some bugs and Pat was unable to rebuild the index overnight, which was a requirement for

8

us.3 Because Mike was doing research in indexing, he consented to take us on as his guinea pigs, and the
relationship worked well. This paper is not the right place to describe the hi server (nor could we if we
wanted to). Suffice it to say that the server keeps compressed inverted indexes in memory and uses them
to answer queries with blinding speed; building new indexes on the entire corpus and all the wordclass and
syntax information takes only the few hours that remain in the night after Adam and the Houghton Mifflin
parser have completed. It does what we need, and that’s all we need to know.

2.9

Target Words

Since Hector was a pilot project, we wanted to ensure we learned as much as possible about the effectiveness
of our tools. We wanted the lexicographers to use the corpus tools heavily for the duration of the project.
Since the project would cover only part of the lexicon, we wanted that work done on appropriate words. If
the project were covering the entire lexicon, it wouldn’t have mattered what order the words were taken.
We decided to assemble a list of target words for the pilot project. Our hope is to have entries for at least
those words by the end of the pilot project.
We wanted to avoid words that occurred infrequently in the corpus, since the corpus tools would provide
little insight in writing dictionary entries for such words. We also wanted to avoid words that occurred very
frequently, since we suspected that such words presented difficult problems in lexicography, independent of
the corpus analysis, that would require too much of the lexicographers’ time.
At Oxford, the headword list of the 8th edition of the Concise Oxford Dictionary was mapped onto a
corpus frequency list in which words were grouped with their inflections. The headwords were then grouped
into bands that reflected their frequency in the corpus. Band 7 contained words with 200-500 occurrences in
the 12.8-million-word corpus that had been collected at that point, meaty but not overwhelming. We chose a
target list of 570 words from that band; their actual frequencies in the 20-million-word corpus ranged from
a low of 260 (“sweat”) to a high of 3099 (“practise”). The target words appear in Appendix B.
Later we added another set of target words, the wave-2 words, which occur in the corpus more than 100
and fewer than 1500 times and occur in the vicinity of the initial targets. So for instance the wave-2 word
“pint” and its inflections occur 284 times in the corpus, and they occur in the vicinity of the band-7 words
“bitter,” “boil,” “cream,” “equivalent,” and “sugar” 28 times. The wave-2 words appear in Appendix C.
The idea of the wave-2 words is to see whether the work already done on the target words aids future
work; does this kind of corpus lexicography get easier as you go along? Many of the wave-2 words appeared
to be profoundly uninteresting, but of course one of the amazing things for us is to watch the lexicographers
tease out threads of meaning and usage in words we hadn’t realized were at all complicated. To date the
lexicographers have done only a very few wave-2 words, so we have no results to report now.

3 The Lexicographer’s Workbench
This is what Hector looked like from the lexicographers’ point of view when they arrived in January:
3

Open Text is at 180 King Street South, Suite 550, Waterloo, Ontario N2J 1P8. We used Pat extensively in other parts of the
project and found it quite satisfactory.

9

The workstation is a DECstation 5000 with three color screens and a single mouse and keyboard. The
center screen displays the corpus viewer, which we call Argus. The right-hand screen displays the dictionaryentry editor, which we call Ajax. The left-hand screen we call Atlas, to maintain the theme of classical A’s,
and it displays mail, document editing, various small utilities, and of course a solitaire game to keep the
lexicographers occupied when the corpus viewer and the dictionary-entry editor are malfunctioning.
In a highly oversimplified scenario, suppose that the lexicographer decides to write the dictionary entry
for the word “tap.” First she 4 uses the corpus viewer (on the center screen) to look at “tap” as a noun:
chen windows let out the promising
tat last summer, when suddenly the
un away to Manchester and become a
fe like royalty. If you are royal,
mother turning on the back-kitchen
e division in Washington. From the
e voice. Gregory Hines, the superb
controlling the church’s main gas
ae up to an inch long polluted the
Miller when, changing feet like a
f select female genes have been on

tap
tap
tap
tap
tap
tap
tap
tap
tap
tap
tap

of fork on plate and the s
was turned off. Now, just
dancer?" Leonard Ford, who
dancing is out. Another re
to fill the kettle for tea
on Bloch’s phone, it knew
dancer turned movie star,
from under his personal pe
water of eight London boro
dancer, he somehow kicked
in recent years through th

... and so on. She decides to divide “tap” the noun into three senses: tap a little sharp sound, tap a spigot,
and tap a surreptitious listening device. She sketches out these senses in the dictionary-entry editor (the
right-hand screen) and assigns each a mnemonic label: CLICK, VALVE, and SPY. Back in the corpus viewer,
she hypothesizes that “tap” as a noun (an attributive noun) followed by “water” will always be the VALVE
sense of “tap,” so she tries a search with those constraints:
4

Half the lexicographers were men, the other half were women; we’re using “she” as a generic pronoun.

10

ae up to an inch long polluted the
ministers have convinced her that
,000 people were warned not to use
G has gone out to MPs not to drink
e country about the quality of our
change in the fish’s health. Pure

tap
tap
tap
tap
tap
tap

water of eight London boro
water really is safe and w
water after diesel oil lea
water in some buildings at
water; fears of the conseq
water might be all right f

Sure enough, that was right, so she tags all those occurrences with the VALVE tag. Now because the
sense-tags are immediately available for further searches and sorts, she can ask for all uses of “tap” as a
noun followed by any other noun but not tagged as VALVE:
fe like royalty. If you are royal,
e voice. Gregory Hines, the superb
keeping the game moving by taking
bbies. ‘It’s got door handles and
t, a true cat-lover and a splendid
tland B prop, trundled over from a
Happy". (Field had once mastered a
als. The answer is to re-grind the
eating with a simple but effective
> HAVE you ever wanted to tango or

tap
tap
tap
tap
tap
tap
tap
tap
tap
tap

dancing is out. Another re
dancer turned movie star,
penalties instead of letti
fittings which can be chan
dancer. He even taught him
penalty and Hull converted
routine to the Youmans son
seating with a simple but
re-seating tool which will
dance? If so, why not pop

She decides that any occurrence of “tap” followed by any inflection or derivative of “dance” is the CLICK
sense of “tap”, so she tries a search with those added constraints:
un away to Manchester and become a
fe like royalty. If you are royal,
e voice. Gregory Hines, the superb
Miller when, changing feet like a
t, a true cat-lover and a splendid
> HAVE you ever wanted to tango or
l> IF YOU fancied a quick tango or
in Cornwall is to observe seagulls
and went to Egypt. I did my usual
in pork-pie hats and bell bottoms,

tap
tap
tap
tap
tap
tap
tap
tap
tap
tap

dancer?" Leonard Ford, who
dancing is out. Another re
dancer turned movie star,
dancer, he somehow kicked
dancer. He even taught him
dance? If so, why not pop
dance then Gosford Hill Co
dancing on the lawn after
dancing on the table, but
dancing on the bollards li

Again the search pays off and she marks all those occurrences as CLICK.
52 instances of “tap” down, 513 to go.

3.1

Hooking Words and Definitions Together

The corpus viewer and the dictionary-entry editor have to work together to get the corpus sense-tagged.
The corpus viewer knows about the corpus, the dictionary-entry editor knows about the word senses. The
dictionary-entry editor has to tell the corpus viewer about the senses. The lexicographer tells the dictionaryentry editor which entries are of interest to the corpus viewer by “activating” them. Any number of entries
can be active at once; when an entry is active, all its senses are active.
For each active sense, the dictionary-entry editor tells the corpus viewer its mnemonic label (e.g. CLICK,
VALVE, SPY), its sense uid (the true identifier of the sense, a number unique over the whole dictionary),
and other information the corpus viewer uses for sorting purposes (headword, homograph number, and sense
11

number). The active mnemonics must be unique. Hence there must be no duplicate mnemonics within an
entry, or in two entries that will be active at the same time.
While an entry is active, the dictionary-entry editor tells the corpus viewer about every change to its set
of senses: additions and deletions of senses, and changes to the mnemonics, homograph numbers, sense
numbers, and headword fields.
Why not just have the whole dictionary active at once? It would make the whole system too slow, and
the mnemonics (which the lexicographers far prefer to the six-digit uids) would have to be unique across the
whole dictionary, not just across the active entries.

3.2

Argus: The Corpus Viewer

Named for the mythological creature with a hundred eyes, Argus provides a dynamic, interactive concordance
for the lexicographer. It occupies the middle screen because we think of it as being at the center of the work
the lexicographers are doing during the pilot project. There are two main windows in the corpus viewer,
the query window and the concordance window. In the query window, the lexicographer specifies what she
wants to search for. Then when she clicks the Search button, the corpus viewer displays the matches in the
concordance window, where she can rearrange them at will, and where she links occurrences in the corpus
with dictionary senses in the dictionary-entry editor.

3.2.1

Searching for a Word or a List of Words

The first step in using the corpus viewer is to specify a query. A single word is the simplest query. The
lexicographer can search for a whole list of search words at once:
hand | hands | handing | handed
The search words need not be related:
hand | any | hogwash | yellow | Patricia
(although it’s hard to imagine why anyone would make such a query).
Occasionally the lexicographer may type in a list of words by hand, but usually she types in only one
word and then has the corpus viewer generate a list from that word.
The simplest generated sets are the case-variants. All the words in the corpus have been indexed, and the
index is case-sensitive. For “hand,” the generated case-variants are “Hand” and “HAND.” The initial-capital
variant is useful for matching words at the beginning of sentences; the fully capitalized variant is useful for
matching words in newspaper headlines. Other possible case-variants, such as “hAnD,” are not generally
useful; when they occur at all in the corpus, they usually indicate an acronym or an initialism.
The corpus viewer can also generate inflections for nouns, verbs, adjectives, and adverbs. If the
lexicographer types the word “hand” and asks the corpus viewer to inflect it as a noun, the corpus viewer
will provide the list:
hand
| hands
| hand’s
| hands’

|
|
|
|

Hand
Hands
Hand’s
Hands’

|
|
|
|

HAND
HANDS
HAND’S
HANDS’

The inflection code can handle regular and irregular inflections and has been steadily improving over the
course of the lexicographers’ stay with us, as they point out errors. We plan to publish it at the end of the
project.
Told to expand “hand” as an adjective, the corpus viewer will blithely do so:
12

hand
| Hand
| HAND
| hander | Hander | HANDER
| handest | Handest | HANDEST
Non-words in the list such as “handest” don’t present a problem because they won’t occur in the index. Of
course, some words look like non-words but turn out to be real. “Hand, hander, handest” looks quite bogus,
but in another setting “hander” is a perfectly sensible word:
A strong left hander, she is a pupil at Banbury School.
By editing the type-in area, the lexicographer can delete unwanted words from the list.
The facility that naive users find the most mysterious is the one for generating all the “related” words,
that is, all the words found in the Hector corpus that may be derived from the search word, or compounds of
the word, or otherwise related to it. The words related to “cake,” for example, include:
anti-caking
beefcake
cake-eaters
cake-holes
cake-icing
cake-mixes

|
|
|
|
|
|

cakehole
cakewalk
cheesecake
cupcakes
filth-caked
fishcake

|
|
|
|
|
|

fruitcake
layer-cake
mud-caked
oatcakes
pancake-like
rock-cakes

|
|
|
|
|
|

salt-caked
sheepcake
shortcake
wedding-cake
worm-cake
yellowcake

This information has already been stored for all the Hector target words. If the lexicographer asks for words
related to a word that’s not a target, the corpus viewer sends electronic mail to us, and we then find all the
related words and store them so that the results are available for subsequent requests. 5

3.2.2

Searching for a Wordclass or Wordclasses

The lexicographer can specify that the search be restricted to certain wordclasses. For example, if the query
word is “butter,” the lexicographer can search for it used only as a verb.
To provide wordclass information for the search, the corpus viewer uses the union of the information
from Adam and the Houghton Mifflin parser. That is, if the search is restricted to, say, adverbs, then the
concordance will include words that either Adam or the Houghton Mifflin parser marked as an adverb. The
corpus viewer doesn’t let the lexicographer specify wordclass by tagger (“adverb according to Adam” or
“reflexive pronoun according to HM”), but it should—and will, shortly after we’ve finished writing this
paper. (One of the reasons to write up results is to be shamed into correcting small errors and infelicities.)
The lexicographers can choose a wordclass or wordclasses from this list:
noun
proper noun
verb
adjective
adverb
degree adverb
preposition

determiner
number
ordinal
modal
auxiliary
possessive
infinitive marker

5
The lexicographers had requested such a facility long before they arrived but were unable to suggest any principled way to
provide it, so we settled on the low-tech solution. We do “grep -i” using the simplest of simple Unix tools for the word in the index
(with a small bit of cleverness about trailing vowels and adjectival forms) and then read through the results and reject ones that
aren’t related to the word. If anybody knows how to write a program that can find “cakehole” as a derivative of “cake” or tell that
“inclemencies” is related to “clement” while “encirclements” is not, our hat’s off to them.

13

personal pronoun
reflexive pronoun
pronoun

negative
conjunction
other

We are still struggling with how to give the lexicographers finer control over wordclass constraints.
If the lexicographer constrains the search to a certain wordclass or wordclasses, she needn’t provide
a search word; the corpus viewer will search for whole wordclasses such as pronouns or determiners. In
practice, the lexicographers don’t have occasion to use this ability for the main search because dictionaries
are arranged by word, not wordclass.

3.2.3

Searching for Syntax, Position, Genre, Authorship, Etc.

The Houghton Mifflin parser identifies the sentence and clause boundaries and the subject, predicate, and
prepositional-phrase boundaries in the corpus, but the corpus viewer doesn’t permit the lexicographers to use
those boundaries as constraints in searches. We also know the genre and authorship of every document in the
corpus, but again, the corpus viewer doesn’t permit the lexicographers to use this knowledge as constraints
in searches.
The corpus viewer used to permit all these kinds of constraints when the lexicographers first arrived, but
Hector had a lot of starting-up problems at that point, and we found ourselves simply jettisoning some kinds
of searching constraints. We plan before the end of the project to experiment with adding back in constraints
on syntax, position, genre, authorship, etc., to see whether the lexicographers find them useful.

3.2.4

Searching for a Sense or Senses

In addition to wordclass constraints, the lexicographer can place sense-tag constraints on the search. The
tags themselves are simply the lexicographers’ mnemonics for dictionary senses, and the entries need to be
active to be used in a search. (“Active,” you may recall, means that the dictionary-entry editor is telling the
corpus viewer about the senses and monitoring changes to them.)
There are three predefined tags, which are always active: P for proper name, T for typographical error,
and U for unassignable. The word “Twist” in the name “Oliver Twist,” for example, would be tagged P.
The lexicographer can also ask for all sense-tagged words or no sense-tagged words even when only
some or none of those senses are active. By excluding all sense-tagged words, for example, the lexicographer
can easily see how much work is left to be done on a particular word.

3.2.5

Searching for Words with Collocates

The corpus viewer also lets the lexicographers specify search words that occur in the context of other words,
their collocates. The lexicographer specifies a distance and a direction between the search word and the
collocate. For example, -5,+3 means that the collocate must occur within 5 words to the left of the search
word or within 3 words to the right. +1 means that the collocate must be the word immediately following
the search word. -5,-2 means that the collocate must occur no more than 5 but at least 2 words to the left of
the search word.
Like search words, collocates can be lists rather than single words; the corpus viewer can generate the
lists automatically; and collocates can be constrained with wordclass and sense-tag restrictions. Like search
words, collocates need not be any particular word as long as they’re constrained by at least one wordclass or
sense-tag. The lexicographers don’t have any reason to search for any noun, but they often search for some
specific word with any noun as a collocate.
The lexicographer can specify any number of collocates, including nested collocates: “drop” with
“potato” as a -5,+5 collocate and “hot” as a -1 collocate of “potato”—i.e. drop something like a hot potato.
14

Although the corpus viewer imposes no limit on the number of collocates or the depth of nesting, the
lexicographers rarely construct complex queries.
We were very surprised that the lexicographers preferred in specifying collocates to use pure position
rather than some kind of syntactic constraint like “within the same sentence” or “within the same clause.”
We might try to convince them to experiment with syntactic constraints before the project is over.

3.2.6

Looking at Concordances

The result of doing a search is a concordance. As each example is found, it is written in the concordance
window, which scrolls up to accommodate successive lines.
Each line of the concordance contains three fields: the sense-tag, the source name, and the text.
The sense-tag field shows what sense-tag, if any, has been assigned to the target word of this line. If the
tag is active, then the mnemonic is shown; otherwise the uid is shown.
The source name is a 6-letter abbreviation for the corpus document in which the citation appears.
The text of the citation shows 80 characters from the corpus, and the search words are vertically aligned.
(The name for this kind of format is KWIC, keyword in context.) We experimented with other display
formats, particularly with whole sentences in a variable-width font. The lexicographers hated it. The search
words didn’t get lined up in the display, so they needed to be highlighted; we chose to highlight them in red.
The lexicographers referred to the highlights as “the river of blood.”
If the query includes collocates, they are highlighted on the concordance line. (The highlight is green,
and since the collocates are dotted around, one lexicographer has suggested they might call it “the meadow
of shamrocks.”)
The lexicographer can get a quick preview by asking for a count of how many concordance lines a query
would produce. The concordance window also contains facilities for expanding a concordance line in a
pop-up window containing either a few paragraphs or the entire document. The wordclass and syntactic
information for the citation is available in another pop-up window.
Each time the lexicographer clicks Search, a new concordance appears in the window, replacing the
previous contents. There is no facility for seeing more than one concordance at a time (although of course
the lexicographer can write a complicated query that produces what she thinks of as two concordances
combined); we could easily provide one, but the lexicographers haven’t asked for it. There is a facility for
saving a concordance in a file, either in KWIC format or as whole sentences.
The initial display of the concordance lines is roughly in genre order (actually in corpus order, with the
corpus arranged roughly in genre order). Once the concordance lines are displayed, the lexicographer can
sort them. There are five primary sorts:

ž Sort by the first word following the target word, and if there’s a tie, the second, and so on.
ž Sort by the first word preceding the target word, and if there’s a tie, the second, and so on.
ž Sort on the search words. For instance, when the lexicographer has searched for “hand | hands,” this
sort puts all “hand” concordance lines together followed by all “hands” lines.
ž Sort by the order of the documents in the corpus (if the concordance lines have been sorted into some
other order and the lexicographer wants them back in corpus order).
ž Sort by dictionary sense.
The lexicographers can break ties in the primary sort by specifying a secondary sort, which can be another
of the five orderings above or “Don’t care.” “Don’t care” is the default secondary sort.

15

The corpus viewer also contains a few other utilities. One, for instance, pops up an edit-window that
the lexicographers use to compose electronic mail; the most recent error message from the corpus viewer is
automatically copied into the body of the e-mail text. This makes it simple for the lexicographers to send
more meaningful questions—and bug-reports—to the developers. There is also a button that pops up the
complete on-line manual for the corpus viewer.

3.2.7

Sense-Tagging Concordance Lines

The lexicographer can sense-tag the search word on a concordance line by typing into the sense-tag field. A
number of different things can go there:

ž An active sense-tag mnemonic. The tags P, T, and U are always active.
ž An active sense-tag mnemonic followed by a question mark, showing that the lexicographer feels a
degree of uncertainty about the tag. The question mark is only a note to the lexicographer; it does not
influence searches in any way.
ž An active sense-tag mnemonic followed by one of the following suffixes:
– P Even though the word appears in a proper name, its original sense is still relevant. For example,
the word “Diet” in the proper name “Diet Pepsi” is related to the food-related sense of the word
“diet.” The -P suffix contrasts with the P tag. Dickens may have chosen Oliver Twist’s name for
some reason connected with a sense of the word “twist,” but that sense is no longer relevant in
the name.
– M A metaphorical use of the sense.
– X An exploitation of the sense—some kind of odd syntax or setting. For instance, the lexicographer might note that it’s generally one person who twists another’s arm; but “cruel fate twisted
her arm” is still that same sense of twist even though it has an atypical subject.
These suffixes can be followed by the question mark.

ž Any number of active sense-tags mnemonics (with suffixes) connected by the word OR. This indicates
that more than one sense is involved or that it is difficult to distinguish between them. A word marked
with more than one tag will be found during a search for any of its tags.
The corpus viewer gives the lexicographer several ways of batch-tagging whole groups of concordance
lines in one fell swoop. As a precaution against accidents, the corpus viewer requires the lexicographer to
click special buttons before overwriting or removing a sense-tag.
Finally, there is the button labelled Commit. No changes in the sense-tag assignments take effect until
the lexicographer clicks this button. When she does, the corpus viewer conveys the new assignments to the
index server, which makes the new information available to all the lexicographers.

3.3

Ajax: The Dictionary-Entry Editor

The lexicographers use the dictionary-entry editor (the right-hand screen) to create new dictionary entries
and to look at existing entries. The dictionary-entry editor permits the lexicographer to work on as many
entries at a time as she wishes, each in a separate window.
Here are the goals that the dictionary-entry editor set out to achieve:
The entries that the dictionary-entry editor produced had to be suitable both for typesetting programs and
for computer analysis. It was a goal for the entries to contain clear marks on each type of information, such
16

as register and grammar, so that no one in the future would have to extract such information by attempting to
analyze more general text. Another goal was to capture accurately the hierarchical relationship of the entry,
to avoid duplicating information within the entry.
We wanted to shield the lexicographers from the details of the representation as much as possible, so
that they could focus on the content of an entry rather than its form.
The dictionary-entry editor had to permit the structure of entries to evolve, to accommodate new
lexicographic insights as the project progressed.
Because an overall goal of the project was to mark the corpus with sense-tags, the dictionary-entry editor
was responsible for assigning and maintaining the “identity” of senses in such a way that the sense-tags
would continue to be valid as the lexicographer revised an entry. In particular, the lexicographer needed to
be able to add a sense, merge two senses into a single sense, make one sense a subsense of another, or change
the order of senses within the entry without invalidating the sense-tagging that had already been done.
It had to be easy to copy examples from the corpus into dictionary entries.

3.3.1

Writing a Dictionary Entry

The dictionary-entry editor permits three different views on an entry, a simulated print view, a full view, and
a skeletal view. The simulated print view is displayed in a separate window and can be seen at the same
time as the full view or the skeletal view. The lexicographer must, however, choose between the full view
and the skeletal view in the main window; only one can be seen at a time.
The simulated print view permits the lexicographer to see how an entry will look; this view is provided
by a program called Sid, written at Oxford for viewing dictionaries. It permits the lexicographer to proof the
entry for the dictionary and is familiar and easy to read when a lexicographer wants to check an entry quickly,
for instance, to compare it with a similar entry that she’s actually working on. This view is read-only; it is
not possible to edit the contents of the simulated print view directly.
The full view presents an explicit representation of the entry hierarchy. The goal of the full view is to
make the complete entry structure visible and accessible. In the full view, the lexicographer can produce any
legal entry, no matter how complex. The hierarchy itself can be modified in a fairly straightforward manner.
For instance, it is easy to take a sense and all its subsenses and move it anywhere in the hierarchy—even
make it a subsense of another sense.
The full view, however, can deal only with a complete entry. The hierarchical framework must always
be in place. So the lexicographer has to consider the structure of the entire entry – sometimes before she
is ready. Worse yet, early versions of the full view wasted so much screen space making the details and
relationships of the hierarchy clear that the lexicographer was forced to think about the whole entry, and she
couldn’t even see it.
The skeletal view in the dictionary-entry editor was motivated by the goal of making it easy for the
lexicographer to assemble an entry bottom up—to start identifying senses of a word and worry about their
relation to one another after they have been identified. The skeletal view tries to make as many senses as
possible visible on the screen and makes it easy to add and delete senses.
But the lexicographer can create only a limited set of fields in the skeletal view. In particular, the fields
must all be fields of a particular sense; they can’t, for instance, be fields of a homograph. This restriction has
led lexicographers to encode information in inappropriate fields. For instance, if the lexicographer wants to
note variant forms of the headword, she can’t record this information directly in the skeletal view. Instead of
switching to the full view, she may record variant forms in a note field, or as part of the definition, or leave
them out altogether. At some level the lexicographers understand that if they do that we will no longer be
able to analyze variant forms in the entries. But in the heat of frenzied composition they forget, or maybe
they don’t yet believe in the utility of having us analyze variant forms.

17

Both the full view and the skeletal view in the dictionary-entry editor support an operation called folding.
Folding a sense removes from the screen all its fields except the tag, sense number, grammar, and definition.
Folding a sense reduces the amount of space it needs on the screen, so folding an entire entry lets the
lexicographer get an overview of the structure and content of the entry.
Lexicographers do most of their work in the skeletal view. Although the range of entries it can produce is
restricted, it is adequate for most of the work in the pilot project. Its model of how a lexicographer develops
an entry corresponds to the way the lexicographers actually work; the full view corresponds to what we as
users and computer scientists want from an entry, but not to how the lexicographer wants to build it up. It
takes anywhere from 5 to 60 seconds to switch between the full view and the skeletal view, so lexicographers
are reluctant to change views.
Final editing of an entry is also awkward because of sluggish performance in maintaining the form
hierarchy as text fields grow larger. Until we tune the performance, we won’t be able to judge the dictionaryentry editor as a tool for the entire process of composing an entry.

3.3.2

Numbering the Components of an Entry

In its original design, the dictionary-entry editor managed all sense and homograph numbering automatically,
based on the position of the sense or homograph in the entry hierarchy. This ensured that the numbering
was always correct and consistent. When the lexicographer wanted to change a sense number, she did so
by moving the sense into the proper position in the entry. (In the full view, lexicographers can move fields
before, after, or into another field.)
The lexicographers found this design awkward because it required a number of mouse and menu
operations to move a sense; most of the time they have their hands on the keyboard, so using the mouse is
slow, and the awkwardness was compounded by slow response times. Also, it was frequently the case that
they couldn’t see both the original location of the sense to be moved and the location where they wanted to
move it. They had to spend time and attention navigating the hierarchy when they knew in principle where
they wanted the sense to go.
We changed the way numbering worked so that it was always the responsibility of the lexicographer to
manage the sense numbers. At that point in the evolution of the dictionary-entry editor, the lexicographer
could move a sense only by assigning it the desired sense number. The dictionary-entry editor would then
sort the entry to reflect the assigned sense numbers. The entry editor also checked that the lexicographer
had assigned values that were internally consistent, that is, that she hadn’t assigned the same sense number
to two senses and that the values of the sense-number fields were indeed valid sense numbers.
This was an improvement for the lexicographers, but managing all the numbers proved tedious, particularly for entries with many senses. Adding a new sense in the middle meant renumbering all the senses that
followed.
To simplify such renumbering, we added a new command which automatically renumbers entries. It
assumes that the current order and nesting level is correct and assigns new numbers in increasing order. The
lexicographers can thus type in numbers themselves to get a rough cut at the numbering and then ask the
dictionary-entry editor to renumber when things start to get messy.

3.4

Atlas: Supporting Information

Atlas is a name that no one uses for a number of small programs that provide the lexicographers with
additional information. We have learned that each lexicographer has her own way of working and her own
idiosyncratic set of tools. No lexicographer uses all the Atlas tools; probably no lexicographer uses none of
them. Most of the Atlas tools are low-tech programs without fancy graphical interfaces.

18

“tally” reports lexicographic progress on the Hector project—how many words have been tagged in the
corpus, how many entries have been written, which lexicographers have been working on which entries,
how many senses each entry has, and how many target words and wave-2 words have been completed. For
instance:
> tally
Number of tokens tagged: 81124
Which is 00.4% of the 17301331 tokens in the total corpus
and roughly 22.1% of the
366670 target tokens in the corpus
TARGET TOTAL: 133 (23.3% of the 570 target entries)
WAVE2 TOTAL:
7 (2.0% of the 355 wave2 entries)
OTHER TOTAL: 24
“stats” tells about occurrences of words in the corpus—occurrences, distribution peaks (documents in
which the word occurs more often than usual), wordclass information, and related words. For instance:
> stats -w grace
grace
noun Ad 463 HM 445
grace
verb Ad 14 HM 24
===
graced
verb Ad 33 HM 33
===
graces
noun Ad 25 HM 22
graces
verb Ad
0 HM
3
===
gracing verb Ad
4 HM
4

(Grace GRACE)
(Grace)

(Graces)

The first line is the invocation of the program (give me stats about wordclasses on the word “grace”). The
second line says that “grace” was identified as a noun 463 times by Adam (Ad) and 445 times by the
Houghton Mifflin parser (HM) and that the case-variants “Grace” and “GRACE” occurred and are being
lumped in with “grace.” And so on. The lexicographers use “stats” to get an initial fix on a word.
“coll” tells about the significant collocates of a word in the Hector corpus, case-free or case-significant,
using Mutual Information and t-score, both of them standard statistical tools, as measures of significance.
For instance:
> coll -cs shorter
shorter
CASE-SENSITIVE
a+b
a
b
3
271
241
3
271
248
32
271
4431
3
271
463
19
271
3579
CASE-SENSITIVE
a+b
a
b
7
364
271
5
2316
271

MI
7.66
7.62
6.88
6.72
6.43

t
1.72
1.71
5.56
1.70
4.26

shorter
shorter
shorter
shorter
shorter

MI
8.29
5.13

t
2.63
2.11

inches shorter
claim shorter

19

tours
varieties
working
periods
hours ...

6
25
3

3232
15713
2437

271
271
271

4.92
4.69
4.32

2.29
4.63
1.57

longer shorter
much shorter
union shorter ...

The first line invokes the program (tell me about singificant collocates of “shorter” case-sensitive, that is,
without the case-variants “Shorter” and “SHORTER”). Collocates with “shorter” on the left are given first,
then collocates with “shorter” on the right. There are three instances where “tours” appears to the right of
“shorter”; “ shorter” appears altogether 271 times, “tours” altogether 241 times. The Mutual Information
score for the significance of the collocation is 7.66, the t-score is 1.72. And so on.
Unlike the corpus viewer, “coll” gives the lexicographer no control over the position of the collocate; it
must occur within -5,+5 of the search word. The lexicographers have been asking that we revise “coll” to
use lemmas rather than wordforms, and it’s on our list.
“beth” tells about Beth Levin’s verb patterns, using information she kindly sent to us in advance of the
publication of her forthcoming book by the University of Chicago Press. For instance:
> beth barter
barter 5.7
5.7
EXCHANGE
vtr
EXCHANGE
SUBSTITUTE

to
to
to

VERB
exchange
substitute


for
the dress for
the cup
for


the skirt
the glass

Here the program invocation asks for information on the verb “barter.” The response says that “barter” fits
pattern 5.7, transitive verbs of exchange. Their pattern is “to VERB A for B,” and some examples are the
verb “exchange,” as in “to exchange the dress for the skirt,” and “substitute,” as in “to substitute the cup for
the glass.”
“corpusdoc” gives information about documents in the corpus: their source, authorship, length, and so
on. For instance:
> corpusdoc Militr
code: Militr
title: Military Illustrated: Past and Present
comment: monthly magazine on military events, uniforms and
artefacts, March
date: 1991
author: unknown
age: unknown
authmode: corporate
sex: unknown
nationality: unknown
domicile: unknown
compos: composite
publisher: Military Illustrated Ltd
place: London, UK
genre: written; published; periodicals; magazines
samplen: 25265
“checkentry” checks some of the more technical fields of an entry (subject, register, and grammar) for
conformity to a policy document about what those fields should contain. We’ve just discovered that through
20

an error on our part the program hasn’t been running since early in May, and none of the lexicographers
seems to have complained.
“printentry” prints out a paper copy of an entry using typography that suggests the appearance of the
final printed copy of the dictionary, not unlike the simulated-print view in the dictionary-entry editor. The
lexicographers complain piteously when it malfunctions, so we can tell that it gets lots of use.
And then there are shell commands to put up windows containing simulated-print views of entries from
various Oxford reference works such as the Oxford Dictionary of Quotations and the new edition, still in
preparation, of the Oxford Shorter English Dictionary. We use Pat and modest home-made front ends in
these commands.

3.5

Gritty Details

The corpus viewer and the dictionary-entry editor are written in Modula-3 [4]. Their user-interface code
uses the X Window System [5]. It is built on top of SRC’s FormsVBT library, which is built in turn on top
of SRC’s Trestle and VBTkit.6 The corpus viewer consists of approximately 15,000 lines of source code,
the dictionary-entry editor of approximately 25,000. Because they are built from standard user-interface
libraries, copying text between one windowed application and another is straightforward. The lexicographers
can, for instance, copy examples from the corpus viewer into the dictionary-entry editor.
The dictionary-entry editor produces entry files that are valid SGML-marked text files. The entry editor
ensures that lexicographers produce only valid entry files by managing the structure of the entry itself, and
by checking the text that has been entered to ensure it contains nothing that might be mistaken for SGML
marking.
We use a specification file to describe the structure and elements of an entry. This gives us a central
location for modifying this structure and lets us separate the details of the structure from the rest of
the dictionary-entry editor, which can manipulate any such structure. The lexicographer can add new
information only in ways that are consistent with the spec file. For instance, a field can be moved only to a
location that is consistent with the structure described in the specification.
When the lexicographers want to change the entry structure for the dictionary, it is relatively straightforward to produce a version of the dictionary-entry editor that understands the new structure. The hard
problem is modifying existing entries so they conform to the new structure. For some changes, such as
adding a new field, no changes to the existing entries are needed. However, when the entry structure is
changed so that previously legal entries become illegal, it is necessary to convert illegal entries into legal
ones. Since this can be a difficult task, Hector has grown increasingly conservative about making changes
to the entry structure that are not upwardly compatible.
The dictionary-entry editor manages the storage and retrieval of the entries, so the lexicographer calls up
an entry based on the value of the headword. All senses and homographs of the same headword go into the
same entry file. Hence, an entry may contain information that is transformed into several dictionary entries,
depending on the dictionary style guidelines.
The dictionary-entry editor maintains a history of past versions of entries with RCS, a Unix version control
program. It is possible to retrieve any version of an entry, although there is not currently a convenient user
interface to this facility. The existence of the previous versions has proved invaluable to the project a number
of times, both for studying the evolution of an entry and for recovering from both human and program errors.
We hoped to use the sense uids to enable entries to contain reliable cross-references to one another. We
entered the uids and the cross-reference representation of a sense (e.g. the uid 774662 and the representation
“bear 1.1a”) into a database and encoded the cross-reference as the uid of the sense. However, this mechanism
proved clumsy to use, since it required that the entry being referred to not only exist but be loaded into an
6

VBTkit, Trestle, and Modula-3 are available via anonymous FTP on gatekeeper.dec.com.

21

entry-editor window in order to establish a cross-reference. So uids have not been used for cross-reference
in the pilot project.

4 What Have We Learned So Far?
4.1

Natural Language Is Hard

It’s probably not an exaggeration to say that every seemingly reasonable assumption we made about natural
language turned out to be inadequate. Even Houghton Mifflin, which had a lot more experience than we did
with real-world language, didn’t foresee some of the situations we encountered. What’s a reasonable limit
for sentence size? 256 words seemed bounteous—until we started processing contracts.
Party of the first part undertakes: not to incur any
liability on behalf of party of the second part or in any
way pledge or purport to pledge party of the second part’s
credit or purport to make any contract binding upon party
of the second part; to involve party of the second part in
any important contract negotiations including but not
restricted to international sales contracts reaching beyond
the Agreed Territories and sales contracts ...
Etc. 256 words come and go and the sentence continues, unstinted. How many procedures should it take to
calculate noun plurals? A dozen? 25? So far we have 72. What’s a plausible specification for a word? It’s
hard to come up with a specification that’s going to stand up to words like “Aah!s” or “county(ies).”
In Adam, the core of the algorithm for identifying wordclasses—the part that deals with the search-space
of probabilities—consists of about 50 lines of code. But it is surrounded by over 4,000 lines of code to
handle real-text problems.
We learned that we had to be prepared to deal with sentences like this:
They come in purple, ref JA8698, age 3-4, #6.99; turquoise, age
7-8, ref JA8701, #7.99.
That’s prose—not an entry in a table or chart—but it’s in an advertisement, which has its own set of
conventions for language use.
Even ordinary prose can temporarily switch gears. Street addresses in the middle of ordinary sentences
use proper names and punctuation in a way that other contexts do not. Every subject area has its own
vocabulary, of course, but it may have its own syntactic rules as well:
Tony Simmons set a world age 16 best at Brache in 1965, with
30:16 for six miles.
Sometimes the complexity of natural language fuddled us. Take for instance the problem of contractions.
The usual way of assigning wordclasses to contractions is to assign them to the separate components. “I’m,”
for instance, is pronoun + aux or pronoun + be.
Early in our work with Adam, we noticed that while the LOB corpus contains all the components for
contractions, it does not contain examples of all the possible combinations, even for the smallest closed
sets such as personal pronouns, undoubtedly because the LOB texts were more formal than the bulk of ours
and contained fewer examples of dialogue. The Hector corpus contains many examples of contractions
from the larger closed sets (“there’ll,” “what’ll,” “who’ll”), and a small number of completely open-ended
contractions:
22

My granddaughter’ll be here in a few minutes ...
If it gets any colder, this stuff’ll turn to ice.
If we send the boy, Nick’ll feel responsible for him.
The corpus also contains an example of a double contraction:
You shouldn’t’ve done that.
For some reason we went from this observation to the conclusion that we should fly in the face of common
wisdom and expand the wordclass list to cover contractions and other special cases: “I’m” for us is not
pronoun + aux or pronoun + be but rather the special wordclass pronounAux or pronounBe. In retrospect,
that was not a wise decision, since the total number of wordclasses in Adam now reaches 313. We added
complexity without reaping any reward for it.
We don’t have any new insights into dealing with the difficulty of natural language, but like others who
have gone before us we treasure certain gems we encountered in our work:
Late gonzoid Detroit/NYC journo Lester Bangs has his memory
enshrined with ‘‘One Horse Down’’ and its dub partner -3D colour tracks, crisp production.

4.2

User-Interface Design Is Hard

One of our problems in user-interface design was that the lexicographers wanted to keep everything as
fluid and malleable as possible, while we wanted to make everything crisp. This tension occurred at every
level, from the composition of the corpus to the contents of the register and subject fields in an entry. We
had endless discussions of the kind where we asked, “Is this A or B?” and the lexicographers answered,
“Sometimes it’s A, sometimes it’s B, usually it’s a combination, I prefer to call it C –” at which point another
lexicographer would break in to say, “No, it’s not C, it’s a combination of B and D.” Some of the difficulty
was just a matter of learning to work together. Some of it was that the lexicographers didn’t understand the
power that computers offer those who are willing to make simplifications. Some of it was that they didn’t
think the power was worth the simplifications. We would imagine that any project that tries to build tools
for sophisticated workers in the intellectual trades will encounter such difficulties.
A more serious problem was that the building blocks for our user interfaces and our own facility in putting
them together simply aren’t up to the sort of complexity the lexicographers handle daily. The wordclasses
are a good case in point. Internally, Adam and the Houghton Mifflin parser use very specific wordclasses,
e.g. “capitalized plural common noun with word-initial capital.” The Houghton Mifflin parser has 171 such
wordclasses; Adam has 313. We made a few unsuccessful attempts to design a user interface that would
make all of these tags available to the lexicographers. There is some structure to the two tagsets, and they
are related—they’re both derivatives of the Brown corpus set. There’s no question that the lexicographers
can handle 171 wordclasses, 313 wordclasses, any number of wordclasses we want to throw at them. But we
couldn’t figure out how to present the choices on-line in a way that fit on the screen, didn’t degrade the rest
of the system, and matched the fluid and changing models of wordclass hierarchy that the lexicographers
have in their heads. We probably should have recruited an expert in user-interface design to help us with the
project.

23

4.3

Speed Is Functionality

Speed is functionality; when tools are slow, the lexicographers don’t use them. Performance always makes
a difference, but we couldn’t tell where it was really important until we understood the tasks that the
lexicographers do.
We knew that a major challenge in Hector would be manipulating large amounts of data quickly. Our
initial efforts were focussed on searching the corpus quickly. We had no performance problems in searching
itself, but ran into problems in several other areas, notably in user-interface functionality.
For instance, switching between the full view and the skeletal view of an entry in the dictionary-entry
editor is slow, so the lexicographers limit themselves to one view, even if it is inappropriate for the task at
hand.
Similarly, displaying a concordance was quite slow in early versions of the corpus viewer. As a result,
the lexicographers would make one broad search for a word and then work with the resulting concordance
as if it were static. They wouldn’t test hypotheses about collocates and senses, because it would take too
much time and they would lose the results of their first search.
Poor performance is especially damaging when it interrupts what feels like a single action. When a
lexicographer reaches a decision after hard thought, she wants to be able to act on it without losing the
thought. For example, when a lexicographer determines that she needs a new sense to tag a corpus line, she
wants to be able to create the sense and tag the line without losing context or having to relocate the line.
Delays in the communication between the dictionary-entry editor and the corpus viewer about active senses
make it irritatingly slow to tag that first line after creating the sense. Or again, when the dictionary-entry
editor needs to expand the size of a field because the contents have overflowed, it takes several seconds
before the lexicographer can safely resume typing.
We have improved the functionality of sense-tagging in the corpus viewer to make it faster to assign
sense-tags once the sense divisions have been determined. Shortcuts and batch sense-tagging render the
actual tagging of corpus words much faster once the hard work of sense division has been done.

4.4

Data Integrity

The lexicographers in the Hector project create two important sets of data: dictionary entries and corpus
sense-tag assignments. Because the human effort and expertise involved is hard to come by, this data is one
of the most valuable results of the project. Hence, it is particularly demoralizing when it is lost or has to be
regenerated.
We made some effort in designing our tools to protect ourselves against data loss. We keep all the
versions of an entry file, and we log all changes to the sense-tag database. But total data integrity was not
one of our original design goals.
Data was lost primarily because the dictionary-entry editor or the corpus viewer crashed, losing sensetags or entry edits that had not been committed. Naively, we had believed we could build programs that
would not crash. Modula-3 provides good type checking and exception handling, and we tried hard to make
the programs robust.
We also lost data due to program errors. For instance, errors in the code for storing sense-tags caused
assignments to be lost.
Since we relied on our design efforts to build robust programs, we didn’t build mechanisms to recover
work in progress when the programs inevitably crashed. In retrospect, this was a bad decision.

24

4.5

And in a Cheerier Vein ...

It wasn’t our aim in the Hector project to see what a group of amateurs would do with a set of tools for corpus
lexicography, but our lab includes a good many tinkerers and casual philologists, so we weren’t surprised
to find them playing with the corpus viewer. Several people use it regularly to check their intuitions about
words and idioms and to supplement information they find in a dictionary. For example, one lab member
recently questioned the apparently inconsistent use of the words “gantlet” and “gauntlet” in the New York
Times. He searched for information in a dictionary but also consulted the corpus viewer.
While the corpus has some usefulness for decoding information, people go to it more often for encoding
information. One colleague invoked the corpus for evidence on whether “noir” is now an ordinary English
word or whether it should still be italicized. Another recently requested help in deciding which of three
possible phrases he should use in a particular mathematical context, and again, corpus evidence was cited in
the ensuing discussion.
Publishers take note. A matching dictionary and corpus set, bound in the electronic equivalent of
morocco, might be the Christmas gift for 1996.

25

Acknowledgements
Thanks to our friends at Houghton Mifflin: Win Carus, Kathy Good, and Jeff Hopkins.
To Ken Church, for his algorithm and his encouragement.
To the lexicographers from Oxford: Sue Atkins, Katherine Barber, Peter Gilliver, Patrick Hanks, Helen
Liebeck, Rosamund Moon, and Bill Trumble.
To our colleagues on the computing side at Oxford: Jeremy Clear, James Howes, Chris Rust.
To our colleagues at SRC: Ken Beckman, Judith Blount, Marc H. Brown, Mike Burrows, John DeTreville,
Steve Glassman, Jim Horning, Catherine Kaercher, Bill Kalsow, Eric Muller, Lyle Ramshaw, Richard
Schedler, Julie Swanson, and Ted Wobber.
To our extraordinary managers: Tim Benbow and Bob Taylor.
To our children: Margaret Brown, Naomi Glassman, Elizabeth Reid, and Vanessa Reid for setting aside
their selfish personal needs for the greater good. We hope that Ethan Glassman will follow in their footsteps.

26

References
[1] Ken Church. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings
of the Second Conference on Applied Natural Language Processing, Austin, Texas, 1988.
[2] Patrick Hanks. Evidence and intuition in lexicography. In Jerzy Tomaszczyk and Barbara
Lewandowska-Tomaszczyk, editors, Meaning and Lexicography. John Benhamins Publishing Company, Amsterdam/Philadelphia, 1990.
[3] Stig Johansson and Knut Hofland. Frequency Analysis of English Vocabulary and Grammar. Oxford
University Press, 1989.
[4] Greg Nelson, editor. Systems Programming with Modula-3. Prentice Hall, 1991.
[5] Robert W. Scheifler, James Gettys, and Ron Newman. X Window System: C Library and Protocol
Reference. Digital Press, 1988.
[6] J. M. Sinclair, editor. Looking Up: An account of the COBUILD Project in lexical computing. Collins,
1987.

27

A Appendix: SGML Constructions
We call SGML bracketing constructions “tags” and write them with angle brackets:
..
We call SGML typographical constructions “sorts” and write them with an opening ampersand and closing
dot:
&alpha.
We call more complicated constructions “sundries” and write them with curlies:
{typo bad="asgood",good="as good"}
Here’s the list:
..
of a letter &alpha. ´. &and. ampersand .. name of the author appearing at the head of a document or a section of a document &back. diacritical mark in, for instance, Arabic names .. .. on a photograph or illustration ¸la. ¢. cent-sign &circ. circumflex .. of a letter .. same as (a remnant of overambition) ©right. copyright sign, not the word “copyright” &dash. .. dateline of a news story, date of a letter °ree. &ellip. &ft. feet written as a stroke or a single quote .. footnote &grave. .. in transcribed speech &hacek. .. headline (not necessarily newspaper—any title line) .. attributes: foreign (with language noted), verse, deadGuys, table {inaudible} &ins. inches written as two strokes or a double quote .. .. .. .. &num. 28 &oob. o oblique &pi. .. postscript of a letter &ring. &rule. .. salutation of a letter .. signature, byline .. transcribed speech &sqrt. .. in a newspaper &stroke. .. entire document &theta. &tilde. ×. {typo bad="string",good="string"} ¨aut. B Appendix: The Target Words absolute absorb acquisition adam adequate admire advocate agenda agriculture airline albert alcohol allegation alliance allowance alongside alter altogether amateur amaze ambition ambulance analyst angle anniversary anxiety anxious dancer dawn dealer dean declaration defendant deficit definition delegate delivery departure deposit depress depth derby derive description desert designer desperate destruction detective determination device devote diet disable intense interim interior interval invasion invest invitation involvement israeli jewish judgement kingdom knee laboratory layer leap lecture leicester lend liability liberation liberty licence literary literature lobby loose 29 resistance restrict restriction retail retirement revenue reverse revolutionary rider riot roman rome rough routine ruin rumour runner rural sack sacrifice sake salary salt sample sanction satellite scientific anywhere apparent appointment appreciate approval arab architect architecture asian assess assumption attendance attraction auction awful ballet ballot banker barely bargain bathroom beating behalf behave bench beneath besides bitter blind boil bone boom bore boss bother boundary brazil breach bread breathe breed brick brush burden bury businessman butter discipline discount discovery dish distant distinction distinguish distribution disturb dividend divorce draft drag drain drift duck duke dutch echo economist edition efficiency efficient eighty embarrass embassy emotion emotional empire enemy enhance entertain entertainment enthusiasm entrance equity equivalent establishment everywhere evil exception excess exclude expansion expectation exploit explore lorry luck luxury maintenance maker margin mayor meat mechanism medicine membership mental merchant merger merit minority mixture modest monthly moreover musician muslim mystery nato necessarily nervous notion objection objective obligation oblige observer occasional olympic operator opponent orange origin ourselves outcome outline output outstanding overseas overwhelm pact palestinian 30 scream script seize sensitive sequence seventy severe shed sheep shell shelter shirt shortly sick significance silent sixth sixty slight smooth snow socialism solve sophisticated speaker spectacular speculation spite stable stair stamp statistic steady steam steer sterling strengthen striker stroke submit subsequent subsidiary substitute sudden sufficient sugar suitable cake calculate calm cancel capture carbon carpet category cease cell characteristic charlie charter cheer cheese chicken childhood cinema circuit civilian classical climate coalition collective colonel column comic commander commissioner commonwealth comparison compensation competitive competitor composer comprehensive compromise concede concentration concrete condemn confine confusion connect constant constituency constitution explosion extension extensive extreme false fancy fantasy fare fascinate fate federal federation fence flag flavour fleet float formula fortune forty forum fulfil fundamental gear generous genuine gesture global gloucester govern governor grace grade graduate greek greet grip guerrilla gulf habit halt healthy height hint historic historical holder parallel partnership passage passport patten peak permanent personality phrase pile pilot pipe platform plead pleasant pledge plot poetry pole portrait pose possess possession poverty practise pre-tax precisely premise premium preparation preserve presumably pride priest princess principal privilege proceeding profession prompt proof prosecution proud province provoke publicity pure 31 suite supreme surplus surrey survival suspend suspicion sustain swallow sweat symbol sympathy temporary tempt tenant tension territory terrorist text thrust tonne tool tragedy trail transaction transform traveller treasury treaty trend truly tune tunnel turnover twelve twin twist ultimate uncertainty underground undermine unemployment unfortunately uniform unique unity unknown constitutional construct consult consultant context contrary controversial convention conventional conversion conviction core corporate corporation counter craft cream creation crucial cultural currency curtain hollywood horror humour identity illegal illness illustrate imagination implement implication imply impress impressive incorporate indication inevitable inner innocent install instruction intellectual intelligence pursuit raid rally rape rapid rarely realize reception reckon recognition recovery recruit referee regret reinforce relevant religion renew replacement reporter resignation resist urban vegetable victorian villa violent voluntary volunteer voter weak wealth westminster whereas widely widespread widow wooden wound yacht yield zone C Appendix: The Wave-2 Words abandon abuse accident accountant adjust administer adverse advertise airport ally altar applicant appropriate assault assistant asylum atmosphere atomic avenue background bake balcony diamond directive disability disappear discharge disclosure disgrace dismay disposal disruption distance distress disturbance dominate donation donor drainage drill edit editor eliminate employ kettle kick kidney kindness kinship kiss kitchen lager landlord laugh laughter lava layout library license livestock location lock lung maggot magistrate magnetic 32 recommend redundancy regulation relate reluctant remind reminder restless retire reward roar rotten rubble rush sandwich satisfactory sausage scenery scramble scrap screen sculpture barrister beam bike biscuit bishop bleed bomber bombing booking borrow bowl brain bright brilliant burglary cabaret cabbage candle carrier cater celebrate cemetery cereal characterise cheap cheek chip cholesterol chub circulate citizenship classroom clock coat collector colt combine comfort companion compile comply compound cone confess confident confuse congestion employee employer employment endless engagement enthusiastic erupt ethics exam examine examination exhibit exhibition explanation expression extract extradition extraordinary ferry file fill filmmaker fixture flood flower focus footstep fragment franchise frontier frost frustration furnish geographical giant glaze gravel grid grill grind guess header heavyweight hedge helicopter heroin honesty male manufacture manufacturer marriage masterpiece mathematical medal medallist methane microwave mild mill mineral mirror missile murmur mushroom nationalist negotiator notice novice obey onion options ordination overlook overthrow packet pack package participation passion pasta patch percent percentage phenomenon phone photograph photographer pint plaster poison potato practitioner predict prediction 33 search secret sensible shadow shock sickness skirt sleeve slip smell snack sofa specialise specialist spiritual statistical stimulation stockbroker storm strand stream string subscription substance suburb succeed sulphur supervision swing tackle takeover temperament textile thigh thin thriller toilet tomato tonight torture tournament transition transplant treasure triangle triple turkey consecutive constable consume contest contrast convert convoy correct correspond counsel cousin crack crash criticise crush cycle dangerous deed defect deterioration horizon hostage hurdle identify immigrant impression incident indefinite industrialist initiative injunction institution institutional intrinsic inventory irony responsible jacket journey junction preference prevail proceed proceedings processor progress progressive prosecute publication punch purse qualification quote quota rabbit radical radically rank reassure receipt 34 twenty tyre conditional undergo unsatisfactory welcome vacant venue vessel visa warrant weakness weed width wild willow woke wonderful workshop

Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.1
Linearized                      : No
Page Count                      : 40
Create Date                     : 1996:12:23 14:44:04
Producer                        : Acrobat Distiller 2.1 for Windows
EXIF Metadata provided by
EXIF.tools

Navigation menu