Natural Language Annotation For Machine Learning A Guide To Corpus ... [Pustejovsky & Stubbs 2012

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 343 [warning: Documents this large are best viewed by clicking the View PDF Link!]

James Pustejovsky and Amber Stubbs
Natural Language Annotation for
Machine Learning
ISBN: 978-1-449-30666-3
[LSI]
Natural Language Annotation for Machine Learning
by James Pustejovsky and Amber Stubbs
Copyright © 2013 James Pustejovsky and Amber Stubbs. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Julie Steele and Meghan Blanchette
Production Editor: Kristen Borg
Copyeditor: Audrey Doyle
Proofreader: Linley Dolby
Indexer: WordCo Indexing Services
Cover Designer: Randy Comer
Interior Designer: David Futato
Illustrator: Rebecca Demarest
October 2012: First Edition
Revision History for the First Edition:
2012-10-10 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449306663 for release details.
Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly
Media, Inc. Natural Language Annotation for Machine Learning, the image of a cockatiel, and related trade
dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade
mark claim, the designations have been printed in caps or initial caps.
While every precaution has been taken in the preparation of this book, the publisher and authors assume
no responsibility for errors or omissions, or for damages resulting from the use of the information contained
herein.
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. The Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Importance of Language Annotation 1
The Layers of Linguistic Description 3
What Is Natural Language Processing? 4
A Brief History of Corpus Linguistics 5
What Is a Corpus? 8
Early Use of Corpora 10
Corpora Today 13
Kinds of Annotation 14
Language Data and Machine Learning 21
Classification 22
Clustering 22
Structured Pattern Induction 22
The Annotation Development Cycle 23
Model the Phenomenon 24
Annotate with the Specification 27
Train and Test the Algorithms over the Corpus 29
Evaluate the Results 30
Revise the Model and Algorithms 31
Summary 31
2. Defining Your Goal and Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Defining Your Goal 33
The Statement of Purpose 34
Refining Your Goal: Informativity Versus Correctness 35
Background Research 41
Language Resources 41
Organizations and Conferences 42
iii
NLP Challenges 43
Assembling Your Dataset 43
The Ideal Corpus: Representative and Balanced 45
Collecting Data from the Internet 46
Eliciting Data from People 46
The Size of Your Corpus 48
Existing Corpora 48
Distributions Within Corpora 49
Summary 51
3. Corpus Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Basic Probability for Corpus Analytics 54
Joint Probability Distributions 55
Bayes Rule 57
Counting Occurrences 58
Zipfs Law 61
N-grams 61
Language Models 63
Summary 65
4. Building Your Model and Specification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Some Example Models and Specs 68
Film Genre Classification 70
Adding Named Entities 71
Semantic Roles 72
Adopting (or Not Adopting) Existing Models 75
Creating Your Own Model and Specification: Generality Versus Specificity 76
Using Existing Models and Specifications 78
Using Models Without Specifications 79
Different Kinds of Standards 80
ISO Standards 80
Community-Driven Standards 83
Other Standards Affecting Annotation 83
Summary 84
5. Applying and Adopting Annotation Standards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Metadata Annotation: Document Classification 88
Unique Labels: Movie Reviews 88
Multiple Labels: Film Genres 90
Text Extent Annotation: Named Entities 94
Inline Annotation 94
Stand-off Annotation by Tokens 96
iv | Table of Contents
Stand-off Annotation by Character Location 99
Linked Extent Annotation: Semantic Roles 101
ISO Standards and You 102
Summary 103
6. Annotation and Adjudication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
The Infrastructure of an Annotation Project 105
Specification Versus Guidelines 108
Be Prepared to Revise 109
Preparing Your Data for Annotation 110
Metadata 110
Preprocessed Data 110
Splitting Up the Files for Annotation 111
Writing the Annotation Guidelines 112
Example 1: Single Labels—Movie Reviews 113
Example 2: Multiple Labels—Film Genres 115
Example 3: Extent Annotations—Named Entities 119
Example 4: Link Tags—Semantic Roles 120
Annotators 122
Choosing an Annotation Environment 124
Evaluating the Annotations 126
Cohens Kappa (κ) 127
Fleisss Kappa (κ) 128
Interpreting Kappa Coefficients 131
Calculating κ in Other Contexts 132
Creating the Gold Standard (Adjudication) 134
Summary 135
7. Training: Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
What Is Learning? 140
Defining Our Learning Task 142
Classifier Algorithms 144
Decision Tree Learning 145
Gender Identification 147
Naïve Bayes Learning 151
Maximum Entropy Classifiers 157
Other Classifiers to Know About 158
Sequence Induction Algorithms 160
Clustering and Unsupervised Learning 162
Semi-Supervised Learning 163
Matching Annotation to Algorithms 165
Table of Contents | v
Summary 166
8. Testing and Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Testing Your Algorithm 170
Evaluating Your Algorithm 170
Confusion Matrices 171
Calculating Evaluation Scores 172
Interpreting Evaluation Scores 177
Problems That Can Affect Evaluation 178
Dataset Is Too Small 178
Algorithm Fits the Development Data Too Well 180
Too Much Information in the Annotation 181
Final Testing Scores 181
Summary 182
9. Revising and Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Revising Your Project 186
Corpus Distributions and Content 186
Model and Specification 187
Annotation 188
Training and Testing 189
Reporting About Your Work 189
About Your Corpus 191
About Your Model and Specifications 192
About Your Annotation Task and Annotators 192
About Your ML Algorithm 193
About Your Revisions 194
Summary 194
10. Annotation: TimeML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
The Goal of TimeML 198
Related Research 199
Building the Corpus 201
Model: Preliminary Specifications 201
Times 202
Signals 202
Events 203
Links 203
Annotation: First Attempts 204
Model: The TimeML Specification Used in TimeBank 204
Time Expressions 204
Events 205
vi | Table of Contents
Signals 206
Links 207
Confidence 208
Annotation: The Creation of TimeBank 209
TimeML Becomes ISO-TimeML 211
Modeling the Future: Directions for TimeML 213
Narrative Containers 213
Expanding TimeML to Other Domains 215
Event Structures 216
Summary 217
11. Automatic Annotation: Generating TimeML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
The TARSQI Components 220
GUTime: Temporal Marker Identification 221
EVITA: Event Recognition and Classification 222
GUTenLINK 223
Slinket 224
SputLink 225
Machine Learning in the TARSQI Components 226
Improvements to the TTK 226
Structural Changes 227
Improvements to Temporal Entity Recognition: BTime 227
Temporal Relation Identification 228
Temporal Relation Validation 229
Temporal Relation Visualization 229
TimeML Challenges: TempEval-2 230
TempEval-2: System Summaries 231
Overview of Results 234
Future of the TTK 234
New Input Formats 234
Narrative Containers/Narrative Times 235
Medical Documents 236
Cross-Document Analysis 237
Summary 238
12. Afterword: The Future of Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Crowdsourcing Annotation 239
Amazons Mechanical Turk 240
Games with a Purpose (GWAP) 241
User-Generated Content 242
Handling Big Data 243
Boosting 243
Table of Contents | vii
Active Learning 244
Semi-Supervised Learning 245
NLP Online and in the Cloud 246
Distributed Computing 246
Shared Language Resources 247
Shared Language Applications 247
And Finally... 248
A. List of Available Corpora and Specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
B. List of Software Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
C. MAE User Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
D. MAI User Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
E. Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
viii | Table of Contents
Preface
This book is intended as a resource for people who are interested in using computers to
help process natural language. A natural language refers to any language spoken by
humans, either currently (e.g., English, Chinese, Spanish) or in the past (e.g., Latin,
ancient Greek, Sanskrit). Annotation refers to the process of adding metadata informa
tion to the text in order to augment a computers capability to perform Natural Language
Processing (NLP). In particular, we examine how information can be added to natural
language text through annotation in order to increase the performance of machine
learning algorithms—computer programs designed to extrapolate rules from the infor
mation provided over texts in order to apply those rules to unannotated texts later on.
Natural Language Annotation for Machine Learning
This book details the multistage process for building your own annotated natural lan
guage dataset (known as a corpus) in order to train machine learning (ML) algorithms
for language-based data and knowledge discovery. The overall goal of this book is to
show readers how to create their own corpus, starting with selecting an annotation task,
creating the annotation specification, designing the guidelines, creating a “gold stan
dard” corpus, and then beginning the actual data creation with the annotation process.
Because the annotation process is not linear, multiple iterations can be required for
defining the tasks, annotations, and evaluations, in order to achieve the best results for
a particular goal. The process can be summed up in terms of the MATTER Annotation
Development Process: Model, Annotate, Train, Test, Evaluate, Revise. This book guides
the reader through the cycle, and provides detailed examples and discussion for different
types of annotation tasks throughout. These tasks are examined in depth to provide
context for readers and to help provide a foundation for their own ML goals.
ix
Additionally, this book provides access to and usage guidelines for lightweight, user-
friendly software that can be used for annotating texts and adjudicating the annotations.
While a variety of annotation tools are available to the community, the Multipurpose
Annotation Environment (MAE) adopted in this book (and available to readers as a free
download) was specifically designed to be easy to set up and get running, so that con
fusing documentation would not distract readers from their goals. MAE is paired with
the Multidocument Adjudication Interface (MAI), a tool that allows for quick compar
ison of annotated documents.
Audience
This book is written for anyone interested in using computers to explore aspects of the
information content conveyed by natural language. It is not necessary to have a pro
gramming or linguistics background to use this book, although a basic understanding
of a scripting language such as Python can make the MATTER cycle easier to follow,
and some sample Python code is provided in the book. If you dont have any Python
experience, we highly recommend Natural Language Processing with Python by Steven
Bird, Ewan Klein, and Edward Loper (O’Reilly), which provides an excellent introduc
tion both to Python and to aspects of NLP that are not addressed in this book.
It is helpful to have a basic understanding of markup languages such as XML (or even
HTML) in order to get the most out of this book. While one doesnt need to be an expert
in the theory behind an XML schema, most annotation projects use some form of XML
to encode the tags, and therefore we use that standard in this book when providing
annotation examples. Although you dont need to be a web designer to understand the
book, it does help to have a working knowledge of tags and attributes in order to un
derstand how an idea for an annotation gets implemented.
Organization of This Book
Chapter 1 of this book provides a brief overview of the history of annotation and ma
chine learning, as well as short discussions of some of the different ways that annotation
tasks have been used to investigate different layers of linguistic research. The rest of the
book guides the reader through the MATTER cycle, from tips on creating a reasonable
annotation goal in Chapter 2, all the way through evaluating the results of the annotation
and ML stages, as well as a discussion of revising your project and reporting on your
work in Chapter 9. The last two chapters give a complete walkthrough of a single an
notation project and how it was recreated with machine learning and rule-based algo
rithms. Appendixes at the back of the book provide lists of resources that readers will
find useful for their own annotation tasks.
x | Preface
Software Requirements
While its possible to work through this book without running any of the code examples
provided, we do recommend having at least the Natural Language Toolkit (NLTK) in
stalled for easy reference to some of the ML techniques discussed. The NLTK currently
runs on Python versions from 2.4 to 2.7. (Python 3.0 is not supported at the time of this
writing.) For more information, see http://www.nltk.org.
The code examples in this book are written as though they are in the interactive Python
shell programming environment. For information on how to use this environment,
please see: http://docs.python.org/tutorial/interpreter.html. If not specifically stated in
the examples, it should be assumed that the command import nltk was used prior to
all sample code.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Constant width
Used for program listings, as well as within paragraphs to refer to program elements
such as variable or function names, databases, data types, environment variables,
statements, and keywords.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Using Code Examples
This book is here to help you get your job done. In general, you may use the code in this
book in your programs and documentation. You do not need to contact us for permis
sion unless you’re reproducing a significant portion of the code. For example, writing a
program that uses several chunks of code from this book does not require permission.
Preface | xi
Selling or distributing a CD-ROM of examples from O’Reilly books does require per
mission. Answering a question by citing this book and quoting example code does not
require permission. Incorporating a significant amount of example code from this book
into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Natural Language Annotation for Machine
Learning by James Pustejovsky and Amber Stubbs (O’Reilly). Copyright 2013 James
Pustejovsky and Amber Stubbs, 978-1-449-30666-3.
If you feel your use of code examples falls outside fair use or the permission given above,
feel free to contact us at permissions@oreilly.com.
Safari® Books Online
Safari Books Online (www.safaribooksonline.com) is an on-demand
digital library that delivers expert content in both book and video
form from the worlds leading authors in technology and business.
Technology professionals, software developers, web designers, and business and creative
professionals use Safari Books Online as their primary resource for research, problem
solving, learning, and certification training.
Safari Books Online offers a range of product mixes and pricing programs for organi
zations, government agencies, and individuals. Subscribers have access to thousands of
books, training videos, and prepublication manuscripts in one fully searchable database
from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Pro
fessional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John
Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT
Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technol
ogy, and dozens more. For more information about Safari Books Online, please visit us
online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
xii | Preface
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://oreil.ly/nat-lang-annotation-ML.
To comment or ask technical questions about this book, send email to bookques
tions@oreilly.com.
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
We would like thank everyone at O’Reilly who helped us create this book, in particular
Meghan Blanchette, Julie Steele, Sarah Schneider, Kristen Borg, Audrey Doyle, and ev
eryone else who helped to guide us through the process of producing it. We would also
like to thank the students who participated in the Brandeis COSI 216 class during the
spring 2011 semester for bearing with us as we worked through the MATTER cycle with
them: Karina Baeza Grossmann-Siegert, Elizabeth Baran, Bensiin Borukhov, Nicholas
Botchan, Richard Brutti, Olga Cherenina, Russell Entrikin, Livnat Herzig, Sophie Kush
kuley, Theodore Margolis, Alexandra Nunes, Lin Pan, Batia Snir, John Vogel, and Yaqin
Yang.
We would also like to thank our technical reviewers, who provided us with such excellent
feedback: Arvind S. Gautam, Catherine Havasi, Anna Rumshisky, and Ben Wellner, as
well as everyone who read the Early Release version of the book and let us know that
we were going in the right direction.
We would like to thank members of the ISO community with whom we have discussed
portions of the material in this book: Kiyong Lee, Harry Bunt, Nancy Ide, Nicoletta
Calzolari, Bran Boguraev, Annie Zaenen, and Laurent Romary.
Additional thanks to the members of the Brandeis Computer Science and Linguistics
departments, who listened to us brainstorm, kept us encouraged, and made sure ev
erything kept running while we were writing, especially Marc Verhagen, Lotus Gold
berg, Jessica Moszkowicz, and Alex Plotnick.
This book could not exist without everyone in the linguistics and computational lin
guistics communities who have created corpora and annotations, and, more impor
tantly, shared their experiences with the rest of the research community.
Preface | xiii
James Adds:
I would like to thank my wife, Cathie, for her patience and support during this project.
I would also like to thank my children, Zac and Sophie, for putting up with me while
the book was being finished. And thanks, Amber, for taking on this crazy effort with
me.
Amber Adds:
I would like to thank my husband, BJ, for encouraging me to undertake this project and
for his patience while I worked through it. Thanks also to my family, especially my
parents, for their enthusiasm toward this book. And, of course, thanks to my advisor
and coauthor, James, for having this crazy idea in the first place.
xiv | Preface
CHAPTER 1
The Basics
It seems as though every day there are new and exciting problems that people have
taught computers to solve, from how to win at chess or Jeopardy to determining shortest-
path driving directions. But there are still many tasks that computers cannot perform,
particularly in the realm of understanding human language. Statistical methods have
proven to be an effective way to approach these problems, but machine learning (ML)
techniques often work better when the algorithms are provided with pointers to what
is relevant about a dataset, rather than just massive amounts of data. When discussing
natural language, these pointers often come in the form of annotations—metadata that
provides additional information about the text. However, in order to teach a computer
effectively, it’s important to give it the right data, and for it to have enough data to learn
from. The purpose of this book is to provide you with the tools to create good data for
your own ML task. In this chapter we will cover:
Why annotation is an important tool for linguists and computer scientists alike
How corpus linguistics became the field that it is today
The different areas of linguistics and how they relate to annotation and ML tasks
What a corpus is, and what makes a corpus balanced
How some classic ML problems are represented with annotations
The basics of the annotation development cycle
The Importance of Language Annotation
Everyone knows that the Internet is an amazing resource for all sorts of information
that can teach you just about anything: juggling, programming, playing an instrument,
and so on. However, there is another layer of information that the Internet contains,
1
and that is how all those lessons (and blogs, forums, tweets, etc.) are being communi
cated. The Web contains information in all forms of media—including texts, images,
movies, and sounds—and language is the communication medium that allows people
to understand the content, and to link the content to other media. However, while com
puters are excellent at delivering this information to interested users, they are much less
adept at understanding language itself.
Theoretical and computational linguistics are focused on unraveling the deeper nature
of language and capturing the computational properties of linguistic structures. Human
language technologies (HLTs) attempt to adopt these insights and algorithms and turn
them into functioning, high-performance programs that can impact the ways we in
teract with computers using language. With more and more people using the Internet
every day, the amount of linguistic data available to researchers has increased signifi
cantly, allowing linguistic modeling problems to be viewed as ML tasks, rather than
limited to the relatively small amounts of data that humans are able to process on their
own.
However, it is not enough to simply provide a computer with a large amount of data and
expect it to learn to speak—the data has to be prepared in such a way that the computer
can more easily find patterns and inferences. This is usually done by adding relevant
metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called
an annotation over the input. However, in order for the algorithms to learn efficiently
and effectively, the annotation done on the data must be accurate, and relevant to the
task the machine is being asked to perform. For this reason, the discipline of language
annotation is a critical link in developing intelligent human language technologies.
Giving an ML algorithm too much information can slow it down and
lead to inaccurate results, or result in the algorithm being so molded to
the training data that it becomes overfit” and provides less accurate
results than it might otherwise on new data. Its important to think
carefully about what you are trying to accomplish, and what informa
tion is most relevant to that goal. Later in the book we will give examples
of how to find that information, and how to determine how well your
algorithm is performing at the task youve set for it.
Datasets of natural language are referred to as corpora, and a single set of data annotated
with the same specification is called an annotated corpus. Annotated corpora can be
used to train ML algorithms. In this chapter we will define what a corpus is, explain
what is meant by an annotation, and describe the methodology used for enriching a
linguistic data collection with annotations for machine learning.
2 | Chapter 1: The Basics
The Layers of Linguistic Description
While it is not necessary to have formal linguistic training in order to create an annotated
corpus, we will be drawing on examples of many different types of annotation tasks, and
you will find this book more helpful if you have a basic understanding of the different
aspects of language that are studied and used for annotations. Grammar is the name
typically given to the mechanisms responsible for creating well-formed structures in
language. Most linguists view grammar as itself consisting of distinct modules or sys
tems, either by cognitive design or for descriptive convenience. These areas usually
include syntax, semantics, morphology, phonology (and phonetics), and the lexicon.
Areas beyond grammar that relate to how language is embedded in human activity
include discourse, pragmatics, and text theory. The following list provides more detailed
descriptions of these areas:
Syntax
The study of how words are combined to form sentences. This includes examining
parts of speech and how they combine to make larger constructions.
Semantics
The study of meaning in language. Semantics examines the relations between words
and what they are being used to represent.
Morphology
The study of units of meaning in a language. A morpheme is the smallest unit of
language that has meaning or function, a definition that includes words, prefixes,
affixes, and other word structures that impart meaning.
Phonology
The study of the sound patterns of a particular language. Aspects of study include
determining which phones are significant and have meaning (i.e., the phonemes);
how syllables are structured and combined; and what features are needed to describe
the discrete units (segments) in the language, and how they are interpreted.
Phonetics
The study of the sounds of human speech, and how they are made and perceived.
A phoneme is the term for an individual sound, and is essentially the smallest unit
of human speech.
Lexicon
The study of the words and phrases used in a language, that is, a languages
vocabulary.
Discourse analysis
The study of exchanges of information, usually in the form of conversations, and
particularly the flow of information across sentence boundaries.
The Importance of Language Annotation | 3
Pragmatics
The study of how the context of text affects the meaning of an expression, and what
information is necessary to infer a hidden or presupposed meaning.
Text structure analysis
The study of how narratives and other textual styles are constructed to make larger
textual compositions.
Throughout this book we will present examples of annotation projects that make use of
various combinations of the different concepts outlined in the preceding list.
What Is Natural Language Processing?
Natural Language Processing (NLP) is a field of computer science and engineering that
has developed from the study of language and computational linguistics within the field
of Artificial Intelligence. The goals of NLP are to design and build applications that
facilitate human interaction with machines and other devices through the use of natural
language. Some of the major areas of NLP include:
Question Answering Systems (QAS)
Imagine being able to actually ask your computer or your phone what time your
favorite restaurant in New York stops serving dinner on Friday nights. Rather than
typing in the (still) clumsy set of keywords into a search browser window, you could
simply ask in plain, natural language—your own, whether its English, Mandarin,
or Spanish. (While systems such as Siri for the iPhone are a good start to this process,
its clear that Siri doesnt fully understand all of natural language, just a subset of
key phrases.)
Summarization
This area includes applications that can take a collection of documents or emails
and produce a coherent summary of their content. Such programs also aim to pro
vide snap “elevator summaries” of longer documents, and possibly even turn them
into slide presentations.
Machine Translation
The holy grail of NLP applications, this was the first major area of research and
engineering in the field. Programs such as Google Translate are getting better and
better, but the real killer app will be the BabelFish that translates in real time when
youre looking for the right train to catch in Beijing.
Speech Recognition
This is one of the most difficult problems in NLP. There has been great progress in
building models that can be used on your phone or computer to recognize spoken
4 | Chapter 1: The Basics
language utterances that are questions and commands. Unfortunately, while these
Automatic Speech Recognition (ASR) systems are ubiquitous, they work best in
narrowly defined domains and dont allow the speaker to stray from the expected
scripted input (Please say or type your card number now).
Document classification
This is one of the most successful areas of NLP, wherein the task is to identify in
which category (or bin) a document should be placed. This has proved to be enor
mously useful for applications such as spam filtering, news article classification,
and movie reviews, among others. One reason this has had such a big impact is the
relative simplicity of the learning models needed for training the algorithms that
do the classification.
As we mentioned in the Preface, the Natural Language Toolkit (NLTK), described in
the O’Reilly book Natural Language Processing with Python, is a wonderful introduction
to the techniques necessary to build many of the applications described in the preceding
list. One of the goals of this book is to give you the knowledge to build specialized
language corpora (i.e., training and test datasets) that are necessary for developing such
applications.
A Brief History of Corpus Linguistics
In the mid-20th century, linguistics was practiced primarily as a descriptive field, used
to study structural properties within a language and typological variations between
languages. This work resulted in fairly sophisticated models of the different informa
tional components comprising linguistic utterances. As in the other social sciences, the
collection and analysis of data was also being subjected to quantitative techniques from
statistics. In the 1940s, linguists such as Bloomfield were starting to think that language
could be explained in probabilistic and behaviorist terms. Empirical and statistical
methods became popular in the 1950s, and Shannons information-theoretic view to
language analysis appeared to provide a solid quantitative approach for modeling qual
itative descriptions of linguistic structure.
Unfortunately, the development of statistical and quantitative methods for linguistic
analysis hit a brick wall in the 1950s. This was due primarily to two factors. First, there
was the problem of data availability. One of the problems with applying statistical meth
ods to the language data at the time was that the datasets were generally so small that it
was not possible to make interesting statistical generalizations over large numbers of
linguistic phenomena. Second, and perhaps more important, there was a general shift
in the social sciences from data-oriented descriptions of human behavior to introspec
tive modeling of cognitive functions.
A Brief History of Corpus Linguistics | 5
As part of this new attitude toward human activity, the linguist Noam Chomsky focused
on both a formal methodology and a theory of linguistics that not only ignored quan
titative language data, but also claimed that it was misleading for formulating models
of language behavior (Chomsky 1957).
This view was very influential throughout the 1960s and 1970s, largely because the
formal approach was able to develop extremely sophisticated rule-based language mod
els using mostly introspective (or self-generated) data. This was a very attractive alter
native to trying to create statistical language models on the basis of still relatively small
datasets of linguistic utterances from the existing corpora in the field. Formal modeling
and rule-based generalizations, in fact, have always been an integral step in theory for
mation, and in this respect, Chomsky’s approach on how to do linguistics has yielded
rich and elaborate models of language.
Timeline of Corpus Linguistics
Heres a quick overview of some of the milestones in the field, leading up to where we are
now.
1950s: Descriptive linguists compile collections of spoken and written utterances of
various languages from field research. Literary researchers begin compiling system
atic collections of the complete works of different authors. Key Word in Context
(KWIC) is invented as a means of indexing documents and creating concordances.
1960s: Kucera and Francis publish A Standard Corpus of Present-Day American
English (the Brown Corpus), the first broadly available large corpus of language texts.
Work in Information Retrieval (IR) develops techniques for statistical similarity of
document content.
1970s: Stochastic models developed from speech corpora make Speech Recognition
systems possible. The vector space model is developed for document indexing. The
London-Lund Corpus (LLC) is developed through the work of the Survey of English
Usage.
1980s: The Lancaster-Oslo-Bergen (LOB) Corpus, designed to match the Brown
Corpus in terms of size and genres, is compiled. The COBUILD (Collins Birmingham
University International Language Database) dictionary is published, the first based
on examining usage from a large English corpus, the Bank of English. The Survey of
English Usage Corpus inspires the creation of a comprehensive corpus-based gram
mar, Grammar of English. The Child Language Data Exchange System (CHILDES)
Corpus is released as a repository for first language acquisition data.
1990s: The Penn TreeBank is released. This is a corpus of tagged and parsed sentences
of naturally occurring English (4.5 million words). The British National Corpus
(BNC) is compiled and released as the largest corpus of English to date (100 million
words). The Text Encoding Initiative (TEI) is established to develop and maintain a
standard for the representation of texts in digital form.
6 | Chapter 1: The Basics
2000s: As the World Wide Web grows, more data is available for statistical models
for Machine Translation and other applications. The American National Corpus
(ANC) project releases a 22-million-word subcorpus, and the Corpus of Contem
porary American English (COCA) is released (400 million words). Google releases
its Google N-gram Corpus of 1 trillion word tokens from public web pages. The
corpus holds up to five n-grams for each word token, along with their frequencies .
2010s: International standards organizations, such as ISO, begin to recognize and co-
develop text encoding formats that are being used for corpus annotation efforts. The
Web continues to make enough data available to build models for a whole new range
of linguistic phenomena. Entirely new forms of text corpora, such as Twitter, Face
book, and blogs, become available as a resource.
Theory construction, however, also involves testing and evaluating your hypotheses
against observed phenomena. As more linguistic data has gradually become available,
something significant has changed in the way linguists look at data. The phenomena
are now observable in millions of texts and billions of sentences over the Web, and this
has left little doubt that quantitative techniques can be meaningfully applied to both test
and create the language models correlated with the datasets. This has given rise to the
modern age of corpus linguistics. As a result, the corpus is the entry point from which
all linguistic analysis will be done in the future.
You gotta have data! As philosopher of science Thomas Kuhn said:
When measurement departs from theory, it is likely to yield mere
numbers, and their very neutrality makes them particularly sterile as a
source of remedial suggestions. But numbers register the departure
from theory with an authority and finesse that no qualitative technique
can duplicate, and that departure is often enough to start a search
(Kuhn 1961).
The assembly and collection of texts into more coherent datasets that we can call corpora
started in the 1960s.
Some of the most important corpora are listed in Table 1-1.
A Brief History of Corpus Linguistics | 7
Table 1-1. A sampling of important corpora
Name of corpus Year published Size Collection contents
British National Corpus (BNC) 1991–1994 100 million words Cross section of British English, spoken and
written
American National Corpus (ANC) 2003 22 million words Spoken and written texts
Corpus of Contemporary American
English (COCA)
2008 425 million words Spoken, fiction, popular magazine, and
academic texts
What Is a Corpus?
A corpus is a collection of machine-readable texts that have been produced in a natural
communicative setting. They have been sampled to be representative and balanced with
respect to particular factors; for example, by genre—newspaper articles, literary fiction,
spoken speech, blogs and diaries, and legal documents. A corpus is said to be “repre
sentative of a language variety” if the content of the corpus can be generalized to that
variety (Leech 1991).
This is not as circular as it may sound. Basically, if the content of the corpus, defined by
specifications of linguistic phenomena examined or studied, reflects that of the larger
population from which it is taken, then we can say that it “represents that language
variety.
The notion of a corpus being balanced is an idea that has been around since the 1980s,
but it is still a rather fuzzy notion and difficult to define strictly. Atkins and Ostler
(1992) propose a formulation of attributes that can be used to define the types of text,
and thereby contribute to creating a balanced corpus.
Two well-known corpora can be compared for their effort to balance the content of the
texts. The Penn TreeBank (Marcus et al. 1993) is a 4.5-million-word corpus that contains
texts from four sources: the Wall Street Journal, the Brown Corpus, ATIS, and the
Switchboard Corpus. By contrast, the BNC is a 100-million-word corpus that contains
texts from a broad range of genres, domains, and media.
The most diverse subcorpus within the Penn TreeBank is the Brown Corpus, which is
a 1-million-word corpus consisting of 500 English text samples, each one approximately
2,000 words. It was collected and compiled by Henry Kucera and W. Nelson Francis of
Brown University (hence its name) from a broad range of contemporary American
English in 1961. In 1967, they released a fairly extensive statistical analysis of the word
frequencies and behavior within the corpus, the first of its kind in print, as well as the
Brown Corpus Manual (Francis and Kucera 1964).
8 | Chapter 1: The Basics
There has never been any doubt that all linguistic analysis must be
grounded on specific datasets. What has recently emerged is the reali
zation that all linguistics will be bound to corpus-oriented techniques,
one way or the other. Corpora are becoming the standard data exchange
format for discussing linguistic observations and theoretical generali
zations, and certainly for evaluation of systems, both statistical and rule-
based.
Table 1-2 shows how the Brown Corpus compares to other corpora that are also still
in use.
Table 1-2. Comparing the Brown Corpus to other corpora
Corpus Size Use
Brown Corpus 500 English text samples; 1 million words Part-of-speech tagged data; 80 different tags used
Child Language Data
Exchange System
(CHILDES)
20 languages represented; thousands of
texts
Phonetic transcriptions of conversations with children
from around the world
Lancaster-Oslo-Bergen
Corpus
500 British English text samples, around
2,000 words each
Part-of-speech tagged data; a British version of the
Brown Corpus
Looking at the way the files of the Brown Corpus can be categorized gives us an idea of
what sorts of data were used to represent the English language. The top two general data
categories are informative, with 374 samples, and imaginative, with 126 samples.
These two domains are further distinguished into the following topic areas:
Informative
Press: reportage (44), Press: editorial (27), Press: reviews (17), Religion (17), Skills
and Hobbies (36), Popular Lore (48), Belles Lettres, Biography, Memoirs (75), Mis
cellaneous (30), Natural Sciences (12), Medicine (5), Mathematics (4), Social and
Behavioral Sciences (14), Political Science, Law, Education (15), Humanities (18),
Technology and Engineering (12)
Imaginative
General Fiction (29), Mystery and Detective Fiction (24), Science Fiction (6), Ad
venture and Western Fiction (29), Romance and Love Story (29) Humor (9)
Similarly, the BNC can be categorized into informative and imaginative prose, and
further into subdomains such as educational, public, business, and so on. A further
discussion of how the BNC can be categorized can be found in Distributions Within
Corpora” (page 49).
As you can see from the numbers given for the Brown Corpus, not every category is
equally represented, which seems to be a violation of the rule of “representative and
balanced” that we discussed before. However, these corpora were not assembled with a
A Brief History of Corpus Linguistics | 9
specific task in mind; rather, they were meant to represent written and spoken language
as a whole. Because of this, they attempt to embody a large cross section of existing texts,
though whether they succeed in representing percentages of texts in the world is de
batable (but also not terribly important).
For your own corpus, you may find yourself wanting to cover a wide variety of text, but
it is likely that you will have a more specific task domain, and so your potential corpus
will not need to include the full range of human expression. The Switchboard Corpus
is an example of a corpus that was collected for a very specific purpose—Speech Rec
ognition for phone operation—and so was balanced and representative of the different
sexes and all different dialects in the United States.
Early Use of Corpora
One of the most common uses of corpora from the early days was the construction of
concordances. These are alphabetical listings of the words in an article or text collection
with references given to the passages in which they occur. Concordances position a word
within its context, and thereby make it much easier to study how it is used in a language,
both syntactically and semantically. In the 1950s and 1960s, programs were written to
automatically create concordances for the contents of a collection, and the results of
these automatically created indexes were called Key Word in Context” indexes, or
KWIC indexes. A KWIC index is an index created by sorting the words in an article or
a larger collection such as a corpus, and aligning them in a format so that they can be
searched alphabetically in the index. This was a relatively efficient means for searching
a collection before full-text document search became available.
The way a KWIC index works is as follows. The input to a KWIC system is a file or
collection structured as a sequence of lines. The output is a sequence of lines, circularly
shifted and presented in alphabetical order of the first word. For an example, consider
a short article of two sentences, shown in Figure 1-1 with the KWIC index output that
is generated.
Figure 1-1. Example of a KWIC index
10 | Chapter 1: The Basics
Another benefit of concordancing is that, by displaying the keyword in its context, you
can visually inspect how the word is being used in a given sentence. To take a specific
example, consider the different meanings of the English verb treat. Specifically, let’s look
at the first two senses within sense (1) from the dictionary entry shown in Figure 1-2.
Figure 1-2. Senses of the word “treat”
Now lets look at the concordances compiled for this verb from the BNC, as differentiated
by these two senses.
These concordances were compiled using the Word Sketch Engine, by
the lexicographer Patrick Hanks, and are part of a large resource of
sentence patterns using a technique called Corpus Pattern Analysis
(Pustejovsky et al. 2004; Hanks and Pustejovsky 2005).
What is striking when one examines the concordance entries for each of these senses is
the fact that the contexts are so distinct. These are presented in Figures 1-3 and 1-4.
A Brief History of Corpus Linguistics | 11
Figure 1-3. Sense (1a) for the verb “treat
Figure 1-4. Sense (1b) for the verb “treat
12 | Chapter 1: The Basics
The NLTK provides functionality for creating concordances. The easiest way
to make a concordance is to simply load the preprocessed texts into the NLTK
and then use the concordance function, like this:
>>> import NLTK
>>> from nltk.book import *
>>> text6.concordance("Ni")
If you have your own set of data for which you would like to create a con
cordance, then the process is a little more involved: you will need to read in
your files and use the NLTK functions to process them before you can create
your own concordance. Here is some sample code for a corpus of text files
(replace the directory location with your own folder of text files):
>>> corpus_loc = '/home/me/corpus/'
>>> docs = nltk.corpus.PlaintextCorpusReader(corpus_loc,'.*\.txt')
You can see if the files were read by checking what file IDs are present:
>>> print docs.fileids()
Next, process the words in the files and then use the concordance function
to examine the data:
>>> docs_processed = nltk.Text(docs.words())
>>> docs_processed.concordance("treat")
Corpora Today
When did researchers start to actually use corpora for modeling language phenomena
and training algorithms? Beginning in the 1980s, researchers in Speech Recognition
began to compile enough spoken language data to create language models (from tran
scriptions using n-grams and Hidden Markov Models [HMMS]) that worked well
enough to recognize a limited vocabulary of words in a very narrow domain. In the
1990s, work in Machine Translation began to see the influence of larger and larger
datasets, and with this, the rise of statistical language modeling for translation.
Eventually, both memory and computer hardware became sophisticated enough to col
lect and analyze increasingly larger datasets of language fragments. This entailed being
able to create statistical language models that actually performed with some reasonable
accuracy for different natural language tasks.
As one example of the increasing availability of data, Google has recently released the
Google Ngram Corpus. The Google Ngram dataset allows users to search for single words
(unigrams) or collocations of up to five words (5-grams). The dataset is available for
download from the Linguistic Data Consortium, and directly from Google. It is also
viewable online through the Google Ngram Viewer. The Ngram dataset consists of more
than one trillion tokens (words, numbers, etc.) taken from publicly available websites
and sorted by year, making it easy to view trends in language use. In addition to English,
Google provides n-grams for Chinese, French, German, Hebrew, Russian, and Spanish,
as well as subsets of the English corpus such as American English and English Fiction.
A Brief History of Corpus Linguistics | 13
N-grams are sets of items (often words, but they can be letters, pho
nemes, etc.) that are part of a sequence. By examining how often the
items occur together we can learn about their usage in a language, and
predict what would likely follow a given sequence (using n-grams for
this purpose is called n-gram modeling).
N-grams are applied in a variety of ways every day, such as in websites
that provide search suggestions once a few letters are typed in, and for
determining likely substitutions for spelling errors. They are also used
in speech disambiguation—if a person speaks unclearly but utters a
sequence that does not commonly (or ever) occur in the language being
spoken, an n-gram model can help recognize that problem and find the
words that the speaker probably intended to say.
Another modern corpus is ClueWeb09 (http://lemurproject.org/clueweb09.php/), a
dataset “created to support research on information retrieval and related human lan
guage technologies. It consists of about 1 billion web pages in ten languages that were
collected in January and February 2009.” This corpus is too large to use for an annotation
project (it’s about 25 terabytes uncompressed), but some projects have taken parts of the
dataset (such as a subset of the English websites) and used them for research (Pomikálek
et al. 2012). Data collection from the Internet is an increasingly common way to create
corpora, as new and varied content is always being created.
Kinds of Annotation
Consider the different parts of a languages syntax that can be annotated. These include
part of speech (POS), phrase structure, and dependency structure. Table 1-3 shows ex
amples of each of these. There are many different tagsets for the parts of speech of a
language that you can choose from.
Table 1-3. Number of POS tags in different corpora
Tagset Size Date
Brown 77 1964
LOB 132 1980s
London-Lund Corpus 197 1982
Penn 36 1992
The tagset in Figure 1-5 is taken from the Penn TreeBank, and is the basis for all sub
sequent annotation over that corpus.
14 | Chapter 1: The Basics
Figure 1-5. The Penn TreeBank tagset
The POS tagging process involves assigning the right lexical class marker(s) to all the
words in a sentence (or corpus). This is illustrated in a simple example, “The waiter
cleared the plates from the table.” (See Figure 1-6.)
A Brief History of Corpus Linguistics | 15
Figure 1-6. POS tagging sample
POS tagging is a critical step in many NLP applications, since it is important to know
what category a word is assigned to in order to perform subsequent analysis on it, such
as the following:
Speech Synthesis
Is the word a noun or a verb? Examples include object, overflow, insult, and sus
pect. Without context, each of these words could be either a noun or a verb.
Parsing
You need POS tags in order to make larger syntactic units. For example, in the
following sentences, is “clean dishes” a noun phrase or an imperative verb phrase?
Clean dishes are in the cabinet.
Clean dishes before going to work!
Machine Translation
Getting the POS tags and the subsequent parse right makes all the difference when
translating the expressions in the preceding list item into another language, such as
French: “Des assiettes propres” (Clean dishes) versus “Fais la vaisselle!” (Clean the
dishes!).
Consider how these tags are used in the following sentence, from the Penn TreeBank
(Marcus et al. 1993):
“From the beginning, it took a man with extraordinary qualities to succeed in Mexico,” says Kimihide Takimura, president of
Mitsui groups Kensetsu Engineering Inc. unit.
“/” From/IN the/DT beginning/NN ,/, it/PRP took/VBD a/DT man/NN with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB
in/IN Mexico/NNP ,/, “/” says/VBZ Kimihide/NNP Takimura/NNP ,/, president/NN of/IN Mitsui/NNS group/NN ’s/POS
Kensetsu/NNP Engineering/NNP Inc./NNP unit/NN ./.
16 | Chapter 1: The Basics
Identifying the correct parts of speech in a sentence is a necessary step in building many
natural language applications, such as parsers, Named Entity Recognizers, QAS, and
Machine Translation systems. It is also an important step toward identifying larger
structural units such as phrase structure.
Use the NLTK tagger to assign POS tags to the example sentence shown
here, and then with other sentences that might be more ambiguous:
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("This is a test."))
Look for places where the tagger doesn’t work, and think about what
rules might be causing these errors. For example, what happens when
you try “Clean dishes are in the cabinet.” and “Clean dishes before going
to work!”?
While words have labels associated with them (the POS tags mentioned earlier), specific
sequences of words also have labels that can be associated with them. This is called
syntactic bracketing (or labeling) and is the structure that organizes all the words we hear
into coherent phrases. As mentioned earlier, syntax is the name given to the structure
associated with a sentence. The Penn TreeBank is an annotated corpus with syntactic
bracketing explicitly marked over the text. An example annotation is shown in
Figure 1-7.
Figure 1-7. Syntactic bracketing
This is a bracketed representation of the syntactic tree structure, which is shown in
Figure 1-8.
A Brief History of Corpus Linguistics | 17
Figure 1-8. Syntactic tree structure
Notice that syntactic bracketing introduces two relations between the words in a sen
tence: order (precedence) and hierarchy (dominance). For example, the tree structure
in Figure 1-8 encodes these relations by the very nature of a tree as a directed acyclic
graph (DAG). In a very compact form, the tree captures the precedence and dominance
relations given in the following list:
{Dom(NNP1,John), Dom(VPZ,loves), Dom(NNP2,Mary), Dom(NP1,NNP1),
Dom(NP2,NNP2), Dom(S,NP1), Dom(VP,VPZ), Dom(VP,NP2), Dom(S,VP),
Prec(NP1,VP), Prec(VPZ,NP2)}
Any sophisticated natural language application requires some level of syntactic analysis,
including Machine Translation. If the resources for full parsing (such as that shown
earlier) are not available, then some sort of shallow parsing can be used. This is when
partial syntactic bracketing is applied to sequences of words, without worrying about
the details of the structure inside a phrase. We will return to this idea in later chapters.
In addition to POS tagging and syntactic bracketing, it is useful to annotate texts in a
corpus for their semantic value, that is, what the words mean in the sentence. We can
distinguish two kinds of annotation for semantic content within a sentence: what some
thing is, and what role something plays. Here is a more detailed explanation of each:
Semantic typing
A word or phrase in the sentence is labeled with a type identifier (from a reserved
vocabulary or ontology), indicating what it denotes.
18 | Chapter 1: The Basics
Semantic role labeling
A word or phrase in the sentence is identified as playing a specific semantic role
relative to a role assigner, such as a verb.
Let’s consider what annotation using these two strategies would look like, starting with
semantic types. Types are commonly defined using an ontology, such as that shown in
Figure 1-9.
The word ontology has its roots in philosophy, but ontologies also have
a place in computational linguistics, where they are used to create cate
gorized hierarchies that group similar concepts and objects. By assign
ing words semantic types in an ontology, we can create relationships
between different branches of the ontology, and determine whether
linguistic rules hold true when applied to all the words in a category.
Figure 1-9. A simple ontology
The ontology in Figure 1-9 is rather simple, with a small set of categories. However, even
this small ontology can be used to illustrate some interesting features of language. Con
sider the following example, with semantic types marked:
[Ms. Ramirez]Person of [QBC Productions]Organization visited [Boston]Place on [Satur
day]Time, where she had lunch with [Mr. Harris]Person of [STU Enterprises]Organization at
[1:15 pm]Time.
From this small example, we can start to make observations about how these objects
interact with one other. People can visit places, people have “of” relationships with
organizations, and lunch can happen on Saturday at 1:15 p.m. Given a large enough
corpus of similarly labeled sentences, we can start to detect patterns in usage that will
tell us more about how these labels do and do not interact.
A corpus of these examples can also tell us where our categories might need to be ex
panded. There are two “times” in this sentence: Saturday and 1:15 p.m. We can see that
events can occur “on” Saturday, but “at” 1:15 p.m. A larger corpus would show that this
A Brief History of Corpus Linguistics | 19
pattern remains true with other days of the week and hour designations—there is a
difference in usage here that cannot be inferred from the semantic types. However, not
all ontologies will capture all information—the applications of the ontology will deter
mine whether it is important to capture the difference between Saturday and 1:15 p.m.
The annotation strategy we just described marks up what a linguistic expression refers
to. But lets say we want to know the basics for Question Answering, namely, the who,
what, where, and when of a sentence. This involves identifying what are called the
semantic role labels associated with a verb. What are semantic roles? Although there is
no complete agreement on what roles exist in language (there rarely is with linguists),
the following list is a fair representation of the kinds of semantic labels associated with
different verbs:
Agent
The event participant that is doing or causing the event to occur
Theme/figure
The event participant who undergoes a change in position or state
Experiencer
The event participant who experiences or perceives something
Source
The location or place from which the motion begins; the person from whom the
theme is given
GoalThe location or place to which the motion is directed or terminates
Recipient
The person who comes into possession of the theme
Patient
The event participant who is affected by the event
Instrument
The event participant used by the agent to do or cause the event
Location/ground
The location or place associated with the event itself
The annotated data that results explicitly identifies entity extents and the target relations
between the entities:
[The man]agent painted [the wall]patient with [a paint brush]instrument.
• [Mary]figure walked to [the cafe]goal from [her house]source.
• [John]agent gave [his mother]recipient [a necklace]theme.
20 | Chapter 1: The Basics
[My brother]theme lives in [Milwaukee]location.
Language Data and Machine Learning
Now that we have reviewed the methodology of language annotation along with some
examples of annotation formats over linguistic data, we will describe the computational
framework within which such annotated corpora are used, namely, that of machine
learning. Machine learning is the name given to the area of Artificial Intelligence con
cerned with the development of algorithms that learn or improve their performance
from experience or previous encounters with data. They are said to learn (or generate)
a function that maps particular input data to the desired output. For our purposes, the
data” that an ML algorithm encounters is natural language, most often in the form of
text, and typically annotated with tags that highlight the specific features that are rele
vant to the learning task. As we will see, the annotation schemas discussed earlier, for
example, provide rich starting points as the input data source for the ML process (the
training phase).
When working with annotated datasets in NLP, three major types of ML algorithms are
typically used:
Supervised learning
Any technique that generates a function mapping from inputs to a fixed set of labels
(the desired output). The labels are typically metadata tags provided by humans
who annotate the corpus for training purposes.
Unsupervised learning
Any technique that tries to find structure from an input set of unlabeled data.
Semi-supervised learning
Any technique that generates a function mapping from inputs of both labeled data
and unlabeled data; a combination of both supervised and unsupervised learning.
Table 1-4 shows a general overview of ML algorithms and some of the annotation tasks
they are frequently used to emulate. We’ll talk more about why these algorithms are used
for these different tasks in Chapter 7.
Table 1-4. Annotation tasks and their accompanying ML algorithms
Algorithms Tasks
Clustering Genre classification, spam labeling
Decision trees Semantic type or ontological class assignment, coreference resolution
Naïve Bayes Sentiment classification, semantic type or ontological class assignment
Maximum Entropy (MaxEnt) Sentiment classification, semantic type, or ontological class assignment
Structured pattern induction (HMMs, CRFs, etc.) POS tagging, sentiment classification, word sense disambiguation
Language Data and Machine Learning | 21
You’ll notice that some of the tasks appear with more than one algorithm. Thats because
different approaches have been tried successfully for different types of annotation tasks,
and depending on the most relevant features of your own corpus, different algorithms
may prove to be more or less effective. Just to give you an idea of what the algorithms
listed in that table mean, the rest of this section gives an overview of the main types of
ML algorithms.
Classification
Classification is the task of identifying the labeling for a single entity from a set of data.
For example, in order to distinguish spam from not-spam in your email inbox, an algo
rithm called a classifier is trained on a set of labeled data, where individual emails have
been assigned the label [+spam] or [-spam]. It is the presence of certain (known) words
or phrases in an email that helps to identify an email as spam. These words are essentially
treated as features that the classifier will use to model the positive instances of spam as
compared to not-spam. Another example of a classification problem is patient diagnosis,
from the presence of known symptoms and other attributes. Here we would identify a
patient as having a particular disease, A, and label the patient record as [+disease-A] or
[-disease-A], based on specific features from the record or text. This might include blood
pressure, weight, gender, age, existence of symptoms, and so forth. The most common
algorithms used in classification tasks are Maximum Entropy (MaxEnt), Naïve Bayes,
decision trees, and Support Vector Machines (SVMs).
Clustering
Clustering is the name given to ML algorithms that find natural groupings and patterns
from the input data, without any labeling or training at all. The problem is generally
viewed as an unsupervised learning task, where either the dataset is unlabeled or the
labels are ignored in the process of making clusters. The clusters that are formed are
similar in some respect,” and the other clusters formed are “dissimilar to the objects
in other clusters. Some of the more common algorithms used for this task include k-
means, hierarchical clustering, Kernel Principle Component Analysis, and Fuzzy C-
Means (FCM).
Structured Pattern Induction
Structured pattern induction involves learning not only the label or category of a single
entity, but rather learning a sequence of labels, or other structural dependencies between
the labeled items. For example, a sequence of labels might be a stream of phonemes in
a speech signal (in Speech Recognition); a sequence of POS tags in a sentence corre
sponding to a syntactic unit (phrase); a sequence of dialog moves in a phone
22 | Chapter 1: The Basics
conversation; or steps in a task such as parsing, coreference resolution, or grammar
induction. Algorithms used for such problems include Hidden Markov Models
(HMMs), Conditional Random Fields (CRFs), and Maximum Entropy Markov Models
(MEMMs).
We will return to these approaches in more detail when we discuss machine learning in
greater depth in Chapter 7.
The Annotation Development Cycle
The features we use for encoding a specific linguistic phenomenon must be rich enough
to capture the desired behavior in the algorithm that we are training. These linguistic
descriptions are typically distilled from extensive theoretical modeling of the phenom
enon. The descriptions in turn form the basis for the annotation values of the specifi
cation language, which are themselves the features used in a development cycle for
training and testing an identification or labeling algorithm over text. Finally, based on
an analysis and evaluation of the performance of a system, the model of the phenomenon
may be revised for retraining and testing.
We call this particular cycle of development the MATTER methodology, as detailed here
and shown in Figure 1-10 (Pustejovsky 2006):
Model
Structural descriptions provide theoretically informed attributes derived from em
pirical observations over the data.
Annotate
An annotation scheme assumes a feature set that encodes specific structural de
scriptions and properties of the input data.
Train
The algorithm is trained over a corpus annotated with the target feature set.
Test The algorithm is tested against held-out data.
Evaluate
A standardized evaluation of results is conducted.
Revise
The model and the annotation specification are revisited in order to make the an
notation more robust and reliable with use in the algorithm.
The Annotation Development Cycle | 23
Figure 1-10. The MATTER cycle
We assume some particular problem or phenomenon has sparked your interest, for
which you will need to label natural language data for training for machine learning.
Consider two kinds of problems. First imagine a direct text classification task. It might
be that you are interested in classifying your email according to its content or with a
particular interest in filtering out spam. Or perhaps you are interested in rating your
incoming mail on a scale of what emotional content is being expressed in the message.
Now lets consider a more involved task, performed over this same email corpus: iden
tifying what are known as Named Entities (NEs). These are references to everyday things
in our world that have proper names associated with them; for example, people, coun
tries, products, holidays, companies, sports, religions, and so on.
Finally, imagine an even more complicated task, that of identifying all the different
events that have been mentioned in your mail (birthdays, parties, concerts, classes,
airline reservations, upcoming meetings, etc.). Once this has been done, you will need
to “timestamp” them and order them, that is, identify when they happened, if in fact
they did happen. This is called the temporal awareness problem, and is one of the most
difficult in the field.
We will use these different tasks throughout this section to help us clarify what is in
volved with the different steps in the annotation development cycle.
Model the Phenomenon
The first step in the MATTER development cycle is “Model the Phenomenon.” The steps
involved in modeling, however, vary greatly, depending on the nature of the task you
have defined for yourself. In this section, we will look at what modeling entails and how
you know when you have an adequate first approximation of a model for your task.
24 | Chapter 1: The Basics
The parameters associated with creating a model are quite diverse, and it is difficult to
get different communities to agree on just what a model is. In this section we will be
pragmatic and discuss a number of approaches to modeling and show how they provide
the basis from which to created annotated datasets. Briefly, a model is a characterization
of a certain phenomenon in terms that are more abstract than the elements in the domain
being modeled. For the following discussion, we will define a model as consisting of a
vocabulary of terms, T, the relations between these terms, R, and their interpretation,
I. So, a model, M, can be seen as a triple, M = <T,R,I>. To better understand this notion
of a model, let us consider the scenarios introduced earlier. For spam detection, we can
treat it as a binary text classification task, requiring the simplest model with the cate
gories (terms) spam and not-spam associated with the entire email document. Hence,
our model is simply:
T = {Document_type, Spam, Not-Spam}
R = {Document_type ::= Spam | Not-Spam}
I = {Spam = “something we dont want!”, Not-Spam = “something we do want!"}
The document itself is labeled as being a member of one of these categories. This is
called document annotation and is the simplest (and most coarse-grained) annotation
possible. Now, when we say that the model contains only the label names for the cate
gories (e.g., sports, finance, news, editorials, fashion, etc.), this means there is no other
annotation involved. This does not mean the content of the files is not subject to further
scrutiny, however. A document that is labeled as a category, A, for example, is actually
analyzed as a large-feature vector containing at least the words in the document. A more
fine-grained annotation for the same task would be to identify specific words or phrases
in the document and label them as also being associated with the category directly. We’ll
return to this strategy in Chapter 4. Essentially, the goal of designing a good model of
the phenomenon (task) is that this is where you start for designing the features that go
into your learning algorithm. The better the features, the better the performance of the
ML algorithm!
Preparing a corpus with annotations of NEs, as mentioned earlier, involves a richer
model than the spam-filter application just discussed. We introduced a four-category
ontology for NEs in the previous section, and this will be the basis for our model to
identify NEs in text. The model is illustrated as follows:
T = {Named_Entity, Organization, Person, Place, Time}
R = {Named_Entity ::= Organization | Person | Place | Time}
I = {Organization = “list of organizations in a database, Person = “list of people in
a database, Place = “list of countries, geographic locations, etc., Time = “all possible
dates on the calendar”}
The Annotation Development Cycle | 25
This model is necessarily more detailed, because we are actually annotating spans of
natural language text, rather than simply labeling documents (e.g., emails) as spam or
not-spam. That is, within the document, we are recognizing mentions of companies,
actors, countries, and dates.
Finally, what about an even more involved task, that of recognizing all temporal infor
mation in a document? That is, questions such as the following:
When did that meeting take place?
How long was John on vacation?
Did Jill get promoted before or after she went on maternity leave?
We won’t go into the full model for this domain, but lets see what is minimally necessary
in order to create annotation features to understand such questions. First we need to
distinguish between Time expressions (“yesterday,” “January 27,” “Monday”), Events
(“promoted,” “meeting,” “vacation”), and Temporal relations (“before,” “after,” “during”).
Because our model is so much more detailed, lets divide the descriptive content by
domain:
Time_Expression ::= TIME | DATE | DURATION | SET
— TIME: 10:15 a.m., 3 oclock, etc.
— DATE: Monday, April 2011
— DURATION: 30 minutes, two years, four days
— SET: every hour, every other month
Event: Meeting, vacation, promotion, maternity leave, etc.
Temporal_Relations ::= BEFORE | AFTER | DURING | EQUAL | OVERLAP | ...
We will come back to this problem in a later chapter, when we discuss the impact of the
initial model on the subsequent performance of the algorithms you are trying to train
over your labeled data.
In later chapters, we will see that there are actually several models that
might be appropriate for describing a phenomenon, each providing a
different view of the data. We will call this multimodel annotation of the
phenomenon. A common scenario for multimodel annotation involves
annotators who have domain expertise in an area (such as biomedical
knowledge). They are told to identify specific entities, events, attributes,
or facts from documents, given their knowledge and interpretation of
a specific area. From this annotation, nonexperts can be used to mark
up the structural (syntactic) aspects of these same phenomena, thereby
making it possible to gain domain expert understanding without forc
ing the domain experts to learn linguistic theory as well.
26 | Chapter 1: The Basics
Once you have an initial model for the phenomena associated with the problem task
you are trying to solve, you effectively have the first tag specification, or spec, for the
annotation. This is the document from which you will create the blueprint for how to
annotate the corpus with the features in the model. This is called the annotation guide
line, and we talk about this in the next section.
Annotate with the Specification
Now that you have a model of the phenomenon encoded as a specification document,
you will need to train human annotators to mark up the dataset according to the tags
that are important to you. This is easier said than done, and in fact often requires multiple
iterations of modeling and annotating, as shown in Figure 1-11. This process is called
the MAMA (Model-Annotate-Model-Annotate) cycle, or the “babeling” phase of MAT
TER. The annotation guideline helps direct the annotators in the task of identifying the
elements and then associating the appropriate features with them, when they are
identified.
Two kinds of tags will concern us when annotating natural language data: consuming
tags and nonconsuming tags. A consuming tag refers to a metadata tag that has real
content from the dataset associated with it (e.g., it “consumes” some text); a noncon
suming tag, on the other hand, is a metadata tag that is inserted into the file but is not
associated with any actual part of the text. An example will help make this distinction
clear. Say that we want to annotate text for temporal information, as discussed earlier.
Namely, we want to annotate for three kinds of tags: times (called Timex tags), temporal
relations (TempRels), and Events. In the first sentence in the following example, each
tag is expressed directly as real text. That is, they are all consuming tags (“promoted” is
marked as an Event, “before” is marked as a TempRel, and “the summer” is marked as
a Timex). Notice, however, that in the second sentence, there is no explicit temporal
relation in the text, even though we know that it’s something like “on. So, we actually
insert a TempRel with the value of “on” in our corpus, but the tag is flagged as a “non
consuming” tag.
John was [promoted]Event [before]TempRel [the summer]Timex.
John was [promoted]Event [Monday]Timex.
An important factor when creating an annotated corpus of your text is, of course, con
sistency in the way the annotators mark up the text with the different tags. One of the
most seemingly trivial problems is the most problematic when comparing annotations:
namely, the extent or the span of the tag. Compare the three annotations that follow. In
the first, the Organization tag spans “QBC Productions,” leaving out the company iden
tifier “Inc.” and the location “of East Anglia,” while these are included in varying spans
in the next two annotations.
The Annotation Development Cycle | 27
[QBC Productions]Organization Inc. of East Anglia
[QBC Productions Inc.]Organization of East Anglia
[QBC Productions Inc. of East Anglia]Organization
Each of these might look correct to an annotator, but only one actually corresponds to
the correct markup in the annotation guideline. How are these compared and resolved?
Figure 1-11. The inner workings of the MAMA portion of the MATTER cycle
In order to assess how well an annotation task is defined, we use Inter-
Annotator Agreement (IAA) scores to show how individual annotators
compare to one another. If an IAA score is high, that is an indication
that the task is well defined and other annotators will be able to continue
the work. This is typically defined using a statistical measure called a
Kappa Statistic. For comparing two annotations against each other, the
Cohen Kappa is usually used, while when comparing more than two
annotations, a Fleiss Kappa measure is used. These will be defined in
Chapter 8.
Note that having a high IAA score doesn’t necessarily mean the anno
tations are correct; it simply means the annotators are all interpreting
your instructions consistently in the same way. Your task may still need
to be revised even if your IAA scores are high. This will be discussed
further in Chapter 9.
Once you have your corpus annotated by at least two people (more is preferable, but not
always practical), its time to create the gold standard corpus. The gold standard is the
final version of your annotated data. It uses the most up-to-date specification that you
created during the annotation process, and it has everything tagged correctly according
to the most recent guidelines. This is the corpus that you will use for machine learning,
28 | Chapter 1: The Basics
and it is created through the process of adjudication. At this point in the process, you
(or someone equally familiar with all the tasks) will compare the annotations and de
termine which tags in the annotations are correct and should be included in the gold
standard.
Train and Test the Algorithms over the Corpus
Now that you have adjudicated your corpus, you can use your newly created gold stan
dard for machine learning. The most common way to do this is to divide your corpus
into two parts: the development corpus and the test corpus. The development corpus is
then further divided into two parts: the training set and the development-test set.
Figure 1-12 shows a standard breakdown of a corpus, though different distributions
might be used for different tasks. The files are normally distributed randomly into the
different sets.
Figure 1-12. Corpus divisions for machine learning
The training set is used to train the algorithm that you will use for your task. The
development-test (dev-test) set is used for error analysis. Once the algorithm is trained,
it is run on the dev-test set and a list of errors can be generated to find where the
algorithm is failing to correctly label the corpus. Once sources of error are found, the
algorithm can be adjusted and retrained, then tested against the dev-test set again. This
procedure can be repeated until satisfactory results are obtained.
Once the training portion is completed, the algorithm is run against the held-out test
corpus, which until this point has not been involved in training or dev-testing. By hold
ing out the data, we can show how well the algorithm will perform on new data, which
gives an expectation of how it would perform on data that someone else creates as well.
Figure 1-13 shows the “TTER” portion of the MATTER cycle, with the different corpus
divisions and steps.
The Annotation Development Cycle | 29
Figure 1-13. The Training–Evaluation cycle
Evaluate the Results
The most common method for evaluating the performance of your algorithm is to cal‐
culate how accurately it labels your dataset. This can be done by measuring the fraction
of the results from the dataset that are labeled correctly using a standard technique of
relevance judgment” called the Precision and Recall metric.
Heres how it works. For each label you are using to identify elements in the data, the
dataset is divided into two subsets: one that is labeled “relevant” to the label, and one
that is not relevant. Precision is a metric that is computed as the fraction of the correct
instances from those that the algorithm labeled as being in the relevant subset. Recall is
computed as the fraction of correct items among those that actually belong to the rele‐
vant subset. The following confusion matrix helps illustrate how this works:
Predicted Labeling
positive negative
Gold positive true positive (tp) false negative (fn)
Labeling negative false positive (fp) true negative (tn)
Given this matrix, we can define both precision and recall as shown in Figure 1-14, along
with a conventional definition of accuracy.
Figure 1-14. Precision and recall equations
The values of P and R are typically combined into a single metric called the F-measure,
which is the harmonic mean of the two.
F= 2* P*R
P+R
30 | Chapter 1: The Basics
This creates an overall score used for evaluation where precision and recall are measured
equally, though depending on the purpose of your corpus and algorithm, a variation of
this measure, such as one that rates precision higher than recall, may be more useful to
you. We will give more detail about how these equations are used for evaluation in
Chapter 8.
Revise the Model and Algorithms
Once you have evaluated the results of training and testing your algorithm on the data,
you will want to do an error analysis to see where it performed well and where it made
mistakes. This can be done with various packages and formulas, which we will discuss
in Chapter 8, including the creation of what are called confusion matrices. These will
help you go back to the design of the model, in order to create better tags and features
that will subsequently improve your gold standard, and consequently result in better
performance of your learning algorithm.
A brief example of model revision will help make this point. Recall the model for NE
extraction from the previous section, where we distinguished between four types of
entities: Organization, Place, Time, and Person. Depending on the corpus you have
assembled, it might be the case that you are missing a major category, or that you would
be better off making some subclassifications within one of the existing tags. For example,
you may find that the annotators are having a hard time knowing what to do with named
occurrences or events, such as Easter, 9-11, or Thanksgiving. These denote more than
simply Times, and suggest that perhaps a new category should be added to the model:
Event. Additionally, it might be the case that there is reason to distinguish geopolitical
Places from nongeopolitical Places. As with the “Model-Annotate” and “Train-Test
cycles, once such additions and modifications are made to the model, the MATTER
cycle begins all over again, and revisions will typically bring improved performance.
Summary
In this chapter, we have provided an overview of the history of corpus and computational
linguistics, and the general methodology for creating an annotated corpus. Specifically,
we have covered the following points:
Natural language annotation is an important step in the process of training com
puters to understand human speech for tasks such as Question Answering, Machine
Translation, and summarization.
All of the layers of linguistic research, from phonetics to semantics to discourse
analysis, are used in different combinations for different ML tasks.
In order for annotation to provide statistically useful results, it must be done on a
sufficiently large dataset, called a corpus. The study of language using corpora is
corpus linguistics.
Summary | 31
Corpus linguistics began in the 1940s, but did not become a feasible way to study
language until decades later, when the technology caught up to the demands of the
theory.
A corpus is a collection of machine-readable texts that are representative of natural
human language. Good corpora are representative and balanced with respect to the
genre or language that they seek to represent.
The uses of computers with corpora have developed over the years from simple key-
word-in-context (KWIC) indexes and concordances that allowed full-text docu
ments to be searched easily, to modern, statistically based ML techniques.
Annotation is the process of augmenting a corpus with higher-level information,
such as part-of-speech tagging, syntactic bracketing, anaphora resolution, and word
senses. Adding this information to a corpus allows the computer to find features
that can make a defined task easier and more accurate.
Once a corpus is annotated, the data can be used in conjunction with ML algorithms
that perform classification, clustering, and pattern induction tasks.
Having a good annotation scheme and accurate annotations is critical for machine
learning that relies on data outside of the text itself. The process of developing the
annotated corpus is often cyclical, with changes made to the tagsets and tasks as the
data is studied further.
Here we refer to the annotation development cycle as the MATTER cycle—Model,
Annotate, Train, Test, Evaluate, Revise.
Often before reaching the Test step of the process, the annotation scheme has already
gone through several revisions of the Model and Annotate stages.
This book will show you how to create an accurate and effective annotation scheme
for a task of your choosing, apply the scheme to your corpus, and then use ML
techniques to train a computer to perform the task you designed.
32 | Chapter 1: The Basics
CHAPTER 2
Defining Your Goal and Dataset
Creating a clear definition of your annotation goal is vital for any project aiming to
incorporate machine learning. When you are designing your annotation tagsets, writing
guidelines, working with annotators, and training algorithms, it can be easy to become
sidetracked by details and lose sight of what you want to achieve. Having a clear goal to
refer back to can help, and in this chapter we will go over what you need to create a good
definition of your goal, and discuss how your goal can influence your dataset. In par
ticular, we will look at:
What makes a good annotation goal
Where to find related research
How your dataset reflects your annotation goals
Preparing the data for annotators to use
How much data you will need for your task
What you should be able to take away from this chapter is a clear answer to the questions
What am I trying to do?, “How am I trying to do it?”, and “Which resources best fit
my needs?. As you progress through the MATTER cycle, the answers to these questions
will probably change—corpus creation is an iterative process—but having a stated goal
will help keep you from getting off track.
Defining Your Goal
In terms of the MATTER cycle, at this point were right at the start of “M”—being able
to clearly explain what you hope to accomplish with your corpus is the first step in
creating your model. While you probably already have a good idea about what you want
to do, in this section well give you some pointers on how to create a goal definition that
is useful and will help keep you focused in the later stages of the MATTER cycle.
33
We have found it useful to split the goal definition into two steps: first, write a statement
of purpose that covers the very basics of your task, and second, use that sentence to
expand on the “how”s of your goal. In the rest of this section, well give some pointers
on how to make sure each of these parts will help you with your corpus task.
The Statement of Purpose
At this point were assuming that you already have some question pertaining to natural
language that you want to explore. (If you don’t really have a project in mind yet, check
the appendixes for lists of existing corpora, and read the proceedings from related con
ferences to see if theres anything that catches your eye, or consider participating in
Natural Language Process [NLP] challenges, which are discussed later in this chapter.)
But how clearly can you explain what you intend to do? If you cant come up with a one-
or two-sentence summary describing your intended line of research, then youre going
to have a very hard time with the rest of this task. Keep in mind that we are not talking
about a sentence like “Genres are interesting”—thats an opinion, not a starting point
for an annotation task. Instead, try to have a statement more like this:
I want to use keywords to detect the genre of a newspaper article in order to create da
tabases of categorized texts.
This statement is still going to need a lot of refinement before it can be turned into an
annotation model, but it answers the basic questions. Specifically, it says:
What the annotation will be used for (databases)
What the overall outcome of the annotation will be (genre classification)
Where the corpus will come from (news articles)
How the outcome will be achieved (keywords)
Be aware that you may have a task in mind that will require multiple annotation efforts.
For example, say youre interested in exploring humor. Even if you have the time, money,
and people to do a comprehensive study of all aspects of humor, you will still need to
break down the task into manageable segments in order to create annotation tasks for
different types of humor. If you want to look at the effects of sarcasm and puns, you will
likely need different vocabularies for each task, or be willing to spend the time to create
one overarching annotation spec that will encompass all annotations, but we suggest
that you start small and then merge annotations later, if possible.
If you do have a broad task that you will need to break down into large subtasks, then
make that clear in your summary: “I want to create a program that can generate jokes
34 | Chapter 2: Defining Your Goal and Dataset
in text and audio formats, including puns, sarcasm, and exaggeration.” Each of the items
in that list will require a separate annotation and machine learning (ML) task. Grouping
them together, at least at first, would create such a massively complicated task that it
would be difficult to complete, let alone learn from.
To provide some more context, Table 2-1 shows a few examples of one-sentence sum
maries, representative of the diverse range of annotation projects and corpora that exist
in the field.
Table 2-1. Some corpora and their uses
Corpus Summary sentence
PropBank For annotating verbal propositions and their arguments for examining semantic roles
Manually Annotated Sub-Corpus
(MASC)
For annotating sentence boundaries, tokens, lemma, and part of speech (POS), noun and
verb chunks, and Named Entities (NEs); a subset of the Open American National Corpus
(OANC)
Penn Discourse TreeBank For annotating discourse relations between eventualities and propositions in newswires for
learning about discourse in natural language
MPQA Opinion Corpus For annotating opinions for use in evaluating emotional language
TimeBank For labeling times, events, and their relationships in news texts for use in temporal reasoning
i2b2 2008 Challenge, Task 1C For identifying patient smoking status from medical records for use in medical studies
2012 SemEval Task 7—COPA:
Choice of Plausible Alternatives
Provides alternative answers to questions; annotation focuses on finding the most likely
answer based on reasoning
Naturally, this isn’t a complete list of corpora, but it does cover a wide range of different
focuses for annotation tasks. These are not high-level descriptions of what these corpora
are about, but they answer the basic questions necessary for moving forward with an
annotation task. In the next section, well look at how to turn this one sentence into an
annotation model.
Refining Your Goal: Informativity Versus Correctness
Now that you have a statement of purpose for your corpus, you need to turn it into a
task description that can be used to create your model—that is, your corpus, annotation
scheme, and guidelines.
When annotating corpora, there is a fine line between having an annotation that will
be the most useful for your task (having high informativity) and having an annotation
that is not too difficult for annotators to complete accurately (which results in high levels
of correctness).
A clear example of where this trade-off comes into play is temporal annotation. Imagine
that you want to capture all the relationships between times and events in this simple
narrative:
Defining Your Goal | 35
On Tuesday, Pat jogged after work, then went home and made dinner.
Figure 2-1 shows what it would look like to actually create all of those relationships, and
you can see that a large number of connections are necessary in order to capture all the
relationships between events and times. In such tasks, the number of relationships is
almost quadratic to the number of times and events [here we have 10, because the links
only go in one direction—if we captured both directions, there would be 20 links: x
(x – 1)], where x is the number of events/times in the sentence. It wouldn’t be very
practical to ask an annotator to do all that work by hand; such a task would take an
incredibly long time to complete, and an annotation that complex will lead to a lot of
errors and omissions—in other words, low correctness. However, asking for a limited
set of relations may lead to lower levels of informativity, especially if your annotation
guidelines are not very carefully written. We’ll discuss annotation guidelines further in
Chapter 6.
Figure 2-1. All temporal relations over events and times
You have probably realized that with this particular example it’s not
necessary to have a human create all those links—if A occurs before B
and B occurs before C, a computer could use closure rules to determine
that A occurs before C, and annotators would not have to capture that
information themselves. Its always a good idea to consider what parts
of your task can be done automatically, especially when it can make your
annotator’s job easier without necessarily sacrificing accuracy.
The considerations surrounding informativity and correctness are very much inter
twined with one of the biggest factors affecting your annotation task: the scope of your
project. There are two main aspects of project scope that you need to consider: (1) how
far-reaching the goal is (the scope of the annotation), and (2) how much of your chosen
field you plan to cover (the scope of the corpus). We already touched on (2) a little in
the preceding section, and we will have more to say about it later, so for now let’s look
at (1).
The scope of the annotation task
At this point you have already begun to address the question of your tasks scope by
answering the four questions from the preceding section—at the very least youve
36 | Chapter 2: Defining Your Goal and Dataset
narrowed down what category of features you’ll be using (by answering the “means by
which the goal will be achieved” question), and what the overall goal of the annotation
will be. However, having a general class that your task can be slotted into may still leave
you with a lot of variables that you will need to consider.
As always, remember that the MATTER cycle is, in fact, a cycle, and as
you go through the steps, you may find new information that causes
your scope to widen or shrink.
Its a bit difficult to discuss scope in general terms, so lets look at some specific examples,
and then see how the principles can be applied to other projects. In the temporal relation
annotation task discussed previously, the scope of the project has to do with exactly
what relations are important to the annotation. Are only the main events in each sen
tence and their relationships important? Do you want to be able to capture only relations
inside a sentence, or do you want to capture relations between sentences as well? Maybe
you only care about the events that have clear temporal anchors, such as “Jay ran on
Sunday.” Do you think its important to differentiate between different types of links?
Similar questions can be asked about the newspaper genre classification example. In
that task, the relevant question is “How specific do you want your categories to be?” Is
it enough to divide articles into broad categories, such as “news” and “sports,” or do you
want a more detailed system that specifies “news:global” and “sports:baseball” (or even
sports:baseball:Yankees”)?
If this is your first excursion into annotation and corpus building, start
with broader categories or simpler tasks—as you learn the ins and outs
of your dataset and annotation tags, you’ll be able to refine your project
in more meaningful and useful ways.
As you can see, the questions being asked about the two examples so far essentially
become questions about classification—in the newspaper example this is a much more
obvious correlation, but even the temporal relation example touches on this subject. By
defining different possible categories of relationships (inter- and intra-sentential, main
verbs versus other verbs), its much easier to identify what parts you feel will be most
relevant to your task.
This book is about annotation, and its necessarily true that if you have an annotation
project, classification is some part of your task. This could involve document-level tags,
as in the newspaper example; labels that are associated with each word (or a subset of
words), as in a POS tagging task; or labeling relationships between existing tags. If you
can think of your annotation task in terms of a classification task, then that will provide
Defining Your Goal | 37
a stable framework for you to start considering the scope of your task. You most likely
already have intuitions about what the relevant features of your data and task are, so use
those (at least at first) to determine the scope of your task. This intuition can also help
you determine what level of informativity you will need for good classification results,
and how accurate you can expect an annotator to be.
Linguistic intuition is an extremely useful way to get started in thinking
about a topic, but it can be misleading and even wrong. Once youve
started gathering texts and doing some annotation, if you find that the
data does not match your expectations, dont hesitate to reevaluate your
approach.
Let’s go back to our four questions from the previous section and see how informativity
and correctness come into effect when elaborating on these aspects of your annotation
task. Now that you have a better idea of the scope of your project, it should be fairly easy
to see what sorts of things you will want to take into account when answering these
questions. (Notice that we didn’t say the questions would be easy to answer—the trade-
off between informativity and correctness is a consideration at all levels, and can make
it difficult to decide where to draw the line on a project.)
What will the annotation be used for?
Its likely that the answer to this question hasnt really changed based on the discussion
questions we’ve provided so far—the end product of the annotation, training, and testing
is why youre taking on this project in the first place, and is what ultimately informs the
answers to the rest of the questions and the project as a whole. However, its helpful to
remind yourself of what you’re trying to do before you start expanding your answers to
the other questions.
If you have some idea of what ML techniques you’ll be using, then that can be a con
sideration as well at this point, but it’s not required, especially if this is your first turn
around the MATTER cycle.
What will the overall outcome be?
Thinking about the scope of your project in terms of a classification task, now is the
time to start describing the outcome in terms of specific categories. Instead of saying
Classify newspaper articles into genres,” try to decide on the number of genres that you
think you’ll need to encompass all the types of articles youre interested in.
38 | Chapter 2: Defining Your Goal and Dataset
The more categories that your task uses, the harder it will probably be
to train an ML algorithm to accurately label your corpus. This doesn’t
necessarily mean you should limit your project right from the start, but
you may find that in later iterations of the MATTER cycle you need to
merge some categories.
For the temporal relation annotation, accuracy versus informativity is a huge consid
eration, for the reasons described earlier (refer back to Figure 2-1 for a refresher on how
complicated this task can be). In this case, and in the case of tasks of similar complexity,
the specifics of how detailed your task will be will almost certainly have to be worked
out through multiple iterations of annotation and evaluation.
For both of these tasks, considering the desired outcome will help you to determine the
answer to this question. In the genre case, the use of the database is the main consid
eration point—who will be using it, and for what? Temporal annotation can be used for
a number of things, such as summarization, timeline creation, Question Answering,
and so on. The granularity of the task will also inform what needs to be captured in the
annotation: if, for example, you are only interested in summarizing the major events in
a text, then it might be sufficient to only annotate the relationships of the main event in
each sentence.
Where will the corpus come from?
Now that youve thought about the scope of your task, it should be much easier to answer
more specific questions about the scope of your corpus. Specifically, its time to start
thinking about the distribution of sources; that is, exactly where all your data will come
from and how different aspects of it will be balanced.
Returning to our newspaper classification example, consider whether different news
venues have sufficiently different styles that you would need to train an algorithm over
all of them, or if a single source is similar enough to the others for the algorithm to be
easily generalizable. Are the writing and topics in the New York Times similar enough
to the Wall Street Journal that you dont need examples from each? What about news
papers and magazines that publish exclusively online? Do you consider blogs to be news
sources? Will you include only written articles, or will you also include transcripts of
broadcasts?
For temporal annotation, our experience has been that different publication styles and
genres can have a huge impact on how times are used in text. (Consider a childrens
story compared to a newspaper article, and how linear [or not] the narration in each
tends to be.) If you want your end product to cover all types of sources, this might be
another splitting point for your tasks—you may need to have different annotation
guidelines for different narrative genres, so consider carefully how far-reaching you want
your task to be.
Defining Your Goal | 39
Clearly, these considerations are tied into the scope of your task—the bigger the scope,
the more sources you will need to include in your annotation in order to maximize
informativity. However, the more disparate sources you include, the more likely it is that
you will need to consider having slightly different annotation tasks for each source,
which could lower correctness if the tasks are not fully thought out for each genre.
Its also a good idea to check out existing corpora for texts that might be suitable for the
task you are working on. Using a preassembled corpus (or a subset of one) has the
obvious benefit of lessening the work that you will have to do, but it also means you will
have access to any other annotations that have been done on those files. See “Background
Research” (page 41) for more information on linguistic resources.
Don’t get too caught up in the beginning with creating the perfect cor
pus—remember that this process is cyclical, and if you find youre miss
ing something, you can always go back and add it later.
How will the result be achieved?
In Chapter 1 we discussed the levels of linguistics—phonology, syntax, semantics, and
so on—and gave examples of annotation tasks for each of those levels. Consider at this
point, if you havent already, which of these levels your task fits into. However, dont try
to force your task to only deal with a single linguistic level! Annotations and corpora do
not always fit neatly into one category or another, and the same is probably true of your
own task.
For instance, while the temporal relation task that we have been using as an example so
far fits fairly solidly into the discourse and text structure level, it relies on having events
and times already annotated. But what is an event? Often events are verbs (“He ran down
the street.”) but they can also be nouns (“The election was fiercely contested.”) or even
adjectives, depending on whether they represent a state that has changed (“The volcano
was dormant for centuries before the eruption.”). But labeling events is not a purely
syntactic task, because (1) not all nouns, verbs, and adjectives are events, and (2) the
context in which a word is used will determine whether a word is an event or not.
Consider “The party lasted until 10” versus “The political party solicited funds for the
campaign.” These examples add a semantic component to the event annotation.
Its very likely that your own task will benefit from bringing in information from different
levels of linguistics. POS tagging is the most obvious example of additional information
that can have a huge impact on how well an algorithm performs an NLP task: knowing
the part of speech of a word can help with word sense disambiguation (“call the police
versus “police the neighborhood”), determining how the syllables of a word are pro
nounced (consider the verb present versus the noun present—this is a common pattern
in American English), and so on.
40 | Chapter 2: Defining Your Goal and Dataset
Of course, there is always a trade-off: the more levels (or partial levels—it might not be
necessary to have POS labels for all your data; they might only be used on words that
are determined to be interesting in some other way) that your annotation includes, the
more informative its likely to be. But the other side of that is that the more complex
your task is, the more likely it is that your annotators will become confused, thereby
lowering your accuracy. Again, the important thing to remember is that MATTER is a
cycle, so you will need to experiment to determine what works best for your task.
Background Research
Now that youve considered what linguistic levels are appropriate for your task, its time
to do some research into related work. Creating an annotated corpus can take a lot of
effort, and while its possible to create a good annotation task completely on your own,
checking the state of the industry can save you a lot of time and effort. Chances are
theres some research thats relevant to what you’ve been doing, and it helps to not have
to reinvent the wheel.
For example, if you are interested in temporal annotation, you know by now that
ISO-TimeML is the ISO standard for time and event annotation, including temporal
relationships. But this fact doesn’t require that all temporal annotations use the ISO-
TimeML schema as-is. Different domains, such as medical and biomedical text analysis,
have found that TimeML is a useful starting place, but in some cases provides too many
options for annotators, or in other cases does not cover a particular case relevant to the
area being explored. Looking at what other people have done with existing annotation
schemes, particularly in fields related to those you are planning to annotate, can make
your own annotation task much easier to plan.
While the library and, of course, Google usually provide good starting places, those
sources might not have the latest information on annotation projects, particularly be
cause the primary publishing grounds in computational linguistics are conferences and
their related workshops. In the following sections well give you some pointers to or
ganizations and workshops that may prove useful.
Language Resources
Currently there are a few different resources for locating preassembled corpora. The
Linguistic Data Consortium (LDC), for example, has a collection of hundreds of corpora
in both text and speech, from a variety of languages. While most of the corpora
are available to nonmembers (sometimes for a fee), some of them do require LDC
membership. The LDC is run by the University of Pennsylvania, and details about
membership and the cost of corpora are available on the LDC website.
Background Research | 41
The European Language Resources Association (ELRA) is another repository of both
spoken and written corpora from many different languages. As with the LDC, it is pos
sible to become a member of the ELRA in order to gain access to the entire database, or
access to individual corpora can be sought. More information is available at the ELRA
website.
Another useful resource is the LRE (Linguistic Resources and Evaluation) Map, which
provides a listing of all the resources used by researchers who submitted papers to the
LRE Conference (LREC) for the past few years. However, the list is not curated and so
not all the entries are guaranteed to be correct. A shortened version of the corpus and
annotation resources in the Map can be found in the appendixes of this book.
With both the LDC and the ELRA, it’s possible that while you would need to pay to gain
access to the most up-to-date version of a corpus, an older version may be available for
download from the group that created the corpus, so its worth checking around for
availability options if you are short of funds. And, of course, check the license on any
corpus you plan to use to ensure that its available for your purposes, no matter where
you obtain it from.
Organizations and Conferences
Much of the work on annotation that is available to the public is being done at univer
sities, making conference proceedings the best place to start looking for information
about tasks that might be related to your own. Here is a list of some of the bigger con
ferences that examine annotation and corpora, as well as some organizations that are
interested in the same topic:
Association for Computational Linguistics (ACL)
Institute of Electrical and Electronics Engineers (IEEE)
Language Resources and Evaluation Conference (LREC)
European Language Resources Association (ELRA)
Conference on Computational Linguistics (COLING)
American Medical Informatics Association (AMIA)
The LINGUIST List is not an organization that sponsors conferences and workshops
itself, but it does keep an excellent up-to-date list of calls for papers and dates of up
coming conferences. It also provides a list of linguistic organizations that can be sorted
by linguistic level.
42 | Chapter 2: Defining Your Goal and Dataset
NLP Challenges
In the past few years, NLP challenges hosted through conference workshops have be
come increasingly common. These challenges usually present a linguistic problem, a
training and testing dataset, and a limited amount of time during which teams or indi
viduals can attempt to create an algorithm or ruleset that can achieve good results on
the test data.
The topics of these challenges vary widely, from POS tagging to word sense disambig
uation to text analysis over biomedical data, and they are not limited to English. Some
workshops that you may want to look into are:
SemEval
This is a workshop held every three years as part of the Association for Computa
tional Linguistics. It involves a variety of challenges including word sense disam
biguation, temporal and spatial reasoning, and Machine Translation.
Conference on Natural Language Learning (CoNLL) Shared Task
This is a yearly NLP challenge held as part of the Special Interest Group on Natural
Language Learning of the Association for Computational Linguistics. Each year
a new NLP task is chosen for the challenge. Past challenges include uncertainty
detection, extracting syntactic and semantic dependencies, and multilingual
processing.
i2b2 NLP Shared Tasks
The i2b2 group is focused on using NLP in the medical domain, and each year it
holds a challenge involving reasoning over patient documents. Past challenges have
focused on comorbidity, smoking status, and identification of medication infor
mation.
A large number of other shared tasks and challenges are available for participation: the
NIST TREC Tracks are held every year, the BioNLP workshop frequently hosts a shared
task, and each year there are more. If you would like to be involved in an ML task but
dont want to necessarily create a dataset and annotation yourself, signing up for one of
these challenges is an excellent way to get involved in the NLP community. NLP chal
lenges are also useful in that they are a good reference for tasks that might not have a
lot of time or funding. However, it should be noted that the time constraints on NLP
challenges often mean the obtained results are not the best possible overall, simply the
best possible given the time and data.
Assembling Your Dataset
We’ve already discussed some aspects that you will need to consider when assembling
your dataset: the scope of your task, whether existing corpora contain documents and
annotations that would be useful to you, and how varied your sources will be.
Assembling Your Dataset | 43
If you are planning to make your dataset public, make very sure that you have permission
to redistribute the information you are annotating. In some cases it is possible to release
only the stand-off annotations and a piece of code that will collect the data from websites,
but its best and easiest to simply ask permission of the content provider, particularly if
your corpus and annotation is for business rather than purely educational purposes.
Guidelines for Corpus Creation
Corpus linguist John Sinclair developed guidelines for the creation of linguistic corpora
(Sinclair 2005). While these guidelines are primarily directed at corpora designed to study
linguistic phenomena, they will be useful for anyone interested in building a corpus. The
full paper can be read at http://www.ahds.ac.uk/creating/guides/linguistic-corpora/chap
ter1.htm, but the guidelines are presented here for convenience:
1. The contents of a corpus should be selected without regard for the language they
contain, but according to their communicative function in the community in which
they arise.
2. Corpus builders should strive to make their corpus as representative as possible of
the language from which it is chosen.
3. Only those components of corpora that have been designed to be independently
contrastive should be contrasted.
4. Criteria for determining the structure of a corpus should be small in number, clearly
separate from each other, and efficient as a group in delineating a corpus that is
representative of the language or variety under examination.
5. Any information about a text other than the alphanumeric string of its words and
punctuation should be stored separately from the plain text and merged when re
quired in applications.
6. Samples of language for a corpus should, wherever possible, consist of entire docu
ments or transcriptions of complete speech events, or should get as close to this target
as possible. This means that samples will differ substantially in size.
7. The design and composition of a corpus should be documented fully with informa
tion about the contents and arguments in justification of the decisions taken.
8. The corpus builder should retain, as target notions, representativeness and balance.
While these are not precisely definable and attainable goals, they must be used to
guide the design of a corpus and the selection of its components.
9. Any control of subject matter in a corpus should be imposed by the use of external,
and not internal, criteria.
10. A corpus should aim for homogeneity in its components while maintaining adequate
coverage, and rogue texts should be avoided.
44 | Chapter 2: Defining Your Goal and Dataset
The Ideal Corpus: Representative and Balanced
In corpus linguistics, the phrase “representative and balanced” is often used to describe
the traits that should be targeted when building a corpus. Because a corpus must always
be a selected subset of any chosen language, it cannot contain all examples of the lan
guages possible uses. Therefore, a corpus must be created by sampling the existing texts
of a language. Since any sampling procedure inherently contains the possibility of skew
ing the dataset, care should be taken to ensure that the corpus is representative of the
full range of variability in a population” (Biber 1993). The “population” being sampled
will be determined by the goal and scope of your annotation task. For example, if you
want to study movie reviews, you dont need to worry about including other types of
reviews or writing in your corpus. You do, however, want to make sure you have examples
of different types of reviews in your dataset. McEnery et al. (2006:19–22) provide an
excellent discussion of considerations for sampling a language.
The other important concept in corpus creation is that of balance. Sinclair (2005) de
scribes a corpus as balanced this way: “the proportions of different kinds of text it con
tains should correspond with informed and intuitive judgments.” This applies predom
inantly to corpora that are taking samples from different types of text: for example, a
corpus that wants to represent “American English” would have to include all types of
written and spoken texts, from newspaper articles to chat room discussions to television
transcripts, book samples, and so on. A corpus that has been predefined to require a
smaller sample will be easier to balance, simply because there will be fewer directions
in which the scope of the corpus can be expanded, but the utility of the corpus for general
research purposes will be correspondingly decreased.
Admittedly, the concepts of “representativeness and balance” are not easy to define, and
whether or not any corpus can be considered truly representative is an issue that corpus
linguists have been debating for years. However, considering what aspects of your corpus
and the world may impact whether your dataset can be considered “representative and
balanced” is an excellent way to gauge how useful it will be for other annotation and
ML tasks, and can help ensure that your results are maximally applicable to other
datasets as well.
The important thing to look out for is whether or not your corpus
matches the goal of your task. If your goal is to be able to process any
movie review, then you’ll want your corpus to be an accurate represen
tation of how reviews are distributed in the real world. This will help to
train your algorithms to more accurately label reviews that you give it
later on.
Assembling Your Dataset | 45
Collecting Data from the Internet
If you are doing textual annotation, you will probably be collecting your corpus from
the Internet. There are a number of excellent books that will provide specifics for how
to gather URLs and string HTML tags from websites, as well as Twitter streams, forums,
and other Internet resources. We will discuss a few of them here.
Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper
(O’Reilly) provides some basic instructions for importing text and web data straight
from the Internet. For example, if you are interested in collecting the text from a book
in the Project Gutenberg library, the process is quite simple (as the book describes):
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
However, you should be aware that some websites block such programs from down
loading their content, and so you may need to find other ways to download your corpus.
If you are interested in taking the raw text from an HTML page, the NLTK includes a
package that will clean that input for you:
>>> url = "http://www.bbc.co.uk/news/world-us-canada-18963939"
>>> html = urlopen(url).read()
>>> raw = nltk.clean_html(html)
Chapter 11 of Natural Language Processing with Python provides information and re
sources for compiling data from other sources, such as from word processor files, da
tabases, and spreadsheets.
In terms of mining information from other web sources, such as Twitter and blogs,
Mining the Social Web by Matthew A. Russell (O’Reilly) provides detailed information
for using the Twitter API, as well as resources for mining information from email,
LinkedIn, and blogs.
Eliciting Data from People
So far we have assumed you will be annotating texts or recordings that already exist.
But for some tasks, the data just isn’t there, or at least it doesnt exist in a form thats
going to be of any use.
This applies, we think, more to tasks requiring annotation of spoken or visual phenom
ena than written work—unless you are looking for something very particular when it
comes to text, its rarely necessary to have people generate writing samples for you.
However, it’s very common to need spoken samples, or recordings of people performing
particular actions for speech or motion recognition projects.
46 | Chapter 2: Defining Your Goal and Dataset
If you do need to elicit data from humans and are affiliated with a uni
versity or business, you will probably have to seek permission from
lawyers, or even an Internal Review Board (IRB). Even if you are doing
your own research project, be very clear with your volunteers about
what youre asking them to do and why youre asking them to do it.
When it comes to eliciting data (as opposed to just collecting it), there are a few things
you need to consider: in particular, do you want your data to be spontaneous, or read?
Do you want each person to say the same thing, or not? Lets take a look at what some
of the differences are and how they might affect your data.
Read speech
Read speech means that, while collecting data, you have each person read the same set
of sentences or words. If, for example, you wanted to compare different dialects or ac
cents, or train a Speech Recognition program to detect when people are saying the same
thing, then this is probably the paradigm you will want to use.
The VoxForge corpus uses this method—it provides a series of prompts that speakers
can record on their own and submit with a user profile describing their language back
ground.
If you do decide to have people read text from a prompt, be aware that
how the text is presented (font, bold, italics) can greatly affect how the
text is read. You may need to do some testing to make sure your readers
are giving you useful sound bites.
Recordings of news broadcasts can also be considered “read speech,” but be careful—
the cadence of news anchors is often very different from “standard” speech, so these
recordings might not be useful, depending on your goal.
A detailed description of the data collection process for the WSJCAM0 Corpus can be
found at http://www.ldc.upenn.edu/Catalog/readme_files/wsjcam0/wsjcam0.html.
Spontaneous speech
Naturally, spontaneous speech is collected without telling people what to say. This can
be done by asking people open-ended questions and recording their responses, or simply
recording conversations (with permission, of course!).
Assembling Your Dataset | 47
The Size of Your Corpus
Now that you know what kind of data youre looking for and how youre going to present
it, you have to decide how much data you’re actually going to collect and annotate. If
youre planning to use a corpus that already exists, then the overall size of the corpus is
decided for you, but you still might have to determine how much of that corpus you
want to annotate.
Generally speaking, no matter what your annotation goal is, the more data you collect
and annotate, the closer you’ll get to achieving that goal. However, most of the time
“bigger is better” isn’t a very practical mantra when discussing annotation tasks—time,
money, limited resources, and attention span are all factors that can limit how much
annotation you and your annotators can complete.
If this is your first pass at data collection, the most important thing is
to have a sample corpus that has examples of all the phenomena that
you are expecting to be relevant to your task.
That being said, we recommend starting small when it comes to your first attempts at
annotating documents—select a handful of documents for your annotators first, and
see how well your annotation task and guidelines work (annotation guidelines will be
discussed further in Chapter 6). Once you have some of the problems worked out, then
you can go back and add to your corpus as needed.
Unfortunately, theres no magic number that we can give you for deciding how big your
corpus will need to be in order to get good results. How big your corpus needs to be will
depend largely on how complex your annotation task is, but even having a way to quan
tify “complexity” in an annotation scheme won’t solve all the problems. However, cor
pora that are in use can provide a rule of thumb for how big you can expect your own
corpus to be.
Existing Corpora
A rule of thumb for gauging how big your corpus may need to be is to examine existing
corpora that are being used for tasks similar to yours. Table 2-2 shows sizes of some of
the different corpora that we have been discussing so far. As you can see, they do not
all use the same metric for evaluating size. This is largely a function of the purpose of
the corpus—corpora designed for evaluation at a document level, such as the movie
review corpus included in the Natural Language Toolkit (NLTK), will provide the num
ber of documents as a reference, while annotation tasks that are done at a word or phrase
level will report on the number of words or tokens for their metric.
48 | Chapter 2: Defining Your Goal and Dataset
Table 2-2. Existing corpora ranked in terms of estimated size
Corpus Estimated size
ClueWeb09 1,040,809,705 web pages
British National Corpus 100 million words
American National Corpus 22 million words (as of the time of this writing)
TempEval 2 (part of SemEval 2010) 10,000 to 60,000 tokens per language dataset
Penn Discourse TreeBank 1 million words
i2b2 2008 Challenge—smoking status 502 hospital discharge summaries
TimeBank 1.2 183 documents; 61,000 tokens
Disambiguating Sentiment Ambiguous Adjectives
(Chinese language data, part of SemEval 2010)
4,000 sentences
You will notice that the last three corpora are generally smaller in size than the other
corpora listed—this is because those three were used in NLP challenges as part of ex
isting workshops, and part of the challenge is to perform an NLP ML task in a limited
amount of time. This limit includes the time spent creating the training and testing
datasets, and so the corpora have to be much smaller in order to be feasible to annotate,
and in some cases the annotation schemes are simplified as well. However, results from
these challenges are often not as good as they would be if more time could have been
put into creating larger and more thoroughly annotated datasets.
Distributions Within Corpora
Previously we discussed including different types of sources in your corpus in order to
increase informativity. Here we will show examples of some of the source distributions
in existing corpora.
For example, TimeBank is a selection of 183 news articles that have been annotated with
time and event information. However, not all the articles in TimeBank were produced
the same way: some are broadcast transcripts, some are articles from a daily newspaper,
and some were written for broadcast over the newswire. The breakdown of this distri
bution is shown in Figure 2-2.
As you can see, while the corpus trends heavily toward daily published newspapers,
other sources are also represented. Having those different sources has provided insight
into how time and events are reported in similar, but not identical, media.
The Size of Your Corpus | 49
Figure 2-2. Production circumstances in TimeBank
The British National Corpus (BNC) is another example of a corpus that draws from
many sources—sources even more disparate than those in TimeBank. Figure 2-3 shows
the breakdown of text types in the BNC, as described in the Reference Guide for the BNC.
Figure 2-3. Distribution of text types in the BNC
Naturally, other distribution aspects can be considered when evaluating how balanced
a corpus is. The BNC also provides analysis of its corpus based on publication dates,
domain, medium, and analysis of subgroups, including information about authors and
intended audiences (see Figure 2-4).
50 | Chapter 2: Defining Your Goal and Dataset
Figure 2-4. Publication dates in the BNC
For your corpus its unlikely that you will need to be concerned with having represen
tative samples of all of these possible segmentations. That being said, minimizing the
number of factors that could potentially make a difference is a good strategy, particularly
when youre in the first few rounds of annotations. So, for example, making sure that all
of your texts come from the same time period, or checking that all of your speakers are
native to the language you are asking them to speak in, is something you may want to
take into account even if you ultimately decide to not include that type of diversity in
your corpus.
Summary
In this chapter we discussed what you need to create a good definition of your goal, and
how your goal can influence your dataset. In particular, we looked at the following
points:
Having a clear definition of your annotation task will help you stay on track when
you start creating task definitions and writing annotation guidelines.
There is often a trade-off in annotation tasks between informativity and accuracy
—be careful that you aren’t sacrificing too much of one in favor of another.
Clearly defining the scope of your task will make it much easier to decide on the
sources of your corpus—and later, the annotation tags and guidelines.
Summary | 51
Doing some background research can help keep you from reinventing the wheel
when it comes to your own annotation task.
Using an existing corpus for your dataset can make it easier to do other analyses if
necessary.
If an existing corpus doesn’t suit your needs, you can build your own, but consider
carefully what data you need and what might become a confounding factor.
There are a variety of existing tools and programming languages that can help you
to collect data from the Internet.
What information you intend to show to your annotators is an important factor
that can influence annotation, particularly in tasks that rely on opinion or annota
tors’ interpretations of texts, rather than objective facts.
52 | Chapter 2: Defining Your Goal and Dataset
CHAPTER 3
Corpus Analytics
Now that you have successfully created a corpus for your defined goal, it is important
to know what it contains. The goal of this chapter is to equip you with tools for analyzing
the linguistic content of this corpus. Hence, we will introduce you to the kinds of tech
niques and tools you will need in order to perform a variety of statistical analytics over
your corpus.
To this end, we will cover the aspects of statistics and probability that you need in order
to understand, from a linguistic perspective, just what is in the corpus we are building.
This is an area called corpus analytics. Topics will include the following:
How to measure basic frequencies of word occurrence, by lemma and by token
How to normalize the data you want to analyze
How to measure the relatedness between words and phrases in a corpus (i.e., dis
tributions)
Knowing what is in your corpus will help you build your model for automatically iden
tifying the tags you will be creating in the next chapter. We will introduce these concepts
using linguistic examples whenever possible. Throughout the chapter, we will reference
a corpus of movie reviews, assembled from IMDb.com (IMDb). This will prove to be a
useful platform from which we can introduce these concepts.
Statistics is important for several reasons, but mostly it gives us two important abilities:
Data analysis
Discovering latent properties in the dataset
Significance for inferential statistics
Allowing us to make judgments and derive information regarding the content of
our corpus
53
This chapter does provide an overview of the statistics used to analyze corpora, but it
doesnt provide a full course in statistics or probability. If youre interested in reading
more about those topics, especially as they relate to corpus linguistics, we recommend
the following books/papers:
Probability for Linguists. John Goldsmith. Math. & Sci. hum. / Mathematics and
Social Sciences (45e année, no. 180, 2007(4)). http://hum.uchicago.edu/~jagoldsm/
Papers/probabilityProofs.pdf.
Analyzing Linguistic Data: A Practical Introduction to Statistics using R. R.H. Baayen.
Cambridge University Press; 1st edition, 2008.
Statistics for Linguistics with R: A Practical Introduction. Stefan Th. Gries. De Gruyter
Mouton; 1st edition, 2010.
Basic Probability for Corpus Analytics
First let’s review some of the basic principles of probability.
State/sample space
When we perform an experiment or consider the possible values that can be as
signed to an attribute, such as an email being identified as spam or not-spam, we
are referring to the state space for that attribute: that is, {spam, not_spam}. Similarly,
the possible outcome for a coin toss is heads or tails, giving rise to the state space of
{heads, tails}.
Random variable
This is a variable that refers not to a fixed value, but to any possible outcome within
the defined domain; for example, the state space mentioned in the preceding list
item.
Probability
The probability of a specific outcome from the state space, x, is expressed as a func
tion, P(x). We say that P(x) refers to the “probability of x,” where x is some value of
the random variable X, x
X.
Probabilities have two important properties:
They must have values between 0 and 1, expressed as:
x : 0 ≤ p(x) ≤ 1
54 | Chapter 3: Corpus Analytics
The sum of the probabilities of all possible events must add up to 1:
xϵXP(x) = 1
Let’s say you are interested in looking at movie reviews. Perhaps your goal is to collect
a corpus in order to train an algorithm to identify the genre of a movie, based on the
text from the plot summary, as found on IMDb.com, for example. In order to train an
algorithm to classify and label elements of the text, you need to know the nature of the
corpus.
Assume that we have 500 movie descriptions involving five genres, evenly balanced over
each genre, as follows:
Action: 100 documents
Comedy: 100 documents
Drama: 100 documents
Sci-fi: 100 documents
Family: 100 documents
Given this corpus, we can define the random variable G (Genre), where the genre values
in the preceding list constitute the state space (sample space) for G. Because this is a
balanced corpus, any g
G will have the same probability: for example, P(Drama) = .
20, P(Action) = .20, and so on.
If you want to find the probability of a particular variable in a corpus
using Python, you can do it quite easily using lists. If you have a list of
all the reviews, and a list of the comedies, you can use the length of the
respective lists to get the probability of randomly selecting a review with
a particular attribute. Let’s say that all is a list of the filenames for all
the reviews, and comedy is a list of the filenames that are comedies:
>>> p_com = float(len(comedy))/float(len(all))
Because the len() function returns an int, in order to get a probability,
you need to convert the int to a float; otherwise, you get a probability
of 0.
Joint Probability Distributions
Usually the kinds of questions we will want to ask about a corpus involve not just one
random variable, but multiple random variables at the same time. Returning to our
Basic Probability for Corpus Analytics | 55
IMDb corpus, we notice that there are two kinds of plot summaries: short and long. We
can define the random variable S (Summary) with the sample space of {short, long}. If
each genre has 50 short summaries and 50 long summaries, then P(short) = .5 and
P(long) = .5.
Now we have two random variables, G (Genre) and S (Summary). We denote the joint
probability distribution of these variables as P(G,S). If G and S are independent variables,
then we can define P(G,S) as follows:
P(G
S) = P(G) × P(S)
So, assuming they are independent, the probability of randomly picking a “short com
edy” article is:
P(Comedy
short) = P(Comedy) × P(short)
= 0.20 × 0.50
= 0.10
But are these two random variables independent of each other? Intuitively, two events
are independent if one event occurring does not alter the likelihood of the other event
occurring.
So, we can imagine separating out the short articles, and then picking a comedy from
those, or vice versa. There are 250 short articles, composed of five genres, each con
taining 50 articles. Alternatively, the comedy genre is composed of 100 articles, con
taining 50 short and 50 long articles. Looking at it this way is equivalent to determining
the conditional probability. This is the probability of outcome A, given that B has oc
curred (read “the probability of A given B”), and is defined as follows:
P(A|B) = P
(
AB
)
P(B)
This is the fraction of B results that are also A results. For our example, the probability
of picking a comedy, given that we have already determined that the article is short, is
as follows:
P(comedy |short) = P
(
Comedy short
)
P(short)
We can quickly see that this returns 0.20, which is the same as P(Comedy). The property
of independence in fact tells us that, if G and S are independent, then:
P(G | S) = P(G)
56 | Chapter 3: Corpus Analytics
Similarly, since P(S | G) = P(S), P(short | Comedy) = P(short), which is 0.50.
Notice that from the preceding formula we have the following equivalent formula, which
is called the multiplication rule in probability:
P
(
AB
)
=P(B)P(A|B) = P(A)P(B|A)
When we need to compute the joint probability distribution for more than two random
variables, this equation generalizes to something known as the chain rule:
P
(
A1A2... An
)
=P
(
A1
)
P
(
A2
|
A1
)
P
(
A3
|
A1A2
)
P
(
An
|
i=1
n–1 Ai
)
This rule will become important when we build some of our first language models over
our annotated corpora.
The calculation of the conditional probability is important for the machine learning
(ML) algorithms we will encounter in Chapter 7. In particular, the Naïve Bayes Classifier
relies on the computation of conditional probabilities from the corpus.
Understanding how different attributes are connected in your corpus
can help you to discover aspects of your dataset that should be included
in your annotation task. If it turned out that the probability of a review
being short was correlated with one or more movie genres, then in
cluding that information in your annotation task (or later on when you
are building a feature set for your ML algorithm) could be very helpful.
At the same time, it may turn out that the connection between the length
of the review and the genre is purely coincidental, or is a result of your
corpus being unbalanced in some way. So checking your corpus for
significant joint probability distributions can also ensure that your cor
pus accurately represents the data you are working with.
Bayes Rule
Once we have the definition for computing a conditional probability, we can recast the
rule in terms of an equivalent formula called Bayes Theorem, stated as follows:
P(A|B) = P(B|A)P(A)
P(B)
This rule allows us to switch the order of dependence between events. This formulates
the conditional probability, P(A | B), into three other probabilities, that are hopefully
easier to find estimates for. This is important because when we want to design an ML
Basic Probability for Corpus Analytics | 57
algorithm for automatically classifying the different entities in our corpus, we need
training data, and this involves being able to easily access statistics for the probabilities
associated with the different judgments being used in the algorithm. We return to this
in Chapter 7.
Counting Occurrences
When putting together a corpus of linguistic texts, you most likely will not know the
probability distribution of a particular phenomenon before you examine the corpus. We
could not have known, for example, what the probability of encountering an action film
would be in the IMDb corpus, without counting the members associated with each value
of the Genre random variable. In reality, no corpus will be so conveniently balanced.
Such information constitutes the statistics over the corpus by counting occurrences of
the relevant objects in the dataset—in this case, movies that are labeled as action films,
comedy films, and so on. Similarly, when examining the linguistic content of a corpus,
we cannot know what the frequency distribution of the different words in the corpus
will be beforehand.
One of the most important things to know about your corpus before you apply any sort
of ML algorithm to it is the basic statistical profile of the words contained in it. The
corpus is essentially like a textbook that your learning algorithm is going to use as a
supply of examples (positive instances) for training. If you don’t know the distribution
of words (or whatever constitutes the objects of interest), then you dont know what the
textbook is supplying for learning. A language corpus typically has an uneven distribu
tion of word types, as illustrated in Figure 3-1. Instructions for how to create this graph
for your own corpus can be found in Madnani 2012.
First, a note of clarification. When we say we are counting the “words” in a corpus, we
need to be clear about what that means. Word frequencies refer to the number of word
tokens that are instances of a word type (or lemma). So we are correct in saying that the
following sentence has 8 words in it, or that it has 11 words in it. It depends on what we
mean by “word”!
The cat chased the mouse down the street in the dark.
You can perform word counts over corpora directly with the NLTK. Figure 3-2 shows
some examples over the IMDb corpus. Determining what words or phrases are most
common in your dataset can help you get a grasp of what kind of text youre looking at.
Here, we show some sample code that could be used to find the 50 most frequently used
words in a corpus of plain text.
58 | Chapter 3: Corpus Analytics
Figure 3-1. Frequency distribution in the NLTK Gutenburg corpus
Figure 3-2. Frequency distribution in the IMDb corpus
Counting Occurrences | 59
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> imdbcorpus = PlaintextCorpusReader('./training','.*')
>>> from nltk import FreqDist
>>> fd1 = FreqDist(imdbcorpus.words())
>>> fd1.N() #total number of sample outcomes. Same as len(imdbcorpus.words())
160035
>>> fd1.B() #total number of sample values that have counts greater than zero
16715
>>> len(fd1.hapaxes()) #total number of all samples that occur once
7933
>>> frequentwords = fd1.keys() #automatically sorts based on frequency
>>> frequentwords[:50]
[',', 'the', '.', 'and', 'to', 'a', 'of', 'is', 'in', "'", 'his',
's', 'he', 'that', 'with', '-', 'her', '(', 'for', 'by', 'him',
'who', 'on', 'as', 'The', 'has', ')', '"', 'from', 'are', 'they',
'but', 'an', 'she', 'their', 'at', 'it', 'be', 'out', 'up', 'will',
'He', 'when', 'was', 'one', 'this', 'not', 'into', 'them', 'have']
Instructions for using the NLTK’s collocation functions are availabled at http://
nltk.googlecode.com/svn/trunk/doc/howto/collocations.html.
Here are two of the basic concepts you need for performing lexical statistics over a
corpus:
Corpus size (N)
The number of tokens in the corpus
Vocabulary size (V)
The number of types in the corpus
For any tokenized corpus, you can map each token to a type; for example, how many
times the appears (the number of tokens of the type the), and so on. Once we have the
word frequency distributions over a corpus, we can calculate two metrics: the rank/
frequency profile and the frequency spectrum of the word frequencies.
To get the rank/frequency profile, you take the type from the frequency list and replace
it with its rank, where the most frequent type is given rank 1, and so forth. To build a
frequency spectrum, you simply calculate the number of types that have a specific fre
quency. The first thing one notices with these metrics is that the top few frequency ranks
are taken up by function words (i.e., words such as the, a, and and; prepositions; etc.).
In the Brown Corpus, the 10 top-ranked words make up 23% of the total corpus size
(Baroni 2009). Another observation is that the bottom-ranked words display lots of ties
in frequency. For example, in the frequency table for the IMDb corpus, the number of
hapax legomena (words appearing only once in the corpus) is over 8,000. In the Brown
Corpus, about half of the vocabulary size is made up of words that occur only once.The
mean or average frequency hides huge deviations. In Brown, the average frequency of
a type is 19 tokens, but the mean is increased because of a few very frequent types.
60 | Chapter 3: Corpus Analytics
We also notice that most of the words in the corpus have a frequency well below the
mean. The mean will therefore be higher than the median, while the mode is usually 1.
So, the mean is not a very meaningful indicator of “central tendency,” and this is typical
of most large corpora.
Recall the distinctions between the following notions in statistics:
Mean (or average): The sum of the values divided by the number
of values
x
¯
=1
n
i=4
nxi
Mode: The most frequent value in the population (or dataset)
Median: The numerical value that separates the higher half of a
population (or sample) from the lower half
Zipfs Law
The uneven distribution of word types shown in the preceding section was first pointed
out over a variety of datasets by George Zipf in 1949. He noticed that frequency of a
word, f(w), appears as a nonlinearly decreasing function of the rank of the word, r(w),
in a corpus, and formulated the following relationship between these two variables:
f(w) = C
r(w)a
C is a constant that is determined by the particulars of the corpus, but for now, let’s say
that its the frequency of the most frequent word in the corpus. Lets assume that a is 1;
then we can quickly see how frequency decreases with rank. Notice that the law is a
power law: frequency is a function of the negative power of rank, –a. So the first word
in the ranking occurs about twice as often as the second word in the ranking, and three
times as often as the third word in the ranking, and so on.
N-grams
In this section we introduce the notion of an n-gram. N-grams are important for a wide
range of applications in Natural Language Processing (NLP), because fairly straight
forward language models can be built using them, for speech, Machine Translation,
indexing, Information Retrieval (IR), and, as we will see, classification.
Counting Occurrences | 61
Imagine that we have a string of tokens, W, consisting of the elements w1, w2, … , wn.
Now consider a sliding window over W. If the sliding window consists of one cell (wi),
then the collection of one-cell substrings is called the unigram profile of the string; there
will be as many unigram profiles as there are elements in the string. Consider now all
two-cell substrings, where we look at w1 w2, w2 w3, and so forth, to the end of the string,
wn–1 wn. These are called bigram profiles, and we have n–1 bigrams for a string of
length n.
Using the definition of conditional probability mentioned earlier, we can define a prob
ability for a token, having seen the previous token, as a bigram probability. Thus the
conditional probability of an element, wi, given the previous element, wi–1:
P(wi | wi−1)
Extending this to bigger sliding windows, we an define an n-gram probability as simply
the conditional probability of an element given the previous n–1 elements. That is:
P(wi | wiN−1wi−1)
The most common bigrams in any corpus will most likely not be very interesting to you,
however. They involve the most frequent words in word pairs. This happens to usually
be boring function word pairs, such as the following:
of the
in the
on the
in a
If you want to get a more meaningful set of bigrams (and trigrams), you can run the
corpus through a part-of-speech (POS) tagger, such as one of those provided by the
NLTK. This would filter the bigrams to more content-related pairs involving, for ex
ample, adjectives and nouns:
Star Trek
Bull Run
Sidekick Brainiac
This can be a useful way to filter the meaningless n-grams from your results. A better
solution, however, is to take advantage of the “natural affinity” that the words in an n-
gram have for one another. This includes what are called collocations. A collocation is
the string created when two or more words co-occur in a language more frequently than
by chance. A convenient way to do this over a corpus is through a concept known as
pointwise mutual information (PMI). Basically, the intuition behind PMI is as follows.
62 | Chapter 3: Corpus Analytics
For two words, X and Y, we would like to know how much one word tells us about the
other. For example, given an occurrence of X, x, and an occurrence of Y, y, how much
does their joint probability differ from the expected value of assuming that they are
independent? This can be expressed as follows:
pmi(x;y) = ln P(x,y)
P(x)P(y)
In fact, the collocation function provided in the NLTK uses this relation to build bigram
collocations. Applying this function to the bigrams from the IMDb corpus, we can see
the following results:
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> finder1 = BigramCollocationFinder.from_words(imdbcorpus.words())
>>> finder1.nbest(bigram_measures.pmi, 10)
[('".(', 'Check'), ('10th', 'Sarajevo'), ('16100', 'Patranis'),
('1st', 'Avenue'), ('317', 'Riverside'), ('5000', 'Reward'),
('6310', 'Willoughby'), ('750hp', 'tire'), ('ALEX', 'MILLER'),
('Aasoo', 'Bane')]
>>> finder1.apply_freq_filter(10) #look only at collocations that occur
10 times or more
>>> finder1.nbest(bigram_measures.pmi, 10)
[('United', 'States'), ('Los', 'Angeles'), ('Bhagwan', 'Shri'),
('martial', 'arts'), ('Lan', 'Yu'), ('Devi', 'Maa'),
('New', 'York'), ('qv', ')),'), ('qv', '))'), ('I', ")'")]
>>> finder1.apply_freq_filter(15)
>>> finder1.nbest(bigram_measures.pmi, 10)
[('Bhagwan', 'Shri'), ('Devi', 'Maa'), ('New', 'York'),
('qv', ')),'), ('qv', '))'), ('I', ")'"), ('no', 'longer'),
('years', 'ago'), ('none', 'other'), ('each', 'other')]
One issue with using this simple formula, however, involves the problem of sparse data.
That is, the probabilities of observed rare events are overestimated, and the probabilities
of unobserved rare events are underestimated. Researchers in computational linguistics
have found ways to get around this problem to a certain extent, and we will return to
this issue when we discuss ML algorithms in more detail in Chapter 7.
Language Models
So, what good are n-grams? NLP has used n-grams for many years to develop statistical
language models that predict sequence behaviors. Sequence behavior is involved in rec
ognizing the next X in a sequence of Xs; for example, Speech Recognition, Machine
Translation, and so forth. Language modeling predicts the next element in a sequence,
given the previously encountered elements.
Language Models | 63
Let’s see more precisely just how this works, and how it makes use of the tools we dis
cussed in the previous sections. Imagine a sequence of words, w1, w2, … , wn. Predicting
any “next wordwi in this sequence is essentially expressed by the following probability
function:
P(wi | w1, …, wi−1)
which is equivalent to:
P
(
w1, ..., wi
)
P
(
w1, ..., wi–1
)
Notice what’s involved with computing these two joint probability distributions. We will
assume that the frequency of a word sequence in the corpus will estimate its probability.
That is:
P(w1, …, wi−1) = Count(w1, …, wi–1)
P(w1, …, wi) = Count(w1, …, wi)
giving us the ratio known as the relative frequency, shown here:
P
(
wi
|
w1, ..., wi–1
)
=Count
(
w1, ..., wi
)
Count
(
w1, ..., wi–1
)
As we just saw, the joint probabilities in the n-gram example can be
expressed as conditional probabilities, using the chain rule for sequence
behavior, illustrated as follows:
P
(
w1,w2..., wn
)
=P
(
w1
)
P
(
w2
|
w1
)
P
(
w3
|
w1,w2
)
...P
(
wn
|
w1n–1
)
which can be expressed as:
Π
k=1
nP
(
wk
|
w1k–1
)
Even though we can, in principle, estimate the probabilities that we need for making
our predictive model, there is little chance that we are going to have a lot of data to work
with, if we take the joint probability of the entire sequence of words. That is, there are
sequences of words that may never have occurred in our corpus, but we still want to be
able to predict the behavior associated with the elements contained in them. To solve
64 | Chapter 3: Corpus Analytics
this problem, we can make some simplifying assumptions regarding the contribution
of the elements in the sequence. That is, if we approximate the behavior of a word in a
sequence as being dependent on only the word before it, then we have reduced the n-
gram probability of:
P
(
wi
|
w1i-1
)
to this bigram probability:
P(wi | wi−1)
This is known as the Markov assumption, and using it, we can actually get some rea
sonable statistics for the bigrams in a corpus. These can be used to estimate the bigram
probabilities by using the concept of relative frequency mentioned earlier. That is, as
before, we take the ratio of the occurrences of the bigram in the corpus to the number
of occurrences of the prefix (the single word, in this case) in the corpus, as shown here:
P
(
wi
|
wi–1
)
=Count
(
wi–1,wi
)
Count
(
wi– 1
)
This procedure is known as a maximum likelihood estimation (MLE), and it provides a
fairly direct way to collect statistics that can be used for creating a language model. We
will return to these themes in Chapter 7.
Summary
In this chapter we introduced you to the tools you need to analyze the linguistic content
of a corpus as well as the kinds of techniques and tools you will need to perform a variety
of statistical analytics. In particular, we discussed the following:
Corpus analytics comprises statistical and probabilistic tools that provide data
analysis over your corpus and information for performing inferential statistics. This
will be necessary information when you take your annotated corpus and train an
ML algorithm on it.
It is necessary to distinguish between the occurrence of a word in a corpus (the
token) and the word itself (the type).
The total number of tokens in a corpus gives us the corpus size.
The total number of types in a corpus gives us the vocabulary size.
The rank/frequency profile of the words in a corpus assigns a ranking to the words,
according to how many tokens there are of that word.
Summary | 65
The frequency spectrum of the word gives the number of word types that have a
given frequency.
Zipfs law is a power law stating that the frequency of any word is inversely pro
portional to its rank.
Constructing n-grams over the tokens in a corpus is the first step in building lan
guage models for many NLP applications.
Pointwise mutual information is a measure of how dependent one word is on an
other in a text. This can be used to identify bigrams that are true collocations in a
corpus.
Language models for predicting sequence behavior can be simplified by making the
Markov assumption, namely, when predicting a word, only pay attention to the
word before it.
66 | Chapter 3: Corpus Analytics
CHAPTER 4
Building Your Model and Specification
Now that youve defined your goal and collected a relevant dataset, you need to create
the model for your task. But what do we mean by “model”? Basically, the model is the
practical representation of your goal: a description of your task that defines the classi
fications and terms that are relevant to your project. You can also think of it as the aspects
of your task that you want to capture within your dataset. These classifications can be
represented by metadata, labels that are applied to the text of your corpus, and/or rela
tionships between labels or metadata. In this chapter, we will address the following
questions:
The model is captured by a specification, or spec. But what does a spec look like?
You have the goals for your annotation project. Where do you start? How do you
turn a goal into a model?
What form should your model take? Are there standardized ways to structure the
phenomena?
How do you take someone elses standard and use it to create a specification?
What do you do if there are no existing specifications, definitions, or standards for
the kinds of phenomena you are trying to identify and model?
How do you determine when a feature in your description is an element in the spec
versus an attribute on an element?
The spec is the concrete representation of your model. So, whereas the model is an
abstract idea of what information you want your annotation to capture, and the inter
pretation of that information, the spec turns those abstract ideas into tags and attributes
that will be applied to your corpus.
67
Some Example Models and Specs
Recall from Chapter 1 that the first part in the MATTER cycle involves creating a model
for the task at hand. We introduced a model as a triple, M = <T,R,I>, consisting of a
vocabulary of terms, T, the relations between these terms, R, and their interpretation,
I. However, this is a pretty high-level description of what a model is. So, before we discuss
more theoretical aspects of models, lets look at some examples of annotation tasks and
see what the models for those look like.
For the most part, well be using XML DTD (Document Type Definition) representa
tions. XML is becoming the standard for representing annotation data, and DTDs are
the simplest way to show an overview of the type of information that will be marked up
in a document. The next few sections will go through what the DTDs for different models
will look like, so you can see how the different elements of an annotation task can be
translated into XML-compliant forms.
What Is a DTD?
A DTD is a set of declarations containing the basic building blocks that allow an XML
document to be validated. DTDs have been covered in depth in other books (O’Reilly’s
Learning XML and XML in a Nutshell) and websites (W3schools.com), so we’ll give a short
overview here.
Essentially, the DTD defines what the structure of an XML document will be by defining
what tags will be used inside the document and what attributes those tags will have. By
having a DTD, the XML in a file can be validated to ensure that the formatting is correct.
So what do we mean by tags and attributes? Let’s take a really basic example: web pages
and HTML. If youve ever made a website and edited some code by hand, youre familiar
with elements such as <b> and <br />. These are tags that tell a program reading the
HTML that the text in between <b> and </b> should be bold, and that <br /> indicates
a newline should be included when the text is displayed. Annotation tasks use similar
formatting, but they define their own tags based on what information is considered im
portant for the goal being pursued. So an annotation task that is based on marking the
parts of speech in a text might have tags such as <noun>, <verb>, <adj>, and so on. In a
DTD, these tags would be defined like this:
<!ELEMENT noun ( #PCDATA ) >
<!ELEMENT verb ( #PCDATA ) >
<!ELEMENT adj ( #PCDATA ) >
The string !ELEMENT indicates that the information contained between the < and > is about
an element (also known as a “tag”), and the word following it is the name of that tag (noun,
verb, adj). The ( #PCDATA ) indicates that the information between the <noun> and
</noun> tags will be parsable character data (other flags instead of #PCDATA can be used
to provide other information about a tag, but for this book, were not going to worry about
them).
68 | Chapter 4: Building Your Model and Specification
By declaring the three tags in a DTD, we can have a valid XML document that has nouns,
verbs, and adjectives all marked up. However, annotation tasks often require more infor
mation about a piece of text than just its type. This is where attributes come in. For example,
knowing that a word is a verb is useful, but its even more useful to know the tense of the
verb—past, present, or future. This can be done by adding an attribute to a tag, which
looks like this:
<!ELEMENT verb ( #PCDATA ) >
<!ATTLIST verb tense ( past | present | future | none ) #IMPLIED >
The !ATTLIST line declares that an attribute called tense is being added to the verb
element, and that it has four possible values: past, present, future, and none. The
#IMPLIED shows that the information in the attribute isn’t required for the XML to be valid
(again, don’t worry too much about this for now). Now you can have a verb tag that looks
like this:
<verb tense="present">
You can also create attributes that allow annotators to put in their own information, by
declaring the attributes type to be CDATA instead of a list of options, like this:
<!ELEMENT verb ( #PCDATA ) >
<!ATTLIST verb tense CDATA #IMPLIED >
One last type of element that is commonly used in annotation is a linking element, or a
link tag. These tags are used to show relationships between other parts of the data that
have been marked up with tags. For instance, if the part-of-speech (POS) task also wanted
to show the relationship between a verb and the noun that performed the action described
by the verb, the annotation model might include a link tag called performs, like so:
<!ELEMENT performs EMPTY >
<!ATTLIST performs fromID IDREF >
<!ATTLIST performs toID IDREF >
The EMPTY in this element tag indicates that the tag will not be applied to any of the text
itself, but rather is being used to provide other information about the text. Normally in
HTML an empty tag would be something like the <br /> tag, or another tag that stands
on its own. In annotation tasks, an empty tag is used to create paths between other, con
tentful tags.
In a model, it is almost always important to keep track of the order (or arity) of the elements
involved in the linking relationship. We do this here by using two elements that have the
type IDREF, meaning they will refer to other annotated extents or elements in the text by
identifiable elements.
We’ll talk more about the IDs and the relationship between DTDs and annotated data in
Chapter 5, but for now, this should give you enough information to understand the ex
amples provided in this chapter.
Some Example Models and Specs | 69
There are other formats that can be used to specify specs for a model.
XML schema are sometimes used to create a more complex represen
tation of the tags being used, as is the Backus–Naur Form. However,
these formats are more complex than DTDs, and arent generally nec
essary to use unless you are using a particular piece of annotation soft
ware, or want to have a more restrictive spec. For the sake of simplicity,
we will use only DTD examples in this book.
Film Genre Classification
A common task in Natural Language Processing (NLP) and machine learning is clas
sifying documents into categories; for example, using film reviews or summaries to
determine the genre of the film being described. If you have a goal of being able to use
machine learning to identify the genre of a movie from the movie summary or review,
then a corresponding model could be that you want to label the summary with all the
genres that the movie applies to, in order to feed those labels into a classifier and train
it to identify relevant parts of the document. To turn that model into a spec, you need
to think about what that sort of label would look like, presumably in a DTD format.
The easiest way to create a spec for a classification task is to simply create a tag that
captures the information you need for your goal and model. In this case, you could create
a tag called genre that has an attribute called label, where label holds the values that
can be assigned to the movie summary. The simplest incarnation of this spec would be
this:
<!ELEMENT genre ( #PCDATA ) >
<!ATTLIST genre label CDATA #IMPLIED >
This DTD has the required tag and attribute, and allows for any information to be added
to the label attribute. Functionally for annotation purposes, this means the annotator
would be responsible for filling in the genres that she thinks apply to the text. Of course,
a large number of genre terms have been used, and not everyone will agree on what a
standard” list of genres should be—for example, are “fantasy” and “sci-fi” different
genres, or should they be grouped into the same category? Are “mystery” films different
from “noir”? Because the list of genres will vary from person to person, it might be better
if your DTD specified a list of genres that annotators could choose from, like this:
<!ELEMENT genre ( #PCDATA ) >
<!ATTLIST genre label ( Action | Adventure | Animation | Biography | Comedy |
Crime | Documentary | Drama | Family | Fantasy | Film-Noir | Game-Show |
History | Horror | Music | Musical | Mystery | News | Reality-TV | Romance |
Sci-Fi | Sport | Talk-Show | Thriller | War | Western ) >
The list in the label attribute is taken from IMDbs list of genres. Naturally, since other
genre lists exist (e.g., Netflix also has a list of genres), you would want to choose the one
that best matches your task, or create your own list. As you go through the process of
70 | Chapter 4: Building Your Model and Specification
annotation and the rest of the MATTER cycle, you’ll find places where your model/spec
needs to be revised in order to get the results you want. This is perfectly normal, even
for tasks that seem as straightforward as putting genre labels on movie summaries—
annotator opinions can vary, even when the task is as clearly defined as you can make
it. And computer algorithms dont really think and interpret the way that people do, so
even when you get past the annotation phase, you may still find places where, in order
to maximize the correctness of the algorithm, you would have to change your model.
For example, looking at the genre list from IMDb we see that “romance” and “comedy”
are two separate genres, and so the summary of a romantic comedy would have to have
two labels: romance and comedy. But if, in a significant portion of reviews, those two
tags appear together, an algorithm may learn to always associate the two, even when the
summary being classified is really a romantic drama or musical comedy. So, you might
find it necessary to create a rom-com label to keep your classifier from creating false
associations.
In the other direction, there are many historical action movies that take place over very
different periods in history, and a machine learning (ML) algorithm may have trouble
finding enough common ground between a summary of 300, Braveheart, and Pearl
Harbor to create an accurate association with the history genre. In that case, you might
find it necessary to add different levels of historical genres, ones that reflect different
periods in history, to train a classifier in the most accurate way possible.
If you’re unclear on how the different components of the ML algorithm
can be affected by the spec, or why you might need to adapt a model to
get better results, don’t worry! For now, just focus on turning your goal
into a set of tags, and the rest will come later. But if you really want to
know how this works, Chapter 7 has an overview of all the different
ways that ML algorithms “learn,” and what it means to train each one.
Adding Named Entities
Of course, reworking the list of genres isn’t the only way to change a model to better fit
a task. Another way is to add tags and attributes that will more closely reflect the in
formation that’s relevant to your goal. In the case of the movie summaries, it might be
useful to keep track of some of the Named Entities (NEs) that appear in the summaries
that may give insight into the genre of the film. An NE is an entity (an object in the
world) that has a name which uniquely identifies it by name, nickname, abbreviation,
and so on. “O’Reilly,” “Brandeis University,” “Mount Hood,” “IBM,” and “Vice President
are all examples of NEs. In the movie genre task, it might be helpful to keep track of NEs
such as film titles, directors, writers, actors, and characters that are mentioned in the
summaries.
Some Example Models and Specs | 71
You can see from the list in the preceding paragraph that there are many different NEs
in the model that we would like to capture. Because the model is abstract, the practical
application of these NEs to a spec or DTD has to be decided upon. There are often many
ways in which a model can be represented in a DTD, due to the categorical nature of
annotation tasks and of XML itself. In this case there are two primary ways in which
the spec could be created. We could have a single tag called named_entity with an
attribute that would have each of the items from the previous list, like this:
<!ELEMENT named_entity ( #PCDATA ) >
<!ATTLIST named_entity role (film_title | director |
writer | actor | character ) >
Or each role could be given its own tag, like this:
<!ELEMENT film_title ( #PCDATA ) >
<!ELEMENT director ( #PCDATA ) >
<!ELEMENT writer ( #PCDATA ) >
<!ELEMENT actor ( #PCDATA ) >
<!ELEMENT character ( #PCDATA ) >
While these two specs seem to be very different, in many ways they are interchangeable.
It would not be difficult to take an XML file with the first DTD and change it to one that
is compliant with the second. Often the choices that you’ll make about how your spec
will represent your model will be influenced by other factors, such as what format is
easier for your annotators, or what works better with the annotation software you are
using. We’ll talk more about the considerations that go into which formats to use in
Chapter 5 and Chapter 6.
By giving ML algorithms more information about the words in the document that are
being classified, such as by annotating the NEs, its possible to create more accurate
representations of what’s going on in the text, and to help the classifier pick out markers
that might make the classifications better.
Semantic Roles
Another layer of information that might be useful in examining movie summaries is to
annotate the relationships between the NEs that are marked up in the text. These rela
tionships are called semantic roles, and they are used to explicitly show the connections
between the elements in a sentence. In this case, it could be helpful to annotate the
relationships between actors and characters, and the staff of the movie and which movie
they worked on. Consider the following example summary/review:
In Love, Actually, writer/director Richard Curtis weaves a convoluted tale about charac
ters and their relationships. Of particular note is Liam Neeson (Schindler’s List, Star
Wars) as Daniel, a man struggling to deal with the death of his wife and the relationship
with his young stepson, Sam (Thomas Sangster). Emma Thompson (Sense and Sensibil
ity, Henry V) shines as a middle-aged housewife whose marriage with her husband (played
by Alan Rickman) is under siege by a beautiful secretary. While this movie does have its
72 | Chapter 4: Building Your Model and Specification
purely comedic moments (primarily presented by Bill Nighy as out-of-date rock star Billy
Mack), this movie avoids the more in-your-face comedy that Curtis has presented before
as a writer for Blackadder and Mr. Bean, presenting instead a remarkable, gently humorous
insight into what love, actually, is.
Using one of the NE DTDs from the preceding section would lead to a number of
annotated extents, but due to the density, an algorithm may have difficulty determining
who goes with what. By adding semantic role labels such as acts_in, acts_as, di
rects, writes, and character_in, the relationships between all the NEs will become
much clearer.
As with the DTD for the NEs, we are faced with a choice between using a single tag with
multiple attribute options:
<!ELEMENT sem_role ( EMPTY ) >
<!ATTLIST sem_role from IDREF >
<!ATTLIST sem_role to IDREF >
<!ATTLIST sem_role label (acts_in |
acts_as | directs | writes | character_in ) >
or a tag for each semantic role we wish to capture:
<!ELEMENT acts_in ( EMPTY ) >
<!ATTLIST acts_in from IDREF >
<!ATTLIST acts_in to IDREF >
<!ELEMENT acts_as ( EMPTY ) >
<!ATTLIST acts_as from IDREF >
<!ATTLIST acts_as to IDREF >
<!ELEMENT directs ( EMPTY ) >
<!ATTLIST directs from IDREF >
<!ATTLIST directs to IDREF >
<!ELEMENT writes ( EMPTY ) >
<!ATTLIST writes from IDREF >
<!ATTLIST writes to IDREF >
<!ELEMENT character_in ( EMPTY ) >
<!ATTLIST character_in from IDREF >
<!ATTLIST character_in to IDREF >
You’ll notice that this time, the DTD specifies that each of these elements is EMPTY,
meaning that no character data is associated directly with the tag. Remember that link
ing tags in annotation are usually defined by EMPTY tags specifically because links be
tween elements do not generally have text associated with them in particular, but rather
clarify a relationship between two or more other extents. We’ll discuss the application
of linking and other types of tags in Chapter 5.
Some Example Models and Specs | 73
Multimodel Annotations
It may be the case that your annotation task requires more than one model to fully capture
the data you need. This happens most frequently when a task requires information from
two or more very different levels of linguistics, or if information from two different do
mains needs to be captured. For example, an annotation over a corpus thats made up of
documents that require training to understand, such as clinical notes, scientific papers,
or legal documents, may require that annotators have training in those fields, and that the
annotation task be tailored to the domain.
In general, employing different annotation models in the same task simply means that
more than one MATTER cycle is being worked through at the same time, and that the
different models will likely be focused on different aspects of the corpus or language being
explored. In these cases, it is important that all the models be coordinated, however, and
that changes made to one model during the MATTER cycle dont cause conflict with the
others.
If your corpus is made up of domain-specific documents (such as the clinical notes that
we mentioned before), and your annotation task requires that your annotators be able to
interpret these documents (e.g., if you are trying to determine which patients have a par
ticular disease), then one of your models may need to be a light annotation task (Stubbs
2012).
A light annotation task is essentially a way to formulate an annotation model that allows
a domain expert (such as a doctor) to provide her insight into a text without being required
to link her knowledge to one of the layers of linguistic understanding. Such an annotation
task might be as simple as having the domain expert indicate whether a file has a particular
property (such as whether or not a patient is at risk for diabetes), or it may involve an
notating the parts of the text associated with a disease state. However, the domain expert
wont be asked to mark POS tags or map every noun in the text to a semantic interpretation:
those aspects of the text would be handled in a different model altogether, and merged at
the end.
There is a slightly different philosophy behind the creation of light annotation tasks than
that of more “traditional” annotations: light annotations focus on encoding an answer to
a particular question about a text, rather than creating a complete record of a particular
linguistic phenomenon, with the purpose of later merging all the different models into a
single annotation. However, aside from the difference in goal, light annotation tasks still
follow the MATTER and MAMA cycles. Because of this, we arent going to use them as
examples in this book, and instead will stick to more traditional linguistic annotations.
If you are interested in performing an annotation task that requires domain-specific
knowledge, and therefore would benefit from using a light annotation task, a methodology
for creating a light annotation and incorporating it into the MATTER cycle is developed
and presented in Stubbs 2012.
74 | Chapter 4: Building Your Model and Specification
Adopting (or Not Adopting) Existing Models
Now that you have an idea of how specs can represent a model, lets look a little more
closely at some of the details we just presented. You might recall from Chapter 1 that
when we discussed semantic roles we presented a very different list from acts_in,
acts_as, directs, writes, and character_in. Heres what the list looked like:
Agent
The event participant that is doing or causing the event to occur
Theme/figure
The event participant who undergoes a change in position or state
Experiencer
The event participant who experiences or perceives something
Source
The location or place from which the motion begins; the person from whom the
theme is given
GoalThe location or place to which the motion is directed or terminates
Recipient
The person who comes into possession of the theme
Patient
The event participant who is affected by the event
Instrument
The event participant used by the agent to do or cause the event
Location/ground
The location or place associated with the event itself
Similarly, we also presented an ontology that defined the categories Organization, Per
son, Place, and Time. This set of labels can be viewed as a simple model of NE types that
are commonly used in other annotation tasks.
So, if these models exist, why didn’t we just use them for our film genre annotation task?
Why did we create our own sets of labels for our spec? Just as when defining the goal of
your annotation you need to think about the trade-off between informativity and cor
rectness, when creating the model and spec for your annotation task, you need to con
sider the trade-off between generality and specificity.
Adopting (or Not Adopting) Existing Models | 75
Creating Your Own Model and Specification: Generality Versus
Specificity
The ontology consisting of Organization, Person, Place, and Time is clearly a very gen
eral model for entities in a text, but for the film genre annotation task, it is much too
general to be useful for the kinds of distinctions we want to be able to make. Of the NE
labels that we identified earlier, four of them (“director,” “writer,” “actor,” and “character”)
would fall under the label “Person,” and “film title” doesn’t clearly fit under any of them.
Using these labels would lead to unhelpful annotations in two respects: first, the labels
used would be so generic as to be useless for the task (labeling everyone as “Person
won’t help distinguish one movie review from another); and second, it would be difficult
to explain to the annotators that, while youve given them a set of labels, you dont want
every instance of those types of entities labeled, but rather only those that are relevant
to the film (so, for example, a mention of another reviewer would not be labeled as a
“Person”). Clearly, overly general tags in a spec can lead to confusion during annotation.
On the other hand, we could have made the tags in the spec even more specific, such as
actor_star, actor_minor_character, character_main, character_minor, writ
er_film, writer_book, writer_book_and_film, and so on. But what would be gained
from such a complicated spec? While its possible to think of an annotation task where
it might be necessary to label all that information (perhaps one that was looking at how
these different people are described in movie reviews), remember that the task we de
fined was, first, simply labeling the genres of films as they are described in summaries
and reviews, and then expanding it to include some other information that might be
relevant to making that determination. Using overly specific tags in this case would
decrease how useful the annotations would be, and also increase the amount of work
done by the annotators for no obvious benefit. Figure 4-1 shows the different levels of
the hierarchy we are discussing. The top two levels are too vague, while the bottom is
too specific to be useful. The third level is just right for this task.
We face the same dichotomy when examining the list of semantic roles. The list given
in linguistic textbooks is a very general list of roles that can be applied to the nouns in
a sentence, but any annotation task trying to use them for film-related roles would have
to have a way to limit which nouns were assigned roles by the annotator, and most of
the roles related to the NEs were interested in would simply be “agent”—a label that is
neither helpful nor interesting for this task. So, in order to create a task that was in the
right place regarding generality and specificity, we developed our own list of roles that
were particular to this task.
76 | Chapter 4: Building Your Model and Specification
Figure 4-1. A hierarchy of named entities
We haven’t really gotten into the details of NE and semantic role anno
tation using existing models, but these are not trivial annotation tasks.
If you’re interested in learning more about annotation efforts that use
these models, check out FrameNet for semantic roles, and the Message
Understanding Conferences (MUCs) for examples of NE and corefer
ence annotation.
Overall, there are a few things that you want to make sure your model and specification
have in order to proceed with your task. They should:
Contain a representation of all the tags and links relevant to completing your goal.
Be relevant to the implementation stated in your goal (if your purpose is to classify
documents by genre, spending a lot of time annotating temporal information is
probably not going to be of immediate help).
Be grounded in existing research as much as possible. Even if theres no existing
annotation spec that meets your goal completely, you can still take advantage of
research thats been done on related topics, which will make your own research
much easier.
Specifically to the last point on the list, even though the specs we’ve described for the
film genre annotation task use sets of tags that we created for this purpose, its difficult
to say that they werent based on an existing model to some extent. Obviously some
knowledge about NEs and semantic roles helped to inform how we described the an
notation task, and helped us to decide whether annotating those parts of the document
would be useful. But you dont need to be a linguist to know that nouns can be assigned
to different groups, and that the relationships between different nouns and verbs can
Adopting (or Not Adopting) Existing Models | 77
be important to keep track of. Ultimately, while its entirely possible that your annotation
task is completely innovative and new, its still worth taking a look at some related re
search and resources and seeing if any of them are helpful for getting your model and
spec put together.
The best way to find out if a spec exists for your task is to do a search for existing
annotated datasets. If you aren’t sure where to start, or Google results seem overwhelm
ing, check Appendix A for the list of corpora and their annotations.
Using Existing Models and Specifications
While the examples we discussed thus far had fairly clear-cut reasons for us to create
our own tags for the spec, there are some advantages to basing your annotation task on
existing models. Interoperability is a big concern in the computer world, and its actually
a pretty big concern in linguistics as well—if you have an annotation that you want to
share with other people, there are a few things that make it easier to share, such as using
existing annotation standards (e.g., standardized formats for your annotation files),
using software to create the annotation that other people can also use, making your
annotation guidelines available to other people, and using models or specifications that
have already been vetted in similar tasks. Well talk more about standards and formats
later in this chapter and in the next one; for now, well focus just on models and specs.
Using models or specs that other people have used can benefit your project in a few
ways. First of all, if you use the specification from an annotation project thats already
been done, you have the advantage of using a system that’s already been vetted, and one
that may also come with an annotated corpus, which you can use to train your own
algorithms or use to augment your own dataset (assuming that the usage restrictions
on the corpus allow for that, of course).
In Background Research” (page 41), we mentioned some places to start looking for
information that would be useful with defining your goal, so presumably youve already
done some research into the topics youre interested in (if you havent, now is a good
time to go back and do so). Even if theres no existing spec for your topic, you might
find a descriptive model similar to the one we provided for semantic roles.
Not all annotation and linguistic models live in semantic textbooks! The
list of film genres that we used was taken from IMDb.com, and there
are many other places where you can get insight into how to frame your
model and specification. A recent paper on annotating bias used the
Wikipedia standards for editing pages as the standard for developing a
spec and annotation guidelines for an annotation project (Herzig et al.
2011). Having a solid linguistic basis for your task can certainly help,
but dont limit yourself to only linguistic resources!
78 | Chapter 4: Building Your Model and Specification
If you are lucky enough to find both a model and a specification that are suitable for
your task, you still might need to make some changes for them to fit your goal. For
example, if you are doing temporal annotation, you can start with the TimeML speci
fication, but you may find that the TIMEX3 tag is simply too much information for your
purposes, or too overwhelming for your annotators. The TIMEX3 DTD description is
as follows:
<!ELEMENT TIMEX3 ( #PCDATA ) >
<!ATTLIST TIMEX3 start #IMPLIED >
<!ATTLIST TIMEX3 tid ID #REQUIRED >
<!ATTLIST TIMEX3 type ( DATE | DURATION | SET | TIME ) #REQUIRED >
<!ATTLIST TIMEX3 value NMTOKEN #REQUIRED >
<!ATTLIST TIMEX3 anchorTimeID IDREF #IMPLIED >
<!ATTLIST TIMEX3 beginPoint IDREF #IMPLIED >
<!ATTLIST TIMEX3 endPoint IDREF #IMPLIED >
<!ATTLIST TIMEX3 freq NMTOKEN #IMPLIED >
<!ATTLIST TIMEX3 functionInDocument ( CREATION_TIME | EXPIRATION_TIME |
MODIFICATION_TIME | PUBLICATION_TIME | RELEASE_TIME | RECEPTION_TIME |
NONE ) #IMPLIED >
<!ATTLIST TIMEX3 mod ( BEFORE | AFTER | ON_OR_BEFORE | ON_OR_AFTER | LESS_THAN |
MORE_THAN | EQUAL_OR_LESS | EQUAL_OR_MORE | START | MID | END |
APPROX ) #IMPLIED >
<!ATTLIST TIMEX3 quant CDATA #IMPLIED >
<!ATTLIST TIMEX3 temporalFunction ( false | true ) #IMPLIED >
<!ATTLIST TIMEX3 valueFromFunction IDREF #IMPLIED >
<!ATTLIST TIMEX3 comment CDATA #IMPLIED >
A lot of information is encoded in a TIMEX3 tag. While the information is there for a
reason—years of debate and modification took place to create this description of a tem
poral reference—there are certainly annotation tasks where this level of detail will be
unhelpful, or even detrimental. If this is the case, other temporal annotation tasks have
been done over the years that have specs that you may find more suitable for your goal
and model.
Using Models Without Specifications
Its entirely possible—even likely—that your annotation task may be based on a linguistic
(or psychological or sociological) phenomenon that has been clearly explained in the
literature, but has not yet been turned into a specification. In that case, you will have to
decide the form the specification will take, in much the same way that we discussed in
the first section of this chapter. Depending on how fleshed out the model is, you may
have to make decisions about what parts of the model become tags, what become at
tributes, and what become links. In some ways this can be harder than simply creating
your own model and spec, because you will be somewhat constrained by someone elses
description of the phenomenon. However, having a specification that is grounded in an
established theory will make your own work easier to explain and distribute, so there
are advantages to this approach as well.
Adopting (or Not Adopting) Existing Models | 79
Many (if not all) of the annotation specifications that are currently in wide use are based
on theories of language that were created prior to the annotation task being created. For
example, the TLINK tag in ISO-TimeML is based largely on James Allens work in tem
poral reasoning (Allen 1984; Pustejovsky et al. 2003), and ISO-Space has been influenced
by the qualitative spatial reasoning work of Randell et al. (1992) and others. Similarly,
syntactic bracketing and POS labeling work, as well as existing semantic role labeling,
are all based on models developed over years of linguistic research and then applied
through the creation of syntactic specifications.
Different Kinds of Standards
Previously we mentioned that one of the aspects of having an interoperable annotation
project is using a standardized format for your annotation files, as well as using existing
models and specs. However, file specifications are not the only kind of standards that
exist in annotation: there are also annotation specifications that have been accepted by
the community as go-to (or de facto) standards for certain tasks. While there are no
mandated (a.k.a. de jure) standards in the annotation community, there are varying
levels and types of de facto standards that we will discuss here.
ISO Standards
The International Organization for Standardization (ISO) is the body responsible for
creating standards that are used around the world for ensuring compatibility of systems
between businesses and government, and across borders. ISO is the organization that
helps determine what the consensus will be for many different aspects of daily life, such
as the size of DVDs, representation of dates and times, and so on. There are even ISO
standards for representing linguistic annotations in general and for certain types of
specifications, in particular ISO-TimeML and ISO-Space. Of course, you arent re
quired to use ISO standards (theres no Annotation Committee that enforces use of these
standards), but they do represent a good starting point for most annotation tasks, par
ticularly those standards related to representation.
ISO standards are created with the intent of interoperability, which sets
them apart from other de facto standards, as those often become the
go-to representation simply because they were there first, or were used
by a large community at the outset and gradually became ingrained in
the literature. While this doesn’t mean that non-ISO standards are in
herently problematic, it does mean that they may not have been created
with interoperability in mind.
80 | Chapter 4: Building Your Model and Specification
Annotation format standards
Linguistic annotation projects are being done all over the world for many different, but
often complementary, reasons. Because of this, in the past few years ISO has been de
veloping the Linguistic Annotation Framework (LAF), a model for annotation projects
that is abstract enough to apply to any level of linguistic annotation.
How can a model be flexible enough to encompass all of the different types of annotation
tasks? LAF takes a two-pronged approach to standardization. First, it focuses on the
structure of the data, rather than the content. Specifically, the LAF standard allows for
annotations to be represented in any format that the task organizers like, so long as it
can be transmuted into LAF’s XML-based “dump format,” which acts as an interface for
all manner of annotations. The dump format has the following qualities (Ide and Romary
2006):
The annotation is kept separate from the text it is based on, and annotations are
associated with character or element offsets derived from the text.
Each level of annotation is stored in a separate document.
Annotations that represent hierarchical information (e.g., syntax trees) must be
either represented with embedding in the XML dump format, or use a flat structure
that symbolically represents relationships.
When different annotations are merged, the dump format must be able to integrate
overlapping annotations in a way that is compatible with XML.
The first bullet point—keeping annotation separate from the text—now usually takes
the form of stand-off annotation (as opposed to inline annotation, where the tags and
text are intermingled). Well go through all the forms that annotation can take and the
pros and cons in Chapter 5.
The other side of the approach that LAF takes toward standardization is encouraging
researchers to use established labels for linguistic annotation. This means that instead
of just creating your own set of POS or NE tags, you can go to the Data Category Registry
(DCR) for definitions of existing tags, and use those to model your own annotation task.
Alternatively, you can name your tag whatever you want, but when transmuting to the
dump format, you would provide information about what tags in the DCR your own
tags are equivalent to. This will help other people merge existing annotations, because
it will be known whether two annotations are equivalent despite naming differences.
The DCR is currently under development (it’s not an easy task to create a repository of
all annotation tags and levels, and so progress has been made very carefully). You can
see the information as it currently exists at www.isocat.org.
Different Kinds of Standards | 81
Timeline of Standardization
LAF didnt emerge as an ISO standard from out of nowhere. Heres a quick rundown of
where the standards composing the LAF model originated:
1987: The Text Encoding Initiative (TEI) is founded “to develop guidelines for en
coding machine-readable texts in the humanities and social sciences.” The TEI is still
an active organization today. See http://www.tei-c.org.
1990: The TEI releases its first set of Guidelines for the Encoding and Interchange of
Machine Readable Texts. It recommends that encoding be done using SGML (Stan
dard Generalized Markup Language), the precursor to XML and HTML.
1993: The Expert Advisory Group on Language Engineering Standards (EAGLES) is
formed to provide standards for large-scale language resources (ex, corpora), as well
as standards for manipulating and evaluating those resources. See http://
www.ilc.cnr.it/EAGLES/home.html.
1998: The Corpus Encoding Standard (CES), also based on SGML, is released. The
CES is a corpus-specific application of the standards laid out in the TEI’s Guide
lines and was developed by the EAGLES group. See http://www.cs.vassar.edu/CES/.
2000: The Corpus Encoding Standard for XML (XCES) is released, again under the
EAGLES group. See http://www.xces.org/.
2002: The TEI releases version P4 of its Guidelines, the first version to implement
XML. See http://www.tei-c.org/Guidelines/P4/.
2004: The first document describing the Linguistic Annotation Framework is released
(Ide and Romary 2004).
2007: The most recent version (P5) of the TEI Guidelines is released. See http://
www.tei-c.org/Guidelines/P5/.
2012: LAF and the TEI Guidelines are still being updated and improved to reflect
progress made in corpus and computational linguistics.
Annotation specification standards
In addition to helping create standards for annotation formats, ISO is working on de
veloping standards for specific annotation tasks. We mentioned ISO-TimeML already,
which is the standard for representing temporal information in a document. There is
also ISO-Space, the standard for representing locations, spatial configurations, and
movement in natural language. The area of ISO that is charged with looking at anno
tation standards for all areas of natural language is called TC 37/SC 4. Other projects
82 | Chapter 4: Building Your Model and Specification
involve the development of standards for how to encode syntactic categories and mor
phological information in different languages, semantic role labeling, dialogue act la
beling, discourse relation annotation, and many others. For more information, you can
visit the ISO web page or check out Appendix A of this book.
Community-Driven Standards
In addition to the committee-based standards provided by ISO, a number of de facto
standards have been developed in the annotation community simply through wide use.
These standards are created when an annotated resource is formed and made available
for general use. Because corpora and related resources can be very time-consuming to
create, once a corpus is made available it will usually quickly become part of the litera
ture. By extension, whatever annotation scheme was used for that corpus will also tend
to become a standard.
If there is a spec that is relevant to your project, taking advantage of community-driven
standards can provide some very useful benefit. Any existing corpora that are related
to your effort will be relevant, since they are developed using the spec you want to adopt.
Additionally, because resources such as these are often in wide use, searching the liter
ature for mentions of the corpus will often lead you to papers that are relevant to your
own research goals, and will help you identify any problems that might be associated
with the dataset or specification. Finally, datasets that have been around long enough
often have tools and interfaces built around them that will make the datasets easier for
you to use.
Community-driven standards dont necessarily follow LAF guidelines,
or make use of other ISO standards. This doesn’t mean they should be
disregarded, but if interoperability is important to you, you may have
to do a little extra work to make your corpus fit the LAF guidelines.
We have a list of existing corpora in Appendix C to help you get started in finding
resources that are related to your own annotation task. While the list is as complete as
we could make it, it is not exhaustive, and you should still check online for resources
that would be useful to you. The list that we have was compiled from the LRE Map, a
database of NLP resources maintained by the European Language Resources Associa
tion (ELRA).
Other Standards Affecting Annotation
While the ISO and community-driven standards are generally the only standards di
rectly related to annotation and NLP, there are many standards in day-to-day life that
can affect your annotation project. For example, the format that you choose to store
your data in (Unicode, UTF-8, UTF-16, ASCII, etc.) will affect how easily other people
Different Kinds of Standards | 83
will be able to use your texts on their own computers. This becomes especially tricky if
you are annotating in a language other than English, where the alphabet uses different
sets of characters. Even languages with characters that overlap with English (French,
Spanish, Italian, etc.) can be problematic when accented vowels are used. We recom
mend using UTF-8 for encoding most languages, as it is an encoding that captures most
characters that you will encounter, and it is available for nearly all computing platforms.
Other standards that can affect a project are those that vary by region, such as the
representation of dates and times. If you have a project in which it is relevant to know
when the document was created, or how to interpret the dates in the text, its often
necessary to know where the document originated. In the United States, dates are often
represented as MM-DD-YYYY, whereas in other countries dates are written in the for
mat DD-MM-YYYY. So if you see the date 01-03-1999 in a text, knowing where its from
might help you determine whether the date is January 3 or March 1. Adding to the
confusion, most computers will store dates as YYYY-MM-DD so that the dates can be
easily sorted.
Similarly, naming conventions can also cause confusion. When annotating NEs, if youre
making a distinction between given names and family names, again the origin of the
text can be a factor in how the names should be annotated. This can be especially con
fusing, because while it might be a convention in a country for people to be referred to
by their family name first (as in Hungary, South Korea, or Japan), if the text you are
annotating has been translated, the names may have been (or may not have been)
swapped by the translator to follow the convention of the language being translated to.
None of the issues we’ve mentioned should be deal breakers for your project, but they
are definitely things to be aware of. Depending on your task, you may also run into
regional variations in language or pronunciation, which can be factors that you should
take into account when creating your corpus. Additionally, you may need to modify
your model or specification to allow for annotating different formats of things such as
dates and names if you find that your corpus has more diversity in it than you initially
thought.
Summary
In this chapter we defined what models and specifications are, and looked at some of
the factors that should be taken into account when creating a model and spec for your
own annotation task. Specifically, we discussed the following:
The model of your annotation project is the abstract representation of your goal,
and the specification is the concrete representation of it.
XML DTDs are a handy way to represent a specification; they can be applied directly
to an annotation task.
84 | Chapter 4: Building Your Model and Specification
Most models and specifications can be represented using three types of tags:
document-level labels, extent tags, and link tags.
When creating your specification, you will need to consider the trade-off between
generality and specificity. Going too far in either direction can make your task
confusing and unmanageable.
Searching existing datasets, annotation guidelines, and related publications and
conferences is a good way to find existing models and specifications for your task.
Even if no existing task is a perfect fit for your goal, modifying an existing specifi
cation can be a good way to keep your project grounded in linguistic theories.
Interoperability and standardization are concerns if you want to be able to share
your projects with other people. In particular, text encoding and annotation format
can have a big impact on how easily other people can use your corpus and anno
tations.
Both ISO standards and community-driven standards are useful bases for creating
your model and specification.
Regional differences in standards of writing, text representation, and other natural
language conventions can have an effect on your task, and may need to be repre
sented in your specification.
Summary | 85
CHAPTER 5
Applying and Adopting Annotation
Standards
Now that youve created the spec for your annotation goal, you’re almost ready to actually
start annotating your corpus. However, before you get to annotating you need to con
sider what form your annotated data will take—that is to say, you know what you want
your annotators to do, but you have to decide how you want them to do it. In this chapter
well examine the different formats annotation can take, and discuss the pros and cons
of each one by answering the following questions:
What does annotation look like?
Are different types of tasks represented differently? If so, how?
How can you ensure that your annotation can be used by other people and in con
junction with other tasks?
What considerations go into deciding on an annotation environment and data for
mat, both for the annotators and for machine learning?
Before getting into the details of how to apply your spec to your corpus, you need to
understand what annotation actually looks like when it has been applied to a document
or text. So now let’s look at the spec examples from Chapter 4 and see how they can be
applied to an actual corpus.
There are many different ways to represent information about a corpus. The examples
we show you won’t be exhaustive, but they will give you an overview of some of the
different formats that annotated data can take.
87
Keep your data accessible. Your annotation project will be much easier
to manage if you choose a format for your data thats easy for you to
modify and access. Using intricate database systems or complicated
XML schemas to define your data is fine if you’re used to them, but if
you arent you’ll be better off keeping things simple.
Annotation tasks range from simple document labeling to text extent tagging and tag
linking. As specs and tasks grow in complexity, more information needs to be contained
within the annotation. In the following sections well discuss the most common ways
that these tasks are represented in data, and the pros and cons of each style.
Metadata Annotation: Document Classification
In Chapter 4 we discussed one example of a document classification task, that of labeling
the genres of a movie based on a summary or review by using nonexclusive category
labels (a movie can be both a comedy and a Western, for example). However, before we
get to multiple category labels for a document, let’s look at a slightly simpler example:
labeling movie reviews as positive, negative, or neutral toward the movie they are re
viewing. This is a simpler categorization exercise because the labels will not overlap;
each document will have only a single classification.
Unique Labels: Movie Reviews
Let’s say you have a corpus of 100 movie reviews, with roughly equal amounts of positive,
negative, and neutral documents. By reading each document, you (or your annotators)
can determine which category each document should be labeled as, but how are you
going to represent that information? Here are a few suggestions for what you can do:
Have a text file or other simple file format (e.g., comma-separated) containing a list
of filenames, and its associated label.
Create a database file and have your annotators enter SQL commands to add files
to the appropriate table.
Create a folder for each label on your computer, and as each review is classified,
move the file into the appropriate folder.
Have the annotators change the filename to add the word positive, negative, or
neutral, as in review0056-positive.txt.
Add the classification inside the file containing the review.
88 | Chapter 5: Applying and Adopting Annotation Standards
Notice that these options run the gamut from completely external representations of the
annotation data, where the information is stored in completely different files, to entirely
internal, where the information is kept inside the same document. They also cover the
middle ground, where the information is kept in the filesystem—near the corpus but
not completely part of it.
So which of these systems is best? In Chapter 4, we discussed the importance of the LAF
standard, and explained why stand-off annotation is preferable to making changes to
the actual text of the corpus. So the last option on the list isn’t one that’s preferable.
But how do you choose between the other four options? They are all recording the
annotation information while still preserving the format of the data; is one really better
than the other? In terms of applying the model to the data, we would argue that no,
theres no real difference between any of the remaining options. Each representation
could be turned into any of the others without loss of data or too much effort (assuming
that you or someone you know can do some basic programming, or is willing to do some
reformatting by hand).
So the actual decision here is going to be based on other factors, such as what will be
easiest for your annotators and what will result in the most accurate annotations. Asking
your annotators to learn SQL commands in order to create tables might be the best
option from your perspective, but unless your annotators are already familiar with that
language and an accompanying interface, chances are that using such a system will
greatly slow the annotation process, and possibly result in inaccurate annotations or
even loss of data if someone manages to delete your database.
Be aware of sources of error! Annotation tasks are often labor-intensive
and require attention to detail, so any source of confusion or mistakes
will probably crop up at least a few times.
Having your annotators type information can also be problematic, even with a simple
labeling task such as this one. Consider giving your annotators a folder of text files and
a spreadsheet containing a list of all the filenames. If you ask your annotators to fill in
the spreadsheet slots next to each filename with the label, what if they are using a pro
gram that will “helpfully” suggest options for autofilling each box? If you are using the
labels positive, negative, or neutral, the last two both start with “ne, and if an annotator
gets tired or doesnt pay attention, she may find herself accidentally filling in the wrong
label. Figure 5-1 shows how easily this could happen in a standard spreadsheet editor.
In a situation like that, you might want to consider using a different set of words, such
as likes, dislikes, and indifferent.
Metadata Annotation: Document Classification | 89
Figure 5-1. A possible source of error in annotation
Of course, this doesn’t mean it’s impossible to complete a task by using a spreadsheet
and classifications that are a bit similar. In some cases, such circumstances are impossible
to avoid. However, its never a bad idea to keep an eye out for places where mistakes can
easily slip in.
While we didn’t discuss the movie review annotation scenario in Chap
ter 4, we have assumed here that we have a schema that contains three
categories. However, that is by no means the only way to frame this task
and to categorize movie reviews. In the Movie Review Corpus that
comes with the Natural Language Toolkit (NLTK), reviews are divided
into only positive and negative (based on the scores provided in the
reviews themselves), and RottenTomatoes.com also uses a binary clas
sification. On the other hand, Metacritic.com rates everything on a scale
from 0 to 100.
Both of these websites provide annotation guidelines for reviews that
don’t give preassigned numeric ratings, and each of those websites has
its editors assign ratings based on their own systems (Metacritic.com;
RottenTomatoes.com).
Multiple Labels: Film Genres
As your tasks grow in complexity, there are more limiting factors for how to structure
your annotations. For example, there are a number of ways to approach the task of
labeling movie reviews that only allow one label per document, but what happens if it’s
possible for a document to have more than one label? In Chapter 4 we started discussing
a spec for a task involving labeling movie summaries with their associated genres. Lets
expand on that example now, to see how we can handle more complex annotation tasks.
While it might be tempting to simply say, “Well, well only give a single label to each
movie,” attempting to follow that guideline becomes difficult quickly. Are romantic
comedies considered romances, or comedies? You could add “romantic comedy” as a
genre label, but will you create a new label for every movie that crosses over a genre line?
90 | Chapter 5: Applying and Adopting Annotation Standards
Such a task quickly becomes ridiculous, simply due to the number of possible combi
nations. So, define your genres and allow annotators to put as many labels as necessary
on each movie (in Chapter 6 we’ll discuss in more detail possible approaches to guide
lines for such a task).
So how should this information be captured? Of the options listed for the movie review
task, some of them can be immediately discarded. Having your annotators change
the names of the files to contain the labels is likely to be cumbersome for both the
annotators and you: Casablanca-drama.txt is easy enough, but Spaceballs-
sciencefiction_comedy_action_parody.txt would be annoying for an annotator to create,
and equally annoying for you to parse into a more usable form (especially if spelling
errors start to sneak in).
Moving files into appropriately labeled folders is also more difficult with this task; a
copy of the file would have to be created for each label, and it would be much harder to
gather basic information such as how many labels, on average, each movie was given.
It would also be much, much harder for annotators to determine if they missed a label.
In Figure 5-1 we showed a sample spreadsheet with filenames and positive/negative/
neutral labels in different columns, with a different row for each review. While it would
certainly be possible to create a spreadsheet set up the same way to give to your anno
tators, it’s not hard to imagine how error-prone that sort of input would be for a task
with even more category options and more potential columns per movie.
So where does that leave us? If none of the simpler ways of labeling data are available,
then its probably time to look at annotation tools and XML representations of annota
tion data.
In this case, since the information you want to capture is metadata that’s relevant to the
entire document, you probably don’t need to worry about character offsets, so you can
have tags that look like this:
<GenreXML>
<FILM fid = "f1" title = "Cowboys and Aliens" file_name = "film01.txt" />
<GENRE gid = "g1" filmid = "f01" label = "western" />
<GENRE gid = "g2" filmid = = "f01" label = "sci-fi" />
<GENRE gid = "g3" filmid= "f01" label = "action" />
</GENREXML>
This is a very simple annotation, with an equally simple DTD or Document Type Def
inition [if you arent sure how to read this DTD, refer back to the sidebar “What Is a
DTD?” (page 68)]:
<!ENTITY name "GenreXML">
<!ELEMENT FILM (#PCDATA) >
<!ATTLIST FILM id ID >
<!ATTLIST FILM title CDATA >
<!ATTLIST FILM file_name CDATA >
Metadata Annotation: Document Classification | 91
<!ELEMENT GENRE (#PCDATA) >
<!ATTLIST GENRE id ID >
<!ATTLIST GENRE filmid CDATA >
<!ATTLIST GENRE label ( action | adventure | classic | ... ) >
This representation of the genre labeling task is not the only way to approach the problem
(in Chapter 4 we showed you a slightly different spec for the same task). Here, we have
two elements, film and genre, each with an ID number and relevant attributes; the
genre element is linked to the film it represents by the filmid attribute.
Don’t fall into the trap of thinking there is One True Spec for your task.
If you find that its easier to structure your data in a certain way, or to
add or remove elements or attributes, do it! Dont let your spec get in
the way of your goal.
By having the filename stored in the XML for the genre listing, its possible to keep the
annotation completely separate from the text of the file being annotated. However,
clearly the file_name attribute is not one that is required, and probably not one that
you would want an annotator to fill in by hand. But it is useful, and would be easy to
generate automatically during pre- or postprocessing of the annotation data.
Giving each tag an ID number (rather than only the FILM tags) may not seem very
important right now, but it’s a good habit to get into because it makes discussing and
modifying the data much easier, and can also make it easier to expand your annotation
task later if you need to.
At this point you may be wondering how all this extra information is going to help with
your task. There are a few reasons why you should be willing to take on this extra
overhead:
Having an element that contains the film information allows the annotation to be
kept either in the same file as the movie summary, or elsewhere without losing track
of the data.
Keeping data in a structured format allows you to more easily manipulate it later.
Having annotation take the form of well-formated XML can make it much easier
to analyze later.
Being able to create a structured representation of your spec helps cement your task,
and can show you where problems are in how you are thinking about your goal.
Representing your spec as a DTD (or other format) means you can use annotation
tools to create your annotations. This can help cut down on spelling and other user-
input errors.
92 | Chapter 5: Applying and Adopting Annotation Standards
Figure 5-2 shows what the film genre annotation task looks like in the Multipurpose
Annotation Environment (MAE), an annotation tool that requires only a DTD-like
document to set up and get running. As you can see, by having the genre options supplied
in the DTD, an annotator has only to create a new instance of the GENRE element and
select the attribute he wants from the list.
Figure 5-2. Genre annotation in MAE
The output from this annotation process would look like this:
<FILM id="f0" start="-1" end="-1" text="" title="Cowboys and Aliens" />
<GENRE id="g0" start="-1" end="-1" text="" label="action" />
<GENRE id="g1" start="-1" end="-1" text="" label="sci-fi" />
<GENRE id="g2" start="-1" end="-1" text="" label="western" />
There are a few more elements here than the ones specified in the DTD shown earlier
—most tools will require that certain parameters be met in order to work with a task,
but in most cases those changes are superficial. In this case, since MAE is usually used
to annotate parts of the text rather than create metatags, the DTD had to be changed to
allow MAE to make GENRE and FILM nonconsuming tags. That’s why the start and end
elements are set to –1, to indicate that the scope of the tag isnt limited to certain char
acters in the text. You’ll notice that here, the filmid attribute in the GENRE tag is not
present, and neither is the file_name attribute in the FILM tag. While it wouldn’t be
unreasonable to ask your annotators to assign that information themselves, it would be
easier—as well as both faster and more accurate—to do so with a program.
If you plan to keep the stand-off annotation in the same file as the text thats being
annotated, then you might not need to add the file information to each tag. However,
annotation data can be a lot easier to analyze/manipulate if it doesnt have to be extracted
from the text it’s referring to, so keeping your tag information in different files that refer
back to the originals is generally a best practice.
Metadata Annotation: Document Classification | 93
Text Extent Annotation: Named Entities
The review classification and genre identification tasks are examples of annotation labels
that refer to the entirety of a document. However, many annotation tasks require a finer-
grained approach, where tags are applied to specific areas of the text, rather than all of
it at once. We already discussed many examples of this type of task: part-of-speech (POS)
tagging, Named Entity (NE) recognition, the time and event identification parts of
TimeML, and so on. Basically, any annotation project that requires sections of the text
to be given distinct labels falls into this category. We will refer to this as extent annota
tion, because its annotating a text extent in the data that can be associated with character
locations.
In Chapter 4 we discussed the differences between stand-off and inline annotation, and
text extents are where the differences become important. The metadata-type tags used
for the document classification task could contain start and end indicators or could
leave them out; their presence in the annotation software was an artifact of the software
itself, rather than a statement of best practice. However, with stand-off annotation, it is
required that locational indicators are present in each tag. Naturally, there are multiple
ways to store this information, such as:
Inline annotation
Stand-off annotation by location in a sentence or paragraph
Stand-off annotation by character location
In the following sections we will discuss the practical applications of each of these
methods, using Named Entity annotation as a case study.
As we discussed previously, NE annotation concerns marking up what you probably
think of as proper nouns—objects in the real world that have specific designators, not
just generic labels. So, “The Empire State Building” is an NE, while “the building over
there” is not. For now, we will use the following spec to describe the NE task:
<!ENTITY name "NamedEntityXML">
<!ELEMENT NE (#PCDATA) >
<!ATTLIST NE id ID >
<!ATTLIST NE type ( person | title | country | building | business |...) >
<!ATTLIST NE note CDATA >
Inline Annotation
While we still strongly recommend not using this form of data storage for your anno
tation project, the fact remains that it is a common way to store data. The phrase “inline
annotation” refers to the annotation XML tags being present in the text that is being
annotated, and physically surrounding the extent that the tag refers to, like this:
94 | Chapter 5: Applying and Adopting Annotation Standards
<NE id="i0” type="building">The Massachusetts State House</NE> in <NE id="i1” type="city">Boston, MA</NE> houses
the offices of many important state figures, including <NE id="i2” type="title">Governor</NE> <NE id="i3”
type="person">Deval Patrick</NE> and those of the <NE id="i4” type="organization">Massachusetts General Court</
NE>.
If nothing else, this format for annotation is extremely difficult to read. But more im
portant, it changes the formatting of the original text. While in this small example there
may not be anything special about the texts format, the physical structure of other
documents may well be important for later analysis, and inline annotation makes that
difficult to preserve or reconstruct. Additionally, if this annotation were to later be
merged with, for example, POS tagging, the headache of getting the two different tagsets
to overlap could be enormous.
Not all forms of inline annotation are in XML format. There are other ways to mark up
data that is inside the text, such as using parentheses to mark syntactic groups, as was
done in the following Penn TreeBank II example, taken from The Penn TreeBank:
Annotating Predicate Argument Structure” (Marcus et al. 1994):
(S (NP-SUBJ I
(VP consider
(S (NP-SUBJ Kris)
(NP-PRD a fool))))
There are still many programs that provide output in this or a similar format (the
Stanford Dependency Parser is one example), and if you want to use tools that do this,
you may have to find a way to convert information in this format to stand-off annotation
to make it maximally portable to other applications.
Of course, there are some benefits to inline annotation: it becomes unnecessary to keep
special track of the location of the tags or the text that the tags are surrounding, because
those things are inseparable. Still, these benefits are fairly shortsighted, and we strongly
recommend not using this paradigm for annotation.
Another kind of inline annotation is commonly seen in POS tagging, or other tasks
where a label is assigned to only one word (rather than spanning many words). In fact,
you already saw an example of it in Chapter 1, in the discussion of the Penn TreeBank.
“/” From/IN the/DT beginning/NN ,/, it/PRP took/VBD a/DT man/NN with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB
in/IN Mexico/NNP ,/, “/” says/VBZ Kimihide/NNP Takimura/NNP ,/, president/NN of/IN Mitsui/NNS group/NN ’s/POS
Kensetsu/NNP Engineering/NNP Inc./NNP unit/NN ./.
Here, each POS tag is appended as a suffix directly to the word it is referring to, without
any XML tags separating the extent from its label. Not only does this form of annotation
make the data difficult to read, but it also changes the composition of the words them
selves. Consider how “groups” becomes “group/NN ’s/POS”—the possessive “’s” has
Text Extent Annotation: Named Entities | 95
been separated from “group, now making it even more difficult to reconstruct the orig
inal text. Or imagine trying to reconcile an annotation like this one with the NE example
in the previous example! It would not be impossible, but it could certainly cause
headaches.
While we dont generally recommend using this format either, many existing POS tag
gers and other tools were originally written to provide output in this way, so it is some
thing you should be aware of, as you may need to realign the original text with the new
POS tags.
We are not, of course, suggesting that you should never use tools that
output information in formats other than some variant stand-off an
notation. Many of these tools are extremely useful and provide very
accurate output. However, you should be aware of problems that might
arise from trying to use them.
Another problem with this annotation format is that if it is applied to the NE task, there
is the immediate problem that the NE task requires that a single tag apply to more than
one word at the same time. There is an important distinction between applying the same
tag more than once in a document (as there is more than one NN tag in the Penn TreeBank
example), and applying one tag across a span of words. Grouping a set of words together
by using a single tag tells the reader something about that group that having the same
tag applied to each word individually does not. Consider these two examples:
<NE id="i0” type="building">The Massachusetts State
House</NE> in <NE id="i1” type="city">Boston, MA</NE>
The/NE_building Massachusetts/NE_building State/
NE_building House/NE_building in Boston/NE_city ,/
NE_city MA/NE_city …
In the example on the left, it is clear that the phrase “The Massachusetts State House
is one unit as far as the annotation is concerned—the NE tag applies to the entire group.
On the other hand, in the example on the right, the same tag is applied individually to
each token, which makes it much harder to determine if each token is an NE on its own,
or if there is a connection between them. In fact, we end up tagging some tokens with
the wrong tag! Notice that the state “MA” has to be identified as “/NE_city” for the span
to be recognized as a city.
Stand-off Annotation by Tokens
One method that is sometimes used for stand-off annotation is tokenizing (i.e., sepa
rating) the text input and giving each token a number. The tokenization process is usually
based on whitespace and punctuation, though the specific process can vary by program
(e.g., some programs will split “’s” or “n’t” from “Meg’s” and “don’t, and others will not).
The text in the appended annotation example has been tokenized—each word and
punctuation mark has been pulled apart.
96 | Chapter 5: Applying and Adopting Annotation Standards
Taking the preceding text as an example, there are a few different ways to identify the
text by assigning numbers to the tokens. One way is to simply number every token in
order, starting at 1 (or 0