Following Natural Language Route Instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 248

Download
Open PDF In Browser	View PDF

The Dissertation Committee for Matthew Tierney MacMahon
certifies that this is the approved version of the following dissertation:

Following Natural Language Route Instructions

Committee:

Benjamin J. Kuipers, Supervisor

Joydeep Ghosh

Jonas Kuhn

Dewayne E. Perry

Brian J. Stankiewicz

Following Natural Language Route Instructions

Matthew Tierney MacMahon, B.S., M.S.E.

Dissertation
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of

Doctor of Philosophy

The University of Texas at Austin
August 2007

To my parents, Paul and B.J., for encouraging both wonder and accomplishment.
To my wife, Sarah, for her unflagging love, support, and understanding.
To all my friends, who have helped in innumerable ways.

Following Natural Language Route Instructions

Publication No.

Matthew Tierney MacMahon, Ph.D.
The University of Texas at Austin, 2007

Supervisor: Benjamin J. Kuipers

Following natural language instructions requires transforming language into situated
conditional procedures; robustly following instructions, despite the director’s natural
mistakes and omissions, requires the pragmatic combination of language, action, and
domain knowledge. This dissertation demonstrates a software agent that parses, models and
executes human-written natural language instructions to accomplish complex navigation
tasks.

We compare the performance against people following the same instructions.

By selectively removing various syntactic, semantic, and pragmatic abilities, this work
empirically measures how often these abilities are necessary to correctly navigate along
extended routes through unknown, large-scale environments to novel destinations.
To study how route instructions are written and followed, this work presents a
new corpus of 1520 free-form instructions from 30 directors for 252 routes in three
virtual environments. 101 other people followed these instructions and rated them for
quality, successfully reaching and identifying the destination on only approximately twothirds of the trials. Our software agent, M ARCO, followed the same instructions in the
same environments with a success rate approaching human levels. Overall, instructions
subjectively rated 4 or better of 6 comprise just over half of the corpus; M ARCO
v

performs at 88% of human performance on these instructions. M ARCO’s performance
was a strong predictor of human performance and ratings of individual instructions.
Ablation experiments demonstrate that implicit procedures are crucial for following
verbal instructions using an approach integrating language, knowledge and action. Other
experiments measure the performance impact of linguistic, execution, and spatial abilities
in successfully following natural language route instructions.

Acknowledgments
Like all large endeavors, this work would have never been accomplished without the support
of others. Thanks to my wife, Sarah, who has supported me throughout the process, even
when the end continued to slip from view. Thanks to my family, who supported me even
when they didn’t pretend to understand what I’m doing or why I’m doing it. Thanks to all
of our friends who did the same.
Thanks to my mom, who helped me get started with computer science, typing in
BASIC programs on our Commodore 64. Thanks to my dad, who encouraged me to reach
for the stars.
My friends helped me throughout, even when I asked for yet another ride, room,
or spot to leave my things. I’d like to thank all of the Lipsters, especially Austin and Lisa,
Ryan and Renee, Karl and Cheryl, Dung and Andrea, Dave and Jiseon, Anuj, Daniel, Paul,
and Stacy.
Thanks to JP, Kyler, Sahar, Jane, Matt, Matt, Amy, Travis, and Chaz for their
thoughts on my work and for taking in a wayward computer geek in a Psych Lab. Thanks
especially to Travis and Chaz for running subjects for me and to JP for programming for
the Psych Lab, so I wouldn’t have to.
Thanks to Pat, Joseph, Jeff, Aniket, Ram, Shilpa, Jonathon, and the rest of the
Robotics lab for likewise taking in a wayward cognitive scientist who likes to play with
robots.
Thanks to the great people I worked with at NASA JSC, especially those who still
vii

ask how it’s going and if I’ll finish, especially Dave, Debbie, Scott, Pete, Eric, Rob, Jeff,
Dan, and Rich. You got me hooked on reactive execution and human-robot interaction.
Thanks to the gang at NRL, who similarly support inter-disciplinary research on
human-robot interaction, especially Alan, Magda, Dennis, Bill, and Sam. This project grew
out of my summers there, working on G RACE and G EORGE. Thanks to all the dispersed,
diverse Grand Challenge team members, too.
Thanks to the good folks at NASA Ames, for letting me play with their robots last
summer and for their ideas and advice, especially Vandi, Ari, Hans, Michael, and Mike.
Thanks to all the great teachers I’ve had along the way, in Flower Mound, in Denton,
in Palo Alto, in Berlin, in Vienna, in Houston, in Austin, in Washington, D.C., and in
Mountain View.
Thanks to Joydeep, Jonas, and Dewayne for advice, pointers, and for listening and
reading.
Thanks to Ben for asking for the impossible and to Brian for pointing out the
possible and to both for supporting me throughout this strange quest, as mentors, as
advisors, as bosses, and as friends.
Thanks to all the people I forgot to thank in the rush to finally finish.
Thanks most of all to everyone who went out of their way to help me through this,
even with every choice I made to make things harder on myself.
This work was partially accomplished in three summer internships at the Naval
Research Laboratory, in the Navy Center For Applied Research in Artificial Intelligence and
one summer internship at NASA Ames Research Center, in the Control Agent Architectures
and Intelligent Robotics Groups. The remainder of the work was accomplished at the
University of Texas at Austin, in the Shape and Space Laboratory and the Intelligent
Robotics Laboratory, in rented rooms in Austin, San Antonio, and Oakland, on planes in
between, in coffee shops around the world, and hither and yon.
This work was supported by AFOSR grants FA9550-04-1-0236, FA9550-05-1-0321

viii

and NIH grant EY016089 to Brian J. Stankiewicz, by NSF grant IIS-0413257 to Benjamin J.
Kuipers, and by support for Matt MacMahon under ONR work order N0001405WX30001
for the NRL Research Option, Coordinated Teams of Autonomous Systems and from the
California Space Grant Foundation.

M ATTHEW T IERNEY M AC M AHON

The University of Texas at Austin
August 2007

Contents
Abstract

Acknowledgments

vii

Contents

List of Tables

xvii

List of Figures

xix

Chapter 1
1.1

1.2

Introduction

Route Instruction Domain and Related Work . . . . . . . . . . . . . . . . .

1.1.1

Route Instructions Understanding is Tractable . . . . . . . . . . . .

1.1.2

Do what I mean, not (just) what I said . . . . . . . . . . . . . . . .

1.1.3

Clear Evaluation and Comparison to People . . . . . . . . . . . . .

1.1.4

Real-World Applications . . . . . . . . . . . . . . . . . . . . . . .

Language and Task Corpus Methodology . . . . . . . . . . . . . . . . . .

1.2.1

General Definition of Language and Task Corpus Methodology . .

1.2.2

Application of Language and Task Corpus Methodology to Study
Spatial Route Instructions . . . . . . . . . . . . . . . . . . . . . .

1.3

Human Route Instruction Studies . . . . . . . . . . . . . . . . . . . . . . .

1.4

Resolving Linguistic Ambiguity by Task Execution . . . . . . . . . . . . .

1.5

1.6

1.7

M ARCO Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.5.1

Conceptual Structure of Instructions . . . . . . . . . . . . . . . . .

M ARCO Route Instruction Studies . . . . . . . . . . . . . . . . . . . . . .

1.6.1

Ablation Studies of Natural Route Instructions . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2
2.1

Related Work in Spatial Language and Instruction Following

Psychological Studies of Spatial Language . . . . . . . . . . . . . . . . . .

2.1.1

Spatial language and spatial cognition . . . . . . . . . . . . . . . .

Psychological Studies of Route Instructions . . . . . . . . . . . . . . . . .

2.2.1

Gary Allen and colleagues . . . . . . . . . . . . . . . . . . . . . .

2.2.2

Michel Denis and colleagues . . . . . . . . . . . . . . . . . . . . .

2.2.3

Barbara Tversky and colleagues . . . . . . . . . . . . . . . . . . .

2.2.4

Other work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Route Instruction Generators . . . . . . . . . . . . . . . . . . . . . . . . .

2.4

Automated Instruction Following and Analysis . . . . . . . . . . . . . . .

2.4.1

Computational Models of Spatial Prepositions . . . . . . . . . . .

2.4.2

Computational models of route instructions . . . . . . . . . . . . .

2.4.3

G RACE and G EORGE in the AAAI Robot Challenge . . . . . . . .

2.4.4

Instruction-Based Learning (IBL) Project . . . . . . . . . . . . . .

2.4.5

Spatial Ontologies and Spatial Instructions . . . . . . . . . . . . .

2.2

Chapter 3

Language and Task Corpus Methodology

3.1

Language and Task Corpus Methodology for Spatial Route Instructions . .

3.2

Related methodologies for evaluating natural language understanding systems 33

3.3

Related methodologies for studying natural language tasks . . . . . . . . .

3.3.1

M AP -TASK Corpus Methodology . . . . . . . . . . . . . . . . . .

Chapter 4

Human Route Instruction Experiments

4.1

Motivating questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

Study overview and motivations . . . . . . . . . . . . . . . . . . . . . . .

4.3

Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3.1

Control of movement through the environment . . . . . . . . . . .

Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4.1

Environment Maps . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4.2

Environment Landmarks . . . . . . . . . . . . . . . . . . . . . . .

Human Directors Learn, Navigate, and Describe . . . . . . . . . . . . . . .

4.5.1

Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5.2

Route Instruction Entry . . . . . . . . . . . . . . . . . . . . . . . .

4.5.3

Experiment 1: Six directors across all three environments . . . . . .

4.5.4

Experiment 2: Twelve directors each in one environment . . . . . .

4.5.5

Experiment 3: Twelve directors with continuous motion control . .

4.6

Route Instruction Corpus Language Statistics . . . . . . . . . . . . . . . .

4.7

Human Followers Read, Navigate, and Evaluate . . . . . . . . . . . . . . .

4.7.1

Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.7.2

Experiment 1: 24 people following 6 directors’ instructions . . . .

4.7.3

Experiment 2: 44 people following 18 directors’ instructions . . . .

4.7.4

Experiment 3: 24 people following 12 directors’ instructions . . . .

Human Task Performance Overview . . . . . . . . . . . . . . . . . . . . .

4.8.1

Differences in Directors . . . . . . . . . . . . . . . . . . . . . . .

4.8.2

Differences in Human Followers between Corpus 1 and Corpus 2 .

4.8.3

Gender-linked performance differences . . . . . . . . . . . . . . .

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

4.5

4.8

4.9

Chapter 5
5.1

Spatial Route Instructions in the M ARCO Architecture

Understanding and Following Route Instructions in Context . . . . . . . . .

xii

76
76

5.2

Syntax Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2.1

Training the Probabilistic Context-Free Grammar . . . . . . . . . .

5.2.2

Robustness for the Syntax Parser . . . . . . . . . . . . . . . . . . .

Content Framer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3.1

Robustness to unknown words and sentence structures . . . . . . .

Instruction Modeler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.4.1

Representing Referring Phrases as View Descriptions . . . . . . . .

5.4.2

Representing Conditional Actions as Procedural Specifications . . .

Executor: Interleaving Action and Perception . . . . . . . . . . . . . . . .

5.5.1

Inferring procedures implicit in instructions . . . . . . . . . . . . .

5.5.2

Recognizing syntactic, semantic, pragmatic, and exploratory cues .

5.5.3

Executing an Example from the Route Instruction Corpus . . . . .

5.5.4

Other work on understanding implicit procedures . . . . . . . . . .

5.6

Robot Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.7

View Description Matcher . . . . . . . . . . . . . . . . . . . . . . . . . .

5.8

Modeling Route Instructions in the HSSH Ontology . . . . . . . . . . . . .

5.3

5.4

5.5

5.9

5.8.1

Relation to the Spatial Semantic Hierarchy . . . . . . . . . . . . . 101

5.8.2

Modeling route instructions by topological maps . . . . . . . . . . 102

Extension to handle other sorts of ambiguity . . . . . . . . . . . . . . . . . 106

Chapter 6
6.1

6.2

M ARCO Route Instruction Corpus Experiments

109

M ARCO Followers Model the Text and Navigate . . . . . . . . . . . . . . . 110
6.1.1

Apparatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.1.2

Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.1.3

Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.1.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Full M ARCO Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 116

xiii

6.3

Comparing Implicit Procedures Inference to Fundamental Explicit Navigation Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.4

Implicit Procedures in Route Instructions . . . . . . . . . . . . . . . . . . 120
6.4.1

6.5

Inferring Different Types of Implicit Procedures . . . . . . . . . . . . . . . 124
6.5.1

6.6

6.7

6.8

6.9

Implicit Procedure Cues Results by Rating . . . . . . . . . . . . . 122

Implicit Procedure Experiment Results by Rating . . . . . . . . . . 126

Object, Structural and Appearance Landmarks . . . . . . . . . . . . . . . . 127
6.6.1

Review of Landmarks Types . . . . . . . . . . . . . . . . . . . . . 127

6.6.2

Landmark Recognition Ablation Study . . . . . . . . . . . . . . . 128

6.6.3

Landmark Recognition Ablation Results . . . . . . . . . . . . . . . 130

Hybrid Spatial Semantic Hierarchy . . . . . . . . . . . . . . . . . . . . . . 132
6.7.1

Review of the Hybrid Spatial Semantic Hierarchy . . . . . . . . . . 132

6.7.2

Hybrid Spatial Semantic Hierarchy Ablation Experiment . . . . . . 134

6.7.3

Hybrid Spatial Semantic Hierarchy Ablation Results . . . . . . . . 134

6.7.4

Extensions to the Hybrid Spatial Semantic Hierarchy . . . . . . . . 138

Grammar Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.8.1

Review of Cross-Validation Methodology . . . . . . . . . . . . . . 139

6.8.2

M ARCO Grammar Cross-Validation Experiments . . . . . . . . . . 140

6.8.3

M ARCO Grammar Cross-Validation Results . . . . . . . . . . . . . 140

6.8.4

Cross-Validation Discussion . . . . . . . . . . . . . . . . . . . . . 144

Human-M ARCO Discrepancy Analysis . . . . . . . . . . . . . . . . . . . . 145

6.10 Comparison to related work . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.10.1 Comparision to the Instruction-Based Learning project . . . . . . . 149
6.11 Conclusions from M ARCO experiments . . . . . . . . . . . . . . . . . . . 151
Chapter 7

Conclusions

7.0.1
7.1

155

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

Empirical examination of route instruction following . . . . . . . . . . . . 158

xiv

Appendix A Human Experiment Materials

160

A.1 Software configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.2 Running the experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.3 Example Consent Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.4 Director Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.4.1 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.4.2 Navigation Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.4.3 Giving Route Instructions . . . . . . . . . . . . . . . . . . . . . . 166
A.5 Director Key Meanings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.6 Instruction Follower Guide . . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.6.1 Movement Controls . . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.6.2 Following Route Instructions . . . . . . . . . . . . . . . . . . . . . 169
A.7 Follower Key Meanings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Appendix B M ARCO Ablation Options

173

B.1 Options: Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.2 Options: Conditionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.3 Options: Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
B.4 Options: Recoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.5 Options: Tweaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.6 Options: Linguistics

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

B.7 Options: Implicits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.8 Options: Landmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.9 Options: HSSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.10 Options: Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Appendix C Glossary and Language Model

185

C.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

C.1.1

Route Instruction Analysis Vocabulary . . . . . . . . . . . . . . . . 185

C.1.2

Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

C.2 Route Instruction Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.2.1

Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

C.2.2

Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

C.2.3

Turn Command Arguments . . . . . . . . . . . . . . . . . . . . . . 188

C.2.4

Travel Command Arguments . . . . . . . . . . . . . . . . . . . . . 189

C.2.5

Description Utterance Arguments . . . . . . . . . . . . . . . . . . 190

C.2.6

Adjectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

C.2.7

Adjectival Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . 191

C.3 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
C.3.1

Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

C.4 Representation of Procedural Specifications . . . . . . . . . . . . . . . . . 193
C.5 Representation of View Description . . . . . . . . . . . . . . . . . . . . . 194
Bibliography

196

Vita

227

xvi

List of Tables
4.1

Characteristics of the three testing environments. . . . . . . . . . . . . . .

4.2

Statistics per group of instructions by each director. . . . . . . . . . . . . .

4.3

Corpora text statistics averaged per director group . . . . . . . . . . . . . .

4.4

Most frequent words per group of instructions . . . . . . . . . . . . . . . .

4.5

Most distinctive words per group of instructions. . . . . . . . . . . . . . .

6.1

Comparing performance of M ARCO to people and to M ARCO following
only explicitly commanded procedures. . . . . . . . . . . . . . . . . . . . 116

6.2

Comparing performance of M ARCO on inferring implicit procecures and
executing types explicitly commanded procecures . . . . . . . . . . . . . . 119

6.3

Comparing performance of M ARCO versions without different implicit
procedure cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.4

Comparing the performance of M ARCO without implicit turns and travels . 124

6.5

Comparing the performance of M ARCO without the ability to recognize
different kinds of landmarks . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.6

Performance across subjective ratings ablating representations from the HSSH136

6.7

Performance on cross-validation runs by subjective rating. . . . . . . . . . 141

6.8

Discrepancy analysis for sampled instructions from Corpus 1. . . . . . . . 146

6.9

Discrepancy analysis for sampled instructions from Corpus 2. . . . . . . . 153

6.10 Discrepancy analysis for sampled instructions from Corpus 3. . . . . . . . 154
xvii

A.1 Combinations for director experiments 2 (Corpus 2) and 3 (Corpus 3). . . . 161

xviii

List of Figures
4.1

Sample views of all objects, from the Medium environment. . . . . . . . .

4.2

Sample views of all hallway textures, from the Compact environment. . .

4.3

Example of route instruction window, with entered text. . . . . . . . . . . .

4.4

Dialog box for rating route instructions. . . . . . . . . . . . . . . . . . . .

4.5

Dialog box for rating navigation confidence. . . . . . . . . . . . . . . . . .

4.6

Occurrence and success rates over all human followers by mean rating. . . .

4.7

Mean performance over all human followers . . . . . . . . . . . . . . . . .

4.8

Success rates over all human followers per Corpus 1 director. . . . . . . . .

4.9

Success rates over all human followers per Corpus 2 director. . . . . . . . .

4.10 Success rates over all human followers per Corpus 3 director. . . . . . . . .

4.11 Success rates for all human followers from each experiment by instructions
rating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.12 Success rates for human followers and directors by gender. . . . . . . . . .

5.1

M ARCO linguistic modules modeling a route instruction text . . . . . . . .

5.2

How interpreting an utterance depends on the follower’s knowledge of its
pose in the environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1

Human visual view and M ARCO symbolic view of a hallway. . . . . . . . . 111

6.2

Human visual view and M ARCO symbolic view of a shorter hallway . . . . 112

xix

6.3

Human and full M ARCO success rates versus post hoc human ratings. . . . 115

6.4

Comparing inferrring implicit procedures vs. executing types of explicit
procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.5

Performance for people, M ARCO and M ARCO without implicit procedure
cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.6

Implicit turns have a higher impact than implicit travels. . . . . . . . . . . 125

6.7

Success rates for M ARCO without the ability to recognize types of landmarks.130

6.8

Performance graph ablating representations from the Hybrid Spatial
Semantic Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.9

Success rates for M ARCO under cross-validation across instruction quality.

142

6.10 Success rates for M ARCO under cross-validation per director. . . . . . . . . 143
A.1 Vizard Experiment Starting Dialog . . . . . . . . . . . . . . . . . . . . . . 172

Chapter 1

Introduction
Since Turing’s seminal paper (1950), a natural response to language has been regarded
as the key test of intelligence. Full and rich language use is perhaps the most defining
characteristic of human intelligence. However, despite more than a half-century of trying,
the goal of natural language interaction with an artificial partner remains distant.
One of the most practical applications of natural language is following verbal
instructions. With instructions, one person, the director, uses language to guide others, the
followers, in accomplishing complex tasks when the followers do not have the knowledge,
expertise, authority, or time to plan. Let us distinguish instructions as describing complex
procedures of multiple actions, while a command impels one procedure. Instructions
are ubiquitous across human tasks.

Common applications include recipes, assembly

instructions, spatial route instructions, repair manuals, tutoring, coaching, as well as
innumerable unnamed, ad hoc interactions where one person guides another through a
procedure.
For perfectly clear, explicit, and correct instructions, the instruction follower can
directly map the syntax and surface semantics of what was said into an imperative model
of what to do. To handle instructions that are partially unclear, implicit, or incorrect, the
follower must achieve a deeper understanding of the pragmatics of the instructions; that is,

what the director meant the follower to do. How often instructions fail to be clear, complete,
or correct is an empirical question that must be answered to understand the difficulty of
following route instructions.
This dissertation aims to provide a computational model of route instruction
following with accuracy and robustness comparable to human performance, despite natural
mistakes, omissions, and variation in the instructions. Five questions must be answered to
build and evaluate a natural language understanding system that follows instructions: (1)
How do directors naturally describe the task in instructions? (2) How do human followers
act when following these instructions – particularly, how often do people succeed at the
task given the instructions? (3) What challenges arise in following instructions and how
can they be robustly handled? (4) What computational principles are necessary to follow
route instructions, especially to handle under-specification and ambiguity? (5) How much
does each system component impact the system’s performance, measured by how often the
system successfully follows natural instructions?

1.1

Route Instruction Domain and Related Work

Verbal route instructions are a type of instructions where the director intends to guide
a mobile agent, the follower, to a specific spatial destination. When following route
instructions, the follower must parse and interpret the text, model the instruction’s
actions and descriptions, and perform perceptive and movement actions to navigate to the
destination. Correctly following instructions requires executing both explicitly commanded
procedures and procedures implicit in the instruction language.
Route instructions make a compelling domain for several reasons. Instructions
require integration of language, action, and reasoning skills, so draw on research in
linguistics, psychology, and artificial intelligence.

Route instructions are constrained

enough to be tractable to approach, while being complex enough to be interesting. The
evaluation of following route instructions is clear – whether the follower ends at the goal.
2

Finally, spatial route instructions have compelling real-world applications.

1.1.1

Route Instructions Understanding is Tractable

Spatial route instructions have several distinct features within the space of instructions
that make building a follower tractable. First, there are well described theories of acting,
reasoning and talking about large-scale spaces (Kuipers et al., 2004; Kuipers, 2000; Landau
and Jackendoff, 1993; Siegel and White, 1975; Timpf et al., 1992; Yeap and Jefferies,
1999). Second, moving through large scale space requires just two fundamental spatial
actions, turns and travels, reducing the number of verbs to model. For the pure task of
route following, few manner verbs, which specify how to perform actions, are necessary.
Moving through space is inherently sequential, and the structure of route instructions tends
to also be highly sequential. Some other types of instructions, such as recipes, may involve
executing many extended actions in parallel. Finally, as discussed below, route instructions
provide a clear and unambiguous evaluation metric. With multiple people following each
set of instructions, or measuring the efficiency of navigation, route instructions also allow
measuring a gradation of success.

1.1.2

Do what I mean, not (just) what I said

Route instructions are not trivial, however. Route instructions allow a lot of variation,
in which landmarks and route attributes described and how they are verbalized. When
more than one route is possible, the directors can also describe different procedures that
will accomplish the task. Route instructions require more than mindlessly executing the
explicit instructions when they are ambiguous, erroneous, or contain implicit procedures.
The follower must determine which necessary, but unstated, procedures satisfy the stated
or implicit preconditions. The follower must apply a knowledge of large-scale space and
knowledge of linguistic conventions in route instructions in order to succeed, despite the
challenges of natural instructions.

1.1.3

Clear Evaluation and Comparison to People

Route instructions provide a clear evaluation metric: Does the follower understand the
instructions well enough to arrive at the destination?

The task of navigating along

a complex route through a large-scale, unknown space is a sequential decision-making
problem. If any mistake or omission – by either the director or follower – is not corrected,
the follower will end up at the wrong destination. This distinguishes the task from other
task-centered language studies, where people can correctly finish at a fuzzy distribution of
places (Roy, 2005; Skubic et al., 2004b; Tellex and Roy, 2006), where the result of text
understanding is text retrieval or generation (Manning and Schütze, 1999; Reiter and Dale,
1997), or where the success criterion is unclear (Anderson, 1984; Anderson et al., 1991;
Carletta and Mellish, 1996; Levit and Roy, 2007).
The situated context of instruction-directed task execution gives an empirical view
into when inference is required. There is a long history looking at drawing the pragmatic
meaning from a text – what is entailed by it or implicated by what was said (Grice, 1967,
1975, 1989; Grodner and Sedivy, 2004; Perrault and Allen, 1980; Sedivy, 2003; Sperber
and Wilson, 1986; Stone et al., 2003). However, this leaves open the question of how
often these implicatures and entailments are necessary to understand the text, for instance,
to follow instructions successfully. Natural language under-specifies descriptions; route
instructions are no exception. Only some aspects of some landmarks and some features
of some actions along the route are described. The number of conceivable implicatures
and entailments from a set of route instructions is large, if not infinite. By comparing the
performance of software agents to human performance, we can discover which inferences
are necessary to follow the instructions and how often people make them.

1.1.4

Real-World Applications

Route instructions are potentially useful anywhere a person is interacting with a mobile
robot in a complex environment. Route instructions can be used as an assistive technology,

to ease (or allow) the control of smart wheelchairs (Beeson et al., 2007; Simpson, 2005;
Tellex and Roy, 2006). Route instructions could help in Urban Search and Rescue, to guide
a rescue or exploratory robot through an unknown environment even if communication is
lost (Burke et al., 2004; Murphy, 2004). Route instructions would also assist astronauts
supervising semi-autonomous robots (Burridge et al., 2003; Fong and Nourbakhsh, 2005;
Kortenkamp et al., 1998).

1.2

Language and Task Corpus Methodology

This thesis looks at the problem of following natural instructions.

Let us use the

term natural instructions to refer to instructions that have the characteristics of naturally
occurring instructions: instructions that are free-form, natural language with unconstrained
vocabulary or grammar and containing naturally occurring mistakes and omissions.
Specifically, to elicit natural instructions, directors learned the environment and navigation
tasks through first-person experience, recalled the environment from memory while
planning the task, and gave the instructions to unknown followers with minimal knowledge
of the environments. Each of these factors increases the ecological validity of the instruction
corpus (Cohen, 1995; Rosenthal and Rosnow, 1991).

1.2.1

General Definition of Language and Task Corpus Methodology

The first general contribution of this dissertation is a methodology for building software
systems that follow free-form, natural language instructions. First, gather a corpus of
people giving instructions to accomplish concrete tasks in the domain. Second, observe
other people following and rating these instructions to accomplish the task. Third, build
a working skeletal system able to interpret and execute the most common instructions in
the domain. Comparing the difference between people’s performance in following the
instructions with the system’s performance shows where to focus development effort, as
well as providing regression testing. Once the system is approaching human performance,
5

we can measure the impact of each component in the system – or each feature in the
instructions – by measuring the performance of the system without that component. Note
that while this work focuses on spatial route instructions, the language and task corpus
methodology is general for any procedural instructions.

1.2.2

Application of Language and Task Corpus Methodology to Study
Spatial Route Instructions

This dissertation applies the language and task corpus methodology to spatial route
instructions, implementing a system that follows spatial route instructions through an
unknown large-scale space. By ablating components from the route instruction following
agent and observing their impact on performance, we measure the importance of general
instruction following abilities, domain-specific navigation and spatial reasoning abilities,
and the combination of both general and task-specific abilities. These computational
experiments reveal which skills are necessary to follow natural instructions and how often
each is required.

1.3

Human Route Instruction Studies

We gathered a large corpus of route instructions from three human experiments, using
the same apparatus, stimuli and per-trial procedure. In the first study, six directors each
gave instructions for 42 routes between named positions in three virtual environments.
These instructions were used to develop a software route instruction follower, M ARCO.
In the second study, an evaluation corpus was gathered, where 24 directors each gave 42
instructions in one of the three environments. This study also varied the start and end
points of the requested routes and tested both continuous and discrete movement through
the environments. The second half of the corpus is used to evaluate the generality of
M ARCO’s methods across individual differences in linguistic and spatial reasoning for route
instructions.
6

1.4

Resolving Linguistic Ambiguity by Task Execution

The challenges in instruction following, and the methods for handling them, occur at every
level of the computational architecture. At the lexical level, the system should be robust to
unknown words. At the syntax level, it should handle the variety of utterance forms that
people can understand, whether or not formally grammatical. At the semantic level, the
system extracts the director’s intended meaning across differences in surface form. The
semantic model of the instructions captures the both imperative and declarative constraints
in the instructions – both how the follower should act and what the follower should expect
to observe. At the pragmatic level, the agent infers which procedures are required, whether
they are left implicit – or even stated incorrectly – in the route instructions.
We have implemented these methods as an integrated system, M ARCO, and we
have shown that they achieve comparable levels of performance to human followers, across
the range of quality of human-provided route instructions. The architecture is described in
detail in Chapter 5, and the individual challenges and methods are described in Chapter 6.
The performance of the full agent approaches the performance of people following the
instructions overall; for the best-rated instructions, M ARCO performs nearly at human
levels. This empirical validation supports the hypothesis that the M ARCO architecture
captures the challenges in spatial route instruction following and the methods that humans
use to meet those challenges.
The major conceptual contribution of the M ARCO architecture is deferred handling
of ambiguity. Both referring phrase resolution and modeling instructions as a sequence
of procedures introduce semantic ambiguity. Referring phrases vastly under-specify the
configuration, even combined with other knowledge explicit in the instructions or in
unspoken shared common heuristics – common-sense. A paragraph of instructions vastly
under-specifies the procedures necessary to accomplish the described task. Even a single
instruction command can refer to a complex conditional procedure (Bugmann et al., 2004;
Lauria et al., 2002a; Simmons et al., 2003; Tellex and Roy, 2006).

M ARCO handles what Sperber and Wilson (1986, 2004) called “contextual
implication,” conclusions derivable from both the linguistic input and the external context,
but from neither alone. M ARCO defers handling some linguistic ambiguity until it is
situated at that point in the task and environment, when it has more perceptual and cognitive
common ground with the director, by perceiving the environment at that point in the route.

1.5

M ARCO Architecture

This dissertation details a system, M ARCO, that interprets human free-form route
instructions and follows the inferred model of the described route (MacMahon et al., 2006).
This work builds on a rich literature studying different aspects of route instructions. Some
work presents a model of route instructions, but does not apply the model to navigate
(Anderson et al., 1991; Daniel et al., 2003; Denis et al., 1999; Klippel et al., 2005; Tversky
and Lee, 1999; Vanetti and Allen, 1988). Other work concentrates on understanding single
spatial commands in the small-scale space of a room (Skubic et al., 2004b; Tellex and Roy,
2006) or tabletop (Roy, 2005). Finally, other work follows instruction sequences in a largescale space, but does not use spatial and linguistic knowledge to recover from instruction
errors or to infer implicit procedures (Bugmann et al., 2004; Simmons et al., 2003).
M ARCO is composed of six primary modules: three to interpret the route instruction
text linguistically and three to interpret and execute the instructions in the context of the
task and environment. M ARCO’s general framework is domain-independent, although
extending the architecture to domains outside of large-scale spatial route instructions, such
as telephone help or booking systems, remains future work.

1.5.1

Conceptual Structure of Instructions

M ARCO’s instruction modeler produces two interlinked models: an imperative, procedural
model of the actions to be taken – a skeletal plan for the task – and declarative models
of the expected environmental and task states. M ARCO models instructions as a series
8

of parametrized local procedures, called procedural specifications. Each procedure in
the instructions is modeled as a procedural specification. Some specified conditions of
procedures are modeled either as internal state, such as estimates of distance traveled.
External state is modeled as view descriptions, which constrain what the follower expects
to see, given the referring phrases and commands in the route instructions.
The procedures to fulfill each condition are modeled as embedded procedural
specifications. A command to travel down a hallway is labeled Travelp . Travelp is
composed of the explicit, simple causal travela action to reach the next place, but also may
require other actions such as turna and verifya . For instance, “Take the blue hall to
the chair,” may require a travela actions to move to the blue hall, a turn actions to face

along the blue hall towards the chair, and the explicit travela actions along the blue hall
until the chair is reached. However, since the number of actions to execute is not known a
priori, the possible procedure sequences are represented as reactive procedures, which are
invoked as necessary. The first Travelp procedure to the blue hall may, in turn, require a
Turnp procedure to face the blue hall and possibly a Findp procedure to bring the blue hall
into view.
The expected observations of the blue hall and chair in this example are modeled
as view descriptions, which model the required relative distance and position as well as the
landmarks’ type and appearance. For instance, at the beginning of the main Travelp down
the blue hallway, there should be some sort of blue path directly in front of the follower,
and a chair in the distance on the part of that path in front of the follower.
The instruction modeler also decomposes high-level commands into lower-level
procedures. The concise “Take the third right to the end of the hall,” is
modeled and executed no differently than if the director had explicitly commanded,
“Go down one path to the third place with a path to the right. Turn
right there. Go down that path until you reach the end of that hall.”

This simplifies the execution code, by separating it from the surface form of the instructions.

Though the implemented M ARCO agent only follows route instructions through
large scale spaces, these knowledge representations are general to modeling complex
reactive procedures. Some of the modeling methods and heuristics are specific to this
domain, but many are generalizable.

1.6

M ARCO Route Instruction Studies

M ARCO follows the same instruction texts to navigate through the same environments as
people do. This allows direct comparisons of M ARCO, M ARCO with abilities ablated, and
people, each following the same instructions. For people, the results are the mean over
runs from multiple participants following each instruction set, each beginning at the start
location facing a random direction. For the M ARCO cases, the results are the mean over
runs facing each of the four directions at the start. M ARCO’s input was from the handverified ‘gold-standard’ parse treebank, not the parser, but all other modeling was done
autonomously.

1.6.1

Ablation Studies of Natural Route Instructions

The implementation of M ARCO is configurable to easily remove some capabilities at runtime. This enables computational experiments of running versions of M ARCO with different
language, action, perception, and reasoning abilities on the same instruction corpus. These
experiments give insight into how people give and follow instructions, and how important
these various capabilities are to construct a software agent to follow the instructions.
How important are implicit procedures?
One key finding is the importance of executing the implicit procedures in the instructions,
not just those that are explicitly commanded. Executing only explicit procedures, M ARCO
succeeded on just 34% of the trials.

What are the cues for implicit procedures?
Instructions can imply unstated, implicit procedures with four distinct cues. First, syntactic
cues are domain independent syntax that mark an explicit condition which the follower must
take unspecified procedures to achieve. Second, semantic cues require the decomposition of
high-level procedures into low-level actions and task conditions. Third, pragmatic cues alter
the interpretation of commands and descriptions, depending on their surrounding context
in the instructions and in task execution. Fourth, exploratory procedures are taken to gain
information needed to match referring phrases to the task and environment state.
These four triggers of implicit procedures occur in instructions across all domains,
though this work examines them in the context of spatial route instructions. This section
compares the results of M ARCO running without the ability to recognize different kinds
of cues for implicit procedures: cues implicit in the syntax, verb semantics, or discourse
and utterance pragmatics, and exploratory procedures used to gain information. Without
semantic cues, M ARCO follows only 36% of the instruction corpus; M ARCO without
semantic cues, 40% success rate; M ARCO without exploration procedures, 50%; and
M ARCO without pragmatic implicit procedures, 55%.
What kinds of landmarks are necessary?
Previous work examined the role of object and structural landmarks in learning and
navigating large-scale spaces (Stankiewicz and Kalia, 2007). A large body of related work
has examined the role of landmarks in spatial route instructions (Daniel and Denis, 2004;
Klippel and Winter, 2005; Lovelace et al., 1999; Michon and Denis, 2001; Nothegger et al.,
2004; Raubal and Winter, 2002; Weissensteiner and Winter, 2004). We can measure how
often structural and object landmarks are mentioned in route instructions by examining
statistics available from our language modeling. We can examine when the landmarks
are crucial for navigation, as opposed to used in elaboration, by selectively removing
the ability to recognize these different types of landmarks. Selectively removing the

ability to recognize different types of landmarks reveals that without Object Landmarks,
M ARCO succeeds on 49% of the corpus; without Intersection Landmarks, 48%; without
Appearance Landmarks, 44%; without Causal Landmarks (paths and walls), 34%; and
without Structural Landmarks, 27%.
What spatial cognition is necessary?
Following spatial route instructions requires the ability to represent and manipulate space
in several different ontologies (Beeson et al., 2007; Kuipers et al., 2004; Kuipers, 2000),
as well as different skills at each level. The Control level models spatial actions as control
laws, requiring skills such as moving until a condition is met (closed loop control) and
moving an estimated distance or turning an estimated angle (open loop control). The Local
Metrical level models the local environment geometrically. Skills at this level include
maintaining the relative position of landmarks that are no longer visible and perspective
taking (Trafton et al., 2005; Tversky et al., 1999), the ability to reason about perspectives
other than the currently view.

The Local Topological level models the navigational

affordances of the local space, i.e. what sort of intersection it is, if any. At the Causal level,
space is represented by how abstracted views are linked by abstracted actions (applications
of control laws). The Global Topological level reasons about space as networks of places
and paths.
By selectively ablating spatial abilities in M ARCO and measuring when the agent
can no longer reach the destination, we learn when these skills are necessary to follow
route instructions in large-scale spaces. These experiments reveal how often the diverse
spatial reasoning and representation skills play crucial or elaboration roles in natural human
route instructions. Without the Local Topological level, M ARCO follows 47%; without the
Topological level, 47%; without Open Loop Causal procedures, 25%, without the Local
Metrical level, 17%; without Closed Loop Causal procedures, 7%; and without any Causal
procedures only 1% of the instruction corpus.

1.7

Summary

This dissertation describes a general architecture for following natural language instructions, with an implementation applied to following spatial route instructions through an
unknown large-scale space. A new, large language and task corpus was collected, of
human directors’ exploration traces and route instructions; and human followers’ navigation
traces using these instructions and subjective ratings. The performance of the full agent
approaches the performance of people following the instructions overall and is statistically
equivalent to the performance for the best-rated instructions. By selectively removing
language, action, perception, and spatial reasoning abilities, the evaluation measured the
importance of each ability for following spatial route instructions.

Chapter 2

Related Work in Spatial Language
and Instruction Following
Instructions in general require applying an inter-disciplinary understanding of natural
language and acting in the world. An integrated cognitive system is required to correctly
follow free-form, natural language instructions. Because route instructions require many
different abilities to follow, they have been studied across the cognitive sciences. In
cognitive psychology, route instructions show see how people think about space (Anderson
et al., 1991; Daniel et al., 2003; Denis et al., 1999; Lovelace et al., 1999; Tversky and Lee,
1999; Vanetti and Allen, 1988). In linguistics, route instructions show people talk about
space (Edmonds, 1993, 1994; Levelt, 1982; van der Zee and Slack, 2003). In artificial
intelligence, spatial commands and route instructions are studied for natural human-robot
interaction (Bugmann et al., 2004; Klippel et al., 2005; Simmons et al., 2003; Skubic
et al., 2004b; Tellex and Roy, 2006). Successfully following route instructions requires
integrating techniques from all of these fields.

2.1
2.1.1

Psychological Studies of Spatial Language
Spatial language and spatial cognition

Route instructions are a sequence of descriptions of spatial actions and configurations.
Spatial cognition is a fundamental cognitive skill: across tasks and languages, linguistic
and psychological studies have found strong evidence of a “Where” system that
recognizes and verbalizes spatial relations separately from object geometry (the “What”
system) (Jackendoff, 1983; Landau and Jackendoff, 1993). Furthermore, when describing
a route, people express similar information whether communicating in language (verbal
instructions) or in pictures (maps) (Hayward and Tarr, 1995; Klippel et al., 2003; Tappe
and Habel, 1998; Tversky and Lee, 1999).
Route instructions are an interesting case of a more general class of problems,
verbally describing a spatial environment. There has been a wealth of good work in studying
the different aspects of describing spatial relations, layouts, and scenes. Herskovits (1997)
provides an invaluable survey of language about space. Mukerjee (1998) surveys cognitive
representations of space, and in particular which are quantitative (“neat”) and which are
qualitative (“scruffy”). van der Zee and Slack (2003) recently collected essays surveying
work on directional prepositions.
Linde and Labov (1975) focus on one domain particular of describing large-scale
space. This work presents a grammar for the sentence- and discourse-structure of verbal
descriptions of apartments by their residents. Linde and Labov derive twelves rules of
the grammar, delineating what is described, the order of the description, how sentences
or subordinate clauses are formed, and the transition from action verbs to passive spatial
layout descriptions. This engrossing paper describes both the semantics and pragmatics
(“well-formedness”) of large-scale space descriptions. The paper illustrates the derivation
and application of the rules with examples of the collected apartment layout descriptions.
The paper is a summary of (Linde, 1974).

Another keystone work in this field is Landau and Jackendoff (1993): “‘What’
and ‘Where’ in Spatial Language and Spatial Cognition.”

Landau and Jackendoff

survey a variety of linguistic and psychological evidence supporting independent mental
representations and linguistic structures for describing objects and locations. Named objects
specify detailed geometric characteristics, especially shape. When referring to spatial
relations and locations, however, the implicitly described geometry is more qualitative
and vague. These distinct manners of reference to objects and locations, universal across
human languages, mirror psychological and neurological evidence for separate modules:
one reasoning about identifying an object and the other reasoning about spatial relations.
The language used to describe space naturally parallels the underlying representations. Talmy (2000) focuses on the closed sets of prepositions that we primarily use to
describe spatial relations. A spatial preposition in a language encodes a rich set of default
attributes of the relation it is describing. For instance, if one says object A was across object
B, we generally infer that A begins on one side of B and extends continuously past the other
side on a fairly straight path. A and B are both likely significantly longer in one dimension
than the other and the long axes should run approximately perpendicular to one another, and
A is likely shorter than B. We would make significantly different assumptions if we heard
that A was along, over, on, in, or around B. The complete set of these default attribute
values may not be true for any given relation, but if too many are violated, the preposition
was the wrong word to describe the situation.
The M AP -TASK corpus is another influential study of spatial language.

See

Section 3.3.1 for a comparison of the M AP -TASK with our language and task corpus
methodology.

2.2
2.2.1

Psychological Studies of Route Instructions
Gary Allen and colleagues

One of the most cited psychological studies of route instructions is Vanetti and Allen (1988).
Vanetti and Allen looked for differences among subjects divided into four even groups
by standardized testing: high spatial-high verbal, high spatial-low verbal, low spatial-high
verbal, and low spatial-low verbal. Interestingly, the two standardized tests of spatial ability
measured small-scale spatial ability, but were good predictors of large-scale space route
planning ability. The subjects gave spoken verbal route instructions between two known
buildings on their college campus and followed spoken verbal route instructions between
two offices inside a campus building. The route across campus was “familiar to all subjects
and not extensive.”
Spatial ability had a larger effect in the accuracy of subjects’ described routes than
verbal ability. However, subjects with high verbal ability were more likely to describe the
key choice and termination points. The subjects’ route instructions were not empirically
tested by having others follow them, but by the experimenters coding them.
Allen (2000) later aims at capturing the “best-practices” of real-world direction
giving. Allen summarizes experiments suggesting descriptives and delimiters should be
inserted at choice points instead of en-route. Allen finds men are more persistent in
following instructions; in his experiments, men have fewer points where they claim the
route instructions were insufficient than women. Women’s performance improves when
environmental features (landmarks, relative spatial directions) are emphasized over metrical
distances and cardinal directions, but still perform worse than men. Allen also finds the
quality of instructions is more important as the follower nears the destination, rather than
the beginning of the route.
Allen (2003) followed up with a study of how and when gestures accompany route
instructions. Allen found deictic gestures were the most frequent, especially emphasizing

right or left on turns. Iconic and jabbing emphasis gestures were less common than deictic
gestures and gestures were more common in general with rapid speech.

2.2.2

Michel Denis and colleagues

Denis (1997) breaks down route instruction-giving into three phases: activating relevant
spatial knowledge, determining a route, and translating that route into a verbal output.
Denis codes a set of route instructions by breaking down each utterance into “minimal
units of information.” Daniel et al. (2003) found good, poor, and “skeletal” instructions
were differentiated by whether the proper action was associated with the proper landmark.
Fontaine and Denis (1999) followed up by examining how people give route
instructions for three-dimensional routes through the Paris Metro differ. They found that
underground, people specified actions in relation to objects, especially signs, far more
frequently than when in open outdoor spaces. Michon and Denis (2001) guided subjects
on long, but topologically simple routes through Paris districts. The subjects then repeated
the route on their own and gave route instructions for a tape recorder. This experiment
found that landmarks, while mentioned all along a route, are most frequently mentioned
“close to critical nodes,” e.g. around a critical turn or picking the correct street to exit a
large square. Recent work elaborates on these themes (Denis et al., 1999; Mellet et al.,
2000), for instance, Daniel and Denis (2004); Daniel et al. (2003) found “good”, “poor”,
and “skeletal” instructions were differentiated by whether the proper action was associated
with the proper landmark.

2.2.3

Barbara Tversky and colleagues

Tversky and Lee performed a series of studies on how people direct others on routes in the
different modalities of route instructions and sketch maps. Students were asked to give route
instructions from a campus landmark to a well-known restaurant in a neighboring town.
One work, Tversky and Lee (1998), follows Denis (1997) in describing each segment of a

route by its starting point, re-orientations, path progression, and a goal description. “How
Space Structures Language” likewise follows up on Talmy (1983, 2000). They find support
for Talmy’s schematization of space in the similar ways people describe routes verbally and
pictorally. For instance, both verbal route instructions and pictoral route maps schematize
information about a route into a series of generic turn and travel actions. In language, turns
are represented by a couple of verbs or phrases, such as “turn,” “make a” and “take a”.
In diagrams, turns are represented as arrows or sketches of intersections with orthogonal
angles.
Tversky and Lee (1999) find the same roles are played by elements in a sketch map
and hypothesize that there may be a common cognitive representation that underlies the
generation of each. This work was supported by further work in Tversky (2000). Agrawala
and Stolte (2001) implemented principles from this work in a software system that rendered
sketch-style route maps. Their evaluation found people preferred the sketch maps over or
in addition to more accurate and detailed cartographic maps.
Taylor et al. (2001) look at when and why people switch perspective while
describing environments and routes. Taylor and Tversky (1996) examine the additional
ambiguities introduced by describing an environment using a linear, limited natural
language as compared to using analog depictions.
Another angle of Tversky’s research has been investigating the role perspective
plays in spatial description and route instructions (Tversky and Lee, 1999). Taylor and
Tversky (1992) looked at the differences in mental representations of large-scale space
that resulted from differences in the perspective of a spatial description. Subjects read
descriptions written either from a survey perspective or as a narrative of a route.
Tversky et al. (1999) examine the cognitive costs and benefits of changing
perspective during verbal spatial description. They review evidence of costs in terms of both
effort and errors when readers are forced to change perspective. The authors propose that
speakers and writers may switch perspectives to take advantage of relatively more salient

objects and spatial relationships that are easier to describe and compute. The other proposed
explanation is simply that people’s heterogeneous mental representations of space encode
different perspectives.

2.2.4

Other work

One of the earliest studies of route instructions was by Elliot and Lesk (1982). Edmonds
(1993, 1994) examined reference resolution to previously unknown objects encountered in
route instruction texts.
Lovelace et al. (1999) had college freshmen write route instructions across their
college campus. Subjects first described two previously known routes while in a lab setting
from a familiar part of campus. Twice, the students were led along an unknown route on
campus, then described it. The route instructions were rated and coded for mentioning
certain features, such as when and where landmarks are mentioned. Subjects often omitted
mentioning turns, short segments, and landmarks, especially when travel was constrained
by environmental considerations. Good route instructions mentioned many landmarks
along the paths, off the route, and at the choice points, in contrast to other studies.
Buhl (2003) also looked at the effect of perspective (called “speaker orientation”)
on route instructions. Subjects gave route instructions to a listener with a different point of
view and found subjects most often produced route instructions composed from their own
perspective

2.3

Route Instruction Generators

Davis (1986) implemented the “Back-Seat Driver” system, an early in-car navigation
system, producing real-time route instructions for a driver.
Moulin and Kettani (1998)’s GRAAD software generates a logical, specification
of a route from a “Spatial Conceptual Map” and tests them by giving them to a virtual
pedestrian in a simulated environment. This logical formulation is processed by another
20

module to convert it into natural language by removing redundant information, matching
logical terms with environment names and matching logical relations with verbs. Gryl et al.
(2002) later presented a richer conceptual model of English and French spatial expressions.
Porzel et al. (2002) examine issues of how to linearize a representation of a two- or threedimensional environment or scene into a one-dimensional string of words.
Fraczak et al. (1998) examines automatically generating route instructions in
underground, three-dimensional environments, such as subway stations. Skubic et al.
(2001) generated spatial descriptions of small-scale space for a robot navigating within
a room.
Stocky (2002) implemented a kiosk system with a virtual avatar that used gesture
and natural language route instructions to guide visitors to offices. From a hand-coded map,
Stocky’s software, MACK, generated spoken route instructions coordinated with the avatar
pointing and highlighting a map. MACK also reasons about when to shift the perspective of
route instruction-giving, based on Taylor and Tversky (1996). Kopp et al. (2007); Striegnitz
et al. (2005) continue to study generating route instructions in both text and gestures.

2.4
2.4.1

Automated Instruction Following and Analysis
Computational Models of Spatial Prepositions

Several software systems have implemented computational models of spatial prepositions.
Winograd (1972) had one of the first implementations, with his S HRDLU system.
S HRDLU executed single commands in a “Blocks World” domain, including planning to
achieve commands with complex unsatisfied preconditions. S HRDLU performed using a
controlled vocabulary of about fifty words, although it could learn nouns online. S HRDLU
had models of the preconditions of prepositions such as on.
André et al. (1986) implemented a dialogue system that could find the reference
objects for certain (German) spatial prepositions in a dialogue about a sightseeing

in a simulated city.

Their system, C ITY T OUR, could handle several basic and

hedged prepositions to answer questions such as (translated to English) “Is the post
office beside the church?” Regier and Carlson (2001) and Coventry and Garrod

(2004) present implementations of system that ground knowledge of small-scale spatial
prepositions by modeling both geometric and causal relations between objects.
Blisard, Skubic, and colleagues implemented a spatial referencing system on a
mobile robot that can understand small-scale spatial prepositions such as front, left,
right, behind (Blisard and Skubic, 2005; Blisard et al., 2006; Skubic et al., 2004b). Their

system can ground prepositional phrases using these spatial prepositions to a occupancy
grid representation of the immediate space surrounding the robot. Their implementation
can describe the locations of objects surrounding the robot, can answer simple questions,
and can move to achieve single commands to move to locations in the small-scale space,
e.g. “go behind the desk.”
Gorniak and Roy (2004) implemented a system that kind find the referents of
referring phrases given a visual scene by resolving color, spatial relations, grouping
information, and anaphora. Their system, Bishop, resolved spatial prepositions with an
implementation of (Regier and Carlson, 2001) to distinguish one object out of a group
of distractors. The domain for this work was initially a simulated, abstracted tabletop
environment of configurations of colored cones. It was later integrated with the Ripley robot
in a physical tabletop environment (Roy et al., 2004). As discussed in 3.3.1, Levit and Roy
(2007) implemented a system applying understanding of spatial language to accomplish the
M AP -TASK.

2.4.2

Computational models of route instructions

Riesbeck (1980)’s system evaluated route instructions by high-level characteristics,
independent of the environment. His natural language parsing and understanding program
analyzed a set of route instructions for overall clarity and cruciality measures. Each motion

must be described completely and precisely (clarity); additional descriptions provide checks
but are not crucial. The software simulated the role of a person glancing over a route
instruction text, while questions can still be asked before navigating, not of an agent
following route instructions in the environment.
Agre and Chapman (1990) discuss plans as communicative acts and instructions as
communicating under-specified plans. They showed how route instructions do not uniquely
specify action sequences, but instead constrain navigation by providing a plan skeleton,
with exploration sub-goals the follower must accomplish. Chapman (1990) followed up on
this theoretical paper in implementing the “Sonja” system, which interpreted spoken advice
and instructions to better fight the monsters in her virtual dungeon.
Alterman et al. (1991) implemented a system which reactively replans to read
the instructions when its naïve plan proves inadequate. It makes an inference graph by
analyzing the keywords in instructions, simplifies the graph using graph summarization
techniques, transforms the graph into a procedure, and resumes executing with the amended
plan. The system, F LOABN, operated in a discrete event simulation. Example instructions
focused on different ways of paying for phone calls.
Zelek (1997) implemented a system that followed spatial instructions from a small,
controlled vocabulary grammar, chosen from a graphical user interface. The system was
able to execute two basic commands travela and finda , which are two of the actions in
out current work. The system also had models for two-dimensional spatial prepositions,
although how these are resolved is not detailed. The system was evaluated on a physical
robot.
Webber et al. (1995) looked at the broader question of inferring an intended plan
from any instructions. This work examined the linguistic and domain knowledge needed to
get a virtual agent to follow instructions from various domains. They state
A plan’s relationship with a set of instructions is also not rigid. It depends,
inter alia, on various features of the instructions, including: (1) whether the

instructions convey doctrine (general policy regarding behavior in some range
of situations) or procedure (actions to be taken now or at some specified time
in the future) ; (2) in the case of procedural instructions, whether they are given
before, during, or after action; (3) whether the instructions are meant as advice,
suggestion, order, request, warning, or tutorial.
Di Eugenio (1998) reports on the language system of this work. Her software analyzes
general instructions, such as craft guides, matching the text against a plan library using
plan recognition. The system integrates a lexical semantic ontology (Conceptual Semantics
Jackendoff (1983)) and a description logic based system.

The major contribution is

interpreting “purpose clauses” (do this to accomplish that or do this such that that is done)
(e.g. “Turn left to face the chair.”). Purpose clauses help lookup an appropriate
plan in a plan database, as the purpose clause indicates the plan’s (or at least the utterance’s)
goal. The system was integrated with the AnimNL system, which is a VR animation able
to simulate several tasks. The self-admitted lack in the system is an inability to synthesize
meaning across the discourse, instead, it interprets each sentence in its own context.
Other parts of Di Eugenio et al.’s work examined instructions for the role of free
adjuncts (e.g. “Facing the chair, move forward”) (Webber and Di Eugenio, 1990),
negative imperatives (e.g. “Do not go down the blue hall.”) (Vander Linden and
Di Eugenio, 1996), and handling standing orders with some autonomy (Bindiganavale et al.,
2000).

Müller et al. (2000) implemented a system that can follow a formal route description
through an environment, with the intention of adding on a natural language understanding
system. Descriptions follow the Tversky and Lee analysis, specifying where to turn or
switch paths (Tversky and Lee, 1998).
Frank (2003) suggested formalizing verbal route instructions into action schemas,
considering the “pragmatic information content” of route instruction texts the same if they
1

All examples are from our domain, not their papers.

produce equivalent actions.
A group in Bremen, Germany is building an intelligent wheelchair (Lankenau and
Röfer, 2001; Mandel et al., 2005) that can share control with a human driver through a
natural language interface by integrating a spatial ontology (Krieg-Brückner et al., 2004)
and dialogue model into an agent control architectures (Ross et al., 2004). Other work
examines how people direct a robot using natural language to one of a group of objects,
particularly the dialogue strategies and spatial referencing used, and whether directors used
open- or closed-loop commands (Moratz et al., 2003; Tenbrink, 2003; Tenbrink et al., 2002;
Tenbrink and Moratz, 2003).
Shimizu and Haas (2006) built a system that followed instructions through a
simulated building. The system parsed free-form natural language instructions into a
command template of a verb of travela or entera , a landmark of a door or hallway, a
direction of left, right, or straight, and a ordinal of which hallway or door is referenced.
The routes consisted of about two segments. This work is attempting to learn to match
word segments to action sequences and for the corpus gathered, succeeds at 77%. The
route instructions were elicited by showing the director the route to follow and routes that
the experimenter could not follow were removed, leaving a corpus which should be 100%
followable.
Gorniak and Roy (2006, 2007) implemented a system that follows directives that
one player gave another while solving a puzzle in a video game. Their system performs
plan recognition by parsing utterances into an affordance filter, which, in turn, selects the
most probable action on an object, given the utterance and situation. Placing the system in
the same situations with the same linguistic inputs as a human player, the system selects the
next action about 70% of the time, whereas plan recognition alone only predicts 50-60% of
the next actions. This evaluation only allows the testing of one command at a time, and the
commands are extremely simple, with 50% of the commands consisting of a single word.
Tellex and Roy (2006) programmed “spatial routines,” or simple procedures, to

execute single commands instantaneously to move within a room or corridor. The system
understood eight commands, combinations of turn, go, and stop with parameters, such
as “Go (straight|right|left)” and “go across the room.” The procedures
included achieving preconditions and were evaluated by whether the simulated robot
produced a similar path across the local small-scale space as people in response to the
command.

2.4.3

G RACE and G EORGE in the AAAI Robot Challenge

Perzanowski et al. (1998, 2001) implemented a system that combined a speech recognizer,
a deep parser, a dialog model, hand gesture recognition, and a Palm Pilot control interface.
A user could command the robot to move around a mapped, small-scale space by speaking
and gesturing.
G RACE extended this architecture, adding the ability to follow a route instruction
series through an unmapped, unknown large-scale space (Simmons et al., 2003). The
robot G RACE navigated through a conference center by asking for and following route
instructions. G RACE could string together several simple commands, using an instruction
queue executor.

They also handled implied new interim destinations (“Take the

elevator”).

In 2002, G RACE successfully, though haltingly, completed the Robot Challenge at
AAAI 2002 (Simmons et al., 2003). The 2003 robots G RACE and G EORGE were beset by
hardware, software, and communications problems that illustrate the need for more user
visibility into the state of the system. Still, the robots were directed down a hallway, up a
ramp and through a narrow doorway, and across an exhibition hall.
G RACE and G EORGE had several major limitations.
commercial speech recognition system was unreliable.

Most debilitating, the

The vocabulary and sentence

structure were limited so only a trained operator could direct the robots. The navigation
planning code relied on having a completed global metrical map, so navigating to unseen,

unknown locations was extremely fragile. Crowds of people forming shifting walls further
confused the robot.
G RACE and G EORGE did not reason to infer implied actions. They had only one
interpretation of the instructions, although this was checked with the director. The robots
did not estimate the likelihood of action success, but instead asked the director.

2.4.4

Instruction-Based Learning (IBL) Project

The Instruction-Based Learning (IBL) for Mobile Robots project is another implementation
of route instruction following on robots . Bugmann et al. (2001) presented a corpus of 96
spoken route instruction sets from participants guiding a human operator, who had remote
control of a robot navigating through a tabletop model of a town center. They modeled
the instructions as action schemas, called “functional primitives,” such as M OVE F ORWARD
U NTIL , T URN , I S L OCATED , and G O T O
. Lauria et al. (2002a) implemented a robotic system capable of following

programs of functional primitives from this corpus, expanded to 144 route instructions. The
15 IBL functional primitives include procedures such as go_untila , exit_roundabouta ,
follow_road_untila , and take_roada , all of which would be modeled with our Travelp
procedure with various parameters.

2.4.5

Spatial Ontologies and Spatial Instructions

Software systems that analyze or follow route instructions can be distinguished by how
they represent space. Freundschuh and Egenhofer (1997) survey a variety of spatial
representation models and define broad categories based on (1) if the objects in the space
are manipulable, (2) if the space requires locomotion to experience, and (3) the size, or
scale, of the space. This work focuses on non-manipulable, large-scale spaces that cannot
be experienced from any one perspective: the agent must turn (panoramic space) or move
(environmental space) to see the space.

Other work concentrates on understanding single spatial commands in the smallscale space of a room (Skubic et al., 2004b; Tellex and Roy, 2006) or tabletop (Roy,
2005). Finally, other work follows instruction sequences in a large-scale space, but does not
use spatial and linguistic knowledge to recover from instruction errors or to infer implicit
actions (Bugmann et al., 2004; Simmons et al., 2003).

Chapter 3

Language and Task Corpus
Methodology
The language and task corpus methodology seeks to build and evaluate systems that will be
robust to natural instructions from people. Natural instructions are not only instructions
in an unconstrained, natural language, but they also represent the kinds of instructions
that people give to one another. In the real world, instructions are often under-specified,
with erroneous information or some necessary actions not implicitly stated. People are
remarkably robust to the challenges of natural instructions. We seek to build a system that
can follow the same natural instructions that people can, especially those instructions that
most people can and do follow successfully.
The language and task corpus seeks to discover which instructions the “reasonable
agent” can follow by following this procedure: (1) collect natural instructions from many
people, (2) give the instructions to many people to follow and rate, and (3) give software
systems the same instructions to follow in the same environments. The language and
task corpus can be used in development and evaluation, using human performance as
a benchmark for the software system. Running systems with disabled components or
alternative implementations and heuristics both measures the performance impact of each
29

software component and makes predictions about how people process and execute natural
instructions.
The language and task corpus methodology with multiple human directors and
followers is an excellent way to study human task-centered language use, as well as
for system development use at the focus of this dissertation.

In fact, other route

instruction researchers independently suggested, but not implemented, this methodology
for psychology studies of spatial route instructions. Lovelace et al. (1999) rated route
instructions subjectively, but could not verify that their poorly rated route instructions were
actually functionally worse. They suggested a study where subjects followed a variety of
route instructions for a variety of routes in a virtual reality environment to determine this
crucial question.
Our evaluation testbed ties together an instruction corpus, navigable environments,
and action traces from human and artificial agents with linguistic and spatial reasoning
abilities. This testbed of a route instruction text corpus tied to simulated environments
presents a challenge task for researchers in natural language understanding and spatial
reasoning. The methodology emphasizes understanding the gist of route instructions over
some details: the essential linguistic and spatial details separate navigation success from
failure. However, to be tested, components must be integrated into a complete agent that
can read the instructions and apply the understanding to act in the world.
This paper contributes an assessment of human performance for communicating
route information through unfamiliar large-scale spaces. By comparing the performance
of a computational model with and without the ability to infer implicit procedures, we
measure how often understanding the unstated is necessary to succeed in this task. Though
this ratio will change for other tasks and domains, the methodology of comparing human
and automated systems on corpora of problems will generalize.
We believe that the language and task corpus methodology described here will
generalize to instructions about other complex procedural tasks, including cooking, first aid,

furniture assembly, automobile repair, and many others. We believe these tasks should be
similarly evaluated, with a testbed that demonstrates sufficient understanding by achieving
a complex, situated task given diverse natural language instructions.

3.1

Language and Task Corpus Methodology for Spatial Route
Instructions

Several decisions were made in how to gather the instruction corpus to elicit more natural
instructions. Each of these factors increases the ecological validity of the instruction corpus
(Cohen, 1995; Rosenthal and Rosnow, 1991).
Directors learned the environment through a combination of undirected, free
exploration and directed navigation task execution. In the free exploration phase, the
directors could move around the environment in any pattern, discovering named positions
by moving into them. The subjects had to actively chose how to move through the
environment, exploring through a first-person perspective. Subjects neither saw a map,
nor were guided in their exploration.
In some other studies, subject passively observe or are led through a recorded route
through the environment (Lovelace et al., 1999; Shimizu and Haas, 2006). Cognitive
psychology studies have shown that active navigation leads to qualitatively different and
better learning of the environment than passive observation of navigation, the “passenger
effect” (Dayan and Thomas, 1995; van Asselen et al., 2006).
In the navigation quiz phase, directors were placed at one of the named locations
and asked to navigate to one of the other named positions. This both helped the subjects
find parts of the environment they may not have found. More importantly, it forces
subjects to reason and problem solve using their cognitive map. Subjects must demonstrate
competence in way-finding among the named positions throughout the environment.
During the trials, the experiment program automatically provides feedback after each trial

on whether the subject finished at the destination or if the route navigated was circuitous.
However, subjects must learn how to overcome these deficiencies on their own.
In the last phase, the directors give instructions. Directors are placed at the starting
location and allowed to turn so that they can recognize the start. Then the directors must
plan and describe a route to another position somewhere in the environment. While writing
the instructions, the environment is not visible. The directors are not shown the route to
describe, but must apply their knowledge of the environment to plan the route. After giving
the instructions, the directors are asked to follow them and rate themselves. With this
information, we can measure when a director unknowingly described a route to a destination
other than the one requested.
The directors provide the instructions as typed natural language text, rather than as a
spoken monologue or a spoken or written dialogue. Writing reduces the disfluencies in the
instructions by allowing the subjects time to think and edit. Since the instructions are given
to unknown followers, directors cannot establish ad hoc conventions with the followers
(Anderson et al., 1991). The director must give general instructions to a follower without
knowledge of the environment. The problem of using dialogue to find or establish common
ground in mental representations is interesting, but is a substantial task in itself (Anderson
et al., 1991; Cohen, 1984; Edmonds, 1993, 1994; Garrod and Anderson, 1987; Garrod and
Doherty, 1994; Heeman and Hirst, 1992; Pickering and Garrod, 2004; Schober, 1993).
To understand how people describe routes in large-scale environments, we
performed a series of experiments in virtual environments. The use of virtual environments
had several benefits. First, all participants were guaranteed to have no initial experience in
these environments. Second, we could record the learning process and exploration patterns
of the directors. Third, we could move followers between environments easily, quickly, and
without notice, to discourage learning the environments. Fourth, we could directly compare
the performance of people navigating these virtual environments against an artificial agent
navigating the same maps given the same instructions.

The routes range from one travel action to many turns through complex environments. The followers had to identify the destination only from the route instructions, not
from any distinctive marking or the trial automatically ending. Moreover, the instructions
were far from perfect with some providing minimal guidance and many with significant
errors in turn direction, object identity, or distance estimates (MacMahon, 2005).

3.2

Related methodologies for evaluating natural language understanding systems

As we discussed in 2.4.4, Bugmann et al. (2004, 2001) enacted a similar methodology
for their Instruction-Based Learning project. The work in this paper is more easily and
less expensively replicated, since no special robotic equipment or physical town model
is needed.

More importantly, our subjects learned the environments from the same

first-person perspective as the human and software agents following the instructions and
wrote instructions from memory. Bugmann’s participants only saw an outside, panoramic
perspective of the town model while directing.
This difference in how environments are learned and perceived between the
directors and followers leads to a class of errors not present in our approach. Specifically,
directors may refer to information unavailable to followers. Conversely, while our directors
may make errors while learning the map through navigation or recalling the map while
directing, these errors are cognitively interesting and prevalent in the real world. Previous
work has found differences in the types and rates of errors that directors make when the
director is looking at a map – “map-present condition” – or directors describing the route
from memory – “map-not-present condition” (Brown et al., 1998; Ward et al., 1986).
In “Wizard of Oz” studies, another common paradigm, the experimenter simulates
user interaction with a software system by having an expert “man behind the curtain”
control the behavior of the system. The Instruction-Based Learning corpus was gathered

through a “Wizard of Oz” methodology, as were several other prominent corpora of spatial
language interaction with software systems (Green et al., 2006; Perzanowski et al., 2003;
Skantze, 2005). The problem of “Wizard of Oz” studies for instructions is that the user
interacts with one expert user, who may not have typical reactions to the instructions.
Worse, it is just one interpretation of the instructions, not covering the individual differences
in language interpretation or navigation strategy. A variant of the “Wizard of Oz” method
uses a static scripted script for the system interaction (Tenbrink, 2003).
A similar strategy is to have live interaction with a live system under development
(Fischer, 2003; Hüttenrauch et al., 2004; Moratz et al., 2003; Tenbrink, 2003; Tenbrink and
Moratz, 2003). This has the same problems of the “Wizard of Oz” studies compounded
by the fact that the follower is now a partially developed software system. In fact, these
studies find that directors can spend a lot of effort trying ascertain what linguistic and
spatial abilities the system has. The interactions are not natural, but qualitatively differ
from interactions with people, as those who run these studies have found. When the system
is fairly developed, this can be a good evaluation, but it does not work well for development.
.

3.3

Related methodologies for studying natural language tasks

One common method of studying natural language tasks is for experts to annotate the corpus
to create a treebank (Ellsworth et al., 2004; Johnson and Fillmore, 2000; Kingsbury et al.,
2002; Palmer et al., 2005; Vander Linden and Di Eugenio, 1996). This methodology grows
from the linguistic tradition and is thus best suited to study the language aspects of the tasks,
but less so for the action portions.
In our study, the directors learn the environment from a first-person view of selfdirected free exploration. The directors then describe the environment from memory and
the followers navigate using the instructions with the same perspective the directors had. In
contrast, in other experiments the directors learned the environment from an overhead view
34

or map, often observed while giving instructions (Bugmann, 2003; Bugmann et al., 2004,
2001; Kyriacou et al., 2002, 2004; Lauria et al., 2001, 2002a,b).
Other route instruction studies (Lovelace et al., 1999; Taylor and Tversky, 1992;
Vanetti and Allen, 1988) rate route instructions subjectively, but do not test if navigation
success is affected. In these studies, the experimenters code expected errors, but do not
measure when people are affected by those errors when following the instructions.
In other experiments, subjects are guided through a route then told to describe it
(Lovelace et al., 1999; Shimizu and Haas, 2006) or rely on previous experience outside
of the experiment to give the participant familiarity with the route(s) (Daniel et al., 2003;
Denis et al., 1999; Lovelace et al., 1999; Michon and Denis, 2001; Nothegger et al., 2004).
Evidence has shown that different mechanisms are used in learning an environment or a
route from maps versus from first-person experience (Garden et al., 2002), as well as much
research into how people pick the routes to describe (Dalton, 2003; Duckham and Kulik,
2003; Haigh et al., 1997; McDermott and Davis, 1984).
The environments are virtual large-scale indoor environments. From any place
in the environment, only a limited portion of the full environment can be seen. The
environments have a maze-like layout, consisting entirely of corridors and intersections,
with no large open areas. Other work has focused on the problems of understanding single
spatial commands in small-scale spaces, such as navigating within a room or open field
(Blisard and Skubic, 2005; Blisard et al., 2006; Skubic et al., 2004a,b; Tellex and Roy,
2006) or in manipulable space, such as a desktop (Pook and Ballard, 1996; Roy, 2005;
Winograd, 1972; Yu and Ballard, 2004).

3.3.1

M AP -TASK Corpus Methodology

A classic psycho-linguistic experiment produced the M AP -TASK corpus (Anderson et al.,
1991). The M AP -TASK is similar to following route instructions through large-scale spaces
to some degree: participants use spatial language to describe and recreate a route. However,

looking at the details, this is a very different task. The participants in the M AP -TASK use
the language of large-scale spatial navigation instructions — and some explicit references
to the paper medium — to describe the task of drawing a line on a map. The follower in the
M AP -TASK is performing a task akin to a would-be tourist at home, reading a Paris guide
book and tracing a route on a map; in the route instruction following task, that tourist uses
that book to navigate the streets of Paris.
The M AP -TASK does not involve navigation, as both participants interact with the
world only through a map. In the M AP -TASK, the participants have equal a priori, though
differing, knowledge of the environment. Each has a map containing slightly different
overhead sketches of the environment. Each can see their entire map at all times and
the only way to gain additional information about the environment is by talking to the
other participant, not through action in the environment. The landmarks are sketched onto
the map and labeled with descriptive noun phrases, e.g. “vast meadow,” so there is no
variability in the noun phrase used to name referents or the landmarks referenced. The
director instructs the follower to recreate a route drawn only on the director’s map on the
follower’s map. The route is provided by the experimenters, not planned by the director,
and is visible to the director throughout the experiment.
In the M AP -TASK, the participants have equal knowledge of the environment,
which both subjects view as a image annotated with labels, and they collaborate in real
time to duplicate the route line shown to one participant. In the M AP -TASK, the evaluation
of success is unclear: is the task is to exactly or qualitatively reproduce the route line? How
should the evaluator account for the missing landmarks on the Follower’s map?
In our task, one participant has learned the environment through navigation, and
must plan a route from memory, then write a text describing the whole route to an unknown
follower, who will later use instructions to navigate to the destination. The director chooses
what to describe and how to describe it without guidance from the experimenters like the
supplied labels in the M AP -TASK.

The advantages of the M AP -TASK are that it controls for some of the individual
differences in experience and ability on some dimensions of the task. The directors’ and
followers’ only knowledge of the environment is what they see on the map and what
they hear. Thus, the experimenters control both participants’ initial state of knowledge.
Additionally, by selecting and labeling the landmarks, the experimenters greatly reduce the
lexicon and referring phrase variation. Finally the M AP -TASK uses paper (or any other
image display) as its only apparatus, making it easy to replicate.
Levit and Roy (2007) created a system to follow the instructions in the M AP -TASK
corpus. Their system uses a dynamic programming approach to combine manually modeled
and grouped “Navigation Information Units” to draw the best-fit graphical path on the
director’s map. The system relies on the fact that it has, a priori, perfect and complete
information from the graphical map, which was exactly the information the director had.
The evaluation does not compare to the paths other people drew given the instructions, only
the experimenter provided reference path. The system was also not evaluated drawing paths
on the follower’s map, unlike the human followers in the M AP -TASK experiments.

Chapter 4

Human Route Instruction
Experiments
Instructions imply two roles: the director and the follower. The director plans to accomplish
the task and describes the procedure in the instructions. The follower understands the
instructions and executes the procedure in the environment to accomplish the task.
Instructions from multiple directors will vary in errors, omissions, vocabulary,
grammar, and information provided. When the environment allows multiple solutions to
a task, the procedures described will also vary. Additionally, the directors may have learned
different aspects of the environment, and have different strategies to describe tasks in the
environment. One goal of this study is collect a corpus covering all of this variability.
Human followers are intelligent and actively attempt to match the directors’ intent
to the task and environmental context. Followers also will have individual variations in how
they interpret and execute instructions. The follower can often achieve the task despite gaps
and explicit errors in the instructions. The crucial question for instructions is how often
followers are able to accomplish the task. By giving many followers instructions from
many directors over multiple tasks, we can measure how often people reach the destination
across a wide variety of instructions.
38

4.1

Motivating questions

As detailed in Chapter 2, research in several fields has examined giving and following
route instructions. Some researchers gather a corpus of instructions and have experts rate
the instructions and code them for various attributes, e.g. landmark usage and errors in
description. Other studies take a small number of instructions and have people navigate
using the instructions through an environment.

No other study gathers a corpus of

instructions over many routes from one set of people and then have another set of people
follow and rate the instructions. We examine the behavior of directors and followers in
conjunction, but independently.
We want to measure human variability in instruction-giving and instructionfollowing for complex routes through an unknown, large-scale space. This corpus allows
us to answer the following questions: How do directors vary in describing the routes and
how do followers vary in following the routes? How do subjective ratings correspond to
objective success rates in following the instructions? What is the distribution of good and
poor instructions? Does a link between gender and spatial route instruction performance
exist?
To measure human linguistic and spatial behavior when giving and following spatial
route instructions, we gathered a language and task corpus from human subjects in three
experiments. The corpus consists of instructions from multiple directors over multiple
routes that are followed and rated by multiple human followers.

4.2

Study overview and motivations

All three of our human studies share the same basic procedure. Directors learned the
environments and wrote instructions. Followers read these instructions and followed them
through the environments. Each of these roles is summarized here and described in detail
later in the chapter.

Directors learned a virtual environment through unguided, first-person exploration.
Each director was tested by navigation tasks, to ensure he or she had sufficiently learned
to navigate through the environment efficiently. Finally, each director planned the route
from memory, typed route instructions, followed the route themselves, and rated their
own instructions for each of up to 42 routes through the environment between two named
positions.
These instructions were later followed and rated by a separate set of human
followers. The followers did not have prior experience in these environments. Each
follower followed instructions from multiple directors in all three environments, without
ever repeating exactly the same route.
In Experiment 1, we investigated how instructions vary across a small number of
directors and whether a director’s style and quality varies across environments. Six directors
each learned and gave 42 instructions in each of three virtual environments, a total of 126
sets of instructions each, (756 requested instructions across all 6 directors). Each set of
instructions was followed and rated by six other people.
In Experiment 2, we wanted to measure how much of human variability our six
directors covered. In Experiment 2, we had twelve directors each learn one environment
and asked for 42 routes from each. Multiple people followed and evaluated each set
of instructions from Experiment 2, as well as following and evaluating a sampling of
instructions from Experiment 1.
In Experiment 3, we tested how subjects’ controlled movement affected instructionfollowing performance. In the previous studies, directors and followers moved through the
environments using discrete actions triggered by the keyboard. For instance, pressing ‘8’
moved the camera along the corridor at a constant speed to the next possible intersection,
unless a wall was ahead. In Experiment 3, both the directors and the followers moved
the camera using continuous joystick control. In Experiments 1 and 2, the camera moved
smoothly down the center of the hallways, but in Experiment 3, each person controlled the

speed and heading continuously. This question addresses how the discrete control affected
the route instructions.

4.3

Apparatus

These experiments are performed in the Vizard 2.53g virtual reality engine (Vizard, 2006).
Vizard provides an immersive environment with optical flow when moving, photo-realistic
textures, and three-dimensional objects. The experiments are run on Dell desktop computer
with a 17" monitor under 1248x1024 resolution. Directors heard audio cues from either
desktop-mounted speakers or headphones. Followers had no audio cues.

4.3.1

Control of movement through the environment

In Experiments 1 and 2, the numeric keypad controlled motion between places and poses
at a place. If the participant pressed the ‘8’, the view moved forward down a hallway to the
next place, or remained in place if a wall was immediately in front. The ‘4’ and ‘6’ keys
rotated the view 90◦ to the left and right, respectively. The views were placed so that the
participant could see the presence or absence of any hallways immediately to their right and
left. See Figures 4.2 and 4.1 for example views.
In Experiment 3, the participants used a joystick to navigate through the
environment, controlling the speed and direction by direction and angle of the joystick.
Unlike the discrete condition, movement was controlled in an analog fashion, where the
further the joystick was angled, the faster the camera moved through the virtual space.
Additionally, the participant could turn to any angle and could stop motion or turn at any
location. In the discrete motion condition, motion only stopped at designated poses.

Hatrack

Sofa

Lamp

Chair

Easel

Barstool

Figure 4.1: Sample views of all objects, from the Medium environment.
42

Compact

Medium

Sparse
Figure 4.4.1: Maps of the three virtual environments, which participants
explored from a first person perspective.
Participants did not see these maps or
any global representation of the environments. The three regions marked
by dotted lines each have a unique wall
hanging: fish, butterfly, or Eiffel Tower.
Each long hallway has a unique flooring.
Letters in the maps mark objects (e.g. ‘C’
is a chair). Numbers indicate the named
positions.

4.4
4.4.1

Stimuli
Environment Maps

These environments and the experiment control software build on top of previous studies
on spatial navigation (Kuipers et al., 2003; Stankiewicz and Eastman, 2008; Stankiewicz
and Kalia, 2007; Stankiewicz et al., 2006, 2001). The environments were generated on a
Cartesian grid and modeled in Virtual Reality Modeling Language (VRML).
The tests were run in three distinct environments, which were constructed from the
same components. Figure 4.4.1 shows the global layout of each environment. All of the

Name
Compact
Medium
Sparse

Number of
positions
(Dead, Mid, Int)
28 ( 1, 0, 27)
34 (10, 4, 20)
37 ( 8, 7, 22)

Number of
straight
paths
15
14
19

Average path
length: Mean
(Median)
2.7 (2 )
2.9 (2.5)
2.1 (1 )

Percentage
unforced
decisions
71%
44%
30%

Mean ints
per path
length
0.97
0.72
0.81

Table 4.1: Characteristics of the three testing environments. Positions are either Dead-ends,
Mid-path, or Intersections. Paths are defined by continuous straight segments.
participants saw the environments from the first-person perspective of the virtual reality
rendering. The three environments varied in the density of the layout, as measured by the
shortest travel routes betweeen the named positions and average hallway length. We name
each environment according to its density. The most compact (Compact) had a mean
shortest route length of 4.2 (median 4), the most spread-out (Sparse) enviroment, mean
6.0 (median 6). The shortest route was a minimum of one travel action, the longest was a
minimum of 13.
Table 4.1 shows more of the statistics of the environments. Each environment has
paths totaling 40 segments. The environments contain between 28 and 37 positions. The
environments have between 14 and 19 paths, where a path is defined as hallway segments
connected along a line.
Certain positions in the environments were marked with audio cues for the directors,
the named positions. When the director encountered a named position in the Exploration
phase, and at the beginning of the Navigation Quiz and Route Instruction Entry phases, a
voice announced the position name, e.g. “Position 2” or “Position 4. Go to Position 7.”
These positions are marked by the numbers on the maps in Figure 4.4.1. There were two
sets of named positions. In Experiment 1, all directors learned and described routes from
Position Set 1. In Experiments 2 and 3, half of the directors learned Position Set 1 and half
Position Set 2.
The layouts vary considerably. Compact is densely connected with few short or
dead end paths. Compact has 96% of possible intersections connecting at least two paths.
44

This count of intersections excludes positions that are only on one path: dead-end positions
and positions in the middle of a hallway where the participant cannot turn onto another
path. Sparse has an intermediate degree of connection, with 59% of possible intersections
connected and many short paths and a long, sparsely connected loop. Medium has also has
59% of possible intersections connected, with many dead ends off of relatively long paths.
Finally, in the Compact environment, nearly three-fourths of the positions are decision
points, where a navigator has more than one way to proceed forward, while less than onethird of the positions in Sparse present the way-finder with a decision.1

4.4.2

Environment Landmarks

To provide useful cues for the directors, we placed 11 three-dimensional objects of 6
different types in each environment. The types of objects were an easel, a hatrack, a sofa,
a chair, a stool, and a lamp. See Figure 4.1 for sample views of the different objects.
The objects were placed at potential intersections in the environments. Some objects were
repeated within an environment and all objects occur in each environment. For instance,
Figure 4.2 shows two different easels in the same (Compact) environment. The positions
of the objects are marked by letters on the maps in Figure 4.4.1.
The objects were chosen to be easily identified, common objects that could be
shown at an apparent normal scale in relation to the hallway size and view height. They are
all indoor furniture. Several are from a similar category of human seats: the chair, barstool,
and sofa. The lamp and hatrack appear similar from a distance, but are distinct on a close
view. The easel is more semantically and perceptually separate from the others, but occurs
less often in everyday life than the other objects.
Furthermore, each environment was divided into three separate regions, designated
by distinct pictures on the walls (see Figure 4.4.1). Finally, 7 long hallways within each
environment had a visually distinct texture mapped onto the floor. Figure 4.4.1 shows the
1

Formally, a place with more than two gateways (Kuipers et al., 2004), eliminating dead-ends and corners.

layout for the three environments. An immediate forward view of each texture hallway can
be seen in Figure 4.2.
The hallway flooring textures were chosen to be distinct in both color and pattern.
All textures were chosen to be recognizable floor or ground patterns. The non-distinct
cement texture is the only one repeated on separate path segment floors within an
environment, and is also used for walls and ceilings. The cement texture was used on
each path that was one or two segments long.

4.5

Human Directors Learn, Navigate, and Describe

All three experiments share the same procedure for learning an environment and writing
instructions for routes through it. We will describe the common director procedure and
then describe the participants and manipulated variables of the individual experiments.

4.5.1

Procedure

The directors progressed through four phases, described in detail below.

First, in

the Introduction phase, the directors were briefed on the experiment, answered some
demographic questions, and were acclimatized to the virtual environments, navigation
interface, trial procedure, and text entry. Second, in the Exploration phase, the directors
explored the environments by moving through the environment for a fixed distance traveled,
equal to 120 hallway segments. Third, in the Navigation Quiz, the directors were asked to
navigate between pairs drawn from the seven named positions that were announced in the
exploration phase.
If the directors demonstrated that they could reach each position and navigate
relatively efficiently (within 65% of the shortest distance for that route), they progressed to
the Route Instruction Entry phase. Otherwise, they repeated the Exploration and Navigation
Quiz phases.
In the Route Instruction Entry phase, the directors were repeatedly placed at a
46

Figure 4.2: Sample views of all hallway textures, from the Compact environment.

named position and asked to type instructions to another named position into a text entry
window (See Figure 4.3). After entering the instructions, the director navigated the route
he or she just described and rated their own instructions (See Figures 4.4 and 4.5).
Introduction Phase
First, each participant was briefed on the design and goals of the experiment and were asked
for informed consent. See Appendix A for this consent form.
Second, the participant was asked a number of questions, including
• Is your hometown rural, suburban or urban?
• Are your hometown’s streets not, mostly, or entirely laid out on a regular grid?
• How much first-person gaming experience do you have?
• Do you get motion sickness?
• What is the language you speak at home?
Next, the participant read an instruction sheet explaining how the experiment will
be run. See Appendix A.4 for the written instructions they received.
Finally, the participant was taken into the virtual reality lab and placed at a
desktop computer. The participant was placed in a small demonstration environment and
familiarized with the movement and audio cues. The participant ran through an abbreviated
version of all three phases of the study in the demonstration environment: Exploration,
Navigation Quiz, and Route Instruction Entry. Once the participant was comfortable with
the experiment apparatus and procedure, the participant started the experiment in one of the
three test environments.

Exploration Phase
In the Exploration phase, the directors learned the environment through free exploration.
During this – and only this – phase, audio cues announced each of seven named positions
whenever the director crossed one while moving through the environment. Until the director
entered the immediate vicinity of a named position, there was no indication that the place
was a named position.
During this phase, the director was placed at one of the seven named positions in
the environment and moved through the environment without guidance. After moving a
distance equivalent to 120 hallway segments (the distance between possible intersections),
the director’s knowledge of the environment was tested in the Navigation Quiz.
Navigation Quiz
Before asking for instructions, we wanted to ensure the directors had learned to navigate
around the environment adequately. The Navigation Quiz tested the director’s ability to
efficiently way-find in this environment on the routes between the named positions. In the
Route Instruction Entry phase, the directors will need to plan these routes from memory
and describe them in instructions.
In the Navigation Quiz, participants were placed at a named position, facing a
random direction. The position name was announced by the computer. Participants were
told to turn around to ensure that they recognized the starting location. Once the participants
had orientated themselves, they pressed the ’0’ key. At this point, the participant was asked
to navigate to one of the other named positions. During this navigation, the audio cues
for the named positions were off. Once the participants believed that they had reached the
destination, or had given up, they pressed the space bar.2
2

Note that this differs from earlier spatial navigation experiments using this software (Kuipers et al., 2003;
Stankiewicz and Eastman, 2008; Stankiewicz and Kalia, 2007; Stankiewicz et al., 2006, 2001), where the
trial ended automatically when the participant encountered the destination, without the participant making any
explicit termination action.

The color of the ‘curtain’ that covered the screen between trials was color-coded
to indicate success. If a director did not end at the correct target, the screen turned red. If
the director terminated at the intended destination, but took an inefficient path, the screen
turned blue. Finally, if the route was fully satisfactory – relatively efficient and correctly
terminated – the screen turned green.
Navigation efficiency was measured by dividing the shortest path travel distance
between two positions by the forward distance that the traveler actually moved. Thus,
navigation efficiency normalizes the distance traveled by the length of the route. A traveler
was deemed to have competently navigated the route if the navigation efficiency is 65% or
higher for that route.
After the directors were quizzed for one route, they pressed the space bar to move
on to the next trial. To pass the Navigation Quiz, the directors had to correctly navigate
to each named position at least once, with average navigation efficiency above 65%. If
either (1) the director misidentified any destination four times or (2) at the end of 25 trials,
the director had more inefficient than efficient routes, the director participated in another
Exploration phase followed by another Navigation Quiz.

4.5.2

Route Instruction Entry

During the Route Instruction Entry Phase, directors typed instructions, followed the route,
and rated their own instructions on quality and confidence of reaching the destination. Since
there were 7 named positions in each environment, there were 42 possible routes in each
environment (72 ).
The directors began each trial with a blank screen and pressed the space bar when
ready, which revealed the environment. The director was placed at the start of a route and
given an audio cue that announced the current position name. After this, the position name
announcements are turned off for the remainder of the trial. Directors were instructed by
another audio cue to turn around, to orient themselves. The directors was allowed to turn,

Figure 4.3: Example of route instruction window, with entered text.
but not move forward. When they were oriented, they pressed the ‘0’ key or space bar.
Next, the screen went blank and the directors were asked to type instructions to
guide another subject from the current position to one of the other named positions in
the environment. The top of the window said “Enter directions to get from Position X
to Position Y,” where ‘X’ and ‘Y’ are the names of the starting (current) and destination
positions. The directors typed into a text editor window, which allowed typing and editing
text, including moving the cursor. The text entry was unconstrained, free form text,
including newlines. See Figure 4.3 for an example of the text entry window, as presented in
the follower part of the study.
The director was instructed that each set of instructions had to stand alone, because
they would be followed by others in an arbitrary order and mixed with the instructions from
other directors. The follower would be familiar with these kinds of environments, but not
51

this particular layout. The follower would be placed at the starting position, but facing an
arbitrary direction.
After the directors typed their instructions, they clicked the ‘OK’ button. At this
point, the environment was revealed and the director was asked to navigate from this
position to the specified destination. At the end of navigation, the directors pressed the
space bar to indicate they were finished.

Figure 4.4: Dialog box for rating route instructions.
After navigating, directors were asked two questions (Figures 4.4 and 4.5): (1) How
certain are you that you’ve reached the target position? (2) How good do you think your
instructions were? Each question was answered on a Likert scale from 1 to 6. A scale with
six discrete points was chosen so that the participants were forced to rate the instructions
towards the good or bad side, with no absolutely neutral rating, as would be possible with
an odd number of rating possibilities.
52

Figure 4.5: Dialog box for rating navigation confidence.
After self-rating their own performance on these two dimensions, the screen was
blanked again and the director moved for the next trial, if any. The director was allowed
to take any breaks whenever necessary while the screen was blank between trials. When
the directors gave directions for all routes in the environment, the experiment announced
“Finished!” and exited.

4.5.3

Experiment 1: Six directors across all three environments

Experiment design
Experiment 1 is a within-subject design. Each director performed three sessions, one
for each environment, on separate days. In each session, the director performed the
Exploration, Navigation Quiz, and Route Instruction Entry phases, including multiple
Exploration and Navigation Quizzes, if necessary. The six directors were each asked for
126 route instructions over all three sessions, 42 in each of the three environments, for a
total of 756 possible instructions.
Participants
The six participants were students at the University of Texas at Austin or college-graduate
community members. They were paid for their participation, which took six to twelve
hours. The participants range in age from 21 to 29 (mean 24.0; median 22.5). By design,
there were equal numbers of males and females, three each.

4.5.4

Experiment 2: Twelve directors each in one environment

Experiment design
The experiment is a 3 (environments) x 2(position sets) x 2 (genders) design, for a total of
12 directors. The combination of conditions is in Table A.1.
Participants
The twelve participants were drawn from students in the introductory psychology class at
the University of Texas at Austin. They received course credit for their participation, which
took about two hours. The participants range in age from 18 years, 2 months to 26 years,
2 months (mean 19.7; median 19.2). By design, there were equal numbers of males and
females, six each.

4.5.5

Experiment 3: Twelve directors with continuous motion control

There are several possible differences between the discrete and continuous motion. With
discrete motion, all movement was controlled and homogeneous. All turns took the same
time to cause the same change in angle; likewise all travels went the same distance at the
same speed. With continuous motion, the participants could travel and turn at different
speeds. They could stop turning or traveling at arbitrary poses.
With discrete motion, some participants used the discrete movements as a count of
distance, e.g. “Go forward three clicks.” With continuous motion, the actions were
not inherently quantized or countable, though directors may still count intersections and
turns or estimate travel time.
Experiment design
The experiment is a 3 (environments) x 2 (position sets) x 2 (genders) design, for a total of
12 directors. The combination of conditions is in Table A.1.
Participants
The twelve participants were students from the introductory psychology class at the
University of Texas at Austin. They received course credit for their participation, which
took about two hours. The participants range in age from 18 years, 8 months to 24 years
(mean 20.0; median 19.4). By design, there were equal numbers of males and females, six
each.

4.6

Route Instruction Corpus Language Statistics

For some routes, the director either did not enter any text or only entered a comment, e.g.
“I don’t know.” We omit training routes, empty route descriptions, and instructions where

Name
Corpus 1
EDA
KXP
WLH
EMW
KLS
TJS
Corpus 2
JJL
JXF
MXM
MJB
PXL
QNL
BKW
BLO
JNN
LEN
MXP
TXG
Corpus 3
JTM
KAJ
KXK
MHH
RRE
WAB
ARL
JLM
JXL
LCT
SCD
SMA

Instructions

Vocabulary

Unique Voc.

Mean Words

Mean Sentences

126
75
124
124
122
120

163
182
181
280
176
190

15
40
13
60
17
19

27
20
41
63
45
20

5.7
3.2
5.8
7.9
4.2
2.2

30
42
24
42
41
42
42
38
24
42
42
34

124
153
98
187
186
114
148
90
174
141
116
144

13
11
4
19
25
7
9
6
19
8
6
20

55
33
33
66
40
22
40
27
64
39
34
47

5.2
3.5
3.6
5.6
2.9
3.2
3.9
3.4
6.0
2.7
2.9
4.4

20
41
42
37
18
41
42
40
42
32
33
42

148
120
109
119
83
119
166
158
108
180
165
140

11
9
5
4
2
2
11
20
5
22
13
7

59
44
33
28
27
28
48
42
26
35
63
31

5.0
4.1
2.6
3.6
2.9
2.5
3.3
4.0
2.9
3.4
7.5
1.9

Table 4.2: Statistics per group of instructions by each director.

Group
Corpus 1
Corpus 2
Corpus 3
Discrete
Male
Female

n
6
12
12
18
15
15

Words
36.5, SEM:6.9
42.1, SEM:4.0
39.3, SEM:3.7
40.2, SEM:3.5
37.7, SEM:3.6
42.0, SEM:3.6

Vocabulary
195.3, SEM:17.3
139.6, SEM:9.3
134.6, SEM:8.5
158.2, SEM:10.4
139.1, SEM:8.9
158.4, SEM:11.3

Sentences
4.8, SEM: 0.8
3.9, SEM: 0.3
3.7, SEM: 0.4
4.2, SEM: 0.4
4.0, SEM: 0.3
4.0, SEM: 0.5

Instructions
115.2, SEM:8.1
36.9, SEM:2.1
35.8, SEM:2.5
63.0, SEM:9.4
49.7, SEM:8.6
54.6, SEM:9.1

Table 4.3: Corpora text statistics averaged per director group, comparing instructions
per experiment and comparing instructions from male and female directors across all
experiments. SEM is the Standard Error of the Means.
a director had described the same route previously.

Tables 4.2, 4.3, 4.4, and 4.5 show the summary statistics for the instructions. For
each director, Table 4.2 shows the total number of instructions written, the vocabulary as
the count of distinct tokens used, how many words are unique to this director, the mean
number of words used by the director (total words divided by total instructions), and the
mean number of sentences, after splitting run-on sentences (as explained below). Note the
large differences in length and vocabulary across directors.
Table 4.3 shows the mean numbers of words, vocabulary used, sentences, and
instructions for several groups of instructions. ‘Words’ is the mean of the mean number
of words used across all instructions per director. ‘Vocabulary’ is the mean number of
distinct tokens used, including misspellings, but not punctuation. ‘Sentences’ is the mean
number of sentences used, after splitting run-on sentences into no more than three clauses.
‘Instructions’ is the mean total number of instructions written, across the group of route
instructions.
Table 4.4 shows the most frequently used words per director. On one hand, this
shows common words in the domain, best illustrated by the ’All’ line giving the most
frequent words across the full corpus. At least one of these most frequent overall words
3

In one case the director started over due to an experimental error. In another case, the director asked to
explore the environment further after entering in some instructions. In both cases, both in Experiment 1, the
director gave some pairs of instructions for the same routes.

Group
All
Corpus 1
Corpus 2
Corpus 3
Discrete
Male
Female
Corpus 1
EDA
KXP
WLH
EMWC
KLS
TJS
Corpus 2
JJL
JXF
MXM
MJB
PXL
QNL
BKW
BLO
JNN
LEN
MXP
TXG
Corpus 3
JTM
KAJ
KXK
MHH
RRE
WAB
ARL
JLM
JXL
LCT
SCD
SMA

Word 1

Word 2

Word 3

Word 4

Word 5

Word 6

you
to
you
you
to
turn
you

to
turn
go
is
turn
to
to

turn
left
left
right
you
you
go

left
forward
right
left
left
left
hall

right
hall
turn
to
go
right
intersection

go
right
to
position
right
forward
right

walk
go
to
hall
path
hall

turn
make
move
to
take
then

forward
area
turn
intersection
at
down

once
to
left
go
go
is

left
left
right
forward
intersection
go

right
hallway
is
turn
right
with

to
go
go
you
move
across
walk
go
go
intersection
forward
turn

you
forward
left
hall
you
panel
you
make
you
hall
move
move

left
you
take
go
floors
turn
take
down
turn
is
intersection
forward

continue
until
right
down
with
move
on
stops
hit
position
take
then

untill
turn
you
take
direction
right
with
right
until
it
you
you

now
wall
until
left
position
left
to
stop
left
in
floors
right

forward
hall
section
go
hall
hallway
you
you
hallway
floors
corridor
is

turn
you
tile
left
blue
take
on
walk
walk
to
you
to

move
until
take
take
until
at
in
is
brick
you
walk
road

stop
walk
go
right
position
left
position
on
is
position
make
your

hallway
down
end
to
reach
stop
is
position
at
right
until
has

once
intersection
at
blue
you
intersection
floors
at
intersection
will
on
position

Table 4.4: Most frequent words per group of instructions. ‘Word 1’ is the most frequent for
58
that group, ‘Word 6’ is the 6th most frequent word. Lines indicate sets of groups: All, by
experiment, both sets of discrete motion experiments, by gender, and by director from each
of the three experiments.

Group
Corpus 1
Corpus 2
Corpus 3
Discrete
Corpus 3
Male
Female
Corpus 1
EDA
KXP
WLH
EMWC
KLS
TJS
Corpus 2
JJL
JXF
MXM
MJB
PXL
QNL
BKW
BLO
JNN
LEN
MXP
TXG
Corpus 3
JTM
KAJ
KXK
MHH
RRE
WAB
ARL
JLM
JXL
LCT
SCD
SMA

Word 1

Word 2

Word 3

Word 4

Word 5

Word 6

alley
l
run
segment
run
alley
tiled

hatrack
panel
arrive
tiled
arrive
untill
bare

containing
stops
turtle
alley
turtle
head
containing

octagon
rectangles
board
segments
board
panel
blank

alleys
panels
walking
hatrack
walking
octagon
passing

rose
eisle
couch
now
couch
rectangles
stops

walk
area
alley
segment
path
rose

once
make
move
tiled
take
grey

twice
aesal
carpet
segments
towards
take

times
rt
octagon
bare
plain
movement

so
keep
flooring
containing
coat
middle

back
going
alleys
passing
cement
winding

untill
easel
poster
rectangles
step
across
walk
stops
available
sitting
puke
re

come
gray
make
l
find
panel
them
make
tiling
grey
past
foward

continue
road
posters
will
tunnel
panels
scales
stand
hallway
tiled
twice
now

now
path
1st
makes
place
cross
interection
painting
square
floral
stone
hallway

sure
dark
know
eisle
stand
hex
bricks
three
hanger
easle
set
hanger

patterned
paths
between
segment
crossing
times
u
over
closest
dark
degrees
than

move
start
section
till
area
painters
located
easel
post
floors
corridor
road

forward
halls
tile
last
follow
hat
corridor
path
gray
turtle
make
exits

once
blocks
head
hit
find
continue
floors
pictures
intersection
rock
colors
has

twice
intersection
destination
stand
army
intersection
blank
granite
flower
shells
onto
big

face
easle
tiles
off
orange
stool
space
intersection
artist
shell
intersect
eiffel

flooring
block
arrive
bricks
pink
stand
standing
four
hanger
locate
plain
honeycomb

Table 4.5: Most frequent words per group of instructions. These are the words with the
59
highest text frequency-inverse document frequency ratio. These words occur frequently
in the listed instruction group, and seldom occur in other groups. Lines indicate sets of
instructions: by experiment, by motion type, by gender, and by director from each of the
three experiments. Typos are by the directors.

occur in the top six words of each director.
On the other hand, the most frequent word list also shows the variation among
directors. The different directors use different motion verbs (e.g. turn vs. take vs. make),
take different strategies in terms of landmark description (e.g. intersection, red, tile)
or purely action description (e.g. director EDA). Some directors use a variety of verbs
often (e.g. hit, stop, find, reach), while other directors only use simple turn and travel
verbs and is frequently. Even some misspellings crop up in the most frequent words (e.g.
untill).

Table 4.5 conversely shows the most distinctive words for each group, compared to
its peer groups. This is calculated by the text frequency inverse document frequency ratio
(TFIDF). Here the ‘documents’ being considered are the concatenation of all instructions
by a director or group of directors. Words are weighted by their frequency within the
instructions from the group of directors, then normalized by the inverse ratio of how many
of the peer groups the word appears in. These words are frequently used by one group, but
used by few others in the peer group. This table highlights the diversity of the directors in
vocabulary and concepts used, as well as creativity in spelling (There must be fifty ways to
spell easel).

4.7

Human Followers Read, Navigate, and Evaluate

All three experiments used the same procedure for people to follow and rate route
instructions. This section describes the general procedure, then the design and participants
of the follower experiments.

4.7.1

Procedure

Followers are acclimated to navigation and experiment cues and then follow sets of route
instructions from all directors in all environments.

Route Instruction Trial Procedure
For each route, followers were shown the instructions in a pop-up window (Figure 4.3).
The navigation screen start was initially blank, with the instruction window is in front.
Followers could look over the instructions for an unlimited time. Also, followers were told
they they could re-examine the instructions at any time while navigating, so they did not
need to memorize the instructions. When the followers were ready, they clicked the ‘OK’
button, which closed the text window.
The follower was placed at the starting position facing one of the four directions
(randomly selected). The virtual curtain was removed and the followers could see the
navigation screen. The navigation control was identical to the control for the directors
(Section 4.3.1).
The followers never heard the named position announcements as the directors
did. The followers navigated and recognized the destination from the text in the current
instructions alone. By pressing the ‘d’ key, the follower could review the instructions at
any time while navigating, but the follower had to close the instruction pop-up window
to resume way-finding. This allowed us to measure when the followers referred to the
instructions.
When the followers reached the destination or finished trying, they pressed the space
bar. Each trial continued until the follower explicitly indicated completion, either when they
found the destination or when they gave up. The followers were instructed that they should
use common sense to follow the instructions, as they did contain some errors.
After terminating, the followers were asked the same rating questions the director
used in self-rating (Figures 4.4 and 4.5): “How certain are you that you are at the
destination?” “How good do you feel the directions were for getting to the destination?”
Each question was answered on a Likert scale from 1 (poor instructions/ certain not there)
to 6 (excellent instructions/ certain at destination).

Route Instruction Trial Sequence
After rating the instructions, the followers repeated the same procedure for the next set of
instructions. The followers followed one set of route instructions for each route in all three
environments.
The sequencing of the route instructions was designed to discourage the followers
learning the environments or any particular director’s style.

The followers changed

environments every other route and had no direct indication of which environment they
were in.
The instruction sequences were also constrained by the directors and endpoints of
the routes. No follower ever navigated exactly the same route twice. No director was
repeated within any four instruction sets. No position was repeated as a start or destination
within 3 trials.
After reading, following, and rating 126 routes, or when their time period has
elapsed, the follower was finished participating in the experiment.

4.7.2

Experiment 1: 24 people following 6 directors’ instructions

Experiment design
In Experiment 1, the followers each followed instructions from all six directors from
Experiment 1 in all three environments. The sequence of instructions is detailed above.
Participants
Participants are drawn from the pool for the introductory psychology class at the University
of Texas at Austin, and receive one or two hours of course credit for their participation. The
participants range in age from 19 years to 29 years (mean 20.1; median 20). By design,
there were equal trials by males and females.

Route instruction preparation
In Experiment 1, the instructions were exactly as typed by the director, including any typos
and line breaks. The only change was anonymizing the position names, to prevent the
followers from relying on the names of the positions. Only the numerical identifier was
changed, replaced with a single letter, ‘X’ for destinations and ‘Y’ for starting positions.
For instance, “That is Position 3” became “That is Position X” and “From
Pos-2, ...” became “From Pos-Y, ...”.

4.7.3

Experiment 2: 44 people following 18 directors’ instructions

Experiment design
In Experiment 2, the followers followed instructions from the six directors from Experiment
1 and the twelve instructions from Experiment 2. The instructions were intermingled so that
each follower followed and evaluated both Experiment 1 and Experiment 2 instructions.
This measures if followers in Experiment 2 follow or rate the instructions differently than
the followers in Experiment 1.
In Experiment 2, the followers were switched between six sets of named positions,
two in each of three environments. Following a later set of instructions in the same
environments, the follower may later proceed along a route to a different “Position 2.”
The followers were warned that they would be switched among maps and position sets and
to follow the instructions independently.
Participants
Participants are drawn from the pool for the introductory psychology class at the University
of Texas at Austin, and receive one or two hours of course credit for their participation. The
participants range in age from 18 years to 25 years, 7 months (mean 19.8; median 19.5).
By design, there were equal numbers of males and females.

Route instruction preparation
For Experiment 2, the instructions presented to the followers were cleaned up for the parser.
Run-on sentences were segmented into no more than three independent clauses per sentence
and consistently punctuated with periods. Typos that split or joined words were fixed,
though typos within words remained. The instructions were presented with each sentence
starting on a new line. In this version, the position names were not altered, but subjects
were told that the names are not consistent between trials.

4.7.4

Experiment 3: 24 people following 12 directors’ instructions

Experiment design
Participants
Participants are drawn from the pool for the introductory psychology class at the University
of Texas at Austin, and receive one or two hours of course credit for their participation. The
participants range in age from 18 years, 1 month to 29 years, 7 months (mean 20.2; median
19.3). By design, there were equal numbers of trials by males and females.
Route instruction preparation
For Experiment 3, route instructions were prepared as for Experiment 2.

4.8

Human Task Performance Overview

Figure 4.7 shows how the mean success rate of human followers varies by the mean
post-hoc human rating, over instructions from all three Experiments, by followers from
all three experiments. The success rate is merely how often the follower reached the
destination to which we asked the director to guide the follower. There was a very strong
correlation between the mean success rate and the mean subjective rating – Spearman

Figure 4.6: Occurrence and success rates over all human followers by mean rating. Most
instructions are good to excellent, with the median rating at 4.0 and the mode at 5.0. Gray
bars indicate the total number instructions with that mean rating in that corpus, colored bars
the number of successes of people following those instructions.
r(19) = 0.957, p ≤ 0.001. Previous work has also found an significant relationship
between instruction quality and instruction following performance (Daniel et al., 2003;
Denis et al., 1999).
For breakdown of instructions by linguistic constructs used and spatial reasoning
needed, see Chapter 6 and Appendix B.
Figure 4.6 shows the distribution over post-hoc human rating, and success rates for
followers from Experiments 1, 2, and 3. The followers from Experiment 2 have a very
slight performance increase over the followers from Experiment 1 on the instructions from
Experiment 1, (M = 70%) vs. (M = 69%), t(638) = 2.46, p ≤ 0.014. The followers
65

Figure 4.7: Mean performance over all human followers The cumulative success rate (Yaxis) over all instructions from all three experiments by mean post-hoc instruction rating
(X-axis). Success rate is how often followers finished at the intended destination for
instructions with mean rating of r±. Data as of May 31, 2007. The annotations P10. . . P90
indicate the 10th . . . 90th percentiles of ratings.
from Experiment 3 only followed the instructions from Experiment 3, as both groups used
the continuous motion controlled by the joystick, whereas all people in Experiments 1 and
2 used discrete motion by the keyboard. The subjective ratings were not evenly distributed
across instructions. Most instructions in this corpus were rated highly; the mode rating is
5.0. In the middle rated instructions – rated between 2.0 and 4.5 inclusive – the distribution
was fairly even.

Figure 4.8: Success rates over all human followers per Corpus 1 director. The vertical bars
show the success rate per director by followers from Experiment 1. Under the bars are the
directors’ initials and the mean subjective rating (1-6 Likert scale) over all the directors’
instructions by all followers from all three experiments.

4.8.1

Differences in Directors

There were strong individual differences among the directors.

On human follower

performance, directors ranged from 83% successful instructions to 31% successful, even
after the blank instructions were filtered out.

The mean of the 30 director’s mean

performances was (M = 60.8%, SEM = 2.6%), while the mean success rate over all 1517
followed instructions was (M = 64.5%, SEM = 1.0%). The directors who gave better
instructions also tended to give more non-blank instructions, skewing the overall mean up
from the mean of directors’ means.

Figure 4.9: Success rates over all human followers per Corpus 2 director. The vertical bars
show the success rate per director by followers from Experiment 2.
Are the differences between the directors’ success rates significantly different? To
answer this question accurately, we need to control for the different sets of start and target
position pairs across directors. In Experiment 1, the directors were asked to describe the
same 126 routes in all three environments, but some did not write instructions for each route.
In Experiment 2 and 3 the directors each were asked for instructions describing 42 routes in
one environment. Moreover, half the later directors described routes with different starting
and ending locations than the Experiment 1 directors. Since the routes vary in complexity
and difficulty, a fair comparison controls for the routes described.
Assume a director d, wrote instructions RI for routes R through one or more
environment. We can collect the instructions RI 0 from all other directors d0 describing these

Figure 4.10: Success rates over all human followers per Corpus 3 director. The vertical bars
show the success rate per director by followers from Experiment 3.
same routes R. We compare the mean performance of the human followers on the two sets
of instructions (RI and RI 0 ) describing the same routes R. This controls for the differences
in complexity of the routes. We performed a two-tailed paired t-test on the mean success
rate of all human followers to determine if the instructions RI from a director d differed in
success rate from the instructions RI 0 for the same routes R from all other directors d0 .
24 of the 30 directors had a statistically significant difference in the success of the
human followers on their instructions, compared to the instructions for the same routes
from other directors. Figures 4.8, 4.9, and 4.10 show the success rates for followers from
all three experiments on instructions from directors from all experiments respectively. Note
that followers from the first experiment did not follow instructions from Experiment 2,

Figure 4.11: Success rates for all human followers from each experiment by instructions
rating. There were no significant differences in how people followed the instructions from
Experiment 1, 2, and 3 across the quality spectrum.
but followers from Experiment 2 did follow instructions from Experiment 1. There are
significant differences in the success rates of different director’s instructions, which are
reproducible given different people following the instructions.

4.8.2

Differences in Human Followers between Corpus 1 and Corpus 2

Figure 4.11 shows the performance of human followers from the two experiment on
instructions from each experiment. In Experiment 1, the followers reached the intended
destination on with a mean of 68.5% (691 instructions, SEM = 1.3%). Followers in
Experiment 2 following instructions from Experiment 1 succeeded on a mean of 69.7%

(639 instructions, SEM = 1.9). This difference is small and barely significant, t(638) =
2.46, p ≤ 0.014.
We controlled for route complexity by comparing the success rates on all route
instructions describing routes with the same starting and ending positions. When the routes
followed are controlled for, the difference disappears entirely, each at 68%. This is the mean
success rate of all the followers for all the instructions describing all routes with the same
start and destination. Comparing the mean success rate by route controls for the complexity
of the route across directors and followers.
The 432 instructions from Experiment 2 were slightly worse, with the followers
from Experiment 2 succeeding only with mean 64.0% overall (SEM = 1.9%), and
66% on the routes shared with the corpus from Experiment 1 (187 instructions, SEM =
0.6%, t(834) = 5.60, p ≤ 0.001). This difference is significant from the success rate
of Experiment 2 followers on Experiment 2 instructions, t(194) = 5.62, p ≤ 0.001,
controlling for route. The best directors is Experiment 2 performed at the level of the
best in Experiment 1, as can be seen in Figures 4.8, 4.9, and 4.10.

4.8.3

Gender-linked performance differences

For directors, there is a strong effect of gender (Figure 4.12). For the 720 instructions
from male directors, the mean success rate for all human followers was 66.5% (SEM =
1.4%), but for the 797 instructions from female directors, the mean success rate for all
human followers was 62.6% (SEM = 1.3), significant with independent t(1518) =
2.02, p ≤ 0.022. On the common routes, the mean performances of male directors
(M = 67.4%, SEM = 0.8%), female (M = 62.5%, SEM = 0.8%), significant at
t(794) = 4.79, p ≤ 0.001. See the prior section for a discussion of the mean success rate
per route statistic. For followers, there was a small, but significant effect of gender: for all
1363 instructions followed by both genders, the mean success rate per route was 65.5% for
male followers (SEM = 0.5%) and 64.5% for female followers (SEM = 0.5%), different

Figure 4.12: Success rates for human followers and directors by gender. Under the bars
are the mean subjective rating (1-6 Likert scale) over all the directors’ instructions by all
followers from all three experiments.
at t(1362) = 1.99, p ≤ 0.024.
There are not strong interaction effects of the genders of the directors and followers.
On the 637 instructions from male directors, male followers have a mean success rate per
route of 66.5% (SEM = 0.9); female followers 67.2% (SEM = 1.0), not a significant
difference t(636) = 0.70, p ≤ 0.485.

Likewise, for 712 instructions from female

directors controlling for route, male followers succeeded 63.9% (SEM = 0.9%) and female
followers succeeded 61.5% (SEM = 0.9), significant at t(712) = 2.71, p ≤ 0.007.
Regardless of the gender of the follower, people following instructions from female
directors reach the intended destination 6-10% less often, across the three corpora. Male

followers of men’s instructions (M = 68.2%, SEM = 0.9%) vs women’s instructions
(M = 63.3%, SEM = 0.9%), t(719) = 3.51, p ≤ 0.001. Female followers also
succeeded significantly more often on men’s instructions than other women’s instructions
on the same routes: (M = 68.0%, SEM = 1.0%) vs. (M = 61.5%, SEM = 0.9%),
t(708) = 5.16, p ≤ 0.001.
Previous studies have also found that women are less accurate in giving instructions,
particularly when the director is describing the route from memory (“map-not-present
condition”) (Brown et al., 1998; Ward et al., 1986). Both of these studies, accuracy is
measured by an expert rating whether the director made a crucial error of omission (implicit
action) or commission (explicit mistake). Some other researchers have also found women
have lower success rates than men in following route instructions (Allen, 2000), while
others did not find this effect (Schmitz, 1999).

4.9

Discussion

This chapter presents a series of experiments gathering a large language and task corpus
of spatial route instructions. These experiments were designed to gather instructions
for complex spatial tasks using a methodology that will elicit natural variation in the
instructions. The instructions vary in vocabulary, grammar, style, spatial actions and
landmarks, and explicit errors and implicit omissions. Because multiple directors describe
each route and multiple people follow each set of instructions, the corpus measures human
variability in generating and understanding spatial procedural instructions. This corpus
is unique in tying together many participant-written instructions to multiple independent
human follower action traces and subjective ratings for each route instruction text.
For the directors, the methodology ensures no a priori knowledge of the
environments, measures how quickly and how well the directors learn to navigate an
environment from a first-person perspective, and gathers many route instruction texts
for routes throughout the environment.

Directors plan and describe the routes from
73

memory, a more cognitively challenging task than planning while looking at a map or other
environmental model. The directors describe the routes in a paragraph of text that can be
independently followed by multiple human or software followers.
The routes described in these experiments require complex procedures to navigate
through large-scale spaces – buildings where the destination is not visible from the starting
location or reachable with one travel action. The task is similar to that faced by patients in
large hospital complexes, by new students on in large buildings or campuses, and, outdoors,
by people navigating through homogeneous housing sub-divisions, where all the houses
look the same and the hard to read street signs mark similar names.
For the human followers, the experiment collected independent action traces and
subjective evaluations from several people following each set of instructions. This measures
the behavior of different people in response to the same route instructions. Followers did
not experience the environments outside of following instructions and they were shuffled
between environments to discourage learning the maps. Each follower of a particular set of
route instructions was presented the same route instruction text in the same manner in the
same environment. Multiple human followers allow exploration of individual differences in
spatial and language ability, strategy used, and even luck when presented truly ambiguous
instructions. Thus the experiment measures how well any set of instructions works for
guiding arbitrary, unknown followers through the large-scale space.
We measured large and statistically significant differences in both directors’ route
instruction verbal styles and the success rates of their instructions.

This experiment

measures a high correlation between subjective ratings of route instructions and the route
instructions’ success rates. The tasks are challenging; the mean success rate across all
followers is 64.3% (1522 instructions, SEM = 0.9%).
We measure the statistical distribution of route instruction quality across directors,
environments, and route complexity. The median instruction is ranked 4, with 6 being
excellent, but there is a long and fat tail of poorly rated instructions, with 25% of the

instructions rated between 3 and 4, and the final 25% rated between 1 (the worst rating)
and 3. Put another way, for these environments, with directors trained until they could
navigate efficiently, one quarter of the instructions are excellent, one quarter good to very
good, one quarter medium to good, and one quarter poorly rated, with objective success
rates matching the subjective ratings. Across the different experiments, the same patterns
occur in the data in the distribution of instruction quality and the strong correlation between
quality and success rate.
This route instruction language and task corpus gives insight into open questions
in the literature. For instance, by splitting directors and followers by gender, we find both
male and female followers succeed about 10% more often given instructions from male
directors. There is a smaller, but still significant, gender-linked difference in the follower’s
success rates: female followers succeeded about at about the same rate as male followers
on instructions from male directors, but about 3% less often than men on instructions from
female directors. However, with the large variation in success rates, some of the women
directors had very high success rates, and some of the male directors very poor success
rates. Moreover, we have not yet done an analysis controlling for experience playing firstperson video games, which may partially explain these differences.
This chapter describes a human cognitive psychology experiment into how people
learn, reason, and describe complex routes through large-scale spaces. We collected a large
language and task corpus of directors learning the environments and giving instructions.
We also collected multiple followers applying each instruction text to perform the task and
evaluating the instructions. From these experiments, we gain a better understanding of the
variability in instruction-giving and instruction-following. Finally, the corpus shows us the
distribution of route instruction quality and styles that an instruction follower must handle
to robustly follow natural instructions.

Chapter 5

Spatial Route Instructions in the
M ARCO Architecture
5.1

Understanding and Following Route Instructions in Context

M ARCO is an architecture for understanding and following natural language route
instructions (MacMahon et al., 2006). M ARCO is composed of six modules: three modules
interpret the route instruction text linguistically; three modules interpret the instructions
spatially in the context of the task and environment.
M ARCO’s linguistic modules parse raw text and produce an imperative procedural
model – a skeletal plan. The syntax parser models the surface syntactic structure of an
utterance. The content framer abstracts away from arbitrary word order and formation,
to model the surface meaning of the utterance. The instruction modeler applies spatial
and linguistic knowledge to combine information across phrases and sentences. Figure 5.1
shows the representations M ARCO uses to model route instructions.
M ARCO’s executive modules apply the instructions to navigate through the world.
The executor reactively interleaves action and perception, acting to gain knowledge of the
environment and execute the instructions in the context of this spatial model. The robot

controller is an abstraction layer for particular robots’ motor and sensory capabilities. The
view description matcher checks symbolic view descriptions against sensory observations
and world models – checking the expected model against the observed model. Figure ??
shows the relation of the natural language understanding and robot control parts of the
architecture.
The language understanding part of the architecture builds on ideas from the
Nautilus natural language understanding system (Perzanowski et al., 2001; Simmons
et al., 2003; Wauchope et al., 1997), from Reiter and Dale’s natural language generation
architecture (1997), as well as frame-based systems (Bindiganavale et al., 2000; Chang
et al., 2002). The instruction execution engine follows in the tradition of reactive execution
or sequencing code in the middle tier of three-tiered intelligent architectures (Bonnasso
et al., 1997; Firby, 1989; Verma et al., 2005), with a prototype implementation programmed
in TDL (Simmons and Apfelbaum, 1998) (See Section 2.4.3).

5.2

Syntax Parser

The syntax parser parses the raw route instruction text.

Our implementation uses a

probabilistic context-free grammar built with the Python Natural Language Toolkit (Bird
and Loper, 2004). Instead of modeling part-of-speech syntax, our grammar directly models
verb-argument structure, similarly to (Baker et al., 1998; Bindiganavale et al., 2000; Chang
et al., 2002; Palmer et al., 2005). The top of Figure 5.1 provides an example of a parse tree.
The verb-argument grammar is detailed in Appendix C.2. We aim for M ARCO
to understand the language of route instructions well enough to navigate. M ARCO is not
assisted by having a correct part-of-speech parse tree of a sentence without the semantics
of the words and phrases. Since M ARCO needs a strong model of the verbs, adjuncts, and
referring phrases in the domain to execute the instructions, it does not hurt to restrict the
grammar to the domain. Moreover, this approach can model which arguments are optional
and which required for different verbs. Finally, M ARCO attempts to execute instructions
77

Route Text

“turn to face
the green hallway,”

“walk forward three times,”

T UR N
T UR NV

S YNTAX
PARSER
⇓
Parse Tree
⇓

T R AVE L

P UR POSE

turn
P UR POSE P

S
T R AVE L V

FACE V
face
DET

PAT H
A PPE AR

walk
PAT H N

the
green hallway
3
T UR N
372
62
7 T R AVE L
6 V /TURN Turn_V_1
66
62
77
66 P UR POSE
V /TRAVEL
77
76
662
7
3
76
66
6D I R
7
76
66 V /FACE Face_V_1
6
6
7
666
66
2
3777
666
66
777
DEF
+
666
777
744 D I ST
666
6
7
7
7
7
644 PAT H
A
PPE AR
Green_ADJ_1
4
5
5
5
5
4
PAT H
Hall_N_1

DIR

D I ST

forward C OUNT U NI T
three times

C ONTENT
F RAMER
⇓
Content Frame
⇓

I NSTRUCTION
M ODELER
⇓
Procedural
Specification
⇓
E XECUTOR
⇓
Instantiated
Reactive
Procedure
⇓

Walk_V_1
Forward_ADV_1
"
C OUNT 3
U NI T
Time_N_2

object=Path,
Turnp (until=( appear=Green, ))
side=Front,
dist=0:

Travelp (dir=Forward, dist=3)

until Verifyp (Path(appear:Green,. . . )):
turna (left)

Facep (Path(dist:0,side:Front))
for i in 1::3:
travela (forward)

37
7
77
77
#77
77
57
5

Figure 5.1: M ARCO linguistic modules modeling a route instruction text (Top) through
the syntactic verb argument and phrase structure (Mid-Top), the surface semantics frame
(Mid-Bottom), and the imperative semantics of which procedural to take, given the context
(Bottom).
even when it does not understand some sentences or words. Since much of the complexity of
instructions can be in elaboration sentences describing extra details, this is a viable strategy.
The parser in the implementation is sufficient for to investigate the issues of this
thesis, although future work should look at integration with best-of-breed parsers. A
more traditional part-of-speech grammar and parser would require a more complex surface
semantics interpretation module. One interesting approach would combine a re-ranking
probabilistic parser (Charniak and Johnson, 2005) with statistical semantic role labeling
(Gildea and Hockenmaier, 2003; Gildea and Palmer, 2002). The re-ranking could include
semantic well-formedness and task and environment fit measures; Purver et al. (2006)

discuss a similar approach in interpreting single commands in a restaurant recommendation
system domain. Another recent approach directly learns to translate text to a semantic
representation (Ge and Mooney, 2005; Kate et al., 2005; Wong and Mooney, 2006).

5.2.1

Training the Probabilistic Context-Free Grammar

A referring phrase is a noun phrase that refers to some entity or attribute being described,
analyzed on its semantic content instead of its syntactic makeup (Kuipers and Kassirer,
1987; Reiter and Dale, 1997). By tagging the referring phrases and verbs in a set of
route instructions with semantic annotations to the phrase and argument types, this analysis
characterized the surface meaning of route instruction utterances.
From the hand-labeled text, a Probabilistic Context-Free Grammar (PCFG) was
trained to parse and semantically tag route instruction texts. This parser labeled additional
route instructions. The automatically tagged route instructions are then hand-corrected. The
tagged trees help bootstrap the process of learning the full language model. The architecture
can either parse raw input text using the trained parser or can load hand-corrected goldstandard parse trees.

5.2.2

Robustness for the Syntax Parser

Several techniques are used to provide the syntax parser with robustness to out-ofvocabulary words and out-of-grammar utterances. The parser uses a maximum likelihood
function to label novel words or familiar words used in a different linguistic context than
previously seen. When an unknown words is found, the grammar matches the rest of the
utterance, with a place holder for the unknown word. The likelihood of an non-terminal
production yielding U NKNOWN is weighted by the variance of the non-terminal.
Open set words, such as nouns and adjectives, tend to have a large number of
different tokens, while closed set words such as prepositions have very few (even counting
typos separately). Thus, the grammar can guess the role of an unlearned word based on

the utterance context and the variance of the different non-terminal tokens. For instance,
given “Turn onto the grue hall,” the grammar will correctly guess that grue is an
appearance term modifying the PATH phrase.
When the grammar cannot find a parse for an utterance, the syntax parser will skip
it. These measures prevent the processing of a paragraph of instructions from being derailed
by one unknown word or grammatical construction. Often in route instructions, the more
complex constructions are elaborations not strictly necessary to follow the instructions,
especially if the follower is robust to gaps in the instructions. Of course, in some cases,
this will leave M ARCO unable to interpret the instruction correctly, but then people are not
always able to interpret an instruction correctly either. Better to push forward with the wellunderstood part of the instructions than to attempt to integrate a poorly modeled utterance.

5.3

Content Framer

The Content Framer translates each utterance’s surface structure to a model of the surface
meaning: a recursive attribute-value matrix that makes content readily accessible. The
resulting content frame (see middle of Figure 5.1) models the nested structure and sense
of an utterance by dropping punctuation, arbitrary text ordering, inflectional suffixes, and
spelling variations.
The Content Framer looks up nouns, verbs, adjectives and adverbs in WordNet
(Fellbaum, 1998; WordNet, 2005), an ontology for English. This process abstracts over
surface differences in morphology, spelling, and synonym usage. The WordNet ontology is
used to look up the nearest known synonym or more abstract hypernym to each mentioned
unknown word. M ARCO can substitute a general concept, such as “couch”, when it does
know a sub-type, such as “futon.” This matching is currently a simple lookup for the nearest
WordNet synonym set in a small domain dictionary, based on the lexical item and semantic
role tagging in the parse tree.
From a semantic parse, as implemented in M ARCO, the content frame is a small
80

step. Having the two processes separate eases any later integration of traditional part-ofspeech parsers. The content frame provides an information interface to the rest of M ARCO,
so changing the parser (e.g. to a statistical re-ranking parser or another semantic parser, as
noted above) only involves altering the content framer. The final reason the content framer
is separate is that it is simpler to correct the semantic parse trees and infer the word sense
information than to hand-correct the word sense information directly.

5.3.1

Robustness to unknown words and sentence structures

When M ARCO comes across a word that it does not have in its concept base, it searches for
the nearest known synonym or more abstract hypernym using the WordNet ontology. For
instance, when instructed to “face the futon,” M ARCO will discover futon is not in its
concept base, look it up in WordNet, find the broader concept of couch in its concept base,
and stop turning when the view description matcher observes a couch.
M ARCO is also robust to unexpected input. If the content framer encounters a
constituent that it cannot model, it will ignore it while modeling the remainder of the clause.
Likewise, if the parser cannot parse one sentence from a set of route instructions, it will
parse the others. These techniques work well for two reasons. First, route instructions
often contain a lot of redundant information, so neglecting to understand a phrase in one
sentence is often not critical. Second, the essential information in route instructions is
usually stated using a relatively small variety of verb frames for directing movements.
Most of the novel sentence frames occur in the declarative descriptions between movement
commands, functioning as elaborations. These are often not necessary if the imperative
sentences are properly understood and applied.

5.4

Instruction Modeler

The instruction modeler translates the content frame’s representation of the surface meaning
of an instruction element to an imperative model of what to do under which conditions –
81

the procedural specification. The instruction modeler infers the imperative model from the
instructions by applying linguistic knowledge of the verbs and prepositions of the route
instructions and spatial knowledge of how perception and action depend on the local spatial
configuration in similar environments. Each verb frame is associated with a hand-coded
procedure to build a procedural specification from the content frame, based on recognizing
frame arguments and idioms.
The instruction modeler integrates content frames into a model of the entire
instruction set. The imperative information in the instructions – what to do – modeled as
procedural specifications. The declarative information in the instructions – what to expect
– is modeled as view descriptions. The procedural specifications use view descriptions to
model the constraints on when to execute causal actions.
The instruction modeler builds a skeletal plan from the instructions by applying
two kinds of knowledge. The follower needs both general knowledge about language –
especially common imperative constructions – and domain-specific knowledge of the verbs,
prepositions, and referring phrases of spatial route instructions. Prepositions and verbs of
motion in route instructions are relatively independent of where instructions are followed
(the environment) and who follows them (the agent). On the other hand, perceiving and
acting on objects and object attributes are dependent on the agent and environment.
The instruction modeler interprets referring phrases, declarative statements, and
implied landmarks (such as a path for a travel verb) as view descriptions to be matched while
executing the task against direct observations of views and synthesized models of scenes
and the state of the instructed task. The instruction modeler interprets both the imperative
commands and the declarative descriptions in the instructions as procedural specifications.
A procedural specification instantiates a general reactive procedure with the verb arguments
and adjuncts that are contingently achieved by other procedures. The stated and implied
information about the world is modeled as view descriptions, which capture the entities
mentioned and the structure of the relationships between them.

The instruction modeler also decomposes high-level commands into lower-level
procedures. The concise “Take the third right to the end of the hall,” is
modeled and executed very similarly as if the director had explicitly commanded, “Go
down to the third place with a path to the right.
Go down to the end of that hall.”

Turn right there.

This simplifies the execution code, by

separating it from the surface form of the instructions.
Even relatively an explicit and simple command is interpreted as a reactive
procedure of simpler procedures.

For instance, “Take the blue hall to the

chair,” may require a Travelp procedure to move to the blue hall, a Turnp procedure

to face along the blue hall towards the chair, and the explicit Travelp procedure along the
blue hall until the chair is reached. The first Travelp to the blue hall may, in turn, require a
Turnp procedure to face the blue hall and possibly a Findp procedure to locate the blue hall.
Each of these procedures controls the sequencing of simple causal actions – e.g. Facep
repeatedly executing the turna action to change orientation within a place, until the faced
object is to the front.
While currently implemented by sequencing discrete, causal actions, the representation only models the conditions and the actions. Another executor could be implemented as
directly calling continuously operating control laws and execution monitors for termination,
as done in other related work (Lauria et al., 2002a; Simmons et al., 2003; Tellex and Roy,
2006).
The instruction modeler reasons about the semantics (meaning), anaphora (coreference resolution), and discourse pragmatics (inferring the conversational intent of an
utterance) of route instruction texts. The module encapsulates the functionality of building
compositional models of utterances, combining information within and between utterances.

5.4.1

Representing Referring Phrases as View Descriptions

A view description represents what the follower expects at a pose in the environment,
given the implicit and explicit descriptions in the discourse. The view description is a
structure modeling the relations that an expected object or structural entity has with the
follower or other landmarks, given the instruction text and the follower’s spatial knowledge.
For each expected landmark, the view description models the object’s type, the object’s
location within the view relative to the observer (angle and distance). Additionally, the
view description may model any mentioned constraints on the attributes of an entity and
relationships between that entity and others. For each pose, the view description may be a
list of several entities without any explicit relationships among them.
The view description is a minimal model of what the follower expects: it neither
over-commits to unspecified details nor enumerates all the possible worlds that would match
the description. Instead, the view description hews close to what was said. For example, the
until condition at bottom of Figure 5.1 models the post-condition of the Turnp : the follower
expects a Path with a Green appearance in front of it, but the path may be immediate in the
view or off in the distance.
The instruction modeler reduces a wide variety of surface forms to functionally
equivalent relational models.

The module handles building view descriptions that

combine noun phrases with pre-position (e.g. “the green hall”), post-position (“the
hallway with the grassy floor”), predicate adjectives (“The path will have
grass on the ground.”), and dependent clauses (“the hallway that has green
on the floor.”).

The view descriptions from these phrases will encode similar

relationships about the grassy corridor being described, with slight differences in detail
depending on the surface form of the phrasing. The view description matcher will not need
any linguistic knowledge to match the relational models to the perception of the hallways.
Position is encoded by two attributes: side and distance. These may be relative
to an intersection, the follower, or another object, from the perspective of the follower.

The default point of view is the follower for the domain of route instructions. side may
be Left, Right, Sides, Front, At, Left_Front, Right_Front, or unspecified. This should
later be expanded by leveraging work modeling spatial prepositions and landmark positions,
especially in small-scale space, such as (Blisard et al., 2006; Herskovits, 1985; Klippel and
Winter, 2005; Regier and Carlson, 2001; Skubic et al., 2004b; Talmy, 2000).
The attribute distance encodes the linear position front-to-back within the view. Is
the object of interest immediately close, a few intersections away, or off in the distance?
(Note that encoding distance in the view is a different concept than distance along a path.)
distance values are encoded as a distance range. For instance, an immediate object is ‘0’,
an object at or past the next place is ‘1 :’, an object a few intersections away might be ‘2 : 3’
and unspecified distance can be represented as ‘0 :’, but is usually not included in the view
description, since it is not a constraint on the view.
This representation of view distance is implementation specific, but any route
follower needs a similar representation of distance. At the least, the representation of view
distance must be able to distinguish between immediate (here ‘0’) and distant (‘1 :’). These
distinguish, for instance, between facing towards an object and being at the object – the
pre- and post-conditions of an until phrase. These three view distances – immediate,
distant, and unspecified – account for all but 7 of the 7,015 view distances in the procedural
specifications derived from our corpus. The others come from phrases such as “one space
before the chair.”

Besides positional information, arbitrary other information may be encoded as an
attribute of a view description. Attributes model a constraint on one aspect of the entity. The
current implementation models the attributes of appearance, length, subtype, and count.
Attributes provide a constraint along one dimension, which may be precise (e.g. brown
wooden) or broad (e.g. dark).

In addition, each entity in the view description may also model the relationships to
other entities. Similarly to Herskovits (1985), we will call the focal entity the subject and

the other entities mentioned the reference objects. For spatial prepositions, Talmy (2000)
called these the Figure and Ground objects.
In the current implementation, the modeled relationships are Between, Detail
Loc[ated], On, and Part[-of]. Between is a ternary relation that the subject is located
between two other reference objects, i.e. if the follower is the subject, the reference objects
will be on opposites sides of her. Detail just asserts an unspecified connection between
two objects, which may be co-location (e.g. the end with the easel) or part (e.g.
the intersection with the brown hallway. Loc represents that the subject is co-

located with (or at or in) the reference object. On represents that the subject is topologically
on a reference entity, a path.
Part represents that the entity is part of the other entity, though which is the
part may not be apparent from the surface form. For instance, in the end of the
hall, the subject the end is part of the reference entity the hall. However, in the
intersection of the red and blue halls, the subject the intersection is

composed of the path fragments referred to by the red and blue halls. Thus, Part
only models that part-hood relationship is present, leaving the view description matcher to
interpret which entity is part of the other(s). The view description matching code must be
domain-specific, so will have better models of the domain entities that the more general
instruction modeler.
Note that some of the attributes and relationships in the view description
will be from explicit mentions in the instruction text, but others will be inferred
from knowledge of the task.

For instance, consider the command “go down the

pink-flowered hall until you reach the red-brick hall.” Here, the type

and appearance of the mentioned hallways are modeled from the text: Path(appear:[Rose])
and Path(appear:[Brick]).1
Knowledge of the Travelp command is needed to fill in the pre- and post-condition
1
[Rose] and [Brick] are domain-specific symbols that the view description matcher can match in the percept
stream.

distances and positions. The path traveled along will be immediate to the follower and
in front at the beginning of the procedure – Path(appear:[Rose], dist:’0’, Side:[Front]).
At the beginning of the Travelp , the destination path will be distant and in front of the
follower (Path(appear:[Brick], dist:’1:’, Side:[Front]), while at the end, it will be local in
an unspecified position (Path(appear:[Brick], dist:’0’)).

5.4.2

Representing Conditional Actions as Procedural Specifications

Route instructions require at least four low-level causal actions (Kuipers, 2000). A turna
changes the agent’s orientation (pose) while remaining in the same location. A travela
changes the agent’s location without changing orientation along a path. A verifya checks an
observation against a description of an expected view. A declare-goala action terminates
the instruction following process by declaring the agent is at the destination. Route
instructions may contain other action types, such as “open the door” or “take the
elevator to the 2nd floor.” These four causal actions are both necessary to follow

almost all route instructions and sufficient for many route instructions.
The procedural specification captures the commands in route instructions by
modeling which actions to take under which external (e.g. seeing a view) or internal
conditions (e.g. estimating the distance traveled). Resolving some ambiguities is deferred
until the follower observes the environmental context as it proceeds along the route. These
procedural specifications are similar to the Bindiganavale et al.’s Parameterized Action
Representations (2000), Denis et al.’s “minimal units of information” (Daniel et al., 2003;
Denis et al., 1999), Higher-Order Route Instruction Elements (Klippel et al., 2005), Spatial
Routines in small-scale space (Tellex and Roy, 2006), or Navigational Information Units
(Levit and Roy, 2007). Figure 5.1 shows the transformation from text to the imperative
instruction model.
Each clause is interpreted as a procedural specification depending on the verb or a
heuristic match based on the other constituents. Adverbs, verb objects, and prepositional

phrases translate to pre-conditions, while-conditions, and post-conditions in procedural
specifications. For instance, constituents may describe which path to take, how far to
travel, or the views that will be seen during the procedure. This is similar in intent to
work on combining the lexical semantics resource FrameNet with action schemas, allowing
inference (Chang et al., 2002).
The modeler recognizes termination conditions stated as purpose clauses (Di Eugenio, 1992), like “Turn so that you see a chair in front.” Other action verbs
have some arguments implicit, such as “face” implies turn until the description is
matched. Conditional procedures are modeled by embedding procedural specifications.
Note the implicit travel procedure in “At the corner, turn left,” modeled as
Turnp ( direction:Left, precondition :Travelp (until:Corner(dist:’0’))).

5.5

Executor: Interleaving Action and Perception

The executor sequences causal actions given the environmental context and the state of
following the route instructions. The executor interprets each procedure to execute causal
actions, including verifying view descriptions to check the state of the world. For instance,
given a Facep procedural specification, the executor continues to turna until a verifya
returns the view description has matched.
The executor is equivalent to, and can be implemented by, the reactive task
sequencing tier of the standard three-tiered intelligent architecture (Bonnasso et al., 1997).
A predecessor implementation of the M ARCO executor (MacMahon et al., 2004; Simmons
et al., 2003) is written in TDL, Task Description Language (Simmons and Apfelbaum,
1998).
The executor algorithm need not know anything about how the follower moves
through the environment or how the view descriptions are verified. Those actions and
observations may be opaque to the executor stage, in an ontology it can pass through but
not comprehend. Thus, an executor module may be run with differently implemented robot
88

controllers, controlling different hardware or software robots, with differing perceptual
abilities.
The simplest route instruction executor is the naïve instruction queue executor. It
steps through a list of instructions, attempting to execute each fully, without considering
the instruction context, before moving to the next. Executing each instruction may consist
of ensuring various preconditions, distance estimates, and postconditions are met. Each
condition may entail moving, then verifying the resulting view matches a view description.
The M ARCO implementation uses a pragmatic instruction queue executor, with
procedures that react to both the linguistic and the spatial context. Spatially, the procedures
act to achieve preconditions, for instance, a Travelp procedure facing a path before moving
forward. Both preconditions inherent in the procedure and specifically mentioned by the
director are treated in this way. Linguistically, the procedures execute differently based on
their context in the instructions. For instance, a Travelp procedure is inserted in between two
consecutive turns, and unterminated Travelp procedures will look ahead for a destination in
the up-coming instruction utterances. See Section 6.4 for an evaluation of the performance
impact of considering the instruction context, as well as the spatial context, in following
route instructions in our corpus.
This instruction following algorithm may be replaced with more sophisticated
algorithms that leverage previous knowledge of the environment map, or knowledge of
individual director’s style, vocabulary, and common mistakes. The executor could also
maintain more state about the route as traveled, building up a map through topological
simultaneous localization and mapping (Kuipers et al., 2004). With a local map and
hypothesis tracking, the executor could implement back-tracking. This would handle some
mistakes that the current M ARCO cannot, both the follower’s errors (e.g. incorrect reference
resolution) and the director’s (e.g. adding a spurious turn to the route).

5.5.1

Inferring procedures implicit in instructions

Implicit procedures are inferred using knowledge and reasoning about both language and
the task, here large-scale spatial navigation. For instance, reading “Go down the hall
to the chair,” M ARCO interprets the phrase structure as along and until parameters of

a Travel procedure. Using spatial knowledge and the Travelp action model, M ARCO infers
the conditions of the Travelp procedure: (Pre) the path should be immediately in front and
the chair should be in the front in the distance and (Post) the chair will be local to the
agent.2
Both Grice’s conversational maxims (1975) and Relevance Theory (Sperber and
Wilson, 2004) are linguistic theories of discourse – how sentences are strung together to
form broader meaning. Each theory assumes that a cooperative speaker conveys meaning
by crafting the discourse to clearly and concisely carry across the necessary concepts.
We make the same broad assumptions as Grice: the director is cooperative with the
follower, is descriptive enough to guide the follower (Maxim of Quantity), is generally
accurate (Maxim of Quantity), is relevant (Maxim of Relation), and is understandable
(Maxim of Manner). However, some of these maxims must be relaxed to handle natural
instructions. Because of individual variation and varying motivation, some subjects are too
curt, others too prolix.
The follower must be able to ignore extraneous information, but also to fill in
implicit procedures. Though directors are generally accurate, they are fallible, especially
in remembering turn directions and travel distances (MacMahon, 2005). The follower must
be able to account for discrepancies between the instructions and the environment. For
instance, the follower should not go forward when the last command has it facing a wall,
not a path. Finally, instructions may be ambiguous and under-specified a priori, but the
spatial context and situated spatial reasoning will usually resolve the ambiguity.
2

Note that in earlier work, these were referred to as “implicit actions.” However, they correspond to closedloop procedures on indeterminate length, not to a single causal action or fixed sequence of causal actions.

5.5.2

Recognizing syntactic, semantic, pragmatic, and exploratory cues

Instructions contain cues to implicit procedures across all of levels of linguistic processing.
Syntax, semantics, and discourse pragmatics are all needed for natural language understanding. Here, we describe recognizing the cues of implicit procedures in instructions that
are primarily encoded in each level. It is difficult, and probably unnecessary, to draw a
clean line demarcating the roles of syntax, semantics, and pragmatics in natural language
understanding. However, the methods we present here are principled and should generalize
to instruction understanding in different domains and other applications of natural language
understanding.
Syntactic Cues for Implicit Procedures
Syntactic cues are syntactic constructions that mark a condition for a procedure which
may be achieved by another procedure. For instance, a locative phrase, such as “at the
corner,” may require a Travelp procedure to reach the location. Similarly, a phrase

describing a pose, e.g. “facing the hallway,” may require a Turnp before proceeding
with the procedure in the main clause. The instruction modeler recognizes some linguistic
conditional clauses as syntactic cues, e.g., “when,” “at,” and “with,” as well as purpose
clauses, (e.g. “Turn so that the chair is in front.”). These conditionals are
modeled as possibly requiring an embedded procedures to achieve.
Syntactic cues are domain and verb independent. They are marked by a small set
of grammatical constructions. Syntactic cues can be evaluated independently of the verb
being modified: for instance, in the construction “at X, do Yp ,” the agent can take actions
to achieve X even without understanding Yp . For route instructions, X is a location, which
may require a Travelp to achieve. In other domains, X might be some other point-like state,
such as a temperature in a cooking recipe.
Though syntactic cues mark implicit procedures independently of spatial context
and reasoning, the application of the cue to navigation is still a pragmatic contextual

implication (Sperber and Wilson, 1986, 2004). That is, the agent can recognize that a
procedure may be needed through syntax, but determining which procedure is pragmatic
reasoning. Consider the sentences “Go down the hall toward the chair.

the blue hall, turn right.” Is the chair before, at, or after the blue hall during

the first travel? Alternatively, the environment might constrain where the follower can turn
when traveling toward the chair, and the blue hall may only be visible after the turn. In
large-scale space, the follower can only determine which procedures are needed by moving
through and perceiving the environment.
Semantic Cues for Implicit Procedures
Semantic cues may imply unstated procedures in two different ways. First, the semantic
frame may encode a procedure (Bindiganavale et al., 2000; Chang et al., 2002; Tellex
and Roy, 2006). For instance, “Take the third left,” implies a Travelp to the third
intersection with a possible left turn, before the turna to the left. Second, a phrase may have
the primary role of controlling the main procedure, but a secondary, implicit role of giving a
precondition to satisfy. For example, in “Walk down the hall until the chair,”
the until phrase not only is the termination condition of the Travelp procedure, but also
implies a Turnp to face the chair before beginning the Travelp . At the least, a chair should
be possibly reachable, if not visible, in the direction faced – not, for instance, a short dead
end path without a chair.
Semantic cues are domain dependent, depending on the verb. Consider two similarly constructed instructions: “Turn left past the chair” and “Walk forward
past the chair.” In the Turnp frame, the past phrase marks a precondition location;

chair must be behind the follower before the Turnp . In the Travelp frame, the past phrase
implies a series of conditions; before the Travelp , the chair is in front of the follower, by the
follower during the Travelp , and only behind the follower at the completion of the Travelp .
Contrast this to the verb-independent implications of syntactic cues, such as at, explained

in the last section.
Pragmatic Cues for Implicit Procedures
Pragmatic cues become important when the conventions or biases of the domain require
interpreting some utterances differently than their surface meanings. Pragmatics can either
be at the utterance or discourse level. At the level of a single utterance, idioms and
conventions are interpreted differently than their syntactic and semantic models. For
example, a declarative sentence with a pronoun and a locative expression may be interpreted
as a imperative sentence, especially at the end of the instructions. e.g. “It’s at the
end of the hall,” may mean “Go to the end of the hall.” Domain-specific

discourse pragmatics can come into play to fill in procedures that are conventionally skipped
in the instruction text, but are not implicit in syntactic or pragmatic preconditions. For
instance, “Turn left and then left again,” usually implies a Travelp procedure
between the two Turnp s.
Additionally, pragmatic cues may trigger the combination of information from
separate utterances into one procedure specification, or the interpretation of one utterance
differently depending on the context of its prior and next utterances, if any.

Early

implementations of M ARCO did not combine information across separate clauses, other
than pronoun resolution (MacMahon and Stankiewicz, 2006; MacMahon et al., 2006), nor
did other similar work in procedural route instruction following (Bugmann et al., 2004;
Simmons et al., 2003).
Each procedural specification has an index value and a pointer to the full instruction
plan. M ARCO has various heuristics to integrate information across utterances or based on
the position of the utterance in the instruction sequence. One implemented cross utterance
pragmatical heuristic fills in a missing termination condition for a Travelp procedure (the
until condition) from the linguistic context of the upcoming locative and descriptive phrases.
Another heuristic inserted a Travelp forward when the final procedure is an explicit Turnp .

Yet another inserts a Travelp to the next match, when the until condition of a Travelp
is met immediately after an explicit Turnp , e.g. “At the end of the hall, turn
left and then turn right at the end of that hall.” 3

Exploratory Implicit Procedures
Exploratory procedures are when the follower acts to gain knowledge. These are not
purely what have been called “knowledge-producing actions” (Scherl and Levesque, 2003),
because they have side effects in the world. The above implicit procedure inferences
can all be implemented in the instruction modeler or the executor. Though the executor
primarily performs the procedures stated explicitly or implicitly in the route instructions, the
executor also plans sequences to gain information and to achieve pre- and post-conditions
of procedures. Exploratory procedures may be necessary to determine where a reference
object is: e.g. in “Go away from the chair,” the follower may Turnp to locate the
chair. If the pre- and post-conditions of procedures are not met, the executor executes a
contingent plan to achieve them. The procedures the follower takes depend on both the
route instruction text and the text’s correspondence to the environment.
For some exploratory procedures, such as “Go toward the longer end of
the hall” or the above “Go away from the chair,” the follower may start in a pose

satisfying the condition. However, without the knowledge of what is behind, the follower
does not have certainty that the condition is achieved. Thus the exploratory Turnp may undo
an achieved condition in order to gain information about the world, before re-achieving the
condition. Of course, if the follower already has that knowledge, no implicit procedures are
necessary – the follower can proceed with the Travel in these examples.
3

Note that the differences in determiners are not enough to differentiate between the halls, as the and
that can refer to the same hallway, as in Face the blue hall and go to the end of
that hall.

5.5.3

Executing an Example from the Route Instruction Corpus

Consider the sentence “Take the blue path to the chair.” Figure 5.2 shows how
this instruction is applied to navigate given different maps and starting poses. Figure 5.2(a)
shows the default assumption, that the previous instruction elements have moved the
follower into position. If a blue path is immediately in front of the agent, it will execute
the explicit Travelp procedure. In Figure 5.2(b), the blue path is visible immediately to one
side, so it will Turnp to meet the precondition of Travelp along a path, though this procedure
is not stated in the instruction (a semantic cue). In Figure 5.2(c), the blue path is visible to
both sides, but the follower does not know which way the chair is. The follower must make
an exploratory Turnp to look down the blue hall in one direction, then if it does not see the
chair, Turnp around to face the chair.
If the follower does not see a blue path immediately in front of it, but does see one
off in the distance (Figure 5.2(d,e)), it will Travelp to the distant path, then Turnp onto it
before proceeding. Figure 5.2(f) shows the agent making an exploratory Turnp to find the
blue hall, a Travelp to reach it, another exploratory Turnp to find the chair, and only then the
explicit Travelp command. If it does not see a blue path from any pose at its current location,
it will move through the environment until it finds a match, with a small, but increasing,
chance of giving up. This search behavior improves performance on poor instructions,
while not significantly reducing the success rate of highly-rated instructions (MacMahon
and Stankiewicz, 2006).

5.5.4

Other work on understanding implicit procedures

Most other work on implicit procedures in route instructions has focused on the semantic
cues for implicit procedures, especially understanding how to achieve the preconditions
procedures. Tellex and Roy implement spatial routines that achieve the preconditions of
commands within the perceptual surround of a robot (Tellex and Roy, 2006), for instance
taking the spatial context into account to move to the next opening before executing a

“Go right” command. Both Tellex and Roy (2006) and Simpson (2005) survey robots
and smart wheelchairs that take spatial context into account when following commands.
However, this and other work only accounts for single instructions and does not test when
the implicit procedures are necessary in the linguistic context of a stream of instructions.
Bugmann et al. (2004) implemented a robotic system capable of following programs
of functional primitives from a corpus of 144 route instructions (See Section 2.4.4). They
modeled 15 functional primitives each taking an optional parameter list. They split similar
procedures into several procedures, where we would just have one procedure. For instance,
they model go_untilp , exit_roundaboutp , follow_road_untilp , and take_roadp , all of
which would be modeled with our Travelp procedure with various keyword parameters.
Bugmann and colleagues handled some of the same implicit procedure cues as
we do, although they do not break down how much each contributes to performance.
They handle some semantic and syntactic cues, giving examples of “Turn left”
meaning continue forward until a left turn is possible, then turn; and “At the second
intersection, turn left,” meaning travel to the second intersection, then turn

left. However, the examples they give as shortcomings of their agent show inability
to recognize discourse pragmatics and exploratory cues: not noticing a dead-end on
a wrongly described turn and not integrating information between sentences in “Pass
first intersection.

5.6

At the second intersection turn left.”.

Robot Controller

The Robot Controller module executes the low-level turna , travela , verifya , and declaregoala actions. Robot controllers present a common interface to the executor, with domaindependent implementations. The controller moves the agent for turna s and travela s and
can verifya if a view description matches an observation.
The robot controller acts at the level of the Control tier of the 3T intelligent
architecture. The robot controller in G RACE is written as commands and monitors in TDL.
96

The robot controller is an interface, presenting the executor discrete Causal actions such
as turna , travela , and verifya , hiding the Control level of the SSH ontology (continuous
sensorimotor experience and control laws). The turna and travela actions are translated
into movement; The verifya actions into perception, and the declare-goala action to a
signal to end the way-finding, possibly stating that the robot is at the goal or by taking a
special action in a simulation.
The actions the robot controller takes are dependent not only on the instructions, but
also models of the world and the robot’s capabilities. For instance, consider the instruction
“Move to the blue hall.” The executor executes the robot’s travela action and then
the robot’s verifya action. A controller for a robot with peripheral vision would execute the
verifya by analyzing the periphery of the view while facing along the travel direction. A
controller for a robot without peripheral vision would turna to face each hallway, verifya if
it is blue, otherwise turna back, all as part of the verifya action.

5.7

View Description Matcher

The view description matcher checks the symbolic view descriptions against sensory
observations. The view description matcher treats the view description as structured
constraints that the observation stream must meet. This defers handling many forms
of ambiguity until the environment can provide some disambiguating context.

The

view description models the world by recognizable entities, such as objects or spatial
configurations, their attributes, such as size or color, and the relations between them,
including spatial, compositional, and logical relationships.
The view description matcher is the code that grounds the symbolic relational
representation in the sensory experience. This may involve resolving ambiguities among
possible senses of a phrase. For instance, “Face the brick path” may mean “Face along the
brick path immediately in front of you” or “Turn towards the brick path that is visible in
the distance.” The view description verification code can check the observation for each of
97

these meanings.
For instance, given the instruction “Turn to face the blue path,” the view
description would be Path side:Front, appear:blue . The view distance is unspecified,
because it is unconstrained: The blue path may run forward from the agent p (distance:’0’,
side:Front) or may be visible crossing this path in the distance p (distance:’1:’, side:Sides).
M ARCO checks for both cases while turning.
The view description matcher performs unification (Russell and Norvig, 1995)
between the under-specified constraints modeled in the view description and a perceptual
model of the follower’s place in the environment. First, the view description is a partial
description of the observations the follower should encounter at that point in executing
the instructions, only mentioning some aspects of some of the landmarks the follower will
encounter. Second, the descriptions of the attributes are also under-specified: a path “in
front” may lead forward from the agent’s location or merely be visible in the distance.

Even matching concrete referring phrases to percepts through object recognition is only
a constraint: in the corpus for this dissertation, directors use “chair” to refer to three
perceptually different pieces of furniture: a stool, a bench, and a dining chair, even at times
when more than one is visible. Other noun phrases are even more vague, e.g. “piece of
furniture” or “something.”

There is a clean interface layer between the executor and the view description
matcher: the executor mostly treats the view descriptions as an opaque data type, formed
by the instruction modeler and verified by the view description matcher. The executor
only performs simple transformations on some parameters of the view descriptions,
for instance, projecting a desired local view p (distance:’0’, side:At) into the distance
p (distance:’1:’,

side:Front). This is a form of perspective taking, a fundamental spatial

skill in people (Schober, 1993; Trafton et al., 2005).
The view description matcher will use whatever perceptual abilities the robot has
available. On a hardware robot, the concept of an intersection can be linked to the code

that segments intersections in the laser scan and classifies the local path topology ( e.g. as a
dead end, “T”, or corner intersection (Kuipers et al., 2004)).
With the simulation in this paper, M ARCO cannot directly observe intersection type,
but must model it through the relative positions of the observed paths (see Figure ??). In
these instructions, “the corner” usually refers to a “L” intersection: the intersection of
two paths, each terminating at the corner, so the view description matcher procedure for
corner looks for a path configuration (local topological map) that meets these constraints,
local to the follower or off in the distance. Other constraints, e.g. “the corner of the
red and blue paths with a chair,” are applied to unify the constraints with the

observation and perceptual model.
The view description matcher encapsulates the following capabilities: object
recognition (Modayil and Kuipers, 2006), local topology recognition and local metrical
mapping (Kuipers et al., 2004), small-scale spatial referring phrase resolution (Regier and
Carlson, 2001; Skubic et al., 2004b), perspective taking (Schober, 1993; Trafton et al.,
2005), and visual search.

5.8

Modeling Route Instructions in the HSSH Ontology

The Spatial Semantic Hierarchy (SSH) (Kuipers, 2000) and its extension the Hybrid Spatial
Semantic Hierarchy (HSSH) (Kuipers et al., 2004; Modayil et al., 2004) are layered
ontologies of space. The representations of the HSSH capture reasoning and acting in
small- and large-scale spaces. Small-scale spaces are areas within the sensory surround
of an agent, such as a room or an open field. The small-scale space for an agent is the
area it can map by turning around and moving within its sensory horizon. Large-scale
spaces extend beyond the sensory horizon, whether because of opaque obstacles, such as in
a building, or the natural limits of perception, such as in the open desert, plains, or ocean,
where the agent can see to the horizon.
Route instructions describe causal and topological structures annotated with
99

metrical and rich view (object and landmark) information. A view abstracts the sensory
image to a symbolic representation. In the SSH, the causal level discretizes continuous
control motions into reliable discrete actions. At the causal level, motions are abstracted
to either turn or travel actions. A turna action changes orientation within a place, while a
travela moves the agent from one place to another.
In the HSSH, the four levels are rethought as the cross-product of interpreting space
locally or globally and metrically or topologically. The levels are the local metrical, local
topological, global topological, and global metrical levels (Kuipers et al., 2004). In each,
representations and actions in the higher (later) levels build on representations and actions
at the lower levels. For instance, at the local metrical level, the agent maps obstacles and
plans safe motion in the continuous world. At the local topological, the agent recognizes,
reasons about, and moves between symbolic intersections.
The topological level of the Spatial Semantic Hierarchy represents the environment
as places, paths, and regions and the topological relations of connectivity, path order,
boundary relations, and regional containment (Remolina and Kuipers, 2004).
Parts of route instructions can be represented by the SSH causal and topological
ontologies, with the actions annotated with metrical and view attributes. Route instructions
include both causal actions (“Walk forward three steps.”) and topological actions
(“Face along the blue path and follow it to the the brick hall.”).
M ARCO’s executor module interprets reactive procedures in the spatial context to
execute causal actions, although the robot may recognize and reason about topological
entities such as paths and intersections. Each turna and travela moves the follower to
the next pose and each verifya compares the view against the view description. However,
moving up the ontology, an executor can reason about the spatial layout of the route.
Explicitly reasoning at the topological level can help handle ambiguity in the language,
interpretation, observation, execution, and map learning.
The executor reasons at all four levels of the HSSH. In the SSH, these are the

100

control, causal, topological, and metrical levels. (Kuipers, 2000).

5.8.1

Relation to the Spatial Semantic Hierarchy

The Spatial Semantic Hierarchy is a rich, but complex model for reasoning about space. Is
the SSH excessive or just complex enough for the needs of following route instructions?
First, the Spatial Semantic Hierarchy is a well-developed, theoretically grounded
cognitive spatial model. The SSH is supported by having working implementations on
mobile robots (Kuipers et al., 2004; Kuipers and Byun, 1991), and by modeling human
navigation and spatial learning performance (Kuipers et al., 2003).
The SSH is a lattice of representations that can model states of partial spatial
knowledge, including partial maps as the environment is learned. For instance, the SSH
can easily model knowing only some routes instead of the overall spatial network; knowing
that two paths meet a corner, but not the relative turn direction between the two; or knowing
some path segment distances but not others.
Route instructions must select certain aspects of the route to describe, so by nature
under-specify the route. Instruction texts do not often mention the absolute location of
everything in the environment. The SSH can model the director eliding a turn direction,
occasionally mentioning a distance, and selectively noting landmarks and other perceptual
features.
The SSH has several advantages over global metrical maps, such as occupancy
grids. With a known global metrical map, a robot can follow a set of route instructions,
although finding the route in the metrical map will require processing on a hybrid,
topological representation.

However, since the global metrical map is a precise

representation of what has been observed, it is difficult to reason about unseen, unknown
places. To navigate with a global occupancy grid alone, the agent must know the exact
dimensions of the hallways and rooms, which are never described in natural language
route instructions. Generating route instructions using a metrical map is possible, but

101

using a representation that more closely follows people’s cognitive maps should produce
instructions which are more robust and easy to follow.
Moreover, the SSH first and fundamentally models the environment in terms of its
navigational affordances — its paths, places, and intersection structure. This is exactly the
sort of information described in route instructions, the navigational semantics of places on
the route. To follow instructions, having a high-level, but semantically annotated map is
necessary, where a metrically precise map of occupied space will fail. The agent must be
able to recognize places, paths, and intersection type, not just the presence or absence of
obstacles.
The multi-tiered SSH ontology allows the follower to recognize and act at the
different levels people do. See Section 6.7 for an experiment showing the necessity and
impact of the different HSSH levels for following route instructions.

5.8.2

Modeling route instructions by topological maps

Route instructions fundamentally describe a route, a topological trace through a large-scale
space. Since the route is composed of topological path segments, one approach would
be to model the instructions declaratively, as an under-specified topological map of the
environment being traversed. Then, the follower could navigate using traditional methods
for way-finding on a known map with uncertain movement (Cassandra et al., 1994; Fox
et al., 1999; Kaelbling et al., 1998; Koenig and Simmons, 1996, 1998; Simmons and
Koenig, 1995; Theocharous et al., 2004).
Resolving linguistic and perceptual ambiguity while following route instructions is
analogous to resolving perceptual aliasing. Route instructions do not completely specify the
route, leaving spatial ambiguity. For instance, a turn direction may be unspecified, leaving
topological ambiguity.
Procedural specifications model the route instructions as a list of imperative
procedures. The follower concentrates on inferring what the director intended the follower

102

to do. However, route instructions also state or imply a spatial route layout. Another
approach to following route instructions is to extract the implied map of the route from
the route instructions. The follower infers what the director intended the follower to know
about the route map.
Deriving the possible topological route maps from the route instructions is an
attractive idea.

Perhaps the hybrid topological SLAM algorithm in the HSSH that

handles spatial and perceptual ambiguity (Beeson et al., 2007; Kuipers et al., 2004;
Modayil et al., 2004) can handle the ambiguous maps derived from under-specified or
linguistically ambiguous route instruction. That is, can the partial, ambiguous map of the
environment derived from language understanding and the partial, ambiguous map learned
from exploration be represented and reasoned about in the same way by the same cognitive
processes?
Tractability of topological SLAM for route instruction following
After careful examination, performing route instruction following by SLAM on a set of
derived maps is intractable. Route instructions provide primary local guidance information,
without providing any global topological information. Moreover, the type of information
that route instructions provide conflicts with the axioms of topological SLAM (Remolina
and Kuipers, 2004).

Instructions rarely mention all places between two turns.

The

topological reasoner, on the other hand, constrains search by assuming it knows all places
along a traveled path segment. Therefore, a described travel procedure in route instructions
cannot be treated the same as a travel trace along the described path segment to the
topological mapper.
Moreover, the information in the view descriptions is under-specified, leading to
more perceptual aliasing of places. Finally, some of the information may be inaccurate. In
fact, the least accurate information in route instructions is the causal information of turn
directions and distance counts, which the topological mapper fundamentally relies upon

103

(MacMahon, 2005).
Though route instructions and sketch maps appear to be derived from the same
underlying topological cognitive maps (Tversky and Lee, 1999), route instructions are less
well specified. The sketch map provides a topological map of the route with taken decision
points on the route explicitly connected by travel arcs. In other words, the number of turn
and travel procedures, and the places they begin and terminate, are explicit in complete
route sketch maps. In verbal route instructions, some of these travel arcs and turn places
may be implicit, so each set of route instructions may correspond to many topological route
maps, and therefore many sketch maps.
Even worse would be applying Markov localization in a straight-forward way.
Markov models rely on having a complete description of the task and environment. For
localization, this means the likelihood of moving to each pose from another given an action,
and of the likelihood of seeing each observation from each pose. Where route instructions
under-specify the possible observations (that is, on almost all poses), we must account for
this under-specification.
The most direct approach treats the Markov observation set as the set of all possible
feature vectors. Yet in the general case, the set of feature vectors for route instructions is
unlimited, as the director may characterize various attributes of the environment at various
levels of detail, approximation, and vagueness. However, Markov models require a finite
set of observations, so this cannot work.
A second approach is to treat the observation as a feature vector representing
whether the perception matches the view description for all poses in the environment. The
catch is that view descriptions are not mutually exclusive. Where one describes a chair
in front, another will describe a path off to the right two intersections away. A priori, the
two view descriptions may be match the same pose, even if the poses are widely separated.
However, other perceptions will match only one of these view descriptions. So the Markov
observation must capture the set of view descriptions for poses that the perception matches.

104

This implies that for n poses with n different view descriptions, there will be 2n possible
observations – the power set of possible matches.
The observation space explodes to the power set of combinations because different
view descriptions will describe orthogonal aspects of the view, e.g. the blue hall and
a chair in front. For any given percept, one, both, or neither of these descriptions

could match. Describing a third attribute multiplies each combination by another true or
false match.
The exponential size of the observation space is bad enough on its own. Worse is
that to solve a POMDP is doubly exponential in the size of the observation space (Kaelbling
et al., 1998). An observation space that scale exponentially with the size of the state space
– number of poses mentioned – adds a third level of exponentiation, so that even solving
even short routes is intractable. Worse, this has the paradoxical result that the more detail
the director provides (especially about poses between or past turn poses) the slower solving
the POMDP becomes. As the director gives more information, each new bit of information
multiplies the difficulty of the problem that the Markov follower must resolve, because there
are more places the agent must rule out.
Though the general case of performing topological SLAM on the complete route
map is intractable, using Markov decision processes locally is an interesting avenue of
future research. Once the agent has perceived the scene and modeled the local surround,
the medium-scale space of the surrounding hallways are much better specified than from
the instructions alone. The follower need not enumerate all the worlds possible from the
instructions alone over the entire route. Instead, the follower can use the POMDP to make
local decisions on a perceptually filled-in model of the local surround. This may still prove
to be a practical way to handle locally resolvable ambiguities.

105

5.9

Extension to handle other sorts of ambiguity

To date in this architecture, we have applied deferred resolution of ambiguity for matching
referring phrases to observations and to modeling the instruction text as an under-specified
plan of action. In the first case, the ambiguities are in lexical and noun-phrase semantics
and at the perceptual levels. In the second case, in the semantics and pragmatics of verbs
and sentences and at the procedural level.
These same principles for resolving ambiguity through interaction with the world
can be applied to other processes in instruction following, such as selecting the best
syntactic parse and pronoun resolution. Parsing is the problem of picking the most probable
syntactic structure given a series of words. Pronoun resolution is the process of selecting the
most probable link between pronouns and other entities in the discourse or shared context,
including anaphora – matching a pronoun to an antecedent phrase, cataphora – matching
to a following phrases, and exophora – matching to an entity in the world that is otherwise
referenced in the discourse (Bos, 2004; Byron et al., 2005; Kamp and Reyle, 1993).
For the syntactic ambiguities of parsing, using a re-rank parser such as (Charniak
and Johnson, 2005), the executive layer can re-rank the list of most likely parses based either
on the higher level semantic constraints within the utterance or on pragmatic constraints
from the surrounding linguistic and task context.

For instance, in “Move towards

the chair on the blue floor,” the phrase “on the blue floor” may specify

the location for the chair, the path for travel, or both. If the mostly like initial parse is
that the phrase is modifying the chair, but the only visible chair is past the end of the blue
path, then the system should re-rank the parse list. Similar ideas are seen in (Fleischman
and Hovy, 2006; Fleischman and Roy, 2005; Skantze, 2005).
Deferred resolution can be applied in a similar way to handle the semantic ambiguities of pronoun resolution. For instance, in “You should be facing flowered
carpet and an easel, move to it.”

does it refer to the carpet or easel?

This idea can build upon similarly motivated work applying semantic filtering and visual

106

attention to reference resolution (Byron, 2002; Byron et al., 2005; Knees, 2002).

107

“Take the blue path to the chair.”
Travelp (along:Path(appear:Blue, side:Front), until:Chair(distance:’0’, side:At))
Map, Robot Pose

Implicit Actions

Worst-Case Actions Taken

a
Travel
b

Precond Turn

Turn
Travel

Explore Turn
Precond Turn

Turn
2 Turns
Travel

Precond Travel
Precond Turn

Travel
Turn
Travel

Precond Travel
Explore Turn
Precond Turn

Travel
Turn
2 Turns
Travel

Explore Turn
Precond Travel
Explore Turn
Precond Turn

Turn
Travel
Turn
2 Turns
Travel

Figure 5.2: How interpreting an utterance depends on the follower’s knowledge of its pose
in the environment. The circle represents the follower, with the line in front. The follower
sees hallways to its side, but not down the side hallway.

108

Chapter 6

M ARCO Route Instruction Corpus
Experiments
The route instruction language and task corpus captures how people describe large-scale
space, in particular, the task of spatial way-finding through an unknown environment. The
corpus measures the individual differences in how people describe a spatial route, including
mistakes and omitted information. The action traces and post-hoc ratings of the human
followers show how well people can apply these faulty, under-specified instructions to
accomplish the concrete task of navigation.
This corpus provides empirical data on the spatial, perceptual, linguistic, and
executive skills needed to follow the imperfect route instructions that people actually give.
By building a system that can approximate human performance, we can show what is
sufficient to follow natural language route instructions. By ablating (selectively turning
off) these various skills, we can measure how often particular skills are necessary to follow
natural language route instructions.
This chapter will first describe the general procedure for running M ARCO through
the instruction corpus. Next, we present a series of ablation comparison experiments,
examining the roles of implicit procedure inference, landmark recognition, and spatial
109

reasoning skills in following natural language route instructions.

6.1

M ARCO Followers Model the Text and Navigate

In general, M ARCO follows the same basic procedure as described for the human followers
in Section 4.7. For each instruction text in the corpus, M ARCO is put at the starting place,
facing an arbitrary direction. Like the human followers, M ARCO does not have any a priori
map or other knowledge of the environment or route layout. M ARCO parses and models
the instruction text to navigate to the destination, given first-person views of the virtual
environment. When M ARCO is finished following the instructions, it must explicitly take
an action to terminate navigation, just as the people did. Unlike people however, M ARCO
does not rate the instructions after following them.

6.1.1

Apparatus

We are interested in measuring how often different skills are necessary to follow natural
language route instructions, but, at this point, not in the particular implementation of those
skills, especially the perceptual skills. Therefore, M ARCO navigates through a symbolic
interface to the same environments that people saw as a three-dimensional scene. The
correspondence between the visual and symbolic views is described in the next section.
The symbolic navigation environment was coded in Python (Python, 2007). The
code implements a discrete motion simulation of the environment by calculating the
appropriate view and environmental observation after each action. The environmental
representation used is a Markov model, though this is completely hidden to M ARCO.
Though all actions and observations are deterministic in the current experiment, the action
model can handle both non-deterministic actions and observations.
One advantage of this model is speed of simulation. Since the model does not
need to render a three-dimensional scene, as the Vizard simulation does, it can simulate
a navigation run very quickly. This allows fast regression and ablation testing, running
110

through the entire corpus in minutes per follower instance. Following 1500 instructions on
a physical robot in a real-world environment in a comparable amount of time is impossible.

6.1.2

Stimuli

The symbolic perceptions given to M ARCO provide the same information in the same
structure as the visual scenes presented to people. Figures 6.1 and 6.2 show two examples
of the correspondence between the Vizard scene and the symbolic model view.
In Vizard, the human participant can see bits of any local peripheral hallways and
all the way down any straight hallway segment in front, until a wall blocks the view. The
symbolic view represents the same information – which entities are visible to the immediate
and peripheral front of the embedded way-finding agent. The symbolic view represents
the base components of the environments – the object landmarks (Table 4.1) and hallway
textures and wall pictures (Table 4.2).
Human immersive virtual reality view

M ARCO symbolic view

End, Wall, End,
Wall, Empty, Yellow
Fish, Grass, Fish,
Wall, Empty, Cement

Fish, Grass, Fish,
Wall, Hatrack, Brick

Figure 6.1: Human visual view and M ARCO symbolic view of a hallway. The symbols in
the tuples correspond to the walls, paths, furniture, and floor textures visible at the same
relative distance in the view.
In the current experiments, M ARCO receives a symbolic observation consisting of
a list of tuples, one tuple per intersection visible in front. Each item in the tuple represents
111

one of the components of a view of an intersection – the hallways or walls visible to the left,
front, and right, any furniture in the intersection, and pictures visible on any corridor walls
to the front. The relative ordering within each tuple corresponds to relative position within
the visual scene. For instance in Figure 6.1, the immediate intersection has a wall on the
left and a brick hallway visible on the right, as does the lower line of the bottom tuple in the
symbolic view. The relative ordering of the list of tuples corresponds to relative ordering of
the view down the path in front, if any.
Human immersive virtual reality view

M ARCO symbol view

End, Wall, End,
Wall, Empty, Wall

Eiffel, Cement, Eiffel,
Stone, Empty, Wall

Figure 6.2: Human visual view and M ARCO symbolic view of a shorter hallway
Because the list of symbol tuples corresponds to the same entities used to derive
the virtual reality visual scene, the visual and symbolic models contain nearly the same
navigation information. The only difference is that the symbolic view does not model
the position of object landmarks within the small-scale space of the intersection, so the
symbolic view is slightly less informative than the visual scene.
Note that the symbolic view represents the base constituents of the environment,
but may not correspond to how the directors refer to the environment. For instance, while
chair, hallway, and Eiffel tower are all directly represented, other nouns, such as
corner and furniture are not, nor are more complex referring phrases such as the

112

corner of the red and blue hallways with the chair or the longer end
of the yellow-tiled hall. The agent has to have some knowledge of the ontology

of the environment and how to match a complex referring phrase to a view, compositionally.
This symbolic observation representation is an implementation detail that M ARCO
does not depend upon in any way. The instruction execution core is separated from the
recognition and local perceptual modeling code by an abstraction layer.

6.1.3

Procedure

To follow a set of route instructions, like the human followers, M ARCO is placed at the
starting position, facing an arbitrary position. It has the route instruction text and attempts
to navigate to the destination. In most of the experiments below, M ARCO starts with the
gold standard parse trees for the instructions, although Section 6.8 examines parsing the
raw text using a learned grammar.
M ARCO navigates through the environment until the instructions end or until it
cannot find a described view. After each trial, M ARCO is placed at the next starting position
and given a new set of instructions. M ARCO builds up local perceptual models of the
environment while traveling, for instance, to remember what is behind it. These local
perceptual models are the symbolic views in each direction M ARCO has seen, shifted as
M ARCO turns or moves forward. However, it does not build global models of the route or
environment and does not remember its local models between trials. M ARCO forgets the
local perceptual map of one hallway as soon as it moves off of it.

6.1.4

Evaluation

We compare the empirical performance of M ARCO to the performance of human followers
on the instructions that people followed and rated. For the purposes of this comparison,
a correct route navigation terminates at the requested destination. In some cases, the
instructions contain errors or were written with the wrong destination in mind. For about

113

11.8% (179 of 1522 route instructions) of the corpus, a majority of the human followers
terminate at a common place other than the requested destination or the starting location.
(M ARCO terminates at this same non-target, non-start mode place in 81 of the 179 cases
on at least one run.) Though followers that also reach this common point may be regarded
as correctly following instructions that lead to a different destination, for the purposes of
this study, they are incorrect. Since the human followers are held to the same standard, the
comparisons are valid.
We will use the performance of people following the instructions as the goldstandard of how well the instructions can be followed. This controls for both the quality of
the instructions and the difficulty of the routes. Where people do poorly, we can expect the
task is more difficult, so the performance of multiple followers on an instruction and over a
corpus of instructions, gives a benchmark for evaluating the performance of the system.
Likewise, the performance of the full M ARCO system can be used as a benchmark
for the performance of ablated or differently configured versions of the system. This
provides two benefits. First, these experiments measure how often a change makes a
measurable difference in the performance of the system, the impact of that change. Second,
these measurements give insight into how people generate, interpret, and follow route
instructions, especially where the changes in performance with an ablated system match
a drop in performance for some of our human followers.
Let us call the performance difference, in percent from the baseline M ARCO
performance, the impact. For instance, if the baseline system successfully follows 66% of
the route instruction corpus and an ablated version successfully follows 33% of the corpus,
the impact of that feature was 50% – there is a 50% reduction in measured performance.
This indicates how often the ablated feature was necessary to apply the instruction to
accomplish the task, normalized by the baseline performance.
The tables in this chapter present the results from the ablation tables using tables
with common conventions, e.g. Table 6.3. Each row represents the mean performance over

114

Figure 6.3: Human and full M ARCO success rates, with standard error bars, versuspost hoc
human subjective instruction rating. The rating of 1 indicates extremely poor instructions,
while 6 is excellent. Success rate is how often, on average, the human followers (circles)
and M ARCO (squares) finished navigating at the intended position for all the instruction
texts with mean human rating of r ± 0.125. Graph on the left shows all data, graphs on
right show per corpus. The same trends hold in all three corpora. Data as of May 31, 2007.
multiple runs for a group of human followers or a configuration of M ARCO. The columns
show the results over instructions with mean post-hoc rating within the listed range.
Statistical significance on a one-sided T-Test is indicated by the annotations: significance
at a 5% level is marked by ∗ , 1% by † , and 0.1% by ‡ . These tests and conventions are used
throughout this chapter.

115

Quality Range
People
Marco
Marco w/o Implicit Procedures

1.0–2.5
35%‡
22%
5%‡

2.5–3.5
51%‡
33%
7%‡

3.5–5.0
73%‡
60%
22%‡

5.0–6.0
85%‡
79%∗
40%‡

All
64%‡
52%
20%‡

Impt
-20%
0%
62%

Table 6.1: Comparing performance of M ARCO to people and to M ARCO following only
explicitly commanded procedures.

6.2

Full M ARCO Performance

Figure 6.3 shows the evaluation results comparing people, the full M ARCO and the naïve
approach of executing only the explicitly commanded procedures. Human participants
were able to successfully find and identify the desired destination with an overall mean
success rate of 64.3% (SEM = 0.9%1 ) of 1522 instructions in the three environments.
This was fairly consistent across the corpora from the three experiments: Corpus 1
(M = 67.7%, SEM = 1.2%), N=691; Corpus 2 (M = 64.0%, SEM = 1.9%), N=432;
Corpus 3 (M = 58.6%, SEM = 2.0%), N=399. For people, the results are the mean over
runs from multiple participants following each instruction set, each beginning at the start
location facing a random direction.
With full procedure inference, M ARCO successfully followed 51.5% (SEM =
1.2%) of all 1522 route instruction texts followed by people. Further, M ARCO increases in
performance as the human instruction rating increases and as human performance increases.
For the M ARCO cases, the presented results are the mean over four runs, facing each of the
four directions at the start. While M ARCO does not yet match human performance across
route instructions of all qualities, the correlations from M ARCO’s mean performance to
human mean performance (Rs = 0.958) and to human mean ratings (Rs = 0.970) are
extremely strong.
Over the three corpora, M ARCO has significantly different performaces. M ARCO
was developed by examining the first corpus, so the performance nearly matches human
1

SEM is Standard Error of Means.

116

performance: (M = 66.0%, SEM = 1.7%), N=691. Controlling for followed routes, this
is difference from human performance is not statistically significant, t(690) = 1.31, p ≤
0.096. The second corpus with hand-corrected parses and minor enhancements to M ARCO,
the performmance is almost three-quarters of human performance: (M = 46.2%, SEM =
2.2%), N=432, a significant difference from human performance, even controlling for route
followed: t(431) = 16.99, p ≤ 0.001. On the third corpus with hand-corrected parses,
but no enhancements to M ARCO after the Corpus 2 modifications, M ARCO successfully
followed (M = 32.3%, SEM = 2.2%), N=399, over half of human performance, even
with worse instructions, significant at t(398) = 21.01, p ≤ 0.001.
The performance of M ARCO following only the explicitly commanded procedures
is shown here as a baseline. By executing only explicitly commanded procedures, the
ablated M ARCO can only follow 19.8% of the 1522 instructions overall (SEM = 0.9%),
and between 10% and 30% of the instructions per corpus. The next two sections break
down the impact of ablating the ability to recognize different cues to implicit procedures
(Section 6.4) and the impact of inferring different navigation procedures (Section 6.5).

6.3

Comparing Implicit Procedures Inference to Fundamental
Explicit Navigation Procedures

To gauge how important inferring implicit procedures is to following natural route
instructions, we can compare the impact to the impact of ablating fundamental spatial
procedures. The turna and travela causal actions are what move an agent through a
large scale space. By removing the ability to recognize the various ways of explicitly
commanding Turnp and Travelp procedures, we can compare the impact of these core
spatial behaviors to the linguistic behavior of inferring implicit procedures for following
spatial route instructions.
Each type of causal action can be controlled by an open-loop or a closed-loop

117

Figure 6.4: Comparing inferrring implicit procedures vs. executing types of explicit
procedures. The ability to infer implicit procedures is as important as the skills to execute
all procedures in the instructions.
procedure. In open-loop procedures, termination is controlled by a relative offset – when
an internal condition is met. The open-loop procedures in spatial route instructions are
Turn Direction for a turna and Travel Distance for the travela . In closed-loop behaviors,
the procedure terminates when an external condition is met, when a view description is
matched against the local perceptual model. In spatial route instructions, the closed-loop
procedures are Face Description and Travel Until [Description].
Examples of open-loop causal procedures are traveling a distance of a count of
intersections (e.g. “Go the third hallway”) and turning to the next possible path
(e.g. “Turn left”). Closed-loop causal procedures terminate when a view description
is matched by observation. Examples of closed-loop causal procedures are turning to a
view (“Face the blue path”) or traveling until a description is met (“Walk forward
118

Quality Range
People
Marco
– w/o Turn Direction
– w/o Distance Count
– w/o Travel Until
– w/o Implicit Procedures
– w/o Face Description

1.0–2.5
35%‡
22%
20%∗
9%‡
10%‡
5%‡
6%‡

2.5–3.5
52%‡
33%
29%
15%‡
7%‡
7%‡
7%‡

3.5–5.0
73%‡
60%
50%‡
29%‡
26%‡
22%‡
14%‡

5.0–6.0
85%‡
79%∗
66%∗
49%‡
36%‡
40%‡
20%‡

All
64%‡
52%
43%‡
27%‡
21%‡
20%‡
12%‡

Impt
-20%
0%
16%
48%
59%
62%
76%

Table 6.2: Comparing performance of M ARCO on inferring implicit procecures and
executing types explicitly commanded procecures. Explicitly commanded procedures are
executed normally in all cases.
until the end of the hall”).

Just like the ability to infer implicit procedures, we can ablate the ability to
recognize these types of explicitly commanded procedures. When ablating one type, the
others are in effect, so if the director used a redundant command, the other behavior will still
be executed. For instance, given “Take the blue hall three intersections to
the chair,” without Travel Distance, the follower will navigate to the chair and without

Travel Until, it will still navigate three intersections. Normally, with both behaviors, it
will check both conditions, though favoring the more reliable closed-loop landmark based
Travel Until when there is a conflict, after moving the described distance. When there is
no redundant information, the follower will execute a fall-back default, such as traveling to
the next intersection.
Here, we can see that the closed-loop procedures have the largest impact while
open-loop procedures have a smaller impact. In the route instruction task, the director
does not know which way the follower is facing at the beginning of the trial. This is one
reason that Face Description had the largest impact. Secondly, closed-loop procedures are
both empirically and theoretically more reliable than open-loop procedures. Closed-loop
procedures are inherently more reliable because they provide an independent way to check
that the procedure was correctly completed. For this reason, the Spatial Semantic Hierarchy

119

assumes both closed-loop control laws and causal actions are completely reliable (Kuipers,
2000).
Empirically, the descriptions given for landmarks in closed-loop procedures
are more reliable than the distance and turn direction parameters given for open-loop
procedures in this corpus (MacMahon, 2005). For instance, landmark descriptions were
about 10 times more reliable than turn directions (99.5% to 95%) (MacMahon, 2005).
Closed-loop procedures allow the follower to recover from his or her own errors while
containing descriptions that contain fewer mistakes by the director. (See Section 6.7 for
further comparison of open-loop and closed-loop procedures.)
The surprise here is that inferring implicit procedures has a larger impact than the
explicitly commanded closed-loop spatial procedures of Turn Direction and Travel Distance
and comparable impact to the fundamental explicitly commanded open-loop procedures of
Travel Until and Face Description. But another way, it is just as important to infer the
implicit procedures that the director intended – what the director meant – as to execute
common invocations of explicitly commanded procedures – what the director said to do.

6.4

Implicit Procedures in Route Instructions

Route instructions represent the transfer of knowledge from the director to the follower
about specific spatial procedures and environment attributes. Route instructions are useful
if the followers reliably reach the intended destination. However, not all the necessary
procedures to navigate the route are explicitly asserted as imperative commands. Some
necessary procedures are implicit, but implied by syntactic, semantic, or pragmatic features
of the instructions. The follower must infer the intended procedure sequence from the
instruction text. See Section 5.5.2 further description of these cues and how they are
implemented in M ARCO.
The likelihood of reaching the destination depends on both the instruction set and
the skills of the follower for navigation and interpretation. We present results for six types of
120

Figure 6.5: Performance for people, M ARCO and M ARCO without implicit procedure cues,
with standard error bars, plotted versus human instruction rating.
followers: (1) human participants, (2) the full M ARCO agent, (3) M ARCO without syntactic
cues for implicit procedures, (4) M ARCO without semantic cues for implicit procedures,
and (5) M ARCO without pragmatic cues for implicit procedure, and (6) M ARCO without
implicit exploratory procedures. Inferred procedures may be unnecessary, depending on the
starting conditions and how the follower interpreted and executed any previous instruction
elements. This experiment measures the distribution of when the inferences are necessary.
These are further defined in Section 5.5.2.
Pragmatic (discourse) cues have the least effect on performance. Without pragmatic
cues, the performance impact is 16%. Overall, performance without pragmatic cues was
(M = 43.4%, SEM = 1.2%), N=1522, different from the full performance at t(1521) =
9.46, p ≤ 0.001. In Corpus 1, the impact was 17%; Corpus 2, 13%; and Corpus 3, 15%.

121

Without exploration procedures, the impact on M ARCO’s overall performance is
21%. In Corpus 1, the impact was 23%; Corpus 2, 21%; and Corpus 3, 32%. Exploratory
procedures include procedures to match the view description to the view when the view
does not contain enough information to disambiguate. The canonical example of this is
“Face the longer end of the hall.” Even if the follower is already facing the
longer portion of the hallway, it cannot know its state without seeing or having seen what is
behind it. Exploratory procedures also include recovery procedures when lost or when the
described pose is visible in the distance.
Without recognizing the implicit procedures marked by syntactic cues, the impact
is 38%. In Corpus 1, the impact was 41%; Corpus 2, 38%; and Corpus 3, 39%. This
includes syntactic cues for both inferring a Travelp implicit in a Turnp command (“At the
end of the hall, turn right.”) and a Turnp in a Travelp procedure (“With your
back to the wall, walk forward.”).

Recognizing semantic cues had the largest impact, decreasing performance by
almost 52%. In Corpus 1, the impact was 45%, Corpus 2, 61%; and Corpus 3, 62%.
Some semantic cues are in the preconditions of the stated procedure, for instance, in
“Take the green path,” a Travelp is necessary when the green path is distant. Turns
can also be implicit in a procedure’s preconditions, e.g. a Turnp to face the path in “Go
down the brick hallway.” An example of verb frame encoding a procedure is “Take
the second left,” implying Travelp forward to the second intersection that has a path

exiting to the left.
See Appendix B.7 for more details of individual behaviors.

6.4.1

Implicit Procedure Cues Results by Rating

Ablating the various cues for implicit procedures had varying effects over the spectrum of
instruction quality (Table 6.3). Exploration cues are particularly important for poor quality
instructions (58% impact). Often the worst instructions had gaps and errors necessitating

122

Quality Range
People
Marco
– w/o Implicit Pragmatic
– w/o Implicit Exploration
– w/o Implicit Syntactic
– w/o Implicit Semantic

1.0–2.5
35%‡
22%
18%†
9%‡
13%‡
6%‡

2.5–3.5
52%‡
33%
27%∗
19%‡
18%‡
11%‡

3.5–5.0
73%‡
60%
51%†
49%‡
37%‡
30%‡

5.0–6.0
85%‡
79%∗
67%∗
70%
53%‡
46%‡

All
64%‡
52%
43%‡
40%‡
32%‡
25%‡

Impt
-20%
0%
16%
23%
38%
52%

Table 6.3: Comparing performance of M ARCO versions without different implicit
procedure cues
exploration. However, exploration cues are rarely crucial to following the best-rated
instructions, the impact is not significant impact at 4%.
Discourse pragmatics cues only had an impact of 8–19% across all quality
instructions. However, the impact of pragmatic cues on the worst instructions was smaller
(8%), significant at t(268) = 2.53, p ≤ 0.006. Inferences from pragmatic cues on poor
instructions may be more challenging or simply not yet implemented.
Semantic cues had a bigger impact on worse instructions than better instructions,
although it had a significant impact across the quality spectrum of between 37–74%. The
impact scaled up as the instructions worsened. Since semantic cues are domain or verb
dependent, they may require more effort to interpret, and therefore maybe cause people to
rate an instruction worse if it requires inferring implicit procedures from semantic cues.
Inferring implicit procedures from syntactic cues has a consistently high impact
across the quality spectrum. On the best instructions, the impact is 26%; for each of good,
mediocre, and poor instructions, the impact is between 38% and 46%. Since syntactic
cues are domain-independent, they may not require as much effort to interpret as the
domain-dependent semantic cues. This would explain why the impact of inferring implicit
procedures from semantic cues is fairly evenly distributed across the quality spectrum.

123

Quality Range
People
Marco
Marco w/o Implicit Procedures

1.0–2.5
35%‡
22%
5%‡

2.5–3.5
52%‡
33%
7%‡

3.5–5.0
73%‡
60%
22%‡

5.0–6.0
85%‡
79%∗
40%‡

All
64%‡
52%
20%‡

Impt
-20%
0%
62%

Table 6.4: Comparing the performance of M ARCO without implicit turns and travels

6.5

Inferring Different Types of Implicit Procedures

Besides the different cues for implicit procedures, there are also different inference products
— the implicit procedures themselves. This section looks at the impact of inferring only
Turnp procedures, only Travelp procedures, inferring both types of procedures and inferring
neither. This chapter is an update of (MacMahon et al., 2006) with a further developed
M ARCO model and a separate evaluation corpus.
The last section examined the impact of cues of implicit procedures and this section
examines the impact of the results of inferring implicit procedures. We present results for
five types of followers: (1) human participants, (2) the full M ARCO model, (3) M ARCO
without Turnp inference, (4) M ARCO without Travelp inference, and (5) M ARCO without
either Turnp or Travelp inference (only explicitly commanded procedures). For people, the
results are the mean over runs from multiple participants following each instruction set,
each beginning at the start location facing a random direction. For the M ARCO cases, the
presented results are the mean over four runs, facing each of the four directions at the start.
Without inferring Travelp procedures, only implicit Turnp procedures, M ARCO’s
performance drops to 32.5% (SEM = 1.1%), an impact of 37%. Across the corpora, the
impact is 29%, 52%, and 43%. In all corpora, the difference between the full M ARCO and
M ARCO that did not infer Travelp procedures is significant (t(690) = 12.19, p ≤ 0.001;
t(431) = 11.76, p ≤ 0.001; t(398) = 7.19, p ≤ 0.001).
If M ARCO does not execute implicit Turnp procedures, but only implicit Travelp
procedures, performance slips to 25.3% of the 1522 instructions followed by people,

124

Figure 6.6: Implicit turns have a higher impact than implicit travels. Inferring turns literally
gets the agent on the right path, where it gets new information.
SEM = 0.9%. Across the corpora, the impact is 47%, 57%, and 54%, a very consistent
drop-off in performance. In all corpora, the difference between the full M ARCO and
M ARCO that did not infer Turnp procedures is significant (t(690) = 20.02, p ≤ 0.001;
t(431) = 13.90, p ≤ 0.001; t(398) = 9.56, p ≤ 0.001).
For all corpora, implicit Travelp procedures have less impact than implicit Turnp
procedures. This difference was significant overall (t(1521) = 7.34, p ≤ 0.001), and in
Corpus 1 (t(690) = 7.82, p ≤ 0.001) and Corpus 3 (t(398) = 2.35, p ≤ 0.010), but not
Corpus 2 (t(431) = 1.25, p ≤ 0.105). It makes some sense for an implicit Turnp to be
more crucial than an implicit Travelp , as the turna causal action changes the information
available drastically, by bringing a new path into view. A travela causal action, on the other
hand does not change the information available to an agent with distal perception as much.

125

Following purely explicit instructions, without inferring either Turnp or Travelp
procedures, M ARCO can successfully follow just 19.8% of the routes in the 1522 instruction
corpus. The effects of inferring implicit Turnp or Travelp procedures are neither fully
independent nor fully dependent, both are critical for route instructions. Across the corpora,
the impact varies from between 54% and 78%.
For all instructions, the difference is significant from both inferring only implicit
Turnp procedures (t(1521) = 9.21, p ≤ 0.001) and Travelp procedures (t(1521) =
15.97, p ≤ 0.001). The difference from inferring only implicit Turnp s is significant in
Corpus 1 (t(690) = 12.92, p ≤ 0.001); Corpus 2 (t(431) = 8.00, p ≤ 0.001); and Corpus 3
(Turns t(398) = 4.89, p ≤ 0.001). Inferring neither implicit Turnp nor Travelp procedures
has a significant impact over inferring only implicit Travelp procedures is significant in
Corpus 1 (t(690) = 5.23, p ≤ 0.001); Corpus 2 (t(431) = 8.62, p ≤ 0.001); and Corpus 3
(t(398) = 2.49, p ≤ 0.007).

6.5.1

Implicit Procedure Experiment Results by Rating

Inferring implicit procedures is essential for following the nearly all of the lowest rated
instructions in this corpus, but still has an impact of nearly 50% even for following the
highest rated instructions. Table 6.3 summarizes the results graphed in Figure 6.5 across
broad classes of human post hoc subject instruction ratings. In this discussion, r will denote
the mean rating on an instruction set from the six human followers.
For poor instructions, r ≤ 3.5 out of 6, M ARCO is effectively crippled without
inferring implicit procedures, with an impact of 77% on instructions rated 1 ≤ r ≤ 2.5 and
79% on instructions rated 2.5 < r ≤ 3.5. On good but not excellent instructions (3.5 <
r ≤ 5.0), M ARCO can follow a significant number of instructions without inferring implicit
procedures, but performs much better by inferring procedures (impact 79%), especially
Turnp procedures, (impact of 65%). Making an implicit turna can put the follower on the
correct path, revealing a view down the path which shares very little information with views

126

facing in other directions. A Travelp moves the agent to a new place, but does not bring as
much new information into the view as the followers are able to see distant objects.
On the best instructions, those rated r > 5, the full M ARCO system and people had
only a small significant difference in performance, 7% t(778) = 1.92, p ≤ 0.028. Without
inferring implicit procedures, M ARCO had significant decrease in performance even on the
best instructions, with a 45% impact. Inferring implicit procedures accounts for nearly all
the success on poor instructions and about half the success even on the best instructions.
All of these trends are consistent across all three corpora.

6.6

Object, Structural and Appearance Landmarks

Landmarks are crucial to any theory of route instructions. Landmarks are the reference
points that directors use to describe the route and that followers use to orient themselves to
the route. To navigate to the destination, the follower must match the landmark descriptions
with observations seen along the way.

6.6.1

Review of Landmarks Types

Directors can refer to several types of landmarks. Stankiewicz and Kalia (2007) distinguish
object landmarks and structural landmarks. Object landmarks are discrete objects in the
environment that are independent of the navigational structure of the environment, such as
the pictures on the walls and furniture on the floor. Structural landmarks are spatially
extended and based on the navigational structure of the environment — how the paths
connect and terminate — and include dead ends, T intersections, and descriptions of
hallway length.
This distinction between object and structural landmarks is common. Hansen
et al. (2006) split landmarks into Point, Linear and Area landmarks at their “Conceptual”
level. Our object landmarks fall into their Point landmarks and structural landmarks are
Linear or Area. Sandstrom et al. (1998) similarly split landmarks into “landmarks” (object
127

landmarks) and “geometric information” (structural landmarks). Sorrows and Hirtle (1999)
distinguish visual, cognitive, and structural landmarks, which Raubal and Winter (2002)
formalize into a computational model accounting for “visual, semantic, and structural”
attributes. Their structural landmarks correspond to ours and our object landmarks fall
into their broader visual landmark category. Siegel and White (1975) propose that these
types of landmarks play primary roles at different stages in learning large-scale spatial
environments.
In this analysis, we decompose landmarks by the complexity of the required
perception. Like the hierarchical lattice of spatial ontologies in the Hybrid Spatial Semantic
Hierarchy (HSSH) (Kuipers et al., 2004; Kuipers, 2000), simpler perceptual attributes and
landmark types are required for more complex types. The landmark types examined here
are
1. Appearance Landmarks: Characterized by a single perceptual attribute, including
color (red, blue), luminance (dark, light), and texture (wooden, brick).
2. Object Landmarks: Discrete, localized objects, e.g. the furniture in some intersections (chair, sofa, lamp).
3. Structural Landmarks: Describe navigable space. We break these into
(a) Causal Landmarks: Local simple structural landmarks that form the components of more complex descriptions: walls, hallways and places.
(b) Intersection Landmarks: Compound structural landmarks describing the local
topology, or shape, of intersections (T intersection, dead end, path
to the right).

6.6.2

Landmark Recognition Ablation Study

To test the navigation impact of different types of landmarks in this route instruction corpus,
we ran an ablation study removing the ability to recognize appearance, object, intersection,
128

Quality Range
People
Marco
– w/o Object Landmarks
– w/o Intersection Landmarks
– w/o Appearance Landmarks
– w/o Structural Landmarks

1.0–2.5
35%‡
22%
15%‡
15%‡
13%‡
9%‡

2.5–3.5
52%‡
33%
22%‡
22%‡
18%‡
11%‡

3.5–5.0
73%‡
60%
46%‡
46%‡
39%‡
25%‡

5.0–6.0
85%‡
79%∗
62%‡
60%‡
57%‡
37%‡

All
64%‡
52%
39%‡
38%‡
33%‡
21%‡

Impt
-20%
0%
25%
25%
35%
58%

Table 6.5: Comparing the performance of M ARCO without the ability to recognize different
kinds of landmarks
and all structural landmarks. For this study, landmarks are any description of the world
along the route — what M ARCO models as view descriptions. In this study, we test how
much the ability to recognize each type of landmark contributes to successfully following
the route instructions.
Where we have ablated the landmark recognition, the recognition code always
returns True, which allows matching the referring phrase by using the other types of
landmark information. For instance, told to “Take the blue path,” without Structural
landmarks (specifically, Causal landmarks), M ARCO will face something blue (in this study,
always a path). Given this description without Appearance landmarks, M ARCO will face a
path, whether it is blue or not. Without either Appearance or Structural landmarks, any view
will match. Thus, a description incorporating at least two types of landmarks (e.g. “Face
the chair on the rose hallway.”) may be robust to the loss of any one, depending

on the environment and whether the additional information is elaborative (redundant) or
contrastive (necessary).
In this study, the three types of landmarks were each ablated separately.
Additionally, we tested the result of ablating only Intersection Structural landmarks, but
not Causal landmarks. Recognizing Intersection landmarks without the Causal landmarks
of paths and walls is not possible, by definition.

129

Figure 6.7: Success rates for M ARCO without the ability to recognize types of landmarks.

6.6.3

Landmark Recognition Ablation Results

The impact of ablating the different landmark types accords with the predictions suggested
by the HSSH and with previous psychological studies (Stankiewicz and Kalia, 2007).
Structural landmarks, which encode information about the navigational structure of space,
have the highest impact. Appearance and Object landmarks have the next highest impact,
followed by Intersection landmarks alone.
Simpler landmarks and landmarks directly involved in the task have higher impact
than more complex landmarks. This pattern of impact matches the expected pattern from the
literature. This is evidence for a hierarchical ontology of spatial and perceptual attributes:
the more fundamental entities of Causal landmarks have higher impact than landmarks built
on them (Intersection) or based on attributes peripheral to the task (Object and Appearance).
Structural landmarks refer to the navigational structure of space, and so the language of
130

route instructions is structured around Structural landmarks.
Over the entire tested corpus, the impact of running without the ability to Structural
landmarks is 58%, i.e. the ablated model succeeds 42% as often as the full M ARCO.
On the best-rated instructions (mean rating 5.0–6.0), the impact is 49%. For middlerated instructions, the impact is 59% and 67% for instructions rated 2.5–3.5 and 3.5–5.0,
respectively. For the worst rated instructions, the impact is 60%.
Appearance landmarks, have less impact (35%), but more than any other type
of landmark. Due to the prevalence of different floor textures and the textures being
distributed along long straight paths, appearance references are extremely common in this
study. Moreover, simple appearance references tend to be contrastive between the available
paths. The impact is 22% for the best instructions, 34% for the good instructions, 47% for
adequate instructions, and 44% impact for the worst instructions. Appearance landmarks
are simple to perceive and describe.
Object landmarks have the next biggest impact in this study at 25%, although they
are relatively sparsely distributed in the environments. Nearly every hallway has a distinct
floor texture, giving a distinctive appearance. Every place can be described in terms of its
intersection structure. However, only about 25% of the places in the tested environments
have an object. Overall, the impact of object landmarks is more evenly distributed across
the rating spectrum. For both the best instructions, the impact is about 14%, while for 3.5–
5.0 rated instructions, the impact is 23% and for 2.5–3.5 rated instructions the impact is
34%, for the worst, the impact is 32%.
Intersection landmarks alone have about the same impact in this study as object
landmarks.

This can be explained by two factors.

First, the local topology of the

intersection (its shape) is a relatively complex concept. Second, many intersections are
described by the appearance of their component paths in the instructions, and this is
usually sufficient in these environments to uniquely identify an intersection. Intersection
landmarks have an impact of 25% , with the two best-rated classes of instructions around

131

20% impact and the worst two classes of rated instructions, the impact around 33%.
Note that many of the worst instructions simply state a description of the destination
in terms like “Position 7 is at the intersection of the yellow and grey
floored halls.” or “Position 4 is at the dead end of the yellow fish
hall with yellow floors.”

All of these differences are consistent and significant across all three corpora. All
are significant for all the rating groups of instructions and across all three corpora.

6.7
6.7.1

Hybrid Spatial Semantic Hierarchy
Review of the Hybrid Spatial Semantic Hierarchy

The original Spatial Semantic Hierarchy (SSH) (Kuipers, 2000, 2006; Remolina and
Kuipers, 2004) represents large-scale space with four distinct representations: (1) control
laws operating on perceptual attributes that allow reliable motion among distinctive states;
(2) causal schemas that model movement between distinctive states and views that represent
what can be observed at a state, which represent the local navigation affordances of the
continuous world as a deterministic finite automaton; (3) a topological model consisting of
places, paths, and regions, which represent the global navigation structure of the continuous
world as a topological map or set of maps; and (4) local metrical information, such as the
distances between places on a path and the angles between path segments at a place.
The Hybrid Spatial Semantic Hierarchy (HSSH) builds on the SSH by synthesizing
a better representation of small-scale space (Beeson et al., 2007; Kuipers et al., 2004;
Modayil et al., 2004). The local perceptual map (LPM) is a representation of the positions
of objects, obstacles, and hazards in the space around the agent, such as an occupancy
grid (Moravec and Elfes, 1985), built directly from ranged perception, such as a laserrange finder, sonar, or vision. The LPM represents the metrical layout of navigable space
in the local surround of the agent. The local topological map describes a place by the

132

structure of its navigational affordances: how the agent can move through, in, and out of
the place. Specifically, the local topology of a place can be represented by the small-scale
star modeling the relationships among the path segments entering and exiting the place.
The Hybrid Spatial Semantic Hierarchy provides the representations needed to
model the entities and actions needed for navigating guided by route instructions. The
control laws at the local metrical provide the basic skills needed for locomotion. The
causal procedures at the local topological level provide behaviors matching descriptions
of turns (“Turn left” and “Take a right”) and travel procedures (“Go forward
twice” and “Move to the next place”).

The local topological level allows the

description of intersections, e.g. “Take the second left” and “at the four way
intersection of the red hall and the rose halls.” The global topological

level is needed for reasoning about paths, such as that when told, “Follow the
hallway around to the chair,” the follower should take forced-choice turns or

default procedures to proceed along the hall until reaching the chair.
We make one distinction over previously published descriptions of the HSSH: the
Causal level is split into open and closed-loop behaviors. In the control level of the SSH,
open and closed-loop control laws (a.k.a trajectory-following and hill-climbing control
laws) describe different behaviors on the same representations. Similarly, open-loop causal
behaviors have a termination condition that is a relative offset, but do not have a description
of the resulting pose. This distinction is in the route instruction literature, usually discussed
as whether or not navigation and description strategies use landmarks (Allen, 2000; Brown
et al., 1998; Burnett, 2000; Dabbs et al., 1988; Denis, 1997; Geldof, 2003; Jackson, 1998;
Lawton, 2001; Lovelace et al., 1999; Michon and Denis, 2001; Raubal and Winter, 2002;
Tenbrink et al., 2002; Tenbrink and Klippel, 2005; Ward et al., 1986).

133

6.7.2

Hybrid Spatial Semantic Hierarchy Ablation Experiment

We have run an experiment looking at the impact of the HSSH ontologies for following
route instructions, building on the landmark recognition ablation study ( 6.6). Here, we have
categorized the landmark recognition and navigation skills by the HSSH representations
each requires, as detailed below.
Below, we introduce each category, and then name the behaviors and representations ablated for that HSSH level.
1. Causal: Causal control laws, plus causal representations and behaviors. Causal
Landmarks, Travel To Next, Travel Between Turns, Travel On Final Turn
(a) Open-loop causal control laws: Simple turna (Right|Left) and travel Forward
causal actions, Distance Count
(b) Closed-loop causal control laws: Repeat Causal action until view description is
matched. Travel Until, Face Description, Turn Until View, Object Landmarks
2. Local Metrical: Building and reasoning with a model of the local sensory surround.
Travel To Distant View, View Memory, Perspective Taking, Face Distance
3. Local Topological: Representations and reasoning about local navigational affordances. Recognize Structural, Intersection Landmarks, Turn Toward Path, Face
Distance, Face Until
4. Topological: Representations and reasoning about places and paths. Use Follow, Use
Find Turn Between Travels, Travel Between Turns

6.7.3

Hybrid Spatial Semantic Hierarchy Ablation Results

The route navigation in these environments is entirely dependent on the Causal level.
Without Causal procedures and representations, hardly any of the instructions can be

134

Figure 6.8: Route instructions depend on Hybrid Spatial Semantic Hierarchy representations with relative impact as the theory predicts.
followed, only about 1% of the corpus (98% impact). Causal representations model moving
discretely between places on a path and paths at a place. They are fundamental to navigation
though large-scale spaces.
Closed-Loop Causal behaviors have more impact (90%) than Open-loop Causal
behaviors (54%). Part of this is that most of the instructions require a Closed-Loop
Causal procedure to accomplish an initial orientation to the route, either with an explicitly
commanded procedure, such as “Face the chair,” or an implicit one, e.g. “With your
back to the wall, turn right.” Closed-Loop Causal procedures are used slightly

more often over the entire corpus that Open-loop Causal procedures, on 702 of 771
instructions instead of 667 of 771 in Corpus 1 (including instructions that people did not
follow). However, this is not enough to explain the discrepancy in impact. See Section 6.4.1

135

Quality Range
People
Marco
– w/o Topological
– w/o Open Loop Causal
– w/o Local Topological
– w/o Local Metrical
– w/o Closed Loop Causal
– w/o Causal

1.0–2.5
35%‡
22%
7%‡
9%‡
6%‡
8%‡
3%‡
1%‡

2.5–3.5
52%‡
33%
18%‡
13%‡
8%‡
8%‡
2%‡
0%‡

3.5–5.0
73%‡
60%
47%‡
25%‡
27%‡
14%‡
7%‡
1%‡

5.0–6.0
85%‡
79%∗
69%
43%‡
45%‡
16%‡
9%‡
1%‡

All
64%‡
52%
38%‡
24%‡
23%‡
12%‡
5%‡
1%‡

Impt
-20%
0%
26%
54%
55%
77%
90%
98%

Table 6.6: Comparing the performance of M ARCO variants without different levels of the
Hybrid Spatial Semantic Hierarchy
for further discussion.
Ablation of the Local Metrical abilities has a high impact on route instructions
following (77% impact). Local Metrical contains two crucial spatial reasoning skills: View
Memory and Perspective Taking. View Memory controls the ability to remember views
that are out of sight, as the agent turns in place and moves down a hallway. Without
View Memory, the follower cannot form a local perceptual model of the local surround
(81% impact) (Beeson et al., 2007; Kuipers et al., 2004; Modayil et al., 2004). Without
Perspective Taking, an agent cannot use the local perceptual model to project what its view
would be if it were at a different pose (67% impact) (Schober, 1993; Taylor and Tversky,
1996; Trafton et al., 2005).
Local Metrical spatial representations and reasoning skills allow the agent to move
beyond purely reactive way-finding behaviors and behaviors with only impoverished state
(e.g. maintaining a Distance Count over a Travelp ). Local Metrical behaviors have less
impact on the worst instructions. Since these often describe only the destination, and not
the route, reacting to immediate observations is usually sufficient, when these instructions
are useful at all.
Local Topological has a significant but smaller impact, reliable across instruction
quality (55% impact). The highest impact components of the Local Topological ablation

136

are Intersection Landmarks (26% impact) and Face [Travel] Until (33% impact) . In this
option, the follower will turn to ensure that the destination of the current Travelp procedure
is visible to the front in the distance before starting to move forward. The follower uses
Perspective Taking to project the view description of the until condition of a Travelp into
the distance, then executes a Turnp procedure to face the projected view description. This
is a Local Topological behavior, because it relies on the topological knowledge that the
destination should be in front of the traveler and either visible or at least, possible, if the
traveler can only see a limited distance along the path.
Without Local Topological Models (Beeson et al., 2007; Kuipers et al., 2004;
Modayil et al., 2004), the agent cannot categorize intersections into common types, such
as dead ends, four-way intersections, and corners. This has the highest impact
within the instruction corpus for the worst instructions, where the director may rely on
intersection shape alone to distinguish it from other places along a hallway.
Finally, the Topological ontology has a significant impact of 26%. The highest
impact components of the Topological level are the high-level procedures Findp and
Followp and the Travel to Next heuristic. Find is the procedure that does undirected search
until the desired view is visible in the distance, then executes a Travelp procedure to reach
it (impact 17%). Followp makes forced-choice turns along a constrained path (impact 4%).
Travel to Next goes to the next match when a Travel Until is matched directly after an
explicit Turnp (impact 9%), e.g. “Turn right. Go to the corner,” when the right turn is also
at a corner.
Overall, the ontologies line up in impact as the HSSH theory predicts, with the
more lower levels having larger impact than the higher levels.

The Causal level is

fundamental to travel through large-scale spaces. Causal behaviors can be described by
both open and closed-loop descriptions, but closed-loop behaviors are more reliable for
both director and follower. Local Metrical representations allow the follower to model the
local layout and projecting what the view would be after moving to another place within

137

the view, perspective taking. This allows the director to describe the world with a richer
vocabulary, which reduces the amount of poses that are linguistically aliased – that can
only be described in the same ways. The Local Topological level brings a common and
concise vocabulary of intersection types, which allows the description of places as a whole,
instead of with longer constructions detailing the positions of the component walls and
paths. Finally, the Topological level allows the director to rely on the follower to infer
unstated turns, reducing the communication cost without greatly reducing the likelihood of
reaching the destination.

6.7.4

Extensions to the Hybrid Spatial Semantic Hierarchy

The work extending the Spatial Semantic Hierarchy to the HSSH takes advantage of
improved sensors, processors, and algorithms to add models of small-scale space to the
SSH, which only modeled large-scale space. The ability to reliably map and localize in the
immediate sensory surround greatly reduces problems of mapping globally (Kuipers et al.,
2004; Modayil et al., 2004). In the SSH, observations were abstracted to views, an opaque
representation of what the agent saw at a pose. In the HSSH, observations at a place are
combined into a rich model of the metrical shape and topological navigational affordances
(how the agent can exit a place).
This dissertation suggests that the Hybrid Spatial Semantic Hierarchy will need to
be similarly extended, to include models of medium-scale space. The ontology and global
maps of large-scale space covers any space larger than the sensory horizon. The ontology
and local maps of small-scale space cover the local place neighborhood immediately around
the agent, that the agent can fully perceive with turns or local motion, while remaining
completely localized.
The experiments in this dissertation show that people often rely on perception
and reasoning about medium-scale space – distant places, visible and accessible from
the immediate place. The route instruction director relies on the follower to be able to

138

perceive and reason about distant places that are partially visible. Medium-scale spatial
reasoning and perception are required to handle explicit instructions, such as “Go toward
the end of the hall with the chair,” and implicit procedures for ensuring that

the destination is either visible on the path ahead, or that the path continues out of view.

6.8
6.8.1

Grammar Cross-validation
Review of Cross-Validation Methodology

Cross-validation is an evaluation technique to test machine learning systems (Mitchell,
1997). In cross-validation a large corpus of labeled examples is repeatedly partitioned into
discrete training and test sets. For each pair of sets, the learning algorithm is given both the
data and labels in the training set to learn the model and then evaluated by labeling the test
set. This process is repeated for different partitions of the data, so that each example in the
corpus is used several times in both different training and test sets. This process ensures
that the split of examples into training and test sets does not bias the results, as well as
measuring the variability across runs.
One commonly used cross-validation methodology is “leave-one-out” cross validation. The data set is shuffled and split into equal parts, by some criteria. For instance, we
will perform cross-validation by splitting the corpus by the instructions’ directors and by
the map the route is in. In “leave-one-out” cross-validation, the training set is all of the
data other than one group (e.g. one director or map). The left-out group is the test set. The
process is then repeated, using each group as the test set, with the remainder comprising the
training set. After training and testing on all partitions of the data, the results are averaged
across runs to give the mean performance and variability.

139

6.8.2

M ARCO Grammar Cross-Validation Experiments

The methodology of cross-validation allows us to test the generalization of the grammar
and parser under several different scenarios. We can test how consistent the grammar is
across directors and across environments. Rather than splitting the route instruction corpus
into training and test partitions randomly, we can split the corpus in controlled manners.
We have run experiments with the corpus split by environment and by director.
In each of these studies, the Probabilistic Context-Free Grammar is trained on
part of the corpus and then the M ARCO agent is run using the output of the parser. All
other parts of the agent remain the same, only the parse trees are generated by the trained
grammar, instead of loaded from hand-corrected trees. Both of these studies were run only
on Corpus 1.
For the cross-validation over maps study, the agent is trained on instructions from
all directors from two of the environments and then run on the instructions from the other
environment. This tests how much variability there is in the grammar across environments.
Additionally, it tests the robustness of the agent to imperfect parses.
In the cross-validation over directors study, the grammar is trained on instructions
from five of the directors and then tested on the instructions. This tests how variable the
instruction grammar is across the different directors. This is one measure of how similar
the styles of the directors are, at least at the surface level. This gives some prediction of
how well the system will do on instructions from a new director.

6.8.3

M ARCO Grammar Cross-Validation Results

Overall, the cross-validation experiments show the system is fairly robust across environments and across all but one of the Corpus 1 directors. With cross-validation over maps,
M ARCO succeeds on 51% on the corpus, a performance impact of 24% over using the
gold-standard parses. With cross-validation over directors, M ARCO succeeds on 38% on
the corpus, a performance impact of 43% over using the gold-standard parses. Table 6.7

140

Quality Range
Human
Marco
– w/ Cross-validation over Maps
– w/ Cross-validation over Directors

1.0–2.5
33%∗
25%
22%
17%∗

2.5–3.5
46%∗
38%
26%‡
25%‡

3.5–5.0
75%
73%
55%‡
41%‡

5.0–6.0
85%†
89%
70%‡
48%‡

All
69%
67%
51%‡
38%‡

Table 6.7: Performance on cross-validation runs by subjective rating.
and Figure 6.9 show the results across the instruction quality spectrum.
Note that in the directors cross-validation study, the PCFG actually trains on fivesixths of the corpus, while in the map cross-validation study, each training set is two-thirds
of the corpus. Despite have more training examples in the director cross-validation study,
the system performs significantly better in the map cross-validation. This is due to smaller
differences in language use for directors between environments than between directors.
In the map cross-validation study, the highest rated two classes of instructions each
have about the same performance impact from applying parsing using a learned, imperfect
syntactic model as using the hand-corrected gold-standard parses, at about 24%. For
instructions rated 2.5–3.5, parsing from the model learned on the other environments has a
larger impact (34%), perhaps because when the director is unsure of the route, the syntactic
style changes and becomes more variable to account for differing kinds of uncertainty. On
the other hand, for the worst instructions (1.0–2.5), there is no significant impact from
applying the learned model. When a director can only give minimal guidance, usually
he or she is only describing the destination, which may be done in the same ways across
environments.
In the director cross-validation study, the impact is greater across differing quality
levels, but more consistent. Again the impact on the two best rated classes is similar, at
about 45%. Here, the impact on the worse rated instructions is higher, at 34% for the 2.5–
3.5 rated instructions and 28% for the lowest rated class, all significant differences from the
gold-standard parse.
Figure 6.10 shows the performance of the cross-validation learned models for each
141

Figure 6.9: Success rates for M ARCO under cross-validation across instruction quality.
director. The performance varies considerably across directors, as some directors have a
distinctive style, using sentence structures not seen in the other directors. For instance,
EDA tends to use a simple but unique sentence structure of “Walk forward C O U N T .” For
example, no other director in Corpus 1 uses the word “twice” and EDA uses it in over half
(67) of his instructions. EDA is very consistent across environments, with the map crossvalidation parser yielding a 72% performance on his instructions, only an 11% impact off
of the gold-standard parses. However, in the director cross-validation, M ARCO has a 61%
impact on EDA’s instructions from the gold-standard, and a 61% impact even from the map
cross-validation parser. On all instructions from directors other than EDA, the impact of
learning the syntax from other directors is only 37% compared to the gold-standard parses
and 13% off of the performance of M ARCO with the parser trained on instructions from all
142

Figure 6.10: Success rates for M ARCO under cross-validation per director.
directors in the other two environments.
Two of the directors, TJS and WLH, share almost all of their important syntactic
variation with other directors. These directors only had a performance impact of 5%
and 6%, respectively, comparing the M ARCO’s performance with a grammar trained on
instructions from the other directors against with a grammar trained on instructions from
the other environments. The other three directors had intermediate impacts (17-24%),
measuring a drop in performance from the map cross-validation to the director crossvalidation. Only EDA (61%) was above the mean impact (26%) for this comparison.

143

6.8.4

Cross-Validation Discussion

These results, and the results of running on Corpus 2 and Corpus 3, show how M ARCO
handles unseen route instructions, using a naïve, PCFG parser (see Section 5.2). The
other experiments in this chapter show how M ARCO performs on the gold-standard handcorrected parse trees. These two extremes — the naïve parser and the gold-standard parser
— provide bounds for the expected performance of M ARCO with a state-of-the-art parser.
With a better parser or more training, M ARCO should perform closer to the gold-standard
performance level and certainly better than our naïve parser. Section 5.2 discusses more
complex parsers that might be incorporated.
The second conclusion to draw from the cross-validation studies is that the
idiosyncratic individual differences between directors have a higher impact than the
differences from the same set of directors between environments. Training for a particular
user would allow the system to learn his or her quirks in sentence structure, word usage,
and implicit procedures. On the other hand, the cross-director performance impact should
be mitigated as the system has trained with more types of users.
M ARCO performs adequately even with a naïve parser and partial training on some
of the instructions. The performance profile across instruction quality has the same profile
as the performance on gold-standard parses, of performance scaling linearly with quality,
albeit at a lower level. Part of the parsing problem, especially across individuals, is that
the grammar training program makes no effort to use techniques to generalize the grammar
from the instances in the training set, handling optional arguments and arguments with
arbitrary order (Goan et al., 1996; Li and Abe, 1998; McClosky et al., 2006).
Even the best parser will not have 100% accuracy; the state-of-the-art is just higher
than 90% per sentence (McClosky et al., 2006). When interpreting a paragraph of route
instructions, there is a chance that each utterance will be mis-interpreted, and that chance
accumulates over the instruction set. Even one mis-parsed sentence can throw off the
execution, even with something as simple as an attachment error. To use a classic example,

144

if told to “Look for the man with a telescope,” the follower will fail if looking
for “a man with a telescope” instead of through a telescope or vice versa. This
points out the need for a parser that can re-rank the possible parses in response to linguistic,
spatial, and task context (Chang and Mok, 2006) (See also Section 5.2). An orthogonal
approach uses dialogue acts to clarify or correct the parses, which complements the methods
in M ARCO and better parsers (MacMahon, 2005; Weng et al., 2006).

6.9

Human-M ARCO Discrepancy Analysis

Finally we want to compare the performance of people and the full M ARCO agent on
particular instructions. We closely examined the outcome of a random selection of route
instructions from all three corpora. For each director, three instructions were selected at
random from the set of all instructions which were successfully followed to the intended
destination by either on least one trial by a human follower or M ARCO. This section
presents the results of comparing the performance of the full M ARCO agent and people
following these instructions.
Overall, since there were 30 followers in the three corpora, there were 90
instructions in the sample. Of these 90 instructions, M ARCO and people performed about
the same, within ±25% success rate, on 46 of the 90 instructions. On 31, people outperformed M ARCO, but on 13 instructions M ARCO reached the destination with a success
rate more than 25 percentage points higher than people.
When comparing the success rates, remember we are comparing the mean success
rate of a population of human users with many strategies and individual differences against
one version of M ARCO running repeatedly. If tested repeatedly, individual people might
show the same pattern as M ARCO of following some instructions reliably and failing
reliably on others. Likewise, other variants of M ARCO perform differently, and occasionally
better on particular instructions, especially M ARCO variants with different heuristics or
parameters for accepting loose matches of descriptions.
145

Rtg
5.1
4.9
5.4
4.7

Ppl
100
88
88
57

4.6
5.2
5.5
5.6
5
4.7
2.5
4.4

86
80
100
100
100
86
25
89

M AR
100
100
100
0

Direct
EDA
EDA
EDA
EMW

Map
Compact0
Sparse0
Sparse0
Sparse0

S
6
1
4
5

T
3
3
7
3

Note
Same route
Same route
Same route
Missed locative phrase

empty intersection between two which contain objects

100
100
100
100
100
100
50
0

EMW
EMW
KLS
KLS
KLS
KXP
KXP
KXP

Medium0
Medium0
Compact0
Compact0
Sparse0
Sparse0
Medium0
Medium0

3
7
4
4
7
6
4
4

6
5
1
5
1
7
1
7

Same route
Same route
Same route
Same route
Same route
Same route, one overshot
No orient, few landmarks
Missed intersection phrase

stop when the first fish hallway to your left appears

4.4
2.7
4.9
5.4
2.6
3.9
4.4

100
86
100
100
63
78
82

100
75
100
100
100
100
82

TJS
TJS
TJS
WLH
WLH
WLH
Mean

Compact0
Sparse0
Medium0
Sparse0
Sparse0
Medium0
All

1
3
2
2
4
2

5
4
7
4
2
6

Unnecessary search
No route description
Same route
Same route
Recovers from missed anaphora
Same route

Table 6.8: Discrepancy analysis for sampled instructions from Corpus 1. The Rtg column
shows the post-hoc human rating for the instruction text. The Ppl and M AR show the mean
success rate for people and M ARCO, respectively, following the instruction text. The Direct
column shows the identifier for the director, Map is which environment the route traversed,
S is the starting location and T is the target location.
In Corpus 1, the development corpus, there were only two instructions where people
succeeded much more often than M ARCO (See Table 6.8). In both cases, M ARCO was not
able to interpret a rare, complex phrase describing a location. In the first case, the empty
intersection between two which contain objects, M ARCO misses that the

count two functions refers to other intersections, which each contain an object. In the
second, M ARCO interprets appears too literally, as when the path is visible in the
distance, not waiting until it is immediately to the left. In each case, the view description
representation could represent the same concepts, and in fact, the instruction modeler

146

accurately modeled other phrasings of the similar descriptions. On one route instruction
set in this sample, M ARCO out-performed people by more than 25 percentage points. In
this case, there is a confusing part about halfway through the instructions, where some of
the human followers give up.
In Corpus 2, after running M ARCO as developed using Corpus 1, we added
heuristics, language recognition code, and an attribute to the view description representation
to follow more instructions As Table 6.9 shows four instructions in this sample that the
amended M ARCO follows with a success rate more than 25 percentage points greater than
people’s success rate. Without the post-Corpus 2 modifications to M ARCO, M ARCO follows
7 of the instructions less often, and 4 more often, for a total of a 9% success rate gain on
this sample (8% impact over all instructions in all corpora).
People succeeded on 13 of the sampled instructions much more often than M ARCO
and 19 instructions that were followed at about the same success rate. However, since
M ARCO fails consistently on some instructions that people follow extremely reliably,
M ARCO’s mean success rate on this sample is 26 percentage points below people’s. Note
that the instructions M ARCO failed on more often are mostly poorly rated, with a mean
rating of 3.7, vs 4.2 for instructions with approximately the same success rates, and 3.3 for
instructions where M ARCO succeeds much more often.
Where people did better on Corpus 2 instructions, in six cases, M ARCO did
not correctly model a rarely occurring phrase, split evenly between phrases describing
intersections, distances, and other spatial relations. In four cases, M ARCO did not properly
model the discourse by combining information across utterances. Most of these combined
initial descriptions with elaborations. This sort of discourse reasoning is complex, but
necessary to achieve human performance levels. The remaining three cases are from an
un-modeled word, a mis-parsed sentence, and from instructions where people were able to
recover from a lack of initial orientation despite few landmarks in the route description.
Table 6.10 shows the discrepancy analysis for Corpus 3. For this corpus, M ARCO

147

was run on hand-corrected parses, with the Corpus 2 enhancements, but no modifications
after examining Corpus 3. Of the 36 instructions in the sample, M ARCO and people succeed
within 25 percentage points on 12, exactly one-third. People succeeded more often on 16
of the sampled instructions and M ARCO succeeded more often on the remaining 8. Here,
there is not a significant difference between the ratings among these groups.
Of the 16 instructions where people out-perform M ARCO, six are from discourse
modeling errors or gaps, four are from phrase modeling errors or gaps, three are from
missing heuristics for under-specified routes, two are from missed words, and one each are
from a missed sentence and a missed anaphorical reference. For the discourse modeling
errors, four are missing elaboration of descriptions by separate utterances and two are
interpreting a redundant ‘stop’ command as an implicit Travelp .
Of the phrase modeling errors, two are complex descriptions of intersections and
the other is a complex locative expression. In three cases, people are able to follow an
under-specified route more often than M ARCO, although in another three, M ARCO is more
successful. In three cases (including one of the elaboration cases), M ARCO interprets a
command more literally than people, by rejecting a L intersection as not matching the
description “dead-end” and by not matching a coat-rack near but not at the end of the
hall as “the end where there is a coat hanger.” Finally, in one case, M ARCO
recognizes the reference “the other” as referring to an easel, but does not recognize that
it must find a different easel than the one in view.

6.10

Comparison to related work

M ARCO is an embedded agent experiencing the world from a first person perspective, and
therefore sometimes needs to act to gain information about unseen parts of the world –
not just localize itself in a completely known world model. This differentiates the problem
from the “kidnapped robot problem” of localization to a known map in large-scale space
(Kaelbling et al., 1998) and from relating spatial language to a map (Levit and Roy, 2007) or
148

other scene within the perceptual horizon (Roy, 2005; Skubic et al., 2004b; Yu and Ballard,
2004). However, to build a full system to follow arbitrary instructions, M ARCO would need
to be extended to incorporate modules to represent, communicate, and reason about such
small-scale spatial relations and actions.
Tellex and Roy implement spatial routines that achieve the preconditions of
commands within the perceptual surround of a robot Tellex and Roy (2006, 2007), for
instance taking the spatial context into account to move to the next opening before executing
a “Go right” command. However, this work only accounts for single commands and
does not test when the implicit actions are necessary in the linguistic context of a stream of
instructions. We share the same general approach as Tellex and Roy – grounding a semantic
parse of spatial instructions into a hand-coded procedure that executes autonomously. We
have focused on the conditionals, complex commands, and series of commands across
large-scale space. Tellex and Roy have focused on executing single commands across local
small-scale space and spatial referring expression. Small-scale spatial relations are simply
implemented in M ARCO, while spatial routines are a more principled model of small-scale
spatial relations, especially recognition.

6.10.1

Comparision to the Instruction-Based Learning project

M ARCO has several significant differences from Bugmann and colleague’s InstructionBased Learning (IBL) project (See Section 2.4.4, Bugmann et al. (2004)). These ablation
experiments allow us to measure how often these differences affect successfully following
the route instructions. One difference is that the IBL agent did not combine information
across utterances.

Overall, the pragmatic cues for implicit procedures that M ARCO

recognizes, but the IBL agent does not, account for 15% of the successful runs of M ARCO
on this corpus. The IBL agent also lacks the ability to reason about some of the local
topology and local metrical entities, according to the examples the authors list outside the
IBL agent’s capabilities (Bugmann et al., 2004). For M ARCO, these capabilities together

149

account for a 4% impact on these corpora. This is a conservative estimate of the differences
between M ARCO and the IBL agent.
Bugmann et al. (2004) compared human and software instruction following
performance. This work compared the performance for a robot navigating through a
tabletop model environment given (1) an automated system translating the instructions from
speech into programs, (2) an automated system following programs translated by hand
from speech, and (3) people following the same instructions. People were able to reach
the destination on 83% of the instructions, the robot followed hand-translated programs on
63% of the routes, and 14% of the routes automatically translated into programs “would
actually lead the robot to the goal.”
Though our success rates are not directly comparable, since they start with raw
speech and control a physical robot, our automated success rates are much more similar
to our human rates. Their environment had fewer places, paths, and strong visual features
than ours, but had more diverse intersections in a realistic town street layout. Their basic
instruction-following method is similar to our work, but seems less robust to errors and
omissions in the instructions, due to the spatial and linguistic knowledge we model.
The work in this paper is more easily and less expensively replicated, since no
special robotic equipment or physical town model is needed. More importantly, our
subjects learned the environments from the same first-person perspective as the human and
software agents following the instructions and wrote instructions from memory. Bugmann’s
participants only saw an outside, panoramic perspective of the town model while directing.
This difference in how environments are learned and perceived between the directors and
followers leads to a class of errors not present in our approach. Specifically, directors may
refer to information unavailable to followers. Conversely, while our directors may make
errors while learning the map through navigation or recalling the map while directing, these
errors are cognitively interesting and prevalent in the real world. (See further discussion in
Chapter 3.)

150

Like the IBL work, we find route instructions require an open set of procedures to
execute. For deployed systems, this implies several options for robust interaction. The
first option is to train its users to use a restricted set of actions, relations, and entities
with a limited vocabulary and grammar. This is viable for long-term users of systems,
especially professionals, such as urban search and rescue first responders and astronauts,
and motivated users such as the disabled. For a robot or computer service that must interact
with untrained users, one option is for the system to use dialogue actions to clarify or
rephrase utterances that are not understood (MacMahon, 2005).
M ARCO does not yet model certain types of linguistic behaviors that are
important to instructions, such as verb aspect (Narayanan, 1997) or negative imperatives
(Vander Linden and Di Eugenio, 1996). Finally, systems could use learning techniques and
context information to learn the meanings of unknown words (Regier and Carlson, 2001;
Roy, 2005; Siskind, 1990, 1995; Yu and Ballard, 2004), and perhaps programming-bydemonstration to learn new procedures (Jung et al., 2006; Nicolescu and Mataric̀, 2003).

6.11

Conclusions from M ARCO experiments

Overall, one essential finding of this dissertation is the impact of implicit procedures
in following complex procedural instructions. Inferring implicit procedures is just as
important to the success of following spatial route instructions as recognizing and executing
each type of explicitly commanded basic open- and closed-loop turn and travel procedures.
Further, this dissertation proposes four kinds of cues to implicit procedures that will occur
in all kinds of instructions: syntactic cues, semantic cues, pragmatic cues, and exploratory
procedures. We measure how often each of these are necessary to follow spatial route
instructions, finding that syntactic and semantic cues have the highest impact, but pragmatic
and exploratory cues also are frequently crucial to infer the correct procedure from underspecified instructions.
Our ablation experiment of different spatial skills and representations at the
151

different levels of the Hybrid Spatial Semantic Hierarchy provides empirical support for the
theory. The experiments first show that all of the variety of spatial ontologies are necessary
for following spatial route instructions. Second, the HSSH theory predicts a hierarchy
of representations built up from lower levels. Our experiments find the impact of ablating
representation and reasoning at each level matches the predicted order of the spatial models,
providing empirical evidence that the model captures essential characteristics of human
spatial cognition.
The discrepancy analysis experiment shows several areas for improvement in
M ARCO.

First, M ARCO needs to better combine information between utterances,

especially by modeling which utterances are elaborations of others.

For instance,

people, but not M ARCO, combine this information in “Position 3 is located in
the hallway of butterflies. It has wood floors all around it.”

Sec-

ond, M ARCO should handle more spatial references, perhaps by building on top of work
such as (Blisard et al., 2006; Klippel and Winter, 2005; Perzanowski et al., 2003; Skubic
et al., 2001, 2004b; Tellex and Roy, 2006). Third, M ARCO should have a mechanism for
handling small errors in landmark descriptions, such as mis-characterizing an intersection,
mis-remembering an object, or approximating a spatial relation. Adding in a best-fit match
of the view description could also handle stochastic perception on a mobile robot, with
noisy sensors.

152

5
5.2
5.3
4
3.5
3

0
100
100
100
100
100

25
100
100
100
0
0

BKW
BKW
BKW
BLO
BLO
BLO

Medium1
Medium1
Medium1
Compact0
Compact0
Compact0

3
7
7
5
6
7

5
2
6
6
3
6

Wrong turn direction
Same route
Same route
Same route
Imprecise: Dead end is T intersection
Missed attachment of distance

Go straight down the red brick floor two stops...

5.3
3.5
3
2.2
5
3.7
4.3
4.8
3.2
3
5
3.5

100
100
100
17
100
67
67
100
75
100
100
100

0
100
100
50
100
100
75
75
100
0
25
0

JJL
JJL
JJL
JNN
JNN
JNN
JXF
JXF
JXF
LEN
LEN
LEN

Compact0
Compact0
Compact0
Sparse0
Sparse0
Sparse0
Compact1
Compact1
Compact1
Compact1
Compact1
Compact1

3
3
6
3
3
7
2
7
7
2
6
7

1
7
2
2
4
3
5
3
6
4
1
6

Missed implicit turn between travels
Same route
Same route
Confusing route description
Same route
Same route, some overshot
Handles negation
Extra search w/o phrase
Same route, unknown word ‘crack’
Not modeling discourse elaboration
Not modeling discourse elaboration
Missed distance phrase

The last intersection before the T intersection

3.8
6

83
100

100
0

5.2
5
1.8
1.9
4
3

100
100
40
14
100
100

100
100
75
0
100
0

MJB
MJB

Sparse1
Sparse1

3
3

4
7

Same route
No model for locative phrase

on the square just past the first segment of grass.

MJB
MXM
MXM
MXM
MXP
MXP

Sparse1
Sparse0
Sparse0
Sparse0
Medium0
Medium0

4
3
4
7
4
4

1
4
7
5
3
5

Same route
Same route
Underspecified, ambiguous route
Underspecified, ambiguous route
Same route
No model of locative phrase

the first set of butterfly pictures, 2 facing each other

2.3
5.5

67
50

25
0

MXP
PXL

Medium0
Medium0

7
2

4
3

No orientation, no landmarks
No model of locative phrase

until where you hit a gray wall in one step

100

PXL

Medium0

Mis-model distance phrase

one step away from the direction of the painting stand.

2.5

100

PXL

Medium0

Mis-model intersection phrase

at a 3 street corssing in which one floor is designed ...

5
5.5
5.7
1.8
4
3.3

100
50
100
50
100
100

100
100
100
25
50
100

QNL
QNL
QNL
TXG
TXG
TXG

Medium1
Medium1
Medium1
Sparse1
Sparse1
Sparse1

4
5
6
3
4
6

6
Same route
4
Reference to turn dir, distance error
7
Recovers from misinterpreting phrase
1 Underspecified route, Position landmarks
2
Named landmark as reference
2
Same route
153
Table 6.9: Discrepancy analysis for sampled instructions from Corpus 2.

3.5
3
3
1

75
100
100
100

0
0
0
0

ARL
ARL
ARL
JLM

Sparse1
Sparse1
Sparse1
Compact0

100

JLM

Compact0

3
4
5
1

6
3
6
2

No route, imprecise: dead end is an L int.
No route, discourse elaboration missed
No route, imprecise: dead end is an L int.
Missed anaphora

You are at one easel. Find the other.

Discourse elaboration not modeled

the four way intersection that meets these specificiations.

4
4
3

100
0
50

100
25
0

JLM
JTM
JTM

Compact0
Sparse0
Sparse0

4
3
4

2
4
1

Extra turn, travel ignored after termination
Distance error
Extra travel for redundant stop

5
2
2
3

50
50
100
50

0
75
100
0

JTM
JXL
JXL
JXL

Sparse0
Compact1
Compact1
Compact1

6
2
4
7

5
4
5
2

Extra travel for Move once and stop.
Missing turn in instructions
Negative advice
Ambiguous sentence

4
5.7
2
4.5
3
2.5
1
4
1
4
3.5

50
67
33
50
75
25
0
50
100
50
50

100
100
100
100
100
0
50
0
0
100
0

move forward twice and stop

until the end where there is a coat hanger. (not quite at end)

KAJ
KAJ
KAJ
KXK
KXK
KXK
LCT
LCT
LCT
MHH
MHH

Compact1
Compact1
Compact1
Medium0
Medium0
Medium0
Medium0
Medium0
Medium0
Sparse1
Sparse1

4
4
7
3
7
7
1
4
6
4
4

2
6
5
7
4
6
7
5
4
1
5

Same route, some undershot
Same route, some undershot
Recovers from wrong turn direction
Same route, some make extra turn
Same route
Underspecified route
Underspecified route
Underspecified route
Missed discourse elaboration
Same route, some undershoot
Missed locative phrase

until the left right before the yellow

1
2.7
1
2.5
2.5

50
33
0
50
50

50
0
75
0
0

MHH
RRE
RRE
RRE
SCD

Sparse1
Sparse0
Sparse0
Sparse0
Sparse0

5.5
3
3
5

100
100
100
100

100
75
100
0

SCD
SCD
SMA
SMA

Sparse0
Sparse0
Medium1
Medium1

2.5
4.7
4.7
5

33
100
100
100

100
100
100
0

SMA
WAB
WAB
WAB

Medium1
Compact0
Compact0
Compact0

7
1
1
5
2

4
5
7
4
6

Missing turns in instructions
Incongruent, ambiguous route
Incongruent, ambiguous route
Incongruent, ambiguous route
Missed intersection phrase

Walk to the back of this corridor.
5 3
Same Route, reference between them

7
4
5

3
5
1

Missed phrase
No route, only destination description
Missed intersection phrase

paths described as left and right are perpendicular

7
6
6
6

6
2
4
5
154

No route, only destination description
Same route
Same route
Discourse elaboration, Missed phrase
last set of blue tiles

Table 6.10: Discrepancy analysis for sampled instructions from Corpus 3.

Chapter 7

Conclusions
This dissertation presents the language and task corpus methodology, which investigates
what is necessary to understand natural instructions in order to follow them as well as people
do. We applied the language and task corpus methodology to spatial route instructions,
tying together a novel instruction corpus, navigable environments, and human and artificial
agents performing complex linguistic and spatial reasoning tasks. We present a software
system, M ARCO, that approaches human levels of performance in applying instruction
texts to navigate a described route through an unknown, large-scale space. Comparing
the performance of M ARCO model variants, we find inferring implicit procedures, in
addition to executing explicitly commanded procedures, is essential to following poorlyrated instructions and crucial even on about half of highly-rated instructions.
This dissertation shows the Hybrid Spatial Semantic Hierarchy (HSSH) applied
to both human-robot interaction and to perform complex navigation tasks (Sections 6.7
and 5.8). We show the HSSH hierarchy of spatial representations are both sufficient and
necessary to represent the procedures and structural landmarks in human natural language
route instructions (Sections 6.6.2 and 6.7). The representations of the HSSH allow M ARCO
to represent the simple and complex spatial entities, relationships, and procedures needed
to execute route instructions. These experiments show that the lower levels of the HSSH

155

(Causal and Local Metrical) have a very high impact in following people’s natural route
instructions, while the higher levels (Local Topological and Global Topological) are less
often crucial to following spatial route instructions. However, a Global Metrical map is not
needed to follow route instructions.
One result of this research is a software system, M ARCO, that parses a route instruction text and follows the described route by executing navigation procedures. M ARCO
connects the linguistic information in instructions texts with semantic representations –
imperative reactive procedures grounded in navigation actions and declarative structured
relations grounded in active perception. One application for M ARCO is as a human-robot
interface for mobile robots, which would be useful in domains such as urban search and
rescue, astronaut EVA assistance, in-car navigation systems, and intelligent wheelchairs.

7.0.1

Future Work

M ARCO does not yet match human performance, but captures the important spatial
representations and procedures for navigating through large-scale spaces. In the changes
from Corpus 1, the development corpus, to Corpus 2, only five out of 17 changes altered the
spatial or procedural representation; the remainder were linguistic, programming M ARCO
to recognize different ways of expressing the same concepts.
In the discrepancy analysis (Section 6.9), nearly all the remaining differences in
performance between M ARCO and people are primarily linguistic. 90 total instructions
were analyzed over the three corpora and only 31 had M ARCO substantially underperforming people, by > 25 percentage points. Conversely, on 13 route instructions,
M ARCO succeeded substantially more often. Of the 31 where M ARCO under-performed,
twelve reflect syntactic parsing or phrase modeling issues (which the representation could
handle, but the modeler could not), ten were discourse modeling challenges of combining
descriptions across utterances, and three unknown words not matched in WordNet. Only
four were spatial or procedural; most where people where able to recover from errors in the

156

instructions, but M ARCO was not. Much work remains on the linguistic front; less work
remains in modeling basic large-scale spatial relations and procedures.
The M ARCO architecture presented here is a snapshot of an iterative development
process, not a turn-key system. Though M ARCO is currently capable of following most
natural instructions in the development and test corpora as well as people do, some
issues remain. First, some unmet challenges remain in the initial development and testing
corpora, especially in the worse-rated instructions. Additional methods are necessary if the
developer wishes to equal human performance on all instructions, no matter how flawed or
how rare. M ARCO would be more capable if extended with bootstrap learning; planning,
acting, and perceiving under uncertainty; interactive dialogue; and modeling other kinds of
actions and relations, especially in small-scale space.
We have shown the initial and post-analysis performance on a novel language
and task corpus, documenting the changes necessary for M ARCO to approach human
performance on these new instructions. While novel challenges will become less frequent
as the development (or training) corpus widens, further unknown challenges may lurk in
unseen instructions from other directors or for related tasks.
Future work could apply the findings of this dissertation – how people produce and
follow route instructions – to the task of generating route instructions. Route instruction
generation systems, such as map kiosks, web, phone, and in-car services, can be improved.
By recognizing and carefully describing the more difficult segments in route instructions,
a system might generate route instruction texts which are more natural, more easy to
follow, and more reliable. Another application could check route instruction texts from
other sources for completeness, clarity, coherence, conciseness, and likelihood of success.
Additionally, the system could produce instructions matching the follower’s preferred style.

157

7.1

Empirical examination of route instruction following

In the beginning of this dissertation, we proposed five questions that must be answered to
build an instruction-following system for any domain. We now can answer these questions
for spatial route instructions and suggest features which will generalize to other domains.
To discover how people describe spatial routes in navigation instructions, we
gathered a corpus of over 1500 route instructions from 30 directors. The instruction
elicitation procedure was designed such that directors produce natural instructions,
including natural errors and omissions, as in a naturalistic setting. Specifically, the directors
are recalling an environment each learned through first-person experience, planning the
route from memory, and writing free-form route instructions without limit on grammar,
vocabulary, or style.
To measure the quality of these instructions and get a performance standard, we
gave the instructions to 100 other people to follow. These human followers had to rely in the
information to navigate, because they did not have experience in these environments outside
of following the instructions. The route instructions from our directors varied substantially
in style, subjective quality, and success rate. People following the instructions reached the
intended destination on 25% to 95% of the trials, on the lowest- and highest-rated route
instructions respectively.
The major conceptual contribution of the M ARCO architecture is deferred handling
of ambiguity, both for referring phrase resolution and for modeling instructions as a
sequence of procedures. Referring phrases vastly under-specify the configuration, even
combined with other knowledge explicit in the instructions or in unspoken shared common
heuristics – common-sense. Instructions likewise under-specify the sequence of procedures
a follower must execute to accomplish the described task. The performance of M ARCO on
natural instructions shows the robustness and sufficiency of letting the environment resolve
both referring phrase and procedural ambiguity.
By selectively removing language, action, perception, and spatial reasoning

158

abilities, the evaluation measured the importance of each ability for following spatial
route instructions. These experiments reveal the importance of various skills for correctly
following instructions. Even where M ARCO does not equal human performance, the
relative drop-off in performance, the impact, reveals the role of the representation, behavior,
or heuristic in instruction following. From an engineering point of view, this is useful to
build systems to follow instructions, especially concentrating on high impact components.
From a scientific point of view, this is important as a computational model of route
instruction following at the level of information processing. The performance of the model
exposes individual differences between and within directors at the deep level of what is
required to follow the instructions, rather than what is mentioned in passing.

159

Appendix A

Human Experiment Materials
A.1

Software configuration

To run the experiment, you need a licensed copy of WorldViz Vizard Virtual Reality
software Vizard (2006) installed. These experiments were run with Vizard 2.53g and Python
2.4 Python (2007). You will also need to install the Tkinter widget kit (Lundh, 1999).
Next, check the code out of the cvs repository:
stankiewiczlab.psy.utexas.edu:/Volumes/LabData/LabUsers/cvs

Check out the DirectionVizard module. This is an alias that will check out
the experiment control scripts, the shared libraries (NavigationModules) and media
(SharedMedia), and the subject information spreadsheet (SubjectInfo).

160

Gender
Experiment 2
Male
Male
Male
Male
Male
Male
Female
Female
Female
Female
Female
Female

Env

Pos Set

Motion

Grid
Grid
Jelly
Jelly
L
L
Grid
Grid
Jelly
Jelly
L
L

0
1
0
1
0
1
0
1
0
1
0
1

Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete
Discrete

Male

Grid

Grid
Jelly
Jelly
L
L
Grid
Grid
Jelly
Jelly
L
L

1
0
1
0
1
0
1
0
1
0
1

Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous
Continuous

Experiment 3
Continuous
Male
Male
Male
Male
Male
Female
Female
Female
Female
Female
Female

Table A.1: Combinations for director experiments 2 (Corpus 2) and 3 (Corpus 3). ‘Grid’
refers the Compact map, ‘L’ to the Medium map, and ‘Jelly’ to the Sparse map.

161

A.2

Running the experiment

First, get consent, add the subject’s demographic information to SubjectInfo/SubjectInfo.xls,
and brief the subject with the appropriate instructions.
Run Directions.py in Vizard. You should see this option screen in Figure A.1.
To run a director, enter the subject’s ID and the environment name in the ‘Message’
field, as I’ve done with MTM Grid. To use the alternate set of position names (“Position Set
1”), select the HMD option.
The SubjectID is the subjects initials, with X as a middle initial if the subject has
no middle name. If the SubjectID would be the same as another subject’s ID, change the
middle initial.
The codes for the three environments are Grid, Jelly, and L.
To use the joystick or arrow keys for Continuous Movement, enter the letter ‘C’ after
the environment code in the Message field. If you enter ‘D’ or no code, the experiment will
use the Discrete motion code.
Allow the director to get acclimatized to the interface in one environment, then press
the ‘escape key’ to exit the experiment, and restart the experiment in the trial environment.
The Option 2 check box can be used to skip the training phase for the director,
during acclimatization or if you need to restart the experiment for the director.
To run a follower, select Option 1 and enter into the Message field (1) the subject’s
id, (2) the string All for the environment name, and (3) the run number, for instance MTM
All 2. To use the alternate set of position names, select the HMD option. This will give

the follower one set of instructions for each route in all three environments, in a carefully
specified order.
If the Tracker is clicked, Vizard will save a snapshot of each view while
navigating.
After the subject finishes running, add their logs (in SubjectLogs/) to cvs and
check them in.

162

A.3

Example Consent Form
Navigating through Complex Environments
You are invited to participate in a study of human navigation. This study is part

of an ongoing research program in the area of human spatial navigation carried out by
Brian Stankiewicz, Ph.D. Participation in this study is completely voluntary. If you decide
to participate, you will be one of approximately fifty people in the study. If you decide
to participate, the total length of the experiment is approximately one hour. For your
participation you will be receive course credit.
This experiment is a study in human spatial navigation and involves learning the
spatial layout and traveling though environments that will be displayed on a computer
screen. The study’s purpose is to inform us about how humans navigate through complex
environments. The risks in the study are no more than those encountered while working on
a computer. If you are experiencing any physical discomfort please inform the experimenter
immediately and the experimenter will try and alleviate any discomfort.
Any information that is obtained in connection with this study and that can
be identified with you, will remain confidential and will be disclosed only with your
permission. Often times in this line of research, subject’s initials are used to differentiate
between different subjects in a research paper. Your name will not be associated with these
initials in any paper or presentation.
Your decision to participate or to decide not to participate will not affect your
present or future relationship with The University of Texas at Austin.
If you have any questions about the study, please ask me. If you have any questions
later, you may call Professor Brian Stankiewicz at 512-232-9373. If you have any questions
or concerns about your treatment as a research participant in this study, all Professor Lisa
Leiden, Ph.D., Chair of The University of Texas at Austin Institutional Review Board for
the Protection of Human Subjects, 512-471-8871 or email: orsc@uts.cc.utexas.edu.
You will be given a copy of this consent form for your records.

You are making a decision whether or not to participate. Your signature below
indicates that you have read the information provided above and have decided to participate
in the study. If you later decide that you do not want to participate in the study, simply tell
me. You may discontinue your participation in this study at any time.
___________________________________________
Printed Name of Participant
_______________________________________ ________________________
Signature of Participant

Date

____________________________________________ ________________________
Signature of Investigator

Date

A.4

Director Guide

This study investigates how people give route instructions in an indoor environment. The
study will be conducted in an unfamiliar virtual building. During the experiment you will
participate in three basic tasks: (1) Exploration, (2) Navigation Quiz, and (3) Giving route
instructions.

A.4.1

Exploration

Before giving instructions you will explore a virtual building.
You will move through the building by making key presses. Using the number pad,
you will move forward by pressing the ‘8’ key, rotate right by pressing the ‘6’ key and rotate
left by pressing the ‘4’ key.
As you navigate through the building you will hear a series of ‘target locations’,
specified by a voice announcing the position number when you walk over the position. For
example, you might hear “Position 2” when you walk into an intersection. Your goal in this
task is to learn the building well enough to navigate and give route instructions between
these target locations.
You will have 120 forward movements to learn the building.

A.4.2

Navigation Quiz

After the building exploration phase, we will quiz how well you can navigate through the
environment. You will be placed at one of the target positions. Press the space bar to show
the environment. You can turn around to orient yourself, but you cannot move forward yet.
When you are ready to move, press the ‘0’ (zero) key on the numeric keypad.
The computer will then instruct you to move to a particular target location, e.g., “Go
to Position 4.” Your task is to move to that target location taking the shortest distance path.
The announcement sound for the target location is turned off. When you have done your
best to reach the goal, press the space bar.

If you were not at the correct target, the screen will turn red. If you reached the
target location by a longer than necessary path, the screen will turn blue. If you go to the
target location by taking a short path, the screen will turn green. You will then be transported
to a new location for another trial. Press the space bar again to show the environment.
When you have shown you can efficiently navigate through the environment (i.e.,
take the shortest path for seven straight test trials), you will move on to the next phase of the
experiment. If after 25 trials, you still need more experience, you will participate in another
Exploration phase followed by another Navigation Quiz.

A.4.3

Giving Route Instructions

After you’ve passed the Navigation Quiz, you will be asked to give route instructions. As
in the Navigation Quiz, you will be placed at one of the target positions and be allowed to
turn around to orient yourself. Press the space bar again to show the environment.
The computer will tell you which target location you are current at in the
environment (e.g., “Position 2”) and tell you to “Turn around,” so you can see the position.
When you have re-oriented yourself and are ready to give instructions press the ‘0’ key on
the numeric keypad. If you recognize or remember the position, you do not need to turn
around, and can press ‘0’ immediately. Until you press ‘0’, you can turn but not move
forward.
A text entry window will appear asking for route instructions from the current
location to another target location, e.g., “Enter directions to get from Position 2 to Position
6.” These instructions will be given to another subject who does not know this environment,
but has experience in a practice environment. Your task is to write instructions that will
reliably direct this subject from the current position to the specified target location.
Use complete sentences and end them with periods. Each set of directions should
stand alone. The followers will receive your and others’ directions in a mixed order. Do
your best if you cannot give complete or exact instructions.

The other people following your instructions will be placed at the starting position
facing any one of the four directions. The direction the follower will start facing is not
related to the directions you face while orienting yourself at the location.
After finishing the instructions, you will be asked to move to that target location
and press the space bar when you are done. After moving, you will answer two questions:
(1) How certain are you that you’ve reached the target location? (2) How good do you think
your instructions were? You will alternate between giving instructions, navigating through
the environment, and rating your performance until all positions have been used.
To familiarize you with the experiment you will run through a practice environment.
Use this time to become comfortable with how to move through the environment and how
to give instructions. If you have any questions, please ask at any time.

A.5

Director Key Meanings

8 Move Forward
4 Turn Left
6 Turn Right
0 Done with Turning Around

space Show Environment / At Destination
t Repeat the current target position
w which hotspot

A.6

Instruction Follower Guide

This study investigates how people follow route instructions in an indoor environment.
We are interested in finding out what makes a good set of route instructions and how
people follow instructions, even when the instructions contain mistakes. The study will
be conducted in a virtual building. During the experiment you will read and follow sets of
route instructions that guide you through the building.

A.6.1

Movement Controls

You will move through the building by making key presses. Using the number pad, you will
move forward by pressing the ‘8’ key, rotate right by pressing the ‘6’ key and rotate left by
pressing the ‘4’ key.

A.6.2

Following Route Instructions

You will follow sets of route instructions that someone else gave. These instructions vary in
how and how well they describe the route and destination, so just try to follow each to the
best of your ability. The computer will display a set of instructions on the computer screen
in a popup window. Read through the instructions and click OK when you are ready. Then,
you will be placed in the environment at the starting position.
Some of the instructions may have mistakes or even be blank. Try your best to
get to the destination, even if the instructions are incomplete or there are small mistakes
(for instance, it says “left” when there is only a right). You may not be facing the same
direction as the direction giver was when the instructions were written. You can reread the
instructions while you are moving by pressing the ‘d’ (instructions) key. When you have
done your best to navigate to the end position, press the space bar.
After moving, you will answer two questions on popup windows: (1) How certain
are you that you’ve reached the target location? (2) How good do you think the instructions
were? After you’ve answered both questions, the screen will be blank again. When you are

ready for the next set of instructions, press the space bar again. You can take a break any
time you like while the screen is blank between each direction set.
To familiarize you with the experiment you will be run in a set of “practice”
sessions. You will be following instructions in a small environment. The purpose of these
practice sessions is to get you familiar with how to move through the environment and how
to give and follow instructions. If you have any questions, please ask at any time. Once you
are comfortable with the experiment, we will start the full set, which is about 120 sets of
instructions.

A.7

Follower Key Meanings

8 Move Forward
4 Turn Left
6 Turn Right
space Next Directions / At Destination
d Reshow the directions while moving

x for Position Group 1 (0)

x for Followers (Directors)
x to skip Training

x for Fullscreen Display

x to take snapshots

Required:

Optional:

SubjID Env

[subjNum][startNum]

Figure A.1: Vizard Experiment Starting Dialog

Appendix B

M ARCO Ablation Options
This section describes the ablation options used to configure M ARCO. Most options are
Boolean on/off switches. This section notes the name of each option, a description of it,
examples of usage scenarios, and any fallback default behavior.

B.1

Options: Fundamentals

Fundamental abilities: Skills that are explicitly used in instructions, e.g. counting, traveling
until.
Distance Count Impact:

48% Move to distance count.

Move forward to the

second alley. Move forward once.

Face Description Impact: 75% Turn until a view description is met. Face the chair.
With your back to the blank wall, ....

Perspective Taking Impact: 67% Project a different perspective from current view. to
your left is a chair. Go to the right of the stand

Travel Until Impact: 59% Travel until the described view. Go to the end of this
hallway. When you get to the red brick hall, ...

173

Turn Direction Impact: 15% Turn towards a specified direction. Turn left. Take a
right.

Use Find Impact: 17% Use the find behavior. it’s at the corner of the yellow
floor and the grassy floor. Find the blue road.

Use Follow Impact: 2% Use the follow behavior. Follow the alley around ....
go all the way down the winding hall

View Memory Impact: 81% Remember views even after turning or traveling. go down
toward the longer end of the hallway. Going away from the coat
hanger

B.2

Options: Conditionals

Conditional abilities: Skills that are explicit in instructions, but with conditionals.
Declare Goal Cond Impact: 18% Move to satisfy condition on DeclareGoal. when you
come to the intersection with ..., you are at position 6 once in
the L you’re in position 5.

Distance Before Impact: 2% Move to distance targets of X before Y. Stop at last
intersection before the bench one intersection before the coat
rack will be 3

Distance Past Impact: 3% Move to distance targets of X past Y. One segment past
the chair is Position 3. ...until two sections past the lamp

Face Away Impact:

1% Face the away condition of Turn.

Go away from the

clothes hanger. Face away from the dead end.

Face Explicit Impact: 11% Face the explicit argument of Face. Face the wall ....
face the easel and dark gray stone floor.

Face Onto Impact: 3% Face the onto condition of Turn. take a left onto the
blue path. turn right onto the stone

Face Purpose Impact: 7% Face the purpose condition of Turn. turn to face the
long red hallway. turn left so that you face another bench

Face Toward Impact: 1% Face the toward condition of Travel. go towards the red
brick floor. move toward the bench.

Face Travel Args Impact: 8% Face Travel arguments. take the yellow path to
the wood path intersection Go down the short hall

Face View Impact: 3% Face a stanfalone description of a view. you should have
butterfly pictures in front of you you see a little bit of yellow
on the floor

Stop Cond Impact:

4% Intrepret ’stop when COND’ as ’go until COND.’ stop

at the first intersection

stop when the first fish hallway to

your left appears

Travel Location Impact: 1% Move to syntactically marked precondition location of a
Travel. at the wood path intersection, take the wood path

from

the easel, move one block

Travel Past Impact: 1% Travel past conditions. move one place past the brick
floor tiling pass the easle that is sitting in the hall

Travel Precond Impact: 8% Turn to syntactically marked precondition of a Travel.
facing the easel, go forward.

with your back to the wall move

along the green ...

Turn Location Impact: 7% Move to syntactically marked precondition location of
a Turn.

once you hit the dining chair, turn right At the brick

hallway, turn towards ...

Turn Postcond Impact:

2% Travel to syntactically marked postcondition pose of

Turn. take a right to the end of the hall take a right onto the
green path all the way to the end of the hall

Turn Precond Impact: 2% Turn to precondition precondition pose of a Turn. With your
back facing the wall turn right facing the long aisle turn left

B.3

Options: Heuristics

Heuristics Hints: Strategies that enforce simple implicit inferences, e.g. face an open path
before traveling.
Face Distance Impact: 4% Face at least one of the distance units (e.g. intersections,
streets, movements) of Travels.
Face Past Impact: 2% Face objects the Travel will pass.
Face Until Impact: 33% Face the termination condition of Travels.
Face Until Post Dist Impact: 2% Re-face the termination condition of Travels after going
distance, to catch over-estimates.
Look Ahead For Travel Term Impact: 6% Look ahead in instuctions for termination for
travel actions.
Look Ahead For Travel Term Desc Impact: 4% Look ahead in instuctions for termination for travel actions in descriptions. Go foward.
Walk towards the brick hallway.
there is a chair.

Look for butterflies

At one end of the brick hallway,

Look Ahead For Travel Term Loc Impact: 2% Look ahead in instuctions for termination
for travel actions in location phrases. go towards the easle but stop at
the closest concrete square closest to the easle take a left onto
the pink path.

at the next intersection, ...

Propagate Context Info Impact: 1% Propagate the context information to embedded
compound action specifications.
Reverse Turn Impact: 3% If last action was a turn, but now facing a wall, turn around.
Travel Between Turns Impact: 1% Make a travel between consecutive turns. take a
left then a right. left.

left.

Travel On Final Turn Impact: 4% Make a final travel forward when the last action was
a turn. with your back to the wall turn left and move one block.
turn right. go down the butterfly walled/blue floored hallway.
make a left at the hatrack.

Travel On Final View Impact: 3% Make a final travel forward when the last action was a view. turn left and you should see a barstool.
position 7.
right.

this is

You should be able to see the grassy hall to your

This is Position 1.

Travel To Next Impact: 5% Travel to next match when last action was turn. ...take a
left.

Go down to the corner. go until hall ends.

take a left.

go until hall ends.

Turn Between Travels Impact: 1% Make a turn between consecutive travels.
forward down the hall until a hall opens to your left.
one segment. yellow hall to wooden hall.

Go forward

then one space forward.

Turn Explicit Impact: 1% Turn towards an explicitly mentioned direction before checking
condition.

Turn Post Reset Cache Impact: 1% Reset anaphora cache after interpreting turn.
Turn Pre Reset Cache Impact: 1% Reset anaphora cache before interpreting turn.
Turn Term Reset Cache Impact: 1% Reset anaphora cache before interpreting turn
termination.
Turn Toward Path Impact: 2% If no explicit turn direction, turn towards a visible path
instead of a wall.

B.4

Options: Recoveries

Error recovery: Strategies that attempt to recover from a failed action.
Check After Turn Find Impact: 1% Check the until condition of the find between the
turn and travel.
Find Face Impact: 16% Find when Face does not find a satisfying pose visible from this
place.
Find Face Travel Impact: 2% Use Find as fallback for Face even if not ImplicitTravel
Travel To Distant View Impact: 2% Travel when Facing a satisfying pose distantly visible
from this place. at the next corner, take a right at the lamp. go
forward.

B.5

then left all the way down the blue hall

Options: Tweaks

Tweaks: parameters that have small effects on how a behavior is executed.
Declare Goal For Position Impact: 1% DeclareGoal whenever a position is mentioned.
Travel Empty Impact: 1% Travel forward without arguments.
Travel No Termination Impact: 2% Travel forward without termination conditions.

B.6

Options: Linguistics

Linguistic parameters: parameters that affect the interpretation of text.
Declare Goal Idiom Impact: 8% Treat ’This is (it|Position X) as idiom. this is 6.
Position 6 is the next intersection as you follow the red-brick
hallway.

Face Declaratives Impact: 26% Enforce declaratives with Face statements. to your
right you should see an alley with grey carpet. there is a bench
there.

Fuzzy Meanings Impact: 7% Use broader definitions of concepts. chair gray
Raw Reference Resolution Impact: 4% Fill a reference phrase or pronoun with the
corresponding noun phrase, even if not limited by syntax.
the hatrack.

Face toward

Walk the one segment to it. there should be blue

carpet on the first alley.

walk to that and turn right.

Recognize Across Impact: 3% Recognize across phrases. move across one yellow
panel. turn right across 2 black stone floors.

Recognize Arrive Frame Impact: 2% Handle arrive frames as travel until. turn right
when you reach the end. Once you get to the floral carpeted hallway,
look for the easle.

Recognize Complex Expressions Impact: 6% Recognize complex expressions. Component options: Recognize Fictive Turn Intersections, Recognize Struct Frame, Recognize Pass Frame, Recognize Dir Turn, Recognize Distal Determiners, Recognize
Last, Recognize Negative, Recognize Negative Compound, Recognize Standalone
Arrive, Recognize Struct Agent Frame, Recognize Until Loc Dist, Recognize Until
View

Recognize Count Impact: 5% Recognize counts in noun phrases. walk past two
chairs and to the lamp. This empty intersection is Position 7.

Recognize Dir Turn Impact: 3% Handle e.g.’(until) the last right.’ there should be
only once choice, to turn right.

this left turn should have a

yellow floor.

Recognize Distal Determiners Impact: 2% Recognize determiners like ’that’ and ’other’
as marking distant entities.
blue hall.

go down the red hall until you see the

at that intersection stop. then turn left until you

see another bench and move to it.

Recognize Fictive Turn Intersections Impact: 2% Handle ’where you (can) turn|travel
to the right...’

move forward until you can turn left again.

straight until you can either go left or towards a dead end.

Recognize Last Impact: 4% Recognize last as order adj. until you come to the
last empty intersection before the easle Follow this down untill
you come to the second to last left.

Recognize Negative Impact: 0% Recognize negative phrases. when there is not a
wall to your left, go straight. ...toward the pictures of not-butterflies

Recognize Negative Compound Impact: 1% Recognize negative compounds phrases.
not a bench or a stool.

Recognize Noun Noun Compound Impact: 2% Recognize noun noun-compounds. down
the long butterfly hallway, with blue walls. Face the pink-flowered
carpet hall

Recognize Pathdir Impact: 2% Handle ’Only one way to go.’ only one way to go
face the direction with the easel.

Recognize Standalone Arrive Impact: 2% Handle sentences like ’You will hit an l. the
very first section you come to, you will be at 5 you will hit an
intersection with black stone floors.

Recognize Struct Agent Frame Impact: 2% Handle sentences like ’You will have to
take a right. ’ You will have to take a right onto a floor that is
black.

Recognize Struct Frame Impact: 2% Handle frames with structural descriptions as verbs,
like ’where the paths cross.’. when road ends, go right. then a left
down the red hall until it intersects with the rose hall

Recognize Structural Impact: 3% Handle structural adjectives, especially long and short.
turn until you face the short end of the hallway with blue flooring.
go down the longer part of the hall with blue rectangles.

Recognize Take Turn Frame Impact: 10% Handle ’Take ((the nth)|a) (right|left) [turn].’
make a left. take your second right.

Recognize Until Loc Dist Impact: 2% Handle Travel UNTIL LOC DIST.
Recognize Until View Impact: 2% Handle Turn until VIEW. turn until you face
the hallwya with the green floor from one turn until you see a
corner of blue carpet in a side alley.

Reference Resolution Impact: 2% Fill a reference phrase or pronoun with the corresponding noun phrase. you should see an alley to your left.
you will be looking for a pink flowered path.

take it.

When you reach this

path, ...

Spellcheck Impact: 2% Do spellchecking and replace unknown words with known.

B.7

Options: Implicits

.
Implicit Actions Impact: 61% Infer and perform implicit procedures.

Component

options: Implicit Travel, Implicit Turn
Implicit Exploration Impact: 22% Active resolution of referring phrases. Component
options: Travel To Distant View, Find Face, Recognize Structural
Implicit Pragmatic Impact: 15% Discourse and idiomatic cues. Component options:
Declare Goal Idiom, Recognize Pathdir, Look Ahead For Travel Term, Travel On
Final Turn, Travel Between Turns, Turn Between Travels, Travel To Next, Travel On
Final View, Propagate Context Info
Implicit Pragmatic Cross Utterance Impact: 9% Pragmatics across utterances. Component options: Look Ahead For Travel Term, Travel On Final Turn, Travel Between
Turns, Turn Between Travels, Travel To Next, Travel On Final View, Propagate
Context Info
Implicit Pragmatic Per Utterance Impact: 8% Pragmatics per utterance. Component
options: Declare Goal Idiom, Recognize Pathdir
Implicit Semantic Impact: 51% Complex action frames. Component options: Recognize
Take Turn Frame, Recognize Arrive Frame, Face Travel Args, Face Distance, Face
Until, Face Past, Turn Toward Path
Implicit Syntactic Impact: 37% Prepositional phrase markings. Component options: Face
Purpose, Travel Precond, Turn Precond, Turn Postcond, Declare Goal Cond, Turn
Location, Travel Location, Stop Cond, Declare Goal For Position
Implicit Travel Impact: 36% Infer and perform implicit travels.
Implicit Turn Impact: 50% Infer and perform implicit turns.

B.8

Options: Landmarks

Recognize different sorts of landmarks.
Appearance Landmarks Impact: 35% Recognizing simple perceptual attributes, e.g.
color, texture.

take one movement towards the blue corridor Go to

the hallway that has the blue tiles and the orange butterflies
on the wall.

Causal Landmarks Impact: 47% Recognizing simple structural landmarks, e.g. paths,
walls, positions.

with your back to the wall turn right.

take a

left onto the black path

Intersection Landmarks Impact: 25% Recognizing intersection structural landmarks,
e.g. dead ends, T intersections, corners. the dead end is position 4. at
the first intersection after the lamp,

Object Landmarks Impact: 24% Recognizing distinct objects, e.g. furniture and pictures.
The intersection with the chair is Position 4.

Go towards the

coat rack and take a left at the coat rack.

Structural Landmarks Impact: 58% Recognizing simple structural landmarks. Component options: Causal Landmarks, Intersection Landmarks

B.9

Options: HSSH

Divide methods by what level of the SSH they require.
Causal Impact: 98% All Causal actions. Component options: Distance Count, Turn
Direction, Travel Until, Face Description, Object Landmarks, Causal Landmarks
Closed Loop Causal Impact: 89% Closed loop causal control laws. Component options:
Travel Until, Face Description, Object Landmarks

Local Metrical Impact: 76% Local Metrical. Component options: Travel To Distant
View, View Memory, Perspective Taking, Face Distance
Local Topological Impact: 55% Local Topological. Component options: Recognize
Structural, Intersection Landmarks, Turn Toward Path, Reverse Turn, Look Ahead
For Travel Term, Recognize Count, Recognize Dir Turn, Recognize Pathdir, Face
Until, Face Until Post Dist
Open Loop Causal Impact: 54% Open loop causal control laws. Component options:
Distance Count, Turn Direction
Topological Impact: 25% Topological. Component options: Use Follow, Use Find, Travel
To Next, Travel Between Turns, Turn Between Travels, Travel On Final Turn

B.10

Options: Comparison

Comparisons to different agents or development baselines.
Corpus2 Options Impact: 8% Options added when examining corpus 2. Component
options: Declare Goal For Position, Propagate Context Info, Raw Reference Resolution, Recognize Across, Recognize Count, Recognize Dir Turn, Recognize Distal
Determiners, Recognize Last, Recognize Negative, Recognize Negative Compound,
Recognize Standalone Arrive, Recognize Struct Agent Frame, Recognize Until Loc
Dist, Recognize Until View, Reverse Turn, Spellcheck, Travel On Final View
IBL Options Impact: 4% Options that distinguish Marco from the IBL agent. Component
options: Reverse Turn, Turn Toward Path, Face Distance, Implicit Pragmatic

Appendix C

Glossary and Language Model
C.1

Glossary

C.1.1

Route Instruction Analysis Vocabulary
Route Instruction Units

Utterance Sentence or fragment of natural language text.
Command An explicit imperative utterance.
Instructions Linguistic or pictoral description of a procedure of explicit commands and
descriptive utterances, describing what the follower agent should do.
Route Instruction Navigation task-specific description of where and how the follower
agent should move.
Procedural Specification A representation of the constraints that route instructions place
on a reactive procedure, consisting of the parameters, constraints, and conditions
governing the activation of discrete Causal actions.

Equivalent to tasks at the

sequencing layer of multi-tiered architectures. Task-dependent representation of how
the follower agent achieves an instruction, grounded in the follower’s capabilities.
185

C.1.2

Abbreviations

POMDP Partially Observable Markov Decision Process
SSH Spatial Semantic Hierarchy
HSSH Hybrid Spatial Semantic Hierarchy
VRML Virtual Reality Markup Language

C.2

Route Instruction Grammar

C.2.1

Verbs

T URN Change in orientation or switch in paths: turn, take (a left), make (a
right turn), go (left)

O RIENT Change in rotation relative to an external frame of reference. face, put (your
back against the wall), stand (so that ...)

T RAVEL Move between places: walk, go, move, follow, take (the path)
S TOP Cease motion: stop
PASS Move past an entity: pass
A RRIVE Move until a description: get to, meet, come to, hit, enter
F IND Undirected search: find, look for, go (to where ...)
I S Be verb: is, match,are
H AS Possession verb: contain,has
S EE Local observation verb: see,look
L OC Location verb: standing,located

S TRUCT Describe environment structure by movement metaphor: intersect, meet,
brings, hits, runs into

C.2.2

Nouns

P LACE Place name or description. position 2, a spot with a chair
PATH Way to travel between places. hallway, road, path
S TRUCT Local place topology. end of the hall, corner, ‘T’ intersection
R EGION Large-scale space region. the Eiffel Tower area
D IST _U NIT Units of distance. segments, movements, times, PATH , S TRUCT
S IDE A direction or directions. the sides, direction, way, the back
O BJ Discrete localized object. furniture, wall, picture
AGENT Agent that can act in the world. you, yourself, the follower
V IEW Description of view. to , from
R EF Reference: it, that, there
V IEW D ESC Description of the expected view: PATH | O BJ | S TRUCT | R EGION | V IEW |
D ESC | R EF

C.2.3

Turn Command Arguments

Phrase

Arg.

Action Model

C OND

D ESC

Facep (faced:Desc)

L OC

V IEW D

Travelp (until:ViewD)

D IR

Turnp (direction:Dir)

S IDE

Turnp (direction:Dir)

P URPOSE

D ESC

Facep (faced:Desc)

T OWARD

V IEW D

Facep (faced:ViewD)

AWAY

V IEW D

Facep (faced:ViewD)

O NTO

PATH

Pre

While

Post

Facep (faced:Path)

C OND After this condition holds, turn evaluate other arguments: ;

L OC Where to turn: left , take a right

D IR Turn direction relative to agent: , ,
S IDE Facing so that the referent is on the named side:
P URPOSE Turn so that this condition holds: place ;
turn ; turn

T OWARD Facing the referent: face ,
AWAY Facing away from the referent: away from the rose path intersection

O NTO Turn onto this path: ,

C.2.4

Travel Command Arguments

Phrase

Arg.

Action Model

A LONG

PATH

Facep (faced:ViewD)

C OND

D ESC

Facep (faced:Desc)

L OC

V IEW D

Travelp (until:ViewD)

T OWARD

V IEW D

Facep (faced:ViewD)

AWAY

V IEW D

Facep (faced:ViewD)

D IR

Turnp (direction:Dir)

S IDE

Turnp (direction:Dir)

PAST

V IEW D

Travelp (past:ViewD)

D IST

Pre

While

Travelp (dist:Dist)

Post
U NTIL

V IEW D

Travelp (until:ViewD)

A LONG Travel on this path: along the grass carpet, down the hall with
the yellow stone floors, take it

T OWARD Direction on a path toward an allocentric entity: toward the hat rack, in
the direction of the red floor

AWAY Direction on a path to face the referent: away from the bench
D IR Travel Direction on a path relative to agent: straight, left, right

PAST Move past entity on this path, in the direction of the referent but further than it: past
the stool, passing the lamp, beyond that

Q UAL -D IST Qualitative description of distance: all of the way down, far, very
end

Q UANT-D IST C OUNT (D IST-U NIT ) Quantitative description of distance: one movement,
two alleys away from, about five intersections, first left

U NTIL Continue travel until condition: stop at the easel, to the lamp, until
you reach the pink walkway

C.2.5

Description Utterance Arguments

A RRIVE Description of termination of motion
to an intersection with ...

you get to the ..., you come

C.2.6

Adjectives

A PPEAR Appearance, floor color and texture, pictures: blue, green octagon, fish
on the wall, wooden floored

S TRUCTURAL Length and shape of path: long, short
S TRUCT-T YPE Local topology and geometry of environment structure: four-way, ’T’

C.2.7

Adjectival Phrases

O N : P/O N PATH Entity is on this path: in this hall, on the long path
B ETWEEN : P/B ETWEEN V IEW D ESC V IEW D ESC Entity is between two other entities:
between the chair and the hatrack, between the two intersections
containing furniture

S IDE : P/S IDE D IR V IEW D ESC? Location of the object relative to another entity in
front of the easel, to your left, behind you, on one side

AGAINST : P/AGAINST V IEW D ESC Object is in contact with or adjacent to the referent.
to the wall, against the chair

L OC : P/L OC V IEW D ESC Location of object: by the coat rack, in the corner,
at the end of the hall

D ETAIL : (P/D ETAIL | V/H AS) O BJ Descriptive detail: with the blue floor,
containing an easel

PART : P/PART O BJ Composition of object: end , intersection

C.3

Interfaces

C.3.1

Simulation

Inputs and outputs
Agent Actions : travela , turn-lefta , turn-righta , declare-goala
Test Actions : teleporta , set-goala
Observations : A variable-length list of tuples, representing the view to the perceptual
horizon (end of the hall). See description below.
Observation Type
The observation is a discrete symbolic representation of each component of the view visible
to the end of the hallway in front. The list is composed of an arbitrary number of 6
item tuples, hleft, middle, right, f-left, front, f-righti. This represents the view of a place, its
peripheral hallways, and the floor and walls of the hallway segment immediately in front.
Positions left, right, and front describe the floor texture, if any, and are drawn
from the set {Wall , Rose, Wood , Grass, Cement, BlueTile, Brick , Stone, Honeycomb}.
Positions f-left and f-right describe the corridor walls, if visible, and are drawn from the set
{End , Butterfly, Fish, Eiffel }. Position front describes any object visible within the intersection and is drawn from the set {Chair , Sofa, Barstool , Hatrack , Easel , Lamp, Empty}.
See Figures 6.1 and 6.2 for examples.

C.4

Representation of Procedural Specifications

See Section 5.4.2 for more explanation.
Common parameters
precondition Pre-condition to achieve prior to executing procedure
postcondition Post-condition to achieve while executing procedure
location Where to perform the procedure. A locative pre-condition.
Procedure-specific parameters
Verifyp
Description View description to verify.
Travelp
along Description of the path to travel along. A topological pre- / duringcondition.
distance Estimate of distance to travel.

May be discrete (2nd left),

continuous (three meters), or qualitative (all the way down). An
internal during- / post-condition.
face View description of the view in front of the traveler at the beginning of
travel. A perceptual pre-condition.
past View description to achieve during forward motion. A locative duringcondition.
until View description of the travel destination. A locative post-condition.
Turnp
direction Relative direction to turn
Facep

faced View description to face at the end of the procedure. A perceptual postcondition.
DeclareGoalp
Goal Name of goal to assert.

C.5

Representation of View Description

See Section 5.4.1 for more explanation.
type Kind of entity. Currently implemented types:
Path Traversal linear structure an agent can travela along.
Pathdir Part of a path proceeding in one direction.
Segment Length of path between two places.
Obj Three-dimensional object located at a point, such as furniture, pictures, and
walls.
Struct Structural landmark such an intersection or block.
Region Large-scale region of space.
value Token representing an instance of type recognizable to the view description matching
code.
dist Distance within the view, if known. Currently a string representing a distance range in
the view, e.g. ’0’ is immediate, ’1’ is one intersection away, and ’1:’ is distal.
Relations between entities Between ternary relation that the subject is located between
two other reference objects, i.e. if the follower is the subject, the reference
objects will be on opposites sides of her.

Detail just asserts an unspecified connection between two objects, which may be colocation (e.g. the end with the easel) or part (e.g. the intersection
with the brown hallway.

Loc represents that the subject is co-located with (or at or in) the reference object.
On represents that the subject is topologically on a reference entity, a path.
Part represents that the entity is part of the other entity, though which is the part
may not be apparent from the surface form.

Bibliography
Agrawala, M. and Stolte, C. (2001). Rendering effective route maps: Improving usability
through generalization. In Fiume, E., editor, ACM SIGGRAPH, pages 241–250. ACM
Press.
Agre, P. E. and Chapman, D. (1990). What are plans for? Rob. & Auton. Sys. , 6:17–34.
Allen, G. L. (2000). Principles and practices for communicating route knowledge. Applied
Cog. Psych. , 14(4):333–359.
Allen, G. L. (2003). Gestures accompanying verbal route directions: Do they point to a new
avenue for examining spatial representations? Spatial Cogn. & Compn. , 3(4):259–268.
Allen, J., Ferguson, G., and Stent, A. (2001a).

An architecture for more realistic

conversational systems. In Proc. of Intelligent User Intf. 2001 (IUI-01), pages 1–8, Sante
Fe, NM. ACM Press.
Allen, J. F., Byron, D. K., Dzikovska, M., Ferguson, G., Galescu, L., and Stent, A. (2001b).
Toward conversational human-computer interaction. AI Magazine, 22(4):27–37.
Allen, J. F., Miller, B. W., Ringger, E. K., and Sikorski, T. (1996). A robust system for
natural spoken dialogue. In Proc. of 34th Ann. Meeting of the ACL (ACL-96), pages 62–
70, Santa Cruz, CA. Morgan Kaufmann.

196

Alterman, R., Zito-Wolf, R., and Carpenter, T. (1991). Interaction, comprehension, and
instruction usage. J. of the Learning Sciences, 1(3/4):361–398.
Anderson, A. (1984). Semantic and Social Pragmatic Aspects of Meaning in Task-Oriented
Dialogue. PhD thesis, Univ. of Glasgow.
Anderson, A., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko,
J., McAllister, J., Miller, J., Sotillo, C., Thompson, H. S., and Weiniert, R. (1991). The
HCRC map task corpus. Lang. & Speech, 34(4):351–366.
André, E., Bosch, G., Herzog, G., and Rist, T. (1986). Coping with the intrinsic and deictic
use of spatial prepositions. In Proc. of AI II: Methodology, Systems, Applications, pages
375–382, Amsterdam, The Netherlands.
Baker, C. F., Fillmore, C. J., and Lowe, J. B. (1998). The Berkeley FrameNet project.
In Proc. of 36th Ann. Meeting of the ACL (ACL-96), Montréal, QB, Canada. Morgan
Kaufmann.
Beeson, P., MacMahon, M., Modayil, J., Murarka, A., Kuipers, B., , and Stankiewicz,
B. (2007).

Integrating multiple representations of spatial knowledge for mapping,

navigation, and communication.

In Proc. of AAAI Spring Symp. on Interaction

Challenges for Intelligent Assistants, Stanford, CA.
Bindiganavale, R., Schuler, W., Allbeck, J. M., Badler, N. I., Joshi, A. K., and Palmer, M.
(2000). Dynamically altering agent behaviors using natural language instructions. In
Proc. of 4th Intl. Conf. on Auton. Agents, pages 293–300, Barcelona, Spain.
Bird, S. and Loper, E. (2004). NLTK: The Natural Language Toolkit. In Proc. of 42nd
Ann. Meeting of the ACL (ACL-04), Barcelona, Spain.
Blisard, S. and Skubic, M. (2005). Modeling spatial referencing language for human-robot
interaction. In Proc. of IEEE Intl. Ws. on Robot and Human Interactive Communication
(RO-MAN), Nashville, TN.

Blisard, S., Skubic, M., Luke, III, R. H., and Keller, J. M. (2006). 3-D modeling of spatial
referencing language for human-robot interaction. In Goodrich et al. (2006), pages 329–
330.
Blocher, A. and Stopp, E. (1998). Time-dependent generation of minimal sets of spatial
descriptions. In Olivier and Gapp (1998), pages 57–72.
Bonnasso, R. P., Firby, R. J., Gat, E., Kortenkamp, D., Miller, D. P., and Slack, M. G.
(1997). Experiences with an archtitecture for intelligent, reactive agents. J. of Exptl. and
Theo. AI , 9(1):237–256.
Bos, J. (2004). Computational semantics in discourse: Underspecification, resolution, and
inference. J. of Logic, Language and Information, 13(2):139–157.
Brown, L. N., Lahar, C. J., and Mosley, J. L. (1998). Age and gender-related differences
in strategy use for route information: A map-present direction-giving paradigm.
Env. & Behavior, 30:123–143.
Bugmann, G. (2003). Challenges in verbal instruction of domestic robots. In Proc. of 1st
Intl. Ws. on Adv. in Service Rob. (ASER ’03), pages 112–116, Bardolino, Italy.
Bugmann, G., Klein, E., Lauria, S., and Kyriacou, T. (2004). Corpus-based robotics: A
route instruction example. In Proc. of Intelligent Auton. Sys. , pages 96–103, Amsterdam.
Bugmann, G., Lauria, S., Kyriacou, T., Klein, E., Bos, J., and Coventry, K. (2001).
Using verbal instructions for route learning: Instruction analysis. In Proc. of Torwards
Intelligent Mobile Robots Conf., Manchester, UK.
Buhl, H. M. (2003). Partner orientation and speaker’s knowledge as conflicting parameters
in language production. J. of Psycholinguistic Res. , 30(6):549–567.
Burke, J. L., Murphy, R. R., Coovert, M. D., and Riddle, D. L. (2004). Moonlight in
Miami: Field study of human-robot interaction in the context of an urban search and

rescue disaster response training exercise. Human-Computer Interaction, 19(1–2):85–
116.
Burnett, G. E. (2000). “Turn right at the traffic lights:” The requirements for landmarks in
vehicle navigation systems. J. of Navigation, 53(3):499–510.
Burridge, R. R., Graham, J., Shillcutt, K., Hirsh, R., and Kortenkamp, D. (2003).
Experiments with an EVA assistant robot.

In Proc. of 7th Intl. Symp. on Artificial

Intelligence, Robotics and Automation in Space (i-SAIRAS-03), Nara, Japan.
Byron, D. K. (2002). Resolving pronominal reference to abstract entities. In Proc. of 40th
Ann. Meeting of the ACL (ACL-02), pages 80–87, Philadelphia, PA.
Byron, D. K., Mampilly, T., Sharma, V., and Xu, T. (2005). Utilizing visual attention for
cross-modal coreference interpretation. In Proc. of Context-05, volume 3554 of LNCS,
pages 83–96.
Carletta, J. and Mellish, C. S. (1996). Risk-taking and recovery in task-oriented dialogue.
J. of Pragmatics, 26(1):71–107.
Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. (1994). Acting optimally in partially
observable stochastic domains. In Proc. of 12th Natl. Conf. on AI (AAAI-94), pages 1023–
1028, Seattle, WA. AAAI Press/The MIT Press.
Chang, N. and Mok, E. (2006). A structured context model for grammar learning. In
Proc. of Intl. JointConf. on Neural Networks, Vancouver, BC.
Chang, N., Narayanan, S., and Petruck, M. R. (2002). Putting frames in perspective. In
Proc. of 19th Intl. Conf. on Compl. Ling. (COLING-02), Taipei, Taiwan.
Chapman, D. (1990). Instruction use in situated activity. PhD thesis, MIT, Dept. of
Elec. Eng. & Comp. Sci. , Cambridge, MA. Also Available as MIT, AI Lab. Technical
Report 1204.

Charniak, E. and Johnson, M. (2005).

Coarse-to-fine n-best parsing and MaxEnt

discriminative reranking. In Proc. of 43rd Ann. Meeting of the ACL (ACL-05), pages 173–
180, Ann Arbor, MI.
Chewar, C. M. and McCrickard, D. S. (2002). Dynamic route descriptions: Tradeoffs by
usage goals and user characteristics. In Proc. of 2nd Intl. Symp. on Smart Graphics, pages
71–78. ACM Press.
Cohen, P. R. (1984). The pragmatics of referring and the modality of communication.
Compl. Intelligence, 10(2):97–146.
Cohen, P. R. (1995). Empirical Methods for Artificial Intelligence. MIT Press, Cambridge,
MA.
Cohen, P. R. and Levesque, H. J. (1990a). Persistence, intention, and commitment. In
Cohen et al. (1990), pages 33–69.
Cohen, P. R. and Levesque, H. J. (1990b).

Rational interaction as the basis for

communication. In Cohen et al. (1990), pages 221–255.
Cohen, P. R., Morgan, J., and Pollack, M. E., editors (1990). Intentions in Communication.
MIT Press, Cambridge, MA.
Cohen, P. R. and Perrault, C. R. (2003). Elements of a plan-based theory of speech acts.
In Huget, M.-P., editor, Communication in Multiagent Systems: Agent Communication
Languages and Conversation Policies, volume 2650 of LNCS, pages 1–36. Springer.
COSIT-99 (1999). Spatial Information Theory: Cognitive and Computational Foundations
of Geog. Info. Sci. (COSIT ’99), volume 1661 of LNCS, Stade, Germany.
Coventry, K. R. and Garrod, S. C. (2004). Saying, Seeing, and Acting: The Psychological
Semantics of Spatial Prepositions. Essays in Cognitive Psychology. Psychology Press,
Hove and New York.

Coventry, K. R. and Oliver, P., editors (2002).

Spatial Language : Cognitive and

Computational Perspectives. Kluwer Academic Publishers, Boston, MA.
Dabbs, J. M., Chang, E.-L., Strong, R. A., and Milun, R. (1988). Spatial ability, navigation
strategy, and geographic knowledge among men and women. Evolution and Human
Behavior, 19:89–98.
Dale, R., Geldof, S., and Prost, J.-P. (2002). Generating more natural route descriptions. In
2002 Australasian Ws. on Nat. Lang. Proc. , pages 41–48, Canberra, Australia.
Dale, R., Geldof, S., and Prost, J.-P. (2003). CORAL: Using natural language generation
for navigational assistance.

In Oudshoorn, M., editor, Proc. of 26th Australasian

Comp. Sci. Conf. (ACSC2003), Adelaide, Australia.
Dalton, R. C. (2003). The secret is to follow your nose: Route path selection and angularity.
Env. & Behavior, 35(1):107–131.
Daniel, M.-P. and Denis, M. (2004). The production of route directions: Investigating
conditions that favour conciseness in spatial discourse. Applied Cog. Psych. , 18:57–75.
Daniel, M.-P., Tom, A., Manghi, E., and Denis, M. (2003). Testing the value of route
directions through navigational performance. Spatial Cogn. & Compn. , 3(4):269–289.
Davis, J. R. (1986). Giving directions: A voice interface to an urban navigation program.
American Voice I/O Society, pages 77–84.
Dayan, A. and Thomas, J. R. (1995). Development of automatic and effortful processes in
memory for spatial location of movement. Human Performance, 8(1):51–66.
Denis, M. (1997). The description of routes: A cognitive approach to the production of
spatial discourse. Current Psych. of Cogn. , 16(4):409–458.

Denis, M., Pazzaglia, F., Cornoldi, C., and Bertolo, L. (1999). Spatial discourse and
navigation: An analysis of route directions in the city of Venice. Applied Cog. Psych. ,
13(2):145–174.
Di Eugenio, B. (1992). Understanding natural language instructions: the case of purpose
clauses. In Proc. of 30th Ann. Meeting of the ACL (ACL-92), pages 120–127, Newark,
DE. Morgan Kaufmann.
Di Eugenio, B. (1998). An action representation formalism to interpret natural language
instructions. Compl. Intelligence, 14(1):89–133.
Dixon, P. (1987a). The processing of organizational and component step information in
written directions. J. of Memory & Lang. , 26(1):24–35.
Dixon, P. (1987b). The structure of mental plans for following directions. J. of Exptl. Psych. , 13(1):18–26.
Dixon, P., Faries, J., and Gabrys, G. (1988). The role of explicit action statements in
understanding and using written directions. J. of Memory & Lang. , 27(6):649–667.
Duckham, M. and Kulik, L. (2003). “Simplest” paths: Automated route selection for
navigation. In Kuhn, W., Worboys, M. F., and Timpf, S., editors, Proc. of COSIT-03,
number 2825 in LNCS, pages 169–185, Kartause Ittingen, Switzerland. Springer-Verlag.
Edmonds, P. G. (1993). A computational model of collaboration on reference in directiongiving dialogues.

Master’s thesis, Univ. of Toronto, Dept. of Comp. Sci. , Toronto,

Canada. Published asTechnical Report CSRI-289.
Edmonds, P. G. (1994). Collaboration on reference to objects that are not mutually known.
In Proc. of 15th Intl. Conf. on Compl. Ling. (COLING-94), pages 1118–1122, Kyoto,
Japan.

Elliot, R. J. and Lesk, M. E. (1982). Route finding in street maps by computers and people.
In Proc. of 2nd Natl. Conf. on AI (AAAI-82), pages 258–261, Pittsburgh, PA.
Ellsworth, M., Erk, K., Kingsbury, P., and Pado, S. (2004). PropBank, SALSA, and
FrameNet: How design determines product. In Proc. of LREC 2004 Ws. on Building
Lexical Resources from Semantically Annotated Corpora, pages 17–23, Lisbon, Portugal.
Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database. MIT Press,
Cambridge,MA.
Ferguson, G. and Allen, J. F. (1993). Generic plan recognition for dialogue systems. In
Proc. of Ws. on Human Language Technology, pages 171–176, Plainsboro, New Jersey.
Ferguson, G. and Allen, J. F. (1998). TRIPS: An integrated intelligent problem-solving
assistant. In Proc. of 15th Natl. Conf. on AI (AAAI-98), Madison, WI.
Firby, R. J. (1989). Adaptive Execution in Complex Dynamic Worlds. PhD thesis, Yale
Univ. , Comp. Sci. Dept. , New Haven, CT.
Fischer, K. (2003). Linguistic methods for investigating concepts in use. In Stolz, T. and
Kolbe, K., editors, Methodologie in der Linguistik, pages 39–62. Peter Lang, Frankfurt
a.M., Germany.
Fleischman, M. and Hovy, E. (2006). Taking advantage of the situation: Non-linguistic
context for natural language interfaces to interactive virtual environments. In Proc. of
11th Intl. Conf. on Intelligent User Interfaces (IUI ’06), Sydney, Australia.
Fleischman, M. and Roy, D. (2005). Intentional context in situated natural language
learning. In Proc. of Conf. on Natural Language Learning, Ann Arbor, MI.
Fong, T. and Nourbakhsh, I. (2005).
exploration. ACM Interactions.

Interaction challenges in human-robot space

Fontaine, S. and Denis, M. (1999). The production of route instructions in underground and
urban environments. In COSIT-99 (1999), pages 83–94.
Fox, D., Burgard, W., and Thrun, S. (1999). Markov localization for mobile robots in
dynamic environments. J. of AI Res. , 11:391–427.
Fraczak, L., Lapalme, G., and Zock, M. (1998). Automatic generation of subway directions:
Salience gradation as a factor for determining message and form. In Hovy, E., editor,
Proc. of Ninth Intl. Ws. on Nat. Lang. Gen. , pages 58–67. Assoc. Compl. Lang. , New
Brunswick, New Jersey.
Frank, A. U. (2003). Pragmatic information content: How to measure the information in a
route description. In Duckham, M., editor, Foundations of Geog. Info. Sci. , pages 47–68.
Taylor & Francis.
Freksa, C., Brauer, W., Habel, C., and Wender, K. F., editors (2000). Spatial Cognition
II, Integrating Abstract Theories, Empirical Studies, Formal Methods, and Practical
Applications, volume 1849 of LNCS. Springer.
Freksa, C., Brauer, W., Habel, C., and Wender, K. F., editors (2003). Spatial Cognition
III: Routes and Navigation, Human Memory and Learning, Spatial Representation and
Spatial Learning, volume 2685 of LNCS. Springer.
Freksa, C., Habel, C., and Wender, K. F., editors (1998).

Spatial Cognition, An

Interdisciplinary Approach to Representing and Processing Spatial Knowledge, volume
1404 of LNCS, Berlin. Springer.
Freksa, C., Knauff, M., Krieg-Brückner, B., Nebel, B., and Barkowsky, T., editors (2004).
Spatial Cognition IV: Reasoning, Action, Interaction: International Conference Spatial
Cognition 2004, volume 3343 of LNCS, Frauenchiemsee, Germany. Springer; Berlin.
Freundschuh, S. and Egenhofer, M. (1997). Human conceptions of spaces: Implications for
GIS. Trans. on Geog. Info. Sci. , 2(4):361–375.

Garden, S., Cornoldi, C., and Logie, R. H. (2002). Visuo-spatial working memory in
navigation. Applied Cog. Psych. , 16(1):35–50.
Garrod, S. (1989). Conceptual and semantic co-ordination in dialogue: Implications for
the design of interactive natural language interfaces. In Peckham, J., editor, Recent
Developments and Applications of Natural Language Processing, UNICOM applied
information technology reports, pages 262–272. Kogan Page.
Garrod, S. and Anderson, A. (1987). Saying what you mean in dialogue: A study in
conceptual and semantic co-ordination. Cognition, 27(2):181–218.
Garrod, S. and Doherty, G. (1994). Conversation, co-ordination and convention: An
empirical investigation of how groups establish linguistic conventions.

Cognition,

53:181–215.
Garrod, S. C. and Sanford, A. J. (1989). Discourse models as interfaces between language
and the spatial world. J. of Semantics, 6:147–160.
Ge, R. and Mooney, R. J. (2005). A statistical semantic parser that integrates syntax and
semantics. In Proc. of the Ninth Conf. on Computational Natural Language Learning,
pages 9–16, Ann Arbor, MI.
Geldof, S. (2003).

Corpus analysis for NLG.

In Reiter, E., Horacek, H., and van

Deemter, K., editors, Proc. of 9th European Ws. on Nat. Lang. Gen. Comp. Sci. Conf. on
(ENLG’03), Budapest, Hungary.
Gildea, D. and Hockenmaier, J. (2003). Identifying semantic roles using combinatory
categorial grammar. In 2003 Conf. on Empirical Methods inNat. Lang. Proc. (EMNLP),
pages 57–64, Sapporo, Japan.
Gildea, D. and Palmer, M. (2002).

The necessity of syntactic parsing for predicate

argument recognition. In Proc. of 40th Ann. Meeting of the ACL (ACL-02), pages 239–
246, Philadelphia, PA.

Goan, T., Benson, N., and Etzioni, O. (1996). A grammar inference algorithm for the world
wide web. In Proc. of AAAI Spring Symp. on Machine Learning in Information Access,
Stanford, CA.
Golding, J. M., Graesser, A. C., and Hauselt, J. (1996). The process of answering directiongiving questions when someone is lost on a university campus: The role of pragmatics.
Applied Cog. Psych. , 10(1):23–39.
Goodrich, M. A., Schultz, A. C., and Bruemmer, D. J., editors (2006). Proc. of 1st ACM
Conf. on Human-Robot Interaction, Salt Lake City, UT.
Gorniak, P. and Roy, D. (2004).

Grounded semantic composition for visual scenes.

J. of AI Res. , 21:429–470.
Gorniak, P. and Roy, D. (2006). Perceived affordances as a substrate for linguistic concepts.
In Proc. of 28th Ann. Meeting of the Cog. Sci. Society (CogSci-06), Vancouver, BC.
Gorniak, P. and Roy, D. (2007). Situated language understanding as filtering perceived
affordances. Cog. Sci. . In Press.
Green, A., Hüttenrauch, H., Topp, E. A., and Eklundh, K. S. (2006).
a contextualized multimodal corpus for human-robot interaction.

Developing

In Proc. of 5th

Intl. Conf. on Language Resources and Evaluation (LREC2006), Genoa, Italy.
Grice, H. P. (1967). Logic and conversation. William James Lectures, Harvard Univ. .
Published in Grice (1989).
Grice, H. P. (1975). Logic and conversation. In Cole, P. and Morgan, J. L., editors, Speech
Acts, volume 3 of Syntax and Semantics, pages 43–58. Academic Press, New York.
Grice, H. P. (1989). Studies in the Way of Words. Harvard Univ. Press, Cambridge, MA.

Grodner, D. J. and Sedivy, J. C. (2004). The effect of speaker-specific information on
pragmatic inferences. In Gibson, N. P. . E., editor, The Processing and Acquisition of
Reference. MIT Press.
Gruenstein, A. (2002). Conversational interfaces: A domain-independent architecture
for task-oriented dialogues. Master’s thesis, Stanford Univ. , Symbolic Sys. Program,
Stanford, CA.
Gryl, A., Moulin, B., and Kettani, D. (2002). A conceptual model for representing verbal
expressions used in route directions. In Coventry and Oliver (2002), pages 19–42.
Guhe, M., Habel, C., and Tschander, L. (2003). Describing motion events: Incremental
representations for incremental processing. In Proc. of 5th Intl. Ws. on Compl. Semantics
(IWCS-5), pages 410–424, Tilburg, The Netherlands.
Gupta, R. and Hennacy, K. (2005). Commonsense reasoning about task instructions. In
Thórisson, K. R., Vilhjalmsson, H., and Marsella, S. C., editors, Proc. of AAAI Ws. on
Modular Construction of Human-Like Intelligence, pages 86–91, Pittsburgh, PA. AAAI
Press.
Gupta, R. and Kochenderfer, M. J. (2004). Common sense data acquisition for indoor
mobile robots. In Proc. of 19th Natl. Conf. on AI (AAAI-2004), Menlo Park, CA.
Habel, C. (2003). Incremental generation of multimodal route instructions. In AAAI Spring
Symp. on Natural language generation in spoken and written dialogue, Stanford, CA.
Haigh, K. Z., Shewchuk, J. R., and Veloso, M. M. (1997). Exploiting domain geometry in
analogical route planning. J. of Exptl. and Theo. AI , 9:509–541.
Hansen, S., Richter, K.-F., and Klippel, A. (2006). Landmarks in OpenLS: A data structure
for cognitive ergonomic route directions. In Raubal, M., Miller, H., Frank, A. U., and
Goodchild, M. F., editors, Proc. of GIScience, number 4197 in LNCS, Berlin.

Hayward, W. G. and Tarr, M. J. (1995). Spatial language and spatial representation.
Cognition, 55(1):39–84.
Heeman, P. A. and Hirst, G. (1992). Collaborating on referring expressions. Technical
Report CSRI-289, Comp. Sci. Dept. Univ. of Rochester, Rochester, NY.
Herskovits, A. (1985). Semantics and pragmatics of locative expressions. Cog. Sci. ,
9(3):341–378.
Herskovits, A. (1997). Language, spatial cognition, and vision. In Stock, O., editor, Spatial
and Temporal Reasoning, pages 155–202. Kluwer Academic Publishers, Boston, MA.
Huffman, S. B. and Laird, J. E. (1995). Flexibly instructable agents. J. of AI Res. , 3:271–
324.
Hüttenrauch, H., Green, A., Norman, M., Oestreicher, L., and Eklundh, K. S. (2004).
Involving users in the design of a mobile office robot.

IEEE Trans. on Sys. ,

Man & Cybernetics – Part C:App. & Reviews, 34(2):113–124.
Jackendoff, R. (1983). Semantics and Cognition. MIT Press.
Jackson, P. G. (1998). In search of better route instructions. Ergonomics, 41(7):1001–1013.
Johnson, C. and Fillmore, C. J. (2000).

The FrameNet tagset for frame-semantic

and syntactic coding of predicate-argument structure.

In Proc. of North American

Assoc. Compl. Lang. .
Jung, H., Allen, J., Chambers, N., Galescu, L., Swift, M., and Taysom, W. (2006). Oneshot procedure learning from instruction and observation. In Proc. of Intl. FLAIRS Conf.:
Special Track on Natl. Lang. & Knowledge Representation.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in
partially observable stochastic domains. AI , 101:99–134.

Kamp, H. and Reyle, U. (1993). From discourse to logic: Introduction to modeltheoretic
semantics of natural language, formal logic and discourse representation theory. Kluwer
Academic Publishers, London, Boston, Dordrecht. Studies in Linguistics and Philosophy,
Volume 42.
Kate, R. J., Wong, Y. W., and Mooney, R. J. (2005). Learning to transform natural to formal
languages. In Proc. of 20th Natl. Conf. on AI (AAAI-2005), pages 1062–1068, Pittsburgh,
PA.
Kelleher, J., Costello, F. J., and van Genabith, J. (2005). Dynamically structuring, updating
and interrelating representations of visual and linguistic discourse context. AI , 167(1–
2):62–102. Connecting Language to the World.
Kingsbury, P., Palmer, M., and Marcus, M. (2002). Adding semantic annotation to the Penn
Treebank. In Proc. of Human Lang. Technology, San Diego, CA.
Klippel, A., Tappe, H., and Habel, C. (2003). Pictorial representations of routes: Chunking
route segments during comprehension. In Freksa et al. (2003), pages 11–33.
Klippel, A., Tappe, H., Kulik, L., and Lee, P. U. (2005). Wayfinding choremes - a language
for modeling conceptual route knowledge.

J. of Visual Languages & Computing,

16(4):311–329.
Klippel, A. and Winter, S. (2005). Structural salience of landmarks for route directions. In
Cohn, A. G. and Mark, D. M., editors, Proc. of COSIT-05, number 3693 in LNCS, pages
347–362, Ellicottville, New York. Springer-Verlag.
Knees, M. H. (2002). Designing an anaphora resolution algorithm for route instructions.
Master’s thesis, Division of Informatics, Univ. of Edinburgh. M.Sc. in Cognitive Science
and Natural Language.
Koenig, S. and Simmons, R. G. (1996). Unsupervised learning of probabilistic models for

robot navigation. In Proc. of IEEE Intl. Conf. on Rob. & Autom. (ICRA-96), pages 2301–
2308, Los Alamitos, CA. IEEE Computer Society Press.
Koenig, S. and Simmons, R. G. (1998). Xavier: a robot navigation architecture based
on partially observable Markov decision process models. In Kortenkamp, D., Bonasso,
R. P., and Murphy, R., editors, Artificial Intelligence and Mobile Robots: Case Studies of
Successful Robot Systems, pages 91–122. AAAI Press/The MIT Press, Menlo Park, CA.
Kopp, S., Tepper, P. A., Ferriman, K., and Cassell, J. (2007). Trading spaces: How humans
and humanoids use speech and gesture to give directions. Spatial Cogn. & Compn. . in
press.
Kortenkamp, D., MacMahon, M., Ryan, D., Bonasso, R. P., and Moreland, L. (1998).
Applying a layered control architecture to a free-flying space camera.

In Proc. of

IEEEIntl. Joint Symposia on Intelligence & Sys. , pages 188–194, Rockville, MD, USA.
Krieg-Brückner, B., Frese, U., Lüttich, K., Mandel, C., Mossakowski, T., and Ross, R. J.
(2004). Specification of an ontology for route graphs. In Freksa et al. (2004), pages
390–412.
Krieg-Brückner, B., Röfer, T., Carmesin, H.-O., and Müller, R. (1998). A taxonomy
of spatial knowledge for navigation and its application to the Bremen autonomous
wheelchair. In Freksa et al. (1998), pages 373–398.
Kuipers, B. and Kassirer, J. (1987).

Knowledge acquisition by analysis of verbatim

protocols. In Kidd, A., editor, Knowledge Acquisition for Expert Systems. Plenum, New
York.
Kuipers, B., Modayil, J., Beeson, P., MacMahon, M., and Savelli, F. (2004). Local metrical
and global topological maps in the Hybrid Spatial Semantic Hierarchy. In Proc. of IEEE
Intl. Conf. on Rob. & Autom. (ICRA-04), New Orleans, LA. IEEE Computer Society
Press.

Kuipers, B. J. (2000). The Spatial Semantic Hierarchy. AI , 119:191–233.
Kuipers, B. J. (2006). An intellectual history of the Spatial Semantic Hierarchy. In Jefferies,
M. and Yeap, A. W.-K., editors, Robot and Cognitive Approaches to Spatial Mapping.
Springer-Verlag.
Kuipers, B. J. and Byun, Y.-T. (1991). A robot exploration and mapping strategy based on
a semantic hierarchy of spatial representations. J. of Rob. & Auton. Sys. , 8:47–63.
Kuipers, B. J., Tecuci, D. G., and Stankiewicz, B. J. (2003). The skeleton in the cognitive
map: A computational and empirical exploration. Env. & Behavior, 35(1):80–106.
Kyriacou, T., Bugmann, G., and Lauria, S. (2002).

Vision-based urban navigation

procedures for verbally instructed robots. In Proc. of IEEE/RSJ Intl. Conf. on Intelligent
Robots & Sys. (IROS-02).
Kyriacou, T., Bugmann, G., and Lauria, S. (2004).

Vision-based urban navigation

procedures for verbally instructed robots. Rob. & Auton. Sys. . XXX.
Landau, B. and Jackendoff, R. (1993). ‘What’ and ‘Where’ in spatial language and spatial
cognition. The Behavioral & Brain Scis. , 16(2):217–265.
Lankenau, A. and Röfer, T. (2001). A versatile and safe mobility assistant. IEEE Robotics
& Autom. Magazine, 8(1):29–37.
Lauria, S., Bugmann, G., Kyriacou, T., Bos, J., and Klein, E. (2001). Training personal
robots using natural language instruction. IEEE Intelligent Sys. , pages 2–9. XXX.
Lauria, S., Bugmann, G., Kyriacou, T., and Klein, E. (2002a). Mobile robot programming
using natural language. Rob. & Auton. Sys. , 38(3–4):171–181.
Lauria, S., Kyriacou, T., Bugmann, G., Bos, J., and Klein, E. (2002b).

Converting

natural language route instructions into robot-executable procedures. In Proc. of IEEE

Intl. Ws. on Robot and Human Interactive Communication (RO-MAN), pages 223–228,
Berlin, Germany.
Lawton, C. A. (2001). Gender and regional differences in spatial referents used in direction
giving. Sex Roles, 44(5):321–337.
Lemon, O., Gruenstein, A., and Peters, S. (2002). Collaborative activities and multi-tasking
in dialogue systems: Towards natural dialogue with robots. Traitement Automatique des
Langues (TAL), 43(2):131–154. Spec. Issue on Dialogue.
Levelt, W. J. M. (1982). Cognitive styles in the use of spatial direction terms. In Jarvalla,
R. J. and Klein, W., editors, Speech, Place, and Action: Studies in Deixis and Related
Topics, pages 251–268. Wiley, Chichester.
Levit, M. and Roy, D. (2007). Interpretation of spatial language in a map navigation task.
IEEE Trans. on Sys. , Man & Cybernetics – Part B: Cybernetics. in press.
Li, H. and Abe, N. (1998). Generalizing case frames using a thesaurus and the MDL
principle. Compl. , 24(2):217–244.
Lieberman, H. and Maulsby, D. (1996). Instructible agents: Software that just keeps getting
better. IBM Systems Journal, 35(3–4):539–556.
Linde, C. (1974). The Linguistic Encoding of Spatial Information. PhD thesis, Columbia
University.
Linde, C. and Labov, W. (1975). Spatial structures as a site for the study of language and
thought. Language, 51(4):924–939.
Liu, H. and Lieberman, H. (2005). Programmatic semantics for natural language interfaces.
In Proc. of ACM Conf. on Human Factors in Computing Systems (CHI ’05), pages 1597–
1600, Portland, OR.

Lizogat, G. (2000). From language to motion, and back: Generating and using route
descriptions. In Christodoulakis, D., editor, Natural Language Processing, volume 1835
of LNCS, pages 328–345. Springer, Patras, Greece.
Lovelace, K. L., Hegarty, M., and Montello, D. R. (1999). Elements of good route directions
in familiar and unfamiliar environments. In COSIT-99 (1999), pages 56–82.
Lundh,

(1999).

introduction

Tkinter.

http://www.pythonware.com/library/tkinter/introduction/.
Maass, W. (1994).

From vision to multimodal communication: Incremental route

descriptions. AI Rvw. , 8(2-3):159–174.
Maass, W. (1995). How spatial information connects visual perception and natural language
generation in dynamic environments. In Frank, A. U. and Kuhn, W., editors, Proc. of
COSIT-95, volume 988 of LNCS, pages 223–240, Semmering, Austria. Springer.
MacMahon, M. (2005). Understanding and following route instructions through large-scale
space. In Proc. of Ws. on Spatial Language and Dialogue, Delmenhorst, Germany.
MacMahon, M., Adams, W., Bugajska, M., Perzanowski, D., Schultz, A., and Thomas, S.
(2004). Adjustable autonomy for route-direction following. In Proc. of AAAI Spring
Symp. on Interaction between Humans & Auton. Systems over Extended Operation,
Stanford, CA.
MacMahon, M. and Stankiewicz, B. (2006). Human and automated indoor route instruction
following. In Proc. of 28th Ann. Meeting of the Cog. Sci. Society (CogSci-06), pages
1759–1764, Vancouver, BC.
MacMahon, M., Stankiewicz, B., and Kuipers, B. (2006). Walk the talk: Connecting
language, knowledge, action in route instructions.
AI (AAAI-2006), pages 1475–1482, Boston, MA.

In Proc. of 21st Natl. Conf. on

Mandel, C., Huebner, K., and Vierhuff, T. (2005). Towards an autonomous wheelchair:
Cognitive aspects in service robotics. In Proc. of Towards Autonomous Robotic Systems
(TAROS 2005), pages 165–172, London, UK.
Manning, C. and Schütze, H. (1999).

Foundations of Statistical Natural Language

Processing. MIT Press, Cambridge, MA.
McClosky, D., Charniak, E., and Johnson, M. (2006). Effective self-training for parsing. In
Proc. of North American Assoc. Compl. Lang. .
McDermott, D. and Davis, E. (1984). Planning routes through uncertain territory. AI ,
22:107–156.
Mellet, E., Briscogne, S., Tzourio-Mazoyer, N., Ghaem, O., Petit, L., Zago, L., Etard,
O., Berthoz, A., Mazoyer, B., and Denis, M. (2000). Neural correlates of topographic
mental exploration: The impact of route versus survey perspective learning. NeuroImage,
12(5):588–600.
Michon, P.-E. and Denis, M. (2001). When and why are visual landmarks used in giving
directions? In Montello (2001), pages 292–305.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, Boston.
Modayil, J., Beeson, P., and Kuipers, B. (2004).

Using the topological skeleton for

scalable global metrical map-building. In Proc. of IEEE/RSJ Intl. Conf. on Intelligent
Robots & Sys. (IROS-04), pages 1530–1536, Sendai, Japan.
Modayil, J. and Kuipers, B. (2006).

Autonomous shape model learning for object

localization and recognition. In Proc. of IEEE Intl. Conf. on Rob. & Autom. , pages 2991–
2996.
Montello, D. R., editor (2001).

Spatial Information Theory:

Foundations of

Geog. Info. Sci. (COSIT ’01), volume 2205 of LNCS, Morro Bay, CA. Springer.

Moratz, R. and Tenbrink, T. (2006). Spatial reference in linguistic human-robot interaction:
Iterative, empirically supported development of a model of projective relations. Spatial
Cogn. & Compn. , 6(1):63–106.
Moratz, R., Tenbrink, T., Bateman, J., and Fischer, K. (2003).

Spatial knowledge

representation for human-robot interaction. In Freksa et al. (2003), pages 263–286.
Moravec, H. and Elfes, A. (1985). High resolution maps from wide angle sonar. In Proc. of
IEEE Intl. Conf. on Rob. & Autom. , pages 116–121.
Moulin, B. and Kettani, D. (1998). Combining a logical and an analogical framework
for route generation and description. Annals of Mathematics and Artificial Intelligence,
24(1-4):155–179.
Mukerjee, A. (1998). Neat vs scruffy: A survey of computational models for spatial
expressions. In Olivier and Gapp (1998).
Müller, R., Röfer, T., Lankenau, A., Musto, A., Stein, K., and Eisenkolb, A. (2000). Coarse
qualitative descriptions in robot navigation. In Freksa et al. (2000), pages 265–276.
Murphy, R. R. (2004). Human-robot interaction in rescue robotics. IEEE Trans. on Sys. ,
Man & Cybernetics – Part C:App. & Reviews, 34(2):138–153.
Narayanan, S. (1997). Talking the talk is like walking the walk: A computational model
of verbal aspect. In Proc. of 19th Ann. Meeting of the Cog. Sci. Society (CogSci-97),
Stanford, CA.
Nicolescu, M. and Mataric̀, M. J. (2003). Natural methods for learning and generalization
in human-robot domains. In Proc. of the Second Intl. Conf. on Auton. Agents, Melbourne,
Australia.
Nothegger, C., Winter, S., and Raubal, M. (2004). Selection of salient features for route
directions. Spatial Cogn. & Compn. , 4(2):113–136.

Olivier, P. and Gapp, K.-P., editors (1998). Representation and processing of spatial
expressions. Lawrence Erlbaum Associates, Mahwah, NJ.
Palmer, M., Kingsbury, P., and Gildea, D. (2005). The Proposition Bank: An annotated
corpus of semantic roles. Compl. Ling. , 31(1):711–106.
Perrault, C. R. and Allen, J. F. (1980). A plan-based analysis of indirect speech acts.
American J. of Compl. Ling. , 6(3-4).
Perzanowski, D., Brock, D., Adams, W., Bugajska, M., Schultz, A. C., Trafton,
J. G., Blisard, S., and Skubic, M. (2003).

Finding the FOO: A pilot study for a

multimodal interface. In Proc. of IEEE Sys. , Man & Cybernetics Conf., pages 3218–
3223, Washington, DC.
Perzanowski, D., Schultz, A., Adams, W., Bugajska, M., Abramson, M., MacMahon, M.,
Atrash, A., and Coblenz, M. (2002). “Excuse me, where’s the registration desk?”:
Report on integrating systems for the Robot Challenge AAAI 2002. In Human-Robot
Interaction, Papers from the 2002 AAAI Fall Symp., pages 63–72. AAAI Press. Technical
Report FS-02-03.
Perzanowski, D., Schultz, A. C., and Adams, W. (1998). Integrating natural language and
gesture in a robotics domain. In Proc. of Intl. Symp. on Intelligent Control, pages 247–
252, Washington, DC. IEEE Computer Society Press.
Perzanowski, D., Schultz, A. C., Adams, W., Marsh, E., and Bugajska, M. (2001). Building
a multimodal human-robot interface. IEEE Intelligent Sys. , pages 16–21.
Pickering, M. J. and Garrod, S. (2004). Toward a mechanistic psychology of dialogue. The
Behavioral & Brain Scis. , 27(2):169226.
Pook, P. K. and Ballard, D. H. (1996). Deictic human/robot interaction. Rob. & Auton. Sys. ,
18(1–2):259–269. Proc. of Intl. Ws. on Biorobotics: Human-Robot Symbiosis.

Porzel, R., Jansche, M., and Meyer-Klabunde, R. (2002). Generating spatial descriptions
from a cognitive point of view. In Coventry and Oliver (2002), pages 185–208.
Purver, M., Ratiu, F., and Cavedon, L. (2006).

Robust interpretation in dialogue by

combining confidence scores with contextual features. In Proc. of Intl. Conf. on Spoken
Language Processing (Interspeech/ICSLP), pages 1–4, Pittsburgh, PA.
Python (2007). Python programming language.
Raubal, M. and Winter, S. (2002). Enriching wayfinding instructions with local landmarks.
In Egenhofer, M. J. and Mark, D. M., editors, Proc. of GIScience 2002, volume 2478 of
LNCS, pages 243–259, Boulder, CO. Springer.
Regier, T. and Carlson, L. A. (2001). Grounding spatial language in perception: An
empirical and computational investigation. J. of Exptl. Psych. , 130(2):273–298.
Reiter, E. and Dale, R. (1997). Building applied natural language generation systems.
J. of Natural Lang. Eng. , 1(1):1–32.
Remolina, E. and Kuipers, B. (2004). Towards a general theory of topological maps. AI ,
152(1):47–104.
Richter, K.-F., Klippel, A., and Freksa, C. (2004). Shortest, fastest, - but what next? a
different approach to route directions. In Raubal, M., Sliwinski, A., and Kuhn, W.,
editors, Geoinformation und Mobilität - von der Forschung zur praktischen Anwendung.
Beiträge zu den Münsteraner GI-Tagen 2004, IfGIprints, pages 205–217. Institut für
Geoinformatik; Münster.
Riesbeck, C. (1980). “You can’t miss it!”: Judging the clarity of directions. Cog. Sci. ,
4:285–303.
Rogers, S., Fiechter, C.-N., and Langley, P. (1999). An adaptive interactive agent for route
advice. In Proc. of Third Intl. Conf. on Auton. Agents, pages 198–205. ACM Press.

Rosenthal, R. and Rosnow, R. L. (1991). Essentials of Behavioral Research: Methods and
Data Analysis. McGraw-Hill, 2 edition.
Ross, R. J., Shi, H., Vierhuff, T., Krieg-Brückner, B., and Bateman, J. (2004). Towards
dialogue based shared control of navigating robots. In Freksa et al. (2004), pages 478–
499.
Roy, D. (2005). Semiotic schemas: a framework for grounding language in action and
perception. AI , 167(1–2):170–205.
Roy, D., Hsiao, K.-Y., and Mavridis, N. (2004). Mental imagery for a conversational robot.
IEEE Trans. on Sys. , Man & Cybernetics – Part B: Cybernetics, 34(3):1374–1383.
Russell, S. and Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Prentice
Hall Series in Artificial Intelligence. Prentice Hall, Upper Saddle River, NJ, 1st edition.
Sandstrom, N. J., Kaufman, J., and Huettel, S. A. (1998). Males and females use different
distal cues in a virtual environment navigation task. Cog. Brain Research, 6(4):351–360.
Scherl, R. B. and Levesque, H. J. (2003). Knowledge, action, and the frame problem.
Artificial Intelligence, 144(1-2):1–39.
Schmitz, S. (1999). Gender differences in acquisition of environmental knowledge related
to wayfinding behavior, spatial anxiety and self-estimated environmental competencies.
Sex Roles, 41(1-2):71–93.
Schober, M. F. (1993). Spatial perspective-taking in conversation. Cognition, 47(1):1–24.
Sedivy, J. C. (2003).

Pragmatic versus form-based accounts of referential contrast:

Evidence for effects of informativity expectations. J. of Psycholinguistic Research,
32(1):3–23.
Shi, H. and Tenbrink, T. (2005). Telling Rolland where to go: HRI dialogues on route
navigation. In Proc. of WoSLaD Workshop on Spatial Language and Dialogue.

Shimizu, N. and Haas, A. (2006). Extracting frame-based knowledge representation from
route instructions. In HLT-NAACLWs. on Computationally Hard Problems and Joint
Inference in Speech and Language Processing. Late Breaking Paper.
Siegel, A. W. and White, S. H. (1975). The development of spatial representations of
large-scale environments. In Reese, H. W., editor, Advances in Child Development and
Behavior, volume 10, pages 9–55. Academic Press, New York.
Simmons, R. and Apfelbaum, D. (1998). A task description language for robot control. In
Proc. of IEEE/RSJ Intl. Conf. on Intelligent Robots & Sys. (IROS-98).
Simmons, R., Goldberg, D., Goode, A., Montemerlo, M., Roy, N., Sellner, B., Urmson,
C., Schultz, A., Abramson, M., Adams, W., Atrash, A., Bugajska, M., Coblenz, M.,
MacMahon, M., Perzanowski, D., Horswill, I., Zubek, R., Kortenkamp, D., Wolfe, B.,
Milam, T., and Maxwell, B. (2003). GRACE: An autonomous robot for the AAAI Robot
Challenge. AI Magazine, 24(2):51–72.
Simmons, R. G. and Koenig, S. (1995).

Probabilistic robot navigation in partially

observable environments. In Proc. of 14th Intl. Joint Conf. on AI (IJCAI-95), pages 1080–
1087, Montreal, Canada. Intl. Joint Conf. on AI , AAAI Press/The MIT Press.
Simpson, R. C. (2005). Smart wheelchairs: A literature review. J. of Rehabilitation
Research & Development, 42(4):423–436.
Siskind, J. M. (1990). Acquiring core meanings of words, represented as jackendoff-style
conceptual structures, from correlated streams of linguistic and non-linguistic input. In
Proc. of 28th Ann. Meeting of the ACL (ACL-90), pages 143–156, Pittsburgh, PA.
Siskind, J. M. (1995). Grounding language in perception. AI Rvw. , 8(5-6):371–391.
Siskind, J. M. (2001). Grounding the lexical semantics of verbs in visual perception using
force dynamics and event logic. J. of AI Res. , 15:31–90.

Skantze, G. (2005). Exploring human error recovery strategies: Implications for spoken
dialogue systems.

Speech Communication, 45(3):325–341.

Spec. Issue on Error

Handling in Spoken Dialogue Systems.
Skubic, M., Blisard, S., Bailey, C., Adams, J., and Matsakis, P. (2004a). Qualitative analysis
of sketched route maps: Translating a sketch into linguistic descriptions. IEEE Trans. on
Sys. , Man & Cybernetics – Part B: Cybernetics, 34(2):1275–1282.
Skubic, M., Matsakis, P., Forrester, B., and Chronis, G. (2001). Generating linguistic
spatial descriptions from sonar readings using the histogram of forces. In Proc. of IEEE
Intl. Conf. on Rob. & Autom. (ICRA-01).
Skubic, M., Perzanowski, D., Blisard, S., Schultz, A., Adams, W., Bugajska, M., and
Brock, D. (2004b). Spatial language for human-robot dialogs. IEEE Trans. on Sys. ,
Man & Cybernetics – Part C:App. & Reviews, 34(2):154–167.
Sorrows, M. E. and Hirtle, S. C. (1999). The nature of landmarks for real and electronic
spaces. In COSIT-99 (1999), pages 37–50.
Sperber, D. and Wilson, D. (1986). Relevance: Communication and Cognition. Blackwell
Publishers, Oxford.
Sperber, D. and Wilson, D. (2004). Relevance Theory. In Horn, L. R. and Ward, G., editors,
The Handbook of Pragmatics, pages 607–632. Blackwell, Oxford.
Stankiewicz, B. and Eastman, K. (2008).

Lost in Virtual Space II: The role of

proprioception and discrete actions when navigating with uncertainty. ACM Trans. on
Applied Perception. under review.
Stankiewicz, B. and Kalia, A. (2007). Acquisition and retention of structural versus object
landmark knowledge when navigating through a large-scale space. J. of Exptl. Psych. :
Human Perception & Performance. in press.

Stankiewicz, B. J., Legge, G. E., Mansfield, J. S., and Schlicht, E. J. (2006). Lost in
virtual space: Studies in human and ideal spatial navigation. J. of Exptl. Psych. : Human
Perception & Performance, 32(3):686–704.
Stankiewicz, B. J., Legge, G. E., and Schlicht, E. (2001). The effect of layout complexity
on human and ideal navigation performance. J. of Vision, 1(3).
Stocky, T. A. (2002). Conveying routes: Multimodal generation and spatial intelligence in
embodied conversational agents. Master’s thesis, Mass. Inst. of Technology, Cambridge,
MA.
Stoia, L., Byron, D., Shockley, D., and Fosler-Lussier, E. (2006). Sentence planning for
realtime navigational instruction. In Proc. of Human Language Technology Conf. of the
NAACL, pages 157–160, New York City, USA.
Stone, M., Doran, C., Webber, B., Bleam, T., and Palmer, M. (2003). Microplanning with
communicative intentions: The SPUD system. Compl. Intelligence, 19(4):311–381.
Striegnitz, K., Tepper, P., Lovett, A., and Cassell, J. (2005). Knowledge representation for
generating locating gestures in route directions. In Proc. of WoSLaD Workshop on Spatial
Language and Dialogue, Delmenhorst, Germany.
Talmy, L. (1983). How language structures space. In H. L. Pick, J. and Acredolo, L. P.,
editors, Spatial orientation : Theory, research and application, pages 225–282. Plenum,
NY .
Talmy, L. (2000). Toward A Cognitive Semantics, volume I of Language, Speech, and
Communication. MIT Press, Cambridge, MA.
Tappe, H. and Habel, C. (1998).

Verbalization of dynamic sketch maps: Layers of

representation and their interaction. In Proc. of 20th Ann. Meeting of the Cog. Sci. Society
(CogSci-98), Madison, WI.

Taylor, H. A. and Tversky, B. (1992). Spatial mental models derived from survey and route
descriptions. J. of Memory & Lang. , 31(2):261–292.
Taylor, H. A. and Tversky, B. (1996). Perspective in spatial descriptions. J. of Memory & Lang. , 35(3):371–391.
Taylor, H. A., Uttal, D. H., Fisher, J., and Mazepa, M. (2001). Ambiguity in acquiring
spatial representation from descriptions compared to depictions: The role of spatial
orientation. In Montello (2001), pages 278–291.
Tellex, S. and Roy, D. (2006). Spatial routines for a simulated speech-controlled vehicle.
In Goodrich et al. (2006).
Tellex, S. and Roy, D. (2007). Grounding language in spatial routines. In Proc. of AAAI
Spring Symp. on on Control Mechanisms for Spatial Knowledge Processing in Cognitive
/ Intelligent Systems, Stanford, CA.
Tenbrink, T. (2003). Conveying spatial information in linguistic human-robot interaction.
In Proc. of Ws. on Semantics and Pragmatics of Dialogue, pages 207–8.
Tenbrink, T., Fischer, K., and Moratz, R. (2002). Spatial strategies in linguistic human-robot
communication. In Freksa, C., editor, Künstliche Intelligenz Themenheft 4/02 Spatial
Cognition, pages 19–23. arenDTaP Verlag.
Tenbrink, T. and Klippel, A. (2005). Achieving reference via contrast in route instructions
and spatial object identification. In Ws. on Reference, 21st Scandinavian Conference of
Linguistics.
Tenbrink, T. and Moratz, R. (2003). Group-based spatial reference in linguistic humanrobot interaction. In Proc. of European Cognitive Science Conf. on (EuroCogSci-03),
Osnabrück, Germany.

Tews, A. D., Matarić, M. J., and Sukhatme, G. S. (2003). A scalable approach to humanrobot interaction. In Proc. of IEEE Intl. Conf. on Rob. & Autom. , pages 1665–1670,
Taipei, Taiwan.
Theocharous, G., Murphy, K., and Kaelbling, L. P. (2004). Representing hierarchical
POMDPs as DBNs for multi-scale robot localization. In Proc. of IEEE Intl. Conf. on
Rob. & Autom. 2004.
Timpf, S., Volta, G. S., Pollock, D. W., and Egenhofer, M. J. (1992). A conceptual
model of wayfinding using multiple levels of abstraction. In Campari, I., Frank, A. U.,
and Formentini, U., editors, Theories and Methods of Spatio-Temporal Reasoning in
Geographic Space, volume 639, pages 348–367. Springer-Verlag, Lecture Annote in
Computer Science.
Trafton, J. G., Cassimatis, N. L., Bugajska, M. D., Brock, D. P., Mintz, F. E., and Schultz,
A. C. (2005). Enabling effective human-robot interaction using perspective-taking in
robots. IEEE Trans. on Sys. , Man & Cybernetics – Part A:Sys. & Humans, 35(4):460–
470. Spec. Issue on Human-Robot Interaction.
Tschander, L. B., Schmidtke, H. R., Eschenbach, C., Habel, C., and Kulik, L. (2003). A
geometric agent following route instructions. In Freksa et al. (2003), pages 89–111.
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59(236):433–460.
Tversky, B. (2000). Some ways that maps and diagrams communicate. In Freksa et al.
(2000), pages 72–79.
Tversky, B., Lee, P., and Mainwaring, S. (1999). Why do speakers mix perspectives?
Spatial Cogn. & Compn. , 1:399–412.
Tversky, B. and Lee, P. U. (1998). How space structures language. In Freksa et al. (1998),
pages 157–176.

Tversky, B. and Lee, P. U. (1999). Pictorial and verbal tools for conveying routes. In
COSIT-99 (1999), pages 51–64.
van Asselen, M., Fritschy, E., and Postma, A. (2006). The influence of intentional and
incidental learning on acquiring spatial knowledge during navigation.

Psych. Res. ,

70(2):151–156.
van der Zee, E. and Slack, J., editors (2003). Representing Direction in Language and
Space. Number 1 in Explorations in Langauge and Space. Oxford Univ. Press.
Vander Linden, K. and Di Eugenio, B. (1996). A corpus study of negative imperatives in
natural language instructions. In Proc. of 16th Intl. Conf. on Compl. Ling. (COLING-96),
Copenhagen, Denmark.
Vander Linden, K. and Martin, J. H. (1995). Expressing local rhetorical relations in
instructional text: A case-study of the purpose relation. Compl. Ling. , 21(1):29–57.
Vanetti, E. J. and Allen, G. L. (1988). Communicating environmental knowledge: The
impact of verbal and spatial abilities on the production and comprehension of route
directions. Env. & Behavior, 20:667–682.
Varges, S. (2005). Spatial descriptions as referring expressions in the MapTask domain. In
Proc. of 10th European Ws. on Nat. Lang. Gen. (ENLG’05), Aberdeen, Scotland.
Verma, V., Estlin, T., Jónsson, A., Pasareanu, C., Simmons, R., and Tso, K. (2005).
Plan execution interchange language (PLEXIL) for executable plans and command
sequences. In Proc. of Intl. Symp. on Artificial Intelligence, Robotics and Automation
in Space (iSAIRAS), Munich, Germany.
Vizard

(2006).

WorldViz

http://www.worldviz.com/vizard.htm.

Vizard

virtual

reality

software.

Ward, S. L., Newcombe, N., and Overton, W. F. (1986). Turn left at the church or three miles
north: A study of direction giving and sex differences. Env. & Behavior, 18(2):192–213.
Wauchope, K., Everett, S., Perzanowski, D., and Marsh, E. (1997). Natural language in
four spatial interfaces. In Proc. of 5th Applied Nat. Lang. Proc. Conf. on, pages 8–11,
Washington, DC.
Webber, B., Badler, N., Di Eugenio, B., Geib, C., Levison, L., and Moore, M. (1995).
Instructions, intentions and expectations.

AI , 73(1–2):253–269.

Spec. Issue on

“Compl. Res. on Interaction and Agency, Pt. 2”.
Webber, B. and Di Eugenio, B. (1990). Free adjuncts in natural language instructions.
In Proc. of 13th Intl. Conf. on Compl. Ling. (COLING-90), pages 395–400, Helsinki,
Finland.
Weissensteiner, E. and Winter, S. (2004). Landmarks in the communication of route
directions. In Egenhofer, M. J., Freksa, C., and Miller, H., editors, Proc. of GIScience
2004, volume 3234 of LNCS, pages 313–326, Adelphi, MD. Springer.
Weng, F., Varges, S., Raghunathan, B., Ratiu, F., Pon-Barry, H., Lathrop, B., Zhang, Q.,
Scheideck, T., Bratt, H., Xu, K., Purver, M., Mishra, R., Raya, M., Peters, S., Meng, Y.,
Cavedon, L., and Shriberg, L. (2006). CHAT: A Conversational Helper for Automotive
Tasks. In Proc. of Intl. Conf. on Spoken Language Processing (Interspeech/ICSLP), pages
1061–1064, Pittsburgh, PA.
Werner, S., Krieg-Brückner, B., Mallot, H. A., Schweizer, K., and Freksa, C. (1997).
Spatial cognition: The role of landmark, route, and survey knowledge in human and
robot navigation. In Jarke, M., Pasedach, K., and Pohl, K., editors, Informatik 97, pages
41–50. Berlin, Heidelberg, New York; Springer.
Winograd, T. (1972). Understanding Natural Language. Academic Press, New York.

Wong, Y. W. and Mooney, R. J. (2006). Learning for semantic parsing with statistical
machine translation. In Proc. of 2006 Human Language Technology Conference - North
American Chapter of the Association for Computational Linguistics Annual Meeting
(HLT/NAACL-2006), New York City, NY.
WordNet (2005).

WordNet:

a lexical database for the English language.

http://wordnet.princeton.edu/.
Yeap, W. K. and Jefferies, M. E. (1999).

Computing a representation of the local

environment. AI , 107:265–301.
Yu, C. and Ballard, D. H. (2004). On the integration of grounding language and learning
objects. In Proc. of 19th Natl. Conf. on AI (AAAI-2004), pages 488–493, San Jose, CA.
Zelek, J. S. (1997). Human-robot interaction with minimal spanning natural language
template for autonomous and tele-operated control. In Proc. of IEEE/RSJ Intl. Conf. on
Intelligent Robots & Sys. (IROS-97), pages 299–305, Grenoble, France.

Vita
Matthew Tierney MacMahon was born in N. Tarrytown, New York on September 10, 1974,
the son of B.J. and Paul MacMahon. Matt attended Marcus High School in Flower Mound,
Texas and graduated from the Texas Academy of Mathematics and Science, Denton, Texas
in 1993. He earned a Bachelor of Science from the Symbolic Systems Program, an
inter-disciplinary cognitive sciences program, at Stanford University in 1997. For three
years, Matt worked as an Intelligent Systems Integration Engineer at NASA Johnson Space
Center in Houston, Texas, for a contractor. In 2000, Matt entered The Graduate School
of The University of Texas at Austin, where in 2002, Matt earned a Master of Science in
Engineering degree. During the summers in graduate school, Matt worked for the Naval
Research Laboratory in Washington, DC, and NASA Ames Research Center in Moffett
Field, CA. In 2003, Matt married Sarah Piper in Austin, Texas. Matt has published papers
in the Proceedings of the International Conference on Field and Serivce Robotics, The
National Conference on Artificial Intelligence, The Cognitive Science Society, The IEEE
International Conference on Robotics and Automation, The International Conference on
Autonomous Agents, The IEEE Aerospace Conference, and The IEEE International Joint
Symposia on Intelligence and Systems, as well as articles in AI Magazine, the Journal of
Autonomous Agents and Multi-Agent Systems, and Connection Science.

227

Permanent Address: Matt@MacMahon.org, 12706 Theriot Trail, Austin, Texas.

This dissertation was typeset with LATEX 2ε 1 by the author.

1 A
LT

EX 2ε is an extension of LATEX. LATEX is a collection of macros for TEX. TEX is a trademark of the
American Mathematical Society. The macros used in formatting this dissertation were written by Dinesh Das,
Department of Computer Sciences, The University of Texas at Austin, and extended by Bert Kay and James A.
Bednar.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 248
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfeTeX-1.21a
Create Date                     : 2007:06:21 15:55:48-05:00
PTEX Fullbanner                 : This is pdfeTeX, Version 3.141592-1.21a-2.2 (Web2C 7.5.4) kpathsea version 3.5.4

EXIF Metadata provided by EXIF.tools

Following Natural Language Route Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation