Applying Computational Models of Spatial - IA de Investigación especializada en el MIT

Applying Computational Models of Spatial
Prepositions to Visually Situated Dialog

John D. Kelleher∗
Dublin Institute of Technology

Fintan J. Costello∗∗
University College Dublin

This article describes the application of computational models of spatial prepositions to visually
situated dialog systems. In these dialogs, spatial prepositions are important because people
often use them to refer to entities in the visual context of a dialog. We ﬁrst describe a generic
architecture for a visually situated dialog system and highlight the interactions between the
spatial cognition module, which provides the interface to the models of prepositional semantics,
and the other components in the architecture. Following this, we present two new computational
models of topological and projective spatial prepositions. The main novelty within these models
is the fact that they account for the contextual effect which other distractor objects in a visual
scene can have on the region described by a given preposition. We next present psycholinguistic
tests evaluating our approach to distractor interference on prepositional semantics, and illustrate
how these models are used for both interpretation and generation of prepositional expressions.

1. Introducción

A growing number of computer applications share a visualized (virtual or real) espacio
with the user, for example graphic design programs, computer games, navigation aids,
robot systems, Etcétera. If these systems are to be equipped with dialog interfaces,
they must be able to participate in visually situated dialog. Visually situated dialog is
spoken from a particular point of view within a physical or simulated context. De
theoretical linguistic and cognitive perspectives, visually situated dialog systems are
interesting as they provide ideal testbeds for investigating the interaction between
language and vision. From a human–computer interaction (HCI) perspectiva, visually
situated dialog systems promise many advantages to users interacting with these
sistemas. In this article we describe computational models for the interpretation and
generation of visually situated locative expressions involving topological and projective
spatial prepositions.

Contributions An inherent aspect of visually situated dialog is reference to objects
in the physical environment in which the dialog occurs. People often use locative

∗ School of Computing, Dublin Institute of Technology, Kevin Street, Dublín 8, Irlanda. Correo electrónico:

john.kelleher@comp.dit.ie.

∗∗ School of Computer Science and Informatics, University College Dublin, Belﬁeld, Dublín 4, Irlanda.

Correo electrónico: ﬁntan.costello@ucd.ie.

Envío recibido: 31 Julio 2006; revised submission received: 30 Marzo 2007; accepted for publication:
4 Julio 2007.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

expresiones, in particular spatial prepositions, to pick out objects in the visual envi-
ambiente. In this article we present computational models of the semantics of spatial
prepositions and illustrate how these models can be used in a visually situated dialog
system for reference resolution and generation. These models are designed to handle
reference resolution and generation in complex visual environments containing multi-
ple objects, and to account for the contextual inﬂuence which the presence of multiple
objects has on the semantics of spatial prepositions. In this our models move beyond
other accounts, which typically do not model the contextual inﬂuence of other objects
on spatial semantics. Because most real-world visual scenes are complex and contain
multiple objects, our models for the semantics of spatial prepositions are important for
visually situated dialog systems intended to operate usefully in the real world.

Overview We begin in Section 2 by describing some terminology we use when
discussing locative expressions. En la sección 3 we present an abstract architecture for
a visually situated dialog system and, using this architecture, illustrate how the spa-
tial reasoning component of the architecture interacts with the other components of
the system. En la sección 4 we review psycholinguistic data on the semantics of spatial
prepositions. Sección 5 reviews previous computational models of spatial prepositional
semantics. Sección 6 presents our computational models accounting for the semantics of
spatial prepositions and the inﬂuence of visual context on those semantics, y Sección 7
presents psycholinguistic evaluation of these models. Sección 8 presents applications of
the models in implemented systems. Sección 8.1 presents an application of our models
to the interpretation of locative expressions, based on Kelleher, Kruijff, and Costello
(2006), y Sección 8.2 presents algorithms which use these models to generate locative
expressions to identify objects in visual scenes from Kelleher and Kruijff (2006).

2. Terminology

Our computational models are designed to interpret and generate locative expressions
involving spatial prepositions. The term locative expression describes “an expression
involving a locative prepositional phrase together with whatever the phrase modiﬁes
(noun, cláusula, etc.)" (Herskovits 1986, página 7). In this article we use the term target
(t) to refer to the object that is being located by a locative expression and the term
landmark1 (l) to refer to the object relative to which the target’s location is described;
see Example (1). We will use the term distractor to describe any object in the visual
context that is neither the landmark nor the target.

Ejemplo 1
[The man]T near [the table]l.

The English lexicon of spatial prepositions numbers above 80 miembros (not consid-
ering compounds such as right next to) (Landau 1996). Within this set a distinction can
be made between static and dynamic prepositions: static prepositions primarily2 denote

1 There is a wealth of terms used in the literature describing locative expressions. The terms local object,
ﬁgure object, and trajector are all equivalent to our term target while the terms reference object, ground,
and relatum are equivalent to our term landmark.

2 Static prepositions can be used in dynamic contexts, Por ejemplo, the man ran behind the house, y

dynamic prepositions can be used in static ones, Por ejemplo, the tree lay across the road.

272

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Cifra 1
Architecture of a visually situated dialog system.

the location of an object, dynamic prepositions primarily denote the path of an object
(Jackendoff 1983; Herskovits 1986), see Examples (2) y (3).

Ejemplo 2
The tree is [behind]static the house.

Ejemplo 3
The man walked [across]dynamic the road.

En general, the set of static prepositions can be decomposed into two sets called
topological and projective. Topological prepositions are the category of prepositions
referring to a region that is proximal to the landmark; Por ejemplo, en, cerca. A menudo, el
distinctions between the semantics of the different topological prepositions is based
on pragmatic constraints, for example the use of at licenses the target to be in contact
with the landmark, while the use of near does not. Projective prepositions describe a
region projected from the landmark in a particular direction, with the speciﬁcation of
the direction dependent on the frame of reference3 being used; Por ejemplo, to the right
de, to the left of.

3. Visually Situated Dialog System Architecture

In this section we present an abstract implementation-independent architecture for a
visually situated dialog system and highlight the role played by spatial reasoning in the
functioning of the system. En particular, we describe how models of spatial prepositional
semantics are important for reference resolution and generation.

The distinguishing characteristic of a visually situated dialog system is that the
system has the ability to visually perceive the environment in which a dialog is situated.
Como consecuencia, these systems use both visual and linguistic contextual information to
understand user commands and to generate linguistic descriptions of the environment.
Cifra 1 illustrates the visual dialog system architecture we will describe. The arrows in
the ﬁgure represent data ﬂows through the system; the boxes are the main information
processing components.

3 In the context of projective prepositions, a frame of reference consists of six half-line axes with a shared
origen; en Inglés, these axes are usually labelled front, atrás, bien, izquierda, arriba, abajo. In English, tres
different frames of reference are distinguished: absolute, intrinsic, and viewer-centered. Curiosamente
sin embargo, although the use of a tripartite system is common in European languages, this is not universal,
with many languages taking different approaches here. We direct the interested reader to Levinson (1996,
2003) and Levelt (1996) for further discussion on frames of reference.

273

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Cifra 2
Example input and output data from a vision subsystem.

There are two information inputs into this system: the vision subsystem and the
speech interpretation pipeline. The vision subsystem directly updates the system’s
representation of the visual context. The basic requirements for the vision subsystem
are that it is able to detect and categorize the objects in the visual context and can
provide geometric positioning information for each visible object. Cifra 2 ilustra
the analysis that a vision subsystem may generate for a given scene.

The speech interpretation pipeline begins with speech recognition. This module
takes a speech utterance from the user and creates a string representation of it. El
parser uses this string to construct a structured representation of the input. Parsers
range in function from wide-coverage syntactic focused parsers, such as Cahill et al.’s
(2004) probabilistic Lexical-Functional Grammar (LFG) parser, to narrow coverage se-
mantic based parsers, for example the CoSy parser (Kruijff, Kelleher, and Hawes 2006).
Cifra 3 illustrates the types of analyses produced by these different types of parsers for
the input string is the box near the ball? The parse tree on the left was generated using a
probabilistic wide-coverage LFG parser.4 The parse tree provides a syntactic analysis of
the input string.

Generally, parsers developed for interactive dialog systems integrate semantic, como
well as syntactic, information in their grammars. In these parsers the elements in the
lexicon and grammar are based on an analysis of the entities and relations of the
speciﬁc domain the system is designed for. These parsers sacriﬁce coverage for depth of
análisis. For a dialog system, the advantage of this deeper analysis is that the semantic
information in the parser’s output can be used by the dialog manager to relate the input
to the rest of the dialog. The parse structure on the right of Figure 3 illustrates the type of
semantically rich representation that an interactive dialog system parser might produce
(this particular representation was generated by the CoSy parser).

The CoSy parser uses a Combinatory Categorial Grammar that represents linguistic
meaning using an ontologically rich sorted relational structure (Baldridge and Kruijff

4 A demo of the parser is available at: http://lfg-demo.computing.dcu.ie/lfgparser.html. The parser

also provides detailed LFG f-structures for input strings.

274

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Cifra 3
Example parse structures for the string is the box near the ball?

2002, 2003). Within this representation the statement b2:phys-obj means that the referent
b2 is of type phys-obj (es decir., a physical object as deﬁned by the ontology the grammar in-
dexes). The semantic contribution of the prepositional phrase near the ball is represented
by the structure and its subcomponents. This structure describes a locative
prepositional phrase containing a static preposition that locates the referent b2 in the
region r that is proximal to the landmark described by the subcomponent. Él
should be noted that the syntactic and semantic representation of prepositions within
grammars is an area of ongoing research (see Gawron 1986; Tseng 2000; Beermann and
Hellan 2004). The analysis presented here of the prepositional phrase near the ball is
intended to illustrate some of the semantic features that prepositions may introduce
into a grammar and is not intended as a comprehensive account of how prepositions
should be grammatically represented.

The ﬁnal stage in the interpretation pipeline is to categorize how the utterance
relates to the current dialog context. This categorization is driven by the dialog manager
and involves interpreting an utterance as a dialog act (Bunt 1994; Carletta et al. 1997;
Klein 1999). One of the important tasks in this process is resolving the references
in the input. Como consecuencia, the dialog manager may invoke the reference resolution
component. Reference resolution is one of two functions in the architecture where
spatial reasoning plays an important role. From a computational perspective, reference
resolution involves two main tasks:

Creating and maintaining a model of what the system considers as
mutual knowledge (this model should contain all the objects that are
available for reference and their properties)

2. Matching the representation introduced by a given referring expression

to an element (or elements) in the set of possible referents

In a visually situated dialog a referring expression may be exophoric (es decir., denote
an object in the visual context which has not yet been mentioned in the dialog) or it
may be anaphoric (es decir., access a representation of a previous referring expression in the
dialog context). People often use the spatial location of an object, described using spatial
prepositions, when making exophoric references. Como resultado, in order to interpret these
references the system must have access to models of the semantics of the prepositions

275

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Cifra 4
The mapping performed by the spatial reasoning module from qualitative to geometric
representations during the interpretation of a locative expression.

usado. In this architecture this access is provided through the spatial reasoning compo-
próximo. Cifra 4 illustrates the translation between the qualitative, parser-generated, y
the geometric, vision subsystem–generated representations that must be performed in
order to interpret a spatial locative expression.

At different stages during a dialog the dialog manager may recognize that the
system needs to generate a response to the last input utterance. Por ejemplo, el
utterance may have been a question, such as where is x? or which x?. In such cases,
the dialog manager informs the content planner of this. The role of the content planner
is to determine the semantic content that should be included in the system’s output,
rather than the linguistic realization of this content. En efecto, the content planner may
generate a logical representation closer to the parse structure on the right of Figure 3
than to a natural language description.

Generating referring expressions (GRE) is a key stage in content planning. GRE is
the second function in the architecture where spatial reasoning plays an important role.
The function of the GRE component is to determine the set of properties that distinguish
a particular target object from the other objects in the scene. Por ejemplo, in response
to a question such as which x? the GRE component may determine that a color and type
description is sufﬁcient to distinguish the target object, resulting in an answer such as the
blue x being linguistically realized. Sin embargo, it may be the case that the location of the
target in the scene is the only way to distinguish it. In such cases, the GRE component
needs access to computational models of the spatial prepositions if it is to determine
which spatial relation is most suitable. Cifra 5 illustrates the translation from a geo-
metric to qualitative representation that is performed during the GRE process by the
spatial reasoning module when a locative description is being generated by the system.
Once the content planning and GRE processes have been completed, the realizer
determines a surface linguistic form in which this content can be conveyed. Finalmente, el
speech synthesis systems generate the speech output for the linguistic string created by
the realizer.

4. Psycholinguistic Data on Spatial Prepositions

Spatial reasoning is a complex activity that involves at least two levels of process-
En g: a geometric level where metric, topological, and projective properties are handled

276

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Cifra 5
The mapping performed by the spatial reasoning module from geometric to qualitative to
representations during the generation of a locative description.

(Herskovits 1986), and a functional level where the normal function of an entity affects
the spatial relationships attributed to it in a context (Coventry and Garrod 2004).

There has been much experimental work done on spatial reasoning and language.
Some of this work has focused on functional aspects of prepositional semantics (p.ej.,
Hayward and Tarr 1995; Coventry 1998; Garrod, Ferrier, and Campbell 1999), y algunos
on geometric factors (Gapp 1995; Logan and Sadler 1996; Regier and Carlson 2001). En
this article we are primarily concerned with the geometric semantics of prepositions
y, como consecuencia, our review will focus on the experimental data that addresses geo-
metric factors. We will begin by reviewing the experimental data describing topological
spatial prepositions. Following this, we will then review data relating to projective
prepositions.

Topological prepositions denote a region that is proximal to a landmark. Subse-
quently we discuss previous psycholinguistic experiments, focusing on how contex-
tual factors such as distance, tamaño, and salience may affect proximity. We also present
examples showing that the location of other objects in a scene may interfere with the
acceptability of a proximal description to locate a target relative to a landmark.

Logan and Sadler (1996) examined the semantics of several spatial prepositions. En
their experiments, a human subject was shown sentences of the form the X is [relation]
the O, each with a picture of a spatial conﬁguration of an O in the center of an invisible
7 × 7 cell grid, and an X in one of the 48 surrounding positions. The subject then had
to rate how well the sentence described the picture, on a scale from 1 (bad) a 9 (bien).
Cifra 6 gives the mean goodness rating for the relation “near to” as a function of the
position occupied by X (Logan and Sadler 1996). It is clear from Figure 6 that ratings
diminish as the distance between X and O increases, but also that even at the extremes
of the grid the ratings were still above 1 (minimum rating).

Besides distance there are also other factors that determine the applicability of a
proximal relation. Por ejemplo, given prototypical size, the region denoted by near the
building is larger than that of near the apple. Además, an object’s salience could inﬂuence
the determination of the proximal region associated with it; as with size, the more salient
an object is the larger the proximal region associated with it (Gapp 1994).

Finalmente, the two scenes in Figure 7 show interference as a contextual factor. Para el
scene on the left we can use the blue box is near the black box to describe object (C). Este
seems inappropriate in the scene on the right. Placing an object (d) beside (b) appears

277

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Cifra 6
A 7 × 7 cell grid with mean goodness ratings for the relation the X is near O as a function of the
position occupied by X.

to interfere with the appropriateness of using a proximal relation to locate (C) relative to
(b), even though the absolute distance between (C) y (b) has not changed.

There are several important features that are evident from these data. Primero, given a
contexto, subjects have the ability to grade the applicability of a spatial relation. logan
and Sadler (1996) introduced the term spatial template to describe the representation of
the regions of acceptability associated with a preposition. A spatial template is centered
on the landmark and identiﬁes for each point in its space the acceptability of the
spatial relationship between the landmark and the target appearing at that point being
described by the preposition. Segundo, there is empirical evidence pointing to the effects
of distance between that landmark and the target, and landmark salience and size on the
applicability of a proximity-based preposition. Finalmente, the examples presented point to
the fact that the location of other distractor objects in context may also interfere with the
applicability of a preposition. (The model of proximity we present in Section 6 captures
all these factors.)

Cifra 8 is a representation of the spatial template for the projective preposition
above described in Logan and Sadler (1996). The main points of note relating to these
data are that there are three regions in the spatial template (bien, acceptable, y
bad) and these regions are symmetric around the canonical direction of the preposition
with acceptability approaching 0 as the angular deviation from the canonical direction
approaches 90 degrees. Sin embargo, it should be noted that these data were gathered dur-
ing an interpretation task and that the task may have affected the subjects’ responses.

Cifra 7
Proximity and distractor interference.

278

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Cifra 8
Spatial template for the preposition above (Logan and Sadler 1996), where LM represents the
landmark and the arrow shows the canonical direction associated with the preposition.

Although the subjects may have rated some of the areas on the far right and left of
the landmark as acceptable with respect to interpreting an utterance such as above the
landmark, this does not mean that they would use the word above to describe a target
object in these regions relative to the landmark. This highlights the fact that people may
be more accommodating when they are interpreting a locative description (Por ejemplo,
they may extend the allowable angular deviation to 90 degrees) but be more speciﬁc
when generating a locative description.

5. Previous Models of Topological and Projective Spatial Prepositions

There has been much research on the formal properties and interactions of topological
relaciones, for example Cohn et al. (1997) and Kuipers (2000). Sin embargo, before these
higher-level frameworks can be applied to real-world data, a model of proximity that
is capable of segmenting a region at the metric or geometric level is required. En
this geometric level previous approaches to modeling topological prepositions have
adopted one of two approaches to deﬁning the region of proximity. The ﬁrst is to adopt a
Voronoi segmentation of space. Under this approach the region considered as proximal
to an object is the area surrounding it that is closer to it than to any other object in the
scene. The second is to deﬁne the proximal region in terms of the size of the landmark.
Por ejemplo, Gapp (1995) deﬁnes the area of proximity as the region within ten times
the size of the landmark object in each direction. Sin embargo, neither of these approaches
consider the effect that the locations of other objects in the scene have on the proximity.
Como consecuencia, they cannot distinguish between the different context provided by the
two images in Figure 7.

Several models of projective prepositions have been proposed (Yamada 1993;
Olivier and Tsujii 1994; Gapp 1995; Fuhr et al. 1998; Regier and Carlson 2001; Kelleher
and van Genabith 2006). Yamada (1993) introduced the concept of a potential ﬁeld
function to capture the gradation of applicability across the region described by the
preposition. Later work (Olivier and Tsujii 1994; Gapp 1995) highlighted the issue of
deﬁning the intended frame of reference. Building on this work and the psycholin-
guistic results of Carlson-Radvansky and Logan (1997), Kelleher and van Genabith

279

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

(2006) developed a computational model that constructed a modiﬁed spatial template in
situations where frame of reference ambiguity occurred. Fuhr et al. (1998) used models
of prepositional semantics in order to interpret natural language commands to a robotic
arm. Fuhr et al. segmented the space around an object into different regions based
on the sides and vertices of the object’s bounding box. One of the drawbacks of this
sistema, sin embargo, was that it could not distinguish between the position of two or more
objects that were fully enclosed within a given region. Finalmente, Regier and Carlson (2001)
developed a vector sum algorithm to compute the applicability of a projective relation
between a landmark and a target. Sin embargo, as with previous topological models, ninguno
of these models consider the inﬂuence of other objects in the context of the landmark
target relationship. Por ejemplo, the introduction of the long black object into image
2 En figura 9 affects the interpretability of a reference such as the blue square above the
white rectangle. In the next section we describe new models designed to account for the
inﬂuence of other objects in the semantics of spatial prepositions.

6. Models of Visual Context in Topological and Projective Spatial Prepositions

If a computational model is going to accommodate the gradation of applicability across
a preposition’s spatial template it must deﬁne the semantics of the preposition as
some sort of continuum function. A potential ﬁeld model is one form of continuum
measure that is widely used (Yamada 1993; Gapp 1994; Olivier and Tsujii 1994; Regier
and Carlson 2001). Using this approach, a model of a preposition’s spatial template
is constructed using a set of normalized equations that, for a given origin and point,
computes a value that represents the cost of accepting that point as the interpretation of
the preposition.

Each equation used to construct the potential ﬁeld representation of a preposition’s
spatial template models a different geometric constraint speciﬁed by the preposition’s
semantics. Por ejemplo, for topological prepositions such as near, an equation inversely
proportional to the distance between a point and a landmark would be used, mientras
for projective prepositions such as to the right of, an equation modeling the angular de-
viation of a point from the idealized direction denoted by the preposition would be in-
cluded in the construction set; Gapp (1995) and Logan and Sadler (1996) both noted that
acceptability of a projective preposition being used to describe a location approaches
0 as the angular deviation of that location approaches 90 degrees. The potential ﬁeld
is then constructed by assigning each point in the ﬁeld an overall potential by inte-
grating the results computed for that point by each of the equations in the construc-
tion set.

Cifra 9
Projective prepositions and distractor interference.

280

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

This potential ﬁeld approach does not, sin embargo, account for the inﬂuence of other
objects in the visual scene on the semantics of a topological or projective preposition.
The basic idea in our computational models is to extend the potential ﬁeld approach by
overlaying the potential ﬁelds for each object in a visual scene and combining those
ﬁelds to produce relative potential ﬁelds for topological or projective prepositions.
These relative potential ﬁelds represent the semantics of those prepositions as modiﬁed
by the presence of other objects in the visual scene.

6.1 Computational Model of Topological Prepositions

In this section we describe a model of relative proximity that uses (1) the distance
between objects, (2) the size and salience of the landmark object, y (3) the location
of other objects in the scene. Our model is based on ﬁrst computing absolute proximity
between each point and each landmark in a scene, and then combining or overlaying
the resulting absolute proximity ﬁelds to compute the relative proximity of each point
to each landmark.

6.1.1 Computing Absolute Proximity Fields. We ﬁrst compute for each landmark an ab-
solute proximity ﬁeld giving each point’s proximity to that landmark, independent of
proximity to any other landmark. We compute ﬁelds on the projection of the scene onto
the 2D-plane, represented as a 2D-array of points. At each point P in that array, el
absolute proximity for landmark L is

proxabs(l, PAG) = (1 − distnormalized(l, PAG)) ∗ salience(l)

(1)

In this equation the absolute proximity for a point P and a landmark L is a function of
both the distance between the point and the location of the landmark, and the salience
of the landmark.

To represent distance we use a normalized distance function distnormalized(l, PAG),
which returns a value between 0 y 1.5 The smaller the distance between L and P,
the higher the absolute proximity value returned, eso es, the more acceptable it is to say
that P is close to L. In this way, this component of the absolute proximity ﬁeld captures
the gradual gradation in applicability evident in Logan and Sadler (1996).

We model the inﬂuence of visual and discourse salience on absolute proximity as
a function salience(l), returning a value between 0 y 1 that represents the relative
salience of the landmark L in the scene (Ecuación (2)). For the current purposes we
assume that the relative salience of an object is the average of its visual salience (Svis)
and discourse salience (Sdisc).6

prominencia(l) = (Svis(l) + Sdisc(l))/2

(2)

5 We normalize by computing the distance between the two points, and then dividing this distance by the

maximum distance between point L and any point in the scene.

6 Hay, por supuesto, many other operators that could be used to combine visual and linguistic salience,
such as maximum (MAX(Svis(l), Sdisc(l))) or probabilistic OR (Svis(l) + Sdisc(l) − (Svis(l) × Sdisc(l))).
We currently have no way of deciding among these operators. Fortunately, sin embargo, the modular nature
of our framework would allow us to change the computation of relative salience without impacting other
aspects of our model, should evidence in favor of one or other operator become available.

281

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Visual salience Svis is computed using the algorithm of Kelleher and van Genabith
(2004). Computing a relative salience for each object in a scene is based on its perceivable
size and its centrality relative to the viewer’s focus of attention. The algorithm returns
scores in the range of 0 a 1. As the algorithm captures object size, we can model
the effect of landmark size on proximity through the salience component of absolute
proximity. The discourse salience (Sdisc) of an object is computed based on recency of
mencionar (Hajicov´a 1993) except we represent the maximum overall salience in the scene
como 1, and use 0 to indicate that the landmark is not salient in the current context.

Cifra 10 shows computed absolute proximity with salience values of 1, 0.6, y
0.5, for points from the upper-left to the lower-right of a 2D plane, with the landmark
at the center of that plane. The graph shows how salience inﬂuences absolute proximity
in our model: For a landmark with high salience, points far from the landmark can still
have high absolute proximity to it.

6.1.2 Computing Relative Proximity Fields. Once we have constructed absolute proximity
ﬁelds for the landmarks in a scene, our next step is to overlay these ﬁelds to produce a
measure of relative proximity to each landmark at each point. For this we ﬁrst select
a landmark, and then iterate over each point in the scene comparing the absolute
proximity of the selected landmark at that point with the absolute proximity of all other
landmarks at that point. The relative proximity of a selected landmark at a point is
equal to the absolute proximity ﬁeld for that landmark at that point, minus the highest
absolute proximity ﬁeld for any other landmark at that point:

proxrel(PAG, l) = proxabs(PAG, l) − MAX
∀LX(cid:2)=L

proxabs(PAG, LX)

(3)

The idea here is that the other landmark with the highest absolute proximity is acting
in competition with the selected landmark. If that other landmark’s absolute proximity

Cifra 10
Absolute proximity ratings for landmark L centered in a 2D plane, points ranging from plane’s
upper-left corner ((cid:4)−3,−3(cid:5)) to lower right corner ((cid:4)3,3(cid:5)).

282

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

is higher than the absolute proximity of the selected landmark, the selected landmark’s
relative proximity for the point will be negative. If the competing landmark’s absolute
proximity is slightly lower than the absolute proximity of the selected landmark, el
selected landmark’s relative proximity for the point will be positive, but low. Only when
the competing landmark’s absolute proximity is signiﬁcantly lower than the absolute
proximity of the selected landmark will the selected landmark have a high relative
proximity for the point in question.

In Equation (3) the proximity of a given point to a selected landmark rises as
that point’s distance from the landmark decreases (the closer the point is to the land-
mark, the higher its proximity score for the landmark will be), but falls as that point’s
distance from some other landmark decreases (the closer the point is to some other
landmark, the lower its proximity score for the selected landmark will be). Cifra 11
shows the relative proximity ﬁelds of two landmarks, L1 and L2, computed using
Ecuación (3) in a 1-dimensional (linear) espacio. The two landmarks have different de-
grees of salience: a salience of 0.5 for L1 and of 0.6 for L2 (represented by the different
sizes of the landmarks). In this ﬁgure, any point where the relative proximity for one
particular landmark is above the zero line represents a point which is proximal to
that landmark, rather than to the other landmark. The extent to which that point is
above zero represents its degree of proximity to that landmark. The overall proximal
area for a given landmark is the overall area for which its relative proximity ﬁeld is
above zero. The left and right borders of the ﬁgure represent the boundaries (walls) de
the area.

Cifra 11 illustrates three main points. Primero, the overall size of a landmark’s prox-
imal area is a function of the landmark’s position relative to the other landmark and
to the boundaries. Por ejemplo, landmark L2 has a large open space between it and
the right boundary: Most of this space falls into the proximal area for that landmark.
Landmark L1 falls into quite a narrow space between the left boundary and L2. L1
thus has a much smaller proximal area in the ﬁgure than L2. Segundo, the relative
proximity ﬁeld for a landmark is a function of that landmark’s salience. This can be
seen in Figure 11 by considering the space between the two landmarks. In that space
the width of the proximal area for L2 is greater than that of L1, because L2 is more
salient.

Cifra 11
Graph of relative proximity ﬁelds for two landmarks L1 and L2. Relative proximity ﬁelds were
computed with salience scores of 0.5 for L1 and 0.6 for L2.

283

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Cifra 12
Example scene.

The third point concerns areas of ambiguous proximity in Figure 11: areas in which
neither of the landmarks have a signiﬁcantly higher relative proximity than the other.
There are two such areas in the Figure. The ﬁrst is between the two landmarks, en
the region where one relative proximity ﬁeld line crosses the other. These points are
ambiguous in terms of relative proximity because these points are equidistant from
those two landmarks. The second ambiguous area is at the extreme right of the space
como se muestra en la figura 11. This area is ambiguous because this area is distant from both
landmarks: Points in this area would not be judged proximal to either landmark. El
question of ambiguity in relative proximity judgments is considered in more detail in
Sección 8.1.

We will illustrate the different stages of the proximity model using the situation
illustrated in Figure 12. The task is to decide whether the target object is proximal to
the landmark object. Cifra 13 illustrates the absolute potential ﬁeld for the landmark

Cifra 13
The absolute proximity ﬁelds for the landmark.

284

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Cifra 14
The absolute proximity ﬁelds for the landmark and the distractor.

object. Cifra 14 illustrates the absolute potential ﬁelds for the landmark and the
distractor object. Cifra 15 illustrates the relative proximity ﬁeld that results from the
interaction between the landmark and distractors absolute proximity ﬁelds. Cifra 16
illustrates the application of the threshold to the landmark’s relative proximity ﬁeld. Si
the target object is located in the region where the landmark’s relative proximity ﬁeld

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

285

Cifra 15
The landmark’s relative proximity ﬁeld.

Ligüística computacional

Volumen 35, Número 2

Cifra 16
Applying the threshold to the landmark’s relative proximity ﬁeld.

is above the threshold the target is deemed to be proximal to the landmark. Cifra 16
demonstrates the contextual inﬂuence which the distractor object has on the landmark’s
relative proximity ﬁeld: The ﬁeld shrinks on the side of the landmark near the distractor,
but expands on the side away from the landmark.

6.2 Computational Model of Projective Prepositions

The two main factors that impact on the applicability of a projective preposition de-
scribing the spatial relationship between a target object and a landmark are the angular
deviation of the target object’s position from the canonical direction described by the
preposition relative to the landmark and the distance of the target object from the
landmark.

The vector originating from the center of the landmark to the viewer’s position
describes the canonical search axis for in front of. We can produce the search vectors
for the other projective prepositions (behind, izquierda, bien) by rotating this front vector on a
horizontal plane. Once the canonical vector (cid:1)c for a given projective preposition has been
selected, the angular deviation of a given point P position relative to the landmark L can
be computed using Equation (4):

angle( (cid:1)LP,(cid:1)C ) = cos

−1






 (cid:1)LP • (cid:1)C
(cid:3)
(cid:3)
(cid:3)
(cid:3)
(cid:3) (cid:1)LP
(cid:3) |(cid:1)C |

(4)

dónde (cid:1)LP is the vector from landmark L to point P and (cid:1)c is the canonical vector for the
projective preposition in question. This equation gives the angle between (cid:1)LP and that
canonical vector.

286

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Using this equation, and the normalized distance measure described in Section 6.1,
we deﬁne an absolute potential ﬁeld for the acceptability of a projective preposition
with canonical vector (cid:1)c for landmark L as follows:

projabs(l, PAG,(cid:1)C ) = 0 si (angle( (cid:1)LP,(cid:1)C ) > 90)o(distnormalized(l, PAG) = 0),

= (angle( (cid:1)LP,(cid:1)C )/distnormalized(l, PAG)) de lo contrario

(5)

In this equation, if the angle between a point P and the canonical vector (cid:1)c is greater than
90 degrees, or if the distance between the landmark and the point is 0, the acceptability
of that point for the projective preposition is 0. De lo contrario, the acceptability of that
point is equal to the angle between that point and the canonical vector, divided by the
normalized distance between that point and the landmark.

We use this absolute potential ﬁeld for projective prepositions in the same way
that we used the absolute ﬁeld for proximity in our model of topological prepositions.
Once we have computed the absolute potential ﬁeld for each point relative to the
landmark we then do the same process for each of the distractor landmarks. Nosotros
then overlay the landmark applicabilities with those of the distractors by subtracting
the maximum applicability of any of the distractors at a point from the landmark’s
applicability at that point, producing a relative potential ﬁeld for the projective prepo-
sition as in Equation (6):

projrel(PAG, l,(cid:1)C ) = projabs(PAG, l(cid:1)C ) − MAX
∀LX(cid:2)=L

projabs(PAG, LX,(cid:1)C )

(6)

We then apply a threshold, and the region above this threshold is taken to deﬁne
the area described by the projective preposition. Note that we can use Equation (6) a
compute relative potential ﬁelds for various different projective prepositions (in front of,
behind, izquierda, bien, arriba, abajo) by selecting the different canonical vectors corresponding
to those prepositions.

Figures 17 a través de 20 illustrate the different stages in this process. In these images
the origin is at the front right corner, the x-axis runs from right to left, the y-axis
from front to back, and the z-axis is the vertical. The higher the z-axis value the more
applicability the preposition. Cifra 17 deﬁnes the baseline applicability of z = 0.1. Nosotros
use this baseline because dividing an angular deviation by distance will never result in
a zero value; rather applicability will approach 0 asymptotically. The baseline provides
a cut-off point for applicability. Cifra 18 illustrates the potential ﬁeld computed for
right of a landmark positioned at x = 100, y = 200, z = 0 with a search axis of x = 1,
y = 0. Cifra 19 illustrates the potential ﬁelds computed for right of the landmark and
a distractor object positioned at x = 150, y = 400, z = 0. Finalmente, Cifra 20 illustrates the
potential ﬁeld that results for right of the landmark when the distractor potential ﬁeld is
subtracted from it. Cifra 20 demonstrates the contextual inﬂuence which the distractor
object has on the landmark’s relative potential ﬁeld for the preposition: the size of the
ﬁeld is reduced by the presence of the distractor object.

7. Psycholinguistic Evaluations of Our Models

We now describe an experiment which tests our approach to relative proximity by
examining the changes in people’s judgments of the appropriateness of the expression
near being used to describe the relationship between a target and landmark object in

287

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Cifra 17
A baseline applicability is set to z = 0.1.

an image where a distractor object is present. All objects in these images were colored
shapes: circles, triangles, or squares.

7.1 Materials and Procedure

All images used in this experiment contained a central landmark object and a target
object, usually with a third distractor object. The landmark was always placed in the

Cifra 18
The potential ﬁeld describing the absolute applicability model for right of the landmark.

288

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Kelleher and Costello

Computational Models of Spatial Prepositions for VSD

Cifra 19
The potential ﬁelds describing the absolute applicability model for right of the landmark and
right of a distractor object.

middle of a 7 × 7 grid. Images were divided into eight groups of six images each. Cada
image in a group contained the target object placed in one of six different cells on the
grid, numbered from 1 a 6. Cifra 21 shows how we number these target positions
according to their nearness to the landmark.

Groups are organized according to the presence and position of a distractor object.
In group a the distractor is directly above the landmark, in group b the distractor is

Cifra 20
The resulting potential ﬁeld for right of the landmark with the baseline applied to it.

289

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

3
5
2
2
7
1
1
7
9
8
6
1
2
/
C
oh

yo
i
.

0
6
–
7
8
–
pag
r
mi
pag
1
4
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
8
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 35, Número 2

Cifra 21
Relative locations of landmark (l) target positions (1. . . 6) and distractor landmark positions
(a. . . gramo) in images used in the experiment.

rotated 45 degrees clockwise from the vertical, in group c it is directly to the right of
the landmark, in d it is rotated 135 degrees clockwise from the vertical, etcétera. El
distractor object is always the same distance from the central landmark. Además de
the distractor groups a,b,C,d,mi,F, and g, there is an eighth group, group x, in which no
distractor object occurs.

In the experiment, each image was displayed with a sentence of the form The

is near the
, with a description of the target and landmark, respectivamente. The sentence
was presented under the image. Twelve participants took part in this experiment. Todo
participants were native English speakers and all volunteered to take part. Participantes
were not linguists and were naive to the formal interpretation of spatial prepositions
and to the hypotheses being tested in the experiment. Participants were asked to rate
the acceptability of the sentence as a description of the image using a 10-point scale,
with zero denoting not acceptable at all; 4 o 5 denoting moderately acceptable; y 9
perfectly acceptable. Cifra 22 illustrates a trial from the experiment. Each participant
rated every image in the experiment. Images were presented in random order to control
for learning effects.

7.2 Results and Discussion

There was signiﬁcant agreement between participants across all 48 images. el promedio
pair-wise correlation between participants’ responses was r = 0.68. There was a signiﬁ-
cant correlation of responses between every pair of participants (pag < 0.01 for all pairs). We assess participants’ responses by comparing their average proximity judgments with those predicted by the absolute proximity equation (Equation (1)), and by the relative proximity equation (Equation (3)). For both equations we assume that all objects have a salience score of 1. With salience equal to 1, the absolute proximity equation relates proximity between target and landmark objects to the distance between those two objects, so that the closer the target is to the landmark the higher its proximity will be. With salience equal to 1, the relative proximity equation relates proximity to both distance between target and landmark and distance between target and distractor, so 290 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD Figure 22 An example trial from the proximity experiment. that the proximity of a given target object to a landmark rises as that target’s distance from the landmark decreases but falls as the target’s distance from some other distractor object decreases. It should be noted that proximity scores in both Equations (1) and (3) are multiplied by a constant salience and that the evaluations we describe below (correlation, multiple regression) factor out multiplication by a constant. Consequently, choosing a particular value for salience does not affect our evaluation results. In analyzing our results we are comparing our basic equation for absolute proximity (Equation (1), in which proximity falls with increasing distance between target and landmark) with the “relative proximity” extension of this equation (Equation (3), in which proximity falls with increasing distance between target and landmark, but rises with distance to distractor). Because both equations are quite similar (both are based on target–landmark distance, which is obviously the prime factor in proximity judgments), we expect both equations to produce quite similar responses. We expect, however, that the relative proximity equations will produce responses which are reliably closer to people’s proximity judgments than those produced by the absolute proximity equation. We initially used Spearman’s rank-order correlation to compare people’s average proximity scores with those produced by Equation (1) (absolute proximity) and Equa- tion (3) (relative proximity) for each group. For each group this analysis replaces each proximity score with its rank within that group, and then compares the ranks. Where the ranks returned by an equation and the ranks from participants’ average proximity scores are identical, the correlation will be 1.0; where the ranks differ, the correlation will drop. For the absolute proximity equation, the correlation was 1.0 in six of the groups, and .94 in the two remaining groups (group c and group g). For the relative proximity equation, the Spearman’s rank-order correlation with people’s responses was 1.0 in each of the eight groups. The fact that the relative proximity equation has a rank- order correlation of 1.0 in all groups while the absolute proximity equation fails to reach 1.0 in two groups (predicting proximity-ranks incorrectly in those two groups) suggests 291 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 2 that the relative-proximity equation is a better model of people’s proximity responses. However, the fact that there are so many correlations of 1.0 means that Spearman’s rank- order correlation is not particularly useful in distinguishing between the two equations. We therefore use Pearson’s product-moment correlation to compare people’s average proximity scores with those produced by the absolute and relative proximity equations. Rather than comparing ranks, this analysis compares actual proximity values. Figure 23 shows the product-moment correlations between people’s average prox- imity ratings and those produced by Equation (1) (absolute proximity) and by Equa- tion (3) (relative proximity) for the eight groups in the experiment. In analyzing these correlations we had two concerns: ﬁrst, to see whether, for each individual group, the correlation produced by Equation (3) was reliably different from that produced by Equation (1); and second, to see whether across all the groups, the correlation produced by Equation (3) was reliably higher than that produced by Equation (1). In regard to the ﬁrst question, we did not expect there to be particularly large differences in correlation between the two equations, because both are based on target–landmark distance. Because we know target–landmark distance to be a good predictor of people’s proximity judgments we expected Equation (1) to have a high correlation with people’s proximity judgments, and we expected Equation (3) to improve on that correlation. However, because the correlation from Equation (1) was already high, any improvement in correlation from Equation (3) would be relatively small. Indeed this is what is seen across the seven groups of interest: The average correlation from Equation (1) is high (average 0.93), the average correlation from Equation (3) is higher (average 0.99), but the difference between the two correlations is relatively small. Using Fisher’s technique for comparing correlation coefﬁcients we ﬁnd no reliable difference between correlation coefﬁcients in any group. Given that the correlations for both Equations (1) and (3) are high we examined whether the results returned by Equation (3) were reliably closer to human judgments than those from Equation (1). For the 42 images where a distractor object was present we recorded which equation gave a result that was closer to the participants’ normalized average for that image. In 28 cases Equation (3) was closer, and in 14 Equation (1) was closer (a 2:1 advantage for Equation (3), signiﬁcant in a sign test: n+ = 28, n− = 14, Z = 2.2, p < 0.05). We conclude that proximity judgments for objects in our experiment are best represented by relative proximity as computed in Equation (3). These results support our “relative” model of proximity.7 In addition to these analyses, we also carried out a multiple regression analysis of participants’ responses in the experiment, with target–landmark distance and target– distractor distance as the predictor variables, and participant response as the depen- dent variable. Because our experiment involved repeated-measures data, we followed the procedure for regression analysis of repeated-measures data described by Lorch and Myers (1990). This involves computing individual multiple regression for each participant in our experiment, and then using a t-test to analyze the regression coef- ﬁcients produced for target–distractor distance and target–landmark distance in those equations, across all participants. Recall that in our relative proximity equation (Equa- tion (3)) target–landmark distance had a negative coefﬁcient (as target–landmark dis- tance increased, judgments of target–landmark proximity fell) whereas target–distractor 7 Note that, in order to display the relationship between proximity values given by participants, computed in Equation (1), and computed in Equation (3), the values displayed in Figure 23 are normalized so that proximity values have a mean of 0 and a standard deviation of 1. This normalization simply means that all values fall in the same region of the scale, and can be easily compared visually. 292 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 293 Figure 23 Comparison between normalized proximity scores observed and computed for each group. Computational Linguistics Volume 35, Number 2 distance had a positive coefﬁcient (as target–distractor distance increased, judgments of target–landmark proximity increased). Our prediction, therefore, is that across these multiple regression analyses of participants’ responses, the target–landmark distance variable will reliably have a negative coefﬁcient, whereas the target–distractor variable will reliably have a positive coefﬁcient. Table 1 shows the regression coefﬁcients obtained for the target–landmark distance variable and the target–distractor distance variable, across the 12 participants in our experiment. As this table shows, the regression coefﬁcient for target–landmark distance was signiﬁcantly more likely to be negative (as predicted) whereas the regression coefﬁcient for target–distractor distance was signiﬁcantly more likely to be positive (again, as predicted). A single-group t-test showed that both target–landmark regres- sion coefﬁcients and target–distractor regression coefﬁcients reliably differed from zero (t(11) = −8.64, p < 0.01; t(11) = 2.23, p < 0.05) indicating that both of these predictor variables had a signiﬁcant and reliable effect on participants’ responses in the exper- iment. There was no concern about collinearity between predictor variables in these regression analyses, as the correlation between those variables (r = 0.38, %var = 0.14) was much lower than that between the predictor variables and the dependent variable (r = 0.93 or higher). Together these regression results, the sign-test results, and the comparative correlations described earlier all support the model of relative proximity as described in Equation (3). 8. Applications of the Models The model of proximity presented here has been implemented and used as a compo- nent in a human–robot dialog system (Kelleher and Kruijff 2006; Kelleher, Kruijff, and Costello 2006). The proximity and projective models have also been integrated into the LIVE virtual environment (Kelleher, Costello, and van Genabith 2005). In this section we will describe how the models are used in these systems to interpret and generate locative expressions. Table 1 Regression coefﬁcients from individual analyses of subjects data in proximity experiment. participant target–landmark distance target–distractor distance −2.02 −1.53 −3.06 −2.45 −2.23 −0.97 −3.09 −1.78 −0.80 −1.48 −3.29 −3.29 −2.17 0.88 −8.64* 1 2 3 4 5 6 7 8 9 10 11 12 M SE t *p < 0.01, **p < 0.05. 294 0.27 0.09 0.03 −0.02 0.06 0.32 0.42 0.02 0.16 −0.24 0.01 0.44 0.13 0.20 2.23** l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD 8.1 Interpreting Spatial References We use the computational models of Section 6 to interpret spatial references to objects. In this section we illustrate how we use our model of relative proximity to ground the interpretation of a locative expression containing a topological preposition. Returning to the architecture described in Section 3, the basic steps triggering the interpretation of a locative are: (1) the user utters a command, such as pick up the ball near the red box, (2) the speech recognition module processes the speech signal and passes the resulting string to the parser, (3) the parser constructs a formal representation of the meaning of the utterance, (4) the dialog manager categorizes the utterance to be a command and, also, recognizes the need to resolve the referring expression the ball near the red box. At this point the reference resolution module is triggered. The ﬁrst stage in reference resolution is to retrieve the context against which the reference is to be resolved. This involves accessing the context model and retrieving the set of currently accessible objects. This set is then subdivided into the set of objects fulﬁlling the landmark description, the set of objects fulﬁlling the description of the target object, and the set of objects fulﬁlling neither description. For each candidate landmark and each object that is neither a candidate landmark nor a candidate target we compute an absolute proximity ﬁeld. For each landmark we convert its absolute proximity ﬁeld into a relative proximity ﬁeld by overlaying the absolute proximity ﬁelds of the other landmarks and the other objects in the context that are neither candidate landmarks nor target objects. For this we iterate over each point in the scene, and compare the competing absolute proximity scores at each point. If the primary landmark’s (i.e., the landmark with the highest relative proximity at the point) relative proximity exceeds the next highest relative proximity score at a given point by more than a predeﬁned conﬁdence interval, the point is in the proximity region anchored around the primary landmark. Otherwise, we take it as ambiguous and not in the proximal region that is being interpreted. The motivation for the conﬁdence interval is to capture situations where the difference in relative proximity scores between the primary landmark and one or more landmarks at a given point is relatively small. Figure 24 illustrates the parsing of a scene into the regions “near” two landmarks. The relative proximity ﬁelds of the two landmarks are identical to those in Figure 11, using a conﬁdence interval of 0.1. Ambiguous points are where the proximity ambiguity series is plotted at 0.5. The regions “near” each landmark are those areas of the graph where each landmark’s relative proximity series is the highest plot on the graph. Figure 24 illustrates an important aspect of our model: the comparison of relative proximity ﬁelds naturally deﬁnes the extent of vague proximal regions. For example, see the region right of L2 in Figure 24. The extent of L2’s proximal region in this direction is bounded by the interference effect of L1’s relative proximity ﬁeld. Because the landmarks’ relative proximity scores converge, the area on the far right of the image is ambiguous with respect to which landmark it is proximal to. In effect, the model captures the fact that the area is relatively distant from both landmarks. In Section 8.2 we describe a cognitive load hierarchy of prepositions and how we use this to generate locative expressions. Following this hierarchy, objects located in the area on the far right of the image should be described with a projective relation such as to the right of L2 rather than a proximal relation like near L2. 8.1.1 An Example. To illustrate the model further we will apply the model to a real scene. Figure 25 shows a real scene on the left-hand side, and a rendering of the scene analysis on the right-hand side. For the shown scene analysis we have assumed all objects to 295 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 2 Figure 24 Graph of ambiguous regions overlaid on relative proximity ﬁelds for landmarks L1 and L2, with conﬁdence interval = 0.1 and different salience scores for L1 (0.5) and L2 (0.6). Locations of landmarks are marked on the x-axis. have an equal salience: on the left, the blue ball; in the middle, the red ball; and on the right, the green ball. As the analysis correctly shows, each object has a proximity potential ﬁeld (shown in its own color) but, due to interference between potential ﬁelds, we see that proximity is usually ambiguous between at least two landmarks. The regions that are ambiguous between two landmarks are colored using a mixture of the colors. The white area denotes the regions deﬁned as being ambiguous between the three objects. Imagine we now place a second blue ball in the scene and the user inputs the command pick up the blue ball near the red ball. As explained previously, when the system starts interpreting this reference it will split the context into a set of candidate target objects, consisting of the two blue balls in the scene, the set of candidate landmarks, consisting of the one red ball, and the set of remaining objects, the green ball. It will then compute proximity ﬁelds for each of the candidate landmarks and the other objects in the scene that are not candidate targets. It will then overlay these proximity ﬁelds Figure 25 Scene analysis. 296 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD to compute the relative proximity ﬁelds around each landmark. Figure 26 illustrates the resulting proximity ﬁelds. As can be seen from the image the original blue ball is inside the red ball’s proximity ﬁeld and consequently it will be selected as the ball to be picked up. This analysis highlights two important aspects of the model. First, we can observe an interference effect between the red ball and the green ball: The potential ﬁeld repre- senting proximity to the red ball forms an ellipsoid, being inhibited to the right through interference with the potential ﬁeld of the green ball. Second, the proximity ﬁeld of the red ball is much larger then that of the green ball; this is due to the relatively high linguistic salience of the red ball compared to the green ball due to it being mentioned in the reference. 8.2 Generating References In this section we illustrate how we use our models of the semantics of spatial preposi- tions to guide the generation of a locative expression in visual situated contexts. In the architecture described earlier, the GRE component is triggered by the content manager. Similar to reference resolution, GRE will ﬁrst retrieve the context from the context model and generate the reference relative to this context. If a locative expression is necessary the GRE component has three things to decide: (1) what properties of the target object to include, (2) which object in the scene should be used as a landmark and how should that be described, and (3) which spatial relation to use (and hence which preposition to use). Several GRE algorithms have addressed the issue of generating locative expressions (Dale and Haddock 1991; Horacek 1997; Gardent 2002; Krahmer and Theune 2002; l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 297 Figure 26 Interpreting the blue ball near the red ball. Computational Linguistics Volume 35, Number 2 Varges 2004). However, all these algorithms assume the GRE component has access to a predeﬁned scene model that deﬁnes all the spatial relations between all the entities in the scene. For many visually situated dialog systems, in particular robotic dialog systems, this assumption is a serious drawback for these algorithms. If an agent wishes to generate a contextually appropriate reference it cannot assume the availability of a domain model, rather it must dynamically construct one. Moreover, constructing a model containing all the spatial relationships between all the entities in the domain is prone to combinatorial explosion, both in terms of the number of objects in the context (the location of each object in the scene must be checked against all the other objects in the scene) and number of inter-object spatial relations (as a greater number of spatial relations will require a greater number of comparisons between each pair of objects). Furthermore, the context-free a priori construction of such an exhaustive scene model is cognitively implausible. Psychological research indicates that spatial relations are not preattentively perceptually available (Treisman and Gormican 1988). Rather, their perception requires attention (Logan 1994, 1995). These ﬁndings point to subjects constructing contextually dependent reduced relational scene models, rather than an exhaustive context-free model. The approach we adopt to generating locative expressions addresses the issue of combinatorial explosion inherent in relational scene model construction by incremen- tally creating a series of reduced scene models. Within each scene model only one spatial relation is considered and only a subset of objects are considered as candidate landmarks. This reduces both the number of relations that must be computed over each object pair and the number of object pairs. The decision as to which relations should be included in each scene model is guided by a cognitively-motivated hierarchy of spatial relations. The set of candidate landmarks in a given scene is dependent on the set of objects in the scene that fulﬁll the description of the target object and the semantic relation that is being considered. We use Dale and Reiter’s (1995) incremental GRE algorithm as the starting point for the generation framework. The incremental algorithm iterates through the proper- ties of the target object and for each property computes the set of distractor objects for which the conjunction of the properties selected so far, and the current property, hold. A property is added to the list of selected properties if it reduces the size of the distractor object set. The algorithm succeeds when all the distractors have been ruled out; it fails if all the properties have been processed and there are still some distractor objects. The algorithm can be reﬁned by ordering the checking of properties according to ﬁxed preferences; for example, ﬁrst a taxonomic description of the target, second an absolute property such as color, third a relative property such as size. Dale and Reiter also stipulate that the type description of the target should be included in the description even if its inclusion does not distinguish the target from any of the distractors; see Algorithm 1. Dale and Reiter argue that this algorithm has a polynomial complexity and that the theoretical run time can be characterized as nd × nl: the run time depends solely on the number of distractor objects nd and the number of properties considered in iterations nl. If we assume that nd and nl are both proportional to n, the number of ob- jects being considered, then the complexity of the incremental algorithm is of order n2. The incremental algorithm generates a description (in terms of type, color, and size) which distinguishes a given target object from a set of distractor objects (if such a description exists). However, we wish to generate locative expressions which iden- tify objects, rather than simple descriptions. These locative expressions may contain a description of a landmark object (in terms of type, color, or size), of a target object (type, color, or size), and a topological or projective preposition relating those two 298 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD Algorithm 1 The Basic Incremental Algorithm Require: T = target object; D = set of distractor objects. Initialize: P = {type, color, size}; DESC = {} for i = 0 to |P| do if |D| (cid:10)= 0 then D(cid:11) = {x : x ∈ D, Pi(x) = Pi(T)} if |D(cid:11)| < |D| then DESC = DESC ∪ Pi(T) D = {x : x ∈ D, Pi(x) = Pi(T)} end if else Distinguishing description generated if type(x) (cid:10)∈ DESC then DESC = DESC ∪ type(x) end if return DESC end if end for Failed to generate distinguishing description return DESC objects. To generate such locative expressions we repeatedly call the basic incremental algorithm for a sequence of different possible spatial relations. The fact that each call to the algorithm uses a different spatial relation results in a different set of objects from the context being deﬁned as candidate landmarks for each function call. If a given spatial relation allows the basic incremental algorithm to generate a description which distinguishes the target object from the set of distractor objects, that spatial relation is used to generate an expression identifying that object. Otherwise we move on and call the basic incremental algorithm for the next spatial relation in our sequence. When generating a referring expression, we use a sequence of possible forms of reference ordered by assumed cognitive load (see Figure 27), with simpler forms of reference (those identifying object type, for example) coming early in the sequence and more complex forms (those involving projective prepositions, for example) coming later. This means that our approach will preferentially produce simpler expressions to identify an object, and only if no such simple expressions can be found which distin- guish that object successfully will more complex topological or projective prepositions Figure 27 Cognitive load of reference forms. 299 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 2 be produced. Our sequence of relations can be extended to include relations of ternary and higher arity such as the ball between the box and the triangle or the ball near the box and the triangle. We use the models of topological and projective prepositions described in Sec- tions 6.1 and 6.2 to deﬁne the regions around a landmark to which a given topological or projective description applies. If the target or one of the distractor objects is the only ob- ject within that region around a given landmark, this is taken to represent a contrastive use of a preposition relative to that landmark. If that region contains more than one object from the target and distractor object set, then it is a relative use of the preposition. 8.2.1 Landmarks and Distinguishing Descriptions. In order to use a locative expression, an object in the context must be selected to function as the landmark. An implicit assumption in selecting an object to function as a landmark is that the hearer can easily identify and locate the object within the context. As shown in Example (4), a landmark can be the speaker, the hearer, the scene, an object in the scene, or a group of objects in the scene.8 Example 4 • • • • • the ball on my right [speaker] the ball to your left [hearer] the ball on the right [scene] the ball to the left of the box [an object in the scene] the ball in the middle [group of objects] Clearly, deciding which objects in a given visual context can function as landmarks is a complex process. Some of the factors effecting this decision are object salience and the functional relationships between objects. However, one basic constraint on landmark selection is that the landmark should be distinguishable from the target. For example, in the context provided by Figure 28 the ball has a relatively high salience, because it is a singleton, despite the fact that it is smaller and geometrically less complex than the other ﬁgures. Moreover, in this context, the ball is the only object in the scene that can function as a landmark without recourse to using the scene itself or a grouping of objects in the scene. Given the context in Figure 28 and all other factors being equal, using a locative such as the man to the left of the man would be much less helpful than using the man to the right of the ball. Following this observation, we treat an object as a candidate landmark if the following conditions are met: 1. 2. The object is not the target. The object is not a member of the distractor set. Furthermore, a target landmark is a member of the candidate landmark set that stands in relation to the target under the relation being considered and a distractor landmark 8 See Gorniak and Roy (2004) for further discussion on the use of spatial extrema of the scene and groups of objects in the scene as landmarks. 300 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD Figure 28 Visual context used to illustrate the relative semantics of topological and projective prepositions. is a member of the candidate landmark set that stands in relation to a distractor object under the relation being considered. Using these categories of landmark we can deﬁne a distinguishing locative de- scription as a locative description where there is a target landmark that can be dis- tinguished using the basic incremental algorithm from all the members of the set of distractor landmarks which stand under the relation used in the locative expression. Given this, our approach is to try to generate a distinguishing description using the standard incremental algorithm. If this fails, we divide the context into three compo- nents: the target, the distractor objects, and the set of candidate landmarks. We then begin to iterate through the hierarchy of relations and for each relation we create a context model that deﬁnes the set of target and distractor landmarks. Once a context model has been created we iterate through the target landmarks (using a salience order- ing if there is more than one) and try to create a distinguishing locative description. A distinguishing locative description is created by using the basic incremental algorithm to distinguish the target landmark from the distractor landmarks. If we succeed in generating a distinguishing locative description we return the description and stop processing. Algorithm 2 The Locative Incremental Algorithm Require: T = target object; D = set of distractor objects; R = hierarchy of relations. DESC = Basic-Incremental-Algorithm(T,D) if DESC (cid:10)= Distinguishing then create CL the set of candidate landmarks CL = {x : x (cid:10)= T, DESC(x) = false} for i = 0 to |R| do create a context model for relation Ri consisting of TL the set of target landmarks and DL the set of distractor landmarks TL = {y : y ∈ CL, Ri(T, y) = true} DL = {z : z ∈ CL, Ri(D, z) = true} for j = 0 to |TL| by salience(TL) do LANDDESC = Basic-Incremental-Algorithm(TLj, DL) if LANDDESC = Distinguishing then Distinguishing locative generated return {DESC,Ri,LANDDESC} end if end for end for end if FAIL 301 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 2 If we cannot create a distinguishing locative description we move on to the next, more complex spatial relation in the sequence of spatial relations, and attempt to gener- ate a distinguishing locative description using that relation. This process continues until either a distinguishing expression is produced or no possible spatial relations remain. This algorithm runs the basic incremental algorithm a number of times for each candidate relation in the list of possible relations. The length of this list will be a constant; call it R. For each candidate relation, the number of times the incremental algorithm runs is equal to the number of TL objects (the number of objects which don’t fulﬁll the description of the target created by the current run of the incremental algorithm, and which the target object stands under the currently selected relation to). Call the number of TL objects nTL and note that nTL must be less than, and proportional to, n (the total number of objects). The number of times the basic incremental algorithm can run, in our system, is then proportional to NTL × R; replacing with nTL with n gives n × R runs of the basic incremental algorithm. Inserting the complexity of the basic incremental algorithm into this, we get an overall complexity of n2 × n × R = n3 × R, which although worse than the basic incremental algorithm’s n2 complexity, is still polynomial. This algorithm cannot generate embedded locative descriptions, such as the bag on the chair near the window, because it does not use spatial relations as properties to describe the landmark. However, these descriptions can be generated if needed by replacing the call to the basic incremental algorithm for the landmark object with a call to the whole locative expression algorithm, using the target landmark as the target object and the set of distractor landmarks as the distractors. A nice consequence of this approach to generating embedded locative descriptions is that inﬁnite descriptions (e.g., the bag on the chair supporting the bag on the chair ...) will not be generated as the target object is excluded from the context that the landmark’s description is generated in. However, the cost of being able to generate these embedded descriptions is a higher exponential complexity. 8.2.2 An Example. We can illustrate the framework using the visual context provided by the scene on the left of Figure 29. This context consists of two red boxes R1 and Figure 29 A visual scene and the topological analysis of R1 and R2. 302 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD R2 and two blue balls B1 and B2. Imagine that we want to refer to B1. We begin by calling the locative incremental algorithm, Algorithm 2. This in turn calls the basic incremental algorithm, Algorithm 1, which will return the property ball. However, this is not sufﬁcient to create a distinguishing description as B2 is also a ball. In this context the set of candidate landmarks equals {R1,R2} and the ﬁrst relation in the hierarchy is topological proximity, which we model as described in Section 6.1. The image on the right of Figure 29 illustrates the analysis of the scene using this framework: The green region on the left deﬁnes the area deemed to be proximal to R1, and the yellow region on the right deﬁnes the area deemed to be proximal to R2. It is evident that B1 is in the area proximal to R1; consequently R1 is classiﬁed as a target landmark. As none of the distractors (i.e., B2) are located in a region that is proximal to a candidate landmark there are no distractor landmarks. As a result when the basic incremental algorithm is called to create a distinguishing description for the target landmark R1 it will return box and this will be deemed to be a distinguishing locative description. The overall algorithm will then return the vector {ball, proximal, box} which would result in the realizer generating a reference of the form: the ball near the box. 9. Conclusions and Future Work In this article we have described the application of computational models of spatial prepositions to visually situated dialog systems. These computational models allow sys- tems to both interpret and generate expressions which refer to topological and projective relations between objects in the visual environment. The computational models of spa- tial prepositions we present are designed to handle reference resolution and generation in complex visual environments containing multiple objects. In particular, these models are designed to account for the contextual inﬂuence which the presence of multiple objects has on the semantics of topological and projective prepositions. In this respect our computational models move beyond other accounts of the semantics of spatial prepositions, which typically do not model the contextual inﬂuence of other objects on spatial semantics. Because most real-world visual scenes are complex and contain multiple objects, our computational models for the semantics of spatial prepositions are important for visually situated dialog systems intended to operate successfully in the real world. Clearly there are many interesting areas for future work. To date our research has focused on a small number of static topological and projective prepositions. We feel, however, that our framework will apply usefully to a range of other more complex static and dynamic prepositions, for example: between, among, within, along, beside, around. These prepositions either involve several objects or multiple areas and, consequently, our account of the effect of distractor objects on the target–landmark relationship could provide a worthwhile perspective on their semantics. This leads to another promising area for future work. Although our current model was designed to accommodate multiple distractor objects, our empirical studies have focused on cases where there is only one distractor. An important aim for future research is to extend these studies and test the model in situations with multiple distractors. From a theoretical point of view, we feel that our approach to the semantics of spa- tial prepositions illustrates an important point for researchers working on the semantics of natural language in general: that it is possible to investigate and model semantics not solely as a linguistic phenomenon, but also in terms of non-linguistic factors such as the visual environment in which language is used. For example, in the psychological evaluations described in Section 7 we found that the semantic applicability of “near” to 303 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 2 the relationship between a target and a landmark object was reliably inﬂuenced by the presence and location of a third, distractor, object, which was not part of the linguistic context. That the semantics of language is inﬂuenced by non-linguistic factors is an old point and an obvious one: however, we think that our research on visually-situated dialog systems makes a useful contribution by showing that these systems provide ideal testbeds for investigating the interaction between language and vision, and for developing detailed and useful computational models of how those interactions work. References Baldridge, J. and G. J. M. Kruijff. 2002. Coupling CCG and hybrid logic dependency semantics. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-02), pages 319–326, Philadelphia, PA. Baldridge, J. and G. J. M. Kruijff. 2003. Multi-modal combinatory categorial grammar. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03), volume 1, pages 211–218, Budapest. Beermann, D. and L. Hellan. 2004. Semantic decomposition in a computational HPSG grammar: A treatment of aspect and context-dependent directionals. In Proceedings of the HPSG04 Conference, pages 357–377, Leuven. Bunt, H. 1994. Context and dialogue control. Think, 3:19–31. Cahill, A., M. Burke, R. O’Donovan, J. van Genabith, and A. Way. 2004. Long-distance dependency resolution in automatically acquired wide-coverage PCFG-based LFG approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 320–327, Barcelona. Carletta, J., A. Isard, S. Isard, J. C. Kowtko, G. Doherty-Sneddon, and A. H. Anderson. 1997. The reliability of a dialogue structure coding scheme. Computational Linguistics, 23(1):13–32. Carlson-Radvansky, L. A. and G. D. Logan. 1997. The inﬂuence of reference frame selection on spatial template construction. Journal of Memory and Language, 37:411–437. Cohn, A. G., B. Bennett, J. M. Gooday, and N. Gotts. 1997. RCC: A calculus for region based qualitative spatial reasoning. GeoInformatica, 1:275–316. Coventry, K. R. 1998. Spatial prepositions, functional relations, and lexical speciﬁcation. In P. Olivier and K. P. Gapp, editors, Representation and Processing of Spatial Expressions. Lawrence Erlbaum Associates, Hillsdale, NJ, pages 247–262. 304 Coventry, K. R. and S. Garrod. 2004. Saying, Seeing and Acting. The Psychological Semantics of Spatial Prepositions. Essays in Cognitive Psychology Series. Lawrence Erlbaum Associates, Hillsdale, NJ. Dale, R. and N. Haddock. 1991. Generating referring expressions involving relations. In Proceedings of the 5th Conference of the European Chapter of the Association for Computational Linguistics (EACL-91), pages 161–166, Berlin. Dale, R. and E. Reiter. 1995. Computatinal interpretations of the gricean maxims in the generation of referring expressions. Cognitive Science, 18:233–263. Fuhr, T., G. Socher, C. Scheering, and G. Sagerer. 1998. A three-dimensional spatial model for the interpretation of image data. In P. Olivier and K. P. Gapp, editors, Representation and Processing of Spatial Expressions. Lawrence Erlbaum Associates, Hillsdale, NJ, pages 103–118. Gapp, K. P. 1994. Basic meanings of spatial relations: Computation and evaluation in 3D space. In Proceedings of the 12th National Conference on Artiﬁcial Intelligence (AAAI-94), Seattle, WA, pages 1393–1398. Gapp, K. P. 1995. An empirically validated model for computing spatial relations. In Proceedings of the 19th German Conference on Artiﬁcial Intelligence (KI-95), Bielefeld, Germany, pages 245–256. Gardent, C. 2002. Generating minimal deﬁnite descriptions. In Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL-02), pages 96–103, Philadelphia, PA. Garrod, S., G. Ferrier, and S. Campbell. 1999. In and on: Investigating the functional geometry of spatial prepositions. Cognition, 72:167–189. Gawron, J. M. 1986. Situations and prepositions. Linguistics and Philosophy, 9(3):327–382. Gorniak, P. and D. Roy. 2004. Grounded semantic composition for visual scenes. Journal of Artiﬁcial Intelligence Research, 21:429–470. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Kelleher and Costello Computational Models of Spatial Prepositions for VSD Hajicov´a, E. 1993. Issues of Sentence Structure and Discourse Patterns, volume 2 of Theoretical and Computational Linguistics. Charles University Press, Prague. Hayward, W. G. and M. J. Tarr. 1995. Spatial language and spatial representation. Cognition, 55:39–84. Herskovits, A. 1986. Language and Spatial Cognition: An Interdisciplinary Study of Prepositions in English. Studies in Natural Language Processing. Cambridge University Press, Cambridge, UK. Horacek, H. 1997. An algorithm for generating referential descriptions with ﬂexible interfaces. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 206–213, Madrid. Jackendoff, R. 1983. Semantics and Cognition. Current Studies in Linguistics. The MIT Press, Cambridge, MA. Kelleher, J., F. Costello, and J. van Genabith. 2005. Dynamically structuring, updating and interrelating representations of visual and lingusitic discourse context. Artiﬁcial Intelligence, 167(1–2):62–102. Kelleher, J. and J. van Genabith. 2004. Visual salience and reference resolution in simulated 3D environments. AI Review, 21(3-4):253–267. Kelleher, J. and J. van Genabith. 2006. A computational model of the referential semantics of projective prepositions. In P. Saint-Dizier, editor, Syntax and Semantics of Prepositions, Speech and Language Processing. Kluwer Academic Publishers, Dordrecht, The Netherlands, pages 199–216. Kelleher, J. D. and G. J. Kruijff. 2006. Incremental generation of spatial referring expressions in situated dialog. In Proceedings of the 3rd Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING/ACL-06), pages 1041–1048, Sydney. Kelleher, J. D., G. J. Kruijff, and F. Costello. 2006. Proximity in context: an empirically grounded computation model of proximity for processing topological spatial expressions. In Proceedings of the 3rd Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics (COLING/ACL-06), pages 745–752, Sydney. Klein, M. 1999. An overview of the state of the art of coding schemes for dialogue act annotation. In V. Matousek, P. Mautner, J. Ocelikov´a, and P. Sojka, editors, Text, Speech and Dialogue (TSD’99), Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pages 274–297. Krahmer, E. and M. Theune. 2002. Efﬁcient context-sensitive generation of referring expressions. In K. van Deemter and R. Kibble, editors, Information Sharing: Reference and Presupposition in Language Generation and Interpretation. CLSI Publications, Stanford, CA, pages 223–263. Kruijff, Geert-Jan, John Kelleher, and Nick Hawes. 2006. Information fusion for visual reference resolution in dynamic situated dialogue. In Elisabeth Andre, Laila Dybkjaer, Wolfgang Minker, Heiko Neumann, and Michael Weber, editors, In Proceedings of Perception and Interactive Technologies (PIT06), volume 4021 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg, pages 117–128. Kuipers, Benjamin. 2000. The spatial semantic hierarchy. Artiﬁcial Intelligence, 19:191–233. Landau, B. 1996. Multiple geometric representations of objects in language and language learners. In P. Bloom, M. Peterson, L. Nadel, and M. Garrett, editors, Language and Space. MIT Press, Cambridge, MA, pages 317–363. Levelt, W. J. M. 1996. Perspective taking and ellipsis in spatial descriptions. In M. Bloom, P. Peterson, L. Nadell, and M. Garrett, editors, Language and Space. MIT Press, Cambridge, MA, pages 77–108. Levinson, S. 1996. Frame of reference and Molyneux’s question: Crosslinguistic evidence. In M. Bloom, P. Peterson, L. Nadell, and M. Garrett, editors, Language and Space. MIT Press, Cambridge, MA, pages 109–170. Levinson, S. 2003. Space in Language and Cognition: Explorations in Cognitive Diversity. Cambridge University Press, Cambridge, UK. Logan, G. D. 1994. Spatial attention and the apprehension of spatial relations. Journal of Experimental Psychology: Human Perception and Performance, 20:1015–1036. Logan, G. D. 1995. Linguistic and conceptual control of visual spatial attention. Cognitive Psychology, 12:523–533. Logan, G. D. and D. D. Sadler. 1996. A computational analysis of the apprehension of spatial relations. In M. Bloom, P. Peterson, L. Nadell, and M. Garrett, editors, Language and 305 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Computational Linguistics Volume 35, Number 2 Space. MIT Press, Cambridge, MA, pages 493–529. Lorch, R. F. and J. L. Myers. 1990. Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1):149–157. Olivier, P. and J. Tsujii. 1994. Quantitative perceptual representation of prepositional semantics. Artiﬁcial Intelligence Review, 8:147–158. Regier, T and L. Carlson. 2001. Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General, 130(2):273–298. Treisman, A. and S. Gormican. 1988. Feature analysis in early vision: Evidence from search assymetries. Psychological Review, 95:15–48. Tseng, J. L. 2000. The Representation and Selection of Prepositions. Ph.D. thesis, University of Edinburgh. Varges, S. 2004. Overgenerating referring expressions involving relations and booleans. In Proceedings of the 3rd International Conference on Natural Language Generation (INLG-04), pages 171–181, Brighton. Yamada, A. 1993. Studies in Spatial Descriptions Understanding Based on Geometric Constraints Satisfaction. Ph.D. thesis, University of Kyoto. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / c o l i / l a r t i c e - p d f / / / / 3 5 2 2 7 1 1 7 9 8 6 1 2 / c o l i . 0 6 - 7 8 - p r e p 1 4 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 306 Applying Computational Models of Spatial image

Descargar PDF