Visual Spatial Reasoning

Fangyu Liu Guy Emerson Nigel Collier
University of Cambridge, 英国
{fl399, gete2, nhc30}@cam.ac.uk

抽象的

Spatial relations are a basic part of human cog-
尼尼申. 然而, they are expressed in natural
language in a variety of ways, and previous
work has suggested that current vision-and-
language models (VLMs) struggle to capture
relational information. 在本文中, we pres-
ent Visual Spatial Reasoning (VSR), a dataset
containing more than 10k natural text-image
pairs with 66 types of spatial relations in En-
glish (例如, 在下面, in front of, facing). 尽管
using a seemingly simple annotation format,
we show how the dataset includes challenging
语言现象, such as varying refer-
ence frames. We demonstrate a large gap be-
tween human and model performance: 这
human ceiling is above 95%, while state-of-
the-art models only achieve around 70%. 我们
observe that VLMs’ by-relation performances
have little correlation with the number of
training examples and the tested models are
in general incapable of recognising relations
concerning the orientations of objects.1

介绍

Multimodal NLP research has developed rap-
idly in recent years, with substantial performance
gains on tasks such as visual question answering
(VQA) (Antol et al., 2015; Johnson et al., 2017;
Goyal et al., 2017; Hudson and Manning, 2019;
Zellers et al., 2019), vision-language reasoning
or entailment (Suhr et al., 2017, 2019; Xie et al.,
2019; 刘等人。, 2021), and referring expression
comprehension (Yu et al., 2016; 刘等人。, 2019).
Existing benchmarks, such as NLVR2 (Suhr et al.,
2019) and VQA (Goyal et al., 2017), define ge-
neric paradigms for testing vision-language mod-
这 (VLMs). 然而, as we further discuss in
§ 2, these benchmarks are not ideal for prob-
ing VLMs as they typically conflate multiple
sources of error and do not allow controlled
analysis on specific linguistic or cognitive prop-
erties, making it difficult to categorize and fully

1Data and code: github.com/cambridgeltl/visual

-spatial-reasoning.

635

understand the model failures. 尤其, spa-
tial reasoning has been found to be particularly
challenging for current models, and much more
challenging than capturing properties of individ-
ual entities (Kuhnle et al., 2018; Cirik et al.,
2018; Akula et al., 2020), even for state-of-the-
art models such as CLIP (Radford et al., 2021;
Subramanian et al., 2022).

Another line of work generates synthetic data-
sets in a controlled manner to target specific
relations and properties when testing VLMs,
例如, CLEVR (刘等人。, 2019) and ShapeWorld
(Kuhnle and Copestake, 2018). 然而, syn-
thetic datasets may accidentally overlook chal-
伦格斯 (such as orientations of objects which
we will discuss in § 5), and using natural images
allows us to explore a wider range of language
使用.

To address the lack of probing evaluation
benchmarks in this field, we present VSR (维-
sual Spatial Reasoning), a controlled dataset that
explicitly tests VLMs for spatial reasoning. 我们
choose spatial reasoning as the focus because it
is one of the most fundamental capabilities for
both humans and VLMs. Such relations are cru-
cial to how humans organize their mental space
and make sense of the physical world, 在那里-
fore fundamental for a grounded semantic model
(Talmy, 1983).

The VSR dataset contains natural image-text
pairs in English, with the data collection process
explained in § 3. Each example in the dataset
consists of an image and a natural language de-
scription that states a spatial relation of two objects
presented in the image (two examples are shown
图中 1 和图 2). A VLM needs to classify
the image-caption pair as either true or false, 印迪-
cating whether the caption is correctly describing
the spatial relation. The dataset covers 66 空间的
relations and has >10k data points, 使用 6,940
images from MS COCO (林等人。, 2014).

Situating one object in relation to another re-
quires a frame of reference: a system of coor-
dinates against which the objects can be placed.

计算语言学协会会刊, 卷. 11, PP. 635–651, 2023. https://doi.org/10.1162/tacl 00566
动作编辑器: Mohit Bansal. 提交批次: 8/2022; 修改批次: 12/2022; 已发表 6/2023.
C(西德:2) 2023 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

变化的. We discuss the impact on data col-
lection in § 3.2, and analyse the collected data
in § 4.

We test

IE。, Visual-
four popular VLMs,
BERT (李等人。, 2019), LXMERT (Tan and
Bansal, 2019), ViLT (Kim et al., 2021), and CLIP
(Radford et al., 2021) on VSR, with results
given in § 5. While the human ceiling is above
95%, all four models struggle to reach 70% 交流电-
curacy. We conduct comprehensive analysis on
the failures of the investigated VLMs and high-
light that (1) positional encodings are extremely
important for the VSR task; (2) models’ by-
relation performance barely correlates with the
number of training examples; (3) 实际上, several
spatial relations that concern orientation of ob-
jects are especially challenging for current VLMs;
和 (4) VLMs have extremely poor generalization
on unseen concepts.

2 相关工作

2.1 Comparison with Synthetic Datasets

Synthetic language-vision reasoning datasets,
例如, SHAPES (Andreas et al., 2016), CLEVR
(刘等人。, 2019), NLVR (Suhr et al., 2017), 和
ShapeWorld (Kuhnle and Copestake, 2018), 在-
able full control of dataset generation and could
potentially benefit probing of spatial reasoning
capability of VLMs. They share a similar goal
to us, to diagnose and pinpoint weaknesses in
VLMs. 然而, synthetic datasets necessarily
simplify the problem as they have inherently
bounded expressivity. In CLEVR, objects can only
be spatially related via four relationships: ‘‘left’’,
‘‘right’’, ‘‘behind’’, and ‘‘in front of’’, while VSR
covers 66 关系.

Synthetic data does not always accurately re-
flect the challenges of reasoning in the real world.
例如, objects like spheres, which often
appear in synthetic datasets, do not have orien-
tations. In real images, orientations matter and
human language use depends on that. 更远-
更多的, synthetic images do not take the scene as
a context into account. The interpretation of ob-
ject relations can depend on such scenes (例如, 这
degree of closeness can vary in open space and
indoor scenes).

最后但并非最不重要的, the vast majority of spa-
tial relationships cannot be determined by rules.
Even for the seemingly simple relationships like
‘‘left/right of’’, the determination of two objects’

the bench. Label: True.

数字 1: Caption: The potted plant is at the right
side of
Image source:
‘‘Texting’’, uploaded November 5,
Antoine K.
2010. https://www.flickr.com/photos/ktoine
/5149301465/ (CC BY-SA 2.0).

这
数字 2: Caption: The cow is ahead of
人. Label: False.
ccarl-
‘‘Holy cow’’, uploaded March 24, 2023.
代替.

/6863977248/ (CC BY-NC-ND 2.0).

来源:

图像

Drawing on detailed studies of more than forty
typologically diverse languages, Levinson (2003)
concludes that the diversity can be reduced to
three major types: intrinsic, relative, and absolute.
An intrinsic frame is centered on an object, 例如,
behind the chair, meaning at the side with the
backrest. A relative frame is centered on a viewer,
例如, behind the chair, meaning further away
from someone’s perspective. An absolute frame
uses fixed coordinates, 例如, north of the chair,
using cardinal directions. 用英语, absolute
frames are rarely used when describing relations
on a small scale, and they do not appear in our
dataset. 然而, intrinsic and relative frames
are widely used, and present an important source

636

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

spatial relationships can depend on the observer’s
viewpoint, whether the object has a front, 如果是这样,
what are their orientations, ETC.

2.2 Spatial Relations in Existing
Vision-language Datasets

Several existing vision-language datasets with
natural images also contain spatial relations (例如,
NLVR2, COCO, and VQA datasets). Suhr et al.
(2019) summarize that there are 9 prevalent lin-
guistic phenomena/challenges in NLVR2 (Suhr
等人。, 2019) such as coreference, existential
关系,
量词, hard cardinality, 空间的
ETC。, 和 4 in VQA datasets (Antol et al., 2015;
Hudson and Manning, 2019). 然而, 差异-
ferent challenges are entangled in these datasets.
Sentences contain complex lexical and syntac-
tic information and can thus conflate different
sources of error, making it hard to identify the ex-
act challenge and preventing categorised analysis.
Yatskar et al. (2016) extract 6 types of visual spa-
tial relations directly from MS COCO images with
annotated bounding boxes. But rule-based auto-
matic extraction can be restrictive as most relations
are complex and cannot be identified relying on
bounding boxes. 最近, R¨osch and Libovick´y
(2022) extract captions that contain 28 positional
keywords from MS COCO and swap the keywords
with their antonyms to construct a challenging
probing dataset. 然而, the COCO captions
also have the error-conflation problem. 还, 这
number of examples and types of relations are
restricted by COCO captions.

Visual Genome (Krishna et al., 2017) also con-
tains annotations of objects’ relations including
spatial relations. 然而, it is only a collection
of true statements and contains no negative ones,
so cannot be framed as a binary classification task.
It is non-trivial to automatically construct neg-
ative examples since multiple relations can be
plausible for a pair of object in a given image.
Relation classifiers are harder to learn than ob-
ject classifiers on this dataset (Liu and Emerson,
2022).

Parcalabescu et al. (2022) propose a bench-
mark called VALSE for testing VLMs’ capabil-
ities on various linguistic phenomena. VALSE
has a subset focusing on ‘‘relations’’ between
物体. It uses texts modified from COCO’s orig-
inal captions. 然而, it is a zero-shot bench-
mark without training set, containing just 535 数据

点. 所以, it is not ideal for large-scale probing
on a wide spectrum of spatial relations.

2.3 Spatial Reasoning Without Grounding

There has also been interest in probing models’
spatial reasoning capability without visual input.
例如, Collell et al. (2018), Mirzaee et al.
(2021), and Liu et al. (2022) probe pretrained
text-only models or VLMs’ spatial reasoning ca-
pabilities with text-only questions. 然而, A
text-only dataset cannot evaluate how a model
relates language to grounded spatial information.
相比之下, VSR focuses on the joint understand-
ing of vision and language input.

2.4 Spatial Reasoning as a Sub-component

最后但并非最不重要的, some vision-language tasks
and models require spatial reasoning as a sub-
成分. 例如, Lei et al. (2020) 亲-
pose TVQA+, a spatio-temporal video QA dataset
containing bounding boxes for objects referred
in the questions. Models then need to simulta-
neously conduct QA while detecting the correct
object of interest. Christie et al. (2016) propose
a method for simultaneous image segmentation
and prepositional phrase attachment resolution.
Models have to reason about objects’ spatial rela-
tions in the visual scene to determine the assign-
ment of prepositional phrases. 然而, if spatial
reasoning is only a sub-component of a task, 是-
ror analysis becomes more difficult. 相比之下,
VSR provides a focused evaluation of spatial
关系, which are particularly challenging for
current models.

3 Dataset Creation

In this section we detail how VSR is constructed.
The data collection process can generally be split
into two phases: (1) contrastive caption gener-
化 (§ 3.1) 和 (2) second-round validation
(§ 3.2). We then discuss annotator hiring and
支付 (§ 3.3), dataset splits (§ 3.4), 和
human ceiling and agreement of VSR (§ 3.5).

3.1 Contrastive Template-based Caption

一代 (数字 3)

In order to highlight spatial relations and avoid
annotators frequently choosing trivial relations
(such as ‘‘near to’’), we use a contrastive caption
generation approach. 具体来说, 第一的, a pair of
图片, each containing two concepts of interest,

637

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

类别

Adjacency

Directional

Orientation
Projective

Proximity
Topological

Unallocated

Spatial Relations

Adjacent to, alongside, at the side of, at the right side of, at the left side of, attached to,
at the back of, ahead of, 反对, at the edge of
Off, 过去的, 朝向, 向下, deep down∗, up∗, away from, 沿着, 大约, from∗, 进入, to∗,
across, across from, 通过, down from
Facing, facing away from, parallel to, perpendicular to
On top of, beneath, beside, behind, left of, right of, 在下面, in front of, 以下, 多于, 超过,
in the middle of
经过, close to, near, 远离, far away from
Connected to, detached from, has as a part, part of, 包含, 之内, 在, 在, 在, 和,
surrounding, 之中, consists of, 在......之外, 之间, inside, outside, touching
超过, next to, opposite to, after∗, 之中, enclosed by

桌子 1: 这 71 available spatial relations; 66 of them appear in our final dataset (∗ indicates not used).

segmentation and classes of 886,284 instances
(individual objects). Leveraging the segmenta-
的, we first randomly select two concepts (例如,
‘‘cat’’ and ‘‘laptop’’ in Figure 3), then retrieve
all images containing the two concepts in COCO
2017 (train and validation sets). 然后, images that
contain multiple instances of any of the concept
are filtered out to avoid referencing ambiguity.
For the single-instance images, we also filter out
any of the images with instance pixel area size
< 30, 000, to prevent extremely small instances. After these filtering steps, we randomly sample a pair in the remaining images. We repeat such a process to obtain a large number of individual image pairs for caption generation. in the Blank: Template-based Caption Fill Generation. Given a pair of images, the an- notator needs to come up with a valid caption that makes it a correct description for one im- age but incorrect for the other. In this way, the annotator should focus on the key difference be- tween the two images (which should be a spatial relation between the two objects of interest) and choose a caption that differentiates the two. Sim- ilar paradigms are also used in the annotation of previous vision-language reasoning datasets such as NLVR(2) (Suhr et al., 2017, 2019) and MaRVL (Liu et al., 2021). To regularize anno- tators from writing modifiers and differentiating the image pair with things beyond accurate spatial relations, we opt for a template-based classifica- tion task instead of free-form caption writing.2 Additionally, the template-generated dataset can Figure 3: An annotation example of concepts ‘‘cat’’ & ‘‘laptop’’ in contrastive caption generation. The ex- ample generates two data points for our dataset: One ‘‘True’’ instance when the completed caption is paired with image 2 (right) and one ‘‘False’’ instance when paired with image 1 (left). Figure 3a source: Jeremy Zawodny. ‘‘Thunder-Man’’, uploaded October 16, 2007. https://www.flickr.com/photos/jzawodn /1590039572/ (CC BY-NC 2.0). Figure 3b source: Chris Jobling. ‘‘Day 42: Cat and mouse?’’, uploaded September 30, 2008. https://www.flickr.com /photos/51214457@N00/2901947727 (CC BY-SA 2.0). would be randomly sampled from MS COCO (we use the train and validation sets of COCO 2017). Second, an annotator would be given a template containing the two concepts and is required to choose a spatial relation from a pre-defined list (Table 1) that makes the caption correct for one image but incorrect for the other image. We will detail these steps and explain the rationales in the following. Image Pair Sampling. MS COCO 2017 con- labeled the tains 123,287 images and has 2Hendricks and Nematzadeh (2021) propose a zero-shot probing benchmark of similar spirit for verb understanding. All captions are simplified as subject-verb-object triplets. 638 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 6 6 2 1 3 8 3 6 0 / / t l a c _ a _ 0 0 5 6 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 be easily categorized based on relations and their categories. Specifically, the annotator would be given instance pairs as shown in Figure 3. The caption template has the format of ‘‘The the ENT2.’’, and the annotators ENT1 (is) are instructed to select a relation from a fixed set to fill in the slot. The copula ‘‘is’’ can be omitted for grammaticality. For example, for ‘‘contains’’ and ‘‘has as a part’’, ‘‘is’’ should be discarded in the template when extracting the final caption. The fixed set of spatial relations enable us to ob- tain the full control of the generation process. The full list of used relations are listed in Table 1. The list contains 71 spatial relations and is adapted from the summarized relation table of Marchi Fagundes et al. (2021). We made minor changes to filter out clearly unusable relations, made rela- tion names grammatical under our template, and reduced repeated relations. In our final dataset, 66 out of the 71 available relations are actually included (the other 6 are either not selected by annotators or are selected but the captions did not pass the validation phase). 3.2 Second-round Human Validation In the second-round validation, every annotated data point is reviewed by at least 3 additional human annotators (validators). Given a data point (consisting of an image and a caption), the valida- tor gives either a True or False label as shown in Figure 4 (the original label is hidden). In our final dataset, we exclude instances with fewer than 2 validators agreeing with the original label. Design Choice on Reference Frames. During validation, a validator needs to decide whether a statement is true or false for an image. However, as discussed in § 1, interpreting a spatial relation requires choosing a frame of reference. For some images, a statement can be both true and false, depending on the choice. As a concrete example, in Figure 1, while the potted plant is on the left side from the viewer’s perspective (relative frame), the potted plant is at the right side if the bench is used to define the coordinate system (intrinsic frame). In order to ensure that annotations are consistent across the dataset, we communicated to the an- notators that, for relations such as ‘‘left’’/‘‘right’’ and ‘‘in front of’’/‘‘behind’’, they should con- sider both possible reference frames, and assign the label True when a caption is true from either the intrinsic or the relative frame. Only when a Figure 4: A second-round validation example. Image source: Marisa McClellan. ‘‘Becky’s grilled pizza’’, uploaded May 31, 2011. https://www.flickr .com/photos/marusula/5779127081/ (CC BY-NC-ND 2.0). caption is incorrect under both reference frames (e.g., if the caption is ‘‘The potted plant is under the bench.’’ for Figure 1) should a False label be assigned. On a practical level, this adds difficulty to the task, since a model cannot naively rely on pixel locations of the objects in the images, but also needs to correctly identify orientations of objects. However, the task is well-defined: A model that can correctly simulate both reference frames would be able to perfectly solve this task. From a theoretical perspective, by involving more diverse reference frames, we are also dem- onstrating the complexity of human cognitive pro- cesses when understanding a scene, since different people approach a scene with different frames. Attempting to enforce a specific reference frame would be methodologically difficult and result in an unnaturally restricted dataset. 3.3 Annotator Hiring and Organization Annotators were hired from prolific.co. We required them to (1) have at least a bachelor’s degree, (2) be fluent in English, and (3) have a >99% historical approval rate on the platform. 全部
annotators were paid 12 GBP per hour.

For caption generation, we released the task
with batches of 200 instances and the annotator
was required to finish a batch in 80 minutes. 一个
annotator could not take more than one batch per

639

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

split

random
zero-shot

火车

7,680
4,713

dev

测试

全部的

1,097
231

2,195
616

10,972
5,560

桌子 2: Statistics of the random and zero-shot
splits.

天. In this way we had a diverse set of annotators
and could also prevent annotators from becoming
fatigued. For second-round validation, we grouped
500 data points in one batch and an annotator was
asked to label each batch in 90 minutes.

总共, 24 annotators participated in caption
generation and 45 participated in validation. Four
people participated in both phases, which should
have minimally impacted the validation quality.
The annotators had diverse demographic back-
理由: They were born in 15 国家, 是
living in 13 国家, and had 12 nationalities.
Fifty annotators were born and living in the same
country while others had moved to different ones.
The vast majority of our annotators were resid-
ing in the UK (32), 南非 (9), and Ireland
(7). The ratio for holding a bachelor/master/PhD
as the highest degree was: 12.5%/76.6%/10.9%.
仅有的 7 annotators were non-native English speak-
ers while the other 58 were native speakers. In our
sample, 56.7% of the annotators self-identified as
female and 43.3% as male.

3.4 Dataset Splits

We split the 10,972 validated data points into
train/dev/test sets in two different ways. The stats
of the two splits are shown in Table 2. 在里面
following, we explain how they are created. Ran-
dom split: We split the dataset randomly into
train/dev/test with a ratio of 70/10/20. 概念
zero-shot split: We create another concept zero-
shot split where train/dev/test have no overlap-
ping concepts. 那是, if ‘‘dog’’ appears in the
train set, then it does not appear in dev or test sets.
This is done by randomly grouping concepts into
three sets with a ratio of 50/20/30 of all concepts.
This reduces the dataset size, since data points
involving concepts from different parts of the
train/dev/test split must be filtered out. 骗局-
cept zero-shot split is a more challenging setup
since the model has to learn concepts and the re-
lations in a compositional way instead of remem-
bering the co-occurrence statistics of the two.

3.5 Human Ceiling and Agreement

We randomly sample 500 data points from the
final random split
test set of the dataset for
computing human ceiling and inter-annotator
协议. We hide the labels of the 500 前任-
amples and two additional annotators are asked
to label True/False for them. 平均而言, 这
two annotators achieve an accuracy of 95.4% 在
the VSR task. We further compute the Fleiss’
kappa among the original annotation and the pre-
dictions of the two humans. The Fleiss’ kappa
score is 0.895, indicating near-perfect agreement
according to Landis and Koch (1977).

4 Dataset Analysis

In this section we compute some basic statis-
tics of our collected data (§ 4.1), analyze
where human annotators have agreed/disagreed
(§ 4.2), and present a case study on refer-
ence frames (§ 4.3).

4.1 Basic Statistics of VSR

After the first phase of contrastive template-
based caption generation (§ 3.1), 我们收集了
12,809 raw data points. In the phase of the
second round validation (§ 3.2), 我们收集了
39,507 validation labels. Every data point received
至少 3 validation labels. 在 69.1% 数据的
点, all validators agree with the original label.
We find that 85.6% of the data points have at least
2
3 annotators agreeing with the original label. 我们
使用 2
3 as the threshold and exclude all instances
with lower validation agreement. After excluding
other instances, 10,972 data points remained and
are used as our final dataset.

Here we provide basic statistics of the two
components in the VSR captions: The concepts
and the relations. 数字 5 demonstrates the rela-
tion distribution. ‘‘touching’’ is most frequently
used by annotators. The relations that reflect the
most basic relative coordinates of objects are
also very frequent, 例如, ‘‘behind’’, ‘‘in front
of’’, ‘‘on’’, ‘‘under’’, ‘‘at the left/right side of’’.
数字 6 shows the distribution of concepts in the
dataset. Note that the set of concepts is bounded
by MS COCO and the distribution also largely
follows MS COCO. Animals such as ‘‘cat’’,
‘‘dog’’, and ‘‘person’’ are the most frequent. 在-
door objects such as ‘‘dining table’’ and ‘‘bed’’
are also very dominant. 图中 6, we separate

640

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 5: Relation distribution of the final dataset (sorted by frequency). Top 40 most frequent relations are
包括. It is clear that the relations follow a long-tailed distribution.

数字 6: Concept distribution. Only concepts with > 100 frequencies are included.

the concepts that appear at ENT1 and ENT2 po-
sitions of the sentence and their distributions are
generally similar.

4.2 Where Do Annotators Disagree?

While we propose using data points with high
validation agreement for model evaluation and
发展, the unfiltered dataset is a valuable
resource for understanding cognitive and linguis-
tic phenomena. We sampled 100 examples where
annotators disagree, and found that around 30 的
them are caused by annotation errors but the rest
are genuinely ambiguous and can be interpreted
in different ways. This shows a level of intrinsic
ambiguity of the task and variation among people.
Along with the validated VSR dataset, 我们
also release the full unfiltered dataset, with an-
notators’ and validators’ metadata, as a second
version to facilitate linguistic studies. 考试用-
普莱, researchers could investigate questions such
as where disagreement is more likely to happen
and how people from different regions or cul-
tural backgrounds might perceive spatial relations
不同地.

To illustrate this, the probability of two ran-
domly chosen annotators disagreeing with each

other is given for each relation in Figure 7. 一些
of the relations with high disagreement can be
interpreted in the intrinsic reference frame, 哪个
requires identifying the orientations of objects, 为了
例子, ‘‘at the side of’’ and ‘‘in front of’’. 其他
relations have a high level of vagueness, 例如, 为了
the notion of closeness: ‘‘near’’ and ‘‘close to’’.
相比之下, part-whole relations, such as ‘‘has as
a part’’, ‘‘part of’’, and in/out relations such as
‘‘within’’, ‘‘into’’, ‘‘outside’’, and ‘‘inside’’ have
the least disagreement.

4.3 Case Study: Reference Frames

It is known that the relative reference frame is
often preferred in English, at least in standard
varieties. 例如, Edmonds-Wathen (2012)
compares Standard Australian English and Abo-
riginal English, as spoken by school children at a
school on Croker Island, investigating the use of
the relations ‘‘in front of’’ and ‘‘behind’’ when de-
scribing simple line drawings of a person and
一棵树. Speakers of Standard Australian English
were found to prefer the relative frame, 尽管
speakers of Aboriginal English were found to
prefer the intrinsic frame.

641

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 7: Per-relation probability of having two randomly chosen annotator disagreeing with each other (sorted
from high to low). Only relations with > 20 data points are included in the figure.

Our methodology allows us to investigate ref-
erence frame usage across a wide variety of spa-
tial relations, using a wide selection of natural
图片. To understand the frequency of annota-
tors using relative vs. intrinsic frames, we label
instances’ reference frames and study their distri-
butions. The majority of examples that can be
interpreted differently under different reference
frames are left/right-related relations (IE。, ‘‘left/
right of’’ and ‘‘at the left/right side of’’). We find
all left/right-related true3 statements and classify
them into three categories: (1) intrinsic, (2) 相对-
ative, 和 (3) 两个都 (the caption is correct under
either intrinsic and relative frames of reference).
Among the 616 instances, 68 (11%) 和 518
(84%) use intrinsic and relative frames respec-
主动地, 30 (5%) can be interpreted with both
frames. Since the vast majority of our annotators
were native English speakers (91%), and all were
university-educated, our finding is consistent with
previous work suggesting that the relative frame
is the most common frame in standard varieties
of English.

Besides the overall trend, the use of reference
frames can vary with the circumstances. Related
patterns have been studied in cognitive science.
例如, Vukovic and Williams (2015) 寻找
a three-way interaction between linguistic cues,
spatial configurations in an image, and a person’s
own preferences on reference frames.

We investigated whether reference to a per-
son in the image might influence how annotators
comprehend the scene. 198 出于 616 在-
stances involve ‘‘person’’ in the caption. And out
的 198 human-involved instances, 32 (16%)

use an intrinsic frame and 154 (78%) use a rela-
tive frame (12, IE。, 6%, can be interpreted with
both frames), while the proportions were 9%
和 87% for instances not involving ‘‘person’’.
This is a statistically significant difference (使用
two-tailed Fisher’s exact test, p = 0.0054 if ignor-
ing both-frame cases, and p = 0.0045 if group-
ing both-frame and intrinsic cases). 其他
字, this suggests that the involvement of a
human can more likely prompt the use of the in-
trinsic frame.

5 实验

在这个部分, we test VLMs on VSR. We first in-
troduce baselines and experimental configurations
in § 5.1, then experimental results and analysis in
§ 5.2. Then we discuss the role of frame of ref-
erence using experiments in § 5.3 最后
conduct sample efficiency analysis in § 5.4.

5.1 Baselines and Experiment

Configurations

基线. 为了
finetuning-based experiments,
we test three popular VLMs: VisualBERT (李
等人。, 2019),4 LXMERT (Tan and Bansal, 2019),5
and ViLT (Kim et al., 2021).6 所有三个型号
are stacked Transformers (Vaswani et al., 2017)
that take image-text pairs as input. 区别
mainly lies in how or whether they encode the
position information of objects. We report only
finetuned results but not direct inferences from
off-the-shelf checkpoints since some of their
pretraining objectives are inconsistent with the

4huggingface.co/uclanlp/visualbert-nlvr2-coco

-pre.

3According to our guideline, false statements are inter-

preted as false under both frames.

5huggingface.co/unc-nlp/lxmert-base-uncased.
6huggingface.co/dandelin/vilt-b32-mlm.

642

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

batch size epoch token length

model↓

random split

zero-shot split

VisualBERT 2e-6
1e-5
LXMERT
1e-5
ViLT

32
32
12

100
100
30

32
32
max

桌子 3: A listing of hyperparameters used for all
VLMs (‘‘lr’’: learning rate).

human ceiling

95.4

CLIP (w/ prompting)

56.0

VisualBERT
ViLT
LXMERT

55.2±1.4
69.3±0.9
70.1±0.9

54.5

51.0±1.9
63.0±0.9
61.2±0.4

the alt

binary classification task of VSR, thus requiring
additional engineering.
We additionally test

text pretrained
dual-encoder CLIP (Radford et al., 2021) 作为
off-the-shelf baseline model (no finetuning).7 我们
follow Booth (2023) to construct negation or
antonym of each individual relation. 例如,
‘‘facing’’ → ‘‘facing away from’’ and ‘‘ahead of’’
→ ‘‘not ahead of’’. For each sample, we compare
the embedding similarity of the image-caption
pair and that of the negated caption. If the orig-
inal pair has a higher probability then the model
prediction is True, otherwise False. We call
this method CLIP (w/ prompting). We only report
direct prompting results without finetuning since
CLIP finetuning is expensive.

Experimental Configurations. We save check-
points every 100 iterations and use the best-
performing checkpoint on dev set for testing. 全部
models are run three times using three random
种子. All models are trained with the AdamW
optimizer (Loshchilov and Hutter, 2019). 这
hyperparameters we used for training the three
VLMs are listed in Table 3.

5.2 Experimental Results

在这个部分, we provide both quantitative and
qualitative results of the four baselines. Through
analyzing the failure cases of the models, we also
highlight the key abilities needed to solve this
dataset.

如表所示 4, the best-performing mod-
els on the random split are LXMERT and ViLT,
reaching around 70% 准确性, while Visual-
BERT is just slightly better than the chance level.
On the zero-shot split, all models’ performance
decline substantially and the best model, ViLT,
only obtains 63.0% 准确性. The off-of-the-shelf
CLIP model obtains around 55% on both sets, 在-
dicating its weaknesses in spatial reasoning echo-

桌子 4: Model performance on VSR test set. CLIP
is applied without finetuning but with carefully
engineered prompts while the other three smaller
models are finetuned on the training set.

ing Subramanian et al.’s (2022) 发现. 全面的,
these results lag behind the human ceiling by more
比 25% and highlight that there is substantial
room for improving current VLMs.

Explicit Positional Information Matters. 两个都
LXMERT and ViLT outperform VisualBERT by
large margins (>10%) on both splits. This is ex-
pected since LXMERT and ViLT encode explicit
positional information while VisualBERT does
不是. LXMERT has position features as part of
the input which encode the relative coordinates of
objects within the image. ViLT slices an image
into patches (instead of object regions) and uses
positional encodings to signal the patches’ rela-
tive positions. VisualBERT, 然而, has no ex-
plicit position encoding. Bugliarello et al. (2021)
and R¨osch and Libovick´y (2022) also highlight
the importance of positional encodings of VLMs,
which agrees with our observations.

Random Split vs. Zero-shot Split.
It is worth
noting that the performance gap between the ran-
dom and zero-shot splits is large. As we will show
in § 5.4, the underlying cause is not likely to
be the number of training examples, 反而
that concept zero-shot learning is fundamentally
a challenging task. The gap suggests that disen-
tangling representations of concepts and relations
is challenging for current models.

Sensitiveness to Random Seeds. Model per-
formance varies by about one to two percentage
点. These fluctuations illustrate the impor-
tance of always reporting the average performance
of multiple runs to make sure the conclusion is
可靠的.

7huggingface.co/laion/CLIP-ViT-H-14-laion2B

-s32B-b79K.

Performance by Relation. We give perfor-
mance by relation for all three finetuned models

643

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 8: Performance (准确性) by relation on the random (A) and zero-shot (乙) split test sets. Relation order
sorted by frequency (high to low from left to right). Only relations with more than 15 和 5 occurrences on the
random and zero-shot tests, 分别, are shown.

on both random and zero-shot splits in Figure 8.
The order from left to right is sorted by the fre-
quency of relations in the dataset (within each
split). 有趣的是, there does not seem to be any
correlation between performance and frequency
of the relation, hinting that specific relations are
hard not due to an insufficient number of train-
ing examples but because they are fundamentally
challenging for current VLMs. Any relation that
requires recognising orientations of objects seems
to be hard, 例如, ‘‘facing’’, ‘‘facing away
from’’, ‘‘parallel to’’, and ‘‘at the back of’’. 作为一个
例子, LXMERT failed on the two examples
图中 9 which require understanding the front
of a hair drier and a person respectively. 在这个
看待, left-right relations such as ‘‘at the left/right
side of’’ and ‘‘left/right of’’ are difficult because
the intrinsic reference frame requires understand-
ing the orientation of objects. 举个例子, 在
数字 1, all three models predicted False, 但
in the intrinsic frame (IE。, from the bench’s point
of view), the potted plant is indeed at the right.

To get a more high-level understanding of
the relations’ performance, we group model per-
formance by the categories of Marchi Fagundes
等人. (2021): ‘‘Adjacency’’, ‘‘Directional’’, ‘‘Or-
ientation’’, ‘‘Projective’’, ‘‘Proximity’’, ‘‘Topo-
logical’’, and ‘‘Unallocated’’ (also shown in
桌子 1). The results are shown in Figure 10.
‘‘Orientation’’ is the worst performing group on
the random split, and on average all models’
performance is close to the chance level. 什么时候
comparing random and zero-shot splits, perfor-
mance has declined to some extent for almost all
categories and models. The decrease in ‘‘Proxim-
ity’’ is particularly drastic across all models—it
declined from close to 75% accuracy in random
split to chance level in zero-shot split. ‘‘Prox-
imity’’ contains relations such as ‘‘close to’’,
‘‘near’’, and ‘‘far from’’. We believe it is due
to the fact that the notion of proximity is rela-
tive and very much dependent on the nature of
the concept and its frequent physical context. 为了
例子, for a ‘‘person’’ to be ‘‘near’’ an indoor

644

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 10: Performance by categories of relations, 在
the random (A) and zero-shot (乙) split test sets. 为了
legend information, 见图 8.

models predicted wrongly. Some other examples
require common sense. 例如, 图中 2,
we can infer the person and the cow’s moving
direction and can then judge if the cow is ahead
of the person. LXMERT failed on this example.
图中 3 (正确的), the model needs to infer that
the main body of the cat is hidden behind the
laptop. 有趣的是, all three models predicted
this example correctly.

5.3 Case Study on Reference Frames
As discussed in § 4.3, different frames of ref-
erence can be used in natural language and it
would be helpful to understand whether our mod-
els recognise them. We argue that the task of
identifying frame of reference itself is very hard
for current models. 然而, learning to recog-
nize frames of reference helps the task of visual
spatial reasoning.

Firstly, we conduct a case study on left/right-
related relations. We additionally label the ref-
erence frames of all true statements containing
any of the left/right-related relations. We exclude
all data points that can be interpreted in both in-
trinsic and relative frames to slightly reduce the
complexity of the task. Then we finetune a ViLT
checkpoint to predict the reference frame based
on the true statement and the image. The model’s

数字 9: LXMERT failed on both examples.
Figure 9a source: Austin & Zak.
‘‘three min-
utes with dryer on high’’, uploaded February
23, 2008. https://www.flickr.com/photos
/zakh/2285744646/ (CC BY-NC-SA 2.0).
Figure 9b source: Carrick. ‘‘Elsa and Maia work-
ing hard sanding down the bench’’, uploaded April
17, 2012. https://www.flickr.com/photos
/carrickg/6941414780/ (CC BY-NC 2.0).

concept such as ‘‘oven’’ is very different from
a person being ‘‘near’’ a frequent outdoor object
such as ‘‘train’’ or ‘‘truck’’. Since the zero-shot
split prevents models from seeing test concepts
during training, the models have a poor grasp of
what counts as ‘‘close to’’ or ‘‘far from’’ for these
概念, thus generalizing poorly.

Other Errors. While certain relations are in-
trinsically hard, we have observed other types of
errors that are not bounded to specific relations.
Here we give a few examples. Some instances re-
quire complex reasoning. 图中 11, 该模型
needs to recognize that both the cow and the back
of the car are in the car’s side mirror and also infer
the relative position of the back of the car and the
cow. It is perhaps no surprise that two of the three

645

Reference frame prediction task

Precision

Recall

59.2±3.7
VSR task (left/right subset)

59.7±5.8

56.9±4.4

model↓

ViLT

model↓

ViLT
ViLT + rf trained

Accuracy

54.2±0.6
59.2±1.8

数字 11: Caption: The cow is at the back of the
car. Label: True. LXMERT and VisualBERT pre-
dicted False. Image source: shorty76. ‘‘Side Mirror
View’’, uploaded December 26, 2008. https://万维网
.flickr.com/photos/shorty 76/3136942358/
(CC BY-NC-ND 2.0).

performance on test set is shown in the upper half
of Table 5. We can see that reference frame pre-
diction is an extremely hard task for the model.
This is presumably because it requires taking into
account a 3D viewpoint and simulating transfor-
mations between different viewpoints.

第二, we use this model trained with refer-
ence frame labels to initialize the VSR task model
and further finetune it on the VSR task (only the
left/right relations). The test results are shown in
the lower part of Table 5.8 We see a clear posi-
tive transfer from reference frame prediction task
to the VSR task. This suggests that learning to re-
cognise reference frames can indeed help down-
stream visual spatial reasoning. 这是有道理的
since simulating the transformation of intrinsic/
relative frames could be an intermediate reasoning
step in detecting whether a statement is true/false.

5.4 Sample Efficiency

In order to understand the correlation between
model performance and the number of training
examples, we conduct sample efficiency analysis
on VSR. The results are plotted in Figure 12.
For the minimum resource scenario, we randomly
sample 100 shots from the training sets of each
split. Then we gradually increase the number of
training examples to be 25%, 50%, 和 75% 的
the whole training sets. Both LXMERT and ViLT

8Note that the reference frame train/dev/test sets are de-
rived from the VSR task split—so no data leakage is possible
from train to dev and test sets even after the intermediate
pretraining.

646

桌子 5: ViLT model performance on the refer-
ence frame prediction task (upper half; we report
macro-averaged Precision/Recall/F1 since the bi-
nary classification task is imbalanced); and VSR
task using original pretrained checkpoint or the
reference frame prediction task trained checkpoint
(accuracy reported).

have a reasonably good few-shot capability and
can be quite performant with 25% of training data.
LXMERT, 尤其, reaches above 55% accu-
racy with 100 shots on both splits. The zero-shot
split is substantially harder and most models ap-
pear to have already plateaued at around 75% 的
the training set. For the random split, all models
are increasing performance with more data points,
though improvement slows down substantially
for LXMERT and ViLT after 75% of training
数据. The fact that LXMERT has the best over-
all few-shot capability may be suggesting that
LXMERT’s pretrained object detector has a strong
inductive bias for the VSR dataset as it does not
need to learn to recognise concept boundaries and
classes from scratch. 然而, this advantage
from LXMERT seems to fade away as the number
of training examples increases.

6 Conclusion and Future Directions

We have presented Visual Spatial Reasoning
(VSR), a controlled probing dataset for testing
the capability of vision-language models (VLMs)
of recognising and reasoning about spatial rela-
tions in natural image-text pairs. We made a series
of linguistic observations on the variability of
spatial language when collecting VSR. We high-
lighted the diverse use of reference frames among
annotators, and also the ambiguous nature of
certain spatial relations. We tested four popular
VLMs on VSR, and found they perform more than
25% below the human ceiling. On a more chal-
lenging concept zero-shot split, the tested VLMs

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 12: Sample efficiency analysis: model performance under different amounts of training data (100-shot,
25%, 50%, 75%, 和 100% of training set). Results on both the random and zero-shot split test sets are shown.
As training data increases, the performance plateaus on both sets but the flattening trend is more obvious on
the zero-shot split.

struggled to reach 60% accuracy and their per-
formance plateaued even with increased training
examples. Among the finetuning-based VLMs,
ViLT, and LXMERT outperformed VisualBERT,
and we noted out that the explicit positional in-
formation in the former two models is crucial in
the task. CLIP with prompt engineering achieved
slightly better than random performance, 建议-
ing poor capability in spatial reasoning. 我们也
performed a by-relation analysis and found that
the models’ performance on certain relations have
little correlation with the number of training ex-
阿普莱斯, and certain relations are inherently more
具有挑战性的. We identified orientation as the most
difficult category of relations for VLMs. Prox-
imity is another challenging category, 尤其
in the zero-shot setup as this relation is highly
concept-dependent. We hope the task serves as a
useful tool for testing and probing future VLMs.
In future work, we plan to more extensively
investigate whether large-scale pretrained dual-
encoders such as CLIP (Radford et al., 2021),
ALIGN (Jia et al., 2021), and LiT (Zhai et al.,
2022) can properly recognize spatial relations,
especially in the finetuning setup. A compari-
son of dual- and cross-encoders’ performance on
each spatial relation might guide future model
设计. 最近, Alayrac et al. (2022), 陈
等人. (2023), and Huang et al. (2023) proposed
ultra-large-scale VLMs. It would be interesting to
see if VLMs have better spatial reasoning ca-
pability when scaled up. Another direction is
extending VSR to cover more languages and cul-
特雷斯 (刘等人。, 2021; Bugliarello et al., 2022)
and test multilingual VLMs. Along the same line,
since we have also collected the metadata of

annotators, the VSR corpus can be used as a re-
source for investigating research questions such
作为: How is ‘‘space’’ described among different
dialects of English? How is ‘‘space’’ perceived
among different populations? We hope that the
annotation process of VSR can also serve as a
basis for future cross-lingual and cross-cultural
sociolinguistic research.

致谢

We thank the TACL reviewers and the action
editor for their thoughtful comments. We thank
Qian Wang and Rongtian Ye for helping trial the
annotation scheme; Zihao Fu for helping set up
the annotation server. The project is funded by
Cambridge Language Sciences Incubator Fund.
FL is supported by Grace & Thomas C.H. Chan
Cambridge Scholarship.

参考

Arjun Akula, Spandana Gella, Yaser Al-Onaizan,
Song-Chun Zhu, and Siva Reddy. 2020.
Words aren’t enough, their order matters: 在
the robustness of grounding visual referring
expressions. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 6555–6565, 在线的. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl-main.586

Jean-Baptiste Alayrac, Jeff Donahue, Pauline
Luc, Antoine Miech, Iain Barr, Yana Hasson,
Karel Lenc, Arthur Mensch, Katherine
Millican, Malcolm Reynolds, Roman Ring,
Eliza Rutherford, Serkan Cabi, Tengda Han,

647

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Zhitao Gong, Sina Samangooei, Marianne
Monteiro, Jacob Menick, Sebastian Borgeaud,
Andrew Brock, Aida Nematzadeh, Sahand
Sharifzadeh, Mikolaj Binkowski, Ricardo
Barreira, Oriol Vinyals, Andrew Zisserman,
and Karen Simonyan. 2022. Flamingo: A vi-
sual
学习.
In Advances in Neural Information Process-
ing Systems 35: Annual Conference on Neural
Information Processing Systems 2022, 诺维姆-
误码率 28 – December 9, 2022, New Orleans,
这, 美国.

language model for few-shot

Jacob Andreas, Marcus Rohrbach, Trevor
Darrell, and Dan Klein. 2016. Neural module
网络. 在 2016 IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR
2016, Las Vegas, NV, 美国, June 27–30, 2016,
pages 39–48. IEEE Computer Society.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,
Margaret Mitchell, Dhruv Batra, C. 劳伦斯
Zitnick, and Devi Parikh. 2015. VQA: Visual
question answering. 在 2015 IEEE Interna-
tional Conference on Computer Vision, ICCV
2015, 圣地亚哥, 智利, December 7–13, 2015,
pages 2425–2433. IEEE Computer Society.
https://doi.org/10.1109/ICCV.2015.279

Joe Booth. 2023. CLIP visual spatial reasoning.
https://github.com/Sohojoe/CLIP visual
-spatial-reasoning. GitHub repository.

Emanuele Bugliarello, Ryan Cotterell, Naoaki
Okazaki, and Desmond Elliott. 2021. 多-
modal pretraining unmasked: A meta-analysis
and a unified framework of vision-and-language
BERTs. 协会的交易
计算语言学, 9:978–994. https://
doi.org/10.1162/tacl 00408

Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer,
Siva Reddy, Desmond Elliott, Edoardo Maria
Ponti, and Ivan Vuli´c. 2022. IGLUE: A bench-
mark for transfer learning across modalities,
任务, and languages. In Proceedings of the 39th
International Conference on Machine Learn-
英, 体积 162 of Proceedings of Machine
Learning Research, pages 2370–2392. PMLR.

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ
Piergiovanni, Piotr Padlewski, Daniel Salz,
Sebastian Goodman, Adam Grycner, Basil
Mustafa, Lucas Beyer, Alexander Kolesnikov,
Joan Puigcerver, Nan Ding, Keran Rong,
Hassan Akbari, Gaurav Mishra, Linting

薛, Ashish V. Thapliyal, James Bradbury,
Weicheng Kuo, Mojtaba Seyedhosseini, Chao
Jia, Burcu Karagol Ayan, Carlos Riquelme
Ruiz, Andreas Peter Steiner, Anelia Angelova,
Xiaohua Zhai, Neil Houlsby,
and Radu
Soricut. 2023. PaLI: A jointly-scaled multi-
lingual language-image model. In The Elev-
enth International Conference on Learning
Representations.

Gordon Christie, Ankit Laddha, Aishwarya
Agrawal, Stanislaw Antol, Yash Goyal, Kevin
Kochersberger, and Dhruv Batra. 2016. 关于-
solving language and vision ambiguities to-
在一起: Joint segmentation & prepositional
attachment resolution in captioned scenes. 在
诉讼程序 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1493–1503, Austin, 德克萨斯州. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/D16-1156

Volkan Cirik, Louis-Philippe Morency, 和
Taylor Berg-Kirkpatrick. 2018. Visual refer-
ring expression recognition: What do systems
actually learn? 在诉讼程序中
这 2018
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 2
(Short Papers), pages 781–787, New Orleans,
Louisiana. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/N18-2123

Guillem Collell, Luc Van Gool, and Marie-
Francine Moens. 2018. Acquiring common
sense spatial knowledge through implicit spatial
模板. In Proceedings of the Thirty-Second
AAAI 人工智能会议,
(AAAI-18), the 30th innovative Applications
of Artificial Intelligence (IAAI-18), 和
8th AAAI Symposium on Educational Advances
in Artificial Intelligence (EAAI-18), New Or-
leans, Louisiana, 美国, February 2–7, 2018,
pages 6765–6772. AAAI出版社.

Cris Edmonds-Wathen. 2012. False friends in the
multilingual mathematics classroom. In 12th
International Congress on Mathematical Ed-
ucation Topic Study Group 28, 8–15 July,
2012, Seoul, 韩国, pages 5857–5866.

Yash Goyal, Tejas Khot, Douglas Summers-Stay,
Dhruv Batra, and Devi Parikh. 2017. Mak-
ing the V in VQA matter: Elevating the role

648

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

的
image understanding in visual question
answering. 在 2017 IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR
2017, 檀香山, HI, 美国, July 21–26, 2017,
pages 6325–6334. IEEE Computer Society.
https://doi.org/10.1109/CVPR.2017
.670

Lisa Anne Hendricks and Aida Nematzadeh. 2021.
Probing image-language transformers for verb
理解. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP
2021, pages 3635–3644, 在线的. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/2021.findings-acl
.318

Shaohan Huang, Li Dong, Wenhui Wang, Yaru
Hao, Saksham Singhal, Shuming Ma, Tengchao
左, Lei Cui, Owais Khan Mohammed, Qiang
刘, Kriti Aggarwal, Zewen Chi, Johan Bjorck,
Vishrav Chaudhary, Subhojit Som, Xia Song,
and Furu Wei. 2023. Language is not all you
need: Aligning perception with language mod-
这. arXiv 预印本 arXiv:2302.14045.

Drew A. Hudson and Christopher D. 曼宁.
2019. GQA: A new dataset for real-world
visual reasoning and compositional question
answering. In IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2019,
Long Beach, CA, 美国, June 16–20, 2019,
pages 6700–6709. Computer Vision Founda-
的 / IEEE. https://doi.org/10.1109
/CVPR.2019.00686

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting
陈, Zarana Parekh, Hieu Pham, Quoc Le,
Yun-Hsuan Sung, Zhen Li, and Tom Duerig.
2021. Scaling up visual and vision-language
representation learning with noisy text supervi-
锡安. In Proceedings of the 38th International
Conference on Machine Learning, 体积 139
of Proceedings of Machine Learning Research,
pages 4904–4916. PMLR.

Justin Johnson, Bharath Hariharan, Laurens van
der Maaten, Li Fei-Fei, C. Lawrence Zitnick,
and Ross B. Girshick. 2017. CLEVR: A diag-
nostic dataset for compositional language and
elementary visual reasoning. 在 2017 IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2017, 檀香山, HI, 美国,
July 21–26, 2017, pages 1988–1997. IEEE

Computer Society. https://doi.org/10
.1109/CVPR.2017.215

Wonjae Kim, Bokyung Son, and Ildoo Kim.
2021. ViLT: Vision-and-language transformer
without convolution or region supervision. 在
Proceedings of the 38th International Confer-
ence on Machine Learning, 体积 139 的
Proceedings of Machine Learning Research,
pages 5583–5594. PMLR.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin
约翰逊, Kenji Hata, Joshua Kravitz, Stephanie
陈, Yannis Kalantidis, Li-Jia Li, 赋予生命.
Shamma, Michael S. Bernstein, and Li Fei-Fei.
2017. Visual Genome: Connecting language
and vision using crowdsourced dense image
注释. International Journal of Com-
puter Vision, 123(1):32–73. https://土井
.org/10.1007/s11263-016-0981-7

Alexander Kuhnle and Ann Copestake. 2018.
Deep learning evaluation using deep linguis-
tic processing. In Proceedings of the Workshop
on Generalization in the Age of Deep Learn-
英, pages 17–23, New Orleans, Louisiana.
计算语言学协会.

Alexander Kuhnle, Huiyuan Xie, and Ann
Copestake. 2018. How clever is the FiLM
模型, and how clever can it be? In Proceed-
ings of the European Conference on Computer
Vision (ECCV) Workshops, pages 162–172.
https://doi.org/10.1007/978-3-030
-11018-5 15

J. Richard Landis and Gary G. 科赫. 1977.
The measurement of observer agreement for
categorical data. Biometrics, 33(1):159–174.
https://doi.org/10.2307/2529310,
考研: 843571

Jie Lei, Licheng Yu, Tamara Berg,

和
Mohit Bansal. 2020. TVQA+: Spatio-temporal
grounding for video question answering. 在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 8211–8225, 在线的. 协会
计算语言学. https://土井
.org/10.18653/v1/2020.acl-main
.730

Stephen C. Levinson. 2003. Space in Language
和认知: Explorations
in Cognitive
Diversity. Language Culture and Cognition.
剑桥大学出版社. https://土井
.org/10.1017/CBO9780511613609

649

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
Hsieh, and Kai-Wei Chang. 2019. VisualBERT:
A simple and performant baseline for vision
和语言. ArXiv preprint, abs/1908.03557.

Tsung-Yi Lin, Michael Maire, Serge Belongie,
James Hays, Pietro Perona, Deva Ramanan,
Piotr Doll´ar, 和C. Lawrence Zitnick. 2014.
Microsoft COCO: Common objects in context.
In European Conference on Computer Vision,
pages 740–755. 施普林格. https://doi.org
/10.1007/978-3-319-10602-1 48

Fangyu Liu, Emanuele Bugliarello, Edoardo
Maria Ponti, Siva Reddy, Nigel Collier, 和
Desmond Elliott. 2021. Visually grounded rea-
soning across languages and cultures. In Pro-
ceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 10467–10485, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics.

Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L.
Yuille. 2019. CLEVR-Ref+: Diagnosing visual
reasoning with referring expressions. In IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA,
美国, June 16–20, 2019, pages 4185–4194.
Computer Vision Foundation / IEEE. https://
doi.org/10.1109/CVPR.2019.00431

Xiao Liu, Da Yin, Yansong Feng, and Dongyan
赵. 2022. Things not written in text: Explor-
ing spatial commonsense from visual signals.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 2365–2376,
都柏林,
爱尔兰. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.168

Yinhong Liu and Guy Emerson. 2022. 学习-
ing functional distributional semantics with
visual data. In Proceedings of the 60th Annual
Meeting of
the Association for Computa-
tional Linguistics (体积 1: Long Papers),
pages 3976–3988, 都柏林, 爱尔兰. 协会
for Computational Linguistics.

Ilya Loshchilov and Frank Hutter. 2019. 的-
coupled weight decay regularization. In 7th
International Conference on Learning Repre-
句子, ICLR 2019, New Orleans, 这, 美国,
May 6–9, 2019. OpenReview.net.

Cristiane Kutianski Marchi Fagundes, Kristin
Stock, and Luciene Delazari. 2021. A cross-
linguistic study of spatial
location descrip-
tions in New Zealand English and Brazilian
Portuguese natural language. Transactions in
GIS, 25(6):3159–3187. https://doi.org
/10.1111/tgis.12815

这 2021 Conference of

Roshanak Mirzaee, Hossein Rajaby Faghihi,
Qiang Ning, and Parisa Kordjamshidi. 2021.
SPARTQA: A textual question answering
benchmark for spatial reasoning. In Proceed-
ings of
the North
the Association for
American Chapter of
计算语言学: Human Language
Technologies,
4582–4598, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2021
.naacl-main.364

页面

Letitia Parcalabescu, Michele Cafagna, Lilitta
Muradjan, Anette Frank, Iacer Calixto, and Al-
bert Gatt. 2022. VALSE: A task-independent
benchmark for vision and language models cen-
tered on linguistic phenomena. In Proceedings
of the 60th Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 8253–8280, 都柏林,
爱尔兰. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2022.acl-long.567

Alec Radford, Jong Wook Kim, Chris Hallacy,
Aditya Ramesh, Gabriel Goh, Sandhini
阿加瓦尔, Girish Sastry, Amanda Askell,
Pamela Mishkin, Jack Clark, Gretchen Krueger,
and Ilya Sutskever. 2021. Learning transfer-
able visual models from natural language super-
想象. In Proceedings of the 38th International
Conference on Machine Learning, 体积 139
of Proceedings of Machine Learning Research,
pages 8748–8763. PMLR.

Philipp J. R¨osch and Jindˇrich Libovick´y. 2022.
Probing the role of positional information in
vision-language models. In Findings of
这
计算语言学协会:
全国AACL 2022, pages 1031–1041, Seattle, 团结的
状态. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2022.findings-naacl.77

Sanjay Subramanian, William Merrill, Trevor
Darrell, Matt Gardner, Sameer Singh, and Anna
Rohrbach. 2022. ReCLIP: A strong zero-shot

650

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
5
6
6
2
1
3
8
3
6
0

/
t

我

A
C
_
A
_
0
0
5
6
6
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

在诉讼程序中

baseline for referring expression comprehen-
the 60th Annual
锡安.
Meeting of
the Association for Computa-
tional Linguistics (体积 1: Long Papers),
pages 5198–5215, 都柏林, 爱尔兰. 协会
for Computational Linguistics. https://土井
.org/10.18653/v1/2022.acl-long.357

Alane Suhr, Mike Lewis, James Yeh, and Yoav
Artzi. 2017. A corpus of natural language for
visual reasoning. In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (体积 2: Short Papers),
pages 217–223, Vancouver, 加拿大. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/P17-2034

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris
张, Huajun Bai, and Yoav Artzi. 2019. A
corpus for reasoning about natural language
grounded in photographs. 在诉讼程序中
the 57th Annual Meeting of the Association for
计算语言学, pages 6418–6428,
Florence, 意大利. Association for Computational
语言学.

Leonard Talmy. 1983. How language structures
空间. In Herbert L. Pick and Linda P. Acredolo,
编辑, Spatial Orientation: 理论, 研究,
and Application, pages 225–282. Plenum Press.
https://doi.org/10.1007/978-1-4615
-9325-6 11

Hao Tan and Mohit Bansal. 2019. LXMERT:
Learning cross-modality encoder representa-
tions from transformers. 在诉讼程序中
这 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 5100–5111, 香港, 中国. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1514

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. 在-
tention is all you need. In Advances in Neural
Information Processing Systems 30: Annual

Conference on Neural Information Process-
ing Systems 2017, December 4–9, 2017, 长的
Beach, CA, 美国, pages 5998–6008.

Nikola Vukovic and John N. 威廉姆斯. 2015.
Individual differences in spatial cognition in-
fluence mental simulation of language. Cogni-
的, 142:110–122.

Ning Xie, Farley Lai, Derek Doran, and Asim
Kadav. 2019. Visual entailment: A novel task
for fine-grained image understanding. ArXiv
preprint, abs/1901.06706. An earlier version of
this paper was published at the NeurIPS 2018
ViGIL workshop.

Mark Yatskar, Vicente Ordonez, and Ali Farhadi.
2016. Stating the obvious: Extracting visual
common sense knowledge. 在诉讼程序中
这 2016 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 193–198, 圣地亚哥, 加利福尼亚州. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1023

Licheng Yu, Patrick Poirson, Shan Yang,
Alexander C. 伯格, and Tamara L. 伯格.
2016. Modeling context in referring expres-
西翁. In European Conference on Computer
Vision, pages 69–85. 施普林格.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, 和
Yejin Choi. 2019. From recognition to cogni-
的: Visual commonsense reasoning. In IEEE
Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA,
美国, June 16–20, 2019, pages 6720–6731.
Computer Vision Foundation / IEEE. https://
doi.org/10.1109/CVPR.2019.00688

Xiaohua Zhai, Xiao Wang, Basil Mustafa,
Andreas Steiner, Daniel Keysers, 亚历山大
Kolesnikov, and Lucas Beyer. 2022. LiT: Zero-
shot transfer with locked-image text tuning.
In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition,
pages 18123–18133. https://doi.org/10
.1109/CVPR52688.2022.01759