Word Representation Learning in Multimodal Pre-Trained

Word Representation Learning in Multimodal Pre-Trained
Transformers: An Intrinsic Evaluation

Sandro Pezzelle, Ece Takmaz, Raquel Fern´andez
Institute for Logic, Language and Computation
University of Amsterdam, 荷兰人

{s.pezzelle|e.takmaz|raquel.fernandez}@uva.nl

抽象的

This study carries out a systematic intrin-
sic evaluation of the semantic representations
learned by state-of-the-art pre-trained multi-
modal Transformers. These representations are
claimed to be task-agnostic and shown to help
on many downstream language-and-vision
任务. 然而, the extent to which they align
with human semantic intuitions remains un-
清除. We experiment with various models and
obtain static word representations from the
contextualized ones they learn. We then eval-
uate them against the semantic judgments pro-
vided by human speakers. In line with previous
证据, we observe a generalized advantage
of multimodal representations over language-
only ones on concrete word pairs, but not on
abstract ones. 一方面, this confirms
the effectiveness of these models to align lan-
guage and vision, which results in better se-
mantic representations for concepts that are
grounded in images. 另一方面, 模组-
els are shown to follow different represen-
tation learning patterns, which sheds some
light on how and when they perform multi-
modal integration.

1

介绍

Increasing evidence indicates that the meaning
of words is multimodal: Human concepts are
grounded in our senses (巴尔萨卢, 2008; De Vega
等人。, 2012), and the sensory-motor experiences
humans have with the world play an important role
in determining word meaning (Meteyard et al.,
2012). 自从 (至少) the first operationalizations
of the distributional hypothesis, 然而, 标准
NLP approaches to derive meaning representa-
tions of words have solely relied on information
extracted from large text corpora, based on the
generalized assumption that the meaning of a
word can be inferred from the effects it has on
its linguistic context (哈里斯, 1954; Firth, 1957).

Language-only semantic representations, from pi-
oneering ‘count’ vectors (Landauer and Dumais,
1997; Turney and Pantel, 2010; Pennington et al.,
2014) to either static (Mikolov et al., 2013)
or contextualized (Peters et al., 2018; Devlin
等人。, 2019) neural network-based embeddings,
have proven extremely effective in many lin-
guistic tasks and applications, for which they
constantly increased state-of-the-art performance.
然而, they naturally have no connection with
the real-world referents they denote (Baroni,
2016). 像这样,
they suffer from the symbol
grounding problem (Harnad, 1990), which in
turn limits their cognitive plausibility (Rotaru and
Vigliocco, 2020).

To overcome this limitation, several methods
have been proposed to equip language-only rep-
resentations with information from concurrent
方式, particularly vision. Until not long ago,
the standard approach aimed to leverage the com-
plementary information conveyed by language and
vision—for example, that bananas are yellow (六-
锡安) and rich in potassium (语言)—by build-
ing richer multimodal representations (Beinborn
等人。, 2018). 全面的, these representations have
proved advantageous over purely textual ones in
a wide range of tasks and evaluations, 包括
the approximation of human semantic similarity/
relatedness judgments provided by benchmarks
like SimLex999 (Hill et al., 2015) or MEN (Bruni
等人。, 2014). This was taken as evidence that
leveraging multimodal information leads to more
human-like, full-fledged semantic representations
of words (Baroni, 2016).

最近, the advent of Transformer-based
pre-trained models such as BERT (Devlin et al.,
2019) has favored the development of a plethora
of multimodal models (李等人。, 2019; Tan and
Bansal, 2019; 卢等人。, 2019; 陈等人。, 2020;
Tan and Bansal, 2020) aimed to solve downstream
language and vision tasks such as Visual Question

1563

计算语言学协会会刊, 卷. 9, PP. 1563–1579, 2021. https://doi.org/10.1162/tacl 00443
动作编辑器: Jing Jiang. 提交批次: 6/2021; 修改批次: 7/2021; 已发表 12/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Answering (Antol et al., 2015) and Visual Dia-
logue (De Vries et al., 2017; Das et al., 2017).
Similarly to the revolution brought about by
Transformer-based language-only models to NLP
(see Tenney et al., 2019), these systems have
rewritten the recent history of research on lan-
guage and vision by setting new state-of-the-art
results on most of the tasks. 而且, similarly
to their language-only counterparts, these systems
have been claimed to produce all-purpose, ‘task-
agnostic’ representations ready-made for any task.
While there has been quite a lot of interest in un-
derstanding the inner mechanisms of BERT-like
型号 (see the interpretability line of research
referred to as BERTology; Rogers et al., 2020)
and the nature of their representations (Mickus
等人。, 2020; Westera and Boleda, 2019), compa-
rably less attention has been paid to analyzing the
multimodal equivalents of these models. In par-
针状的, no work has explicitly investigated how
the representations learned by these models com-
pare to those by their language-only counterparts,
which were recently shown to outperform standard
static representations in approximating people’s
semantic intuitions (Bommasani et al., 2020).

在这项工作中, we therefore focus on the repre-
sentations learned by state-of-the-art multimodal
pre-trained models, and explore whether, 并
what extent, leveraging visual information makes
them closer to human representations than those
produced by BERT. Following the approach pro-
posed by Bommasani et al. (2020), we derive static
representations from the contextualized ones pro-
duced by these Transformer-based models. 我们
then analyze the quality of such representations
by means of the standard intrinsic evaluation based
on correlation with human similarity judgments.

We evaluate LXMERT (Tan and Bansal, 2019),
UNITER (陈等人。, 2020), ViLBERT (卢等人。,
2019), VisualBERT (李等人。, 2019), and Vok-
enization (Tan and Bansal, 2020) on five human
judgment benchmarks1 and show that: (1) in line
with previous work, multimodal models outper-
form purely textual ones in the representation of
concrete, but not abstract, 字; (2) 代表-
tions by Vokenization stand out as the overall best-
performing multimodal ones; 和 (3) multimodal
models differ with respect to how and when they

1Data and code can be found at https://github
.com/sandropezzelle/multimodal-evaluation.

integrate information from language and vision, 作为
revealed by their learning patterns across layers.

2 相关工作

2.1 Evaluating Language Representations

Evaluating the intrinsic quality of learned seman-
tic representations has been one of the main, 长的-
standing goals of NLP (for a recent overview of
the problem and the proposed approaches, 看
Navigli and Martelli, 2019; Taieb et al., 2020).
In contrast to extrinsic evaluations that measure
the effectiveness of task-specific representations
in performing downstream NLU tasks (例如, 那些
contained in the GLUE benchmark; 王等人。,
2019), the former approach tests whether, 和
to what extent, task-agnostic semantic representa-
系统蒸发散 (IE。, not learned nor fine-tuned to be effective
on some specific tasks) align with those by human
speakers. This is typically done by measuring the
correlation between the similarities computed on
system representations and the semantic similarity
judgments provided by humans, a natural testbed
for distributional semantic models (Landauer and
Dumais, 1997). Lastra-D´ıaz et al. (2019) 亲-
vide a recent, comprehensive survey on methods,
benchmarks, and results.

In the era of Transformers, recent work has ex-
plored the relationship between the contextualized
representations learned by these models and the
static ones learned by distributional semantic models
(DSMs). On a formal level, some work has ar-
gued that this relation is not straightforward since
only context-invariant—but not contextualized—
representations may adequately account for ex-
pression meaning (Westera and Boleda, 2019).
In parallel, Mickus et al. (2020) focused on
BERT and explored to what extent the seman-
tic space learned by this model is comparable to
that by DSMs. Though an overall similarity was
报道, BERT’s next-sentence-prediction objec-
tive was shown to partly obfuscate this relation.
A more direct exploration of the intrinsic seman-
tic quality of BERT representations was carried
out by Bommasani et al. (2020). In their work,
BERT’s contextualized representations were first
turned into static ones by means of simple meth-
消耗臭氧层物质 (参见章节 4.2) and then evaluated against
several similarity benchmarks. These representa-
tions were shown to outperform traditional ones,
which revealed that pooling over many contexts
improves embeddings’ representational quality.

1564

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

最近, Ilharco et al. (2021) probed the represen-
tations learned by purely textual language models
in their ability to perform language grounding.
Though far from human performance, 他们是
shown to learn nontrivial mappings to vision.

2.2 Evaluating Multimodal Representations

Since the early days of DSMs, many approaches
have been proposed to enrich language-only rep-
resentations with information from images. Bruni
等人. (2012, 2014) equipped textual representa-
tions with low-level visual features, and reported
an advantage over language-only representations
in terms of correlation with human judgments.
An analogous pattern of results was obtained by
Kiela and Bottou (2014) and Kiela et al. (2016),
who concatenated visual features obtained with
convolutional neural networks (CNNs) with skip-
gram linguistic representations. Lazaridou et al.
(2015) further improved over these techniques by
means of a model trained to optimize the similar-
ity of words with their visual representations, 一个
approach similar to that by Silberer and Lapata
(2014). Extensions of these latter methods include
the model by Zablocki et al. (2018), which lever-
ages information about the visual context in which
objects appear; and Wang et al. (2018), 在哪里
three dynamic fusion methods were proposed to
learn to assign importance weights to each modal-
性. 最近, some work has explored the
quality of representations learned from images
仅有的 (L¨uddecke et al., 2019) or by combining lan-
规格, 想象, and emojis (Rotaru and Vigliocco,
2020). In parallel, new evaluation methods based,
例如, on decoding brain activity (戴维斯
等人。, 2019) or success on tasks such as image
恢复 (Kottur et al., 2016) have been proposed.
This mass of studies has overall demonstrated
the effectiveness of multimodal representations in
approximating human semantic intuitions better
than purely textual ones. 然而, this advantage
has been typically reported for concrete, 但不是
抽象的, 概念 (Hill and Korhonen, 2014).

最近几年, the revolution brought about
by Transformer-based multimodal models has
fostered research that sheds light on their inner
workings. One approach has been to use probing
任务: 曹等人. (2020) focused on LXMERT
and UNITER and systematically compared the
two models with respect to, 例如, 德-
gree of integration of the two modalities at each

layer or the role of various attention heads (为了
a similar analysis on VisualBERT, see Li et al.,
2020). Using two tasks (image-sentence verifica-
tion and counting) as testbeds, Parcalabescu et al.
(2021) highlighted capabilities and limitations of
various pre-trained models to integrate modalities
or handle dataset biases. Another line of work
has explored the impact of various experimental
choices, such as pre-training tasks and data, loss
functions and hyperparameters, on the performance
of pre-trained multimodal models (Singh et al.,
2020; Hendricks et al., 2021). Since all of these
aspects have proven to be crucial for these mod-
这, Bugliarello et al. (2021) proposed VOLTA,
a unified framework to pre-train and evaluate
Transformer-based models with the same data,
tasks and visual features.

Despite the renewed interest

in multimodal
型号, to the best of our knowledge no work has
explored, 迄今为止, the intrinsic quality of the task-
agnostic representations built by various pre-
trained Transformer-based models. 在这项工作中,
we tackle this problem for the first time.

3 数据

We aim to evaluate how the similarities between
the representations learned by pre-trained mul-
timodal Transformers align with the similarity
judgments by human speakers, 以及这些如何
representations compare to those by textual Trans-
formers such as BERT. 这样做, we need data that
(1) is multimodal, 那是, where some text (lan-
规格) is paired with a corresponding image (六-
锡安), 和 (2) includes most of the words making
up the word pairs for which human semantic judg-
ments are available. 下文中, we describe
the semantic benchmarks used for evaluation and
the construction of our multimodal dataset.

3.1 Semantic Benchmarks

We experiment with five human judgment bench-
marks used for intrinsic semantic evaluation in
both language-only and multimodal work: RG65
(Rubenstein and Goodenough, 1965), WordSim-
353 (Finkelstein et al., 2002), SimLex999 (爬坡道
等人。, 2015), MEN (Bruni et al., 2014), 和
SimVerb3500 (Gerz et al., 2016). These bench-
marks have a comparable format, 即, 他们
contain N (西德:3)w1, w2, 分数(西德:4) 样品, where w1
and w2 are two distinct words, and score is a
bounded value—that we normalize to range in

1565

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

original

found in VICO

benchmark
RG65
WordSim353
SimLex999
MEN
SimVerb3500
全部的

相对.
S

S

S

PoS


氮, V, Adj
氮, V, Adj
氮, V, Adj
V

# 对
65
353
999
3000
3500
7917

# 瓦
48
437
1028
752
827
2453

concr. (#)
4.37 (65)
3.82 (331)
3.61 (999)
4.41 (2954)
3.08 (3487)

# 对 (%)
65 (100%)
306 (86.7%)
957 (95.8%)
2976 (99.2%)
2890 (82.6%)
7194 (90.9%)

# 瓦 (%)
48 (100%)
384 (87.9%)
994 (99.5%)
750 (99.7%)
729 (88.2%)
2278 (92.9%)

concr. (#)
4.37 (65)
3.91 (300)
3.65 (957)
4.41 (2930)
3.14 (2890)

桌子 1: Statistics of the benchmarks before (original) and after (found in VICO) filtering them based
on VICO: 相对. refers to the type of semantic relation, IE。, (S)imilarity or (右)elatedness; # W to the
number of unique words present; concr. to average concreteness of the pairs (in brackets, # found in
Brysbaert et al., 2014). Within found in VICO, percentages in brackets refer to the coverage compared
to original.

[0, 1]—which stands for the degree of semantic
similarity or relatedness between w1 and w2: 这
higher the value, the more similar the pair. 在
同一时间, these benchmarks differ in several
respects, 即, (1) the type of semantic rela-
tion they capture (IE。, similarity or relatedness);
(2) the parts-of-speech (PoS) they include; (3)
the number of pairs they contain; (4) the size of
their vocabulary (IE。, the number of unique words
展示); 和 (5) the words’ degree of concrete-
内斯, which previous work found to be particularly
relevant for evaluating the performance of multi-
modal representations (参见章节 2.2). We report
descriptive statistics of all these relevant features
表中 1 (original section). For concreteness,
we report a single score for each benchmark: 这
更高, the more concrete. We obtained this score
(1) by taking, for each word, the corresponding
5-point human rating collected by Brysbaert et al.
(2014);2 (2) by computing the average concrete-
ness of each pair; 和 (3) by averaging over the
entire benchmark.

3.2 数据集

Previous work evaluating the intrinsic quality of
multimodal representations has faced the issue of
limited vocabulary coverage in the datasets used.
作为结果, only a subset of the tested
benchmarks has often been evaluated (例如, 29%
of word pairs in SimLex999 and 42% in MEN,
reported by Lazaridou et al., 2015). To overcome
this issue, we jointly consider two large mul-
timodal datasets: Common Objects in Contexts
(COCO; 林等人。, 2014) and Visual Storytelling

2Participants were instructed that concrete words refer to
things/actions that can be experienced through our senses,
while meanings of abstract words are defined by other words.

(VIST; Huang et al., 2016). The former contains
samples where a natural image is paired with a
free-form, crowdsourced description (or caption)
of its visual content. The latter contains samples
where a natural image is paired with both a de-
scription of its visual content (DII, Descriptions
of Images in Isolation) and a fragment of a story
invented based on a sequence of five images to
which the target image belongs (SIS, Stories of
Images in Sequences). Both DII and SIS contain
crowdsourced, free-form text. 尤其, 我们
consider the entire COCO 2017 数据 (缺点-
catenation of train and val splits), which consists
的 616,767 (西德:3)图像, description(西德:4) 样品. As for
VIST, we consider the train, val, and test splits
of both DII and SIS, which sum up to 401,600
(西德:3)图像, description/story(西德:4) 样品.

By concatenating VIST and COCO, we obtain a
dataset containing 1,018,367 (西德:3)图像, 句子(西德:4)
样品, that we henceforth refer to as VICO.
Thanks to the variety of images and, in particu-
拉尔, the types of text it contains, the concatenated
dataset proves to be very rich in terms of lexicon,
an essential desideratum for having broad cover-
age of the word pairs in the semantic benchmarks.
We investigate this by considering all the 7917
word pairs making up the benchmarks and check-
英, for each pair, whether both of its words are
present at least once in VICO. We find 7194 对
made up of 2278 unique words. As can be seen in
桌子 1 (found in VICO section), this is equivalent
to around 91% of total pairs found (min. 83%,
max. 100%), with an overall vocabulary coverage
of around 93% (min. 88%, max. 100%). This is re-
flected in a pattern of average concreteness scores
that is essentially equivalent to original. 数字 1
reports this pattern in a boxplot.

1566

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

multimodal models (部分 4.1). In all cases,
representations are extracted from the samples in-
cluded in our dataset. In the language-only models,
representations are built based on only the sen-
张力; in the multimodal models, based on the
sentence and its corresponding image (或者, 为了
Vokenization, just the sentence but with visual su-
pervision in pre-training, as explained later). 自从
representations by most of the tested models are
contextualized, we make them static by means
of an aggregation method (部分 4.2). 在里面
评估 (部分 4.3), we test the ability of
these representations to approximate human se-
mantic judgments.

4.1 楷模

Language-Only Models We experiment with
one distributional semantic model producing static
陈述, 即, GloVe (Pennington et al.,
2014) and one producing contextualized repre-
句子, 即, the pre-trained Transformer-
based BERT (Devlin et al., 2019). For GloVe,
following Bommasani et al. (2020) we use its
300-d word representations pre-trained on 6B to-
kens from Wikipedia 2014 and Gigaword 5.4
As for BERT, we experiment with its standard
12-layer version (BERT-base).5 This is the model
serving as the backbone of all the multimodal mod-
els we test, which allows for direct comparison.

Multimodal Models We experiment with five
pre-trained Transformer-based multimodal mod-
这. Four of them are both pre-trained and evalu-
ated using multimodal data, 那是, they produce
representations based on a sentence and an image
(Language and Vision; LV ) at both training and
inference time: LXMERT (Tan and Bansal, 2019),
UNITER (陈等人。, 2020), ViLBERT (卢等人。,
2019), and VisualBERT (李等人。, 2019). 之一
他们, 相比之下, is visually supervised during
训练, but only takes Language as input during
inference (LV ): Vokenization (Tan and Bansal,
2020). All five models are similar in three main
respects: (1) they have BERT as their backbone;
(2) they produce contextualized representations;
和 (3) they have multiple layers from which
such representations can be extracted.

4http://nlp.stanford.edu/data/glove.6B

.zip.

数字 1: Concreteness of word pairs found in VICO.
Concreteness ranges from 1 到 5. The horizontal line
shows median; (西德:5), 意思是.

# 样品 # imgs

发送. L # 瓦 # WP

COCO
VIST
全部的

50452
63256
113708

39767
40528
80295

13.4
14.7
14.7

1988 6076
2250 7122
2278 7194

桌子 2: Dataset statistics. # imgs: number of
unique images; 发送. L: average sentence length;
# 瓦, # WP: resp., number of words and word pairs.

Since experimenting with more than 1 米尔-
狮子 (西德:3)图像, 句子(西德:4) samples turns out to be
computationally highly demanding, for efficiency
reasons we extract a subset of VICO such that: (1)
全部 2278 words found in VICO (因此, the vocabu-
lary) and the corresponding 7194 pairs are present
at least once among its sentences; (2) its size is
around an order of magnitude smaller than VICO;
(3) it preserves the word frequency distribution ob-
served in VICO. We obtain a subcorpus including
113,708 独特的 (西德:3)图像, 句子(西德:4) 样品, 那
是, 大约 11% of the whole VICO. Since all the
experiments reported in the paper are performed
on this subset, from now on we will simply refer to
it as our dataset. Some of its descriptive statistics
are reported in Table 2.3 有趣的是, VIST sam-
ples contain more vocabulary words compared to
COCO (2250 与. 1988 字), which is reflected
in higher coverage of word pairs (7122 与. 6076).

4 实验

In our experiments, we build representations for
each of the words included in our semantic bench-
marks by means of various language-only and

3The average frequency of our vocabulary words is 171

5We adapt the code from: https://github.com

(min. 1, max. 8440). 61 字 (3%) have frequency 1.

/rishibommasani/Contextual2Static.

1567

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

As for the LV models, we use reimplementa-
tions by the VOLTA framework (Bugliarello et al.,
2021).6 This has several advantages since all the
型号: (1) are initialized with BERT weights;7
(2) use the same visual features, 即, 36 关于-
gions of interest extracted by Faster R-CNN with
a ResNet-101 backbone (Anderson et al., 2018);8
和 (3) are pre-trained in a controlled setting using
the same exact data (Conceptual Captions; 夏尔马
等人。, 2018), 任务, and objectives, 那是, Masked
Language Model (MLM), masked object clas-
sification with KL-divergence, and image-text
matching (ITM), a binary classification problem
to predict whether an image and text pair match.
This makes the four LV models directly com-
parable to each other, with no confounds. 最多
is reimplemented as
重要的是, 每个型号
a particular instance of a unified mathematical
framework based on the innovative gated bimodal
Transformer layer. This general layer can be used
to model both intra-modal and inter-modal inter-
行动, which makes it suitable to reimplement
both single-stream models (where language and
vision are jointly processed by a single encoder;
UNITER, VisualBERT) and dual-stream models
(where the two modalities are first processed sepa-
rately and then integrated; LXMERT, ViLBERT).
As for Vokenization, we use the original im-
plementation by Tan and Bansal (2020). 这
model is essentially a visually supervised language
model which, during training, extracts multimodal
alignments to language-only data by contextually
mapping words to images. Compared to LV mod-
els where alignment between language and vision
is performed at the (西德:3)句子, 图像(西德:4) 等级, 在
Vokenization the mapping is done at the token
等级 (the image is named voken). It is worth men-
tioning that Vokenization is pre-trained with less
textual data compared to the standard BERT, 这
model used to initialize all LV architectures. 为了
比较, 表中 3 we report the tasks and
data used to pre-train each of the tested models.
None of the tested LV models were pre-trained
with data present in our dataset. For Vokenization,
we cannot exclude that some COCO samples of
our dataset were also used in the TIM task.

6https://github.com/e-bug/volta.
7Including LXMERT, which was initialized from scratch

in its original implementation.

8Our code to extract visual features for all our images
is adapted from: https://github.com/airsplay
/py-bottom-up-attention/blob/master/demo
/demo_feature_extraction_attr.ipynb.

pre-training task(s)

pre-training data

GloVe Unsupervised vector learning

维基百科 2014
+ Gigaword 5

BERT

Masked Language Model (MLM)
+ Next Sentence Prediction (NSP)

English Wikipedia
+ BooksCorpus

LV *

Vok.

Masked Language Model (MLM)
+ Masked Object Classification KL
+ Image-Text Matching (ITM)

Token-Image Matching (TIM)*

Masked Language Model (MLM)

Conceptual Captions

COCO
+ Visual Genome
English Wikipedia
+ Wiki103

桌子 3: Overview of tasks and data used to pre-
train models. LV refers to the 4 multimodal mod-
这; Vok. to Vokenization. *Initialized with BERT.

4.2 Aggregation Method

With the exception of GloVe, all our tested
models build contextualized representations.
We closely follow the method proposed by
Bommasani et al. (2020) 和, for each word in
our benchmarks, we compute a single, 静止的
context-agnostic representation using the samples
included in our dataset. This involves two steps:
(1) subword pooling and (2) context combination.
A schematic illustration of both these steps is
shown in Figure 2, where we exemplify the
general architecture of a LV model. Subword
pooling is the operation by which we construct a
word representation from the tokens produced by
the BERT tokenizer. 自从, during tokenization,
some words (例如, ‘donut’) get decomposed into
N subwords (大学教师, ##ut), we apply a function to
combine the representations s1, . . . , sN produced
for each subword token tk, . . . , tk+N −1 into a
contextualized word-level representation wc. 我们
take the corresponding model hidden states as
our representations, and use arithmetic mean as
the combination function, following Bommasani
等人. (2020):

wc = mean(s1, . . . , sN )

(1)

For each word, we then compute a static repre-
sentation from its contextualized representations.
This is done via context combination, where we
aggregate the contextualized representations wc1,
. . . , wcM of the same word found in M sentences
(上下文). We obtain a single static representation
for the word, again using arithmetic mean:

w = mean(wc1, . . . , wcM )

(2)

1568

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 2: A schematic illustration of our method to obtain static representations of words, 例如, ‘donut’.

因此, for each of the 2278 words in our
vocabulary we obtain a single static 768-d rep-
resentation w. The operations of aggregating sub-
字 (Eq. 1) and contextualized representations
(Eq. 2) are carried out for each layer of each
tested model.

For BERT, we consider 13 layers, 从 0 (这
input embedding layer) 到 12. For Vokenization,
we consider its 12 layers, 从 1 到 12. For LV
型号, we consider the part of VOLTA’s gated
bimodal layer processing the language input, 和
extract activations from each of the feed-forward
layers following a multi-head attention block. 在
LXMERT, 有 5 such layers: 21, 24, 27, 30,
33; in both UNITER and VisualBERT, 12 layers:
2, 4, . . . , 24; in ViLBERT, 12 layers: 14, 16,
. . . , 36.9 Representations are obtained by running
the best snapshot of each pre-trained model10
on our samples in evaluation mode, IE。, 没有
fine-tuning nor updating the model’s weights.

4.3 评估

To evaluate the intrinsic quality of the obtained
陈述, we compare the semantic space
of each model with that of human speakers. 为了
each word pair in each semantic benchmark de-
scribed in Section 3.1, we compute the cosine
similarity between the representations of the words
in the pair:

similarity = cosine(w1, w2)

(3)

For each benchmark, we then compute Spear-
man’s rank ρ correlation between the similarities
obtained by the model and the normalized human
判断: The higher the correlation, the more
aligned the two semantic spaces are.

9For reproducibility reasons, we report VOLTA’s indexes.
10https://github.com/e-bug/volta/blob

5 结果

表中 4, we report the best results obtained
by all tested models on the five benchmarks. 在
brackets we report the number of the model layer.

Language-Only Models We notice that BERT
evaluated on our dataset (因此, just BERT) 系统-
tematically outperforms GloVe. This is in line
with Bommasani et al. (2020), and replicates their
findings that, as compared to standard static em-
beddings, averaging over contextualized represen-
tations by Transformer-based models is a valuable
method for obtaining semantic representations
that are more aligned to those of humans.

It is interesting to note, 而且, that the re-
sults we obtain with BERT actually outperform the
best results reported by Bommasani et al. (2020)
using the same model on 1M Wikipedia contexts
(BERT-1M-Wiki). This is intriguing since it sug-
gests that building representations using a dataset
of visually grounded language, as we do, 不是
detrimental to the representational power of the
resulting embeddings. Since this comparison is
partially unfair due to the different methods em-
ployed in selecting language contexts, we also
obtain results on a subset of Wikipedia that we
extract using the method described for VICO (看
部分 3.2),11 and which is directly comparable to
our dataset. As can be seen, representations built
on this subset of Wikipedia (BERT-Wiki ours)
turn out to perform better than those by BERT
for WordSim353, SimLex999, and SimVerb3500
(the least concrete benchmarks—see Table 1);
worse for RG65 and MEN (the most concrete
那些). This pattern of results indicates that visually
grounded language is different from encyclopedic
一, which in turn has an impact on the resulting
陈述.

/main/MODELS.md.

11This subset contains 127,246 unique sentences.

1569

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

模型

输入

Spearman ρ correlation (层)

BERT-1M-Wiki*
BERT-Wiki ours
GloVe
BERT
LXMERT
UNITER
ViLBERT
VisualBERT
Vokenization

L
L
L
L
LV
LV
LV
LV
LV

RG65
0.7242 (1)
0.8107 (1)
0.7693
0.8124 (2)
0.7821 (27)
0.7679 (18)
0.7927 (20)
0.7592 (2)
0.8456 (9)

WS353
0.7048 (1)
0.7262 (1)
0.6097
0.7096 (1)
0.6000 (27)
0.6813 (2)
0.6204 (14)
0.6778 (2)
0.6818 (3)

SL999
0.5134 (3)
0.5213 (0)
0.3884
0.5191 (0)
0.4438 (21)
0.4843 (2)
0.4729 (16)
0.4797 (4)
0.4881 (9)

MEN

0.7176 (2)
0.7296
0.7368 (2)
0.7417 (33)
0.7483 (20)
0.7714 (26)
0.7512 (20)
0.8068 (10)

SVERB
0.3948 (4)
0.4039 (4)
0.2183
0.4027 (3)
0.2443 (21)
0.3926 (10)
0.3875 (14)
0.3833 (10)
0.3439 (9)

桌子 4: Spearman’s rank ρ correlation between similarities computed with representations by all tested
models and human similarity judgments in the five evaluation benchmarks: the higher the better. 结果
in bold are the highest in the column among models run on our dataset (那是, all but the top 2 型号).
Results underlined are the highest among LV models. *Original results from Bommasani et al. (2020).

数字 3: Highest ρ by each model on WordSim353, SimLex999 and MEN. Each barplot reports results on both
the whole benchmark (桌子 4) and the most concrete subset of it (桌子 5). Best viewed in color.

Multimodal Models Turning to multimodal
型号, we observe that they outperform BERT
on two benchmarks, RG65 and MEN. 尽管
Vokenization is found to be the best-performing
architecture on both of them, all multimodal mod-
els surpass BERT on MEN (see rightmost panel
of Figure 3; dark blue bars). 相比之下, no multi-
modal model outperforms or is on par with BERT
on the other three benchmarks (数字 3 节目
the results on WordSim353 and SimLex999). 这
indicates that multimodal models have an advan-
tage on benchmarks containing more concrete
word pairs (recall that MEN and RG65 are the
overall most concrete benchmarks; 见表 1);
相比之下, leveraging visual information appears
to be detrimental for more abstract word pairs, A
pattern that is very much in line with what was
reported for previous multimodal models (Bruni
等人。, 2014; Hill and Korhonen, 2014). Among
multimodal models, Vokenization stands out as
the overall best-performing model. This indicates

that grounding a masked language model is an ef-
fective way to obtain semantic representations that
are intrinsically good, as well as being effective in
downstream NLU tasks (Tan and Bansal, 2020).
Among the models using an actual visual input
(LV ), ViLBERT turns out to be best-performing
on high-concreteness benchmarks, while UNITER
is the best model on more abstract benchmarks.
This pattern could be due to the different embed-
ding layers of these models, which are shown to
扮演一个重要角色 (Bugliarello et al., 2021).

Concreteness Our results show a generalized
advantage of multimodal models on more concrete
benchmarks. This seems to indicate that visual in-
formation is beneficial for representing concrete
字. 然而, it might still be that models are
just better at representing the specific words con-
tained in these benchmarks. To further investigate
这一点, for each benchmark we extract the sub-
set of pairs where both words have concreteness

1570

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

模型

输入

concr.

Spearman ρ correlation (层)

L
BERT
LV
LXMERT
LV
UNITER
LV
ViLBERT
LV
VisualBERT
Vokenization LV
# 对 (%)

4
4
4
4
4
4

WS353
0.6138 (1)

MEN
0.7368 (2)

SL999
RG65
SVERB
0.8321 (2)
0.4864 (0)
0.1354 (3)
0.8648 (27) 0.6606 (27) 0.5749 (21) 0.7862 (33) 0.1098 (21)
0.7755 (20) 0.1215 (10)
0.4975 (2)
0.8148 (18) 0.5943 (2)
0.8374 (20) 0.5558 (14) 0.5534 (16) 0.7910 (26) 0.1529 (14)
0.7727 (20) 0.1310 (10)
0.4971 (4)
0.8269 (2)
0.8150 (10) 0.1390 (9)
0.8708 (9)
0.5051 (9)
1917 (65%) 210 (7%)
396 (41%)
44 (68%)

0.6043 (2)
0.6133 (3)
121 (40%)

桌子 5: Correlation on the concrete subsets (concr. 4) of the five evaluation benchmarks. Results in
bold are the highest in the column. Results underlined are the highest among LV multimodal models.

≥4 out of 5 in Brysbaert et al. (2014). 为了
每个型号, we consider the results by the layer
which is best-performing on the whole benchmark.
桌子 5 reports the results of this analysis, 沿着
with the number (和 %) of word pairs considered.
For all benchmarks, there is always at least one
multimodal model that outperforms BERT. 这
pattern is crucially different from that observed
表中 4, and confirms that multimodal models
are better than language-only ones at representing
concrete words, regardless of their PoS. Zooming
into the results, we note that Vokenization still
outperforms other multimodal models on both
RG65 and MEN (see rightmost panel of Figure 3;
light blue bars), while LXMERT turns out to
be the best-performing model on both WordSim-
353 and SimLex999 (see left and middle pan-
els of Figure 3; light blue bars). These results
suggest that this model is particularly effective
in representing highly concrete words, but fails
with abstract ones, which could cause the overall
low correlations in the full benchmarks (桌子 4).
ViLBERT obtains the best results on SimVerb-
3500, thus confirming the good performance of
this model in representing verbs/actions seen also
表中 4. 然而, the low correlation that all
models achieve on this subset indicates that they
all struggle to represent the meaning of verbs
that are deemed very concrete. This finding ap-
pears to be in line with the generalized difficulty
in representing verbs reported by Hendricks and
Nematzadeh (2021). Further work is needed to
explore this issue.

6 分析

We perform analyses aimed at shedding light on
commonalities and differences between the vari-

ous models. 尤其, we explore how model
performance evolves through layers (部分 6.1),
and how various models compare to humans at
the level of specific word pairs (部分 6.2).

6.1 Layers

桌子 4 reports the results by the best-performing
layer of each model. 数字 4 complements these
numbers by showing, for each model, how perfor-
mance changes across various layers. For BERT,
Bommasani et al. (2020) found an advantage of
earlier layers in approximating human semantic
判断. We observe the same exact pattern,
with earlier layers (0–3) achieving the best corre-
lation scores on all benchmarks and later layers
experiencing a significant drop in performance.
As for multimodal models, previous work (曹
等人。, 2020) experimenting with UNITER and
LXMERT revealed rather different patterns be-
tween the two architectures. 对于前者, A
higher degree of integration between language
and vision was reported in later layers; as for the
后者, such integration appeared to be in place
from the very first multimodal layer. 曹等人.
(2020) hypothesized this pattern to be represen-
tative of the different behaviors exhibited by
single-stream (UNITER, VisualBERT) 与. 双重的-
溪流 (LXMERT, ViLBERT) 型号. If a higher
degree of integration between modalities leads to
better semantic representations, we should observe
an advantage of later layers in UNITER and Visu-
alBERT, but not in LXMERT and ViLBERT. 在
特别的, we expect this to be the case for bench-
marks where the visual modality plays a bigger
角色, IE。, the more concrete RG65 and MEN.

As can be seen in Figure 4, LXMERT exhib-
its a rather flat pattern of results, which overall

1571

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 4: Correlation by all tested models on the five benchmarks across layers. Best viewed in color.

confirms the observation that, in this dual-stream
模型, integration of language and vision is in
place from the very first multimodal layer. 骗局-
versely, we notice that single-stream UNITER
achieves the best correlation on RG65 and MEN
towards the end of its pipeline (at layers 18 和
20, 分别), which supports the hypothesis
that later representations are more multimodal.
The distinction between single- and dual-stream
models appears less clear-cut in the other two
架构 (not explored by Cao et al., 2020).
Though ViLBERT (dual-stream) achieves gener-
ally good results in earlier layers, the best corre-
lation on RG65 and MEN is reached in middle
layers. As for VisualBERT (single-stream), 骗局-
sistently with the expected pattern the best cor-
relation on MEN is achieved at one of the latest
layers; 然而, the best correlation on RG65 is
reached at the very first multimodal layer. 全面的,
our results mirror the observations by Cao et al.
(2020) for LXMERT and UNITER. 然而, 这
somewhat mixed pattern observed for the other
models suggests more complex interactions be-
tween the two modalities. As for Vokenization,
there is a performance drop at the last two layers,
but otherwise its performance constantly increases
through the layers and reaches the highest peaks
toward the end.

合在一起, the results of this analysis con-
firm that various models differ with respect to how

they represent and process the inputs and to how
and when they perform multimodal integration.

6.2 Pair-Level Analysis

Correlation results are not informative about (1)
which word pairs are more or less challenging
for the models, nor about (2) how various models
compare to each other in dealing with specific
word pairs. 直观地, this could be tested by
comparing the raw similarity values output by a
given model to both human judgments and scores
by other models. 然而, this turns out not to be
sound in practice due to the different ranges of val-
ues produced. 例如, some models output
generally low cosine values, while others produce
generally high scores,12 which reflects differences
in the density of the semantic spaces they learn. 到
compare similarities more fairly, for each model
we consider the entire distribution of cosine val-
ues obtained in a given benchmark, rank it in
descending order (from highest to lowest similar-
ity values) and split it in five equally-sized bins,
that we label highest, 高的, medium, 低的, lowest.
We do the same for human similarity scores.
然后, for each word pair, we check whether it
is assigned the same similarity ‘class’ by both
humans and the model. We focus on the three

12Differences also emerge between various model layers.

1572

BERT

ViLBERT
most similar pairs according to humans

Vokenization

BERT

ViLBERT
least similar pairs according to humans

Vokenization

RG65

WS353

SL999

MEN

SVERB

cord, smile
noon, string
rooster, voyage

gem, jewel
midday, noon
automobile, car

gem, jewel
midday, noon
automobile, car

cord, smile
noon, string
rooster, voyage
fruit, furnace
autograph, shore
king, cabbage
弦, smile
noon, string

cord, smile
noon, string
rooster, voyage
fruit, furnace
autograph, shore
king, cabbage
弦, smile
noon, string

gem, jewel
midday, noon
automobile, car
cemetery, graveyard cemetery, graveyard cemetery, graveyard fruit, furnace
cushion, pillow
coast, shore
钱, 现金
midday, noon
journey, voyage
dollar, buck
stupid, dumb
creator, 制作者
vanish, 消失
quick, 迅速的
insane, 疯狂的
sun, 阳光
猫, kitten
automobile, car
河, 水
stair, staircase
repair, fix
triumph, 赢
建造, construct
flee, escape
rip, tear

cushion, pillow
coast, shore
钱, 现金
midday, noon
journey, voyage
dollar, buck
stupid, dumb
creator, 制作者
vanish, 消失
quick, 迅速的
insane, 疯狂的
sun, 阳光
猫, kitten
automobile, car
河, 水
stair, staircase
repair, fix
triumph, 赢
建造, construct
flee, escape
rip, tear

cushion, pillow
coast, shore
钱, 现金
midday, noon
journey, voyage
dollar, buck
stupid, dumb
creator, 制作者
vanish, 消失
quick, 迅速的
insane, 疯狂的
sun, 阳光
猫, kitten
automobile, car
河, 水
stair, staircase
repair, fix
triumph, 赢
建造, construct
flee, escape
rip, tear

autograph, shore
king, cabbage
弦, smile
noon, string
professor, cucumber professor, cucumber professor, cucumber
rooster, voyage
cliff, tail
container, 老鼠
踝, window
新的, ancient
shrink, 生长
feather, truck
angel, gasoline
bakery, zebra
bikini, pizza
肌肉, tulip
drive, breed
die, 生长
visit, giggle
miss, catch
shut, vomit

rooster, voyage
cliff, tail
container, 老鼠
踝, window
新的, ancient
shrink, 生长
feather, truck
angel, gasoline
bakery, zebra
bikini, pizza
肌肉, tulip
drive, breed
die, 生长
visit, giggle
miss, catch
shut, vomit

rooster, voyage
cliff, tail
container, 老鼠
踝, window
新的, ancient
shrink, 生长
feather, truck
angel, gasoline
bakery, zebra
bikini, pizza
肌肉, tulip
drive, breed
die, 生长
visit, giggle
miss, catch
shut, vomit

桌子 6: Similarity assigned by BERT, ViLBERT, and Vokenization to most (左边) and least (正确的)
similar pairs according to humans. Dark green indicates highest assigned similarity; light green, 高的;
黄色的, medium; orange, 低的; 红色的, lowest. Best viewed in color.

overall best-performing models, 即, BERT
(L), ViLBERT (LV ), and Vokenization (LV ).

We perform a qualitative analysis by focusing
在 5 pairs for each benchmark with the highest and
lowest semantic similarity/relatedness according
to humans. 桌子 6 reports the results of this anal-
ysis through colors. Dark green and red indicate
alignment between humans and models on most
similar and least similar pairs, 分别. 在
first glance, we notice a prevalence of dark green
on the left section of the table, which lists 5 的
most similar pairs according to humans; a preva-
lence of red on the right section, which lists the
least similar ones. This clearly indicates that the
three models are overall effective in capturing sim-
ilarities of words, mirroring the results reported in
桌子 4. Consistently, we notice that model rep-
resentations are generally more aligned in some
benchmarks compared to others: consider, 对于前任-
充足, RG65 vs. SimLex999 or SimVerb3500.
而且, some models appear to be more aligned
than others in specific benchmarks: 例如,
in the highly concrete MEN, Vokenization is much
more aligned than BERT on the least similar cases.
相比之下, BERT is more aligned with humans
than are multimodal models on the most similar

RG65 WS353 SL999 MEN SVERB

全部
BERT
ViLBERT
Vokenization
相似的
BERT
ViLBERT
Vokenization
dissimilar
BERT
ViLBERT
Vokenization

0.52
0.49
0.60

0.62
0.50
0.73

0.46
0.50
0.54

0.39
0.37
0.39

0.45
0.43
0.48

0.43
0.39
0.41

0.38
0.35
0.35

0.41
0.38
0.38

0.39
0.33
0.36

0.41
0.43
0.45

0.44
0.47
0.46

0.42
0.44
0.48

0.31
0.30
0.29

0.33
0.31
0.29

0.32
0.33
0.31

桌子 7: Proportion of aligned cases between
humans and the models when considering all
pairs in the benchmarks (全部), their highest + 高的
partition (相似的), and lowest + low partition
(dissimilar).

pairs of SimLex999, to which ViLBERT (和, 到
a lesser extent, Vokenization), often assigns low
and medium similarities. These qualitative obser-
vations are in line with the numbers reported in
桌子 7, which refer to the proportion of aligned
cases between humans and the models within
each benchmark. 有趣的是, all models display

1573

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

RG65 WS353

SL999 MEN SVERB

全部
none
B only
MM only

0.31
0.22
0.08
0.08

0.20
0.43
0.06
0.05

0.18
0.44
0.10
0.05

0.19
0.32
0.08
0.09

0.13
0.51
0.08
0.05

桌子 8: Proportion of pairs where all / none of
这 3 models or only either BERT (B only) or mul-
timodal (MM only) models are aligned to humans.

assigned to low and lowest by ViLBERT and Vo-
kenization, 分别, to medium by BERT. 在
这个案例, adding visual information has a positive
role toward moving one representation away from
另一个, which is in line with human intuitions.
As for the relatively high similarity assigned by
BERT to this pair, a manual inspection of the
dataset reveals the presence of samples where the
word zebra appears in bakery contexts; 为了考试-
普莱, ‘‘There is a decorated cake with zebra and
giraffe print’’ or ‘‘A zebra and giraffe themed
cake sits on a silver plate’’. We conjecture these
co-occurrence patterns may play a role in the
non-grounded representations of these words.

To provide a more quantitative analysis of con-
trasts across models, we compute the proportion
of word pairs in each benchmark for which all /
none of the 3 models assign the target similarity
班级; BERT assigns the target class, but neither
of the multimodal models do (B only); both mul-
timodal models are correct but BERT is not (毫米
仅有的). We report the numbers in Table 8. It can
be noted that, in MEN, the proportion of MM
only cases is higher compared to B only; 那是,
visual information helps more than harms in this
benchmark. An opposite pattern is observed for,
as an example, SimLex999.

7 结论

Language is grounded in the world. 因此, a priori,
representations extracted from multimodal data
should better account for the meaning of words.
We investigated the representations obtained by
Transformer-based pre-trained multimodal models
—which are claimed to be general-purpose seman-
tic representations—and performed a systematic
intrinsic evaluation of how the semantic spaces
learned by these models correlate with human se-
mantic intuitions. Though with some limitations
(see Faruqui et al., 2016; Collell Talleda and

数字 5: Four (西德:3)图像, caption(西德:4) samples from our
dataset (in brackets, we indicate the source: VIST/
COCO). For the high-similarity SL999 pair creator,
制作者 (顶部), multimodal models perform worse than
BERT. An opposite pattern is observed for the low-
similarity MEN pair bakery, zebra (底部). 这
pair bakery, zebra is highly concrete, while creator,
maker is not.

a comparable performance when dealing with se-
mantically similar and dissimilar pairs; 那是,
none of the models is biased toward one or the
other extreme of the similarity scale.

Some interesting observations can be made by
zooming into some specific word pairs in Table 6:
例如, creator, 制作者, one of the most
similar pairs in SimLex999 (a pair with low con-
creteness), is assigned the highest class by BERT;
low and medium by ViLBERT and Vokeniza-
的, 分别. This suggests that adding visual
information has a negative impact on the repre-
sentation of these words. 如图 5
(顶部), this could be due to the (visual) special-
ization of these two words in our dataset, 在哪里
creator appears to be usually used to refer to a
human agent, while maker typically refers to some
machinery. This confirms that multimodal mod-
els effectively leverage visual information, 哪个
leads to rather dissimilar representations. 其他
interesting case is bakery, zebra, one of MEN’s
least similar pairs (and highly concrete), 这是

1574

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

摩恩斯, 2016), this evaluation is simple and in-
terpretable, and provides a more direct way to
assess the representational power of these models
compared to evaluations based on task perfor-
曼斯 (Tan and Bansal, 2020; Ma et al., 2021).
而且, it allows to probe these models on a
purely semantic level, which can help answer im-
portant theoretical questions regarding how they
build and represent word meanings, 以及这些如何
mechanisms compare to previous methods (看
Mickus et al., 2020, for a similar discussion).

We proposed an experimental

setup that
makes evaluation of various models comparable
while maximizing coverage of human judgments
数据. 全部
the multimodal models we tested—
LXMERT, UNITER, ViLBERT, VisualBERT,
and Vokenization—show higher correlations with
human judgments than language-only BERT for
more concrete words. These results confirm the
effectiveness of Transformer-based models in
aligning language and vision. Among these, Vok-
enization exhibits the most robust results overall.
This suggests that the token-level approach to vi-
sual supervision used by this model in pre-training
may lead to more fine-grained alignment between
方式. 相比之下, the sentence-level regime
of the other models may contribute to more un-
certainty and less well defined multimodal word
陈述. Further work is needed to better
understand the relation between these different
方法.

致谢

We kindly thank Emanuele Bugliarello for the ad-
vice and indications he gave us to use the VOLTA
框架. We are grateful to the anonymous
TACL reviewers and to the Action Editor Jing
Jiang for the valuable comments and feedback.
They helped us significantly to broaden the anal-
ysis and improve the clarity of the manuscript.
This project has received funding from the Euro-
pean Research Council (ERC) under the European
Union’s Horizon 2020 research and innovation
programme (grant agreement no. 819455).

参考

Peter Anderson, Xiaodong He, Chris Buehler,
Damien Teney, Mark Johnson, Stephen Gould,
and Lei Zhang. 2018. Bottom-up and top-down
attention for image captioning and visual ques-

tion answering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Recognition, pages 6077–6086. https://
doi.org/10.1109/CVPR.2018.00636

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,
Margaret Mitchell, Dhruv Batra, C Lawrence
Zitnick, and Devi Parikh. 2015. VQA: Visual
question answering. In Proceedings of the IEEE
International Conference on Computer Vision,
pages 2425–2433. https://doi.org/10
.1109/ICCV.2015.279

Marco Baroni. 2016. Grounding distributional se-
mantics in the visual world. 语言和
Linguistics Compass, 10(1):3–13. https://
doi.org/10.1111/lnc3.12170

Lawrence W. 巴尔萨卢. 2008. Grounded cognition.
心理学年度评论, 59:617–645.
https://doi.org/10.1146/annurev
.psych.59.103006.093639, 考研:
17705682

Lisa Beinborn, Teresa Botschen, and Iryna
古列维奇. 2018. Multimodal grounding for

语言处理. 在诉讼程序中
27th International Conference on Computa-
tional Linguistics, pages 2325–2339. Associa-
tion for Computational Linguistics.

Rishi Bommasani, Kelly Davis, and Claire
Cardie. 2020. Interpreting pretrained contextu-
alized representations via reductions to static
嵌入. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 4758–4781. https://土井
.org/10.18653/v1/2020.acl-main.431

Elia Bruni, Nam-Khanh Tran, and Marco Baroni.
2014. Multimodal distributional
语义学.
Journal of Artificial Intelligence Research,
49:1–47. https://doi.org/10.1613/jair
.4135

Elia Bruni, Jasper Uijlings, Marco Baroni, 和
Nicu Sebe. 2012. Distributional semantics with
眼睛: Using image analysis to improve com-
putational representations of word meaning.
In Proceedings of the 20th ACM International
Conference on Multimedia, pages 1219–1228.
https://doi.org/10.1145/2393347
.2396422

Marc Brysbaert, Amy Beth Warriner, 和
Victor Kuperman. 2014. Concreteness ratings
为了 40 thousand generally known English

1575

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

word lemmas. Behavior Research Methods,
46(3):904–911. https://doi.org/10.3758
/s13428-013-0403-5, 考研: 24142837

Emanuele Bugliarello, Ryan Cotterell, Naoaki
Okazaki, and Desmond Elliott. 2021. 多-
modal pretraining unmasked: A meta-analysis
and a unified framework of vision-and-language
BERTs. 协会的交易
计算语言学. https://土井
.org/10.1162/tacl_a_00408

Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu,
Yen-Chun Chen, and Jingjing Liu. 2020.
Behind the scene: Revealing the secrets of
pre-trained vision-and-language models.

European Conference on Computer Vision,
pages 565–580. 施普林格. https://doi.org
/10.1007/978-3-030-58539-6 34

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed
El Kholy, Faisal Ahmed, Zhe Gan, 于
Cheng, and Jingjing Liu. 2020. UNITER: 大学-
versal image-text representation learning. 在
European Conference on Computer Vision,
pages 104–120. 施普林格. https://doi.org
/10.1007/978-3-030-58577-8 7

Guillem Collell Talleda and Marie-Francine
摩恩斯. 2016. Is an image worth more than
a thousand words? On the fine-grain seman-
tic differences between visual and linguistic
陈述. 在诉讼程序中
the 26th
国际计算会议
语言学, pages 2807–2817. 前交叉韧带.

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi
辛格, Deshraj Yadav, Jos´e M. F. Moura, Devi
Parikh, and Dhruv Batra. 2017. Visual dialog. 在
Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).

Christopher Davis, Luana Bulat, Anita Lilla Ver˝o,
and Ekaterina Shutova. 2019. Deconstructing
multimodality: Visual properties and visual
context in human semantic processing. In Pro-
ceedings of the Eighth Joint Conference on
Lexical and Computational Semantics (* SEM
2019), 第 118–124 页.

Manuel De Vega, Arthur Glenberg, and Arthur
Graesser. 2012. Symbols and Embodiment:
Debates on Meaning and Cognition, 牛津
大学出版社.

Harm De Vries, Florian Strub, Sarath Chandar,
Olivier Pietquin, Hugo Larochelle, and Aaron

考维尔. 2017. GuessWhat?! Visual object
discovery through multi-modal dialogue. 在
会议记录
the IEEE Conference on
计算机视觉和模式识别,
pages 5503–5512. https://doi.org/10
.1109/CVPR.2017.475

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中
这 2019
Conference of the North American Chapter of
the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186.

Manaal Faruqui, Yulia Tsvetkov, Pushpendre
Rastogi, and Chris Dyer. 2016. Problems with
evaluation of word embeddings using word sim-
ilarity tasks. In Proceedings of the 1st Work
shop on Evaluating Vector-Space Representa-
tions for NLP, pages 30–35. https://土井
.org/10.18653/v1/W16-2506

Lev Finkelstein, Evgeniy Gabrilovich, Yossi
Matias, Ehud Rivlin, Zach Solan, Gadi
Wolfman, and Eytan Ruppin. 2002. Placing
search in context: The concept
revisited.
ACM Transactions on Information Systems,
20(1):116–131. https://doi.org/10.1145
/503104.503110

约翰·R. Firth. 1957. A synopsis of linguistic the-
奥里, 1930–1955. Studies in Linguistic Analysis.

Daniela Gerz, Ivan Vuli´c, Felix Hill, Roi Reichart,
and Anna Korhonen. 2016. SimVerb-3500: A
large-scale evaluation set of verb similarity. 在
诉讼程序 2016 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2173–2182. https://doi.org/10
.18653/v1/D16-1235

Stevan Harnad. 1990. The symbol grounding
问题. Physica D: Nonlinear Phenomena,
42(1-3):335–346. https://doi.org/10
.1016/0167-2789(90)90087-6

Zellig S. 哈里斯. 1954. Distributional structure.
Word, 10(2-3):146–162. https://doi.org
/10.1080/00437956.1954.11659520

Lisa Anne Hendricks, John Mellor, Rosalia
施耐德, Jean-Baptiste Alayrac, and Aida
Nematzadeh. 2021. Decoupling the role of
数据, 注意力, and losses in multimodal Trans-
前者. 协会的交易

1576

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学. https://土井
.org/10.1162/tacl_a_00385

Lisa Anne Hendricks and Aida Nematzadeh.
2021. Probing image-language Transformers
for verb understanding. arXiv 预印本 arXiv:
2106.09141. https://doi.org/10.18653
/v1/2021.findings-acl.318

Felix Hill and Anna Korhonen. 2014. 学习
abstract concept embeddings from multi-modal
数据: Since you probably can’t see what I mean.
在诉讼程序中
这 2014 会议
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 255–265. https://
doi.org/10.3115/v1/D14-1032

Felix Hill, Roi Reichart, and Anna Korhonen.
2015. Simlex-999: Evaluating semantic models
和 (genuine) similarity estimation. Computa-
tional Linguistics, 41(4):665–695. https://
doi.org/10.1162/COLI 00237

Ting-Hao Huang, Francis Ferraro, Nasrin
Ishan Misra, Aishwarya
Mostafazadeh,
Agrawal,
Jacob Devlin, Ross Girshick,
Xiaodong He, Pushmeet Kohli, Dhruv Batra,
and et al.. 2016. Visual storytelling. In Pro-
ceedings of the 2016 Conference of the North
the Association for
American Chapter of
计算语言学: Human Language
Technologies, pages 1233–1239. https://
doi.org/10.18653/v1/N16-1147

Gabriel Ilharco, Rowan Zellers, Ali Farhadi, 和
Hannaneh Hajishirzi. 2021. Probing context-
瓦尔
language models for common ground
with visual representations. 在诉讼程序中
这 2021 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 5367–5377, 在线的. 协会
计算语言学. https://土井
.org/10.18653/v1/2021.naacl-main.422

in semantics. 在诉讼程序中 2016 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing, pages 447–456. https://
doi.org/10.18653/v1/D16-1043

Satwik Kottur, Ramakrishna Vedantam, 何塞
MF Moura, and Devi Parikh. 2016. Visual
word2vec (vis-w2v): Learning visually grounded
word embeddings using abstract scenes. 在
the IEEE Conference on
会议记录
计算机视觉和模式识别,
pages 4985–4994. https://doi.org/10
.1109/CVPR.2016.539

Thomas K. Landauer and Susan T. Dumais. 1997.
A solution to Plato’s problem: The latent se-
mantic analysis theory of acquisition, induction,
and representation of knowledge. Psychologi-
cal Review, 104(2):211. https://doi.org
/10.1037/0033-295X.104.2.211

Juan J. Lastra-D´ıaz, Josu Goikoetxea, Mohamed
Ali Hadj Taieb, Ana Garc´ıa-Serrano, Mohamed
Ben Aouicha, and Eneko Agirre. 2019. A re-
producible survey on word embeddings and
ontology-based methods for word similarity:
linear combinations outperform the state of the
艺术. Engineering Applications of Artificial In-
telligence, 85:645–665. https://doi.org
/10.1016/j.engappai.2019.07.010

Angeliki Lazaridou, Nghia The Pham, and Marco
Baroni. 2015. Combining language and vi-
sion with a multimodal skip-gram model. 在
诉讼程序 2015 Conference of the
North American Chapter of the Association for
计算语言学: Human Language
Technologies, pages 153–163. https://土井
.org/10.3115/v1/N15-1016

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
Hsieh, and Kai-Wei Chang. 2019. VisualBERT:
A simple and performant baseline for vision and
语言. arXiv 预印本 arXiv:1908.03557.

Douwe Kiela and L´eon Bottou. 2014. 学习
image embeddings using convolutional neural
networks for improved multi-modal seman-
抽动症. 在诉讼程序中 2014 会议
on Empirical Methods in Natural Language
加工 (EMNLP), 第 36–45 页. https://
doi.org/10.3115/v1/D14-1005

Liunian Harold Li, Mark Yatskar, Da Yin,
Cho-Jui Hsieh, and Kai-Wei Chang. 2020.
What does BERT with vision look at? 在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 5265–5275. https://doi.org/10
.18653/v1/2020.acl-main.469

Douwe Kiela, Anita Lilla Ver˝o, and Stephen
克拉克. 2016. Comparing data sources and archi-
tectures for deep visual representation learning

Tsung-Yi Lin, Michael Maire, Serge Belongie,
James Hays, Pietro Perona, Deva Ramanan,
Piotr Doll´ar, and C Lawrence Zitnick. 2014.

1577

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Microsoft COCO: Common objects in context.
In European Conference on Computer Vision,
pages 740–755. 施普林格. https://doi.org
/10.1007/978-3-319-10602-1 48

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
李. 2019. ViLBERT: Pretraining task-agnostic
visiolinguistic representations for vision-and-
language tasks. In Advances in Neural Infor-
mation Processing Systems, 体积 32. 柯兰
Associates, Inc.

Timo L¨uddecke, Alejandro Agostini, 迈克尔
Fauth, Minija Tamosiunaite, and Florentin
W¨org¨otter. 2019. Distributional semantics of
objects in visual scenes in comparison to text.
人工智能, 274:44–65. https://
doi.org/10.1016/j.artint.2018.12.009

Chunpeng Ma, Aili Shen, Hiyori Yoshikawa,
Tomoya Iwakura, Daniel Beck, and Timothy
Baldwin. 2021. 上 (在)effectiveness of im-
ages for text classification. 在诉讼程序中
the 16th Conference of the European Chapter of
the Association for Computational Linguistics:
Main Volume, pages 42–48.

Lotte Meteyard, Sara Rodriguez Cuadrado,
Bahador Bahrami, and Gabriella Vigliocco.
2012. Coming of age: A review of embodi-
ment and the neuroscience of semantics. Cortex,
48(7):788–804. https://doi.org/10.1016
/j.cortex.2010.11.002, 考研: 21163473

Timothee Mickus, Mathieu Constant, Denis
Paperno, and Kees van Deemter. 2020. 什么
do you mean, BERT? Assessing BERT as a
Distributional Semantics Model. 会议记录
of the Society for Computation in Linguistics,
体积 3.

Tom´as Mikolov, Kai Chen, Greg Corrado, 和
Jeffrey Dean. 2013. Efficient estimation of word
representations in vector space. In 1st Inter-
national Conference on Learning Representa-
系统蒸发散, ICLR 2013, Scottsdale, Arizona, 美国,
May 2–4, 2013, Workshop Track Proceedings.

Roberto Navigli and Federico Martelli. 2019.
An overview of word and sense similarity.
Natural Language Engineering, 25(6):693–714.

https://doi.org/10.1017/S135132491
9000305

Letitia Parcalabescu, Albert Gatt, Anette Frank,
and Iacer Calixto. 2021. Seeing past words:
Testing the cross-modal capabilities of pre-
trained V&L models on counting tasks. In Pro-
ceedings of the ‘Beyond Language: Multimodal
Semantic Representations’ Workshop.

杰弗里

Socher,

Pennington, 理查德


Christopher D. 曼宁. 2014. GloVe: 全球的
Vectors for word representation. In Proceedings
的 2014 经验方法会议
自然语言处理博士 (EMNLP),
pages 1532–1543. https://doi.org/10
.3115/v1/D14-1162

Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextual-
ized word representations. 在诉讼程序中
这 2018 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 1 (Long Papers), pages 2227–2237.
https://doi.org/10.18653/v1/N18
-1202

Anna Rogers, Olga Kovaleva,

and Anna
Rumshisky. 2020. A primer in BERTology:
What we know about how BERT works. 反式-
actions of the Association for Computational
语言学, 8:842–866. https://doi.org
/10.1162/tacl_a_00349

Armand S. Rotaru and Gabriella Vigliocco.
2020. Constructing semantic models from
字, 图片, and emojis. 认知科学,
44(4):e12830. https://doi.org/10.1111
/cogs.12830, 考研: 32237093

Herbert Rubenstein and John B. Goodenough.
1965. Contextual correlates of synonymy.
ACM 通讯, 8(10):627–633.
https://doi.org/10.1145/365628.365657

Piyush Sharma, Nan Ding, Sebastian Goodman,
and Radu Soricut. 2018. Conceptual captions:
A cleaned, hypernymed, image alt-text dataset
for automatic image captioning. In Proceedings
of the 56th Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 2556–2565. https://
doi.org/10.18653/v1/P18-1238

1578

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

在诉讼程序中

Carina Silberer and Mirella Lapata. 2014. 学习-
ing grounded meaning representations with
the 52nd
autoencoders.
Annual Meeting of the Association for Com-
putational Linguistics (体积 1: Long Pa-
pers), pages 721–732. https://doi.org
/10.3115/v1/P14-1068

Amanpreet Singh, Vedanuj Goswami, and Devi
Parikh. 2020. Are we pretraining it right? Dig-
ging deeper into visio-linguistic pretraining.
arXiv 预印本 arXiv:2004.08744.

Mohamed Ali Hadj Taieb, Torsten Zesch, 和
Mohamed Ben Aouicha. 2020. 一项调查
of semantic relatedness evaluation datasets
and procedures. Artificial Intelligence Review,
53(6):4407–4448. https://doi.org/10
.1007/s10462-019-09796-3

Hao Tan and Mohit Bansal. 2019. LXMERT:
Learning cross-modality encoder representa-
tions from transformers. 在诉讼程序中
这 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 5100–5111. Association for Computa-
tional Linguistics, 香港, 中国. https://
doi.org/10.18653/v1/D19-1514

Hao Tan and Mohit Bansal. 2020. Vokenization:
Improving language understanding via con-
textualized, visually-grounded supervision. 在
诉讼程序 2020 Conference on Empi-
rical Methods in Natural Language Processing
(EMNLP), pages 2066–2080. https://土井
.org/10.18653/v1/2020.emnlp-main.162

Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019. BERT Rediscovers the Classical NLP
Pipeline. In Proceedings of the 57th Annual
Meeting of the Association for Computational
语言学, pages 4593–4601. https://土井
.org/10.18653/v1/P19-1452

Peter D. Turney and Patrick Pantel. 2010. 从
frequency to meaning: Vector space models
of semantics. Journal of Artificial Intelligence
研究, 37:141–188. https://doi.org
/10.1613/jair.2934

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2019. GLUE: A multi-task bench-
mark and analysis platform for natural language
理解. In 7th International Confer-
ence on Learning Representations, ICLR 2019.
https://doi.org/10.18653/v1/W18
-5446

Shaonan Wang, Jiajun Zhang, and Chengqing
Zong. 2018. Learning multimodal word repre-
sentation via dynamic fusion methods. In Pro-
ceedings of the AAAI Conference on Artificial
智力, 体积 32.

Matthijs Westera and Gemma Boleda. 2019. 不
blame distributional semantics if it can’t do
entailment. In Proceedings of the 13th Interna-
tional Conference on Computational Semantics-
Long Papers, pages 120–133. https://土井
.org/10.18653/v1/W19-0410

Eloi Zablocki, Benjamin Piwowarski, Laure
Soulier, and Patrick Gallinari. 2018. 学习
multi-modal word representation grounded in
visual context. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, 体积 32.

1579

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
4
3
1
9
7
9
7
5
4

/

/
t

A
C
_
A
_
0
0
4
4
3
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3Word Representation Learning in Multimodal Pre-Trained image
Word Representation Learning in Multimodal Pre-Trained image

下载pdf