Word Representation Learning in Multimodal Pre-Trained
transformadores: An Intrinsic Evaluation
Sandro Pezzelle, Ece Takmaz, Raquel Fern´andez
Institute for Logic, Language and Computation
University of Amsterdam, Los países bajos
{s.pezzelle|e.takmaz|raquel.fernandez}@uva.nl
Abstracto
This study carries out a systematic intrin-
sic evaluation of the semantic representations
learned by state-of-the-art pre-trained multi-
modal Transformers. These representations are
claimed to be task-agnostic and shown to help
on many downstream language-and-vision
tareas. Sin embargo, the extent to which they align
with human semantic intuitions remains un-
clear. We experiment with various models and
obtain static word representations from the
contextualized ones they learn. We then eval-
uate them against the semantic judgments pro-
vided by human speakers. In line with previous
evidencia, we observe a generalized advantage
of multimodal representations over language-
only ones on concrete word pairs, but not on
abstract ones. Por un lado, this confirms
the effectiveness of these models to align lan-
guage and vision, which results in better se-
mantic representations for concepts that are
grounded in images. Por otro lado, modificación-
els are shown to follow different represen-
tation learning patterns, which sheds some
light on how and when they perform multi-
modal integration.
1
Introducción
Increasing evidence indicates that the meaning
of words is multimodal: Human concepts are
grounded in our senses (Barsalou, 2008; De Vega
et al., 2012), and the sensory-motor experiences
humans have with the world play an important role
in determining word meaning (Meteyard et al.,
2012). Desde (al menos) the first operationalizations
of the distributional hypothesis, sin embargo, standard
NLP approaches to derive meaning representa-
tions of words have solely relied on information
extracted from large text corpora, based on the
generalized assumption that the meaning of a
word can be inferred from the effects it has on
its linguistic context (harris, 1954; Firth, 1957).
Language-only semantic representations, from pi-
oneering ‘count’ vectors (Landauer and Dumais,
1997; Turney and Pantel, 2010; Pennington et al.,
2014) to either static (Mikolov et al., 2013)
or contextualized (Peters et al., 2018; Devlin
et al., 2019) neural network-based embeddings,
have proven extremely effective in many lin-
guistic tasks and applications, for which they
constantly increased state-of-the-art performance.
Sin embargo, they naturally have no connection with
the real-world referents they denote (Baroni,
2016). Tal como,
they suffer from the symbol
grounding problem (Harnad, 1990), En cual
turn limits their cognitive plausibility (Rotaru and
Vigliocco, 2020).
To overcome this limitation, several methods
have been proposed to equip language-only rep-
resentations with information from concurrent
modalities, particularly vision. Until not long ago,
the standard approach aimed to leverage the com-
plementary information conveyed by language and
vision—for example, that bananas are yellow (vi-
sión) and rich in potassium (idioma)—by build-
ing richer multimodal representations (Beinborn
et al., 2018). En general, these representations have
proved advantageous over purely textual ones in
a wide range of tasks and evaluations, incluido
the approximation of human semantic similarity/
relatedness judgments provided by benchmarks
like SimLex999 (Hill et al., 2015) or MEN (Bruni
et al., 2014). This was taken as evidence that
leveraging multimodal information leads to more
human-like, full-fledged semantic representations
of words (Baroni, 2016).
More recently, the advent of Transformer-based
pre-trained models such as BERT (Devlin et al.,
2019) has favored the development of a plethora
of multimodal models (Le et al., 2019; Tan and
Bansal, 2019; Lu et al., 2019; Chen et al., 2020;
Tan and Bansal, 2020) aimed to solve downstream
language and vision tasks such as Visual Question
1563
Transacciones de la Asociación de Lingüística Computacional, volumen. 9, páginas. 1563–1579, 2021. https://doi.org/10.1162/tacl a 00443
Editor de acciones: Jing Jiang. Lote de envío: 6/2021; Lote de revisión: 7/2021; Publicado 12/2021.
C(cid:2) 2021 Asociación de Lingüística Computacional. Distribuido bajo CC-BY 4.0 licencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Answering (Antol et al., 2015) and Visual Dia-
logue (De Vries et al., 2017; Das et al., 2017).
Similarly to the revolution brought about by
Transformer-based language-only models to NLP
(see Tenney et al., 2019), these systems have
rewritten the recent history of research on lan-
guage and vision by setting new state-of-the-art
results on most of the tasks. Además, similarmente
to their language-only counterparts, these systems
have been claimed to produce all-purpose, ‘task-
agnostic’ representations ready-made for any task.
While there has been quite a lot of interest in un-
derstanding the inner mechanisms of BERT-like
modelos (see the interpretability line of research
referred to as BERTology; Rogers et al., 2020)
and the nature of their representations (Mickus
et al., 2020; Westera and Boleda, 2019), compa-
rably less attention has been paid to analyzing the
multimodal equivalents of these models. en par-
particular, no work has explicitly investigated how
the representations learned by these models com-
pare to those by their language-only counterparts,
which were recently shown to outperform standard
static representations in approximating people’s
semantic intuitions (Bommasani et al., 2020).
En este trabajo, we therefore focus on the repre-
sentations learned by state-of-the-art multimodal
pre-trained models, and explore whether, y para
En qué medida, leveraging visual information makes
them closer to human representations than those
produced by BERT. Following the approach pro-
posed by Bommasani et al. (2020), we derive static
representations from the contextualized ones pro-
duced by these Transformer-based models. Nosotros
then analyze the quality of such representations
by means of the standard intrinsic evaluation based
on correlation with human similarity judgments.
We evaluate LXMERT (Tan and Bansal, 2019),
UNITER (Chen et al., 2020), ViLBERT (Lu et al.,
2019), VisualBERT (Le et al., 2019), and Vok-
enization (Tan and Bansal, 2020) on five human
judgment benchmarks1 and show that: (1) en línea
with previous work, multimodal models outper-
form purely textual ones in the representation of
concreto, but not abstract, palabras; (2) representa-
tions by Vokenization stand out as the overall best-
performing multimodal ones; y (3) multimodal
models differ with respect to how and when they
1Data and code can be found at https://github
.com/sandropezzelle/multimodal-evaluation.
integrate information from language and vision, como
revealed by their learning patterns across layers.
2 Trabajo relacionado
2.1 Evaluating Language Representations
Evaluating the intrinsic quality of learned seman-
tic representations has been one of the main, largo-
standing goals of NLP (for a recent overview of
the problem and the proposed approaches, ver
Navigli and Martelli, 2019; Taieb et al., 2020).
In contrast to extrinsic evaluations that measure
the effectiveness of task-specific representations
in performing downstream NLU tasks (p.ej., those
contained in the GLUE benchmark; Wang y cols.,
2019), the former approach tests whether, y
to what extent, task-agnostic semantic representa-
ciones (es decir., not learned nor fine-tuned to be effective
on some specific tasks) align with those by human
speakers. This is typically done by measuring the
correlation between the similarities computed on
system representations and the semantic similarity
judgments provided by humans, a natural testbed
for distributional semantic models (Landauer and
Dumais, 1997). Lastra-D´ıaz et al. (2019) pro-
vide a recent, comprehensive survey on methods,
benchmarks, and results.
In the era of Transformers, recent work has ex-
plored the relationship between the contextualized
representations learned by these models and the
static ones learned by distributional semantic models
(DSMs). On a formal level, some work has ar-
gued that this relation is not straightforward since
only context-invariant—but not contextualized—
representations may adequately account for ex-
pression meaning (Westera and Boleda, 2019).
In parallel, Mickus et al. (2020) focused on
BERT and explored to what extent the seman-
tic space learned by this model is comparable to
that by DSMs. Though an overall similarity was
reported, BERT’s next-sentence-prediction objec-
tive was shown to partly obfuscate this relation.
A more direct exploration of the intrinsic seman-
tic quality of BERT representations was carried
out by Bommasani et al. (2020). In their work,
BERT’s contextualized representations were first
turned into static ones by means of simple meth-
probabilidades (mira la sección 4.2) and then evaluated against
several similarity benchmarks. These representa-
tions were shown to outperform traditional ones,
which revealed that pooling over many contexts
improves embeddings’ representational quality.
1564
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Recientemente, Ilharco et al. (2021) probed the represen-
tations learned by purely textual language models
in their ability to perform language grounding.
Though far from human performance, they were
shown to learn nontrivial mappings to vision.
2.2 Evaluating Multimodal Representations
Since the early days of DSMs, many approaches
have been proposed to enrich language-only rep-
resentations with information from images. Bruni
et al. (2012, 2014) equipped textual representa-
tions with low-level visual features, and reported
an advantage over language-only representations
in terms of correlation with human judgments.
An analogous pattern of results was obtained by
Kiela and Bottou (2014) and Kiela et al. (2016),
who concatenated visual features obtained with
convolutional neural networks (CNNs) with skip-
gram linguistic representations. Lazaridou et al.
(2015) further improved over these techniques by
means of a model trained to optimize the similar-
ity of words with their visual representations, un
approach similar to that by Silberer and Lapata
(2014). Extensions of these latter methods include
the model by Zablocki et al. (2018), which lever-
ages information about the visual context in which
objects appear; and Wang et al. (2018), dónde
three dynamic fusion methods were proposed to
learn to assign importance weights to each modal-
idad. More recently, some work has explored the
quality of representations learned from images
solo (L¨uddecke et al., 2019) or by combining lan-
guage, visión, and emojis (Rotaru and Vigliocco,
2020). In parallel, new evaluation methods based,
Por ejemplo, on decoding brain activity (davis
et al., 2019) or success on tasks such as image
retrieval (Kottur et al., 2016) have been proposed.
This mass of studies has overall demonstrated
the effectiveness of multimodal representations in
approximating human semantic intuitions better
than purely textual ones. Sin embargo, this advantage
has been typically reported for concrete, pero no
abstract, conceptos (Hill and Korhonen, 2014).
En años recientes, the revolution brought about
by Transformer-based multimodal models has
fostered research that sheds light on their inner
workings. One approach has been to use probing
tareas: Cao et al. (2020) focused on LXMERT
and UNITER and systematically compared the
two models with respect to, Por ejemplo, the de-
gree of integration of the two modalities at each
layer or the role of various attention heads (para
a similar analysis on VisualBERT, see Li et al.,
2020). Using two tasks (image-sentence verifica-
tion and counting) as testbeds, Parcalabescu et al.
(2021) highlighted capabilities and limitations of
various pre-trained models to integrate modalities
or handle dataset biases. Another line of work
has explored the impact of various experimental
choices, such as pre-training tasks and data, loss
functions and hyperparameters, on the performance
of pre-trained multimodal models (Singh et al.,
2020; Hendricks et al., 2021). Since all of these
aspects have proven to be crucial for these mod-
los, Bugliarello et al. (2021) proposed VOLTA,
a unified framework to pre-train and evaluate
Transformer-based models with the same data,
tasks and visual features.
Despite the renewed interest
in multimodal
modelos, to the best of our knowledge no work has
explored, hasta la fecha, the intrinsic quality of the task-
agnostic representations built by various pre-
trained Transformer-based models. En este trabajo,
we tackle this problem for the first time.
3 Datos
We aim to evaluate how the similarities between
the representations learned by pre-trained mul-
timodal Transformers align with the similarity
judgments by human speakers, and how these
representations compare to those by textual Trans-
formers such as BERT. para hacerlo, we need data that
(1) is multimodal, eso es, where some text (lan-
guage) is paired with a corresponding image (vi-
sión), y (2) includes most of the words making
up the word pairs for which human semantic judg-
ments are available. In what follows, we describe
the semantic benchmarks used for evaluation and
the construction of our multimodal dataset.
3.1 Semantic Benchmarks
We experiment with five human judgment bench-
marks used for intrinsic semantic evaluation in
both language-only and multimodal work: RG65
(Rubenstein and Goodenough, 1965), WordSim-
353 (Finkelstein et al., 2002), SimLex999 (Colina
et al., 2015), MEN (Bruni et al., 2014), y
SimVerb3500 (Gerz et al., 2016). These bench-
marks have a comparable format, a saber, ellos
contain N (cid:3)w1, w2, puntaje(cid:4) muestras, where w1
and w2 are two distinct words, and score is a
bounded value—that we normalize to range in
1565
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
original
found in VICO
benchmark
RG65
WordSim353
SimLex999
MEN
SimVerb3500
total
rel.
S
R
S
R
S
PoS
norte
norte, V, Adj
norte, V, Adj
norte, V, Adj
V
# pares
65
353
999
3000
3500
7917
# W.
48
437
1028
752
827
2453
concr. (#)
4.37 (65)
3.82 (331)
3.61 (999)
4.41 (2954)
3.08 (3487)
# pares (%)
65 (100%)
306 (86.7%)
957 (95.8%)
2976 (99.2%)
2890 (82.6%)
7194 (90.9%)
# W. (%)
48 (100%)
384 (87.9%)
994 (99.5%)
750 (99.7%)
729 (88.2%)
2278 (92.9%)
concr. (#)
4.37 (65)
3.91 (300)
3.65 (957)
4.41 (2930)
3.14 (2890)
Mesa 1: Statistics of the benchmarks before (original) and after (found in VICO) filtering them based
on VICO: rel. refers to the type of semantic relation, es decir., (S)imilarity or (R)elatedness; # W to the
number of unique words present; concr. to average concreteness of the pairs (in brackets, # found in
Brysbaert et al., 2014). Within found in VICO, percentages in brackets refer to the coverage compared
to original.
[0, 1]—which stands for the degree of semantic
similarity or relatedness between w1 and w2: El
higher the value, the more similar the pair. En
al mismo tiempo, these benchmarks differ in several
respects, a saber, (1) the type of semantic rela-
tion they capture (es decir., similarity or relatedness);
(2) the parts-of-speech (PoS) they include; (3)
the number of pairs they contain; (4) the size of
their vocabulary (es decir., the number of unique words
present); y (5) the words’ degree of concrete-
ness, which previous work found to be particularly
relevant for evaluating the performance of multi-
modal representations (mira la sección 2.2). Nosotros informamos
descriptive statistics of all these relevant features
en mesa 1 (original section). For concreteness,
we report a single score for each benchmark: el
más alto, the more concrete. We obtained this score
(1) by taking, for each word, the corresponding
5-point human rating collected by Brysbaert et al.
(2014);2 (2) by computing the average concrete-
ness of each pair; y (3) by averaging over the
entire benchmark.
3.2 Dataset
Previous work evaluating the intrinsic quality of
multimodal representations has faced the issue of
limited vocabulary coverage in the datasets used.
Como consecuencia, only a subset of the tested
benchmarks has often been evaluated (p.ej., 29%
of word pairs in SimLex999 and 42% in MEN,
reported by Lazaridou et al., 2015). To overcome
this issue, we jointly consider two large mul-
timodal datasets: Common Objects in Contexts
(COCO; Lin et al., 2014) and Visual Storytelling
2Participants were instructed that concrete words refer to
things/actions that can be experienced through our senses,
while meanings of abstract words are defined by other words.
(VIST; Huang et al., 2016). The former contains
samples where a natural image is paired with a
free-form, crowdsourced description (or caption)
of its visual content. The latter contains samples
where a natural image is paired with both a de-
scription of its visual content (DII, Descriptions
of Images in Isolation) and a fragment of a story
invented based on a sequence of five images to
which the target image belongs (SIS, Stories of
Images in Sequences). Both DII and SIS contain
crowdsourced, free-form text. En particular, nosotros
consider the entire COCO 2017 datos (the con-
catenation of train and val splits), which consists
de 616,767 (cid:3)imagen, descripción(cid:4) muestras. Como para
VIST, we consider the train, vale, and test splits
of both DII and SIS, which sum up to 401,600
(cid:3)imagen, description/story(cid:4) muestras.
By concatenating VIST and COCO, we obtain a
dataset containing 1,018,367 (cid:3)imagen, oración(cid:4)
muestras, that we henceforth refer to as VICO.
Thanks to the variety of images and, in particu-
lar, the types of text it contains, the concatenated
dataset proves to be very rich in terms of lexicon,
an essential desideratum for having broad cover-
age of the word pairs in the semantic benchmarks.
We investigate this by considering all the 7917
word pairs making up the benchmarks and check-
En g, for each pair, whether both of its words are
present at least once in VICO. We find 7194 pares
made up of 2278 unique words. As can be seen in
Mesa 1 (found in VICO section), this is equivalent
to around 91% of total pairs found (mín.. 83%,
máximo. 100%), with an overall vocabulary coverage
of around 93% (mín.. 88%, máximo. 100%). This is re-
flected in a pattern of average concreteness scores
that is essentially equivalent to original. Cifra 1
reports this pattern in a boxplot.
1566
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
multimodal models (Sección 4.1). En todos los casos,
representations are extracted from the samples in-
cluded in our dataset. In the language-only models,
representations are built based on only the sen-
tence; in the multimodal models, based on the
sentence and its corresponding image (o, para
Vokenization, just the sentence but with visual su-
pervision in pre-training, as explained later). Desde
representations by most of the tested models are
contextualized, we make them static by means
of an aggregation method (Sección 4.2). En el
evaluación (Sección 4.3), we test the ability of
these representations to approximate human se-
mantic judgments.
4.1 Modelos
Language-Only Models We experiment with
one distributional semantic model producing static
representaciones, a saber, GloVe (Pennington et al.,
2014) and one producing contextualized repre-
sentaciones, a saber, the pre-trained Transformer-
based BERT (Devlin et al., 2019). For GloVe,
following Bommasani et al. (2020) we use its
300-d word representations pre-trained on 6B to-
kens from Wikipedia 2014 and Gigaword 5.4
As for BERT, we experiment with its standard
12-layer version (BERT-base).5 This is the model
serving as the backbone of all the multimodal mod-
els we test, which allows for direct comparison.
Multimodal Models We experiment with five
pre-trained Transformer-based multimodal mod-
los. Four of them are both pre-trained and evalu-
ated using multimodal data, eso es, they produce
representations based on a sentence and an image
(Language and Vision; LV ) at both training and
inference time: LXMERT (Tan and Bansal, 2019),
UNITER (Chen et al., 2020), ViLBERT (Lu et al.,
2019), and VisualBERT (Le et al., 2019). Uno de
a ellos, in contrast, is visually supervised during
training, but only takes Language as input during
inferencia (LV ): Vokenization (Tan and Bansal,
2020). All five models are similar in three main
respects: (1) they have BERT as their backbone;
(2) they produce contextualized representations;
y (3) they have multiple layers from which
such representations can be extracted.
4http://nlp.stanford.edu/data/glove.6B
.zip.
Cifra 1: Concreteness of word pairs found in VICO.
Concreteness ranges from 1 a 5. The horizontal line
shows median; (cid:5), significar.
# muestras # imgs
sent. l # W. # WP
COCO
VIST
total
50452
63256
113708
39767
40528
80295
13.4
14.7
14.7
1988 6076
2250 7122
2278 7194
Mesa 2: Dataset statistics. # imgs: number of
unique images; sent. l: average sentence length;
# W., # WP: resp., number of words and word pairs.
Since experimenting with more than 1 mil-
lion (cid:3)imagen, oración(cid:4) samples turns out to be
computationally highly demanding, for efficiency
reasons we extract a subset of VICO such that: (1)
todo 2278 words found in VICO (hence, the vocabu-
lary) and the corresponding 7194 pairs are present
at least once among its sentences; (2) its size is
around an order of magnitude smaller than VICO;
(3) it preserves the word frequency distribution ob-
served in VICO. We obtain a subcorpus including
113,708 unique (cid:3)imagen, oración(cid:4) muestras, eso
es, alrededor 11% of the whole VICO. Since all the
experiments reported in the paper are performed
on this subset, from now on we will simply refer to
it as our dataset. Some of its descriptive statistics
are reported in Table 2.3 Curiosamente, VIST sam-
ples contain more vocabulary words compared to
COCO (2250 vs. 1988 palabras), which is reflected
in higher coverage of word pairs (7122 vs. 6076).
4 experimentos
En nuestros experimentos, we build representations for
each of the words included in our semantic bench-
marks by means of various language-only and
3The average frequency of our vocabulary words is 171
5We adapt the code from: https://github.com
(mín.. 1, máximo. 8440). 61 palabras (3%) have frequency 1.
/rishibommasani/Contextual2Static.
1567
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
As for the LV models, we use reimplementa-
tions by the VOLTA framework (Bugliarello et al.,
2021).6 This has several advantages since all the
modelos: (1) are initialized with BERT weights;7
(2) use the same visual features, a saber, 36 re-
gions of interest extracted by Faster R-CNN with
a ResNet-101 backbone (Anderson et al., 2018);8
y (3) are pre-trained in a controlled setting using
the same exact data (Conceptual Captions; sharma
et al., 2018), tareas, and objectives, eso es, Masked
Language Model (MLM), masked object clas-
sification with KL-divergence, and image-text
matching (ITM), a binary classification problem
to predict whether an image and text pair match.
This makes the four LV models directly com-
parable to each other, with no confounds. Mayoría
is reimplemented as
importantly, each model
a particular instance of a unified mathematical
framework based on the innovative gated bimodal
Transformer layer. This general layer can be used
to model both intra-modal and inter-modal inter-
comportamiento, which makes it suitable to reimplement
both single-stream models (where language and
vision are jointly processed by a single encoder;
UNITER, VisualBERT) and dual-stream models
(where the two modalities are first processed sepa-
rately and then integrated; LXMERT, ViLBERT).
As for Vokenization, we use the original im-
plementation by Tan and Bansal (2020). Este
model is essentially a visually supervised language
model which, durante el entrenamiento, extracts multimodal
alignments to language-only data by contextually
mapping words to images. Compared to LV mod-
els where alignment between language and vision
is performed at the (cid:3)oración, imagen(cid:4) nivel, en
Vokenization the mapping is done at the token
nivel (the image is named voken). It is worth men-
tioning that Vokenization is pre-trained with less
textual data compared to the standard BERT, el
model used to initialize all LV architectures. Para
comparación, en mesa 3 we report the tasks and
data used to pre-train each of the tested models.
None of the tested LV models were pre-trained
with data present in our dataset. For Vokenization,
we cannot exclude that some COCO samples of
our dataset were also used in the TIM task.
6https://github.com/e-bug/volta.
7Including LXMERT, which was initialized from scratch
in its original implementation.
8Our code to extract visual features for all our images
is adapted from: https://github.com/airsplay
/py-bottom-up-attention/blob/master/demo
/demo_feature_extraction_attr.ipynb.
pre-training task(s)
pre-training data
GloVe Unsupervised vector learning
Wikipedia 2014
+ Gigaword 5
BERT
Masked Language Model (MLM)
+ Next Sentence Prediction (NSP)
English Wikipedia
+ BooksCorpus
LV *
Vok.
Masked Language Model (MLM)
+ Masked Object Classification KL
+ Image-Text Matching (ITM)
Token-Image Matching (TIM)*
Masked Language Model (MLM)
Conceptual Captions
COCO
+ Visual Genome
English Wikipedia
+ Wiki103
Mesa 3: Overview of tasks and data used to pre-
train models. LV refers to the 4 multimodal mod-
los; Vok. to Vokenization. *Initialized with BERT.
4.2 Aggregation Method
With the exception of GloVe, all our tested
models build contextualized representations.
We closely follow the method proposed by
Bommasani et al. (2020) y, for each word in
our benchmarks, we compute a single, static
context-agnostic representation using the samples
included in our dataset. This involves two steps:
(1) subword pooling and (2) context combination.
A schematic illustration of both these steps is
como se muestra en la figura 2, where we exemplify the
general architecture of a LV model. Subword
pooling is the operation by which we construct a
word representation from the tokens produced by
the BERT tokenizer. Desde, during tokenization,
some words (p.ej., ‘donut’) get decomposed into
N subwords (don, ##ut), we apply a function to
combine the representations s1, . . . , sN produced
for each subword token tk, . . . , tk+N −1 into a
contextualized word-level representation wc. Nosotros
take the corresponding model hidden states as
our representations, and use arithmetic mean as
the combination function, following Bommasani
et al. (2020):
wc = mean(s1, . . . , sN )
(1)
For each word, we then compute a static repre-
sentation from its contextualized representations.
This is done via context combination, where we
aggregate the contextualized representations wc1,
. . . , wcM of the same word found in M sentences
(contextos). We obtain a single static representation
for the word, again using arithmetic mean:
w = mean(wc1, . . . , wcM )
(2)
1568
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 2: A schematic illustration of our method to obtain static representations of words, p.ej., ‘donut’.
Como resultado, for each of the 2278 words in our
vocabulary we obtain a single static 768-d rep-
resentation w. The operations of aggregating sub-
palabras (ecuación. 1) and contextualized representations
(ecuación. 2) are carried out for each layer of each
tested model.
For BERT, we consider 13 capas, de 0 (el
input embedding layer) a 12. For Vokenization,
we consider its 12 capas, de 1 a 12. For LV
modelos, we consider the part of VOLTA’s gated
bimodal layer processing the language input, y
extract activations from each of the feed-forward
layers following a multi-head attention block. En
LXMERT, hay 5 such layers: 21, 24, 27, 30,
33; in both UNITER and VisualBERT, 12 capas:
2, 4, . . . , 24; in ViLBERT, 12 capas: 14, 16,
. . . , 36.9 Representations are obtained by running
the best snapshot of each pre-trained model10
on our samples in evaluation mode, es decir., sin
fine-tuning nor updating the model’s weights.
4.3 Evaluation
To evaluate the intrinsic quality of the obtained
representaciones, we compare the semantic space
of each model with that of human speakers. Para
each word pair in each semantic benchmark de-
scribed in Section 3.1, we compute the cosine
similarity between the representations of the words
in the pair:
similarity = cosine(w1, w2)
(3)
For each benchmark, we then compute Spear-
man’s rank ρ correlation between the similarities
obtained by the model and the normalized human
judgments: The higher the correlation, the more
aligned the two semantic spaces are.
9For reproducibility reasons, we report VOLTA’s indexes.
10https://github.com/e-bug/volta/blob
5 Resultados
En mesa 4, we report the best results obtained
by all tested models on the five benchmarks. En
brackets we report the number of the model layer.
Language-Only Models We notice that BERT
evaluated on our dataset (hence, just BERT) sys-
tematically outperforms GloVe. This is in line
with Bommasani et al. (2020), and replicates their
findings that, as compared to standard static em-
camas, averaging over contextualized represen-
tations by Transformer-based models is a valuable
method for obtaining semantic representations
that are more aligned to those of humans.
It is interesting to note, además, that the re-
sults we obtain with BERT actually outperform the
best results reported by Bommasani et al. (2020)
using the same model on 1M Wikipedia contexts
(BERT-1M-Wiki). This is intriguing since it sug-
gests that building representations using a dataset
of visually grounded language, as we do, is not
detrimental to the representational power of the
resulting embeddings. Since this comparison is
partially unfair due to the different methods em-
ployed in selecting language contexts, nosotros también
obtain results on a subset of Wikipedia that we
extract using the method described for VICO (ver
Sección 3.2),11 and which is directly comparable to
nuestro conjunto de datos. As can be seen, representations built
on this subset of Wikipedia (BERT-Wiki ours)
turn out to perform better than those by BERT
for WordSim353, SimLex999, and SimVerb3500
(the least concrete benchmarks—see Table 1);
worse for RG65 and MEN (the most concrete
unos). This pattern of results indicates that visually
grounded language is different from encyclopedic
uno, which in turn has an impact on the resulting
representaciones.
/main/MODELS.md.
11This subset contains 127,246 unique sentences.
1569
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
modelo
aporte
Spearman ρ correlation (capa)
BERT-1M-Wiki*
BERT-Wiki ours
GloVe
BERT
LXMERT
UNITER
ViLBERT
VisualBERT
Vokenization
l
l
l
l
LV
LV
LV
LV
LV
RG65
0.7242 (1)
0.8107 (1)
0.7693
0.8124 (2)
0.7821 (27)
0.7679 (18)
0.7927 (20)
0.7592 (2)
0.8456 (9)
WS353
0.7048 (1)
0.7262 (1)
0.6097
0.7096 (1)
0.6000 (27)
0.6813 (2)
0.6204 (14)
0.6778 (2)
0.6818 (3)
SL999
0.5134 (3)
0.5213 (0)
0.3884
0.5191 (0)
0.4438 (21)
0.4843 (2)
0.4729 (16)
0.4797 (4)
0.4881 (9)
MEN
–
0.7176 (2)
0.7296
0.7368 (2)
0.7417 (33)
0.7483 (20)
0.7714 (26)
0.7512 (20)
0.8068 (10)
SVERB
0.3948 (4)
0.4039 (4)
0.2183
0.4027 (3)
0.2443 (21)
0.3926 (10)
0.3875 (14)
0.3833 (10)
0.3439 (9)
Mesa 4: Spearman’s rank ρ correlation between similarities computed with representations by all tested
models and human similarity judgments in the five evaluation benchmarks: the higher the better. Resultados
in bold are the highest in the column among models run on our dataset (eso es, all but the top 2 modelos).
Results underlined are the highest among LV models. *Original results from Bommasani et al. (2020).
Cifra 3: Highest ρ by each model on WordSim353, SimLex999 and MEN. Each barplot reports results on both
the whole benchmark (Mesa 4) and the most concrete subset of it (Mesa 5). Best viewed in color.
Multimodal Models Turning to multimodal
modelos, we observe that they outperform BERT
on two benchmarks, RG65 and MEN. Though
Vokenization is found to be the best-performing
architecture on both of them, all multimodal mod-
els surpass BERT on MEN (see rightmost panel
of Figure 3; dark blue bars). A diferencia de, no multi-
modal model outperforms or is on par with BERT
on the other three benchmarks (Cifra 3 muestra
the results on WordSim353 and SimLex999). Este
indicates that multimodal models have an advan-
tage on benchmarks containing more concrete
word pairs (recall that MEN and RG65 are the
overall most concrete benchmarks; ver tabla 1);
in contrast, leveraging visual information appears
to be detrimental for more abstract word pairs, a
pattern that is very much in line with what was
reported for previous multimodal models (Bruni
et al., 2014; Hill and Korhonen, 2014). Among
multimodal models, Vokenization stands out as
the overall best-performing model. Esto indica
that grounding a masked language model is an ef-
fective way to obtain semantic representations that
are intrinsically good, as well as being effective in
downstream NLU tasks (Tan and Bansal, 2020).
Among the models using an actual visual input
(LV ), ViLBERT turns out to be best-performing
on high-concreteness benchmarks, while UNITER
is the best model on more abstract benchmarks.
This pattern could be due to the different embed-
ding layers of these models, which are shown to
play an important role (Bugliarello et al., 2021).
Concreteness Our results show a generalized
advantage of multimodal models on more concrete
benchmarks. This seems to indicate that visual in-
formation is beneficial for representing concrete
palabras. Sin embargo, it might still be that models are
just better at representing the specific words con-
tained in these benchmarks. To further investigate
this point, for each benchmark we extract the sub-
set of pairs where both words have concreteness
1570
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
modelo
aporte
concr.
Spearman ρ correlation (capa)
l
BERT
LV
LXMERT
LV
UNITER
LV
ViLBERT
LV
VisualBERT
Vokenization LV
# pares (%)
≥ 4
≥ 4
≥ 4
≥ 4
≥ 4
≥ 4
WS353
0.6138 (1)
MEN
0.7368 (2)
SL999
RG65
SVERB
0.8321 (2)
0.4864 (0)
0.1354 (3)
0.8648 (27) 0.6606 (27) 0.5749 (21) 0.7862 (33) 0.1098 (21)
0.7755 (20) 0.1215 (10)
0.4975 (2)
0.8148 (18) 0.5943 (2)
0.8374 (20) 0.5558 (14) 0.5534 (16) 0.7910 (26) 0.1529 (14)
0.7727 (20) 0.1310 (10)
0.4971 (4)
0.8269 (2)
0.8150 (10) 0.1390 (9)
0.8708 (9)
0.5051 (9)
1917 (65%) 210 (7%)
396 (41%)
44 (68%)
0.6043 (2)
0.6133 (3)
121 (40%)
Mesa 5: Correlation on the concrete subsets (concr. ≥ 4) of the five evaluation benchmarks. Results in
bold are the highest in the column. Results underlined are the highest among LV multimodal models.
≥4 out of 5 in Brysbaert et al. (2014). Para
each model, we consider the results by the layer
which is best-performing on the whole benchmark.
Mesa 5 reports the results of this analysis, a lo largo de
with the number (y %) of word pairs considered.
For all benchmarks, there is always at least one
multimodal model that outperforms BERT. Este
pattern is crucially different from that observed
en mesa 4, and confirms that multimodal models
are better than language-only ones at representing
concrete words, regardless of their PoS. Zooming
into the results, we note that Vokenization still
outperforms other multimodal models on both
RG65 and MEN (see rightmost panel of Figure 3;
light blue bars), while LXMERT turns out to
be the best-performing model on both WordSim-
353 and SimLex999 (see left and middle pan-
els of Figure 3; light blue bars). Estos resultados
suggest that this model is particularly effective
in representing highly concrete words, but fails
with abstract ones, which could cause the overall
low correlations in the full benchmarks (Mesa 4).
ViLBERT obtains the best results on SimVerb-
3500, thus confirming the good performance of
this model in representing verbs/actions seen also
en mesa 4. Sin embargo, the low correlation that all
models achieve on this subset indicates that they
all struggle to represent the meaning of verbs
that are deemed very concrete. This finding ap-
pears to be in line with the generalized difficulty
in representing verbs reported by Hendricks and
Nematzadeh (2021). Further work is needed to
explore this issue.
6 Análisis
We perform analyses aimed at shedding light on
commonalities and differences between the vari-
ous models. En particular, we explore how model
performance evolves through layers (Sección 6.1),
and how various models compare to humans at
the level of specific word pairs (Sección 6.2).
6.1 Layers
Mesa 4 reports the results by the best-performing
layer of each model. Cifra 4 complements these
numbers by showing, for each model, how perfor-
mance changes across various layers. For BERT,
Bommasani et al. (2020) found an advantage of
earlier layers in approximating human semantic
judgments. We observe the same exact pattern,
with earlier layers (0–3) achieving the best corre-
lation scores on all benchmarks and later layers
experiencing a significant drop in performance.
As for multimodal models, previous work (Cao
et al., 2020) experimenting with UNITER and
LXMERT revealed rather different patterns be-
tween the two architectures. For the former, a
higher degree of integration between language
and vision was reported in later layers; as for the
latter, such integration appeared to be in place
from the very first multimodal layer. Cao et al.
(2020) hypothesized this pattern to be represen-
tative of the different behaviors exhibited by
single-stream (UNITER, VisualBERT) vs. dual-
stream (LXMERT, ViLBERT) modelos. If a higher
degree of integration between modalities leads to
better semantic representations, we should observe
an advantage of later layers in UNITER and Visu-
alBERT, but not in LXMERT and ViLBERT. En
particular, we expect this to be the case for bench-
marks where the visual modality plays a bigger
role, es decir., the more concrete RG65 and MEN.
As can be seen in Figure 4, LXMERT exhib-
its a rather flat pattern of results, which overall
1571
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 4: Correlation by all tested models on the five benchmarks across layers. Best viewed in color.
confirms the observation that, in this dual-stream
modelo, integration of language and vision is in
place from the very first multimodal layer. Estafa-
versely, we notice that single-stream UNITER
achieves the best correlation on RG65 and MEN
towards the end of its pipeline (at layers 18 y
20, respectivamente), which supports the hypothesis
that later representations are more multimodal.
The distinction between single- and dual-stream
models appears less clear-cut in the other two
architectures (not explored by Cao et al., 2020).
Though ViLBERT (dual-stream) achieves gener-
ally good results in earlier layers, the best corre-
lation on RG65 and MEN is reached in middle
capas. As for VisualBERT (single-stream), estafa-
sistently with the expected pattern the best cor-
relation on MEN is achieved at one of the latest
capas; sin embargo, the best correlation on RG65 is
reached at the very first multimodal layer. En general,
our results mirror the observations by Cao et al.
(2020) for LXMERT and UNITER. Sin embargo, el
somewhat mixed pattern observed for the other
models suggests more complex interactions be-
tween the two modalities. As for Vokenization,
there is a performance drop at the last two layers,
but otherwise its performance constantly increases
through the layers and reaches the highest peaks
toward the end.
Tomados juntos, the results of this analysis con-
firm that various models differ with respect to how
they represent and process the inputs and to how
and when they perform multimodal integration.
6.2 Pair-Level Analysis
Correlation results are not informative about (1)
which word pairs are more or less challenging
for the models, nor about (2) how various models
compare to each other in dealing with specific
word pairs. Intuitivamente, this could be tested by
comparing the raw similarity values output by a
given model to both human judgments and scores
by other models. Sin embargo, this turns out not to be
sound in practice due to the different ranges of val-
ues produced. Por ejemplo, some models output
generally low cosine values, while others produce
generally high scores,12 which reflects differences
in the density of the semantic spaces they learn. A
compare similarities more fairly, for each model
we consider the entire distribution of cosine val-
ues obtained in a given benchmark, rank it in
descending order (from highest to lowest similar-
ity values) and split it in five equally-sized bins,
that we label highest, alto, medio, bajo, lowest.
We do the same for human similarity scores.
Entonces, for each word pair, we check whether it
is assigned the same similarity ‘class’ by both
humans and the model. We focus on the three
12Differences also emerge between various model layers.
1572
BERT
ViLBERT
most similar pairs according to humans
Vokenization
BERT
ViLBERT
least similar pairs according to humans
Vokenization
RG65
WS353
SL999
MEN
SVERB
cord, smile
mediodía, cadena
rooster, voyage
gem, jewel
midday, mediodía
automobile, auto
gem, jewel
midday, mediodía
automobile, auto
cord, smile
mediodía, cadena
rooster, voyage
fruit, furnace
autograph, shore
king, cabbage
chord, smile
mediodía, cadena
cord, smile
mediodía, cadena
rooster, voyage
fruit, furnace
autograph, shore
king, cabbage
chord, smile
mediodía, cadena
gem, jewel
midday, mediodía
automobile, auto
cemetery, graveyard cemetery, graveyard cemetery, graveyard fruit, furnace
cushion, pillow
coast, shore
dinero, cash
midday, mediodía
journey, voyage
dollar, buck
stupid, dumb
creator, maker
vanish, disappear
quick, rapid
insane, crazy
sun, sunlight
cat, kitten
automobile, auto
river, agua
stair, staircase
repair, fix
triumph, win
build, construir
flee, escape
rip, tear
cushion, pillow
coast, shore
dinero, cash
midday, mediodía
journey, voyage
dollar, buck
stupid, dumb
creator, maker
vanish, disappear
quick, rapid
insane, crazy
sun, sunlight
cat, kitten
automobile, auto
river, agua
stair, staircase
repair, fix
triumph, win
build, construir
flee, escape
rip, tear
cushion, pillow
coast, shore
dinero, cash
midday, mediodía
journey, voyage
dollar, buck
stupid, dumb
creator, maker
vanish, disappear
quick, rapid
insane, crazy
sun, sunlight
cat, kitten
automobile, auto
river, agua
stair, staircase
repair, fix
triumph, win
build, construir
flee, escape
rip, tear
autograph, shore
king, cabbage
chord, smile
mediodía, cadena
professor, cucumber professor, cucumber professor, cucumber
rooster, voyage
cliff, tail
container, mouse
ankle, window
nuevo, ancient
shrink, crecer
feather, truck
angel, gasoline
bakery, zebra
bikini, pizza
muscle, tulip
drive, breed
die, crecer
visit, giggle
miss, catch
shut, vomit
rooster, voyage
cliff, tail
container, mouse
ankle, window
nuevo, ancient
shrink, crecer
feather, truck
angel, gasoline
bakery, zebra
bikini, pizza
muscle, tulip
drive, breed
die, crecer
visit, giggle
miss, catch
shut, vomit
rooster, voyage
cliff, tail
container, mouse
ankle, window
nuevo, ancient
shrink, crecer
feather, truck
angel, gasoline
bakery, zebra
bikini, pizza
muscle, tulip
drive, breed
die, crecer
visit, giggle
miss, catch
shut, vomit
Mesa 6: Similarity assigned by BERT, ViLBERT, and Vokenization to most (izquierda) and least (bien)
similar pairs according to humans. Dark green indicates highest assigned similarity; light green, alto;
yellow, medio; naranja, bajo; rojo, lowest. Best viewed in color.
overall best-performing models, a saber, BERT
(l), ViLBERT (LV ), and Vokenization (LV ).
We perform a qualitative analysis by focusing
en 5 pairs for each benchmark with the highest and
lowest semantic similarity/relatedness according
to humans. Mesa 6 reports the results of this anal-
ysis through colors. Dark green and red indicate
alignment between humans and models on most
similar and least similar pairs, respectivamente. En
first glance, we notice a prevalence of dark green
on the left section of the table, which lists 5 del
most similar pairs according to humans; a preva-
lence of red on the right section, which lists the
least similar ones. This clearly indicates that the
three models are overall effective in capturing sim-
ilarities of words, mirroring the results reported in
Mesa 4. Consistently, we notice that model rep-
resentations are generally more aligned in some
benchmarks compared to others: consider, for ex-
amplio, RG65 vs. SimLex999 or SimVerb3500.
Además, some models appear to be more aligned
than others in specific benchmarks: Por ejemplo,
in the highly concrete MEN, Vokenization is much
more aligned than BERT on the least similar cases.
A diferencia de, BERT is more aligned with humans
than are multimodal models on the most similar
RG65 WS353 SL999 MEN SVERB
todo
BERT
ViLBERT
Vokenization
similar
BERT
ViLBERT
Vokenization
dissimilar
BERT
ViLBERT
Vokenization
0.52
0.49
0.60
0.62
0.50
0.73
0.46
0.50
0.54
0.39
0.37
0.39
0.45
0.43
0.48
0.43
0.39
0.41
0.38
0.35
0.35
0.41
0.38
0.38
0.39
0.33
0.36
0.41
0.43
0.45
0.44
0.47
0.46
0.42
0.44
0.48
0.31
0.30
0.29
0.33
0.31
0.29
0.32
0.33
0.31
Mesa 7: Proportion of aligned cases between
humans and the models when considering all
pairs in the benchmarks (todo), their highest + alto
partition (similar), and lowest + low partition
(dissimilar).
pairs of SimLex999, to which ViLBERT (y, a
a lesser extent, Vokenization), often assigns low
and medium similarities. These qualitative obser-
vations are in line with the numbers reported in
Mesa 7, which refer to the proportion of aligned
cases between humans and the models within
each benchmark. Curiosamente, all models display
1573
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
RG65 WS353
SL999 MEN SVERB
todo
ninguno
B only
MM only
0.31
0.22
0.08
0.08
0.20
0.43
0.06
0.05
0.18
0.44
0.10
0.05
0.19
0.32
0.08
0.09
0.13
0.51
0.08
0.05
Mesa 8: Proportion of pairs where all / none of
el 3 models or only either BERT (B only) or mul-
timodal (MM only) models are aligned to humans.
assigned to low and lowest by ViLBERT and Vo-
kenization, respectivamente, to medium by BERT. En
este caso, adding visual information has a positive
role toward moving one representation away from
the other, which is in line with human intuitions.
As for the relatively high similarity assigned by
BERT to this pair, a manual inspection of the
dataset reveals the presence of samples where the
word zebra appears in bakery contexts; for exam-
por ejemplo, ‘‘There is a decorated cake with zebra and
giraffe print’’ or ‘‘A zebra and giraffe themed
cake sits on a silver plate’’. We conjecture these
co-occurrence patterns may play a role in the
non-grounded representations of these words.
To provide a more quantitative analysis of con-
trasts across models, we compute the proportion
of word pairs in each benchmark for which all /
none of the 3 models assign the target similarity
class; BERT assigns the target class, but neither
of the multimodal models do (B only); both mul-
timodal models are correct but BERT is not (MM
solo). We report the numbers in Table 8. It can
be noted that, in MEN, the proportion of MM
only cases is higher compared to B only; eso es,
visual information helps more than harms in this
benchmark. An opposite pattern is observed for,
as an example, SimLex999.
7 Conclusión
Language is grounded in the world. De este modo, a priori,
representations extracted from multimodal data
should better account for the meaning of words.
We investigated the representations obtained by
Transformer-based pre-trained multimodal models
—which are claimed to be general-purpose seman-
tic representations—and performed a systematic
intrinsic evaluation of how the semantic spaces
learned by these models correlate with human se-
mantic intuitions. Though with some limitations
(see Faruqui et al., 2016; Collell Talleda and
Cifra 5: Four (cid:3)imagen, caption(cid:4) samples from our
conjunto de datos (in brackets, we indicate the source: VIST/
COCO). For the high-similarity SL999 pair creator,
maker (arriba), multimodal models perform worse than
BERT. An opposite pattern is observed for the low-
similarity MEN pair bakery, zebra (abajo). El
pair bakery, zebra is highly concrete, while creator,
maker is not.
a comparable performance when dealing with se-
mantically similar and dissimilar pairs; eso es,
none of the models is biased toward one or the
other extreme of the similarity scale.
Some interesting observations can be made by
zooming into some specific word pairs in Table 6:
Por ejemplo, creator, maker, one of the most
similar pairs in SimLex999 (a pair with low con-
creteness), is assigned the highest class by BERT;
low and medium by ViLBERT and Vokeniza-
ción, respectivamente. This suggests that adding visual
information has a negative impact on the repre-
sentation of these words. As shown in Figure 5
(arriba), this could be due to the (visual) special-
ization of these two words in our dataset, dónde
creator appears to be usually used to refer to a
human agent, while maker typically refers to some
machinery. This confirms that multimodal mod-
els effectively leverage visual information, cual
leads to rather dissimilar representations. Otro
interesting case is bakery, zebra, one of MEN’s
least similar pairs (and highly concrete), cual es
1574
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Moens, 2016), this evaluation is simple and in-
terpretable, and provides a more direct way to
assess the representational power of these models
compared to evaluations based on task perfor-
mance (Tan and Bansal, 2020; Ma et al., 2021).
Además, it allows to probe these models on a
purely semantic level, which can help answer im-
portant theoretical questions regarding how they
build and represent word meanings, and how these
mechanisms compare to previous methods (ver
Mickus et al., 2020, for a similar discussion).
We proposed an experimental
setup that
makes evaluation of various models comparable
while maximizing coverage of human judgments
datos. Todo
the multimodal models we tested—
LXMERT, UNITER, ViLBERT, VisualBERT,
and Vokenization—show higher correlations with
human judgments than language-only BERT for
more concrete words. These results confirm the
effectiveness of Transformer-based models in
aligning language and vision. Among these, Vok-
enization exhibits the most robust results overall.
This suggests that the token-level approach to vi-
sual supervision used by this model in pre-training
may lead to more fine-grained alignment between
modalities. A diferencia de, the sentence-level regime
of the other models may contribute to more un-
certainty and less well defined multimodal word
representaciones. Further work is needed to better
understand the relation between these different
methods.
Expresiones de gratitud
We kindly thank Emanuele Bugliarello for the ad-
vice and indications he gave us to use the VOLTA
estructura. We are grateful to the anonymous
TACL reviewers and to the Action Editor Jing
Jiang for the valuable comments and feedback.
They helped us significantly to broaden the anal-
ysis and improve the clarity of the manuscript.
This project has received funding from the Euro-
pean Research Council (ERC) under the European
El horizonte de la Unión 2020 investigación e innovación
programme (grant agreement no. 819455).
Referencias
Peter Anderson, Xiaodong He, Chris Buehler,
Damien Teney, Marcos Johnson, Stephen Gould,
and Lei Zhang. 2018. Bottom-up and top-down
attention for image captioning and visual ques-
tion answering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern
Reconocimiento, pages 6077–6086. https://
doi.org/10.1109/CVPR.2018.00636
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu,
Margaret Mitchell, Dhruv Batra, C Lawrence
Zitnick, and Devi Parikh. 2015. VQA: Visual
question answering. In Proceedings of the IEEE
International Conference on Computer Vision,
pages 2425–2433. https://doi.org/10
.1109/ICCV.2015.279
Marco Baroni. 2016. Grounding distributional se-
mantics in the visual world. Language and
Linguistics Compass, 10(1):3–13. https://
doi.org/10.1111/lnc3.12170
Lawrence W. Barsalou. 2008. Grounded cognition.
Annual Review of Psychology, 59:617–645.
https://doi.org/10.1146/annurev
.psych.59.103006.093639, PubMed:
17705682
Lisa Beinborn, Teresa Botschen, and Iryna
Gurévich. 2018. Multimodal grounding for
el
language processing. En procedimientos de
27th International Conference on Computa-
lingüística nacional, pages 2325–2339. asociación-
ción para la Lingüística Computacional.
Rishi Bommasani, Kelly Davis, and Claire
Cárdigan. 2020. Interpreting pretrained contextu-
alized representations via reductions to static
embeddings. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Lingüística, pages 4758–4781. https://doi
.org/10.18653/v1/2020.acl-main.431
Elia Bruni, Nam-Khanh Tran, and Marco Baroni.
2014. Multimodal distributional
semantics.
Journal of Artificial Intelligence Research,
49:1–47. https://doi.org/10.1613/jair
.4135
Elia Bruni, Jasper Uijlings, Marco Baroni, y
Nicu Sebe. 2012. Distributional semantics with
ojos: Using image analysis to improve com-
putational representations of word meaning.
In Proceedings of the 20th ACM International
Conference on Multimedia, pages 1219–1228.
https://doi.org/10.1145/2393347
.2396422
Marc Brysbaert, Amy Beth Warriner, y
Victor Kuperman. 2014. Concreteness ratings
para 40 thousand generally known English
1575
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
word lemmas. Behavior Research Methods,
46(3):904–911. https://doi.org/10.3758
/s13428-013-0403-5, PubMed: 24142837
Emanuele Bugliarello, Ryan Cotterell, Naoaki
Okazaki, y Desmond Elliott. 2021. Multi-
modal pretraining unmasked: A meta-analysis
and a unified framework of vision-and-language
BERTs. Transactions of the Association for
Ligüística computacional. https://doi
.org/10.1162/tacl_a_00408
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu,
Yen-Chun Chen, and Jingjing Liu. 2020.
Behind the scene: Revealing the secrets of
pre-trained vision-and-language models.
En
European Conference on Computer Vision,
pages 565–580. Saltador. https://doi.org
/10.1007/978-3-030-58539-6 34
Yen-Chun Chen, Linjie Li, Licheng Yu, ahmed
El Kholy, Faisal Ahmed, Zhe Gan, Yu
cheng, and Jingjing Liu. 2020. UNITER: Uni-
versal image-text representation learning. En
European Conference on Computer Vision,
pages 104–120. Saltador. https://doi.org
/10.1007/978-3-030-58577-8 7
Guillem Collell Talleda and Marie-Francine
Moens. 2016. Is an image worth more than
a thousand words? On the fine-grain seman-
tic differences between visual and linguistic
representaciones. En procedimientos de
the 26th
International Conference on Computational
Lingüística, pages 2807–2817. LCA.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi
singh, Deshraj Yadav, Jos´e M. F. Moura, Devi
Parikh, and Dhruv Batra. 2017. Visual dialog. En
Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Christopher Davis, Luana Bulat, Anita Lilla Ver˝o,
and Ekaterina Shutova. 2019. Deconstructing
multimodality: Visual properties and visual
context in human semantic processing. En profesional-
ceedings of the Eighth Joint Conference on
Lexical and Computational Semantics (* SEM
2019), pages 118–124.
Manuel De Vega, Arthur Glenberg, and Arthur
Graesser. 2012. Symbols and Embodiment:
Debates on Meaning and Cognition, Oxford
Prensa universitaria.
Harm De Vries, Florian Strub, Sarath Chandar,
Olivier Pietquin, Hugo Larochelle, and Aaron
Courville. 2017. GuessWhat?! Visual object
discovery through multi-modal dialogue. En
Actas de
the IEEE Conference on
Computer Vision and Pattern Recognition,
pages 5503–5512. https://doi.org/10
.1109/CVPR.2017.475
Jacob Devlin, Ming-Wei Chang, Kenton Lee, y
Kristina Toutanova. 2019. BERT: Pre-entrenamiento
de transformadores bidireccionales profundos para el lenguaje
comprensión. En procedimientos de
el 2019
Conference of the North American Chapter of
the Association for Computational Linguis-
tics: Tecnologías del lenguaje humano, Volumen 1
(Artículos largos y cortos), páginas 4171–4186.
Manaal Faruqui, Yulia Tsvetkov, Pushpendre
Rastogi, and Chris Dyer. 2016. Problems with
evaluation of word embeddings using word sim-
ilarity tasks. In Proceedings of the 1st Work
shop on Evaluating Vector-Space Representa-
tions for NLP, pages 30–35. https://doi
.org/10.18653/v1/W16-2506
Lev Finkelstein, Evgeniy Gabrilovich, Yossi
Matias, Ehud Rivlin, Zach Solan, Gadi
Wolfman, and Eytan Ruppin. 2002. Placing
search in context: The concept
revisited.
ACM Transactions on Information Systems,
20(1):116–131. https://doi.org/10.1145
/503104.503110
John R. Firth. 1957. A synopsis of linguistic the-
ory, 1930–1955. Studies in Linguistic Analysis.
Daniela Gerz, Ivan Vulic, Felix Hill, Roi Reichart,
and Anna Korhonen. 2016. SimVerb-3500: A
large-scale evaluation set of verb similarity. En
Actas de la 2016 Conferencia sobre el Imperio-
Métodos icales en el procesamiento del lenguaje natural,
pages 2173–2182. https://doi.org/10
.18653/v1/D16-1235
Stevan Harnad. 1990. The symbol grounding
problema. Physica D: Nonlinear Phenomena,
42(1-3):335–346. https://doi.org/10
.1016/0167-2789(90)90087-6
Zellig S. harris. 1954. Distributional structure.
Word, 10(2-3):146–162. https://doi.org
/10.1080/00437956.1954.11659520
Lisa Anne Hendricks, John Mellor, Rosalia
Schneider, Jean-Baptiste Alayrac, and Aida
Nematzadeh. 2021. Decoupling the role of
datos, atención, and losses in multimodal Trans-
formadores. Transactions of the Association for
1576
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional. https://doi
.org/10.1162/tacl_a_00385
Lisa Anne Hendricks and Aida Nematzadeh.
2021. Probing image-language Transformers
for verb understanding. arXiv preimpresión arXiv:
2106.09141. https://doi.org/10.18653
/v1/2021.findings-acl.318
Felix Hill and Anna Korhonen. 2014. Aprendiendo
abstract concept embeddings from multi-modal
datos: Since you probably can’t see what I mean.
En procedimientos de
el 2014 Conferencia sobre
Métodos empíricos en Natural Language Pro-
cesando (EMNLP), pages 255–265. https://
doi.org/10.3115/v1/D14-1032
Felix Hill, Roi Reichart, and Anna Korhonen.
2015. Simlex-999: Evaluating semantic models
con (genuine) similarity estimation. Computa-
lingüística nacional, 41(4):665–695. https://
doi.org/10.1162/COLI a 00237
Ting-Hao Huang, Francis Ferraro, Nasrin
Ishan Misra, Aishwarya
Mostafazadeh,
Agrawal,
Jacob Devlin, Ross Girshick,
Xiaodong He, Pushmeet Kohli, Dhruv Batra,
and et al.. 2016. Visual storytelling. En profesional-
cesiones de la 2016 Conference of the North
la Asociación para
American Chapter of
Ligüística computacional: Human Language
Technologies, pages 1233–1239. https://
doi.org/10.18653/v1/N16-1147
Gabriel Ilharco, Rowan Zellers, Ali Farhadi, y
Hannaneh Hajishirzi. 2021. Probing context-
ual
language models for common ground
with visual representations. En procedimientos de
el 2021 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
pages 5367–5377, En línea. Asociación para
Ligüística computacional. https://doi
.org/10.18653/v1/2021.naacl-main.422
in semantics. En Actas de la 2016 Estafa-
Conferencia sobre métodos empíricos en Lan Natural.-
Procesamiento de calibre, pages 447–456. https://
doi.org/10.18653/v1/D16-1043
Satwik Kottur, Ramakrishna Vedantam, Jos´e
MF Moura, and Devi Parikh. 2016. Visual
word2vec (vis-w2v): Learning visually grounded
word embeddings using abstract scenes. En
the IEEE Conference on
Actas de
Computer Vision and Pattern Recognition,
pages 4985–4994. https://doi.org/10
.1109/CVPR.2016.539
Thomas K. Landauer and Susan T. Dumais. 1997.
A solution to Plato’s problem: The latent se-
mantic analysis theory of acquisition, induction,
and representation of knowledge. Psychologi-
cal Review, 104(2):211. https://doi.org
/10.1037/0033-295X.104.2.211
Juan J. Lastra-D´ıaz, Josu Goikoetxea, mohamed
Ali Hadj Taieb, Ana Garc´ıa-Serrano, mohamed
Ben Aouicha, and Eneko Agirre. 2019. A re-
producible survey on word embeddings and
ontology-based methods for word similarity:
linear combinations outperform the state of the
arte. Engineering Applications of Artificial In-
inteligencia, 85:645–665. https://doi.org
/10.1016/j.engappai.2019.07.010
Angeliki Lazaridou, Nghia The Pham, and Marco
Baroni. 2015. Combining language and vi-
sion with a multimodal skip-gram model. En
Actas de la 2015 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, pages 153–163. https://doi
.org/10.3115/v1/N15-1016
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui
Hsieh, and Kai-Wei Chang. 2019. VisualBERT:
A simple and performant baseline for vision and
idioma. arXiv preimpresión arXiv:1908.03557.
Douwe Kiela and L´eon Bottou. 2014. Aprendiendo
image embeddings using convolutional neural
networks for improved multi-modal seman-
tics. En Actas de la 2014 Conferencia
sobre métodos empíricos en lenguaje natural
Procesando (EMNLP), pages 36–45. https://
doi.org/10.3115/v1/D14-1005
Liunian Harold Li, Mark Yatskar, Da Yin,
Cho-Jui Hsieh, and Kai-Wei Chang. 2020.
What does BERT with vision look at? En
Actas de la 58ª Reunión Anual de
la Asociación de Lingüística Computacional,
pages 5265–5275. https://doi.org/10
.18653/v1/2020.acl-main.469
Douwe Kiela, Anita Lilla Ver˝o, and Stephen
clark. 2016. Comparing data sources and archi-
tectures for deep visual representation learning
Tsung-Yi Lin, Michael Maire, Serge Belongie,
James Hays, Pietro Perona, Deva Ramanan,
Piotr Doll´ar, and C Lawrence Zitnick. 2014.
1577
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Microsoft COCO: Common objects in context.
In European Conference on Computer Vision,
pages 740–755. Saltador. https://doi.org
/10.1007/978-3-319-10602-1 48
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan
Sotavento. 2019. ViLBERT: Pretraining task-agnostic
visiolinguistic representations for vision-and-
language tasks. In Advances in Neural Infor-
mation Processing Systems, volumen 32. Curran
Associates, Cª.
Timo L¨uddecke, Alejandro Agostini, Miguel
Fauth, Minija Tamosiunaite, and Florentin
W¨org¨otter. 2019. Distributional semantics of
objects in visual scenes in comparison to text.
Artificial Intelligence, 274:44–65. https://
doi.org/10.1016/j.artint.2018.12.009
Chunpeng Ma, Aili Shen, Hiyori Yoshikawa,
Tomoya Iwakura, Daniel Beck, and Timothy
Baldwin. 2021. Sobre el (en)effectiveness of im-
ages for text classification. En procedimientos de
the 16th Conference of the European Chapter of
la Asociación de Lingüística Computacional:
Volumen principal, pages 42–48.
Lotte Meteyard, Sara Rodriguez Cuadrado,
Bahador Bahrami, and Gabriella Vigliocco.
2012. Coming of age: A review of embodi-
ment and the neuroscience of semantics. Corteza,
48(7):788–804. https://doi.org/10.1016
/j.cortex.2010.11.002, PubMed: 21163473
Timothee Mickus, Mathieu Constant, Denis
Paperno, y Kees van Deemter. 2020. Qué
do you mean, BERT? Assessing BERT as a
Distributional Semantics Model. Actas
of the Society for Computation in Linguistics,
volumen 3.
Tom´as Mikolov, Kai Chen, Greg Corrado, y
Jeffrey Dean. 2013. Efficient estimation of word
representations in vector space. In 1st Inter-
national Conference on Learning Representa-
ciones, ICLR 2013, Scottsdale, Arizona, EE.UU,
May 2–4, 2013, Workshop Track Proceedings.
Roberto Navigli and Federico Martelli. 2019.
An overview of word and sense similarity.
Natural Language Engineering, 25(6):693–714.
https://doi.org/10.1017/S135132491
9000305
Letitia Parcalabescu, Albert Gatt, Anette Frank,
and Iacer Calixto. 2021. Seeing past words:
Testing the cross-modal capabilities of pre-
trained V&L models on counting tasks. En profesional-
ceedings of the ‘Beyond Language: Multimodal
Semantic Representations’ Workshop.
jeffrey
Socher,
Pennington, Ricardo
y
Cristóbal D.. Manning. 2014. GloVe: Global
Vectors for word representation. En procedimientos
del 2014 Conferencia sobre métodos empíricos
en procesamiento del lenguaje natural (EMNLP),
pages 1532–1543. https://doi.org/10
.3115/v1/D14-1162
Matthew Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextual-
ized word representations. En procedimientos de
el 2018 Conference of the North American
Chapter of the Association for Computational
Lingüística: Tecnologías del lenguaje humano,
Volumen 1 (Artículos largos), pages 2227–2237.
https://doi.org/10.18653/v1/N18
-1202
Anna Rogers, Olga Kovaleva,
and Anna
Rumshisky. 2020. A primer in BERTology:
What we know about how BERT works. Trans-
acciones de la Asociación de Computación
Lingüística, 8:842–866. https://doi.org
/10.1162/tacl_a_00349
Armand S. Rotaru and Gabriella Vigliocco.
2020. Constructing semantic models from
palabras, images, and emojis. Ciencia cognitiva,
44(4):e12830. https://doi.org/10.1111
/cogs.12830, PubMed: 32237093
Herbert Rubenstein and John B. Goodenough.
1965. Contextual correlates of synonymy.
Communications of the ACM, 8(10):627–633.
https://doi.org/10.1145/365628.365657
Piyush Sharma, Nan Ding, Sebastian Goodman,
and Radu Soricut. 2018. Conceptual captions:
A cleaned, hypernymed, image alt-text dataset
for automatic image captioning. En procedimientos
of the 56th Annual Meeting of the Associa-
ción para la Lingüística Computacional (Volumen 1:
Artículos largos), pages 2556–2565. https://
doi.org/10.18653/v1/P18-1238
1578
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
En procedimientos de
Carina Silberer and Mirella Lapata. 2014. Learn-
ing grounded meaning representations with
the 52nd
autoencoders.
Annual Meeting of the Association for Com-
Lingüística putacional (Volumen 1: Long Pa-
pers), pages 721–732. https://doi.org
/10.3115/v1/P14-1068
Amanpreet Singh, Vedanuj Goswami, and Devi
Parikh. 2020. Are we pretraining it right? Dig-
ging deeper into visio-linguistic pretraining.
arXiv preimpresión arXiv:2004.08744.
Mohamed Ali Hadj Taieb, Torsten Zesch, y
Mohamed Ben Aouicha. 2020. A survey
of semantic relatedness evaluation datasets
and procedures. Artificial Intelligence Review,
53(6):4407–4448. https://doi.org/10
.1007/s10462-019-09796-3
Hao Tan and Mohit Bansal. 2019. LXMERT:
Learning cross-modality encoder representa-
tions from transformers. En procedimientos de
el 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
Procesamiento del lenguaje oral (EMNLP-IJCNLP),
pages 5100–5111. Asociación de Computación-
lingüística nacional, Hong Kong, Porcelana. https://
doi.org/10.18653/v1/D19-1514
Hao Tan and Mohit Bansal. 2020. Vokenization:
Improving language understanding via con-
textualized, visually-grounded supervision. En
Actas de la 2020 Conference on Empi-
rical Methods in Natural Language Processing
(EMNLP), pages 2066–2080. https://doi
.org/10.18653/v1/2020.emnlp-main.162
Ian Tenney, Dipanjan Das, and Ellie Pavlick.
2019. BERT Rediscovers the Classical NLP
Pipeline. In Proceedings of the 57th Annual
Meeting of the Association for Computational
Lingüística, pages 4593–4601. https://doi
.org/10.18653/v1/P19-1452
Peter D. Turney and Patrick Pantel. 2010. De
frequency to meaning: Vector space models
of semantics. Journal of Artificial Intelligence
Investigación, 37:141–188. https://doi.org
/10.1613/jair.2934
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R.
Bowman. 2019. GLUE: A multi-task bench-
mark and analysis platform for natural language
comprensión. In 7th International Confer-
ence on Learning Representations, ICLR 2019.
https://doi.org/10.18653/v1/W18
-5446
Shaonan Wang, Jiajun Zhang, and Chengqing
Zong. 2018. Learning multimodal word repre-
sentation via dynamic fusion methods. En profesional-
ceedings of the AAAI Conference on Artificial
Inteligencia, volumen 32.
Matthijs Westera and Gemma Boleda. 2019. Don’t
blame distributional semantics if it can’t do
entailment. In Proceedings of the 13th Interna-
tional Conference on Computational Semantics-
Artículos largos, pages 120–133. https://doi
.org/10.18653/v1/W19-0410
Eloi Zablocki, Benjamin Piwowarski, Laure
Soulier, and Patrick Gallinari. 2018. Aprendiendo
multi-modal word representation grounded in
visual context. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volumen 32.
1579
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
t
a
C
yo
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
d
oh
i
/
.
1
0
1
1
6
2
/
t
yo
a
C
_
a
_
0
0
4
4
3
1
9
7
9
7
5
4
/
/
t
yo
a
C
_
a
_
0
0
4
4
3
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3