Multi-SimLex: A Large-Scale Evaluation of - IA de Investigación especializada en el MIT

Multi-SimLex: Una evaluación a gran escala de
Multilingual and Crosslingual Lexical
Semantic Similarity

Ivan Vuli´c♠
Language Technology Lab
University of Cambridge
iv250@cam.ac.uk

Simon Baker♠
Language Technology Lab
University of Cambridge
sb895@cam.ac.uk

Edoardo Maria Ponti♠
Language Technology Lab
University of Cambridge
ep490@cam.ac.uk

Ulla Petti
Language Technology Lab
University of Cambridge
ump20@cam.ac.uk

Ira Leviant
Faculty of Industrial Engineering and
Management, Technion, IIT
ira.leviant@campus.technion.ac.il

Kelly Wing
Language Technology Lab
University of Cambridge
lkw33cam@gmail.com

Olga Majewska
Language Technology Lab
University of Cambridge
om304@cam.ac.uk

All data are available at https://multisimlex.com/

♠ Equal contribution.

Envío recibido: 11 Marzo 2020; versión revisada recibida: 17 Julio 2020; accepted for publication:
3 Octubre 2020.

https://doi.org/10.1162/COLI a 00391

© 2020 Asociación de Lingüística Computacional
Publicado bajo una Atribución Creative Commons-NoComercial-SinDerivadas 4.0 Internacional
(CC BY-NC-ND 4.0) licencia

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Eden Bar
Faculty of Industrial Engineering and
Management, Technion, IIT
edenb@campus.technion.ac.il

Matt Malone
Language Technology Lab
University of Cambridge
mm2289@cam.ac.uk

Thierry Poibeau
LATTICE Lab, CNRS and ENS/PSL and
Univ. Sorbonne Nouvelle
thierry.poibeau@ens.fr

Roi Reichart
Faculty of Industrial Engineering and
Management, Technion, IIT
roiri@ie.technion.ac.il

Anna Korhonen
Language Technology Lab
University of Cambridge
alk23@cam.ac.uk

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering
data sets for 12 typologically diverse languages, including major languages (p.ej., Mandarin
Chino, Español, Russian) as well as less-resourced ones (p.ej., galés, Kiswahili). Each lan-
guage data set is annotated for the lexical relation of semantic similarity and contains 1,888
semantically aligned concept pairs, providing a representative coverage of word classes (nouns,
verbos, adjectives, adverbs), frequency ranks, similarity intervals, lexical ﬁelds, and concreteness
niveles. Además, owing to the alignment of concepts across languages, we provide a suite
de 66 crosslingual semantic similarity data sets. Because of its extensive size and language
coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and
análisis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array
of recent state-of-the-art monolingual and crosslingual representation models, including static
and contextualized word embeddings (such as fastText, monolingual and multilingual BERT,
XLM), externally informed lexical representations, as well as fully unsupervised and (weakly)
supervised crosslingual word embeddings. We also present a step-by-step data set creation
protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make
these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong
baseline results, and in-depth analyses which can be helpful in guiding future developments in
multilingual lexical semantics and representation learning—available via a Web site that will
encourage community effort in further expansion of Multi-Simlex to many more languages.
Such a large-scale semantic resource could inspire signiﬁcant further advances in NLP across
idiomas.

848

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

1. Introducción

Multi-SimLex

The lack of annotated training and evaluation data for many tasks and domains hinders
the development of computational models for the majority of the world’s languages
(Snyder and Barzilay 2010; Adams et al. 2017; Ponti et al. 2019a; Joshi et al. 2020). El
necessity to guide and advance multilingual and crosslingual NLP through annotation
efforts that follow crosslingually consistent guidelines has been recently recognized
by collaborative initiatives such as the Universal Dependency (UD) proyecto (Nivre
et al. 2019). The latest version of UD (as of July 2020) covers about 90 idiomas.
Fundamentalmente, this resource continues to steadily grow and evolve through the contribu-
tions of annotators from across the world, extending the UD’s reach to a wide array
of typologically diverse languages. Besides steering research in multilingual parsing
(Zeman et al. 2018; Kondratyuk and Straka 2019; Doitch et al. 2019) and crosslingual
parser transfer (Rasooli y Collins 2017; Lin et al. 2019; Rotman and Reichart 2019),
the consistent annotations and guidelines have also enabled a range of insightful com-
parative studies focused on the languages’ syntactic (dis)similarities (Chen and Gerdes
2017; Bjerva and Augenstein 2018; Bjerva et al. 2019; Ponti et al. 2018a; Pires, Schlinger,
and Garrette 2019).

Inspired by the UD work and its substantial impact on research in (plurilingüe)
syntax, in this article we introduce Multi-SimLex, a suite of manually and consistently
annotated semantic data sets for 12 different languages, focused on the fundamental
lexical relation of semantic similarity on a continuous scale (es decir., gradience/strength of
semantic similarity) (Budanitsky and Hirst 2006; Colina, Reichart, and Korhonen 2015). Para
any pair of words, this relation measures whether (and to what extent) their referents
share the same (functional) características (p.ej., lion – cat), as opposed to general cognitive
association (p.ej., lion – zoo) captured by co-occurrence patterns in texts (es decir., el dis-
tributional information).1 Data sets that quantify the strength of semantic similarity
between concept pairs such as SimLex-999 (Colina, Reichart, and Korhonen 2015) o
SimVerb-3500 (Gerz et al. 2016) have been instrumental in improving models for distri-
butional semantics and representation learning. Discerning between semantic similarity
and relatedness/association is not only crucial for theoretical studies on lexical seman-
tics (see §2), but has also been shown to beneﬁt a range of language understanding
tasks in NLP. Examples include dialog state tracking (Mrkˇsi´c et al. 2017; Ren et al. 2018),
spoken language understanding (Kim et al. 2016; kim, de Marneffe, and Fosler-Lussier
2016), text simpliﬁcation (Glavaˇs and Vuli´c 2018; Ponti et al. 2018b; Lauscher et al. 2019),
and dictionary and thesaurus construction (Cimiano, Hotho, and Staab 2005; Hill et al.
2016).

Despite the proven usefulness of semantic similarity data sets, they are available
only for a small and typologically narrow sample of resource-rich languages such as
Alemán, italiano, and Russian (Leviant and Reichart 2015), whereas some language
types and low-resource languages typically lack similar evaluation data. Even if some
resources do exist, they are limited in their size (p.ej., 500 pairs in Turkish [Ercan and
Yıldız 2018], 500 in Farsi [Camacho-Collados et al. 2017], o 300 in Finnish [Venekoski
and Vankka 2017]) and coverage (p.ej., all data sets that originated from the original
English SimLex-999 contain only high-frequent concepts, and are dominated by nouns).
This is why, as our departure point, we introduce a larger and more comprehensive
English word similarity data set spanning 1,888 concept pairs (ver §4).

1 This lexical relation is, somewhat imprecisely, also termed true or pure semantic similarity (Colina, Reichart,

and Korhonen 2015; Kiela, Colina, and Clark 2015); see the ensuing discussion in §2.1.

849

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Más importante, semantic similarity data sets in different languages have been
created using heterogeneous construction procedures with different guidelines for
translation and annotation, as well as different rating scales. Por ejemplo, some data
sets were obtained by directly translating the English SimLex-999 in its entirety (Leviant
and Reichart 2015; Mrkˇsi´c et al. 2017) or in part (Venekoski and Vankka 2017). Other data
sets were created from scratch (Ercan and Yıldız 2018) and yet others sampled English
concept pairs differently from SimLex-999 and then translated and reannotated them
in target languages (Camacho-Collados et al. 2017). This heterogeneity makes these
data sets incomparable and precludes systematic crosslinguistic analyses. In this arti-
cle, consolidating the lessons learned from previous data set construction paradigms,
we propose a carefully designed translation and annotation protocol for developing
monolingual Multi-SimLex data sets with aligned concept pairs for typologically di-
verse languages. We apply this protocol to a set of 12 idiomas, including a mixture of
major languages (p.ej., Mandarin, Russian, y francés) as well as several low-resource
unos (p.ej., Kiswahili, galés, and Yue Chinese). We demonstrate that our proposed data
set creation procedure yields data with high inter-annotator agreement rates (p.ej., el
average mean inter-annotator agreement over all 12 languages is Spearman’s ρ = 0.740,
ranging from ρ = 0.667 for Russian to ρ = 0.812 for French).

The uniﬁed construction protocol and alignment between concept pairs enables a
series of quantitative analyses. Preliminary studies on the inﬂuence that polysemy and
crosslingual variation in lexical categories (see §2.3) have on similarity judgments are
provided in §5. Data created according to Multi-SimLex protocol also allow for probing
into whether similarity judgments are universal across languages, or rather depend on
linguistic afﬁnity (in terms of linguistic features, phylogeny, and geographical location).
We investigate this question in §5.4. Naturalmente, Multi-SimLex data sets can be used as
an intrinsic evaluation benchmark to assess the quality of lexical representations based
on monolingual, joint multilingual, and transfer learning paradigms. We conduct a
systematic evaluation of several state-of-the-art representation models in §7, demostración
that there are large gaps between human and system performance in all languages. El
proposed construction paradigm also supports the automatic creation of 66 crosslingual
Multi-SimLex data sets by interleaving the monolingual ones. We outline the construc-
tion of the crosslingual data sets in §6, and then present a quantitative evaluation of a
series of cutting-edge crosslingual representation models on this benchmark in §8.

Contributions. We now summarize the main contributions of this work:

Building on lessons learned from prior work, we create a more
comprehensive lexical semantic similarity data set for the English
language spanning a total of 1,888 concept pairs balanced with respect to
semejanza, frequency, and concreteness, and covering four word classes:
nouns, verbos, adjectives and, for the ﬁrst time, adverbs. This data set
serves as the main source for the creation of equivalent data sets in several
other languages.

2) We present a carefully designed and rigorous language-agnostic

translation and annotation protocol. These well-deﬁned guidelines will
facilitate the development of future Multi-SimLex data sets for other
idiomas. The proposed protocol eliminates some crucial issues with
prior efforts focused on the creation of multilingual semantic resources,

850

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

a saber: i) limited coverage; ii) heterogeneous annotation guidelines; y
iii) concept pairs that are semantically incomparable across different
idiomas.

3) We offer to the community manually annotated evaluation sets of 1,888

concept pairs across 12 typologically diverse languages, y 66 grande
crosslingual evaluation sets. A lo mejor de nuestro conocimiento, Multi-SimLex is
the most comprehensive evaluation resource to date focused on the
relation of semantic similarity.

4) We benchmark a wide array of recent state-of-the-art monolingual and

crosslingual word representation models across our sample of languages.
The results can serve as strong baselines that lay the foundation for future
improvements.

5) We present a ﬁrst large-scale evaluation study on the ability of encoders
pretrained on language modeling (such as BERT [Devlin et al. 2019] y
XLM [Conneau and Lample 2019]) to reason over word-level semantic
similarity in different languages. To our surprise, the results show that
monolingual pretrained encoders, even when presented with word types
out of context, are sometimes competitive with static word embedding
models such as fastText (Bojanowski y otros. 2017) or word2vec (Mikolov
et al. 2013). The results also reveal a huge gap in performance between
massively multilingual pretrained encoders and language-speciﬁc
encoders in favor of the latter: Our ﬁndings support other recent empirical
evidence related to the “curse of multilinguality” (Conneau et al. 2019;
Bapna and Firat 2019) in representation learning.

6) We make all of these resources available on a Web site that facilitates easy

creation, envío, and sharing of Multi-Simlex–style data sets for a
larger number of languages. We hope that this will yield an even larger
repository of semantic resources that inspire future advances in NLP
within and across languages.

In light of the success of UD (Nivre et al. 2019), we hope that our initiative will
instigate a collaborative public effort with established and clear-cut guidelines that will
result in additional Multi-SimLex data sets in a large number of languages in the near
future. Además, we hope that it will provide means to advance our understanding
of distributional and lexical semantics across a large number of languages. All mono-
lingual and crosslingual Multi-SimLex data sets—along with detailed translation and
annotation guidelines—are available online at: https://multisimlex.com/.

2. Lexical Semantic Similarity

2.1 Similarity and Association

The focus of the Multi-SimLex initiative is on the lexical relation of “true/pure” semantic
semejanza, as opposed to the broader conceptual association. For any pair of words, este
relation measures whether their referents share the same features. Por ejemplo, grafﬁti

851

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

and frescos are similar to the extent that they are both forms of painting and appear
on walls. This relation can be contrasted with the cognitive association between two
palabras, which often depends on how much their referents interact in the real world,
or are found in the same situations. Por ejemplo, a painter is easily associated with
frescos, although they lack any physical commonalities. Association is also known in the
literature under other names: relatedness (Budanitsky and Hirst 2006), topical similarity
(McKeown et al. 2002), and domain similarity (Turney 2012).

Semantic similarity and association overlap to some degree, but do not coincide
(Kiela, Colina, and Clark 2015; Vuli´c, Kiela, and Korhonen 2017). De hecho, there exist plenty
of pairs that are intuitively associated but not similar. Pairs where the converse is true
can also be encountered, although more rarely. An example are synonyms where a word
is common and the other infrequent, such as to seize and to commandeer. Colina, Reichart,
and Korhonen (2015) revealed that although similarity measures based on the WordNet
graph (Wu and Palmer 1994) and human judgments of association in the University
of South Florida Free Association Database (nelson, McEvoy, and Schreiber 2004) hacer
correlate, a number of pairs follow opposite trends. Several studies on human cognition
also point in the same direction. Por ejemplo, semantic priming can be triggered by
similar words without association (lucas 2000). Por otro lado, a connection with
cue words is established more quickly for topically related words than for similar words
in free association tasks (De Deyne and Storms 2008).

A key property of semantic similarity is its gradience: Pairs of words can be similar
to a different degree. Por otro lado, the relation of synonymy is binary: Pairs of
words are synonyms if they can be substituted in all contexts (or most contexts, en un
looser sense); otherwise they are not. Although synonyms can be conceived as lying
on one extreme of the semantic similarity continuum, it is crucial to note that their
deﬁnition is stated in purely relational terms, rather than invoking their referential
propiedades (Lyons 1977; Cruse 1986; Coseriu 1967). This makes behavioral studies on
semantic similarity fundamentally different from lexical resources like WordNet (Molinero
1995), which include paradigmatic relations (such as synonymy).

2.2 Similarity for NLP: Intrinsic Evaluation and Semantic Specialization

The ramiﬁcations of the distinction between similarity and association are profound
for distributional semantics. This paradigm of lexical semantics is grounded in the
distributional hypothesis, formulated by Firth (1957) and Harris (1951). According to
this hypothesis, the meaning of a word can be recovered empirically from the contexts
in which it occurs within a collection of texts. Because both pairs of topically related
words and pairs of purely similar words tend to appear in the same contexts, their asso-
ciated meaning confounds the two distinct relations (Colina, Reichart, and Korhonen 2015;
Schwartz, Reichart, and Rappoport 2015; Vuli´c et al. 2017b). Como resultado, distributional
methods obscure a crucial facet of lexical meaning.

This limitation also reﬂects onto word embeddings (WEs), representations of words
as low-dimensional vectors that have become indispensable for a wide range of NLP
applications (Collobert et al. 2011; Chen and Manning 2014; Melamud et al. 2016, enterrar
alia). En particular, it involves both static WEs learned from co-occurrence patterns
(Mikolov et al. 2013; Levy and Goldberg 2014; Bojanowski y otros. 2017) and contextualized
WEs learned from modeling word sequences (Peters et al. 2018; Devlin et al. 2019,
inter alia). Como resultado, in the induced representations, geometrical closeness (measured,
p.ej., through cosine distance) conﬂates genuine similarity with broad relatedness. Para

852

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

instancia, the vectors for antonyms such as sober and drunk, by deﬁnition dissimilar,
might be neighbors in the semantic space under the distributional hypothesis. Similar
to work on distributional representations that predated the WE era (Sahlgren 2006),
Turney (2012), Kiela and Clark (2014), and Melamud et al. (2016) demonstrated that
different choices of hyperparameters in WE algorithms (such as context window) em-
phasize different relations in the resulting representations. Asimismo, Agirre et al. (2009)
and Levy and Goldberg (2014) discovered that WEs learned from texts annotated with
syntactic information mirror similarity better than simple local bag-of-words neighbor-
capuchas.

The failure of WEs to capture semantic similarity, Sucesivamente, affects model performance
in several NLP applications where such knowledge is crucial. En particular, Natural
Language Understanding tasks such as statistical dialog modeling, text simpliﬁcation,
or semantic text similarity (Mrkˇsi´c et al. 2016; Kim et al. 2016; Ponti et al. 2019C), entre
otros, suffer the most. Como consecuencia, resources providing clean information on
semantic similarity are key in mitigating the side effects of the distributional signal. En
particular, such databases can be used for the intrinsic evaluations of speciﬁc WE models
as a proxy for their reliability for downstream applications (Collobert and Weston 2008;
Baroni and Lenci 2010; Colina, Reichart, and Korhonen 2015); intuitively, the more WEs are
misaligned with human judgments of similarity, the more their performance on actual
tasks is expected to be degraded. Además, word representations can be specialized
(a.k.a. retroﬁtted) by disentangling word relations of similarity and association. En
particular, linguistic constraints sourced from external databases (such as synonyms
from WordNet) can be injected into WEs (Faruqui et al. 2015; Wieting et al. 2015; Mrkˇsi´c
et al. 2017; Lauscher et al. 2019; Kamath et al. 2019, inter alia) in order to enforce
a particular relation in a distributional semantic space while preserving the original
adjacency properties.

2.3 Similarity and Language Variation: Semantic Typology

En este trabajo, we tackle the concept of (true and gradient) semantic similarity from a
multilingual perspective. Although the same meaning representations may be shared
by all human speakers at a deep cognitive level, there is no one-to-one mapping between
the words in the lexicons of different languages. This makes the comparison of similarity
judgments across languages difﬁcult, because the meaning overlap of translationally
equivalent words is sometimes far less than exact. This results from the fact that the
way languages “partition” semantic ﬁelds is partially arbitrary (Trier 1931), a pesar de
constrained crosslingually by common cognitive biases (Majid et al. 2007). Por ejemplo,
consider the ﬁeld of colors: English distinguishes between green and blue, whereas Murle
(South Sudan) has a single word for both (Kay and Mafﬁ 2013).2

En general, semantic typology studies the variation in lexical semantics across the
world’s languages. According to Evans (2011), the ways languages categorize concepts
into the lexicon follow three main axes: 1) granularity: what is the number of categories
in a speciﬁc domain?; 2) boundary location: where do the lines marking different cate-
gories lie?; 3) grouping and dissection: what are the membership criteria of a category;
which instances are considered to be more prototypical? Different choices with respect

2 Note that there is also inherent intra-language variability that can affect concept categorization, p.ej.,

Vejdemo (2018) studies the monolingual variability in the domain of colors. Our annotation protocols and
models do not speciﬁcally cater to nor measure the ﬁne-grained intra-language phenomenon.

853

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

to these axes lead to different lexicalization patterns.3 For instance, distinct senses in a
polysemous word in English, such as skin (referring to both the body and fruit), may be
assigned separate words in other languages such as Italian pelle and buccia, respectivamente
(Rzymski et al. 2020). We later analyze whether similarity scores obtained from native
speakers also loosely follow the patterns described by semantic typology.

3. Previous Work and Evaluation Data

Word Pair Data Sets. Rich expert-created resources such as WordNet (Molinero 1995;
Fellbaum 1998), VerbNet (Kipper Schuler 2005; Kipper et al. 2008), or FrameNet (Panadero,
Fillmore, and Lowe 1998) encode a wealth of semantic and syntactic information, but are
expensive and time-consuming to create. The scale of this problem becomes multiplied
by the number of languages in consideration. Por lo tanto, crowd-sourcing with non-
expert annotators has been adopted as a quicker alternative to produce smaller and
more focused semantic resources and evaluation benchmarks. This alternative practice
has had a profound impact on distributional semantics and representation learning
(Colina, Reichart, and Korhonen 2015). Whereas some prominent English word pair data
sets such as WordSim-353 (Finkelstein et al. 2002), MEN (Bruni, Tran, and Baroni 2014),
or Stanford Rare Words (Luong, Socher, and Manning 2013) did not discriminate be-
tween similarity and relatedness, the importance of this distinction was established
by Hill, Reichart, and Korhonen (2015) (see again the discussion in §2.1) through the
creation of SimLex-999. This inspired other similar data sets that focused on different
lexical properties. Por ejemplo, SimVerb-3500 (Gerz et al. 2016) provided similarity
ratings for 3,500 English verbs, whereas CARD-660 (Pilehvar et al. 2018) aimed at
measuring the semantic similarity of infrequent concepts.

Semantic Similarity Data Sets in Other Languages. Motivated by the impact of data sets
such as SimLex-999 and SimVerb-3500 on representation learning in English, a line of
related work put focus on creating similar resources in other languages. The dominant
approach is translating and reannotating the entire original English SimLex-999 data
colocar, as done previously for German, italiano, and Russian (Leviant and Reichart 2015),
Hebrew and Croatian (Mrkˇsi´c et al. 2017), and Polish (Mykowiecka, Marciniak, y
Rychlik 2018). Venekoski and Vankka (2017) applied this process only to a subset of
300 concept pairs from the English SimLex-999. Por otro lado, Camacho-Collados
et al. (2017) sampled a new set of 500 English concept pairs to ensure wider topical
coverage and balance across similarity spectra, and then translated those pairs to Ger-
hombre, italiano, Español, and Farsi (SEMEVAL-500). A similar approach was followed
by Ercan and Yıldız (2018) for Turkish, by Huang et al. (2019) for Mandarin Chinese,
and by Sakaizawa and Komachi (2018) for Japanese. Netisopakul, Wohlgenannt, y
Pulich (2019) translated the concatenation of SimLex-999, WordSim-353, and the En-
glish SEMEVAL-500 into Thai and then reannotated it. Finalmente, Barzegar et al. (2018)
translated English SimLex-999 and WordSim-353 to 11 resource-rich target languages
(Alemán, Francés, Russian, italiano, Dutch, Chino, Portuguese, Swedish, Español, Ara-
bic, Farsi), but they did not provide details concerning the translation process and the

3 Más formalmente, colexiﬁcation is a phenomenon when different meanings can be expressed by the same

word in a language (Franc¸ois 2008). Por ejemplo, the two senses that are distinguished in English as time
and weather are co-lexiﬁed in Croatian: the word vrijeme is used in both cases.

854

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

resolution of translation disagreements. More importantly, they also did not reannotate
the translated pairs in the target languages. As we discussed in §2.3 and reiterate
later in §5, semantic differences among languages can have a profound impact on the
annotation scores; particularly, we show in §5.4 that these differences even roughly
deﬁne language clusters based on language afﬁnity.

A core issue with the current data sets concerns a lack of one uniﬁed procedure
that ensures the comparability of resources in different languages. Más, concepto
pairs for different languages are sourced from different corpora (p.ej., direct translation
of the English data versus sampling from scratch in the target language). Además,
the previous SimLex-based multilingual data sets inherit the main deﬁciencies of the
English original version, such as the focus on nouns and highly frequent concepts.
Finalmente, prior work mostly focused on languages that are widely spoken and do not
account for the variety of the world’s languages. Our long-term goal is devising a
standardized methodology to extend the coverage also to languages that are resource-
lean and/or typologically diverse (p.ej., galés, Kiswahili, as in this work).

Multilingual Data Sets for Natural Language Understanding. The Multi-SimLex initiative
and corresponding data sets are also aligned with the recent efforts on procuring
multilingual benchmarks that can help advance computational modeling of natural lan-
guage understanding across different languages. Por ejemplo, pretrained multilingual
language models such as multilingual BERT (Devlin et al. 2019) or XLM (Conneau and
Lample 2019) are typically probed on XNLI test data (Conneau et al. 2018b) for cross-
lingual natural language inference. XNLI was created by translating examples from the
English MultiNLI data set, and projecting its sentence labels (williams, Nangia, y
Bowman 2018). Other recent multilingual data sets target the task of question answering
based on reading comprehension: i) MLQA (Lewis et al. 2019) includes 7 idiomas;
ii) XQuAD (casa de arte, Ruder, and Yogatama 2019) 10 idiomas; and iii) TyDiQA (clark
et al. 2020) 9 widely spoken typologically diverse languages. While MLQA and XQuAD
result from the translation from an English data set, TyDiQA was built independently
in each language. Another multilingual data set, PAWS-X (Yang y otros. 2019), focused
on the paraphrase identiﬁcation task and was created translating the original English
PAWS (zhang, Baldridge, and He 2019) en 6 idiomas. XCOPA (Ponti et al. 2020) es un
crosslingual data set for the evaluation of crosslingual causal commonsense reasoning,
obtained through translation of the English COPA data (Roemmele, Bejan, and Gordon
2011) a 11 target languages. A large number of tasks have been recently integrated into
uniﬁed multilingual evaluation suites: EXTREMO (Hu et al. 2020) and XGLUE (Liang
et al. 2020). We believe that Multi-SimLex can substantially contribute to this endeavor
by offering a comprehensive multilingual benchmark for the fundamental lexical level
relation of semantic similarity. In future work, Multi-SimLex also offers an opportunity
to investigate the correlations between word-level semantic similarity and performance
in downstream tasks such as QA and NLI across different languages.

4. The Base for Multi-SimLex: Extending English SimLex-999

En esta sección, we discuss the design principles behind the English (ENG) Multi-SimLex
data set, which is the basis for all the Multi-SimLex data sets in other languages, como
detailed in §5. We ﬁrst argue that a new, more balanced, and more comprehensive
evaluation resource for lexical semantic similarity in English is necessary. Nosotros entonces
describe how the 1,888 word pairs contained in the ENG Multi-SimLex were selected

855

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

in such a way as to represent various linguistic phenomena within a single integrated
resource.

Construction Criteria. The following criteria have to be satisﬁed by any high-quality
semantic evaluation resource, as argued by previous studies focused on the creation of
such resources (Colina, Reichart, and Korhonen 2015; Gerz et al. 2016; Vuli´c et al. 2017a;
Camacho-Collados et al. 2017, inter alia):

(C1) Representative and diverse. The resource must cover the full range of diverse
concepts occurring in natural language, including different word classes (p.ej., nouns,
verbos, adjectives, adverbs), concrete and abstract concepts, a variety of lexical ﬁelds,
and different frequency ranges.

(C2) Clearly deﬁned. The resource must provide a clear understanding of which
semantic relation exactly is annotated and measured, possibly contrasting it with other
relaciones. Por ejemplo, the original SimLex-999 and SimVerb-3500 explicitly focus on
true semantic similarity and distinguish it from broader relatedness captured by data
sets such as MEN (Bruni, Tran, and Baroni 2014) or WordSim-353 (Finkelstein et al.
2002).

(C3) Consistent and reliable. The resource must ensure consistent annotations obtained
from non-expert native speakers following simple and precise annotation guidelines.

In choosing the word pairs and constructing ENG Multi-SimLex, we adhere to these
requirements. Además, we follow good practices established by the research on related
resources. En particular, since the introduction of the original SimLex-999 data set (Colina,
Reichart, and Korhonen 2015), follow-up work has improved its construction protocol
across several aspects, incluido: 1) coverage of more lexical ﬁelds, Por ejemplo, por
relying on a diverse set of Wikipedia categories (Camacho-Collados et al. 2017), 2)
infrequent/rare words (Pilehvar et al. 2018), 3) focus on particular word classes, para
ejemplo, verbos (Gerz et al. 2016), y 4) annotation quality control (Pilehvar et al. 2018).
Our goal is to make use of these improvements toward a larger, more representative,
and more reliable lexical similarity data set in English and, como consecuencia, in all other
idiomas.

The Final Output: English Multi-SimLex. In order to ensure that criterion C1 is satisﬁed,
we consolidate and integrate the data already carefully sampled in prior work into a
single, comprehensive, and representative data set. This way, we can control for diver-
sity, frequency, and other properties while avoiding performing this time-consuming
selection process from scratch. Tenga en cuenta que, por otro lado, the word pairs chosen for
English are scored from scratch as part of the entire Multi-SimLex annotation process,
introduced later in §5. We now describe the external data sources for the ﬁnal set of
word pairs:

Fuente: SimLex-999 (Colina, Reichart, and Korhonen 2015). The English
Multi-SimLex has been initially conceived as an extension of the original
SimLex-999 data set. Por lo tanto, we include all 999 word pairs from
SimLex, which span 666 noun pairs, 222 verb pairs, y 111 adjective
pares. While SimLex-999 already provides examples representing different

856

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

POS classes, it does not have a sufﬁcient coverage of different linguistic
phenomena: Por ejemplo, it contains only very frequent concepts, y eso
does not provide a representative set of verbs (Gerz et al. 2016).

Fuente: SemEval-17: Tarea 2 (henceforth SEMEVAL-500; Camacho-Collados
et al. 2017). We start from the full data set of 500 concept pairs to extract a
total of 334 concept pairs for English Multi-SimLex a) which contain only
single-word concepts, b) which are not named entities, C) where POS tags
of the two concepts are the same, d) where both concepts occur in the
top 250K most frequent word types in the English Wikipedia, and e) cual
do not already occur in SimLex-999. The original concepts were sampled
as to span all the 34 domains available as part of BabelDomains
(Camacho-Collados and Navigli 2017), which roughly correspond to the
main high-level Wikipedia categories. This ensures topical diversity in our
sub-sample.

Fuente: CARD-660 (Pilehvar et al. 2018). Sixty-seven word pairs are taken
from this data set focused on rare word similarity, applying the same
selection criteria a to e utilized for SEMEVAL-500. Words are controlled for
frequency based on their occurrence counts from the Google News data
and the ukWaC corpus (Baroni et al. 2009). CARD-660 contains some
words that are very rare (logboat), domain-speciﬁc (erythroleukemia), y
slang (2mrw), which might be difﬁcult to translate and annotate across a
wide array of languages. Por eso, we opt for retaining only the concept
pairs above the threshold of the top 250K most frequent Wikipedia
conceptos, as above.

Fuente: SimVerb-3500 (Gerz et al. 2016) Because both CARD-660 and
SEMEVAL-500 are heavily skewed toward noun pairs, and nouns also
dominate the original SimLex-999, we also extract additional verb pairs
from the verb-speciﬁc similarity data set SimVerb-3500. We randomly
sample 244 verb pairs from SimVerb-3500 that represent all similarity
spectra. En particular, añadimos 61 verb pairs for each of the similarity
intervals: [0, 1.5), [1.5, 3), [3, 4.5), [4.5, 6]. Because verbs in SimVerb-3500
were originally chosen from VerbNet (Kipper, Snyder, and Palmer 2004;
Kipper et al. 2008), they cover a wide range of verb classes and their
related linguistic phenomena.

Fuente: University of South Florida (USF; nelson, McEvoy, and Schreiber
2004) normas, the largest database of free association for English. In order to
improve the representation of different POS classes, we sample additional
adjectives and adverbs from the USF norms following the procedure
established by Hill, Reichart, and Korhonen (2015) and Gerz et al. (2016).
This yields an additional 122 adjective pairs, but only a limited number of
adverb pairs (p.ej., later – never, now – here, once – twice). Por lo tanto, nosotros también
create a set of adverb pairs semi-automatically by sampling adjectives that
can be derivationally transformed into adverbs (p.ej., adding the sufﬁx -ly)
from the USF, and assessing the correctness of such derivation in WordNet.
The resulting pairs include, por ejemplo, primarily – mainly, softly – ﬁrmly,
roughly – reliably, Etcétera. We include a total of 123 adverb pairs into
the ﬁnal English Multi-SimLex. Note that this is the ﬁrst time that adverbs
are included into any semantic similarity data set.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

857

Ligüística computacional

Volumen 46, Número 4

Fulﬁllment of Construction Criteria. The ﬁnal ENG Multi-SimLex data set spans 1,051 noun
pares, 469 verb pairs, 245 adjective pairs, y 123 adverb pairs.4 As mentioned earlier,
the criterion C1 has been fulﬁlled by relying only on word pairs that already underwent
meticulous sampling processes in prior work, integrating them into a single resource.
Como consecuencia, Multi-SimLex allows for ﬁne-grained analyses over different POS
classes, concreteness levels, similarity spectra, frequency intervals, relation types, mor-
phology, and lexical ﬁelds; and it also includes some challenging orthographically sim-
ilar examples (p.ej., infection – inﬂection).5 We ensure that criteria C2 and C3 are satisﬁed
by using similar annotation guidelines as Simlex-999, SimVerb-3500, and SEMEVAL-
500 that explicitly target semantic similarity. In what follows, we outline the carefully
tailored process of translating and annotating Multi-SimLex data sets in all target
idiomas.

5. Multi-SimLex: Translation and Annotation

We now detail the development of the ﬁnal Multi-SimLex resource, describing our lan-
guage selection process, as well as translation and annotation of the resource, incluido
the steps taken to ensure and measure the quality of this resource. We also provide key
data statistics and preliminary crosslingual comparative analyses.

Language Selection. Multi-SimLex comprises eleven languages in addition to English.
The main objective for our inclusion criteria has been to balance language prominence
(by number of speakers of the language) for maximum impact of the resource, mientras
simultaneously having a diverse suite of languages based on their typological features
(such as morphological type and language family). Mesa 1 summarizes key information
about the languages currently included in Multi-SimLex. We have included a mixture
of fusional, agglutinative, isolating, and introﬂexive languages that come from eight
different language families. This includes languages that are very widely used such
as Chinese Mandarin and Spanish, and low-resource languages such as Welsh and
Kiswahili. We acknowledge that, despite a good balance between typological diversity
and language prominence, the initial Multi-SimLex language sample still contains some
gaps as it does not cover languages from some language families and geographical
regions such as the Americas or Australia. This is mostly due to added difﬁculty for
the authors to reach trusted translators and annotators for these languages. This also
indicates why Multi-SimLex has been envisioned as a collaborative community project:
We hope to further include additional languages and inspire other researchers that work
more closely with under-resourced languages to contribute to the effort over the lifetime
of this project.

4 There is a very small number of adjective and verb pairs extracted from CARD-660 and SEMEVAL-500 as
Bueno. Por ejemplo, the total number of verbs is 469, since we augment the original 222 SimLex-999 verb
pairs with 244 SimVerb-3500 pairs and 3 SEMEVAL-500 pairs; and similarly for adjectives.

5 Unlike SEMEVAL-500 and CARD-660, we do not explicitly control for the equal representation of concept
pairs across each similarity interval for several reasons: a) Multi-SimLex contains a substantially larger
number of concept pairs, so it is possible to extract balanced samples from the full data; b) such balance,
even if imposed on the English data set, would be distorted in all other monolingual and crosslingual
data sets; C) balancing over similarity intervals arguably does not reﬂect a true distribution “in the wild”
where most concepts are only loosely related or completely unrelated.

858

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 1
La lista de 12 languages in the Multi-SimLex multilingual suite along with their corresponding
language family (IE = Indo-European), broad morphological type, and their ISO 639-3 código.
The number of speakers is based on the total count of L1 and L2 speakers, according to
ethnologue.com.

Idioma

ISO 639-3

Family

Tipo

# Speakers

Chinese Mandarin
galés
Inglés
Estonian
Finnish
Francés
hebreo
Polish
Russian
Español
Kiswahili
Yue Chinese

CMN
CYM
ENG
EST
FIN
FRA
HEB
POL
RUS
SPA
SWA
YUE

Sino-Tibetan
IE: Celtic
IE: Germanic
Uralic
Uralic
IE: Romance
Afro-Asiatic
IE: Slavic
IE: Slavic
IE: Romance
Niger-Congo Agglutinative
Sino-Tibetan

Isolating
Fusional
Fusional
Agglutinative
Agglutinative
Fusional
Introﬂexive
Fusional
Fusional
Fusional

Isolating

1.116 B
0.7 METRO
1.132 B
1.1 METRO
5.4 METRO

280 METRO
9 METRO
50 METRO
260 METRO
534.3 METRO
98 METRO
73.5 METRO

The work on data collection can be divided into two crucial phases:

A translation phase where the extended English language data set with
1,888 pares (described in §4) is translated into eleven target languages, y
2) an annotation phase where human raters scored each pair in the
translated set as well as the English set. Detailed guidelines for both
phases are available online at: https://multisimlex.com.6

5.1 Word Pair Translation

Translators for each target language were instructed to ﬁnd direct or approximate
translations for the 1,888 word pairs that satisfy the following rules. (1) All pairs in
the translated set must be unique (es decir., no duplicate pairs); (2) Translating two words
from the same English pair into the same word in the target language is not allowed
(p.ej., it is not allowed to translate car and automobile to the same Spanish word coche).
(3) The translated pairs must preserve the semantic relations between the two words
when possible. This means that, when multiple translations are possible, the translation
that best conveys the semantic relation between the two words found in the original
English pair is selected. (4) If it is not possible to use a single-word translation in the
target language, then a multiword expression can be used to convey the nearest possible
semantics given the above points (p.ej., the English word homework is translated into the
Polish multiword expression praca domowa).

Satisfying these rules when ﬁnding appropriate translations for each pair—while
keeping to the spirit of the intended semantic relation in the English version—is
not always straightforward. Por ejemplo, kinship terminology in Sinitic languages
(Mandarin and Yue) uses different terms depending on whether the family member
is older or younger, and whether the family member comes from the mother’s side or

6 All translators and annotators are native speakers of each target language, who are either bilingual or

proﬁcient in English. They were reached through personal contacts of the authors or were recruited from
the pool of international students at the University of Cambridge.

859

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Mesa 2
Inter-translator agreement (% of matched translated words) by independent translators using a
randomly selected 100-pair English sample from the Multi-SimLex data set, y el
corresponding 100-pair samples from the other data sets.

Idiomas: CMN

CYM EST

FIN FRA

HEB

POL

RUS

SPA SWA YUE Avg

Nouns
Adjectives
Verbos
Adverbs
En general

84.5
88.5
88.0
92.9
86.5

80.0
88.5
74.0
100.0
81.0

90.0
61.5
82.0
57.1
82.0

87.3
73.1
76.0
78.6
82.0

78.2
69.2
78.0
92.9
78.0

98.2
100.0
100.0
100.0
99.0

90.0
84.6
74.0
85.7
85.0

95.5
100.0
100.0
100.0
97.5

85.5
69.2
74.0
85.7
80.5

80.0
88.5
76.0
85.7
81.0

77.3
84.6
86.0
78.6
80.5

86.0
82.5
82.5
87.0
84.8

(older brother) o

the father’s side. In Mandarin, brother has no direct translation and can be translated as
(younger brother). Por lo tanto, in such cases, the translators
either:
are asked to choose the best option given the semantic context (relation) expressed by
the pair in English; de lo contrario, to select one of the translations arbitrarily. This is also
used to remove duplicate pairs in the translated set, by differentiating the duplicates
using a variant at each instance. Más, many translation instances were resolved using
near-synonymous terms in the translation. Por ejemplo, the words in the pair: wood –
timber can only be directly translated in Estonian to puit, and are not distinguishable.
Por lo tanto, the translators approximated the translation for timber to the compound
noun puitmaterjal (literally: wood material) in order to produce a valid pair in the target
idioma. En algunos casos, a less formal yet frequent variant is used as a translation. Para
ejemplo, the words in the pair physician – doctor both translate to the same word in
Estonian (arst); the less formal word doktor is used as a translation of doctor to generate
a valid pair.

We measure the quality of the translated pairs by using a random sample set
de 100 pares (desde el 1,888 pares) to be translated by an independent translator for
each target language. The sample is proportionally stratiﬁed according to the part-of-
speech categories. The independent translator is given identical instructions to the main
translator; we then measure the percentage of matched translated words between the
two translations of the sample set. Mesa 2 summarizes the inter-translator agreement
results for all languages and by part-of-speech subsets. Overall across all languages, el
agreement is 84.8%, which is similar to prior work (Camacho-Collados et al. 2017; Vuli´c,
Ponzetto, and Glavaˇs 2019).

5.2 Guidelines and Word Pair Scoring

Across all languages, 145 human annotators were asked to score all 1,888 pares (in their
given language). We ﬁnally collect at least ten valid annotations for each word pair in
each language. All annotators were required to abide by the following instructions:

Each annotator must assign an integer score between 0 y 6 (inclusivo)
indicating how semantically similar the two words in a given pair are. A
score of 6 indicates very high similarity (es decir., perfect synonymy), and zero
indicates no similarity.

Each annotator must score the entire set of 1,888 pairs in the data set. El
pairs must not be shared between different annotators.

860

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Annotators are able to break the workload over a period of approximately
2–3 weeks, and are able to use external sources (p.ej., dictionaries, thesauri,
WordNet) if required.

Annotators are kept anonymous, and are not able to communicate with
each other during the annotation process.

The selection criteria for the annotators required that all annotators must be native
speakers of the target language. Preference to annotators with university education was
given, but not required. Annotators were asked to complete a spreadsheet containing
the translated pairs of words, as well as the part-of-speech, and a column to enter the
puntaje. The annotators did not have access to the original pairs in English.

To ensure the quality of the collected ratings, we have used an adjudication protocol
similar to the one proposed and validated by Pilehvar et al. (2018). It consists of the
following three rounds:

Round 1: All annotators are asked to follow the instructions outlined above, and to rate
todo 1,888 pairs with integer scores between 0 y 6.

Round 2: We compare the scores of all annotators and identify the pairs for each
annotator that have shown the most disagreement. We ask the annotators to reconsider
the assigned scores for those pairs only. The annotators may choose to either change
or keep the scores. As in the case with Round 1, the annotators have no access to the
scores of the other annotators, and the process is anonymous. This process gives a
chance for annotators to correct errors or reconsider their judgments, and has been
shown to be very effective in reaching consensus, as reported by Pilehvar et al. (2018).
We used a very similar procedure as Pilehvar et al. (2018) to identify the pairs with the
most disagreement; for each annotator, we marked the ith pair if the rated score si falls
dentro: si ≥ µi + 1.5 or si ≤ µi − 1.5, where µi is the mean of the other annotators’ scores.

Round 3: We compute the average agreement for each annotator (with the other
annotators) by measuring the average Spearman’s correlation against all other
annotators. We discard the scores of annotators that have shown the least average
agreement with all other annotators, while we maintain at least ten annotators per
language by the end of this round. The actual process is done in multiple iterations:
(S1) we measure the average agreement for each annotator with every other annotator
(this corresponds to the APIAA measure, see later); (S2) if we still have more than 10
valid annotators and the lowest average score is higher than in the previous iteration,
we remove the lowest one, and rerun S1. Mesa 3 shows the number of annotators at
both the start (Round 1) and end (Round 3) of our process for each language.

We measure the agreement between annotators using two metrics, average pair-
wise inter-annotator agreement (APIAA), and average mean inter-annotator agreement
(AMIAA). Both of these use Spearman’s correlation (ρ) between annotators’ scores, el
only difference is how they are averaged. They are computed as follows:

1)APIAA =

2 (cid:80)

i,j ρ(si, sj)

norte(N − 1)

2)AMIAA =

(cid:80)

i ρ(si, µi)
norte

, dónde: µi =

(cid:80)

j,j(cid:54)=i sj
N − 1

(1)

861

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Mesa 3
Number of human annotators. R1 = Annotation Round 1, R3 = Round 3.

Idiomas:

R1: Start
R3: End

CMN CYM ENG

EST

FIN FRA HEB

POL

RUS

SPA

SWA

YUE

13
11

12
10

14
13

12
10

13
10

10
10

11
10

12
10

11
10

13
11

Mesa 4
Average pairwise inter-annotator agreement (APIAA). A score of 0.6 and above indicates strong
agreement.

Idiomas:

Nouns
Adjectives
Verbos
Adverbs
En general

CMN

CYM

ENG

EST

FIN

FRA

HEB

POL

RUS

SPA

SWA

YUE

0.661
0.757
0.694
0.699
0.680

0.622
0.698
0.604
0.593
0.619

0.659
0.823
0.707
0.695
0.698

0.558
0.695
0.580
0.579
0.583

0.647
0.721
0.644
0.646
0.646

0.698
0.741
0.691
0.595
0.697

0.538
0.683
0.615
0.561
0.572

0.606
0.699
0.593
0.543
0.609

0.524
0.625
0.555
0.535
0.530

0.582
0.640
0.588
0.563
0.576

0.626
0.658
0.631
0.562
0.623

0.727
0.785
0.760
0.716
0.733

Mesa 5
Average mean inter-annotator agreement (AMIAA). A score of 0.6 and above indicates strong
agreement.

Idiomas:

Nouns
Adjectives
Verbos
Adverbs
En general

CMN

CYM

ENG

EST

FIN

FRA

HEB

POL

RUS

SPA

SWA

YUE

0.757
0.800
0.774
0.749
0.764

0.747
0.789
0.733
0.693
0.742

0.766
0.865
0.811
0.777
0.794

0.696
0.790
0.715
0.697
0.715

0.766
0.792
0.757
0.748
0.760

0.809
0.831
0.808
0.729
0.812

0.680
0.754
0.720
0.645
0.699

0.717
0.792
0.722
0.655
0.723

0.657
0.737
0.690
0.608
0.667

0.710
0.743
0.710
0.671
0.703

0.725
0.686
0.702
0.623
0.710

0.804
0.811
0.784
0.716
0.792

where ρ(si, sj) is the Spearman’s correlation between annotators i and j’s scores (si, sj) para
all pairs in the data set, and N is the number of annotators. APIAA has been used widely
as the standard measure for inter-annotator agreement, including in the original SimLex
paper (Colina, Reichart, and Korhonen 2015). It simply averages the pairwise Spearman’s
correlation between all annotators. Por otro lado, AMIAA compares the average
Spearman’s correlation of one held-out annotator with the average of all the other N − 1
annotators, and then averages across all N ‘held-out’ annotators. It smooths individual
annotator effects and arguably serves as a better upper bound than APIAA (Gerz et al.
2016; Vuli´c et al. 2017a; Pilehvar et al. 2018, inter alia).

We present the respective APIAA and AMIAA scores in Table 4 and Table 5 for all
part-of-speech subsets, as well as the agreement for the full data sets. As reported in
prior work (Gerz et al. 2016; Vuli´c et al. 2017a), AMIAA scores are typically higher than
APIAA scores. Fundamentalmente, the results indicate “strong agreement” (across all languages)
using both measurements. The languages with the highest annotator agreement were
Francés (FRA) and Yue Chinese (YUE), while Russian (RUS) had the lowest overall IAA
puntuaciones. These scores, sin embargo, are still considered to be “moderately strong agreement.”

862

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 6
Fine-grained distribution of concept pairs over different rating intervals in each Multi-SimLex
idioma, reported as percentages. The total number of concept pairs in each data set is 1,888.

Lang:

Interval
[0, 1)
[1, 2)
[2, 3)
[3, 4)
[4, 5)
[5, 6]

CMN CYM ENG

EST

FIN

FRA

HEB

POL

RUS

SPA

SWA

YUE

8.74 19.54 17.06 30.67 21.35 20.39 35.86 17.32 22.40 22.35 11.86

56.99 52.01 50.95 35.01 47.83 17.69 28.07 49.36 50.21 43.96 61.39 57.89
7.84
9.11 11.76
7.10 12.98
6.89
6.30
2.65
4.24

13.72 11.97 12.66 16.21 12.02 22.03 16.74 11.86 11.81 14.83
9.38
11.60
6.78
6.41
2.70
2.54

8.16 10.22 10.17 17.64
5.61 12.55
6.89
9.64
2.97
4.29

8.10
5.88
1.59

8.47
6.62
4.24

8.95
7.57
4.93

8.32
5.83
2.33

6.25
1.64

5.3 Análisis de los datos

Similarity Score Distributions. Across all languages, the average score (mean = 1.61,
median = 1.1) is on the lower side of the similarity scale. Sin embargo, looking closer
at the scores of each language in Table 6, we indicate notable differences in both the
averages and the spread of scores. Notablemente, French has the highest average of similarity
puntuaciones (mean = 2.61, median = 2.5), and Kiswahili has the lowest average (mean = 1.28,
median = 0.5). Russian has the lowest spread (σ = 1.37), and Polish has the largest
(σ = 1.62). All of the languages are strongly correlated with each other, as shown in
Cifra 1, where all of the Spearman’s correlation coefﬁcients are greater than 0.6 for all
language pairs. Languages that share the same language family are highly correlated
(p.ej., CMN-YUE, RUS-POL, EST-FIN). Además, we observe high correlations between
English and most other languages, as expected. This is due to the effect of using English
as the base/anchor language to create the data set. In simple words, if one translates to
two languages L1 and L2 starting from the same set of pairs in English, it is highly likely
that L1 and L2 will diverge from English in different ways. Por lo tanto, the similarity
between L1-ENG and L2-ENG is expected to be higher than between L1-L2, especially
if L1 and L2 are typologically dissimilar languages (p.ej., HEB-CMN, ver figura 1). Este
phenomenon is well documented in related prior work (Leviant and Reichart 2015;
Camacho-Collados et al. 2017; Mrkˇsi´c et al. 2017; Vuli´c, Ponzetto, and Glavaˇs 2019).
Although we acknowledge this as an artifact of the data set design, it would otherwise
be impossible to construct a semantically aligned and comprehensive data set across a
large number of languages.

We also report differences in the distribution of the frequency of words among the
languages in Multi-SimLex. Cifra 2 shows six example languages, where each bar seg-
ment shows the proportion of words in each language that occur in the given frequency
range. Por ejemplo, the 10K–20K segment of the bars represents the proportion of words
in the data set that occur in the list of most frequent words between the frequency rank
de 10,000 y 20,000 in that language; likewise with other intervals. Frequency lists for
the presented languages are derived from Wikipedia and Common Crawl corpora.7
Although many concept pairs are direct or approximate translations of English pairs,
we can see that the frequency distribution does vary across different languages, y

7 Frequency lists were obtained from fastText word vectors, which are sorted by frequency:

https://fasttext.cc/docs/en/crawl-vectors.html.

863

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Cifra 1
Spearman’s correlation coefﬁcient (ρ) of the similarity scores for all languages in Multi-SimLex.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 2
A distribution over different frequency ranges for words from Multi-SimLex data sets for
selected languages. Multiword expressions are excluded from the analysis.

is also related to inherent language properties. Por ejemplo, in Finnish and Russian,
although we use inﬁnitive forms of all verbs, conjugated verb inﬂections are often more
frequent in raw corpora than the corresponding inﬁnitive forms. The variance can also
be partially explained by the difference in monolingual corpora size used to derive the
frequency rankings in the ﬁrst place: Absolute vocabulary sizes are expected to ﬂuctuate
across different languages. Sin embargo, it is also important to note that the data sets also
contain subsets of lower-frequency and rare words, which can be used for rare word
evaluations in multiple languages, in the spirit of Pilehvar et al. (2018)’s English rare
word data set.

864

Vuli´c et al.

Multi-SimLex

Mesa 7
Examples of concept pairs with their similarity scores from four languages. Para ser breve, only the
original English concept pair is included, but note that the pair is translated to all target
idiomas, ver §5.1.

Word Pair

Similar average rating
unlikely – friendly
book – literature
vanish – disappear

Different average rating
regular – average
care – caution

One language higher
large – big
bank – seat
sunset – evening
purely – completely

One language lower
woman – wife
amazingly – fantastically
wonderful – terriﬁc
promise – swear

POS

ADV
norte
V

ADJ
norte

ADJ
norte
norte
ADV

norte
ADV
ADJ
V

ENG

SPA

SWA

CYM

0
2.5
5.2

4
4.1

5.9
0
1.6
2.3

0.9
5.1
5.3
4.8

0
2.3
5.3

4.1
5.7

2.7
5.1
1.5
2.3

2.9
0.4
5.4
5.3

0
2.1
5.5

0.5
0.2

3.8
0
5.5
1.1

4.1
4.1
0.9
4.3

0
2.3
5.3

0.8
3.1

3.8
0.1
2.8
5.4

4.8
4.1
5.7
0

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Crosslinguistic Differences. Mesa 7 shows some examples of average similarity scores
of English, Español, Kiswahili, and Welsh concept pairs. Remember that the scores
range from 0 a 6: The higher the score, the more similar the participants found the
concepts in the pair. The examples from Table 7 show evidence of both the stability of
average similarity scores across languages (unlikely – friendly, book – literature, and vanish
– disappear), as well as language-speciﬁc differences (care – caution). Some differences in
similarity scores seem to group languages into clusters. Por ejemplo, the word pair
regular – average has an average similarity score of 4.0 y 4.1 in English and Spanish,
respectivamente, whereas in Kiswahili and Welsh the average similarity score of this pair is
0.5 y 0.8. We analyze this phenomenon in more detail in §5.4.

There are also examples for each of the four languages having a notably higher or
lower similarity score for the same concept pair than the three other languages. Para
ejemplo, large – big in English has an average similarity score of 5.9, whereas Spanish,
Kiswahili, and Welsh speakers rate the closest concept pair in their native language to
have a similarity score between 2.7 y 3.8. What is more, woman – wife receives an
average similarity of 0.9 en Inglés, 2.9 in Spanish, and greater than 4.0 in Kiswahili and
galés. The examples from Spanish include banco – asiento (bank – seat) which receives
an average similarity score 5.1, whereas in the other three languages the similarity score
for this word pair does not exceed 0.1. Al mismo tiempo, the average similarity score
of espantosamente – fant´asticamente (amazingly – fantastically) is much lower in Spanish
(0.4) than in other languages (4.1 – 5.1). In Kiswahili, an example of a word pair with a
higher similarity score than the rest would be machweo – jioni (sunset – evening), teniendo
an average score of 5.5, while the other languages receive 2.8 or less, and a notably lower
similarity score is given to wa ajabu – mkubwa sana (wonderful – terriﬁc), getting 0.9, mientras

865

Ligüística computacional

Volumen 46, Número 4

the other languages receive 5.3 or more. Welsh examples include yn llwyr – yn gyfan
gwbl (purely – completely), which scores 5.4 among Welsh speakers but 2.3 or less in other
idiomas, whereas addo – tyngu (promise – swear) is rated as 0 by all Welsh annotators,
but in the other three languages 4.3 or more on average.

There can be several explanations for the differences in similarity scores across
idiomas, including but not limited to cultural context, polysemy, metonymy, trans-
lación, regional and generational differences, and most commonly, the fact that words
and meanings do not exactly map onto each other across languages. Por ejemplo, él
is likely that the other three languages do not have two separate words for describing
the concepts in the concept pair: big – large, and the translators had to opt for similar
lexical items that were more distant in meaning, explaining why in English the concept
pair received a much higher average similarity score than in other languages. A similar
issue related to the mapping problem across languages arose in the Welsh concept
pair yn llwye – yn gyfan gwbl, where Welsh speakers agreed that the two concepts
are very similar. When asked, bilingual speakers considered the two Welsh concepts
more similar than English equivalents purely – completely, potentially explaining why
a higher average similarity score was reached in Welsh. The example of woman – wife
can illustrate cultural differences or another translation-related issue where the word
‘wife’ did not exist in some languages (Por ejemplo, Estonian), and therefore had to be
described using other words, affecting the comparability of the similarity scores. Este
was also the case with the football – soccer concept pair. The pair bank – seat demonstrates
the effect of the polysemy mismatch across languages: Whereas ‘bank’ has two different
meanings in English, neither of them is similar to the word ‘seat’, but in Spanish, ‘banco’
can mean ‘bank’, but it can also mean ‘bench’. Quite naturally, Spanish speakers gave
the pair banco – asiento a higher similarity score than the speakers of languages where
this polysemy did not occur.

An example of metonymy affecting the average similarity score can be seen in
the Kiswahili version of the word pair: sunset – evening (machweo – jioni). el promedio
similarity score for this pair is much higher in Kiswahili, likely because the word
‘sunset’ can act as a metonym of ‘evening’. The low similarity score of wonderful – terriﬁc
in Kiswahili (wa ajabu – mkubwa sana) can be explained by the fact that while ‘mkubwa
sana’ can be used as ‘terriﬁc’ in Kiswahili, it technically means ‘very big’, adding to
the examples of translation- and mapping-related effects. The word pair amazingly
– fantastically (espantosamente – fant´asticamente) brings out another translation-related
problema: the accuracy of the translation. Although ‘espantosamente’ could arguably be
translated to ‘amazingly’, more common meanings include: ‘frightfully’, ‘terrifyingly’,
and ‘shockingly’, explaining why the average similarity score differs from the rest of the
idiomas. Another problem was brought out by addo – tyngu (promise – swear) in Welsh,
where the ‘tyngu’ may not have been a commonly used or even a known word choice
para anotadores, pointing out potential regional or generational differences in language
usar.

Mesa 8 presents examples of concept pairs from English, Español, Kiswahili, y
Welsh on which the participants agreed the most. Por ejemplo, in English all partici-
pants rated the similarity of trial – test to be 4 o 5. In Spanish and Welsh, all participants
rated start – begin to correspond to a score of 5 o 6. In Kiswahili, money – cash received a
similarity rating of 6 from every participant. Although there are numerous examples of
concept pairs in these languages where the participants agreed on a similarity score of
4 or higher, it is worth noting that none of these languages had a single pair where all
participants agreed on either 1-2, 2-3, o 3-4 similarity rating. Curiosamente, en Inglés
all pairs where all the participants agreed on a 5-6 similarity score were adjectives.

866

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 8
Examples of concept pairs with their similarity scores from four languages where all participants
show strong agreement in their rating.

Idioma

Word Pair

ENG
SWA
SPA, CYM
ENG
ENG, SPA
SPA
CYM
SWA
SWA
SWA
CYM

trial – test
archbishop – bishop
start – begin
smart – intelligent
quick – rapid
circumstance – situation
football – soccer
football – soccer
pause – wait
money – cash
friend – buddy

POS
norte
norte
V
ADJ
ADJ
norte
norte
norte
V
norte
norte

Rating all participants agree with
4–5
4–5
5–6
5–6
5–6
5–6
5–6
6
6
6
6

5.4 Effect of Language Afﬁnity on Similarity Scores

Based on the analysis in Figure 1 and inspecting the anecdotal examples in the previous
sección, it is evident that the correlation between similarity scores across languages is
no al azar. To corroborate this intuition, we visualize the vectors of similarity scores
for each single language by reducing their dimensionality to 2 via principal component
análisis (Pearson 1901). The resulting scatter plot in Figure 3 reveals that languages

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

867

Cifra 3
Principal component analysis of the language vectors resulting from the concatenation of
similarity judgments for all pairs.

Ligüística computacional

Volumen 46, Número 4

from the same family or branch have similar patterns in the scores. En particular, Russian
and Polish (both Slavic), Finnish and Estonian (both Uralic), Cantonese and Mandarin
Chino (both Sinitic), and Spanish and French (both Romance) are all neighbors.

In order to quantify exactly the effect of language afﬁnity on the similarity scores,
we run correlation analyses between these and language features. En particular, nosotros
extract feature vectors from URIEL (Littell et al. 2017), a massively multilingual typo-
logical database that collects and normalizes information compiled by grammarians
and ﬁeld linguists about the world’s languages. En particular, we focus on information
about geography (the areas where the language speakers are concentrated), familia (el
phylogenetic tree each language belongs to), and typology (including syntax, phono-
logical inventory, and phonology).8 Además, we consider typological representations
of languages that are not manually crafted by experts, but rather learned from texts
( ¨Ostling and Tiedemann 2017). We experiment with the vectors of Malaviya, Neubig,
and Littell (2017), who proposed to construct such representations by training language-
identifying vectors end-to-end as part of neural machine translation models.

The vector for similarity judgments and the vector of linguistic features for a given
language have different dimensionality. Por eso, we ﬁrst construct a distance matrix for
each vector space, such that both columns and rows are language indices, and each cell
value is the cosine distance between the vectors of the corresponding language pair.
Given a set of L languages, each resulting matrix S has dimensionality of R|l|×|l| y
is symmetrical. To estimate the correlation between the matrix for similarity judgments
and each of the matrices for linguistic features, we run a Mantel test (Mantel 1967),
a non-parametric statistical test based on matrix permutations that takes into account
inter-dependencies among pairwise distances.

The results of the Mantel test reported in Table 9 show that there exist statistically
signiﬁcant correlations between similarity judgments and geography, familia, and syn-
tax, given that p < 0.05 and z > 1.96. The correlation coefﬁcient is particularly strong
for geography (r= 0.647) and syntax (r= 0.649). The former result is intuitive, porque
languages in contact easily borrow and loan lexical units, and cultural interactions
may result in similar cognitive categorizations. The result for syntax, en cambio, cannot
be explained so easily, as formal properties of language do not affect lexical seman-
tics. En cambio, we conjecture that, although no causal relation is present, both syntactic
features and similarity judgments might be linked to a common explanatory variable
(such as geography). De hecho, several syntactic properties are not uniformly spread
across the globe. Por ejemplo, verbs with Verb–Object–Subject word order are mostly
concentrated in Oceania (Dryer 2013). Sucesivamente, geographical proximity leads to similar
judgment patterns, as mentioned above. Por otro lado, we ﬁnd no correlation with
phonology and inventory, as expected, nor with the bottom–up typological features
from Malaviya, Neubig, and Littell (2017).

6. Crosslingual Multi-SimLex Data Sets

A crucial advantage of having semantically aligned monolingual data sets across differ-
ent languages is the potential to create crosslingual semantic similarity data sets. Such data
sets allow for probing the quality of crosslingual representation learning algorithms
(Camacho-Collados et al. 2017; Conneau et al. 2018a; Chen and Cardie 2018; Doval et al.
2018; Ruder, Vuli´c, and Søgaard 2019; Conneau and Lample 2019; Ruder, Søgaard, y
Vuli´c 2019) as an intrinsic evaluation task. Sin embargo, the crosslingual data sets previous

8 For the extraction of these features, we used lang2vec: github.com/antonisa/lang2vec.

868

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 9
Mantel test on the correlation between similarity judgments from Multi-SimLex and linguistic
features from typological databases.

Características
geography
familia
syntax
inventory
phonology
Malaviya, Neubig, and Littell (2017)

Dimension
299
3718
103
158
28
512

Mantel r
0.647
0.329
0.649
0.155
0.397
−0.431

Mantel p
0.007*
0.023*
0.007*
0.459
0.046
0.264

Mantel z
3.443
2.711
3.787
0.782
1.943
−1.235

work relied upon (Camacho-Collados et al. 2017) were limited to a homogeneous set of
high-resource languages (p.ej., Inglés, Alemán, italiano, Español) and a small number
of concept pairs (all less than 1K pairs). We address both problems by 1) using a typo-
logically more diverse language sample, y 2) relying on a substantially larger English
data set as a source for the crosslingual data sets: 1,888 pairs in this work versus 500
pairs in the work of Camacho-Collados et al. (2017). Como resultado, each of our crosslingual
data sets contains a substantially larger number of concept pairs, as shown in Table 11.
The crosslingual Multi-Simlex data sets are constructed automatically, leveraging word
pair translations and annotations collected in all 12 idiomas. This yields a total of
66 crosslingual data sets, one for each possible combination of languages. Mesa 11
provides the ﬁnal number of concept pairs, which lie between 2,031 y 3,480 pares
for each crosslingual data set, whereas Table 10 shows some sample pairs with their
corresponding similarity scores.

The automatic creation and veriﬁcation of crosslingual data sets closely follows the
procedure ﬁrst outlined by Camacho-Collados, Pilehvar, and Navigli (2015) y luego
adopted by Camacho-Collados et al. (2017) (for semantic similarity) and Vuli´c, Ponzetto,
and Glavaˇs (2019) (for graded lexical entailment). Primero, given two languages, we inter-
sect their aligned concept pairs obtained through translation. Por ejemplo, starting from
the aligned pairs attroupement – foule in French and rahvasumm – rahvahulk in Estonian,
we construct two crosslingual pairs attroupement – rahvaluk and rahvasumm – foule. El
scores of crosslingual pairs are then computed as averages of the two corresponding
monolingual scores. Finalmente, in order to ﬁlter out concept pairs whose semantic meaning
was not preserved during this operation, we retain only crosslingual pairs for which the

Mesa 10
Example concept pairs with their scores from a selection of crosslingual Multi-SimLex data sets.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Concept-1

Concept-2

plentynaidd

Pair
CYM-ENG rhyddid
CYM-POL
SWA-ENG kutimiza
CMN-FRA
FIN-SPA
SPA-FRA
EST-YUE
ENG-FIN
SPA-POL
POL-SWA

tiet¨am¨att ¨omyys
ganador
takso
sitrushedelm¨a
naranja
wskaz ´owka
palabra
prawdopodobnie uwezekano

liberty
niem?dry
accomplish
ﬂexible
inteligencia
candidat

Score
5.37
2.15
5.24
4.08
0.55
2.15
2.08
3.43
0.55
4.05

Pair
CMN-EST
FIN-SWA
ENG-FRA
FIN-SPA
CMN-YUE
CYM-SWA
EST-SPA
FIN-EST
CMN-CYM
POL-ENG

Concept-1

psykologia
normally
auto

Concept-2
optimistlikult
sayansi
quotidiennement
bicicleta

mazingira
sefyllfa
legi ´on
armee
halveksuva p ˜olglik
disgybl
grawitacja meteor

Score
0.83
2.20
2.41
0.85
4.78
1.90
3.25
5.55
4.45
0.27

869

Ligüística computacional

Volumen 46, Número 4

Mesa 11
The sizes of all monolingual (main diagonal) and crosslingual data sets.

CMN CYM ENG

EST

FIN

FRA

HEB

POL

RUS

SPA

SWA

YUE

CMN 1,888
CYM 3,085
3,151
ENG
3,188
EST
3,137
FIN
2,243
FRA
3,056
HEB
3,009
POL
3,032
RUS
3,116
SPA
2,807
SWA
3,480
YUE

–
1,888
3,380
3,305
3,274
2,301
3,209
3,175
3,196
3,205
2,926
3,062

–
–
1,888
3,364
3,352
2,284
3,274
3,274
3,222
3,318
2,828
3,099

–
–
–
1,888
3,386
2,787
3,358
3,310
3,339
3,312
2,845
3,080

–
–
–
–
1,888
2,682
3,243
3,294
3,257
3,256
2,900
3,063

–
–
–
–
–
1,888
2,903
2,379
2,219
2,645
2,031
2,313

–
–
–
–
–
–
1,888
3,201
3,226
3,256
2,775
3,005

–
–
–
–
–
–
–
1,888
3,209
3,250
2,819
2,950

–
–
–
–
–
–
–
–
1,888
3,189
2,855
2,966

–
–
–
–
–
–
–
–
–
1,888
2,811
3,053

–
–
–
–
–
–
–
–
–
–
1,888
2,821

–
–
–
–
–
–
–
–
–
–
–
1,888

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Cifra 4
(a) Rating distribution and (b) distribution of pairs over the four POS classes in crosslingual
Multi-SimLex data sets averaged across each of the 66 language pairs (y-axes plot percentages as
the total number of concept pairs varies across different crosslingual data sets). Minimum and
maximum percentages for each rating interval and POS class are also plotted.

corresponding monolingual scores (ss, st) differ at most by one ﬁfth of the full scale (es decir.,
| ss − st |≤ 1.2). This heuristic mitigates the noise due to crosslingual semantic shifts
(Camacho-Collados et al. 2017; Vuli´c, Ponzetto, and Glavaˇs 2019). We refer the reader
to the work of Camacho-Collados, Pilehvar, and Navigli (2015) for a detailed technical
description of the procedure.

To assess the quality of the resulting crosslingual data sets, we have conducted a
veriﬁcation experiment similar to Vuli´c, Ponzetto, and Glavaˇs (2019). We randomly sam-
pled 300 concept pairs in the English-Spanish, English-French, and English-Mandarin
crosslingual data sets. Después, we asked bilingual native speakers to provide
similarity judgments of each pair. The Spearman’s correlation score ρ between automat-
ically induced and manually collected ratings achieves ρ ≥ 0.90 on all samples, cual
conﬁrms the viability of the automatic construction procedure.

Score and Class Distributions. The summary of score and class distributions across all 66
crosslingual data sets are provided in Figure 4a and Figure 4b, respectivamente. Primero, es
obvious that the distribution over the four POS classes largely adheres to that of the
original monolingual Multi-SimLex data sets, and that the variance is quite low: Para

870

Vuli´c et al.

Multi-SimLex

ejemplo, the ENG-FRA data set contains the lowest proportion of nouns (49.21%) y
the highest proportion of verbs (27.1%), adjectives (15.28%), and adverbs (8.41%). On
the other hand, the distribution over similarity intervals in Figure 4a shows a much
greater variance. This is again expected as this pattern resurfaces in monolingual data
conjuntos (ver tabla 6). It is also evident that the data are skewed toward lower-similarity
concept pairs. Sin embargo, due to the joint size of all crosslingual data sets (ver tabla 11),
even the least represented intervals contain a substantial number of concept pairs. Para
instancia, the RUS-YUE data set contains the least highly similar concept pairs (en el
interval [4, 6]) de todo 66 crosslingual data sets. Sin embargo, the absolute number of pairs
(138) in that interval for RUS-YUE is still substantial. If needed, this makes it possible
to create smaller data sets that are balanced across the similarity spectra through sub-
sampling.

7. Monolingual Evaluation of Representation Learning Models

After the numerical and qualitative analyses of the Multi-SimLex data sets provided
in §§ 5.3–5.4, we now benchmark a series of representation learning models on the
new evaluation data. We evaluate standard static word embedding algorithms such
as fastText (Bojanowski y otros. 2017), as well as a range of more recent text encoders
pretrained on language modeling such as multilingual BERT (Devlin et al. 2019). Estos
experiments provide strong baseline scores on the new Multi-SimLex data sets and offer
a ﬁrst large-scale analysis of pretrained encoders on word-level semantic similarity
across diverse languages. Además, los experimentos, now enabled by Multi-SimLex,
aim to answer several important questions. (Q1) Is it viable to extract high-quality word-
level representations from pretrained encoders receiving subword-level tokens as in-
put? Are such representations competitive with standard static word-level embeddings?
(Q2) What are the implications of monolingual pretraining versus (massively) multilin-
gual pretraining for performance? (Q3) Do lightweight unsupervised post-processing
techniques improve word representations consistently across different languages? (Q4)
Can we effectively transfer available external lexical knowledge from resource-rich
languages to resource-lean languages in order to learn word representations that distin-
guish between true similarity and conceptual relatedness (see the discussion in §2.3)?

7.1 Models in Comparison

Static Word Embeddings in Different Languages. Primero, we evaluate a standard method for
inducing non-contextualized (es decir., static) word embeddings across a plethora of different
idiomas: FASTTEXT (FT) vectores (Bojanowski y otros. 2017) are currently the most pop-
ular and robust choice given 1) the availability of pretrained vectors in a large number
of languages (Grave et al. 2018) trained on large Common Crawl (CC) plus Wikipedia
(Wiki) datos, y 2) their superior performance across a range of NLP tasks (Mikolov
et al. 2018). De hecho, FASTTEXT is an extension of the standard word-level CBOW and
skip-gram word2vec models (Mikolov et al. 2013) that takes into account subword-level
información, a saber, the constituent character n-grams of each word (Zhu, Vuli´c, y
Korhonen 2019). Por esta razón, FASTTEXT is also more suited for modeling rare words
and morphologically rich languages.9

9 We have also trained standard word-level CBOW and skip-gram with negative sampling (SGNS) on full

Wikipedia dumps for several languages, but our preliminary experiments have veriﬁed that they
under-perform compared to FASTTEXT. This ﬁnding is consistent with other recent studies demonstrating
the usefulness of subword-level information (Vania and Lopez 2017; Mikolov et al. 2018; Zhu, Vuli´c, y
Korhonen 2019; Zhu et al. 2019) Por lo tanto, we do not report the results with CBOW and SGNS for brevity.

871

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

We rely on 300-dimensional FT word vectors trained on CC+Wiki and available
online for 157 languages.10 The word vectors for all languages are obtained by CBOW
with position-weights, with character n-grams of length 5, a window of size 5, 10
negative examples, y 10 training epochs. We also probe another (older) collection
of FT vectors, pretrained on full Wikipedia dumps of each language.11 The vectors are
300-dimensional, trained with the skip-gram objective for 5 epochs, con 5 negative
examples, a window size set to 5, and relying on all character n-grams from length 3 a
6. Following prior work, we trim the vocabularies for all languages to the 200K most
frequent words and compute representations for multiword expressions by averaging
the vectors of their constituent words.

Unsupervised Post-Processing. Más, we consider a variety of unsupervised post-
processing steps that can be applied post-training on top of any pretrained input word
embedding space without any external lexical semantic resource. Hasta ahora, the usefulness
of such methods has been veriﬁed only on the English language through benchmarks
for lexical semantics and sentence-level tasks (Mu, Bhat, and Viswanath 2018). En esto
artículo, we assess whether unsupervised post-processing is beneﬁcial also in other lan-
calibres. Para tal fin, we apply the following post hoc transformations on the initial word
embeddings:

(1) Mean centering (MC) is applied after unit length normalization to ensure

that all vectors have a zero mean, and is commonly applied in data mining
and analysis (Bro and Smilde 2003; van den Berg et al. 2006).

(2)

(3)

All-but-the-top (ABTT) (Mu, Bhat, and Viswanath 2018; Espiga, Mousavi, y
de Sa 2019) eliminates the common mean vector and a few top dominating
directions (according to principal component analysis) from the input
distributional word vectors, since they do not contribute toward
distinguishing the actual semantic meaning of different words. El
method contains a single (tunable) hyperparameter ddA, which denotes the
number of the dominating directions to remove from the initial
representaciones. Previous work has veriﬁed the usefulness of ABTT in
several English lexical semantic tasks such as semantic similarity, palabra
analogies, and concept categorization, as well as in sentence-level text
classiﬁcation tasks (Mu, Bhat, and Viswanath 2018).

UNCOVEC (Artetxe et al. 2018) adjusts the similarity order of an arbitrary
input word embedding space, and can emphasize either syntactic or
semantic information in the transformed vectors. En breve, it transforms
the input space X into an adjusted space XWα through a linear map Wα
controlled by a single hyperparameter α. The nth-order similarity
transformation of the input word vector space X (for which n = 1) can be
obtained as Mn(X) = M1(XW (n − 1)/2), with Wα = QΓΓΓα, where Q and ΓΓΓ are
the matrices obtained via eigendecomposition of XT X = QΓΓΓQT. ΓΓΓ is a
diagonal matrix containing eigenvalues of XTX; Q is an orthogonal matrix
with eigenvectors of XTX as columns. While the motivation for the
UNCOVEC methods does originate from adjusting discrete similarity

10 https://fasttext.cc/docs/en/crawl-vectors.html.
11 https://fasttext.cc/docs/en/pretrained-vectors.html.

872

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

orders, note that α is in fact a continuous real-valued hyperparameter that
can be carefully tuned. For more technical details we refer the reader to the
original work of Artetxe et al. (2018).

As mentioned, all post-processing methods can be seen as unsupervised retroﬁtting
methods that, given an arbitrary input vector space X, produce a perturbed/
transformed output vector space X(cid:48), pero, unlike common retroﬁtting methods (Faruqui
et al. 2015; Mrkˇsi´c et al. 2017), the perturbation is completely unsupervised (es decir., self-
contained) and does not inject any external (semantic similarity-oriented) conocimiento
into the vector space. Note that different perturbations can also be stacked: por ejemplo-
amplio, we can apply UNCOVEC and then use ABTT on top of the output UNCOVEC
vectores. When using UNCOVEC and ABTT we always length-normalize and mean-center
the data ﬁrst (es decir., we apply the simple MC normalization). Finalmente, we tune the two
hyperparameters dA (for ABTT) and α (UNCOVEC) on the English Multi-SimLex and use
the same values on the data sets of all other languages; we report results with ddA = 3
or ddA = 10, and α = −0.3.

Contextualized Word Embeddings. We also evaluate the capacity of unsupervised pretrain-
ing architectures based on language modeling objectives to reason over lexical semantic
semejanza. A lo mejor de nuestro conocimiento, our article is the ﬁrst study performing such
analiza. State-of-the-art models such as BERT (Devlin et al. 2019), XLM (Conneau and
Lample 2019), or RoBERTa (Liu et al. 2019b) are typically very deep neural networks
based on the Transformer architecture (Vaswani et al. 2017). They receive subword-
level tokens as inputs (such as WordPieces [Schuster and Nakajima 2012]) to tackle data
sparsity. In output, they return contextualized embeddings, dynamic representations
for words in context.

To represent words or multiword expressions through a pretrained model, nosotros
follow prior work (Liu et al. 2019a) and compute an input item’s representation by 1)
feeding it to a pretrained model in isolation; entonces 2) averaging the H hidden represen-
taciones (bottom-to-top) for each of the item’s constituent subwords; and then ﬁnally 3)
averaging the resulting subword representations to produce the ﬁnal d-dimensional rep-
resentimiento, where d is the embedding and hidden-layer dimensionality (p.ej., re = 768
with BERT). We opt for this approach due to its proven viability and simplicity (Liu
et al. 2019a), as it does not require any additional corpora to condition the induction of
contextualized embeddings.12 Other ways to extract the representations from pretrained
modelos (Aldarmaki and Diab 2019; Wu et al. 2019; Cao, Kitaev, and Klein 2020) son
beyond the scope of this work, and we will experiment with them in the future.

En otras palabras, we treat each pretrained encoder ENC as a black-box function to
encode a single word or a multiword expression x in each language into a d-dimensional
contextualized representation xENC ∈ Rd = ENC(X) (p.ej., re = 768 with BERT). As mul-
tilingual pretrained encoders, we experiment with the multilingual BERT model
(M-BERT) (Devlin et al. 2019) and XLM (Conneau and Lample 2019). M-BERT is
pretrained on monolingual Wikipedia corpora of 102 idiomas (comprising all
Multi-SimLex languages) with a 12-layer Transformer network, and yields 768-
dimensional representations. Because the concept pairs in Multi-SimLex are lowercased,

12 We also tested another encoding method where we fed pairs instead of single words/concepts into the
pretrained encoder. The rationale is that the other concept in the pair can be used as a disambiguation
señal. Sin embargo, this method consistently led to sub-par performance across all experimental runs.

873

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

we use the uncased version of M-BERT.13 M-BERT comprises all Multi-SimLex languages,
and its evident ability to perform crosslingual transfer (Pires, Schlinger, and Garrette
2019; Wu and Dredze 2019; Wang et al. 2020) also makes it a convenient baseline model
for crosslingual experiments later in §8. The second multilingual model we consider,
XLM-100,14 is pretrained on Wikipedia dumps of 100 idiomas, and encodes each
concept into a 1,280-dimensional representation. In contrast to M-BERT, XLM-100 drops
the next-sentence prediction objective and adds a crosslingual masked language mod-
eling objective. For both encoders, the representations of each concept are computed as
averages over the ﬁrst H = 4 hidden layers in all experiments.15

Besides M-BERT and XLM, covering multiple languages, we also analyze the perfor-
mance of “language-speciﬁc” BERT and XLM models for the languages where they are
disponible: Finnish, Español, Inglés, Mandarin Chinese, y francés. The main goal of
this comparison is to study the differences in performance between multilingual “one-
size-ﬁts-all” encoders and language-speciﬁc encoders. For all experiments, we rely on
the pretrained models released in the Transformers repository (Wolf et al. 2019).16

Unsupervised post-processing steps devised for static word embeddings (es decir.,
mean-centering, ABTT, UNCOVEC) can also be applied on top of contextualized embed-
dings if we predeﬁne a vocabulary of word types V that will be represented in a word
vector space X. We construct such V for each language as the intersection of word types
covered by the corresponding CC+Wiki fastText vectors and the (single-word or multi-
palabra) expressions appearing in the corresponding Multi-SimLex data set.

Finalmente, note that it is not feasible to evaluate a full range of available pretrained
encoders within the scope of this work. Our main intention is to provide the ﬁrst set of
baseline results on Multi-SimLex by benchmarking a sample of most popular encoders,
at the same time also investigating other important questions such as performance of
static versus contextualized word embeddings, or multilingual versus language-speciﬁc
pretraining. Another purpose of the experiments is to outline the wide potential and
applicability of the Multi-SimLex data sets for multilingual and crosslingual represen-
tation learning evaluation.

7.2 Results and Discussion

The results we report are Spearman’s ρ coefﬁcients of the correlation between the ranks
derived from the scores of the evaluated models and the human scores provided in each
Multi-SimLex data set. The main results with static and contextualized word vectors
for all test languages are summarized in Table 12. The scores reveal several interesting
patrones, and also pinpoint the main challenges for future work.

State-of-the-Art Representation Models. The absolute scores of CC+Wiki FT, Wiki FT, y
M-BERT are not directly comparable, because these models have different coverage. En
particular, Multi-SimLex contains some out-of-vocabulary (OOV) words whose static

13 https://github.com/google-research/bert/blob/master/multilingual.md.
14 https://github.com/facebookresearch/XLM.
15 In our preliminary experiments on several language pairs, we have also veriﬁed that this choice is

superior to: a) using the output of only the embedding layer, and b) averaging over all hidden layers (es decir.,
H = 12 for the BERT-BASE architecture). Asimismo, using the special prepended ‘[CLS]’ token rather than
the constituent sub-words to encode a concept also led to much worse performance across the board.
16 github.com/huggingface/transformers. The full list of currently supported pretrained encoders is

available here: huggingface.co/models.

874

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 12
A summary of results (Spearman’s ρ correlation scores) on the full monolingual Multi-SimLex
data sets for 12 idiomas. We benchmark fastText word embeddings trained on two different
corpus (CC+Wiki and only Wiki) as well as the multilingual M-BERT model (see §7.1). Resultados
with the initial word vectors are reported (es decir., without any unsupervised post-processing), como
well as with different unsupervised post-processing methods, described in §7.1. The language
codes are provided in Table 1. The numbers in the parentheses (gray rows) refer to the number of
OOV concepts excluded from the computation. The highest scores for each language and per
model are in bold.

Idiomas:
FASTTEXT (CC+Wiki)
(1) FT:INIT
(2) FT:+MC
(3) FT:+ABTT (−3)
(4) FT:+ABTT (−10)
(5) FT:+UNCOVEC
(1)+(2)+(5)+(3)
(1)+(2)+(5)+(4)
FASTTEXT (Wiki)
(1) FT:INIT
(2) FT:+MC
(3) FT:+ABTT (−3)
(4) FT:+ABTT (−10)
(5) FT:+UNCOVEC
(1)+(2)+(5)+(3)
(1)+(2)+(5)+(4)
M-BERT
(1) M-BERT:INIT
(2) M-BERT:+MC
(3) M-BERT:+ABTT (−3)
(4) M-BERT:+ABTT (−10)
(5) M-BERT:+UNCOVEC
(1)+(2)+(5)+(3)
(1)+(2)+(5)+(4)
AMIAA (En general)

CMN CYM ENG
(12)
(151)
(272)
.528
.363
.534
.393
.535
.539
.536
.389
.557
.551
.583
.384
.550
.387
.572
.549
.386
.574
.542
.376
.577
(6)
(282)
(429)
.436
.318
.315
.445
.337
.373
.343
.453
.459
.460
.323
.496
.469
.328
.518
.526
.470
.323
.471
.526
.307
(0)
(0)
(0)
.138
.033
.408
.256
.044
.458
.487
.321
.056
.329
.056
.456
.317
.063
.464
.326
.083
.464
.086
.326
.444
.794
.742
.764

EST
(319)
.469
.473
.495
.476
.465
.476
.455
(343)
.400
.404
.404
.385
.375
.369
.355
(0)
.085
.122
.137
.122
.144
.130
.112
.715

FIN
(347)
.607
.621
.642
.651
.642
.655
.652
(345)
.575
.583
.584
.581
.568
.564
.548
(0)
.162
.173
.200
.164
.213
.201
.179
.760

FRA HEB
(66)
(43)
.450
.578
.480
.584
.501
.610
.503
.623
.501
.595
.503
.604
.510
.613
(62)
(73)
.428
.444
.447
.463
.447
.487
.460
.494
.449
.483
.495
.448
.450
.495
(0)
(0)
.104
.115
.128
.183
.144
.287
.306
.121
.164
.288
.149
.304
.135
.305
.699
.812

POL
(326)
.405
.412
.427
.455
.435
.442
.466
(354)
.370
.383
.387
.401
.389
.392
.394
(0)
.069
.097
.126
.126
.144
.122
.127
.723

RUS
(291)
.422
.424
.459
.500
.437
.452
.491
(343)
.359
.378
.394
.400
.387
.392
.394
(0)
.085
.123
.197
.183
.198
.199
.187
.667

SPA
(46)
.511
.516
.523
.542
.525
.528
.540
(57)
.432
.447
.456
.477
.469
.473
.476
(0)
.145
.203
.299
.315
.287
.295
.285
.703

SWA
(222)
.439
.469
.473
.462
.437
.432
.424
(379)
.332
.373
.423
.406
.386
.388
.382
(0)
.125
.128
.135
.136
.143
.148
.119
.710

YUE
(–)
–
–
–
–
–
–
–
(677)
.376
.427
.429
.399
.394
.388
.396
(0)
.404
.469
.492
.467
.464
.456
.447
.792

FT embeddings are not available.17 On the other hand, M-BERT has perfect coverage.
A general comparison between CC+Wiki and Wiki FT vectors, sin embargo, supports the
intuition that larger corpora (such as CC+Wiki) yield higher correlations. Otro
ﬁnding is that a single massively multilingual model such as M-BERT cannot produce
semantically rich word-level representations. Whether this actually happens because
the training objective is different—or because the need to represent 100+ idiomas
reduces its language-speciﬁc capacity—is investigated further below.

The overall results also clearly indicate that (i) there are differences in perfor-
mance across different monolingual Multi-SimLex data sets, y (ii) unsupervised post-
processing is universally useful, and can lead to huge improvements in correlation
scores for many languages. In what follows, we also delve deeper into these analyses.

17 We acknowledge that it is possible to approximate word-level representations of OOVs with FT by

summing the constituent n-gram embeddings as proposed by Bojanowski et al. (2017). Sin embargo, we do
not perform this step as the resulting embeddings are typically of much lower quality than non-OOV
embeddings (Zhu, Vuli´c, and Korhonen 2019).

875

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Impact of Unsupervised Post-Processing. Primero, the results in Table 12 suggest that applying
dimension-wise mean centering to the initial vector spaces has positive impact on word
similarity scores in all test languages and for all models, both static and contextualized
(see the +MC rows in Table 12). Mimno and Thompson (2017) show that distributional
word vectors have a tendency toward narrow clusters in the vector space (es decir., ellos
occupy a narrow cone in the vector space and are therefore anisotropic [Mu, Bhat, y
Viswanath 2018; Ethayarajh 2019]), and are prone to the undesired effect of hubness
(Radovanovi´c, Nanopoulos, and Ivanovi´c 2010; Lazaridou, Dinu, and Baroni 2015).18
Applying dimension-wise mean centering has the effect of spreading the vectors across
the hyperplane and mitigating the hubness issue, which consequently improves word-
level similarity, as it emerges from the reported results. Previous work has already
validated the importance of mean centering for clustering-based tasks (Suzuki et al.
2013), bilingual lexicon induction with crosslingual word embeddings (casa de arte, Labaka,
and Agirre 2018a; Zhang et al. 2019; Vuli´c et al. 2019), and for modeling lexical semantic
cambiar (Schlechtweg et al. 2019). Sin embargo, a lo mejor de nuestro conocimiento, the results
summarized in Table 12 are the ﬁrst evidence that also conﬁrms its importance for
semantic similarity in a wide array of languages. En suma, as a general rule of thumb,
we suggest always mean-centering representations for semantic tasks.

The results further indicate that additional post-processing methods such as ABTT
and UNCOVEC on top of mean-centered vector spaces can lead to further gains in most
idiomas. The gains are even visible for languages that start from high correlation
puntuaciones: por ejemplo, CMN with CC+Wiki FT increases from 0.534 a 0.583, de 0.315
a 0.526 with Wiki FT, and from 0.408 a 0.487 with M-BERT. Similarmente, for RUS with
CC+Wiki FT we can improve from 0.422 a 0.500, and for FRA the scores improve from
0.578 a 0.613. There are additional similar cases reported in Table 12.

En general, the unsupervised post-processing techniques seem universally useful
across languages, but their efﬁcacy and relative performance does vary across different
idiomas. Note that we have not carefully ﬁne-tuned the hyperparameters of the
evaluated post-processing methods, so additional small improvements can be expected
for some languages. The main ﬁnding, sin embargo, is that these post-processing techniques
are robust to semantic similarity computations beyond English, and are truly language
independiente. Por ejemplo, removing dominant latent (PCA-based) components from
word vectors emphasizes semantic differences between different concepts, as only shared
non-informative latent semantic knowledge is removed from the representations.

En resumen, pretrained word embeddings do contain more information pertaining
to semantic similarity than revealed in the initial vectors. This way, we have corrob-
orated the hypotheses from prior work (Mu, Bhat, and Viswanath 2018; Artetxe et al.
2018) that was not previously empirically veriﬁed on other languages due to a shortage
of evaluation data; this gap has now been ﬁlled with the introduction of the Multi-
SimLex data sets. In all follow-up experiments, we always explicitly denote which post-
processing conﬁguration is used in evaluation.

POS-Speciﬁc Subsets. We present the results for subsets of word pairs grouped by POS
class in Table 13. Prior work based on English data showed that representations for
nouns are typically of higher quality than those for the other POS classes (Schwartz,

18 Hubness can be deﬁned as the tendency of some points/vectors (es decir., “hubs”) to be nearest neighbors of

many points in a high-dimensional (vector) espacio (Radovanovi´c, Nanopoulos, and Ivanovi´c 2010;
Lazaridou, Dinu, and Baroni 2015; Conneau et al. 2018a).

876

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 13
Spearman’s ρ correlation scores over the four POS classes represented in Multi-SimLex data sets.
In addition to the word vectors considered earlier in Table 12, we also report scores for another
contextualized model, XLM-100. The numbers in parentheses refer to the total number of
POS-class pairs in the original ENG data set and, como consecuencia, in all other monolingual data
conjuntos. For the comparison with the average human performance, we refer the reader to the APIAA
and AMIAA scores provided previously in Table 4 and Table 5, respectivamente.

Idiomas:
FASTTEXT (CC+Wiki)
NOUNS (1,051)
VERBS (469)
ADJ (245)
ADV (123)
FASTTEXT (CC+Wiki)

NOUNS
VERBS
ADJ
ADV
M-BERT

NOUNS
VERBS
ADJ
ADV
XLM-100

ALL
NOUNS
VERBS
ADJ
ADV

CMN CYM ENG EST

FIN FRA HEB POL RUS

SPA SWA YUE

FT:INIT

.561
.511
.448
.622

.601
.583
.526
.675

.517
.511
.227
.282

.498
.551
.544
.356
.284

.497
.265
.338
.187

.512
.305
.372
.150

.091
.005
.050
.012

.096
.132
.038
.140
.017

.592 .627 .709 .641 .560 .538 .526 .583 .544 .426
.408 .379 .527 .551 .458 .384 .464 .499 .391 .252
.564 .401 .546 .616 .467 .284 .349 .401 .344 .288
.482 .378 .547 .648 .491 .266 .514 .423 .172 .103

FT:+ABTT (−3)

.599 .621 .730 .653 .592 .585 .578 .605 .553 .431
.454 .379 .575 .602 .520 .390 .475 .526 .381 .314
.601 .427 .592 .646 .483 .316 .409 .411 .402 .312
.504 .397 .546 .695 .491 .230 .495 .416 .223 .081

M-BERT:+ABTT (−3)

.446 .191 .210 .364 .191 .188 .266 .418 .142 .539
.200 .039 .077 .248 .038 .107 .181 .266 .091 .503
.226 .028 .128 .193 .044 .046 .002 .099 .192 .267
.343 .112 .173 .390 .326 .036 .046 .207 .161 .049

XLM:+ABTT (−3)

.270 .118 .203 .234 .195 .106 .170 .289 .130 .506
.381 .193 .238 .234 .242 .184 .292 .378 .165 .559
.169 .006 .190 .132 .136 .073 .095 .243 .047 .570
.256 .081 .179 .185 .150 .046 .022 .100 .220 .291
.040 .086 .043 .027 .221 .014 .022 .315 .095 .156

Reichart, and Rappoport 2015, 2016; Vuli´c et al. 2017b). We observe a similar trend in
other languages as well. This pattern is consistent across different representation models
and can be attributed to several reasons. Primero, verb representations need to express a
rich range of syntactic and semantic behaviors rather than purely referential features
(Gruber 1976; Levin 1993; Kipper et al. 2008). Segundo, low correlation scores on the ad-
jective and adverb subsets in some languages (p.ej., POL, CYM, SWA) might be due to their
low frequency in monolingual texts, which yields unreliable representations. En general,
the variance in performance across different word classes warrants further research in
class-speciﬁc representation learning (Panadero, Reichart, and Korhonen 2014; Vuli´c et al.
2017b). The scores further attest the usefulness of unsupervised post-processing as
almost all class-speciﬁc correlation scores are improved by applying mean-centering
and ABTT. Finalmente, the results for M-BERT and XLM-100 in Table 13 further conﬁrm that
massively multilingual pretraining cannot yield reasonable semantic representations
for many languages: De hecho, for some classes they display no correlation with human
ratings at all.

Differences across Languages. Naturalmente, the results from Tables 12 y 13 also reveal
that there is variation in performance of both static word embeddings and pretrained
encoders across different languages. Among other causes, the lowest absolute scores
with FT are reported for languages with least resources available to train monolingual

877

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

word embeddings, such as Kiswahili, galés, and Estonian. The low performance on
Welsh is especially indicative: Cifra 1 shows that the ratings in the Welsh data set
match up very well with the English ratings, but we cannot achieve the same level
of correlation in Welsh with Welsh FT word embeddings. Difference in performance
between two closely related languages, EST (low-resource) and FIN (high-resource),
provides additional evidence in this respect.

The highest reported scores with M-BERT and XLM-100 are obtained for Mandarin
Chinese and Yue Chinese: This effectively points to the weaknesses of massively mul-
tilingual training with a joint subword vocabulary spanning 102 y 100 idiomas.
Because of the difference in scripts, “language-speciﬁc” subwords for YUE and CMN
do not need to be shared across a vast amount of languages and the quality of their
representation remains unscathed. This effectively means that M-BERT’s subword vo-
cabulary contains plenty of CMN-speciﬁc and YUE-speciﬁc subwords that are exploited
by the encoder when producing M-BERT-based representations. Simultáneamente, más alto
scores with M-BERT (and XLM in Table 13) are reported for resource-rich languages such
as French, Español, e inglés, which are better represented in M-BERT’s training data,
while we observe large performance losses for lower-resource languages: These artifacts
of massively multilingual training with M-BERT and XLM and lower performance in
low-resource languages was further validated recently (Lauscher et al. 2020; Wu y
Dragar 2020). We also observe lower absolute scores (and a larger number of OOVs) para
languages with very rich and productive morphological systems such as the two Slavic
idiomas (Polish and Russian) and Finnish. Because Polish and Russian are known
to have large Wikipedias and Common Crawl data (Conneau et al. 2019) (p.ej., su
Wikipedias are in the top 10 largest Wikipedias worldwide), the problem with coverage
can be attributed exactly to the proliferation of morphological forms in those languages.
Finalmente, although Table 12 does reveal that unsupervised post-processing is useful
for all languages, it also demonstrates that peak scores are achieved with different post-
processing conﬁgurations. This ﬁnding suggests that a more careful language-speciﬁc
ﬁne-tuning is indeed needed to reﬁne word embeddings toward semantic similarity.
We plan to inspect the relationship between post-processing techniques and linguistic
properties in more depth in future work.

Multilingual vs. Language-Speciﬁc Contextualized Embeddings. Recent work has shown
that—despite the usefulness of massively multilingual models such as M-BERT and
XLM-100 for zero-shot crosslingual transfer (Pires, Schlinger, and Garrette 2019; Wu y
Dragar 2019)—stronger results in downstream tasks for a particular language can be
achieved by pretraining language-speciﬁc models on language-speciﬁc data.

In this experiment, motivated by the low results of M-BERT and XLM-100 (ver
again Table 13), we assess if monolingual pretrained encoders can produce higher-
quality word-level representations than multilingual models. Por lo tanto, we evaluate
language-speciﬁc BERT and XLM models for a subset of the Multi-SimLex languages
for which such models are currently available: Finnish (Virtanen et al. 2019) (BERT-
BASE architecture, uncased), Francés (Le et al. 2019) (the FlauBERT model based on
XLM), Inglés (BERT-BASE, uncased), Mandarin Chinese (BERT-BASE) (Devlin et al. 2019),
and Spanish (BERT-BASE, uncased). Además, we also evaluate a series of pretrained
encoders available for English: (i) BERT-BASE, BERT-LARGE, and BERT-LARGE with whole
word masking (WWM) from the original work on BERT (Devlin et al. 2019), (ii) mono-
lingual “English-speciﬁc” XLM (Conneau and Lample 2019), y (iii) two models that
use parameter reduction techniques to build more compact encoders: ALBERT-B uses a

878

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Cifra 5
(a) Monolingual vs. plurilingüe. A performance comparison between monolingual pretrained
language encoders and massively multilingual encoders. For four languages (CMN, ENG, FIN,
SPA), we report the scores with monolingual uncased BERT-BASE architectures and multilingual
uncased M-BERT model, while for FRA we report the results of the multilingual XLM-100
architecture and a monolingual French FlauBERT model (Le et al. 2019), which is based on the
same architecture as XLM-100. (b) Pretrained ENG encoders. A comparison of various pretrained
encoders available for English. All these models are post-processed via ABTT (−3).

conﬁguration similar to BERT-BASE, while ALBERT-L is similar to BERT-LARGE, but with
an 18× reduction in the number of parameters (Lan et al. 2020).19

From the results in Figure 5, it is clear that monolingual pretrained encoders yield
much more reliable word-level representations. The gains are visible even for languages
such as CMN, which showed reasonable performance with M-BERT and are substantial
on all test languages. This further conﬁrms the validity of language-speciﬁc pretraining
in lieu of multilingual training, if sufﬁcient monolingual data are available. Additional
comparisons along these axes are available in related work (Vuli´c et al. 2020). Además,
a comparison of pretrained English encoders in Figure 5b largely follows the intuition:
The larger BERT-LARGE model yields slight improvements over BERT-BASE, and we can
improve a bit more by relying on word-level (es decir., lexical-level) masking. Finalmente, luz-
weight ALBERT model variants are quite competitive with the original BERT models,
with only modest drops reported, and ALBERT-L again outperforms ALBERT-B. En general,
it is interesting to note that the scores obtained with monolingual pretrained encoders
are on a par with or even outperform static FT word embeddings: this is a very intrigu-
ing ﬁnding per se as it shows that such subword-level models trained on large corpora
can implicitly capture rich lexical semantic knowledge.

Similarity-Specialized Word Embeddings. Conﬂating distinct lexico-semantic relations is
a well-known property of distributional representations (Turney and Pantel 2010;
Melamud et al. 2016). Semantic specialization ﬁne-tunes distributional spaces to empha-
size a particular lexico-semantic relation in the transformed space by injecting external
lexical knowledge (Glavaˇs, Ponti, and Vuli´c 2019). Explicitly discerning between true
semantic similarity (as captured in Multi-SimLex) and broad conceptual relatedness

19 All models and their further speciﬁcations are available at the following link:

https://huggingface.co/models.

879

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

beneﬁts a number of tasks, as discussed in §2.1.20 Because most languages lack ded-
icated lexical resources, sin embargo, one viable strategy to steer monolingual word vec-
tor spaces to emphasize semantic similarity is through crosslingual transfer of lexical
conocimiento, usually through a shared crosslingual word vector space (Ruder, Vuli´c,
and Søgaard 2019). Por lo tanto, we evaluate the effectiveness of specialization transfer
methods using Multi-SimLex as our multilingual test bed.

We evaluate a current state-of-the-art crosslingual specialization transfer method
with minimal requirements, put forth recently by Ponti et al. (2019C).21 In a nutshell,
their LI-POSTSPEC method is a multistep procedure that operates as follows. Primero, el
knowledge about semantic similarity is extracted from WordNet in the form of triplets,
eso es, linguistic constraints (w1, w2, r), where w1 and w2 are two concepts, and r is
a relation between them obtained from WordNet (p.ej., synonymy or antonymy). El
goal is to “attract” synonyms closer to each other in the transformed vector space as
they reﬂect true semantic similarity, and “repel” antonyms further apart. In the second
step, the linguistic constraints are translated from English to the target language via a
shared crosslingual word vector space. Para tal fin, following Ponti et al. (2019C) we rely
on crosslingual word embeddings (CLWEs) (Joulin et al. 2018) available online, cual
are based on Wiki FT vectors.22 Following that, a constraint reﬁnement step is applied
in the target language which aims to eliminate the noise inserted during the transla-
tion process. This is done by training a relation classiﬁcation tool: It is trained again
on the English linguistic constraints and then used on the translated target language
constraints, where the transfer is again enabled via a shared crosslingual word vector
space.23 Finally, a state-of-the-art monolingual specialization procedure from Ponti et al.
(2018b) injects the (now target language) linguistic constraints into the target language
distributional space.

The scores are summarized in Table 14. Semantic specialization with LI-POSTSPEC
leads to substantial improvements in correlation scores for the majority of the target
idiomas, demonstrating the importance of external semantic similarity knowledge
for semantic similarity reasoning. Sin embargo, we also observe deteriorated performance
for the three target languages that can be considered the lowest-resource ones in our
colocar: CYM, SWA, YUE. We hypothesize that this occurs due to the inferior quality of
the underlying monolingual Wikipedia word embeddings, which generates a chain of
error accumulations. En particular, poor distributional word estimates compromise the
alignment of the embedding spaces, which in turn results in increased translation noise,
and reduced reﬁnement ability of the relation classiﬁer. On a high level, this “poor get
poorer” observation again points to the fact that one of the primary causes of low perfor-
mance of resource-low languages in semantic tasks is the sheer lack of even unlabeled
data for distributional training. Por otro lado, as we see from Table 13, typological
dissimilarity between the source and the target does not deteriorate the effectiveness
of semantic specialization. De hecho, LI-POSTSPEC does yield substantial gains also for the

20 For an overview of specialization methods for semantic similarity, we refer the interested reader to the

recent tutorial (Glavaˇs, Ponti, and Vuli´c 2019).

21 We have also evaluated other specialization transfer methods (p.ej., Glavaˇs and Vuli´c 2018; Ponti

et al. 2018b), but they are consistently outperformed by the method of Ponti et al. (2019C).

22 https://fasttext.cc/docs/en/aligned-vectors.html; for target languages for which there are no

pretrained CLWEs, we induce them following the same procedure of Joulin et al. (2018).

23 We again follow Ponti et al. (2019C) and use a state-of-the-art relation classiﬁer (Glavaˇs and Vuli´c 2018).
We refer the reader to the original work for additional technical details related to the classiﬁer design.

880

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 14
The impact of vector space specialization for semantic similarity. The scores are reported using
the current state-of-the-art specialization transfer LI-POSTSPEC method of Ponti et al. (2019C),
relying on English as a resource-rich source language and the external lexical semantic
knowledge from the English WordNet.

Idiomas:
FASTTEXT (Wiki)
FT:INIT
LI-POSTSPEC

CMN CYM ENG
(6)
(282)
(429)
.318
–
.315
.584
–
.204

EST
(343)
.400
.515

FIN FRA HEB
(73)
(345)
(62)
.444 .428
.575
.601 .510
.619

POL
(354)
.370
.531

RUS
(343)
.359
.547

SPA SWA
(379)
(57)
.332
.432
.635
.238

YUE
(677)
.376
.267

typologically distant targets such as HEB, CMN, and EST. The critical problem indeed
seems to be insufﬁcient raw data for monolingual distributional training.

8. Crosslingual Evaluation

Similar to monolingual evaluation in §7, we now evaluate several state-of-the-art
crosslingual representation models on the suite of 66 automatically constructed crosslin-
gual Multi-SimLex data sets. De nuevo, note that evaluating a full range of crosslingual
models available in the rich prior work on crosslingual representation learning is well
beyond the scope of this article. We therefore focus our crosslingual analyses on several
well-established and indicative state-of-the-art crosslingual models, again spanning
both static and contextualized crosslingual word embeddings.

8.1 Models in Comparison

Static Word Embeddings. We rely on a state-of-the-art mapping-based method for the
induction of CLWEs: VECMAP (casa de arte, Labaka, and Agirre 2018b). The core idea behind
such mapping-based or projection-based approaches is to learn a post hoc alignment
of independently trained monolingual word embeddings (Ruder, Vuli´c, and Søgaard
2019). Such methods have gained popularity because of their conceptual simplicity and
competitive performance coupled with reduced bilingual supervision requirements:
They support CLWE induction with only as much as a few thousand word translation
pairs as the bilingual supervision (Mikolov, Le, and Sutskever 2013; Xing et al. 2015;
Upadhyay et al.. 2016; Ruder, Søgaard, and Vuli´c 2019). More recent work has shown that
CLWEs can be induced with even weaker supervision from small dictionaries spanning
several hundred pairs (Vuli´c and Korhonen 2016; Vuli´c et al. 2019), identical strings
(Smith et al. 2017), or even only shared numerals (casa de arte, Labaka, and Agirre 2017).
In the extreme, fully unsupervised projection-based CLWEs extract such seed bilingual
lexicons from scratch on the basis of monolingual data only (Conneau et al. 2018a;
casa de arte, Labaka, and Agirre 2018b; Hoshen and Wolf 2018; Alvarez-Melis and Jaakkola
2018; Chen and Cardie 2018; Mohiuddin and Joty 2019, inter alia).

Recent empirical studies (Glavaˇs et al. 2019; Vuli´c et al. 2019; Doval et al. 2019) tener
compared a variety of unsupervised and weakly supervised mapping-based CLWE
methods, and VECMAP emerged as the most robust and very competitive choice. Allá-
delantero, we focus on 1) its fully unsupervised variant (UNSUPER) in our comparisons. Para
several language pairs, we also report scores with two other VECMAP model variants: 2)
a supervised variant that learns a mapping based on an available seed lexicon (SUPER),
y 3) a supervised variant with self-learning (SUPER+SL) that iteratively increases the
seed lexicon and improves the mapping gradually. For a detailed description of these

881

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Mesa 15
Spearman’s ρ correlation scores on all 66 crosslingual data sets. The scores below the main
diagonal are computed based on crosslingual word embeddings (CLWEs) induced by aligning
CC+Wiki FT in all languages (except for YUE where we use Wiki FT) in a fully unsupervised way
(es decir., without any bilingual supervision). We rely on a standard CLWE mapping-based (es decir.,
alignment) acercarse: VECMAP (casa de arte, Labaka, and Agirre 2018b). The scores above the main
diagonal are computed by obtaining 768-dimensional word-level vectors from pretrained
multilingual BERT (M-BERT) following the procedure described in §7.1. For both fully
unsupervised VECMAP and M-BERT, we report the results with unsupervised postprocessing
enabled: Todo 2 × 66 reported scores are obtained using the +ABBT (−10) variante.

CMN

CYM ENG

CMN
.041
CYM
.565
ENG
.014
EST
.049
FIN
.224
FRA
.202
HEB
.121
POL
.032
RUS
.546
SPA
SWA −.01
.004
YUE

.076

.004
.097
.020
.015
.110
.028
.037
.048
.116
.047

.348
.087

.335
.542
.662
.516
.464
.511
.498
.029
.059

EST

.139
.017
.168

.530
.559
.465
.415
.408
.450
.006
.004

FIN

.154
.049
.159
.143

.533
.445
.465
.476
.490
.013
.002

FRA

.392
.095
.401
.161
.195

.469
.534
.529
.600
-.05
.059

HEB

.190
.033
.171
.100
.077
.191

.412
.430
.462
.033
.001

POL

.207
.072
.182
.113
.110
.229
.095

.390
.398
.052
.074

RUS

.227
.085
.236
.083
.111
.297
.154
.139

.419
.035
.032

SPA

.300
.089
.309
.134
.157
.382
.181
.183
.248

SWA

.049
.002
.014
.025
.029
.038
.038
.013
.037
.055

.045
.089 −.02

YUE

.484
.083
.357
.124
.167
.382
.185
.205
.226
.313
.043

variants, we refer the reader to recent work (casa de arte, Labaka, and Agirre 2018b; Vuli´c
et al. 2019). We again use CC+Wiki FT vectors as initial monolingual word vectors,
except for YUE where Wiki FT is used. The seed dictionaries of two different sizes (1k and
5k translation pairs) are based on PanLex (Kamholz, Pool, and Colowick 2014), y son
taken directly from prior work (Vuli´c et al. 2019),24 or extracted from PanLex following
the same procedure as in the prior work.

Contextualized CrossLingual Word Embeddings. We again evaluate the capacity of (mas-
sively) multilingual pretrained language models, M-BERT and XLM-100, to reason over
crosslingual lexical similarity. Implicitly, such an evaluation also evaluates “the intrinsic
quality” of shared crosslingual word-level vector spaces induced by these methods, y
their ability to boost crosslingual transfer between different language pairs. We rely on
the same procedure of aggregating the models’ subword-level parameters into word-
level representations, already described in §7.1.

As in monolingual settings, we can apply unsupervised post-processing steps such

as ABTT to both static and contextualized crosslingual word embeddings.

8.2 Results and Discussion

Main Results and Differences across Language Pairs. A summary of the results on the 66
crosslingual Multi-SimLex data sets are provided in Table 15 and Figure 6a. The ﬁndings
conﬁrm several interesting ﬁndings from our previous monolingual experiments (§7.2),
and also corroborate several hypotheses and ﬁndings from prior work, now on a large
sample of language pairs and for the task of crosslingual semantic similarity.

24 https://github.com/cambridgeltl/panlex-bli.

882

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Cifra 6
Further performance analyses of crosslingual Multi-SimLex data sets. (a) Spearman’s ρ
correlation scores averaged over all 66 crosslingual Multi-SimLex data sets for two pretrained
multilingual encoders (M-BERT and XLM). The scores are obtained with different conﬁgurations
that exclude (INIT) or enable unsupervised post-processing. (b) A comparison of various
pretrained encoders available for the English-French language pair; see the main text for a short
description of each benchmarked pretrained encoder.

Primero, we observe that the fully unsupervised VECMAP model, despite being the
most robust fully unsupervised method at present, fails to produce a meaningful
crosslingual word vector space for a large number of language pairs (see the bottom
triangle of Table 15): Many correlation scores are in fact no-correlation results, accentu-
ating the problem of fully unsupervised crosslingual learning for typologically diverse
languages and with fewer amounts of monolingual data (Vuli´c et al. 2019). The scores
are particularly low across the board for lower-resource languages such as Welsh and
Kiswahili. It also seems that the lack of monolingual data is a larger problem than
typological dissimilarity between language pairs, as we do observe reasonably high
correlation scores with VECMAP for language pairs such as CMN-SPA, HEB-EST, y
RUS-FIN. Sin embargo, typological differences (p.ej., morphological richness) still play an
important role as we observe very low scores when pairing CMN with morphologically
rich languages such as FIN, EST, POL, and RUS. Similar to prior work of Vuli´c et al. (2019)
and Doval et al. (2019), given the fact that unsupervised VECMAP is the most robust
unsupervised CLWE method at present (Glavaˇs et al. 2019), our results again question
the usefulness of fully unsupervised approaches for a large number of languages,
and call for further developments in the area of unsupervised and weakly supervised
crosslingual representation learning.

The scores of M-BERT and XLM-10025 lead to similar conclusions as in the mono-
lingual settings. Reasonable correlation scores are achieved only for a small subset of
resource-rich language pairs (p.ej., ENG, FRA, SPA, CMN) that dominate the multilingual
M-BERT training. Curiosamente, the scores indicate a much higher performance of lan-
guage pairs where YUE is one of the languages when we use M-BERT instead of VECMAP.
This boils down again to the fact that YUE, due to its speciﬁc language script, has a
good representation of its words and subwords in the shared M-BERT vocabulary. En
al mismo tiempo, a reliable VECMAP mapping between YUE and other languages cannot

25 The XLM-100 scores are not reported for brevity; they largely follow the patterns observed with M-BERT.

The aggregated scores between the two encoders are also very similar, as indicated by Figure 6a.

883

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Mesa 16
Results on a selection of crosslingual Multi-SimLex data sets where the fully unsupervised
(UNSUPER) CLWE variant yields reasonable performance. We also show the results with
supervised VECMAP without self-learning (SUPER) and with self-learning (+SL), with two seed
dictionary sizes: 1k and 5k pairs; see §8.1 for more detail. Highest scores for each language pair
are in bold.

CMN-
ENG
.565
.575
.577
.587
.581

ENG-
FRA
.662
.602
.703
.704
.707

ENG-
SPA
.498
.453
.547
.542
.548

ENG-
RUS
.511
.376
.548
.535
.551

EST-
FIN
.510
.378
.591

.518
.556

EST-
HEB
.465
.363
.513
.473
.525

FIN-
HEB
.445
.442
.488
.585
.589

FRA-
SPA
.600
.588
.639
.631
.645

POL-
RUS
.390
.399
.439

.455
.432

POL-
SPA
.398
.406
.456
.463
.476

UNSUPER

SUPER (1k)
+SL (1k)
SUPER (5k)
+SL (5k)

be found due to a small monolingual YUE corpus. In cases when VECMAP does not
yield a degenerate crosslingual vector space starting from two monolingual ones, el
ﬁnal correlation scores seem substantially higher than the ones obtained by the single
massively multilingual M-BERT model.

Finalmente, the results in Figure 6a again verify the usefulness of unsupervised post-
processing also in crosslingual settings. We observe improved performance with both
M-BERT and XLM-100 when mean centering (+MC) is applied, and further gains can
be achieved by using ABTT on the mean-centered vector spaces. A similar ﬁnding also
holds for static crosslingual word embeddings,26 where applying ABBT (−10) yields
higher scores on 61 de 66 language pairs.

Fully Unsupervised vs. Weakly Supervised CrossLingual Embeddings. The results in
Mesa 15 indicate that fully unsupervised crosslingual learning fails for a large number
of language pairs. Sin embargo, trabajo reciente (Vuli´c et al. 2019) has noted that these sub-
optimal non-alignment solutions with the UNSUPER model can be avoided by relying
en (weak) crosslingual supervision spanning only several thousands or even hundreds
of word translation pairs. Por lo tanto, we examine 1) if we can further improve the results
on crosslingual Multi-SimLex resorting to (at least some) crosslingual supervision for
resource-rich language pairs; y 2) if such available word-level supervision can also
be useful for a range of languages that displayed near-zero performance in Table 15. En
otras palabras, we test if recent “tricks of the trade” used in the rich literature on CLWE
learning reﬂect in gains on crosslingual Multi-SimLex data sets.

Primero, we reassess the ﬁndings established on the bilingual lexicon induction task
(Søgaard, Ruder, and Vuli´c 2018; Vuli´c et al. 2019): Using at least some crosslingual
supervision is always beneﬁcial compared to using no supervision at all. Nosotros informamos
improvements over the UNSUPER model for all 10 language pairs in Table 16, incluso
though the UNSUPER method initially produced strong correlation scores. The impor-
tance of self-learning increases with decreasing available seed dictionary size, y el
+SL model always outperforms UNSUPER with 1k seed pairs; we observe the same
patterns also with even smaller dictionary sizes than reported in Table 16 (250 y
500 seed pairs). Along the same line, the results in Table 17 indicate that at least some
supervision is crucial for the success of static CLWEs on resource-leaner language pairs.

26 Note that VECMAP does mean centering by default as one of its preprocessing steps prior to learning the

mapping function (casa de arte, Labaka, and Agirre 2018b; Vuli´c et al. 2019).

884

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Mesa 17
Results on a selection of crosslingual Multi-SimLex data sets where the fully unsupervised
(UNSUPER) CLWE variant fails to learn a coherent shared crosslingual space. See also the caption
de mesa 16.

CMN-FIN CMN-RUS

UNSUPER

SUPER (1k)
+SL (1k)

.049
.410
.590

.032
.388
.537

CMN-YUE
.004
.372
.458

CYM-FIN CYM-FRA

.020
.384
.471

.015
.475
.578

CYM-POL
.028
.326
.380

FIN-SWA
.013
.206
.264

We note substantial improvements on all language pairs; En realidad, the VECMAP model is
able to learn a more reliable mapping starting from clean supervision. We again note
large gains with self-learning.

Multilingual vs. Bilingual Contextualized Embeddings. Similar to the monolingual settings,
we also inspect whether massively multilingual training in fact dilutes the knowledge
necessary for crosslingual reasoning on a particular language pair. Por lo tanto, we com-
pare the 100-language XLM-100 model with i) a variant of the same model trained via
masked language modeling on a smaller set of 17 idiomas (XLM-17); ii) a variant of the
same model trained speciﬁcally for the particular language pair via masked language
modelado (XLM-2); and iii) a variant of the bilingual XLM-2 model that also leverages
bilingual knowledge from parallel sentences during joint training (XLM-2++, entrenado
based on masked language modeling and translation language modeling objectives).
We again use the pretrained models made available by Conneau and Lample (2019),
and we refer to the original work for further technical details (p.ej., regarding the core
difference between pretraining objectives for XLM-2 and XLM-2++).

The results are summarized in Figure 6b, and they conﬁrm the intuition that
massively multilingual pretraining can damage performance even on resource-rich
languages and language pairs. We observe a steep rise in performance when the mul-
tilingual model is trained on a much smaller set of languages (17 vs. 100), and further
improvements can be achieved by training a dedicated bilingual model. Finalmente, lever-
aging bilingual parallel data seems to offer additional slight gains, but a tiny difference
between XLM-2 and XLM-2++ also suggests that this rich bilingual information is not
used in the optimal way within the XLM architecture for semantic similarity.

En resumen, these results indicate that, in order to improve performance in crosslin-
gual transfer tasks, more work should be invested into 1) pretraining dedicated lan-
guage pair-speciﬁc models, y 2) creative ways of leveraging available crosslingual
supervision (p.ej., word translation pairs, parallel or comparable corpora) (Liu et al.
2019a; Wu et al. 2019; Cao, Kitaev, and Klein 2020) with pretraining paradigms such
as BERT and XLM. Using such crosslingual supervision could lead to similar beneﬁts as
indicated by the results obtained with static crosslingual word embeddings (ver tabla 16
and Table 17). We believe that Multi-SimLex can serve as a valuable means to track and
guide future progress in this research area.

9. Conclusion and Future Work

We have presented Multi-SimLex, a resource containing human judgments on the se-
mantic similarity of word pairs for 12 monolingual and 66 crosslingual data sets. El
languages covered are typologically diverse and include also under-resourced ones,
such as Welsh and Kiswahili. The resource covers an unprecedented amount of 1,888
word pairs, carefully balanced according to their similarity score, frequency, concreteness,

885

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

part-of-speech class, and lexical ﬁeld. In addition to Multi-Simlex, we release the
detailed protocol we followed to create this resource. We hope that our consistent
guidelines will encourage researchers to translate and annotate Multi-Simlex-style data
sets for additional languages. This can help and create a hugely valuable, large-scale
semantic resource for multilingual NLP research.

The core Multi-SimLex we release with this article already enables researchers to
carry out novel linguistic analysis as well as establishes a benchmark for evaluating
representation learning models. Based on our preliminary analyses, Encontramos eso
speakers of closely related languages tend to express equivalent similarity judgments.
En particular, geographical proximity seems to play a greater role than family member-
ship in determining the similarity of judgments across languages. Además, probamos
several state-of-the-art word embedding models, both static and contextualized ones,
as well as several (supervised and unsupervised) post-processing techniques, on Multi-
SimLex. This enables future endeavors to improve multilingual representation learning
with challenging baselines. Además, our results provide several important insights
for research on both monolingual and crosslingual word representations:

Unsupervised post-processing techniques (mean centering, elimination of
top principal components, adjusting similarity orders) are always
beneﬁcial independently of the language, although the combination
leading to the best scores is language-speciﬁc and hence needs to be tuned.

Similarity rankings obtained from word embeddings for nouns are better
aligned with human judgments than all the other part-of-speech classes
considered here (verbos, adjectives, y, for the ﬁrst time, adverbs). Este
conﬁrms previous generalizations based on experiments on English.

The factor having the greatest impact on the quality of word
representations is the availability of raw texts to train them in the ﬁrst
lugar, rather than language properties (such as family, geographical area,
typological features).

4) Massively multilingual pretrained encoders such as M-BERT (Devlin et al.
2019) and XLM-100 (Conneau and Lample 2019) fare quite poorly on our
benchmark, whereas pretrained encoders dedicated to a single language
are more competitive with static word embeddings such as fastText
(Bojanowski y otros. 2017). Además, for language-speciﬁc encoders,
parameter reduction techniques reduce performance only marginally.

Techniques to inject clean lexical semantic knowledge from external
resources into distributional word representations were proven to be
effective in emphasizing the relation of semantic similarity. En particular,
methods capable of transferring such knowledge from resource-rich to
resource-lean languages (Ponti et al. 2019C) increased the correlation with
human judgments for most languages, except for those with limited
unlabelled data.

Future work can expand our preliminary, yet large-scale study on the ability of
pretrained encoders to reason over word-level semantic similarity in different lan-
calibres. Por ejemplo, we have highlighted how sharing the same encoder parameters
across multiple languages may harm performance. Sin embargo, it remains unclear if, y
to what extent, the input language embeddings present in XLM-100 but absent in

886

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

M-BERT help mitigate this issue. Además, pretrained language embeddings can be
obtained both from typological databases (Littell et al. 2017) and from neural architec-
turas (Malaviya, Neubig, and Littell 2017). Plugging these embeddings into the encoders
in lieu of embeddings trained end-to-end as suggested by prior work (Tsvetkov et al.
2016; Ammar et al. 2016; Ponti et al. 2019b) might extend the coverage to more resource-
lean languages.

Another important follow-up analysis might involve the comparison of the perfor-
mance of representation learning models on multilingual data sets for both word-level
semantic similarity and sentence-level natural language understanding. En particular,
Multi-SimLex ﬁlls a gap in available resources for multilingual NLP and might help un-
derstand how lexical and compositional semantics interact if put alongside existing re-
sources such as XNLI (Conneau et al. 2018b) for natural language inference or PAWS-X
(Yang y otros. 2019) for crosslingual paraphrase identiﬁcation. Más, the Multi-SimLex
annotation could turn out to be a unique source of evidence to study the effects of poly-
semy in human judgments on semantic similarity: For equivalent word pairs in multiple
idiomas, are the similarity scores affected by how many senses the two words (o
multiword expressions) incorporate? Finalmente, similar to recent work on multilingual
data set creation in NLP (see §3), Multi-SimLex has been created through translation
from English, aiming to ensure as much comparability as possible between concept
pairs in different languages; sin embargo, a very interesting path for future work is studying
how different departure points (p.ej., English vs. Hebrew vs. Mandarin as the source
idioma) affect the obtained data, and if there are any differing or persistent translation
artifacts (casa de arte, Labaka, and Agirre 2020). In the long run, how can we measure to
what degree shared cultural, societal, and economic models shape and affect semantic
similarity reasoning (Quinn and Holland 1987)?

Although Multi-SimLex makes a large leap toward more inclusive and larger-scale
lexical semantic evaluation resources, there are other interesting research challenges
remaining related to data collection, which are envisioned as future research—for ex-
amplio, developing similar resources for non-written signed languages. Another clear
idea is creating a dedicated resource which targets similarity of multiword expressions
mono- and crosslingually (p.ej., measuring similarity between martial law and coup
d’´etat).

In light of the success of initiatives like Universal Dependencies for multilingual
treebanks, we hope that making Multi-SimLex and its guidelines available will en-
courage other researchers to expand our current sample of languages (p.ej., Arabic has
been added since the completion of this article). We particularly encourage creation
and submission of comparable Multi-SimLex data sets for under-resourced and typo-
logically diverse languages in future work. En particular, we have made a Multi-Simlex
community Web site available to facilitate easy creation, gathering, dissemination, y
use of annotated data sets: https://multisimlex.com/.

Expresiones de gratitud
This work is supported by the ERC
Consolidator Grant LEXICAL: Lexical
Acquisition Across Languages (No 648909).
Thierry Poibeau is partly supported by a
PRAIRIE 3IA Institute fellowship
(“Investissements d’avenir” program,
reference ANR-19-P3IA-0001).

Referencias
Adams, Oliver, Adam Makarucha, graham
Neubig, Steven Bird, and Trevor Cohn.

2017. Cross-lingual word embeddings for
low-resource language modeling. En
Proceedings of EACL, pages 937–947.
Valencia. DOI: https://doi.org/10
.18653/v1/E17-1088

aguirre, Eneko, Enrique Alfonseca, Keith

Sala, Jana Kravalov´a, Marius Pasca, y
Aitor Soroa. 2009. A study on similarity
and relatedness using distributional and
WordNet-based approaches. En procedimientos
of NAACL-HLT, pages 19–27. Roca, CO.
DOI: https://doi.org/10.3115/1620754
.1620758

887

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Aldarmaki, Hanan and Mona Diab. 2019.
Context-aware cross-lingual mapping.
In Proceedings of NAACL-HLT,
pages 3906–3911. Mineápolis, Minnesota. DOI:
https://doi.org/10.18653/v1/N19-1391

Alvarez-Melis, David and Tommi Jaakkola.
2018. Gromov-Wasserstein alignment of
word embedding spaces. En procedimientos de
EMNLP, pages 1881–1890. Bruselas. DOI:
https://doi.org/10.18653/v1/D18-1214

Ammar, Waleed, George Mulcaire, Miguel
Ballesteros, Chris Dyer, and Noah Smith.
2016. Many languages, one parser.
Transactions of the ACL, 4:431–444. DOI:
https://doi.org/10.1162/tacl a 00109

casa de arte, Mikel, Gorka Lavaka, y eneko
aguirre. 2017. Learning bilingual word
embeddings with (almost) no bilingual
datos. In Proceedings of ACL, pages 451–462.
vancouver. DOI: https://doi.org/10
.18653/v1/P17-1042

casa de arte, Mikel, Gorka Lavaka, y eneko

aguirre. 2018a. Generalizing and improving
bilingual word embedding mappings with
a multi-step framework of linear
transformations. In Proceedings of AAAI,
pages 5012–5019. Nueva Orleans, LA.
casa de arte, Mikel, Gorka Lavaka, y eneko
aguirre. 2018b. A robust self-learning
method for fully unsupervised
cross-lingual mappings of word
embeddings. In Proceedings of ACL,
pages 789–798. Melbourne. DOI: https://
doi.org/10.18653/v1/P18-1073

casa de arte, Mikel, Gorka Lavaka, y eneko
aguirre. 2020. Translation artifacts in
cross-lingual transfer learning. En
Proceedings of EMNLP, volume abs/
2004.04721. En línea.

casa de arte, Mikel, Gorka Lavaka, I ˜nigo

Lopez-Gazpio, and Eneko Agirre. 2018.
Uncovering divergent linguistic
information in word embeddings with
lessons for intrinsic and extrinsic
evaluación. In Proceedings of CoNLL,
pages 282–291. Bruselas. DOI:
https://doi.org/10.18653/v1/K18-1028

casa de arte, Mikel, Sebastián Ruder, y dani
Yogatama. 2019. On the cross-lingual
transferability of monolingual
representaciones. CORR, abs/1910.11856.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.421

Panadero, Collin F., Charles J. Fillmore, y
John B. Lowe. 1998. The Berkeley
FrameNet project. In Proceedings of ACL,
pages 86–90. Montréal. DOI: https://doi
.org/10.3115/980845.980860

888

Panadero, Simón, Roi Reichart, and Anna

Korhonen. 2014. An unsupervised model
for instance level subcategorization
adquisición. In Proceedings of EMNLP,
pages 278–289. Doha. DOI: https://doi
.org/10.3115/v1/D14-1034, PMID:
25178667

Bapna, Ankur and Orhan Firat. 2019.

Simple, scalable adaptation for neural
machine translation. En procedimientos de
EMNLP, pages 1538–1548. Hong Kong.
DOI: https://doi.org/10.18653/v1
/N19-1191

Baroni, Marco, Silvia Bernardini, Adriano
Ferraresi, and Eros Zanchetta. 2009. El
WaCky wide web: A collection of very
large linguistically processed web-crawled
corpus. Language Resources and Evaluation,
43(3):209–226. DOI: https://doi.org/10
.1007/s10579-009-9081-4

Baroni, Marco and Alessandro Lenci. 2010.

Distributional memory: A general
framework for corpus-based semantics.
Ligüística computacional, 36(4):673–721.
DOI: https://doi.org/10.1162/coli a
00016

Barzegar, Siamak, Brian Davis, Manel

Zarrouk, Siegfried Handschuh, and Andr´e
Freitas. 2018. SemR-11: A multi-lingual
gold-standard for semantic similarity and
relatedness for eleven languages. En
Proceedings of LREC, pages 3912–3916.
Miyazaki.

Bjerva, Johannes and Isabelle Augenstein.

2018. From phonology to syntax:
Unsupervised linguistic typology at
different levels with language
embeddings. En procedimientos de
NAACL-HLT, pages 907–916. Nuevo
Orleans, LA. DOI: https://doi.org
/10.18653/v1/N18-1083

Bjerva, Johannes, Robert ¨Ostling, Maria Han

Veiga, J ¨org Tiedemann, and Isabelle
Augenstein. 2019. What do language
representations really represent?
Ligüística computacional, 45(2):381–389.
DOI: https://doi.org/10.1162/coli a
00351

Bojanowski, Piotr, Edouard Grave, Armand

Joulin, and Tomas Mikolov. 2017.
Enriching word vectors with subword
información. Transactions of the ACL,
5:135–146. DOI: https://doi.org
/10.1162/tacl a 00051

Bro, Rasmus and Age K. Smilde. 2003.
Centering and scaling in component
análisis. Journal of Chemometrics,
17(1):16–33. DOI: https://doi.org
/10.1002/cem.773

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Bruni, Elia, Nam-Khanh Tran, and Marco
Baroni. 2014. Multimodal distributional
semantics. Journal of Artiﬁcial Intelligence
Investigación, 49:1–47. DOI: https://doi
.org/10.1613/jair.4135

Budanitsky, Alexander and Graeme Hirst.

2006. Evaluating WordNet-based measures
of lexical semantic relatedness.
Ligüística computacional, 32(1):13–47. DOI:
https://doi.org/10.1162/coli.2006.32
.1.13

Camacho-Collados, Jose and Roberto

Navigli. 2017. BabelDomains: Large-scale
domain labeling of lexical resources. En
Proceedings of EACL, pages 223–228.
Valencia. DOI: https://doi.org
/10.18653/v1/E17-2036

Camacho-Collados, José, Mohammad Taher

Pilehvar, Nigel Collier, and Roberto
Navigli. 2017. Semeval-2017 task 2:
Multilingual and cross-lingual semantic
word similarity. En procedimientos de
SEMEVAL, pages 15–26. vancouver. DOI:
https://doi.org/10.18653/v1/S17-2002

Camacho-Collados, Jos´e, Mohammad Taher
Pilehvar, and Roberto Navigli. 2015. A
framework for the construction of
monolingual and cross-lingual word
similarity datasets. In Proceedings of ACL,
pages 1–7. Beijing. DOI: https://doi.org
/10.3115/v1/P15-2001

Cao, Steven, Nikita Kitaev, and Dan Klein.

2020. Multilingual alignment of contextual
word representations. En procedimientos de
ICLR. En línea.

Chen, Danqi and Christopher D. Manning.
2014. A fast and accurate dependency
parser using neural networks. En
Proceedings of EMNLP, pages 740–750.
Doha. DOI: https://doi.org/10
.3115/v1/D14-1082

Chen, Xilun and Claire Cardie. 2018.
Unsupervised multilingual word
embeddings. In Proceedings of EMNLP,
pages 261–270. Bruselas. DOI:
https://doi.org/10.18653/v1/D18-1024

Chen, Xinying and Kim Gerdes. 2017.

Classifying languages by dependency
estructura. Typologies of delexicalized
universal dependency treebanks. En
Proceedings of the 4th International
Conference on Dependency Linguistics
(DepLing), pages 54–63. Pisa.

Cimiano, Philipp, Andreas Hotho, y
Steffen Staab. 2005. Learning concept
hierarchies from text corpora using formal
concept analysis. Journal of Artiﬁcial
Intelligence Research, 24:305–339. DOI:
https://doi.org/10.1613/jair.1648

clark, Jonathan H., Eunsol Choi, Miguel

collins, Dan Garrett, Tom Kwiatkowski,
Vitali Nikoláiev, and Jennimaria Palomaki.
2020. TyDi QA: A benchmark for
information-seeking question answering
in typologically diverse languages.
Transactions of the ACL, 8:454–470. DOI:
https://doi.org/10.1162/tacl a 00317
Collobert, Ronan and Jason Weston. 2008. A
uniﬁed architecture for natural language
Procesando: Deep neural networks with
aprendizaje multitarea. In Proceedings of ICML,
pages 160–167. Helsinki. DOI: https://
doi.org/10.1145/1390156.1390177

Collobert, Ronan, Jason Weston, L´eon Bottou,
Michael Karlen, Koray Kavukcuoglu, y
Pavel P. Kuksa. 2011. Natural language
Procesando (almost) from scratch. Diario de
Machine Learning Research, 12:2493–2537.

Conneau, Alexis, Kartikay Khandelwal,
Naman Goyal, Vishrav Chaudhary,
Guillaume Wenzek, Francisco Guzm´an,
Edouard Grave, Myle Ott, Luke
Zettlemoyer, and Veselin Stoyanov. 2019.
Unsupervised cross-lingual representation
learning at scale. CORR, abs/1911.02116.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.747

Conneau, Alexis and Guillaume Lample.
2019. Cross-lingual language model
pretraining. In Proceedings of NeurIPS,
pages 7057–7067. vancouver.

Conneau, Alexis, Guillaume Lample,

Marc’Aurelio Ranzato, Ludovic Denoyer,
and Herv´e J´egou. 2018a. Word translation
without parallel data. En procedimientos de
ICLR. vancouver.

Conneau, Alexis, Ruty Rinott, Guillaume

Lample, Adina Williams, Samuel Bowman,
Holger Schwenk, and Veselin Stoyanov.
2018b. XNLI: Evaluating cross-lingual
sentence representations. En procedimientos de
EMNLP, pages 2475–2485. Bruselas. DOI:
https://doi.org/10.18653/v1/D18-1269

Coseriu, Eugenio. 1967. Lexikalische
solidarit¨aten. Poetica. 1:293–303.

Cruse, David Alan. 1986. Lexical Semantics,

Prensa de la Universidad de Cambridge.

De Deyne, Simon and Gert Storms. 2008.

Word associations: Network and semantic
propiedades. Behavior Research Methods,
40(1):213–231. DOI: https://doi.org
/10.3758/BRM.40.1.213, PMID: 18411545
Devlin, Jacob, Ming-Wei Chang, Kenton Lee,

and Kristina Toutanova. 2019. BERT:
Pre-training of deep bidirectional
transformers for language understanding.
In Proceedings of NAACL-HLT,
páginas 4171–4186. Mineápolis, Minnesota.

889

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Doitch, Amichay, Ram Yazdi, Tamir Hazan,

and Roi Reichart. 2019. Perturbation based
learning for structured NLP tasks with
application to dependency parsing.
Transactions of the ACL, 7:643–659. DOI:
https://doi.org/10.1162/tacl a 00291

Doval, Yerai, Jose Camacho-Collados, Luis
Espinosa-Anke, and Steven Schockaert.
2018. Improving cross-lingual word
embeddings by meeting in the middle. En
Proceedings of EMNLP, pages 294–304.
Bruselas. DOI: https://doi.org/10
.18653/v1/D18-1027

Doval, Yerai, Jose Camacho-Collados, Luis
Espinosa-Anke, and Steven Schockaert.
2019. On the robustness of unsupervised
and semi-supervised cross-lingual word
embedding learning. CORR, abs/1908.
07742.

Dryer, Matthew S. 2013. Order of subject,

object and verb. In Dryer, Matthew S. y
Martin Haspelmath, editores, The World
Atlas of Language Structures Online, máx.
Planck Institute for Evolutionary
Antropología, Leipzig, pages 330–333.
Ercan, G ¨okhan and Olcay Taner Yıldız. 2018.
AnlamVer: Semantic model evaluation
dataset for Turkish – Word similarity and
relatedness. In Proceedings of COLING,
pages 3819–3836. Santa Fe, NM.

Ethayarajh, Kawin. 2019. How contextual are

contextualized word representations?
Comparing the geometry of BERT, ELMo,
and GPT-2 embeddings. En procedimientos
of EMNLP, pages 55–65. Hong Kong.
DOI: https://doi.org/10.18653/v1
/D19-1006

evans, Nicholas. 2011. Semantic Typology. En
The Oxford Handbook of Linguistic Typology.
prensa de la Universidad de Oxford, pages 504–533.
DOI: https://doi.org/10.1093
/oxfordhb/9780199281251.013.0024
Faruqui, Manaal, Jesse Dodge, Sujay Kumar
Jauhar, Chris Dyer, Eduard Hovy, y
Noah A. Herrero. 2015. Retroﬁtting word
vectors to semantic lexicons. En procedimientos
of NAACL-HLT, pages 1606–1615. Denver,
CO. DOI: https://doi.org/10.3115/v1
/N15-1184

Fellbaum, Christiane. 1998. WordNet, CON
Prensa. DOI: https://doi.org/10.7551
/mitpress/7287.001.0001

Finkelstein, Lev, Evgeniy Gabrilovich, Yossi
Matias, Ehud Rivlin, Zach Solan, Gadi
Wolfman, and Eytan Ruppin. 2002. Placing
search in context: The concept revisited.
ACM Transactions on Information Systems,
20(1):116–131. DOI: https://doi.org
/10.1145/503104.503110

890

Firth, John R. 1957. A synopsis of linguistic
theory, 1930-1955. Studies in Linguistic
Análisis. Blackwell, pages 1–32.

Franc¸ois, Alexandre. 2008. Semantic maps
and the typology of colexiﬁcation. En
Martine Vanhove, editor, From Polysemy to
Semantic Change: Towards a Typology of
Lexical Semantic Associations, Studies in
Language Companion Series, 106. John
Benjamins Publishing Company,
pages 163–215. DOI: https://doi.org
/10.1075/slcs.106.09fra

Gerz, Daniela, Ivan Vulic, Felix Hill, Roi
Reichart, and Anna Korhonen. 2016.
SimVerb-3500: A large-scale evaluation set
of verb similarity. In Proceedings of EMNLP,
pages 2173–2182. austin, Texas. DOI:
https://doi.org/10.18653/v1/D16-1235

Glavaˇs, Goran, Edoardo Maria Ponti, y

Ivan Vulic. 2019. Semantic specialization of
distributional word vectors. En procedimientos
of EMNLP: Tutorial Abstracts. Hong Kong.
Ver https://www.aclweb.org
/anthology/D19-2007.

Glavaˇs, Goran and Ivan Vuli´c. 2018.

Discriminating between lexico-semantic
relations with the specialization tensor
modelo. In Proceedings of NAACL-HLT,
pages 181–187. Nueva Orleans, LA. DOI:
https://doi.org/10.18653/v1/N18-2029

Glavaˇs, Goran and Ivan Vuli´c. 2018. Explicit
retroﬁtting of distributional word vectors.
In Proceedings of ACL, pages 34–45.
Melbourne. DOI: https://doi.org/10
.18653/v1/P18-1004

Glavaˇs, Goran, Robert Litschko, Sebastian
Ruder, and Ivan Vuli´c. 2019. Cómo
(properly) evaluate cross-lingual word
embeddings: On strong baselines,
comparative analyses, y algunos
misconceptions. In Proceedings of ACL,
pages 710–721. Florencia. DOI:
https://doi.org/10.18653/v1/P19-1070

Grave, Edouard, Piotr Bojanowski, Prakhar

Gupta, Armand Joulin, and Tomas
Mikolov. 2018. Learning word vectors for
157 idiomas. In Proceedings of LREC,
pages 3483–3487. Miyazaki.

Gruber, jeffrey. 1976. Lexical Structures in

Syntax and Semantics, volumen 25,
Holanda Septentrional.

harris, Zellig S. 1951. Methods in Structural
Lingüística, University of Chicago Press.

Colina, Felix, KyungHyun Cho, Anna

Korhonen, and Yoshua Bengio. 2016.
Learning to understand phrases by
embedding the dictionary. Transactions of
the ACL, 4:17–30. DOI: https://doi
.org/10.1162/tacl a 00080

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Colina, Felix, Roi Reichart, and Anna

Korhonen. 2015. SimLex-999: Evaluating
semantic models with (genuine) semejanza
estimation. Ligüística computacional,
41(4):665–695. DOI: https://doi.org/10
.1162/tacl a 00080

Hoshen, Yedid and Lior Wolf. 2018.

Non-adversarial unsupervised word
traducción. In Proceedings of EMNLP,
pages 469–478. Bélgica. DOI: https://
doi.org/10.18653/v1/D18-1043
Hu, Junjie, Sebastián Ruder, Aditya

Siddhant, Graham Neubig, Orhan Firat,
y Melvin Johnson. 2020. EXTREMO: A
massively multilingual multi-task
benchmark for evaluating cross-lingual
generalización. In Proceedings of ICML,
páginas 4411–4421. En línea.

Huang, Junjie, Fanchao Qi, Chenghao Yang,
Zhiyuan Liu, and Maosong Sun. 2019.
COS960: A Chinese word similarity
dataset of 960 word pairs. CORR,
abs/1906.00247.

Joshi, Pratik, Sebastián Santy, Amar Budhiraja,
Kalika Bali, y Monojit Choudhury. 2020.
The state and fate of linguistic diversity
and inclusion in the NLP world. En
Proceedings of ACL, páginas 6282–6293.
En línea. DOI: https://doi.org/10
.18653/v1/2020.acl-main.560

Joulin, Armand, Piotr Bojanowski, Tomas
Mikolov, Herv´e J´egou, and Edouard
Grave. 2018. Loss in translation: Aprendiendo
bilingual word mapping with a retrieval
criterion. In Proceedings of EMNLP,
pages 2979–2984. Bruselas. DOI: https://
doi.org/10.18653/v1/D18-1330
Kamath, Aishwarya, Jonás Pfeiffer,

Edoardo Maria Ponti, Goran Glavaˇs,
and Ivan Vuli´c. 2019. Specializing
distributional vectors of all words for
lexical entailment. In Proceedings of the 4th
Workshop on Representation Learning for
NLP, pages 72–83. Florencia. DOI:
https://doi.org/10.18653/v1/W19-4310

Kamholz, David, Jonathan Pool, y
Susan M. Colowick. 2014. PanLex:
Building a resource for panlingual lexical
traducción. In Proceedings of LREC,
pages 3145–3150. Reykjavik.

kay, Paul and Luisa Mafﬁ. 2013. Green and
azul. In Dryer, Matthew S. and Martin
Haspelmath, editores, The World Atlas of
Language Structures Online, Max Planck
Institute for Evolutionary Anthropology,
Leipzig, pages 540–541.

Kiela, Douwe and Stephen Clark. 2014. A

systematic study of semantic vector space
model parameters. In Proceedings of the 2nd
Workshop on Continuous Vector Space Models

and their Compositionality (CVSC),
pages 21–30. Gothenburg. DOI: https://
doi.org/10.3115/v1/W14-1503

Kiela, Douwe, Felix Hill, and Stephen Clark.
2015. Specializing word embeddings for
similarity or relatedness. En procedimientos de
EMNLP, pages 2044–2048. Lisbon. DOI:
https://doi.org/10.18653/v1/D15-1242

kim, Joo-Kyung, Marie-Catherine

de Marneffe, y Eric Fosler-Lussier. 2016.
Adjusting word embeddings with
semantic intensity orders. En procedimientos de
the 1st Workshop on Representation Learning
for NLP, pages 62–69. Berlina. DOI:
https://doi.org/10.18653/v1/W16-1607

kim, Joo-Kyung, Gokhan Tur, Asli

Celikyilmaz, Bin Cao, and Ye-Yi Wang.
2016. Intent detection using semantically
enriched word embeddings. In SLT,
pages 414–419. San Diego, California. DOI:
https://doi.org/10.1109/SLT.2016
.7846297

Kipper, Karin, Anna Korhonen, Neville
Ryant, and Martha Palmer. 2008. A
large-scale classiﬁcation of English verbs.
Language Resources and Evaluation,
42(1):21–40. DOI: https://doi.org
/10.1007/s10579-007-9048-2

Kipper, Karin, Benjamin Snyder, and Martha
Palmer. 2004. Extending a verb-lexicon
using a semantically annotated corpus. En
Proceedings of LREC, pages 1557–1560.
Lisbon.

Kipper Schuler, Karin. 2005. VerbNet: A

broad-coverage, comprehensive verb lexicon.
Doctor. tesis, Universidad de Pennsylvania.

Kondratyuk, Dan and Milan Straka. 2019.

75 idiomas, 1 modelo: Parsing Universal
Dependencies universally. En procedimientos de
EMNLP-IJCNLP, pages 2779–2795.
Hong Kong. DOI: https://doi.org/10
.18653/v1/D19-1279

Lan, Zhenzhong, Mingda Chen, Sebastian

Buen hombre, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. 2020. ALBERT: A lite
BERT for self-supervised learning of
language representations. En procedimientos de
ICLR. En línea.

Lauscher, Anne, Vinit Ravishankar, Ivan

Vuli´c, and Goran Glavaˇs. 2020. From zero
to hero: On the limitations of zero-shot
cross-lingual transfer with multilingual
transformadores. In Proceedings of EMNLP,
pages 4483–4499. En línea.

Lauscher, Anne, Ivan Vulic, Edoardo Maria

Ponti, Anna Korhonen, and Goran Glavaˇs.
2019. Informing unsupervised pretraining
with external linguistic knowledge. arXiv
preprint arXiv:1909.02339.

891

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Lazaridou, Angeliki, Georgiana Dinu, y

Marco Baroni. 2015. Hubness and
pollution: Delving into cross-space
mapping for zero-shot learning. En
Proceedings of ACL, pages 270–280. Beijing.
DOI: https://doi.org/10.3115/v1
/P15-1027

Le, Hang, Lo¨ıc Vial, Jibril Frej, Vincent

Segonne, Maximin Coavoux, Benjamín
Lecouteux, Alexandre Allauzen, Benoˆıt
Crabb´e, Laurent Besacier, and Didier
Schwab. 2019. FlauBERT: Unsupervised
language model pre-training for French.
CORR, abs/1912.05372.

Leviant, Ira and Roi Reichart. 2015. Separated
by an un-common language: Towards
judgment language informed vector space
modelado. CORR, abs/1508.00106.

Levin, Beth. 1993. English Verb Classes and
Alternation, A Preliminary Investigation.
The University of Chicago Press.
Exacción, Omer and Yoav Goldberg. 2014.

Dependency-based word embeddings.
In Proceedings of ACL, pages 302–308.
baltimore, Maryland. DOI: https://doi
.org/10.3115/v1/P14-2050, PMID:
25270273

Luis, Patrick S. h., Barlas Oguz, Ruty
Rinott, Sebastián Riedel, and Holger
Schwenk. 2019. MLQA: Evaluating
cross-lingual extractive question
answering. CORR, abs/1910.07475. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.653

Liang, Yaobo, Nan Duan, Yeyun Gong, Y
Wu, Fenfei Guo, Weizhen Qi, Ming Gong,
Linjun Shou, Daxin Jiang, Guihong Cao,
Xiaodong Fan, Bruce Zhang, Rahul
Agrawal, Edward Cui, Sining Wei, Taroon
Bharti, Ying Qiao, Jiun-Hung Chen,
Winnie Wu, Shuguang Liu, fan yang,
Rangan Majumder, y Ming Zhou. 2020.
XPEGAMENTO: A new benchmark dataset for
cross-lingual pre-training, comprensión
y generación. CORR, abs/2004.01401.
lin, Yu Hsiang, Chian-Yu Chen, Juan Lee,
Zirui Li, Yuyan Zhang, Mengzhou Xia,
Shruti Rijhwani, Junxian He, Zhisong
zhang, Xuezhe Ma, Antonios
Anastasopoulos, Patricio Littell, y
Graham Neubig. 2019. Choosing transfer
languages for cross-lingual learning. En
Proceedings of ACL, páginas 3125–3135.
Florencia. DOI: https://doi.org/10
.18653/v1/P19-1301

Littell, Patrick, David R Mortensen, Ke Lin,
Katherine Kairis, Carlisle Turner, y
Lori Levin. 2017. Uriel and lang2vec:
Representing languages as typological,
geographical, and phylogenetic vectors. En

892

Proceedings of EACL, pages 8–14. Valencia.
DOI: https://doi.org/10.18653/v1
/E17-2002

Liu, Qianchu, Diana McCarthy, Ivan Vulic,

and Anna Korhonen. 2019a. Investigating
cross-lingual alignment methods for
contextualized embeddings with token-
level evaluation. In Proceedings of CoNLL,
pages 33–43. Hong Kong. DOI: https://
doi.org/10.18653/v1/K19-1004

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer
Exacción, mike lewis, Lucas Zettlemoyer, y
Veselin Stoyanov. 2019b. RoBERTa: A
robustly optimized BERT pretraining
acercarse. CORR, abs/1907.11692.
lucas, Margery. 2000. Semantic priming
without association: A meta-analytic
revisar. Boletín psiconómico & Revisar,
7(4):618–630. DOI: https://doi.org
/10.3758/BF03212999, PMID: 11206202

Luong, Thang, Richard Socher, y

Christopher Manning. 2013. Better word
representations with recursive neural
networks for morphology. En procedimientos de
CONLL, pages 104–113. Soﬁa.

Lyons, John. 1977. Semántica, volumen 2.

Prensa de la Universidad de Cambridge. DOI: https://
doi.org/10.1017/CBO9781139165693

Majid, Asifa, Melissa Bowerman, Miriam van

Staden, and James S. Boster. 2007. El
semantic categories of cutting and
breaking events: A cross-linguistic
perspectiva. Cognitive Linguistics,
18(2):133–152. DOI: https://doi.org
/10.1515/COG.2007.005

Malaviya, Chaitanya, Graham Neubig, y
Patricio Littell. 2017. Learning language
representations for typology prediction. En
Proceedings of EMNLP, pages 2529–2535.
Copenhague. DOI: https://doi.org
/10.18653/v1/D17-1268

Mantel, Nathan. 1967. The detection of
disease clustering and a generalized
regression approach. Cancer Research,
27(2 Parte 1):209–220.

McKeown, Kathleen R., Regina Barzilay,

David Evans, Vasileios Hatzivassiloglou,
Judith L. Klavans, Ani Nenkova, Carl
Sable, Barry Schiffman, and Sergey
Sigelman. 2002. Tracking and
summarizing news on a daily basis with
Columbia’s newsblaster. En procedimientos de
HLT, paginas 280-285. San Diego, California. DOI:
https://doi.org/10.3115/1289189
.1289212

Melamud, Oren, David McClosky, Siddharth
Patwardhan, and Mohit Bansal. 2016. El
role of context types and dimensionality in

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

learning word embeddings. En procedimientos
of NAACL-HLT, pages 1030–1040. san
diego, California. DOI: https://doi.org/10
.18653/v1/N16-1118

Mikolov, Tomas, Edouard Grave, Piotr
Bojanowski, Christian Puhrsch, y
Armand Joulin. 2018. Avances en
pre-training distributed word
representaciones. In Proceedings of LREC,
pages 52–55. Miyazaki.

Mikolov, Tomas, Quoc V. Le, and Ilya

Sutskever. 2013. Exploiting similarities
among languages for machine translation.
arXiv preprint, CORR, abs/1309.4168.
Mikolov, Tomas, Ilya Sutskever, Kai Chen,
Gregory S. Corrado, and Jeffrey Dean.
2013. Distributed representations of words
and phrases and their compositionality. En
Proceedings of NeurIPS, pages 3111–3119.
South Lake Tahoe, NV.

Molinero, George A. 1995. WordNet: A lexical

database for English. Communications of the
ACM. pages 39–41. DOI: https://doi
.org/10.1145/219717.219748

Mimno, David and Laure Thompson. 2017.
The strange geometry of skip-gram with
negative sampling. En procedimientos de
EMNLP, pages 2873–2878. Copenhague.
DOI: https://doi.org/10.18653/v1/D17
-1308

Mohiuddin, Tasnim and Shaﬁq Joty. 2019.
Revisiting adversarial autoencoder for
unsupervised word translation with cycle
consistency and improved training. En
Proceedings of NAACL-HLT,
pages 3857–3867. Mineápolis, Minnesota. DOI:
https://doi.org/10.18653/v1/N19-1386

Mrkˇsi´c, Nikola, Diarmuid ´O S´eaghdha,

Blaise Thomson, Milica Gaˇsi´c, Lina Maria
Rojas-Barahona, Pei-Hao Su, David
Vandyke, Tsung-Hsien Wen, and Steve
Joven. 2016. Counter-ﬁtting word vectors
to linguistic constraints. En procedimientos de
NAACL-HLT, pages 142–148. San Diego,
California. DOI: https://doi.org/10.18653/v1
/N16-1018

Mrkˇsi´c, Nikola, Ivan Vulic, Diarmuid ´O

S´eaghdha, Ira Leviant, Roi Reichart, Milica
Gaˇsi´c, Anna Korhonen, and Steve Young.
2017. Semantic specialisation of
distributional word vector spaces using
monolingual and cross-lingual constraints.
Transactions of the ACL, 5:309–324. DOI:
https://doi.org/10.1162/tacl a 00063

Mu, Jiaqi, Suma Bhat, and Pramod

Viswanath. 2018. All-but-the-top: Simple
and effective postprocessing for word
representaciones. In Proceedings of ICLR.
vancouver.

Mykowiecka, Agnieszka, Małgorzata
Marciniak, and Piotr Rychlik. 2018.
SimLex-999 for Polish. En procedimientos de
LREC, pages 2398–2402. Miyazaki.

nelson, Douglas L., Cathy L. McEvoy, y

Thomas A. Schreiber. 2004. The University
of South Florida free association, rhyme,
and word fragment norms. Comportamiento
Research Methods, 36(3):402–407. DOI:
https://doi.org/10.3758/BF03195588,
PMID: 15641430

Netisopakul, Ponrudee, Gerhard

Wohlgenannt, and Aleksei Pulich. 2019.
Word similarity datasets for Thai:
Construction and evaluation. CORR,
abs/1904.04307. DOI: https://doi.org
/10.1109/ACCESS.2019.2944151
Nivre, Joakim, Mitchell Abrams, ˇZeljko

Agi´c, Lars Ahrenberg, Gabriel ˙e
Aleksandraviˇci ¯ut ˙e, Lene Antonsen, Katya
Aplonova, Maria Jesus Aranzabe, Gashaw
Arutie, Masayuki Asahara, et al. 2019.
Universal Dependencies 2.4.
LINDAT/CLARIN digital library at the
Institute of Formal and Applied
Lingüística ( ´UFAL), Faculty of
Mathematics and Physics, Charles
Universidad.

¨Ostling, Robert and J ¨org Tiedemann. 2017.

Continuous multilinguality with language
vectores. In Proceedings of EACL,
pages 644–649. Valencia, España. DOI:
https://doi.org/10.18653/v1/E17-2102

Pearson, Karl. 1901. On lines and planes of

closest ﬁt to systems of points in space. El
Londres, Edimburgo, and Dublin Philosophical
Magazine and Journal of Science,
2(11):559–572. DOI: https://doi.org/10
.1080/14786440109462720

Peters, Matthew, Mark Neumann, Mohit

Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. 2018.
Deep contextualized word representations.
In Proceedings of NAACL-HLT,
pages 2227–2237. Nueva Orleans, LA. DOI:
https://doi.org/10.18653/v1/N18-1202

Pilehvar, Mohammad Taher, Dimitri

Kartsaklis, Victor Prokhorov, and Nigel
Minero. 2018. Card-660: Cambridge rare
word dataset – a reliable benchmark for
infrequent word representation models. En
Proceedings of EMNLP, pages 1391–1401.
Bruselas. DOI: https://doi.org
/10.18653/v1/D18-1169

Pires, Telmo, Eva Schlinger, and Dan
Garrette. 2019. How multilingual is
multilingual BERT? In Proceedings of ACL,
pages 4996–5001. Florencia. DOI:
https://doi.org/10.18653/v1/P19-1493

893

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

Ponti, Edoardo Maria, Goran Glavaˇs, Olga
Majewska, Qianchu Liu, Ivan Vulic, y
Anna Korhonen. 2020. XCOPA: A
Multilingual Dataset for Causal
Commonsense Reasoning. arXiv preprint
arXiv:2005.00333. En línea.

Ponti, Edoardo Maria, Helen O’Horan,

Yevgeni Berzak, Ivan Vulic, Roi Reichart,
Thierry Poibeau, Ekaterina Shutova, y
Anna Korhonen. 2019a. Modeling
language variation and universals: A
survey on typological linguistics for
natural language processing.
Ligüística computacional, 45(3):559–601.
DOI: https://doi.org/10.1162/coli a
00357

Ponti, Edoardo Maria, Roi Reichart, Anna

Korhonen, and Ivan Vuli´c. 2018a.
Isomorphic transfer of syntactic structures
in cross-lingual NLP. In Proceedings of ACL,
pages 1531–1542. Melbourne. DOI:
https://doi.org/10.18653/v1/P18
-1142

Ponti, Edoardo Maria, Ivan Vulic, ryan
Cotterell, Roi Reichart, and Anna
Korhonen. 2019b. Towards zero-shot
language modeling. En procedimientos de
EMNLP-IJCNLP, pages 2893–2903.
Hong Kong. DOI: https://doi.org
/10.18653/v1/D19-1288

Ponti, Edoardo Maria, Ivan Vulic, Goran

Glavaˇs, Nikola Mrkˇsi´c, and Anna
Korhonen. 2018b. Adversarial propagation
and zero-shot cross-lingual transfer of
word vector specialization. En procedimientos
of EMNLP, pages 282–293. Bruselas. DOI:
https://doi.org/10.18653/v1/D18-1026

Ponti, Edoardo Maria, Ivan Vulic, Goran

Glavaˇs, Roi Reichart, and Anna Korhonen.
2019C. Cross-lingual semantic
specialization via lexical relation
induction. In Proceedings of EMNLP,
pages 2206–2217. Hong Kong. DOI:
https://doi.org/10.18653/v1/D19-1226

quinn, Naomi and Dorothy Holland. 1987.
Culture and cognition. Cultural Models in
Language and Thought, pages 3–40.
DOI: https://doi.org/10.1017
/CBO9780511607660.002, PMID: 3552892

Radovanovi´c, Miloˇs, Alexandros

Nanopoulos, and Mirjana Ivanovi´c. 2010.
Hubs in space: Popular nearest neighbors
in high-dimensional data. Diario de
Machine Learning Research, 11:2487–2531.
Rasooli, Mohammad Sadegh and Michael
collins. 2017. Cross-lingual syntactic
transfer with limited resources.
Transactions of the ACL, 5:279–293. DOI:
https://doi.org/10.1162/tacl a 00061

894

Ren, Liliang, Kaige Xie, Lu Chen, and Kai Yu.
2018. Towards universal dialogue state
tracking. In Proceedings of EMNLP,
pages 2780–2786. Bruselas. DOI:
https://doi.org/10.18653/v1/D18
-1299

Roemmele, Melissa, Cosmin Adrian Bejan,
and Andrew S. gordon. 2011. Choice of
plausible alternatives: An evaluation of
commonsense causal reasoning. En
Actas de la 2011 AAAI Spring
Symposium Series. Palo Alto, California.

Rotman, Guy and Roi Reichart. 2019. Deep

contextualized self-training for low
resource dependency parsing. Transactions
of the ACL, 7:695–713. DOI: https://doi
.org/10.1162/tacl a 00294

Ruder, Sebastian, Anders Sogaard, and Ivan
Vuli´c. 2019. Unsupervised cross-lingual
representation learning. En procedimientos de
LCA: Tutorial Abstracts, pages 31–38.
Florencia. DOI: https://doi.org/10
.18653/v1/P19-4007

Ruder, Sebastian, Ivan Vulic, and Anders

Søgaard. 2019. A survey of cross-lingual
embedding models. Journal of Artiﬁcial
Intelligence Research, 65:569–631. DOI:
https://doi.org/10.1613/jair.1.11640
Rzymski, Christoph, Tiago Tresoldi, Simon J.

Greenhill, Mei-Shin Wu, Nathanael E.
Schweikhard, Maria Koptjevskaja-Tamm,
Volker Gast, Timotheus A. Bodt, Abbie
Hantgan, Gereon A. Kaiping, et al. 2020.
The database of cross-linguistic
colexiﬁcations, reproducible analysis of
cross-linguistic polysemies. Scientiﬁc Data,
7(1):1–12. DOI: https://doi.org
/10.1038/s41597-019-0341-x, PMID:
31932593, PMCID: PMC6957499

Sahlgren, Magnus. 2006. The Word-Space
Modelo: Using Distributional Analysis to
Represent Syntagmatic and Paradigmatic
Relations Between Words in High-
Dimensional Vector Spaces. Doctor. tesis,
Stockholm University.

Sakaizawa, Yuya and Mamoru Komachi.
2018. Construction of a Japanese word
similarity dataset. In Proceedings of LREC,
pages 948–951. Miyazaki.

Schlechtweg, Dominik, Anna H¨atty,

Marco Del Tredici, and Sabine Schulte im
Walde. 2019. A wind of change: Detecting
and evaluating lexical semantic change
across times and domains. En procedimientos de
LCA, pages 732–746. Florencia. DOI:
https://doi.org/10.18653/v1/P19
-1072

Schuster, Mike and Kaisuke Nakajima. 2012.

Japanese and Korean voice search. En
Congreso Internacional de Acústica, Discurso

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

y procesamiento de señales, pages 5149–5152.
Kioto. DOI: https://doi.org/10
.1109/ICASSP.2012.6289079
Schwartz, roy, Roi Reichart, and Ari

Rappoport. 2015. Symmetric pattern based
word embeddings for improved word
similarity prediction. En procedimientos de
CONLL, pages 258–267. Beijing. DOI:
https://doi.org/10.18653/v1/K15
-1026

Schwartz, roy, Roi Reichart, and Ari

Rappoport. 2016. Symmetric patterns and
coordinations: Fast and enhanced
representations of verbs and adjectives. En
Proceedings of NAACL-HLT, pages 499–505.
San Diego, California. DOI: https://doi.org
/10.18653/v1/N16-1060, PMCID:
PMC4832855

Herrero, Samuel L., David H. PAG. Turban, Steven
Hamblin, and Nils Y. Hammerla. 2017.
Ofﬂine bilingual word vectors, orthogonal
transformations and the inverted softmax.
In Proceedings of ICLR (Conference Track).
Toulon.

Snyder, Benjamin and Regina Barzilay. 2010.

Climbing the tower of Babel:
Unsupervised multilingual learning. En
Proceedings of ICML, pages 29–36. Haifa.
Søgaard, Anders, Sebastián Ruder, and Ivan

Vuli´c. 2018. On the limitations of
unsupervised bilingual dictionary
induction. In Proceedings of ACL,
pages 778–788. Melbourne. DOI:
https://doi.org/10.18653/v1/P18
-1072

suzuki, Ikumi, Kazuo Hara, Masashi
Shimbo, Marco Saerens, and Kenji
Fukumizu. 2013. Centering similarity
measures to reduce hubs. En procedimientos de
EMNLP, pages 613–623. seattle, Washington.

Espiga, Shuai, Mahta Mousavi, and Virginia R.

de Sa. 2019. An empirical study on
post-processing methods for word
embeddings. CORR, abs/1905.10971.
Trier, Jost. 1931. Der Deutsche Wortschatz im
Sinnbezirk des Verstandes: Die Geschichte
eines sprachlichen Feldes. 1. Von den
Anf¨angen bis zum Beginn des 13.
Jahrhunderts. Doctor. tesis, Universidad de
Bonn.

Tsvetkov, Yulia, Sunayana Sitaram, Manaal

Faruqui, Guillaume Lample, Patrick
Littell, David Mortensen, Alan W. Negro,
Lori Levin, and Chris Dyer. 2016. Polyglot
neural language models: A case study in
cross-lingual phonetic representation
aprendiendo. In Proceedings of NAACL-HLT,
pages 1357–1366. San Diego, California. DOI:
https://doi.org/10.18653/v1/N16-1161

Turney, Peter D. 2012. Domain and function:
A dual-space model of semantic relations
and compositions. Journal of Artiﬁcial
Intelligence Research, 44:533–585. DOI:
https://doi.org/10.1613/jair.3640
Turney, Peter D. and Patrick Pantel. 2010.

From frequency to meaning: vector space
models of semantics. Journal of Artiﬁcal
Intelligence Research, 37(1):141–188. DOI:
https://doi.org/10.1613/jair.2934
Upadhyay, Shyam, Manaal Faruqui, cris
Dyer, and Dan Roth. 2016. multilingüe
models of word embeddings: An empirical
comparación. In Proceedings of ACL,
pages 1661–1670. Berlina. DOI:
https://doi.org/10.18653/v1/P16-1157

van den berg, Robert A., Huub C. j.

Hoefsloot, Johan A. Westerhuis, Age K.
Smilde, and Mari¨et J. van der Werf. 2006.
Centrado, scaling, and transformations:
Improving the biological information
content of metabolomics data. BMC
Genomics, 7(1):142. Miyazaki. DOI:
https://doi.org/10.1186/1471-2164-7
-142, PMID: 16762068, PMCID:
PMC1534033

Vania, Clara and Adam Lopez. 2017. De

characters to words to in between: Do we
capture morphology? En procedimientos de
LCA, pages 2016–2027. vancouver. DOI:
https://doi.org/10.18653/v1/P17-1184

Vaswani, Ashish, Noam Shazeer, Niki

Parmar, Jakob Uszkoreit, Leon Jones,
Aidan N.. Gómez, Lukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you
need. In Proceedings of NeurIPS,
pages 6000–6010. Long Beach, California.

Vejdemo, Susanne. 2018. Lexical change often
begins and ends in semantic peripheries:
Evidence from color linguistics. Pragmatics
& Cognición, 25(1):50–85. DOI:
https://doi.org/10.1075/pc.00005.vej

Venekoski, Viljami and Jouko Vankka. 2017.
Finnish resources for evaluating language
model semantics. En procedimientos de
NODALIDA, pages 231–236. Gothenburg.

Virtanen, Antti, Jenna Kanerva, Rami Ilo,
Jouni Luoma, Juhani Luotolahti, Tapio
Salakoski, Filip Ginter, and Sampo
Pyysalo. 2019. Multilingual is not enough:
BERT for Finnish. CORR, abs/1912.07076.
Vuli´c, Ivan, Daniela Gerz, Douwe Kiela, Felix

Colina, and Anna Korhonen. 2017a.
HiperLex: A large-scale evaluation of
graded lexical entailment. computacional
Lingüística, 43(4):781–835. DOI: https://
doi.org/10.1162/COLI a 00301

Vuli´c, Ivan, Goran Glavaˇs, Roi Reichart, y
Anna Korhonen. 2019. Do we really need

895

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Ligüística computacional

Volumen 46, Número 4

fully unsupervised cross-lingual
embeddings? In Proceedings of EMNLP,
pages 4407–4418. Hong Kong. DOI:
https://doi.org/10.18653/v1/D19
-1449

Vuli´c, Ivan, Douwe Kiela, and Anna

Korhonen. 2017. Evaluation by association:
A systematic study of quantitative word
association evaluation. En procedimientos de
EACL, pages 163–175. Valencia. DOI:
https://doi.org/10.18653/v1/E17
-1016

Vuli´c, Ivan and Anna Korhonen. 2016. Sobre el
role of seed lexicons in learning bilingual
word embeddings. In Proceedings of ACL,
pages 247–257. Berlina. DOI: https://doi
.org/10.18653/v1/P16-1024

Vuli´c, Ivan, Edoardo Maria Ponti, Roberto
Litschko, Goran Glavaˇs, and Anna
Korhonen. 2020. Probing pretrained
language models for lexical semantics. En
Proceedings of EMNLP, pages 7222–7240.
Hong Kong. En línea.

Vuli´c, Ivan, Simone Paolo Ponzetto, y
Goran Glavaˇs. 2019. Multilingual and
cross-lingual graded lexical entailment. En
Proceedings of ACL, pages 4963–4974.
Florencia. DOI: https://doi.org
/10.18653/v1/P19-1490

Vuli´c, Ivan, Roy Schwartz, Ari Rappoport,

Roi Reichart, and Anna Korhonen. 2017b.
Automatic selection of context
conﬁgurations for improved class-speciﬁc
word representations. En procedimientos de
CONLL, pages 112–122. vancouver. DOI:
https://doi.org/10.18653/v1/K17
-1013

Wang, Zihan, Karthikeyan K, esteban

Mayhew, and Dan Roth. 2020.
Cross-lingual ability of multilingual BERT:
An empirical study. In Proceedings of ICLR.
En línea.

Wieting, John, Mohit Bansal, Kevin Gimpel,

and Karen Livescu. 2015. From paraphrase
database to compositional paraphrase
model and back. Transactions of the ACL,
3:345–358. DOI: https://doi.org/10
.1162/tacl a 00143

williams, Adina, Nikita Nangia, and Samuel

Bowman. 2018. A broad-coverage
challenge corpus for sentence
understanding through inference. En
Proceedings of NAACL-HLT,
pages 1112–1122. Nueva Orleans, LA. DOI:
https://doi.org/10.18653/v1/N18-1101
Lobo, tomás, Debut de Lysandre, Víctor Sanh,
Julien Chaumond, Clemente Delangue,
Anthony Moi, Pierric Cistac, Tim Rault,
R’emi Louf, Morgan Funtowicz, and Jamie

896

Brew. 2019. HuggingFace’s Transformers:
Lenguaje natural de última generación
Procesando. ArXiv, abs/1910.03771.

Wu, Shijie, Alexis Conneau, Haoran Li, Luke
Zettlemoyer, and Veselin Stoyanov. 2019.
Emerging cross-lingual structure in
pretrained language models. CORR,
abs/1911.01464.

Wu, Shijie and Mark Dredze. 2019. Beto,

bentz, becas: The surprising cross-lingual
effectiveness of BERT. En procedimientos de
EMNLP, pages 833–844. Hong Kong. DOI:
https://doi.org/10.18653/v1/D19-1077

Wu, Shijie and Mark Dredze. 2020. Are all
languages created equal in multilingual
BERT? In Proceedings of the 5th Workshop
on Representation Learning for NLP,
páginas 120–130. En línea. DOI: https://doi
.org/10.18653/v1/2020.repl4nlp-1.16
Wu, Zhibiao and Martha Palmer. 1994. Verb

semantics and lexical selection. En
Proceedings of ACL, pages 133–138. Las
Cruces, NM. DOI: https://doi.org
/10.3115/981732.981751

Xing, chao, Dong Wang, Chao Liu, and Yiye
lin. 2015. Normalized word embedding
and orthogonal transform for bilingual
word translation. En procedimientos de
NAACL-HLT, pages 1006–1011. Denver,
CO. DOI: https://doi.org/10.3115/v1
/N15-1104

Cual, Yinfei, Yuan Zhang, Chris Tar, y

Jason Baldridge. 2019. PAWS-X: A cross-
lingual adversarial dataset for paraphrase
identiﬁcation. In Proceedings of EMNLP,
pages 3687–3692. Hong Kong. DOI:
https://doi.org/10.18653/v1/D19-1382

zeman, Daniel, Jan Hajiˇc, Martin Popel,

Martin Potthast, Milan Straka, Filip Ginter,
Joakim Nivré, and Slav Petrov. 2018.
CONLL 2018 shared task: Plurilingüe
parsing from raw text to universal
dependencies. In Proceedings of the CoNLL
2018 Tarea compartida: Multilingual Parsing from
Raw Text to Universal Dependencies,
pages 1–21. Bruselas.

zhang, Mozhi, Keyulu Xu, Ken-ichi

Kawarabayashi, Stefanie Jegelka, y
Jordan Boyd-Graber. 2019. Are girls neko
or sh ¯ojo? Cross-lingual alignment of
non-isomorphic embeddings with iterative
normalization. In Proceedings of ACL,
pages 3180–3189. Mineápolis, Minnesota. DOI:
https://doi.org/10.18653/v1/P19-1307

zhang, Yuan, Jason Baldridge, and Luheng
Él. 2019. PAWS: Paraphrase adversaries
from word scrambling. En procedimientos de
NAACL-HLT, pages 1298–1308. hong
kong.

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

Vuli´c et al.

Multi-SimLex

Zhu, Hacer, Benjamin Heinzerling, Ivan Vulic,
Michael Strube, Roi Reichart, and Anna
Korhonen. 2019. On the importance of
subword information for morphological
tasks in truly low-resource languages. En
Proceedings of CoNLL, pages 216–226. hong
kong. DOI: https://doi.org/10.18653
/v1/K19-1021

Zhu, Hacer, Ivan Vulic, and Anna Korhonen.
2019. A systematic study of leveraging
subword information for learning word
representaciones. En procedimientos de
NAACL-HLT, pages 912–932. Bruselas.
DOI: https://doi.org/10.18653/v1
/N19-1097

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

897

D
oh
w
norte
oh
a
d
mi
d

F
r
oh
metro
h

t
t

pag

:
/
/

d
i
r
mi
C
t
.

metro

i
t
.

mi
d
tu
/
C
oh

yo
i
/

a
r
t
i
C
mi
–
pag
d

F
/

4
6
4
8
4
7
1
8
8
8
2
8
7
/
C
oh

yo
i

_
a
_
0
0
3
9
1
pag
d

b
y
gramo
tu
mi
s
t

oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3

898 Multi-SimLex: A Large-Scale Evaluation of image

Descargar PDF