Hierarchical Mapping for Crosslingual Word Embedding Alignment

Ion Madrazo Azpiazu and Maria Soledad Pera

Department of Computer Science
Boise State University
{ionmadrazo,solepera}@boisestate.edu

Astratto

the pivot

The alignment of word embedding spaces in
different languages into a common crosslin-
gual space has recently been in vogue. Strate-
gies that do so compute pairwise alignments
and then map multiple languages to a single
pivot language (most often English). These
strategies, Tuttavia, are biased towards the
choice of
lingua, given that
language proximity and the linguistic char-
acteristics of the target language can strongly
impact the resultant crosslingual space in detri-
ment of topologically distant languages. Noi
present a strategy that eliminates the need
for a pivot language by learning the mappings
across languages in a hierarchical way. Experi-
ments demonstrate that our strategy signifi-
cantly improves vocabulary induction scores
in all existing benchmarks, as well as in a new
non-English–centered benchmark we built,
which we make publicly available.

1 introduzione

Word embeddings have changed how we build text
processing applications, given their capabilities
for representing the meaning of words (Mikolov
et al., 2013UN; Pennington et al., 2014; Bojanowski
et al., 2017). Traditional embedding-generation
strategies create different embeddings for the same
word depending on the language. Even if the
embeddings themselves are different across lan-
guages, their distributions tend to be consistent—
the relative distances across word embeddings
are preserved regardless of the language (Mikolov
et al., 2013B). This behavior has been exploited
for crosslingual embedding generation by aligning
any two monolingual embeddings spaces into one
(Dinu et al., 2014; Xing et al., 2015; Artetxe et al.,
2016).

Alignment techniques have been successful in
generating bilingual embedding spaces that can
later be merged into a crosslingual space using
a pivoting language, English being the most

361

the target

common choice. Unfortunately, mapping one
language into another suffers from a neutrality
problem, as the resultant bilingual space is
impacted by language-specific phenomena and
corpus-specific biases of
lingua
(Doval et al., 2018). To address this issue,
Doval et al. (2018) propose mapping any two
languages into a different middle space. Questo
mapping, Tuttavia, precludes the use of a pivot
language for merging multiple bilingual spaces
into a crosslingual one, limiting the solution to a
bilingual scenario. Additionally,
the pivoting
strategy suffers from a generalized bias problem,
as languages that are the most similar to the
pivot obtain a better alignment and are therefore
better represented in the crosslingual space. This is
because language proximity is a key factor when
learning alignments. This is evidenced by the
results in Artetxe et al. (2017), which indicate
that when using English (Indo-European) as a
pivot, the vocabulary induction results for Finnish
(Uralic) are about 10 points below the rest of the
Indo-European languages under study.

If we want to incorporate all languages into
the same crosslingual space regardless of their
characteristics, we need to go beyond the train-
bilingual/merge-by-pivoting (TB/MP) modello,
and instead seek solutions that can directly
generate crosslingual spaces without requiring
a bilingual step. This motivates the design of
HCEG (Hierarchical Crosslingual Embedding
Generation), the hierarchical pivotless approach
for generating crosslingual embedding spaces
that we present in this paper. HCEG addresses
both the language proximity and target-space bias
problems by learning a compositional mapping
across multiple languages in a hierarchical fash-
ion. This is accomplished by taking advantage of
a language family tree for aggregating multiple
languages into a single crosslingual space. Che cosa
distinguishes HCEG from TB/MP strategies is
that it does not need to include the pivot language

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 8, pag. 361–376, 2020. https://doi.org/10.1162/tacl a 00320
Redattore di azioni: Eneko Agirre. Lotto di invio: 7/2019; Lotto di revisione: 2/2020; Pubblicato 7/2020.
C(cid:13) 2020 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

in all mapping functions. This enables the option
to learn mappings between typologically similar
languages, known to yield better quality mappings
(Artetxe et al., 2017).
The main contributions of our work include:

• A strategy1 that leverages a language family
tree for learning mapping matrices that are
composed hierarchically to yield crosslingual
embedding spaces for language families.

• An analysis of the benefits of hierarchically
generating mappings across multiple lan-
guages compared to traditional unsupervised
and supervised TB/MP alignment strategies.

2 Related Work

Recent interest in crosslingual word embedding
generation has led to manifold strategies that
can be classified into four groups (Ruder et al.,
2017): (1) Mapping techniques that rely on a
bilingual lexicon for mapping an already trained
monolingual space into another (Mikolov et al.,
2013B; Artetxe et al., 2017; Doval et al., 2018);
(2) Pseudo-crosslingual techniques that generate
synthetic crosslingual corpora that are then used
in a traditional monolingual strategy, by randomly
replacing words of a text with their translations
(Gouws and Søgaard, 2015; Duong et al., 2016)
or by combining texts in various languages into
one (Vuli´c and Moens, 2016); (3) Approaches
that only optimize for a crosslingual objective
function, which require parallel corpora in the
form of aligned sentences (Hermann and Blunsom,
2013; Lauly et al., 2014) or texts (Søgaard et al.,
2015); E (4) Approaches using a joint objective
function that optimizes both mono- and cross-
lingual loss, that rely on a parallel corpora aligned
at the word (Zou et al., 2013; Luong et al., 2015)
or sentence level (Gouws et al., 2015; Coulmance
et al., 2015).

A key factor

for crosslingual embedding
generation techniques is the amount of supervised
signal needed. Parallel corpora are a scarce
resource—even nonexistent for some isolated
or low-resource languages. Così, we focus on
mapping-based strategies
that can go from
requiring just a bilingual lexicon (Mikolov et al.,
2013B) to absolutely no supervised signal (Artetxe

1Resources can be found at https://github.com/

ionmadrazo/HCEG.

362

first

et al., 2018). This aligns with one of the premises
for our research to enable the generation of a
single crosslingual embedding space for as many
languages as possible.
Mikolov et al.

introduced a
(2013B)
mapping strategy for aligning two monolingual
spaces that learns a linear transformation from
source to target space using stochastic gradient
descent. This approach was later enhanced with
the use of least squares for finding the optimal
solution, L2-normalizing the word embedding, O
constraining the mapping matrix to be orthogonal
(Dinu et al., 2014; Shigeto et al., 2015; Xing
et al., 2015; Artetxe et al., 2016; Smith et al.,
2017); enhancements that soon became standard
in the area. These models, Tuttavia, are affected
by hubness, where some words tend to be in the
neighborhood of an exceptionally large number
of other words, causing problems when using
nearest-neighbor as the retrieval algorithm, E
neutrality, where the resultant crosslingual space
is highly conditioned by the characteristics of the
language used as target. Hubness was addressed by
a correction applied to nearest-neighbor retrieval
whether using a inverted softmax (Smith et al.,
2017) or a cross-domain similarity local scaling
(Conneau et al., 2017) Dopo
incorporated as
part of the training loss (Joulin et al., 2018).
Neutrality was noticed by Doval et al. (2018), for
which they proposed using two independent linear
transformations so that the resulting crosslingual
space is in a middle point between the two
languages rather than just on the target language,
and therefore not biased towards either language.
Other important trends in the area concentrate
SU (io) the search of unsupervised techniques
for learning mapping functions (Conneau et al.,
2017; Artetxe et al., 2018) and their versatility
in dealing with low-resource languages (Vuli´c
et al., 2019); (ii) the long-tail problem, Dove
most existing crosslingual embedding generation
strategies tend to under-perform (Braune et al.,
2018; Czarnowska et al., 2019); E (iii) IL
formulation of more robust evaluation procedures
oriented to determining the quality of generated
crosslingual spaces (Glavas et al., 2019; Litschko
et al., 2019).

Most existing works focus on a bilingual
scenario. Yet, there is an increase on the interest
for designing strategies that directly consider more
than two languages at training time, thus creating
fully multilingual spaces that do not depend on the

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

TB/MP model (Kementchedjhieva et al., 2018)
inference. Attempts to do so
for multilingual
include the efforts by Søgaard et al. (2015), who
leverage an inverted index based on the Wikipedia
multilingual links to generate multilingual word
representations. Wada et al. (2019) instead use a
sentence-level neural language model for directly
learning multilingual word embeddings and as a
result bypassing the need for mapping functions.
In the paradigm of aligning pre-trained word
embeddings where we focus, Heyman et al.
(2019) propose a technique that iteratively builds
a multilingual space starting from a monolingual
space and incrementally incorporating languages
to it. Even if this strategy deviates from the tradi-
tional TB/MP model, it still preserves the idea of
having a pivot language. Chen and Cardie (2018)
separate the mapping functions into encoders and
decoders, which are not language-pair dependent,
unlike those in the TB/MP model. This removes
the need for a pivot language, given that the
multilingual space is now latent among all encoder
and decoders and not centered in a specific
È
lingua. The same pivot-removal effect
achieved by the strategy introduced in Jawanpuria
et al. (2019), which generalizes a bilingual word
embedding strategy into a multilingual counterpart
by inducing a Mahalanobis similarity metric in the
common space. These two strategies, Tuttavia,
still consider all languages equidistant to each
other, ignoring the similarities and differences
that lay among them.

Our work is inspired by Doval et al. (2018)
and Chen and Cardie (2018), in the sense that
it focuses on obtaining a non-biased or neutral
crosslingual space that does not need to be cen-
tered in English (or any other pivot language)
as the primary source. This neutrality is obtained
by a compositional mapping strategy that hierar-
chically combines mapping functions in order to
generate a single, non-language-centered crosslin-
gual
spazio, enabling a better mapping for
languages that are distant or non-typologically
related to English.

3 Proposed Strategy

A language family tree is a natural categorization
of languages that has historically been used by
linguistics as a reference that encodes similarities
and differences across languages (Comrie, 1989).
Per esempio, based on the relative distances

among languages in the tree illustrated in Figure 1,
we infer that both Spanish and Portuguese are
relatively similar to each other, given that they are
part of the same Italic family. Allo stesso tempo,
both languages are farther apart from English than
each other, and are radically different with respect
to Finnish.

A language family tree offers a natural
organization that can be exploited when building
crosslingual spaces that integrate typologically
diverse languages. We leverage this structure
in HCEG, in order to generate a hierarchically
crosslingual word embedding
compositional
spazio. Unlike traditional TB/MP strategies that
generate a single crosslingual space, the result
of HCEG is a set of transformation matrices
that can be used to hierarchically compose the
space required in each use-case. This maximizes
the typological intra-similarity among languages
used for generating the embedding space, while
minimizing the differences across languages
that can hinder the quality of the crosslingual
embedding space. Così, if an external application
only considers languages that are Germanic, Poi
it can just use the Germanic crosslingual space
generated by HCEG, whereas if it needs languages
beyond Germanic it can utilize a higher level
family, such as the Indo-European. This cannot
be done with the traditional TB/MP model. In this
case, if an application is, Per esempio, using only
Uralic languages, then it would be forced to use an
English-centered crosslingual space; this would
in a decrease in the quality of the crosslingual
space used because of the potential bad quality
of mappings between typologically different
languages, such as Uralic and Indo-European
languages (Artetxe et al., 2017).

3.1 Definitions

Let L = {l1, . . . , l|l|} be a set of languages
considered, F = {f1, . . . , F|F |} a set of language
families, and S = L ∪ F = {s1, . . . , S|F |+|l|} UN
set of possible language spaces. Let Xl ∈ RVl×d
be the set of word embeddings in language l,
where Vl is the vocabulary of l and d is the number
of dimensions of each embedding. Consider T as
a language family tree (exemplified in Figure 1).
The nodes in T represent language spaces in
S, while each edge represents a transformation
between the two nodes attached to it—that is,
Wsa←−sb ∈ Rd×d refers to the transformation from

363

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figura 1: Sample language tree representation simplified for illustration purposes (Lewis and Gary,
2015).

space sb to space sa. For notation ease, we refer
to W
as the transformation that results from
aggregating all transformations in the path from
sb to sa, using the dot product:

∗←−sb

= Wsa←−st1

Wst1←−st2

Wst2←−sb

(1)

where the path from sa to sb is sa, st1, st2, sb; st1
and st2 are intermediate spaces between sa and sb.
Finalmente, P is a set of bilingual lexicons, Dove
Pl1,l2 ∈ {0, 1}Vl1 ×Vl2 is a bilingual lexicon with
word pairs in languages l1 and l2. Pl1,l2(io, j) = 1
if the ith word of Vl1 and the jth word of Vl2 are
aligned, Pl1,l2(io, j) = 0 otherwise.

sge

∗←−sen

Esempio. Consider the set of embeddings for
the transformation that converts
English Xen,
to the
embeddings
in the English space
Germanic language family space W
,
∗←−sen
and the English embeddings transformed to the
Xen. HCEG makes
Germanic space W
∗←−sen
sge
it so that W
Xde (IL
Xen and W
transformed embeddings of English and German)
are in the same Germanic embedding space,
(IL
Xen and W
while W
transformed embeddings of English and Spanish)
are in the same Indo-European embedding space.
In the rest of this section we describe HCEG
in detail. Values given to each hyperparameter
mentioned in this
section are defined in
Sezione 4.4.

∗←−sen

∗←−sde

∗←−ses

Xes

sin

sge

3.2 Embedding Normalization

When dealing with embeddings generated from
different sources and languages, it is important
to normalize them. For doing so, HCEG follows

364

a normalization sequence shown to be beneficial
(Artetxe et al., 2018), which consists of length
normalization, mean centering, and a second
length normalization. The last length normaliza-
tion allows computing cosine similarity between
embeddings in a more efficient manner, simpli-
fying the computation of cosine similarity to a
dot product given that the embeddings are of
unit-length.

3.3 Word Pairs

In order to generate a crosslingual embedding
spazio, HCEG requires a set P of aligned words
across different languages. When using HCEG in
a supervised way, P can be any existing resource
consisting of bilingual lexicons, such as the ones
described in Section 4.1. Tuttavia, best advantage
of the proposed strategy is taken when using
unsupervised lexicon induction techniques, COME
they enable generating input lexicons for any pair
of languages needed. Unlike TB/MP strategies
that can only take advantage of signal that involves
the pivot language, HCEG can use signal across
all combinations of languages. Per esempio, UN
TB/MP model where English is the pivot can
only use lexicons composed of English words.
Invece, HCEG can exploit bilingual lexicons from
other languages, such as Spanish-Portuguese or
Spanish-Dutch, that if using the language tree in
Figura 1 would reinforce the training of Wsit←−ses,
Wsit←−spt and Wsit←−ses, Wsin←−sit, Wsin←−sge,
Wsge←−sdu, rispettivamente.

When using HCEG in unsupervised mode, P
needs to be automatically inferred. Yet, computing
each Pl1,l2 ∈ P given two monolingual embedding
matrices Xl1 and Xl2 is not a trivial task, as Xl1 and

Figura 2: Distributions of word rankings across languages. The coordinates of each dot (representing
a word pair) are determined by the position in the frequency ranking the word pair in each of the
languages. Numbers are written in thousands. Scores computed using FastText embedding rankings
(Grave et al., 2018) and MUSE crosslingual pairs (Conneau et al., 2017). Pearson’s correlation (ρ)
computed using the full set of word pairs, figures generated using a random sample of 500 word pairs
for illustration purposes.

Xl2 are not aligned in vocabulary or dimension
axes. Artetxe et al. (2018) leverage the fact that
the relative distances among words are maintained
across languages (Mikolov et al., 2013B), and thus
propose using a language-agnostic representation
Ml for generating an initial alignment Pl1,l2:

Ml = sorted(XlX ⊤
l )

(2)

where given that Xl is length normalized, E
XlX ⊤
l computes a matrix of dimensions Vl × Vl
containing in each row the cosine similarities of
the corresponding word embedding with respect
to all other word embeddings. The values in each
row are then sorted to generate a distribution
representation of each word that in a ideal case
where the isometry assumption holds perfectly
would be language agnostic. Using the embedding
representations Ml1 and Ml2, Pl1,l2 can be
computed by assigning each word its most similar
representation as its pair, questo è, Pl1,l2(io, j) = 1 if:

j = arg max
1≤j≤Vl

Ml1(io, ∗)Ml2(j, ∗)⊤

(3)

where Ml1(io, ∗) is the ith row of Ml1 and Ml2(j, ∗)
is the jth row of Ml2.

The results in Artetxe et al. (2018) indicate
that this assumption is strong enough to generate
an initial alignment across languages. Tuttavia,
as we demonstrate in Section 3.3, the quality
of this type of initial alignment is dependent on
the languages used, making this initialization not
applicable for languages that are typologically too
distant from each other—a statement also echoed
by Artetxe et al. (2018) and Søgaard et al. (2018).

To ensure a more robust

initialization, we
enhance the strategy presented in Artetxe et al.
(2018) by introducing a new signal based on the
frequency of use of words. Lin et al. (2012) found
that the top-2 most frequent words tend to be
consistent across different languages. Motivated
by this result, we measure to what extent the
frequency rankings of words correlates across
languages. Come mostrato in figura 2,
the word-
frequency rankings are strongly correlated across
languages, meaning that popular words tend to
be popular regardless of the language. We exploit
this behavior in order to reduce the search space
of Equation (3) come segue:

j = arg max

j−t≤j≤j+t

Ml1(io, ∗)Ml2(j, ∗)⊤

(4)

where t is a value used to determine the search
finestra. Note that we assume the embeddings in
any matrix Xl are sorted in ascending order of
frequency, namely, the embedding in the first row
represents the most frequent word of language l.
Apart from improving the overall quality of the
inferred lexicons (see Section 5.1), incorporating
a frequency ranking based search as part of the
initialization reduces the computation time needed
as the search space is considerably reduced.

3.4 Objective Function

Unlike traditional objective functions that opti-
mize a transformation matrix for two languages
at a time, the goal of HCEG is to simultaneously
optimize the set of all transformation matrices W
such that the loss function L is minimized:

arg min

(5)

365

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

L is a linear combination of three different losses:

L = β1 × Lalign + β2 × Lorth + β3 × Lreg (6)

where Lalign, Lorth, Lreg, represent the alignment,
orthogonality, and regularization losses, and β1,
β2, β3 are their weights.

Lalign gauges the extent to which training word
pairs align. This is done by computing the sum of
the cosine similarity among all word pairs in P :

Lalign = − X
Pl1,l2 ∈P

Pl1,l2(W

Xl1

∗←−sl1

s dl1,l2

(7)

∗←−sl2

s dl1,l2

Xl2)

where sdl1,l2
refers to the space in the lowest
common parent node for sl1 and sl2 in T (per esempio.,
S[es,en = sin in Figure 1). We found that using
instead of the space in the root node of T
sdl1,l2
improves the overall performance of HCEG, apart
from reducing the time taken for training (Vedere
Sezione 5.3).

Several researchers have found it beneficial to
enforce orthogonality in the transformation matri-
ces W (Xing et al., 2015; Artetxe et al., 2016;
Smith et al., 2017). This constraint ensures that the
original quality of the embeddings is not degraded
when transforming them to a crosslingual space.
For this reason, we incorporate an orthogonality
constraint Lorth
function in
into our
Equazione 8, with I being the identity matrix.

loss

Lorth = X

kI − Ws1←−s2W ⊤

s1←−s2k (8)

Ws1←−s2 ∈W

We also find it beneficial to include a regulariza-
tion term in L:

Lreg = X

kWs1←−s2k2

(9)

Ws1←−s2 ∈W

3.5 Learning the Parameters

HCEG utilizes stochastic gradient descent for
tuning the parameters in W with respect to the
training word pairs in P . In each iteration, l
is computed and backtracked in order to tune
each transformation matrix in W such that L
is minimized. Batching is used to reduce the
computational load in each iteration. A batch of
word pairs ˆP is sampled from P by randomly
selecting αlpairs language pairs as well as αwpairs
word pairs in each ˆPl1,l2 ∈ ˆP —for example, UN

366

batch might consist of 10 ˆPl1,l2 matrices each
containing 500 aligned words.

Iterations are grouped into epochs of αiter
iterations at the end of which L is computed
for the whole P . We take a conservative approach
as convergence criterion. If no improvement is
found in L in the last αconv epochs, the training
loop stops.

We achieve best convergence time initializing
each Ws1←−s2 ∈ W to be orthogonal. We tried
several methods for orthogonal initialization, come
as simply initializing to the identity matrix.
Tuttavia, we obtained most consistent results
using the random semi-orthogonal initialization
introduced by Saxe et al. (2013).

3.6 Iterative Refinement

As shown by Artetxe et al. (2017), the initial
lexicon P is iteratively improved by using the
generated crosslingual space for inferring a new
lexicon P ′ at the end of each learning phase
described in Section 3.5. More specifically, Quando
computing each P ′
(io, j) È 1 (0
otherwise) if

∈ P ′, P ′

l1,l2

j = arg max

∗←−sl1

s dl1,l2

Xl1(io, ∗)·

∗←−sl2

s dl1,l2

Xl2(j, ∗))⊤

(10)

l1,l2

Potentially, any new bilingual lexicon P ′

can
be inferred and included in P ′ at the end of each
learning phase. Tuttavia, as the cardinality of L
grows, this process can take a prohibitive amount
of time given combinatorial explosion. Therefore,
in practice, we only infer P ′
following a
criterion intended to maximize lexicon quality.
P ′
is inferred for languages l1 and l2 only if l1
and l2 are siblings in T (they share the same parent
node) or l1 and l2 are the best representatives of
their corresponding family. A language is deemed
the best representative of its family if it is the
most frequently-spoken2 language in its subtree.
Per esempio, in Figure 1, Spanish is the best
representative for the Italic family, but not for
Indo-European, for which English is used.

The set criterion not only reduces the amount
of time required to infer P ′ but also improves
overall HCEG performance. This is due to a better
utilization of the hierarchical characteristics of
our crosslingual space, only inferring bilingual
lexicons from typologically related languages or

2Based on numbers reported by Lewis and Gary (2015).

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

their best representatives in terms of resource
quality.

3.7 Retrieval Criterion

As discussed in Section 2, one of the issues effect-
ing nearest-neighbor retrieval is hubness (Dinu
et al., 2014), where certain words are in the
surrounding of an abnormally large number of
other words, causing the nearest-neighbor algo-
rithm to incorrectly prioritize hub words. A
address this issue, we use Cross-domain Similarity
Local Scaling (CSLS) (Conneau et al., 2017)
as the retrieval algorithm during both training
and prediction time. CSLS is a rectification for
nearest-neighbor retrieval that avoids hubness by
counterbalancing the cosine similarity between
two embeddings by a factor consisting of the
average similarity of each embeddings with its
k closest neighbors. Following the criteria in
Conneau et al. (2017), we set the number of
neighbours used by CSLS to k = 10.

4 Evaluation Framework

We describe below the evaluation set up used for
conducting the experiments presented in Section 5.

4.1 Word Pair Datasets

Dinu-Artetxe. The Dinu-Artetxe dataset, pre-
sented by Dinu et al. (2014) and enhanced by
Artetxe et al. (2016),
is the one of the first
benchmarks for evaluating crosslingual embed-
dings. It is composed of English-centered bilingual
lexicons for Italian, Spanish, German, and Finnish.

MUSE. The MUSE dataset (Conneau et al.,
2017) contains bilingual lexicons for all combi-
nations of German, English, Spanish, French,
Italian, and Portuguese. Inoltre, it includes
A
word pairs for 44 languages with respect
English.

Panlex. Dinu-Artetxe and MUSE are both
English-centered datasets, given that most (if not
Tutto) of their word pairs have English as their
source or target language. This makes the datasets
suboptimal for our purpose of generating and
evaluating a non-language centered crosslingual
spazio. For this reason, we generated a dataset
using Panlex (Kamholz et al., 2014), a panlingual
lexical database. This dataset (made public in
our repository) includes bilingual lexicons for all
combinations of 157 languages for which FastText

is available, totalling 24,492 bilingual lexicons.
Each of the lexicons was generated by randomly
sampling 5k words from the top-200k words in
the embedding set for the source language, E
translating them to the target language using the
Panlex database. We find it important to highlight
that this dataset contains considerably more noise
than other datasets given that Panlex is generated
in an automatic way and is not as finely curated
by humans as previous datasets. We still find
comparisons using this dataset fair, given that its
noisy nature should affect all strategies equally.

4.2 Language Selection and Family Tree

As previously stated, we aim to generate a single
crosslingual space for as many languages as
possible. We started with the 157 languages for
which FastText embeddings are available (Grave
et al., 2018). We then removed languages that did
not meet both of the following criteria: (1) there
must exist a bilingual lexicon with at least 500
word pairs for the language in any of the datasets
described in Section 4.1, E (2) the embedding
set provided by FastText must contain at least 20k
parole. The first criterion is a minimal condition
for evaluation, while the second one is necessary
for the unsupervised initialization strategy. IL
criteria are met by 107 languages, which are the
ones used in our experiments. Their corresponding
ISO-639 codes can be seen later in Table 5. Noi
use the language family tree defined by Lewis and
Gary (2015).

4.3 Framework

For experimental purposes, each dataset described
in Section 4.1 is split into training and testing
sets. We use the original train-test splits for Dinu-
Artetxe and MUSE. For Panlex, we generate a split
randomly sampling word pairs—keeping 80% for
the training and the remaining 20% for testing.
For development and parameter tuning purposes,
we use a disjoint set of word pairs specifically
created for this purpose based on the Panlex
lexical database. This development set contains 10
different languages with varied popularity. None
of the word pairs present in this development set
are part of either the train or test sets.

4.4 Hyperparameters

The following hyperparameters were manually
tuned using the development set described in
Sezione 4.3: β1 = 0.98, β2 = 0.01, β3 = 0.01,

367

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Figura 3: Number of correct word pairs inferred using the unsupervised initialization technique presented
by Artetxe et al. (2018) and the Frequency based technique described in Section 3.3.

t = 1000, αlpairs = 128, αwpairs = 2048,
αiter = 5000, αconv = 25.

5 Evaluation

We discuss below the results of
conducted over 107 languages to assess HCEG.

the study

5.1 Unsupervised Initialization

We first evaluate the performance of
IL
unsupervised initialization strategy described in
Sezione 3.3, and compare it with the state-of-the-
art strategy proposed by Artetxe et al. (2018).
In questo caso, we run both initialization strategies
using the top-20k FastText embeddings (Grave
et al., 2018) for all pairwise combinations of
IL 107 languages we study. For each language
pair, we measure how many of the inferred word
pairs are present in the corresponding lexicons
in the MUSE and Panlex datasets. For MUSE,
our proposed initialization strategy (Frequency
based) obtains an average of 48.09 correct pairs, an
improvement with respect to the 29.62 obtained by
the strategy proposed by Artetxe et al. (2018). For
Panlex, the respective average correct pair counts
are 1.05 E 0.55. Both differences are statistically
significant (P < 0.01) using a paired t-test. The noticeable difference across datasets is due to how the sampling was done for generating the datasets: MUSE contains a considerably higher number of frequent words in comparison to Panlex, making the latter a relatively harder dataset for vocabulary induction. In Figure 3 we illustrate the results of each strategy grouped by language-pair similarity. This similarity is based on the number of common parents the two languages share. For example, in Figure 1, Spanish has a similarity of 3, 2, and 1 with Portuguese, English, and Finnish, respec- tively. As we see in Figure 3, similarity is a factor that strongly determines the quality of the alignment generated by the unsupervised initialization. Even if this phenomenon affects both analyzed strategies, our proposed frequency- based initialization strategy consistently obtains a few more correct word pairs for the least similar language pairs, which, as we show in Table 4, are key for generating a correct mapping for those languages. 5.2 State-of-the-Art Comparison In order to contextualize the performance of HCEG with respect to the state-of-the-art (listed in Tables 1 and 2), we measure the word translation prediction capabilities of each of the strategies. We do so using Precision@1 for bilingual lexicon induction as a means to quantify vocabulary induction performance. Scores reported hereafter are average Precision@1 in percentage form, for each of the words in the testing set. When applicable, we report results for both the supervised (HCEG-S) and unsupervised (HCEG- U) versions of HCEG. In the supervised mode, we train one single model per dataset using all the training word pairs available. We then use this model for computing all pairwise scores. In the unsupervised mode, unless explicitly stated otherwise, we train a single model regardless of the dataset used for testing purposes. This means that, in some cases, the unsupervised mode leverages monolingual data beyond the languages used for testing, as it uses all 107 language embeddings. We found it unfair to train a supervised model using 368 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 0 1 9 2 3 5 9 6 / / t l a c _ a _ 0 0 3 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 d e s i v r e p u S Method en-it en-de en-fi en-es Mikolov (2013b) 34.93* 35.00* 25.91* 27.73* Faruqui (2014) 38.40* 37.13* 27.60* 26.80* Shigeto (2015) 41.53* 43.07* 31.04* 33.73* Dinu (2014) 38.93* 29.14* 30.40* 37.7 Lazaridou (2015) 40.2 - - Xing (2015) 36.87* 41.27* 28.23* 31.20* Zhang (2016) 36.73* 40.80* 28.16* 31.07* Artetxe (2016) 41.87* 30.62* 31.40* 39.27 Artetxe (2017) 40.87 39.67 28.72 Smith (2017) 43.33* 29.42* 35.13* 43.1 32.94 44.13 Artetxe (2018a) 45.27 - - 45.5 Jouling (2018) 36.0 49.1 Jawanpuria (2019) mul 48.7 36.1 49.3 48.3 Jawanpuria (2019) 36.60 - 36.0 39.3 - - i . Artetxe (2017) 25 m Smith (2017) cog e S Artetxe (2017) num 37.27 39.9 39.40 39.60 - 40.27 28.16 - 26.47 - - - d e s i v r e p u s n U Zhang (2017), λ = 1 Zhang (2017) λ = 10 Conneau (2017) code Conneau (2017) paper 45.1 Artetxe (2018) HCEG-U 0.00* 0.00* 0.00* 0.00* 0.01* 0.00* 45.15* 46.83* 0.38* 0.01* 0.01* 32.63 48.19 34.82 48.18 48.13 49.02 0.00* 0.01* 35.38* 35.44* 37.33 42.15 Table 1: Results using the Dinu-Artetxe dataset. Scores marked with (*) were reported by Artetxe et al. (2018); the remaining ones were reported in the corresponding original papers. the Dinu-Artetxe dataset given that it only contains four bilingual lexicons, not enough for training our tree structure. Thus, only unsupervised results are shown for that dataset. in most cases, As shown in Table 1, the unsupervised version of HCEG achieves, the best performance among all unsupervised strategies, even improving over state-of-the-art supervised models in some cases. The improvement is most noticeable for Italian and Spanish, where HCEG-U obtains an improvement of 1 and 3 points, respectively. A similar behavior can be seen in Table 2, where we describe the results on the MUSE dataset. Spanish, along with Catalan, Italian, and Portuguese, obtains a sub- stantially larger improvement compared with other languages. We attribute this to the fact that Spanish is the second most resourceful language in terms of corpora after English. This makes the quality of Spanish word embeddings comparably better than other languages, which as a result improves the mapping quality of typologically related languages, such as Portuguese, Italian, or Catalan. To further contextualize the performance of HCEG-U, in terms of its capability for generating in an unsupervised crosslingual embeddings fashion, we conducted further experiments. In Table 3, we summarize the results obtained from comparing HCEG-U with other unsupervised strategies focused on learning crosslingual word embeddings. In our comparisons we include (i) a direct bilingual learning baseline that simply learns a bilingual mapping using two monolingual word embeddings (Conneau et al., 2017), (ii) a pivot-based strategy that can leverage a third lan- guage for learning a crosslingual space (Conneau et al., 2017), and (iii) a fully multilingual, pivotless strategy that aggregates languages into a joint space in an iterative manner (Chen and 369 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 0 1 9 2 3 5 9 6 / / t l a c _ a _ 0 0 3 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Conneau Joulin Artetxe (2018) (2018) (2017) HCEG-S HCEG-U− HCEG-U bg ca cs da de el es et fi fr he hr hu id it mk nl no pl pt ro ru sk sl sv tr uk vi Avg. 57.5 70.9 64.5 67.4 72.7 58.5 83.5 45.7 59.5 82.4 54.1 52.2 64.9 67.9 77.9 54.6 75.3 67.4 66.9 80.3 68.1 63.7 55.3 50.4 60.0 59.2 49.3 55.8 63.8 63.9 73.8 68.2 71.1 76.9 62.7 86.4 49.5 65.8 84.7 57.8 55.6 69.3 69.7 81.5 59.9 79.7 71.2 70.5 82.9 74.0 67.1 59.0 54.2 63.7 61.9 51.5 55.8 67.4 65.8 76.3 70.2 70.3 79.1 67.8 88.6 55.8 68.1 87.6 61.1 57.6 69.6 75.5 83.3 63.5 79.9 69.9 72.0 85.5 75.4 69.5 62.0 60.1 66.2 68.7 56.4 3.9 68.2 64.1 73.1 68.2 68.8 75.8 65.3 86.8 53.5 65.2 85.4 59.5 54.8 66.8 73.2 81.3 62.3 79.4 69.5 70.7 83.8 72.8 68.1 59.6 57.7 65.0 66.3 53.8 55.5 68.1 64.0 74.2 65.9 71.9 75.2 66.4 86.4 53.4 65.4 85.7 61.4 54.1 64.5 73.5 79.7 62.5 79.7 69.3 70.5 83.2 71.7 69.1 56.7 59.7 64.8 66.6 55.7 55.6 68.1 67.5 77.7 71.7 72.7 79.0 68.5 90.4 57.3 68.3 88.3 63.0 58.2 70.1 75.6 85.6 64.9 81.9 71.9 72.8 87.8 76.0 69.8 62.4 61.1 68.0 70.0 56.4 58.3 71.2 Table 2: Results on the MUSE dataset. Scores from Artetxe et al. (2018) were obtained using the scripts shared by the authors. All the other scores were reported in Joulin et al. (2018). HCEG-U− only considers the 29 languages in the experiment for training. Cardie, 2018). From the reported results, we see that HCEG-U− outperforms all other considered strategies for 24 out of 30 language pairs. Highest improvements are found for languages of the Italic family (Spanish, Portuguese, Italian, and French). We observe that HCEG-U− under-performed when the corresponding experiment involved the German language as source or target. We attribute this behavior to the fact that the Italic family is predominant in the languages explored in this experiment. (2018), we limited the monolingual data that HCEG-U− used to the six languages considered in this experiment (results that are reported in Table 3). However, in order to show the full poten- tial of HCEG-U, we also include results achieved when using 107 languages (column HCEG-U). As seen in Tables 2 and 3, the differences between HCEG-U− and HCEG-U are considerable, mani- festing the capabilities of the proposed model to take advantage of monolingual data in multiple languages at the same time. In order to perform a fair comparison with respect to the work proposed by Chen and Cardie The importance of explicitly considering topological connections among languages to 370 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 0 1 9 2 3 5 9 6 / / t l a c _ a _ 0 0 3 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Method Type en-de en-fr en-es en-it en-pt de-fr de-es de-it de-pt fr-es fr-it fr-pt es-it es-pt it-pt Conneau (2017) Direct 74.0 74.0 Conneau (2017) Pivot 74.8 Multi Chen (2018) HCEG-U− 74.5 Multi 82.3 81.7 82.3 81.7 82.4 82.5 82.8 82.7 77.0 80.7 77.0 80.7 78.8 81.5 79.5 81.7 73.0 65.7 71.9 66.1 76.7 69.6 73.5 68.0 66.5 58.5 68.0 57.4 72.0 63.2 72.2 63.3 83.1 83.0 77.9 83.3 87.3 80.5 81.1 79.7 74.7 81.9 85.0 78.9 83.9 83.5 79.3 84.5 87.8 82.3 84.4 83.9 79.8 86.0 88.9 83.6 HCEG-U Multi 79.4 88.4 89.8 85.4 88.1 77.4 72.3 76.5 66.7 89.1 86.1 84.8 89.4 89.7 86.3 Method Type de-en fr-en es-en it-en pt-en fr-de es-de it-de pt-de es-fr it-fr pt-fr it-es pt-es pt-it Conneau (2017) Direct 72.2 72.2 Conneau (2017) Pivot 72.9 Multi Chen (2018) HCEG-U− 72.4 Multi 82.1 83.3 82.1 83.3 81.8 83.7 82.6 84.1 77.7 80.1 77.7 80.1 77.4 79.9 77.8 80.3 69.7 68.8 68.1 67.9 71.2 69.0 71.2 67.8 62.5 60.5 66.1 63.1 69.5 65.7 69.6 65.6 87.6 83.9 87.7 92.1 80.6 86 84.7 86.5 82.6 85.8 91.3 79.2 86.9 88.1 86.3 88.2 92.7 82.6 87.5 88.8 87.0 89.5 94.0 83.9 HCEG-U Multi 78.6 88.2 91.0 85.8 87.5 75.4 71.2 73.9 68.6 90.6 91.0 90.2 91.4 94.3 87.1 Table 3: Comparison of unsupervised crosslingual embedding learning strategies under different merging scenarios in the MUSE dataset. Direct indicates a traditional bilingual scenario where a mapping from source to target is learned. Pivot uses an auxiliary pivot language (English) for merging multiple languages into the same space. Multi merges all languages into the same space without using a pivot. All scores except HCEG-U were originally reported by Chen and Cardie (2018). HCEG-U− only considers the six languages in the experiment for training. Note that HCEG-U is excluded when highlighting the best model (bold), given that it uses monolingual data beyond what other models do. enhance mappings become more evident when analyzing the data in Table 5. Here we include the pairing that yielded the best and worst mapping for each language, as well as the position of English in the quality ranking. English and Spanish have a strong quality mapping with respect to each other, Spanish being the language with which English obtains the best mapping and English is the second-best mapped language for Spanish. Additionally, Spanish is the language with which Italian, Portuguese, and Catalan obtain the best mapping quality. On the other side of the spectrum, the worst mappings are dominated by two languages, Georgian and Vietnamese, with 40 languages having these two language as worst; this is followed by Maltese, Albanian, and Finnish, with 8 occurrences each. This is not unexpected, as these languages are relatively isolated in the language family tree, and also have a low number of speakers. We also see that English is usually on the top side of the ranking for most languages. For languages that are completely isolated, such as Basque and Yoruba, English tends to be their best mapped language. From this we surmise that when typological relations are lacking, the quality of the embedding space is the only aspect the mapping strategy can rely on. Given space constraints, we cannot show the vocabulary induction scores for the 24,492 language pairs in the Panlex dataset. Instead, we group the results using two variables: the sum of number of speakers for each of the two languages, and the minimum similarity (as defined in Section 5.1) for each language with respect to English. We rely on these variables for grouping purposes as they align with two of our objectives for designing HCEG: (1) remove the bias towards the pivot language (English), and (2) improve the performance of low-resource languages by taking advantage of typologically similar languages. Figure 4 captures the improvement (2.7 on average) of HCEG-U over the strategy introduced in Artetxe et al. (2018) (the best-performing benchmark), grouped by the aforementioned variables. We excluded Hindi and Chinese from the figure, as they made any pattern hard to observe given their high number of speakers. The sum of number of speakers axis was also logarithmically scaled to facilitate visualization. The figure captures an evident trend in the simi- larity axis. The lower the similarity of the lan- guage with respect to English, the higher the improvement achieved by HCEG-U. This can be attributed to the manner in which TB/MP models generated the space using English as primary resource, hindering the potential quality of languages that are distant from it. Additionally, we see a less-prominent but existing trend in 371 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 2 0 1 9 2 3 5 9 6 / / t l a c _ a _ 0 0 3 2 0 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Description Dinu-ArtetxeMUSEPanlex d¬Hierarchy e s i v r e p u S ¬Orthogonal Init. ¬Iterative Refinement All vs All Inference World langs. as root HCEG-S d¬Hierarchy ¬Orthogonal Init. ¬Iterative Refinement All vs All Inference World langs. as root e s i v r e p u s n U ¬Freq. based Init. HCEG-U - - - - - - 40.2 43.2 0.09 39.3 42.8 41.2 43.5 66.7 67.8 65.4 66.3 67.5 68.1 32.0 36.5 35.1 36.6 35.7 37.3 28.1 67.9 34.7 71.0 0.02 0.08 34.6 69.4 33.8 70.2 68.0 31.1 71.2 35.8 Figure 4: Improvement over the strategy proposed by Artetxe et al. (2018) in Panlex, in terms of language similarity and number of speakers. Darker denotes larger improvement. the speaker sum axis. Despite some exceptions, HCEG-U obtains higher differences with respect to Artetxe et al. (2018) the less spoken a language is. A behavior that is similar in essence to a Pareto front can also be depicted from the figure. Even if both variables contribute to the difference in improvement of HCEG-U, one variable needs to compensate for the other in order to maximize accuracy. In other words, the improvement is higher the fewer speakers the language pair has or the more distant the two languages are from English, but when both variables go to the extreme, the improvement decreases. The aforementioned trends serve as evidence that the hierarchical structure is indeed important when building a crosslingual space that considers typologically diverse languages, validating our premises for designing HCEG. 5.3 Ablation Study In order to assess the validity of each functionality included as part of HCEG, we conducted an ablation study. We summarize the results of this study in Table 4, where the symbol ¬ indicates that the subsequent feature is ablated in the model. For example, ¬Hierarchy indicates that the Hierarchy structure is removed, replacing it by a structure where each language needs just one transformation matrix to reach the World languages space. Table 4: Ablation study. As indicated by the ablation results, the hierarchical structure is indeed a key part of HCEG, considerably reducing its performance when removed, and having its strongest effect in the dataset with the highest number of languages (i.e., Panlex). The importance of the Iterative Refinement strategy is also noticeable, making the unsupervised version of HCEG useless when removed. The Frequency-based initialization is also a characteristic that considerably improves the results of HCEG-U. Looking deeper into the data, we found 2,198 language pairs (about 9% of all pairs) that obtained a vocabulary induction accuracy close to 0 (<0.05) without using this initialization, but were able to produce enough signal to yield more substantial accuracy values (>10.0) when using the Frequency-based
initialization. Finalmente, the design decisions that we
initially took for reducing training time—(io) IL
orthogonal initialization, (ii) the heuristic based
inference, E (iii) using the lowest common
root for computing the loss function—also have a
positive effect on the performance of the HCEG.

5.4 Influence of Pivot Choice

One of the premises for building HCEG was to
design a strategy that would not require pivots
for achieving a single space with multiple word
embeddings, given that a pivot induces a bias
into the final space that can hinder the quality
of the mapping for languages that are too distant
to it. In this section we describe the results of
experiments conducted for measuring the effect
pivot selection can have on the performance
of the mapping. For doing so, we measure the

372

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

L B,W,E

B,W,E

L B,W,E

B,W,E

L B,W,E

af nl,fi,4
als en,vi,1
am arz,Di,80 cs
an es,ka,17
arz mt,ja,3
as bn,vi,4
ast es,ja,20
ba tt,sq,34
bar de,fi,6
be ru,vi,4
bg mk,ka,9
bn as,vi,6
br cy,ka,18
bs
sr,ka,2
ca es,mt,5
ce en,sq,1

ceb tl,li,22
ckb tg,tr,19
sk,vi,12

tk,ka,13
tr
jv id,scn,34 my zh,mk,19 sco en,mt,1
gd,tt,12
ga
tt
ba,sa,9
sd bn,tl,5
nds nl,vi,3
ka en,bs,1
ga,vi,2
gd
ug tr,vls,4
si
af,ka,4
nl
kk ky,vi,51
dv,ka,5
gl
pt,ka,16
uk ru,fi,19
sk cs,vi,5
sv,vi,3
NO
km vi,nl,4
gom mr,fi,10
ur hi,eo,10
sr,vi,6
sl
es,my,3
oc
kn ta,lt,55
pa,ka,3
gu
vec pms,tr,2
so arz,sq,73
pa
arz,mk,10 ko en,af,1
he
gu,vi,6
vi
km,vls,3
sq en,tt,1
pam id,sr,18
ur,ka,5
CIAO
pl
hr
vls nl,eo,8
hr,vi,4
sr
cs,vi,4
sr,tt,5
pms vec,sah,7 su id,mk,37 wa fr,fi,7
hsb pl,am,3
es,mt,5
yo en,lt,1
sv da,vi,5
pt
fi,ckb,9
hu
en,bn,1
ta ml,mt,3
qu
en,fi,1
hy
zh my,Di,10
es,vi,6
ta,mk,15
te
ro
jv,vi,3
id
uk,su,20
ckb,ka,13
tg
ru
id,sq,6
ilo
en,vls,1
th
sa
sv,ka,3
hu,COME,24 È
CIAO,ka,2
tr,lt,7
tk
sah tr,ka,2
es,mt,5
Esso
Esso,vi,5
ceb,ru,47
tl
scn it,ka,21
en,vi,1
ja
en,eo,1

ky kk,af,17
la
es,mt,3
lb de,ka,2
nl,ka,7
li
lt
ru,mt,5
mg id,sq,44
mk bg,vi,4
ml
ta,sq,29
mr si,ka,21
mt arz,tt,70

cv tr,sq,2
cy br,fi,2
sv,fi,4
da
Di
lb,mt,5
dv si,ka,3
en,eo,1
el
en es,gv,-
eo en,sq,1
es
pt,vi,2
eu en,lt,1
fi
fr
fy

Tavolo 5: Best (B), worst (W), and English mapping ranking (E) for each language (l).

C
io
v
UN
l
S
–
o
T
l
UN
B
/
N
UN
e
P
o
R
tu
E
–
o
D
N
IO

N
UN
io
N
UN
R
IO
–
o
D
N
IO
/
N
UN
e
P
o
R
tu
E
–
o
D
N
IO

C
io
N
UN
M
R
e
G
/
N
UN
e
P
o
R
tu
E
–
o
D
N
IO

C
io
l
UN
T
IO
/
N
UN
e
P
o
R
tu
E
–
o
D
N
IO

N
UN
T
e
B
io
T
–
o
N
io
S

C
io
T
UN
io
S
UN
–
o
R
F

N
UN
io
S
e
N
o
R
T
S
tu
UN

C
io
k
R
tu
T

C
io
l
UN
R
U

Avg.

Pivot Language Family

en
arz
id
ru
Di
CIAO
es
pt
zh
tr
hu

27.3 28.7 32.1 39.8 31.4 40.4 27.3 26.9 28.3 31.4
Indo-European/Germanic
30.2 27.1 28.1 32.1 28.3 33.4 25.1 23.4 27.1 28.3
Afro-Asiatic
27.1 30.3 27.7 31.1 28.3 32.5 25.8 24.6 27.6 28.3
Austronesian
Indo-European/Balto-Slavic 26.3 26.3 34.2 38.2 28.5 37.3 24.6 22.5 26.8 29.4
Indo-European/Germanic
25.1 26.9 25.1 37.6 27.3 37.2 24.7 23.7 25.6 28.1
Indo-European/Indo-Iranian 26.3 27.1 26.1 33.7 32.3 34.2 23.4 25.6 26.4 28.3
26.9 26.7 30.6 38.5 31.0 41.5 26.8 26.7 28.4 30.8
Indo-European/Italic
26.0 26.6 30.4 37.9 27.7 41.3 25.9 26.4 26.5 29.9
Indo-European/Italic
25.1 27.3 25.3 23.4 26.1 24.8 29.3 25.7 27.6 26.1
Sino-Tibetan
24.9 25.3 25.5 28.2 27.8 28.6 25.3 28.7 27.3 26.8
Turkic
25.4 25.8 25.8 31.8 26.4 32.8 25.5 21.9 30.1 27.3
Uralic

Tavolo 6: Results obtained by existing bilingual mapping strategies using different pivots on the
Panlex dataset. Values in each cell indicate the average performance obtained for each of the pairwise
combinations of languages under the family noted in the corresponding column title. Per esempio, IL
first cell indicates the average score obtained for all possible combinations of afro-asiatic languages
using English as a pivot. Results are averaged across the strategy presented in Conneau et al. (2017)
and Artetxe et al. (2018) in order to avoid system-specific biases.

performance of state-of-the-art bilingual mapping
strategies in a pivot-based inference scenario. Noi
use 11 different pivots and average the results of
two different strategies—(Conneau et al., 2017)

E (Artetxe et al., 2018)—grouped by several
language families. As depicted by the results
presented in Table 6, selecting a pivot
Quello
belongs to the family of the languages being

373

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

tested is always the best choice. In cases where we
considered multiple pivots of the same family, IL
most resource-rich language resulted in the best
option, namely, Spanish in the case of the Italic
family and English for the Germanic family. On
average, English is the best choice of pivot if all
language families need to be considered, followed
by Spanish and Portuguese. This validates two
of the design decisions for HCEG, questo è, IL
need to avoid selecting a pivot and the importance
of using the languages with largest speaker-base
when performing language transfer.

6 Conclusion and Future Work

We have introduced HCEG, a crosslingual
space learning strategy that does not depend on
a pivot language, as instead, it takes advantage of
the natural hierarchy existing among languages.
Results from extensive studies on 107 languages
demonstrate that
the proposed strategy out-
performs existing crosslingual space generation
techniques,
in terms of vocabulary induction,
for both popular and not so popular languages.
HCEG improves the mapping quality of many
low-resource languages. We noticed that
Questo
improvement mostly happens when a language
has more typologically related counterparts, how-
ever. Therefore, as future work, we intend to
investigate other techniques that can help improve
the quality of mapping for typologically isolated
low-resource languages. Additionally, it is impor-
tant to note that the time complexity required by
the proposed algorithm is N (N −1), with N being
the number of languages considered. For the tra-
ditional TB/MP strategy, complexity is limited to
learning from N language pairs. Therefore, we
plan on exploring strategies to reduce the num-
ber of language pairs that need to be learned
for creating the crosslingual space. Finalmente, we
will explore different data-driven strategies for
building the tree structure, such as geographical
proximity or lexical overlap, which could lead to
better optimized arrangements of the crosslingual
spazio.

Riferimenti

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2016. Learning principled bilingual mappings
of word embeddings while preserving mono-

lingual invariance. Negli Atti del 2016
Conference on Empirical Methods in Natural
Language Processing, pages 2289–2294.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2017. Learning bilingual word embeddings
con (almost) no bilingual data. Negli Atti
of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Lungo
Carte), pages 451–462. ACL.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2018. A robust self-learning method for fully
unsupervised cross-lingual mappings of word
embeddings. In Proceedings of the 56th Annual
Meeting of
the Association for Computa-
linguistica nazionale (Volume 1: Documenti lunghi),
pages 789–798.

Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word vec-
tors with subword information. Transactions of
the Association for Computational Linguistics,
5:135–146.

Fabienne Braune, Viktor Hangya, Tobias Eder,
and Alexander Fraser. 2018. Evaluating
bilingual word embeddings on the long tail.
Negli Atti del 2018 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, Volume 2 (Short Papers),
pages 188–193.

Xilun Chen and Claire Cardie. 2018. Unsuper-
vised multilingual word embeddings. Nel professionista-
ceedings of the 2018 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
pages 261–270.

Bernard Comrie. 1989. Language Universals and
Linguistic Typology: Syntax and Morphology,
University of Chicago Press.

Alexis Conneau, Guillaume Lample, Marc’
Aurelio Ranzato, Ludovic Denoyer, and Herv´e
J´egou. 2017. Word translation without parallel
dati. arXiv preprint arXiv:1710.04087.

Jocelyn Coulmance, Jean-Marc Marty, Guillaume
Wenzek, and Amine Benhalloum. 2015. Trans-
fast cross-lingual word-embeddings.
gram,
IL 2015 Conference on
Negli Atti di
Empirical Methods
in Natural Language
in lavorazione, pages 1109–1113.

374

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

Paula Czarnowska, Sebastian Ruder,

´Edouard
Grave, Ryan Cotterell, and Ann Copestake.
2019. Don’t forget the long tail! A comprehen-
sive analysis of morphological generalization
in bilingual lexicon induction. Negli Atti
del 2019 Conferenza sui metodi empirici
in Natural Language Processing and the 9th
Internazionale
Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNP),
pages 973–982.

Georgiana Dinu, Angeliki Lazaridou, and Marco
Baroni. 2014. Improving zero-shot learning by
mitigating the hubness problem. arXiv preprint
arXiv:1412.6568.

Yerai Doval,

Jose Camacho-Collados, Luis
Espinosa Anke, and Steven Schockaert. 2018.
Improving cross-lingual word embeddings by
meeting in the middle. Negli Atti del
2018 Conference on Empirical Methods in
Elaborazione del linguaggio naturale, pages 294–304.
ACL.

Long Duong, Hiroshi Kanayama, Tengfei Ma,
Steven Bird, and Trevor Cohn. 2016. Apprendimento
crosslingual word embeddings without bilin-
gual corpora. Negli Atti di
IL 2016
Conference on Empirical Methods in Natural
Language Processing, pages 1285–1295.

Goran Glavas, Robert Litschko, Sebastian Ruder,
and Ivan Vulic. 2019. How to (properly) eval-
uate cross-lingual word embeddings: On strong
baselines, comparative analyses, and some mis-
conceptions. arXiv preprint arXiv:1902.00508.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.
2015. BilBOWA: Fast bilingual distributed rep-
resentations without word alignments. In Inter-
national Conference on Machine Learning,
pages 748–756.

Stephan Gouws and Anders Søgaard. 2015. Sim-
ple task-specific bilingual word embeddings.
Negli Atti del 2015 Conference of the
North American Chapter of the Association for
Linguistica computazionale: Human Language
Technologies, pages 1386–1390.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 languages. In
Proceedings of the International Conference
on Language Resources and Evaluation (LREC
2018).

Karl Moritz Hermann and Phil Blunsom. 2013.
Multilingual distributed representations without word
alignment. arXiv preprint arXiv:1312.6173.

Geert Heyman, Bregt Verreet, Ivan Vuli´c, E
Marie Francine Moens. 2019. Learning unsu-
pervised multilingual word embeddings with
incremental multilingual hubs. Negli Atti
del 2019 Conferenza del Nord America
the Association for Computa-
Chapter of
linguistica nazionale: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 1890–1902.

Pratik Jawanpuria, Arjun Balgovind, Anoop
Kunchukuttan, and Bamdev Mishra. 2019.
Learning multilingual word embeddings in
latent metric space: a geometric approach.
Transactions of the Association for Compu-
linguistica nazionale, 7:107–120.

Armand

Joulin, Piotr Bojanowski, Tomas
Mikolov, Herv´e J´egou, and Edouard Grave.
2018. Loss in translation: Learning bilingual
word mapping with a retrieval criterion. In
Atti del 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2979–2984.

David Kamholz,

Jonathan Pool, and Susan
Colowick. 2014. Panlex: Building a resource for
panlingual lexical translation. Negli Atti
of the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC-2014),
pages 3145–3150.

Yova Kementchedjhieva,

Sebastian Ruder,
Ryan Cotterell, and Anders Søgaard. 2018.
Generalizing procrustes analysis for better
bilingual dictionary induction. Negli Atti
the 22nd Conference on Computational
Di
Natural Language Learning, pages 211–220.

Stanislas Lauly, Alex Boulanger, and Hugo
Larochelle. 2014. Learning multilingual word
representations using a bag-of-words autoen-
coder. arXiv preprint arXiv:1401.1803.

M. Paul Lewis and F. Gary. 2015. Simons, E
Charles D. Fennig (eds.). 2013. Ethnologue:
Languages of the world, pages 233–62.

Yuri Lin, Jean-Baptiste Michel, Erez Lieberman
Aiden, Jon Orwant, Will Brockman, and Slav
Petrov. 2012. Syntactic annotations for the

375

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3

google books ngram corpus. Negli Atti
the ACL 2012 System Demonstrations,
Di
pages 169–174. ACL.

Offline bilingual word vectors, orthogonal
transformations and the inverted softmax. arXiv
preprint arXiv:1702.03859.

Robert Litschko, Goran Glavaˇs,

Ivan Vulic,
and Laura Dietz. 2019. Evaluating resource-
lean cross-lingual embedding models in unsu-
pervised retrieval. In Proceedings of the 42nd
International ACM SIGIR Conference on Re-
in Information
search and Development
Retrieval, pages 1109–1112. ACM.

Thang Luong, Ciao Pham, e Christopher D.
Equipaggio. 2015. Bilingual word representations
with monolingual quality in mind. Nel professionista-
ceedings of the 1st Workshop on Vector Space
Modeling for Natural Language Processing,
pages 151–159.

Tomás Mikolov, Kai Chen, Greg Corrado, E
Jeffrey Dean. 2013UN. Stima efficiente di
rappresentazioni di parole nello spazio vettoriale. arXiv
preprint arXiv:1301.3781.

Tomás Mikolov, Quoc V. Le, and Ilya Sutskever.
2013B. Exploiting similarities among lan-
guages for machine translation. arXiv preprint
arXiv:1309.4168.

Jeffrey

Socher,

Pennington, Richard

E
Cristoforo Manning. 2014. Guanto: Globale
vettori per la rappresentazione delle parole. Negli Atti
del 2014 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale (EMNLP),
pagine 1532–1543.

Sebastian Ruder,

Ivan Vuli´c,

and Anders
Søgaard. 2017. A survey of cross-lingual
word embedding models. arXiv preprint
arXiv:1706.04902.

Andrew M. Saxe, James L. McClelland, and Surya
Ganguli. 2013. Exact solutions to the nonlinear
dynamics of learning in deep linear neural
networks. arXiv preprint arXiv:1312.6120.

Yutaro Shigeto,

Ikumi Suzuki, Kazuo Hara,
Masashi Shimbo, and Yuji Matsumoto. 2015.
Ridge regression, hubness, and zero-shot
apprendimento. In Joint European Conference on
Machine Learning and Knowledge Discovery
in Databases, pages 135–151. Springer.

Samuel L. Smith, David H. P. Turban, Steven
Hamblin, and Nils Y. Hammerla. 2017.

376

Anders Søgaard, ˇZeljko Agi´c, H´ector Mart´ınez
Alonso, Barbara Plank, Bernd Bohnet, E
Anders Johannsen. 2015. Inverted indexing
for cross-lingual NLP. In The 53rd Annual
Riunione dell'Associazione per il Computazionale
Linguistica e 7° Giunto Internazionale
Conference of the Asian Federation of Natural
Language Processing (ACL-IJCNLP 2015).

Anders Søgaard, Sebastian Ruder, and Ivan
Vuli´c. 2018. On the limitations of unsupervised
bilingual dictionary induction. Negli Atti
of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Lungo
Carte), pages 778–788.

Ivan Vuli´c, Goran Glavaˇs, Roi Reichart, E
Anna Korhonen. 2019. Do we really need
fully unsupervised cross-lingual embeddings?
IL 2019 Conference on
Negli Atti di
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conferenza sull'elaborazione del linguaggio naturale
(EMNLP-IJCNLP), pages 4398–4409.

Ivan Vuli´c and Marie-Francine Moens. 2016.
Bilingual distributed word representations from
document-aligned comparable data. Journal of
Artificial Intelligence Research, 55:953–994.

Takashi Wada, Tomoharu Iwata, and Yuji
Matsumoto. 2019. Unsupervised multilingual
word embedding with limited resources using
neural language models. Negli Atti del
57esima Assemblea Annuale dell'Associazione per
Linguistica computazionale, pages 3113–3124.

orthogonal

Chao Xing, Dong Wang, Chao Liu,

E
Yiye Lin. 2015. Normalized word embedding
bilingual
E
IL
word translation.
the North American
2015 Conference of
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano,
pages 1006–1011.

transform for
Negli Atti di

Will Y. Zou, Riccardo Socher, Daniel Cer,
e Christopher D. Equipaggio. 2013. Bilingual
word embeddings for phrase-based machine
IL 2013
Negli Atti di
translation.
Conference on Empirical Methods in Natural
Language Processing, pages 1393–1398.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
3
2
0
1
9
2
3
5
9
6

/
T

UN
C
_
UN
_
0
0
3
2
0
P
D

B
sì
G
tu
e
S
T

o
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Scarica il pdf