Hierarchical Mapping for Crosslingual Word Embedding Alignment
Ion Madrazo Azpiazu and Maria Soledad Pera
Department of Computer Science
Boise State University
{ionmadrazo,solepera}@boisestate.edu
Abstrakt
the pivot
The alignment of word embedding spaces in
different languages into a common crosslin-
gual space has recently been in vogue. Strate-
gies that do so compute pairwise alignments
and then map multiple languages to a single
pivot language (most often English). Diese
strategies, Jedoch, are biased towards the
choice of
Sprache, given that
language proximity and the linguistic char-
acteristics of the target language can strongly
impact the resultant crosslingual space in detri-
ment of topologically distant languages. Wir
present a strategy that eliminates the need
for a pivot language by learning the mappings
across languages in a hierarchical way. Experi-
ments demonstrate that our strategy signifi-
cantly improves vocabulary induction scores
in all existing benchmarks, as well as in a new
non-English–centered benchmark we built,
which we make publicly available.
1 Einführung
Word embeddings have changed how we build text
processing applications, given their capabilities
for representing the meaning of words (Mikolov
et al., 2013A; Pennington et al., 2014; Bojanowski
et al., 2017). Traditional embedding-generation
strategies create different embeddings for the same
word depending on the language. Even if the
embeddings themselves are different across lan-
guages, their distributions tend to be consistent—
the relative distances across word embeddings
are preserved regardless of the language (Mikolov
et al., 2013B). This behavior has been exploited
for crosslingual embedding generation by aligning
any two monolingual embeddings spaces into one
(Dinu et al., 2014; Xing et al., 2015; Artetxe et al.,
2016).
Alignment techniques have been successful in
generating bilingual embedding spaces that can
later be merged into a crosslingual space using
a pivoting language, English being the most
361
the target
common choice. Bedauerlicherweise, mapping one
language into another suffers from a neutrality
Problem, as the resultant bilingual space is
impacted by language-specific phenomena and
corpus-specific biases of
Sprache
(Doval et al., 2018). To address this issue,
Doval et al. (2018) propose mapping any two
languages into a different middle space. Das
mapping, Jedoch, precludes the use of a pivot
language for merging multiple bilingual spaces
into a crosslingual one, limiting the solution to a
bilingual scenario. Zusätzlich,
the pivoting
strategy suffers from a generalized bias problem,
as languages that are the most similar to the
pivot obtain a better alignment and are therefore
better represented in the crosslingual space. Das ist
because language proximity is a key factor when
learning alignments. This is evidenced by the
results in Artetxe et al. (2017), which indicate
that when using English (Indo-European) as a
pivot, the vocabulary induction results for Finnish
(Uralic) are about 10 points below the rest of the
Indo-European languages under study.
If we want to incorporate all languages into
the same crosslingual space regardless of their
characteristics, we need to go beyond the train-
bilingual/merge-by-pivoting (TB/MP) Modell,
and instead seek solutions that can directly
generate crosslingual spaces without requiring
a bilingual step. This motivates the design of
HCEG (Hierarchical Crosslingual Embedding
Generation), the hierarchical pivotless approach
for generating crosslingual embedding spaces
that we present in this paper. HCEG addresses
both the language proximity and target-space bias
problems by learning a compositional mapping
across multiple languages in a hierarchical fash-
Ion. This is accomplished by taking advantage of
a language family tree for aggregating multiple
languages into a single crosslingual space. Was
distinguishes HCEG from TB/MP strategies is
that it does not need to include the pivot language
Transactions of the Association for Computational Linguistics, Bd. 8, S. 361–376, 2020. https://doi.org/10.1162/tacl a 00320
Action Editor: Eneko Agirre. Submission batch: 7/2019; Revision batch: 2/2020; Published 7/2020.
C(cid:13) 2020 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
in all mapping functions. This enables the option
to learn mappings between typologically similar
languages, known to yield better quality mappings
(Artetxe et al., 2017).
The main contributions of our work include:
• A strategy1 that leverages a language family
tree for learning mapping matrices that are
composed hierarchically to yield crosslingual
embedding spaces for language families.
• An analysis of the benefits of hierarchically
generating mappings across multiple lan-
guages compared to traditional unsupervised
and supervised TB/MP alignment strategies.
2 Related Work
Recent interest in crosslingual word embedding
generation has led to manifold strategies that
can be classified into four groups (Ruder et al.,
2017): (1) Mapping techniques that rely on a
bilingual lexicon for mapping an already trained
monolingual space into another (Mikolov et al.,
2013B; Artetxe et al., 2017; Doval et al., 2018);
(2) Pseudo-crosslingual techniques that generate
synthetic crosslingual corpora that are then used
in a traditional monolingual strategy, by randomly
replacing words of a text with their translations
(Gouws and Søgaard, 2015; Duong et al., 2016)
or by combining texts in various languages into
eins (Vuli´c and Moens, 2016); (3) Approaches
that only optimize for a crosslingual objective
Funktion, which require parallel corpora in the
form of aligned sentences (Hermann and Blunsom,
2013; Lauly et al., 2014) or texts (Søgaard et al.,
2015); Und (4) Approaches using a joint objective
function that optimizes both mono- and cross-
lingual loss, that rely on a parallel corpora aligned
at the word (Zou et al., 2013; Luong et al., 2015)
or sentence level (Gouws et al., 2015; Coulmance
et al., 2015).
A key factor
for crosslingual embedding
generation techniques is the amount of supervised
signal needed. Parallel corpora are a scarce
resource—even nonexistent for some isolated
or low-resource languages. Daher, we focus on
mapping-based strategies
that can go from
requiring just a bilingual lexicon (Mikolov et al.,
2013B) to absolutely no supervised signal (Artetxe
1Resources can be found at https://github.com/
ionmadrazo/HCEG.
362
Erste
et al., 2018). This aligns with one of the premises
for our research to enable the generation of a
single crosslingual embedding space for as many
languages as possible.
Mikolov et al.
introduced a
(2013B)
mapping strategy for aligning two monolingual
spaces that learns a linear transformation from
source to target space using stochastic gradient
descent. This approach was later enhanced with
the use of least squares for finding the optimal
solution, L2-normalizing the word embedding, oder
constraining the mapping matrix to be orthogonal
(Dinu et al., 2014; Shigeto et al., 2015; Xing
et al., 2015; Artetxe et al., 2016; Smith et al.,
2017); enhancements that soon became standard
in the area. Diese Modelle, Jedoch, are affected
by hubness, where some words tend to be in the
neighborhood of an exceptionally large number
of other words, causing problems when using
nearest-neighbor as the retrieval algorithm, Und
Neutralität, where the resultant crosslingual space
is highly conditioned by the characteristics of the
language used as target. Hubness was addressed by
a correction applied to nearest-neighbor retrieval
whether using a inverted softmax (Smith et al.,
2017) or a cross-domain similarity local scaling
(Conneau et al., 2017) später
incorporated as
part of the training loss (Joulin et al., 2018).
Neutrality was noticed by Doval et al. (2018), für
which they proposed using two independent linear
transformations so that the resulting crosslingual
space is in a middle point between the two
languages rather than just on the target language,
and therefore not biased towards either language.
Other important trends in the area concentrate
An (ich) the search of unsupervised techniques
for learning mapping functions (Conneau et al.,
2017; Artetxe et al., 2018) and their versatility
in dealing with low-resource languages (Vuli´c
et al., 2019); (ii) the long-tail problem, Wo
most existing crosslingual embedding generation
strategies tend to under-perform (Braune et al.,
2018; Czarnowska et al., 2019); Und (iii) Die
formulation of more robust evaluation procedures
oriented to determining the quality of generated
crosslingual spaces (Glavas et al., 2019; Litschko
et al., 2019).
Most existing works focus on a bilingual
scenario. Noch, there is an increase on the interest
for designing strategies that directly consider more
than two languages at training time, thus creating
fully multilingual spaces that do not depend on the
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
TB/MP model (Kementchedjhieva et al., 2018)
inference. Attempts to do so
for multilingual
include the efforts by Søgaard et al. (2015), WHO
leverage an inverted index based on the Wikipedia
multilingual links to generate multilingual word
Darstellungen. Wada et al. (2019) instead use a
sentence-level neural language model for directly
learning multilingual word embeddings and as a
result bypassing the need for mapping functions.
In the paradigm of aligning pre-trained word
embeddings where we focus, Heyman et al.
(2019) propose a technique that iteratively builds
a multilingual space starting from a monolingual
space and incrementally incorporating languages
to it. Even if this strategy deviates from the tradi-
tional TB/MP model, it still preserves the idea of
having a pivot language. Chen and Cardie (2018)
separate the mapping functions into encoders and
decoders, which are not language-pair dependent,
unlike those in the TB/MP model. This removes
the need for a pivot language, given that the
multilingual space is now latent among all encoder
and decoders and not centered in a specific
Ist
Sprache. The same pivot-removal effect
achieved by the strategy introduced in Jawanpuria
et al. (2019), which generalizes a bilingual word
embedding strategy into a multilingual counterpart
by inducing a Mahalanobis similarity metric in the
common space. These two strategies, Jedoch,
still consider all languages equidistant to each
andere, ignoring the similarities and differences
that lay among them.
Our work is inspired by Doval et al. (2018)
and Chen and Cardie (2018), in the sense that
it focuses on obtaining a non-biased or neutral
crosslingual space that does not need to be cen-
tered in English (or any other pivot language)
as the primary source. This neutrality is obtained
by a compositional mapping strategy that hierar-
chically combines mapping functions in order to
generate a single, non-language-centered crosslin-
gual
Raum, enabling a better mapping for
languages that are distant or non-typologically
related to English.
3 Proposed Strategy
A language family tree is a natural categorization
of languages that has historically been used by
linguistics as a reference that encodes similarities
and differences across languages (Comrie, 1989).
Zum Beispiel, based on the relative distances
among languages in the tree illustrated in Figure 1,
we infer that both Spanish and Portuguese are
relatively similar to each other, given that they are
part of the same Italic family. Gleichzeitig,
both languages are farther apart from English than
each other, and are radically different with respect
to Finnish.
A language family tree offers a natural
organization that can be exploited when building
crosslingual spaces that integrate typologically
diverse languages. We leverage this structure
in HCEG, in order to generate a hierarchically
crosslingual word embedding
compositional
Raum. Unlike traditional TB/MP strategies that
generate a single crosslingual space, the result
of HCEG is a set of transformation matrices
that can be used to hierarchically compose the
space required in each use-case. This maximizes
the typological intra-similarity among languages
used for generating the embedding space, while
minimizing the differences across languages
that can hinder the quality of the crosslingual
embedding space. Daher, if an external application
only considers languages that are Germanic, Dann
it can just use the Germanic crosslingual space
generated by HCEG, whereas if it needs languages
beyond Germanic it can utilize a higher level
family, such as the Indo-European. This cannot
be done with the traditional TB/MP model. In diesem
Fall, if an application is, Zum Beispiel, nur verwenden
Uralic languages, then it would be forced to use an
English-centered crosslingual space; this would
in a decrease in the quality of the crosslingual
space used because of the potential bad quality
of mappings between typologically different
languages, such as Uralic and Indo-European
languages (Artetxe et al., 2017).
3.1 Definitions
Let L = {l1, . . . , l|L|} be a set of languages
berücksichtigt, F = {f1, . . . , F|F |} a set of language
Familien, and S = L ∪ F = {s1, . . . , S|F |+|L|} A
set of possible language spaces. Let Xl ∈ RVl×d
be the set of word embeddings in language l,
where Vl is the vocabulary of l and d is the number
of dimensions of each embedding. Consider T as
a language family tree (exemplified in Figure 1).
The nodes in T represent language spaces in
S, while each edge represents a transformation
between the two nodes attached to it—that is,
Wsa←−sb ∈ Rd×d refers to the transformation from
363
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figur 1: Sample language tree representation simplified for illustration purposes (Lewis and Gary,
2015).
space sb to space sa. For notation ease, we refer
to W
as the transformation that results from
aggregating all transformations in the path from
sb to sa, using the dot product:
∗←−sb
sa
W
∗←−sb
sa
= Wsa←−st1
Wst1←−st2
Wst2←−sb
(1)
where the path from sa to sb is sa, st1, st2, sb; st1
and st2 are intermediate spaces between sa and sb.
Endlich, P is a set of bilingual lexicons, Wo
Pl1,l2 ∈ {0, 1}Vl1 ×Vl2 is a bilingual lexicon with
word pairs in languages l1 and l2. Pl1,l2(ich, J) = 1
if the ith word of Vl1 and the jth word of Vl2 are
aligned, Pl1,l2(ich, J) = 0 ansonsten.
sge
sge
∗←−sen
Example. Consider the set of embeddings for
the transformation that converts
English Xen,
to the
embeddings
in the English space
Germanic language family space W
,
∗←−sen
and the English embeddings transformed to the
Xen. HCEG makes
Germanic space W
∗←−sen
sge
it so that W
Xde (Die
Xen and W
transformed embeddings of English and German)
are in the same Germanic embedding space,
(Die
Xen and W
while W
transformed embeddings of English and Spanish)
are in the same Indo-European embedding space.
In the rest of this section we describe HCEG
in detail. Values given to each hyperparameter
mentioned in this
section are defined in
Abschnitt 4.4.
∗←−sen
∗←−sde
∗←−ses
Xes
sin
sin
sge
3.2 Embedding Normalization
When dealing with embeddings generated from
different sources and languages, it is important
to normalize them. For doing so, HCEG follows
364
a normalization sequence shown to be beneficial
(Artetxe et al., 2018), which consists of length
normalization, mean centering, and a second
length normalization. The last length normaliza-
tion allows computing cosine similarity between
embeddings in a more efficient manner, simpli-
fying the computation of cosine similarity to a
dot product given that the embeddings are of
unit-length.
3.3 Word Pairs
In order to generate a crosslingual embedding
Raum, HCEG requires a set P of aligned words
across different languages. When using HCEG in
a supervised way, P can be any existing resource
consisting of bilingual lexicons, such as the ones
described in Section 4.1. Jedoch, best advantage
of the proposed strategy is taken when using
unsupervised lexicon induction techniques, als
they enable generating input lexicons for any pair
of languages needed. Unlike TB/MP strategies
that can only take advantage of signal that involves
the pivot language, HCEG can use signal across
all combinations of languages. Zum Beispiel, A
TB/MP model where English is the pivot can
only use lexicons composed of English words.
Stattdessen, HCEG can exploit bilingual lexicons from
other languages, such as Spanish-Portuguese or
Spanish-Dutch, that if using the language tree in
Figur 1 would reinforce the training of Wsit←−ses,
Wsit←−spt and Wsit←−ses, Wsin←−sit, Wsin←−sge,
Wsge←−sdu, jeweils.
When using HCEG in unsupervised mode, P
needs to be automatically inferred. Noch, computing
each Pl1,l2 ∈ P given two monolingual embedding
matrices Xl1 and Xl2 is not a trivial task, as Xl1 and
Figur 2: Distributions of word rankings across languages. The coordinates of each dot (representing
a word pair) are determined by the position in the frequency ranking the word pair in each of the
languages. Numbers are written in thousands. Scores computed using FastText embedding rankings
(Grave et al., 2018) and MUSE crosslingual pairs (Conneau et al., 2017). Pearson’s correlation (ρ)
computed using the full set of word pairs, figures generated using a random sample of 500 word pairs
for illustration purposes.
Xl2 are not aligned in vocabulary or dimension
axes. Artetxe et al. (2018) leverage the fact that
the relative distances among words are maintained
across languages (Mikolov et al., 2013B), and thus
propose using a language-agnostic representation
Ml for generating an initial alignment Pl1,l2:
Ml = sorted(XlX ⊤
l )
(2)
where given that Xl is length normalized, Und
XlX ⊤
l computes a matrix of dimensions Vl × Vl
containing in each row the cosine similarities of
the corresponding word embedding with respect
to all other word embeddings. The values in each
row are then sorted to generate a distribution
representation of each word that in a ideal case
where the isometry assumption holds perfectly
would be language agnostic. Using the embedding
representations Ml1 and Ml2, Pl1,l2 can be
computed by assigning each word its most similar
representation as its pair, das ist, Pl1,l2(ich, J) = 1 Wenn:
j = arg max
1≤j≤Vl
Ml1(ich, ∗)Ml2(J, ∗)⊤
(3)
where Ml1(ich, ∗) is the ith row of Ml1 and Ml2(J, ∗)
is the jth row of Ml2.
The results in Artetxe et al. (2018) indicate
that this assumption is strong enough to generate
an initial alignment across languages. Jedoch,
as we demonstrate in Section 3.3, the quality
of this type of initial alignment is dependent on
the languages used, making this initialization not
applicable for languages that are typologically too
distant from each other—a statement also echoed
by Artetxe et al. (2018) and Søgaard et al. (2018).
To ensure a more robust
initialization, Wir
enhance the strategy presented in Artetxe et al.
(2018) by introducing a new signal based on the
frequency of use of words. Lin et al. (2012) found
that the top-2 most frequent words tend to be
consistent across different languages. Motivated
by this result, we measure to what extent the
frequency rankings of words correlates across
languages. As shown in Figure 2,
the word-
frequency rankings are strongly correlated across
languages, meaning that popular words tend to
be popular regardless of the language. We exploit
this behavior in order to reduce the search space
of Equation (3) as follows:
j = arg max
j−t≤j≤j+t
Ml1(ich, ∗)Ml2(J, ∗)⊤
(4)
where t is a value used to determine the search
window. Note that we assume the embeddings in
any matrix Xl are sorted in ascending order of
frequency, nämlich, the embedding in the first row
represents the most frequent word of language l.
Apart from improving the overall quality of the
inferred lexicons (see Section 5.1), incorporating
a frequency ranking based search as part of the
initialization reduces the computation time needed
as the search space is considerably reduced.
3.4 Objective Function
Unlike traditional objective functions that opti-
mize a transformation matrix for two languages
at a time, the goal of HCEG is to simultaneously
optimize the set of all transformation matrices W
such that the loss function L is minimized:
arg min
W
L
(5)
365
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
L is a linear combination of three different losses:
L = β1 × Lalign + β2 × Lorth + β3 × Lreg (6)
where Lalign, Lorth, Lreg, represent the alignment,
orthogonality, and regularization losses, and β1,
B 2, β3 are their weights.
Lalign gauges the extent to which training word
pairs align. This is done by computing the sum of
the cosine similarity among all word pairs in P :
Lalign = − X
Pl1,l2 ∈P
Pl1,l2(W
·
Xl1
∗←−sl1
s dl1,l2
(7)
W
∗←−sl2
s dl1,l2
Xl2)
where sdl1,l2
refers to the space in the lowest
common parent node for sl1 and sl2 in T (z.B.,
S[es,en = sin in Figure 1). We found that using
instead of the space in the root node of T
sdl1,l2
improves the overall performance of HCEG, apart
from reducing the time taken for training (sehen
Abschnitt 5.3).
Several researchers have found it beneficial to
enforce orthogonality in the transformation matri-
ces W (Xing et al., 2015; Artetxe et al., 2016;
Smith et al., 2017). This constraint ensures that the
original quality of the embeddings is not degraded
when transforming them to a crosslingual space.
Aus diesem Grund, we incorporate an orthogonality
constraint Lorth
function in
into our
Gleichung 8, with I being the identity matrix.
loss
Lorth = X
kI − Ws1←−s2W ⊤
s1←−s2k (8)
Ws1←−s2 ∈W
We also find it beneficial to include a regulariza-
tion term in L:
Lreg = X
kWs1←−s2k2
(9)
Ws1←−s2 ∈W
3.5 Learning the Parameters
HCEG utilizes stochastic gradient descent for
tuning the parameters in W with respect to the
training word pairs in P . In each iteration, L
is computed and backtracked in order to tune
each transformation matrix in W such that L
is minimized. Batching is used to reduce the
computational load in each iteration. A batch of
word pairs ˆP is sampled from P by randomly
selecting αlpairs language pairs as well as αwpairs
word pairs in each ˆPl1,l2 ∈ ˆP —for example, A
366
batch might consist of 10 ˆPl1,l2 matrices each
containing 500 aligned words.
Iterations are grouped into epochs of αiter
iterations at the end of which L is computed
for the whole P . We take a conservative approach
as convergence criterion. If no improvement is
found in L in the last αconv epochs, the training
loop stops.
We achieve best convergence time initializing
each Ws1←−s2 ∈ W to be orthogonal. We tried
several methods for orthogonal initialization, solch
as simply initializing to the identity matrix.
Jedoch, we obtained most consistent results
using the random semi-orthogonal initialization
introduced by Saxe et al. (2013).
3.6 Iterative Refinement
As shown by Artetxe et al. (2017), die Initiale
lexicon P is iteratively improved by using the
generated crosslingual space for inferring a new
lexicon P ′ at the end of each learning phase
described in Section 3.5. Genauer, Wann
computing each P ′
(ich, J) Ist 1 (0
ansonsten) Wenn
∈ P ′, P ′
l1,l2
l1,l2
j = arg max
J
W
∗←−sl1
s dl1,l2
Xl1(ich, ∗)·
(W
∗←−sl2
s dl1,l2
Xl2(J, ∗))⊤
(10)
l1,l2
l1,l2
l1,l2
Potentially, any new bilingual lexicon P ′
can
be inferred and included in P ′ at the end of each
learning phase. Jedoch, as the cardinality of L
grows, this process can take a prohibitive amount
of time given combinatorial explosion. daher,
in practice, we only infer P ′
following a
criterion intended to maximize lexicon quality.
P ′
is inferred for languages l1 and l2 only if l1
and l2 are siblings in T (they share the same parent
node) or l1 and l2 are the best representatives of
their corresponding family. A language is deemed
the best representative of its family if it is the
most frequently-spoken2 language in its subtree.
Zum Beispiel, in Abbildung 1, Spanish is the best
representative for the Italic family, but not for
Indo-European, for which English is used.
The set criterion not only reduces the amount
of time required to infer P ′ but also improves
overall HCEG performance. This is due to a better
utilization of the hierarchical characteristics of
our crosslingual space, only inferring bilingual
lexicons from typologically related languages or
2Based on numbers reported by Lewis and Gary (2015).
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
their best representatives in terms of resource
Qualität.
3.7 Retrieval Criterion
As discussed in Section 2, one of the issues effect-
ing nearest-neighbor retrieval is hubness (Dinu
et al., 2014), where certain words are in the
surrounding of an abnormally large number of
other words, causing the nearest-neighbor algo-
rithm to incorrectly prioritize hub words. To
address this issue, we use Cross-domain Similarity
Local Scaling (CSLS) (Conneau et al., 2017)
as the retrieval algorithm during both training
and prediction time. CSLS is a rectification for
nearest-neighbor retrieval that avoids hubness by
counterbalancing the cosine similarity between
two embeddings by a factor consisting of the
average similarity of each embeddings with its
k closest neighbors. Following the criteria in
Conneau et al. (2017), we set the number of
neighbours used by CSLS to k = 10.
4 Evaluation Framework
We describe below the evaluation set up used for
conducting the experiments presented in Section 5.
4.1 Word Pair Datasets
Dinu-Artetxe. The Dinu-Artetxe dataset, pre-
sented by Dinu et al. (2014) and enhanced by
Artetxe et al. (2016),
is the one of the first
benchmarks for evaluating crosslingual embed-
dings. It is composed of English-centered bilingual
lexicons for Italian, Spanish, Deutsch, and Finnish.
MUSE. The MUSE dataset (Conneau et al.,
2017) contains bilingual lexicons for all combi-
nations of German, English, Spanish, French,
Italian, and Portuguese. Zusätzlich, it includes
Zu
word pairs for 44 languages with respect
English.
Panlex. Dinu-Artetxe and MUSE are both
English-centered datasets, given that most (if not
alle) of their word pairs have English as their
source or target language. This makes the datasets
suboptimal for our purpose of generating and
evaluating a non-language centered crosslingual
Raum. Aus diesem Grund, we generated a dataset
using Panlex (Kamholz et al., 2014), a panlingual
lexical database. This dataset (made public in
our repository) includes bilingual lexicons for all
combinations of 157 languages for which FastText
is available, totalling 24,492 bilingual lexicons.
Each of the lexicons was generated by randomly
sampling 5k words from the top-200k words in
the embedding set for the source language, Und
translating them to the target language using the
Panlex database. We find it important to highlight
that this dataset contains considerably more noise
than other datasets given that Panlex is generated
in an automatic way and is not as finely curated
by humans as previous datasets. We still find
comparisons using this dataset fair, given that its
noisy nature should affect all strategies equally.
4.2 Language Selection and Family Tree
As previously stated, we aim to generate a single
crosslingual space for as many languages as
möglich. We started with the 157 languages for
which FastText embeddings are available (Grave
et al., 2018). We then removed languages that did
not meet both of the following criteria: (1) Dort
must exist a bilingual lexicon with at least 500
word pairs for the language in any of the datasets
described in Section 4.1, Und (2) the embedding
set provided by FastText must contain at least 20k
Wörter. The first criterion is a minimal condition
for evaluation, while the second one is necessary
for the unsupervised initialization strategy. Der
criteria are met by 107 languages, which are the
ones used in our experiments. Their corresponding
ISO-639 codes can be seen later in Table 5. Wir
use the language family tree defined by Lewis and
Gary (2015).
4.3 Framework
For experimental purposes, each dataset described
in Section 4.1 is split into training and testing
sets. We use the original train-test splits for Dinu-
Artetxe and MUSE. For Panlex, we generate a split
randomly sampling word pairs—keeping 80% für
the training and the remaining 20% for testing.
For development and parameter tuning purposes,
we use a disjoint set of word pairs specifically
created for this purpose based on the Panlex
lexical database. This development set contains 10
different languages with varied popularity. None
of the word pairs present in this development set
are part of either the train or test sets.
4.4 Hyperparameters
The following hyperparameters were manually
tuned using the development set described in
Abschnitt 4.3: β1 = 0.98, β2 = 0.01, β3 = 0.01,
367
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Figur 3: Number of correct word pairs inferred using the unsupervised initialization technique presented
by Artetxe et al. (2018) and the Frequency based technique described in Section 3.3.
t = 1000, αlpairs = 128, αwpairs = 2048,
αiter = 5000, αconv = 25.
5 Evaluation
We discuss below the results of
conducted over 107 languages to assess HCEG.
the study
5.1 Unsupervised Initialization
We first evaluate the performance of
Die
unsupervised initialization strategy described in
Abschnitt 3.3, and compare it with the state-of-the-
art strategy proposed by Artetxe et al. (2018).
In this case, we run both initialization strategies
using the top-20k FastText embeddings (Grave
et al., 2018) for all pairwise combinations of
Die 107 languages we study. For each language
pair, we measure how many of the inferred word
pairs are present in the corresponding lexicons
in the MUSE and Panlex datasets. For MUSE,
our proposed initialization strategy (Frequency
based) obtains an average of 48.09 correct pairs, ein
improvement with respect to the 29.62 obtained by
the strategy proposed by Artetxe et al. (2018). Für
Panlex, the respective average correct pair counts
Sind 1.05 Und 0.55. Both differences are statistically
bedeutsam (P < 0.01) using a paired t-test. The
noticeable difference across datasets is due to how
the sampling was done for generating the datasets:
MUSE contains a considerably higher number of
frequent words in comparison to Panlex, making
the latter a relatively harder dataset for vocabulary
induction. In Figure 3 we illustrate the results of
each strategy grouped by language-pair similarity.
This similarity is based on the number of common
parents the two languages share. For example, in
Figure 1, Spanish has a similarity of 3, 2, and
1 with Portuguese, English, and Finnish, respec-
tively. As we see in Figure 3, similarity is a
factor that strongly determines the quality of
the alignment generated by the unsupervised
initialization. Even if this phenomenon affects
both analyzed strategies, our proposed frequency-
based initialization strategy consistently obtains a
few more correct word pairs for the least similar
language pairs, which, as we show in Table 4,
are key for generating a correct mapping for those
languages.
5.2 State-of-the-Art Comparison
In order to contextualize the performance of
HCEG with respect to the state-of-the-art (listed in
Tables 1 and 2), we measure the word translation
prediction capabilities of each of the strategies. We
do so using Precision@1 for bilingual lexicon
induction as a means to quantify vocabulary
induction performance. Scores reported hereafter
are average Precision@1 in percentage form, for
each of the words in the testing set.
When applicable, we report results for both the
supervised (HCEG-S) and unsupervised (HCEG-
U) versions of HCEG. In the supervised mode,
we train one single model per dataset using all
the training word pairs available. We then use
this model for computing all pairwise scores. In
the unsupervised mode, unless explicitly stated
otherwise, we train a single model regardless of the
dataset used for testing purposes. This means that,
in some cases, the unsupervised mode leverages
monolingual data beyond the languages used for
testing, as it uses all 107 language embeddings. We
found it unfair to train a supervised model using
368
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
t
l
a
c
_
a
_
0
0
3
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
d
e
s
i
v
r
e
p
u
S
Method
en-it
en-de
en-fi
en-es
Mikolov (2013b)
34.93* 35.00* 25.91* 27.73*
Faruqui (2014)
38.40* 37.13* 27.60* 26.80*
Shigeto (2015)
41.53* 43.07* 31.04* 33.73*
Dinu (2014)
38.93* 29.14* 30.40*
37.7
Lazaridou (2015)
40.2
-
-
Xing (2015)
36.87* 41.27* 28.23* 31.20*
Zhang (2016)
36.73* 40.80* 28.16* 31.07*
Artetxe (2016)
41.87* 30.62* 31.40*
39.27
Artetxe (2017)
40.87
39.67
28.72
Smith (2017)
43.33* 29.42* 35.13*
43.1
32.94
44.13
Artetxe (2018a)
45.27
-
-
45.5
Jouling (2018)
36.0
49.1
Jawanpuria (2019) mul 48.7
36.1
49.3
48.3
Jawanpuria (2019)
36.60
-
36.0
39.3
-
-
i
. Artetxe (2017) 25
m
Smith (2017) cog
e
S
Artetxe (2017) num
37.27
39.9
39.40
39.60
-
40.27
28.16
-
26.47
-
-
-
d
e
s
i
v
r
e
p
u
s
n
U
Zhang (2017), λ = 1
Zhang (2017) λ = 10
Conneau (2017) code
Conneau (2017) paper 45.1
Artetxe (2018)
HCEG-U
0.00*
0.00*
0.00*
0.00*
0.01*
0.00*
45.15* 46.83* 0.38*
0.01*
0.01*
32.63
48.19
34.82
48.18
48.13
49.02
0.00*
0.01*
35.38*
35.44*
37.33
42.15
Table 1: Results using the Dinu-Artetxe dataset. Scores
marked with (*) were reported by Artetxe et al. (2018); the
remaining ones were reported in the corresponding original
papers.
the Dinu-Artetxe dataset given that it only contains
four bilingual lexicons, not enough for training our
tree structure. Thus, only unsupervised results are
shown for that dataset.
in most cases,
As shown in Table 1, the unsupervised version
of HCEG achieves,
the best
performance among all unsupervised strategies,
even improving over state-of-the-art supervised
models in some cases. The improvement
is
most noticeable for Italian and Spanish, where
HCEG-U obtains an improvement of 1 and 3
points, respectively. A similar behavior can be
seen in Table 2, where we describe the results
on the MUSE dataset. Spanish, along with
Catalan, Italian, and Portuguese, obtains a sub-
stantially larger
improvement compared with
other languages. We attribute this to the fact that
Spanish is the second most resourceful language
in terms of corpora after English. This makes the
quality of Spanish word embeddings comparably
better than other languages, which as a result
improves the mapping quality of typologically
related languages, such as Portuguese, Italian, or
Catalan.
To further contextualize the performance of
HCEG-U, in terms of its capability for generating
in an unsupervised
crosslingual embeddings
fashion, we conducted further experiments. In
Table 3, we summarize the results obtained from
comparing HCEG-U with other unsupervised
strategies focused on learning crosslingual word
embeddings. In our comparisons we include (i)
a direct bilingual learning baseline that simply
learns a bilingual mapping using two monolingual
word embeddings (Conneau et al., 2017), (ii) a
pivot-based strategy that can leverage a third lan-
guage for learning a crosslingual space (Conneau
et al., 2017), and (iii) a fully multilingual,
pivotless strategy that aggregates languages into
a joint space in an iterative manner (Chen and
369
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
t
l
a
c
_
a
_
0
0
3
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Conneau Joulin Artetxe
(2018)
(2018)
(2017)
HCEG-S HCEG-U− HCEG-U
bg
ca
cs
da
de
el
es
et
fi
fr
he
hr
hu
id
it
mk
nl
no
pl
pt
ro
ru
sk
sl
sv
tr
uk
vi
Avg.
57.5
70.9
64.5
67.4
72.7
58.5
83.5
45.7
59.5
82.4
54.1
52.2
64.9
67.9
77.9
54.6
75.3
67.4
66.9
80.3
68.1
63.7
55.3
50.4
60.0
59.2
49.3
55.8
63.8
63.9
73.8
68.2
71.1
76.9
62.7
86.4
49.5
65.8
84.7
57.8
55.6
69.3
69.7
81.5
59.9
79.7
71.2
70.5
82.9
74.0
67.1
59.0
54.2
63.7
61.9
51.5
55.8
67.4
65.8
76.3
70.2
70.3
79.1
67.8
88.6
55.8
68.1
87.6
61.1
57.6
69.6
75.5
83.3
63.5
79.9
69.9
72.0
85.5
75.4
69.5
62.0
60.1
66.2
68.7
56.4
3.9
68.2
64.1
73.1
68.2
68.8
75.8
65.3
86.8
53.5
65.2
85.4
59.5
54.8
66.8
73.2
81.3
62.3
79.4
69.5
70.7
83.8
72.8
68.1
59.6
57.7
65.0
66.3
53.8
55.5
68.1
64.0
74.2
65.9
71.9
75.2
66.4
86.4
53.4
65.4
85.7
61.4
54.1
64.5
73.5
79.7
62.5
79.7
69.3
70.5
83.2
71.7
69.1
56.7
59.7
64.8
66.6
55.7
55.6
68.1
67.5
77.7
71.7
72.7
79.0
68.5
90.4
57.3
68.3
88.3
63.0
58.2
70.1
75.6
85.6
64.9
81.9
71.9
72.8
87.8
76.0
69.8
62.4
61.1
68.0
70.0
56.4
58.3
71.2
Table 2: Results on the MUSE dataset. Scores from Artetxe et al.
(2018) were obtained using the scripts shared by the authors. All the
other scores were reported in Joulin et al. (2018). HCEG-U− only
considers the 29 languages in the experiment for training.
Cardie, 2018). From the reported results, we see
that HCEG-U− outperforms all other considered
strategies for 24 out of 30 language pairs. Highest
improvements are found for languages of the Italic
family (Spanish, Portuguese, Italian, and French).
We observe that HCEG-U− under-performed
when the corresponding experiment involved the
German language as source or target. We attribute
this behavior to the fact that the Italic family
is predominant in the languages explored in this
experiment.
(2018), we limited the monolingual data that
HCEG-U− used to the six languages considered
in this experiment (results that are reported in
Table 3). However, in order to show the full poten-
tial of HCEG-U, we also include results achieved
when using 107 languages (column HCEG-U). As
seen in Tables 2 and 3, the differences between
HCEG-U− and HCEG-U are considerable, mani-
festing the capabilities of the proposed model to
take advantage of monolingual data in multiple
languages at the same time.
In order to perform a fair comparison with
respect to the work proposed by Chen and Cardie
The importance of explicitly considering
topological connections among languages to
370
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
t
l
a
c
_
a
_
0
0
3
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Method
Type
en-de en-fr en-es en-it en-pt de-fr de-es de-it de-pt
fr-es fr-it
fr-pt es-it es-pt
it-pt
Conneau (2017) Direct 74.0
74.0
Conneau (2017) Pivot
74.8
Multi
Chen (2018)
HCEG-U−
74.5
Multi
82.3 81.7
82.3 81.7
82.4 82.5
82.8 82.7
77.0 80.7
77.0 80.7
78.8 81.5
79.5 81.7
73.0 65.7
71.9 66.1
76.7 69.6
73.5 68.0
66.5 58.5
68.0 57.4
72.0 63.2
72.2 63.3
83.1 83.0 77.9 83.3 87.3 80.5
81.1 79.7 74.7 81.9 85.0 78.9
83.9 83.5 79.3 84.5 87.8 82.3
84.4 83.9 79.8 86.0 88.9 83.6
HCEG-U
Multi
79.4
88.4 89.8
85.4 88.1
77.4 72.3
76.5 66.7
89.1 86.1 84.8 89.4 89.7 86.3
Method
Type
de-en fr-en es-en it-en pt-en fr-de es-de it-de pt-de es-fr
it-fr pt-fr
it-es pt-es pt-it
Conneau (2017) Direct 72.2
72.2
Conneau (2017) Pivot
72.9
Multi
Chen (2018)
HCEG-U−
72.4
Multi
82.1 83.3
82.1 83.3
81.8 83.7
82.6 84.1
77.7 80.1
77.7 80.1
77.4 79.9
77.8 80.3
69.7 68.8
68.1 67.9
71.2 69.0
71.2 67.8
62.5 60.5
66.1 63.1
69.5 65.7
69.6 65.6
87.6 83.9 87.7 92.1 80.6
86
84.7 86.5 82.6 85.8 91.3 79.2
86.9 88.1 86.3 88.2 92.7 82.6
87.5 88.8 87.0 89.5 94.0 83.9
HCEG-U
Multi
78.6
88.2 91.0
85.8 87.5
75.4 71.2
73.9 68.6
90.6 91.0 90.2 91.4 94.3 87.1
Table 3: Comparison of unsupervised crosslingual embedding learning strategies under different
merging scenarios in the MUSE dataset. Direct indicates a traditional bilingual scenario where a
mapping from source to target is learned. Pivot uses an auxiliary pivot language (English) for merging
multiple languages into the same space. Multi merges all languages into the same space without using
a pivot. All scores except HCEG-U were originally reported by Chen and Cardie (2018). HCEG-U−
only considers the six languages in the experiment for training. Note that HCEG-U is excluded when
highlighting the best model (bold), given that it uses monolingual data beyond what other models do.
enhance mappings become more evident when
analyzing the data in Table 5. Here we include
the pairing that yielded the best and worst
mapping for each language, as well as the position
of English in the quality ranking. English and
Spanish have a strong quality mapping with
respect to each other, Spanish being the language
with which English obtains the best mapping and
English is the second-best mapped language for
Spanish. Additionally, Spanish is the language
with which Italian, Portuguese, and Catalan obtain
the best mapping quality. On the other side of the
spectrum, the worst mappings are dominated by
two languages, Georgian and Vietnamese, with
40 languages having these two language as worst;
this is followed by Maltese, Albanian, and Finnish,
with 8 occurrences each. This is not unexpected,
as these languages are relatively isolated in the
language family tree, and also have a low number
of speakers. We also see that English is usually
on the top side of the ranking for most languages.
For languages that are completely isolated, such
as Basque and Yoruba, English tends to be their
best mapped language. From this we surmise that
when typological relations are lacking, the quality
of the embedding space is the only aspect the
mapping strategy can rely on.
Given space constraints, we cannot show
the vocabulary induction scores for the 24,492
language pairs in the Panlex dataset. Instead,
we group the results using two variables: the
sum of number of speakers for each of the two
languages, and the minimum similarity (as defined
in Section 5.1) for each language with respect to
English. We rely on these variables for grouping
purposes as they align with two of our objectives
for designing HCEG: (1) remove the bias towards
the pivot language (English), and (2) improve the
performance of low-resource languages by taking
advantage of typologically similar languages.
Figure 4 captures the improvement (2.7 on
average) of HCEG-U over the strategy introduced
in Artetxe et al. (2018) (the best-performing
benchmark), grouped by the aforementioned
variables. We excluded Hindi and Chinese from
the figure, as they made any pattern hard to
observe given their high number of speakers.
The sum of number of speakers axis was also
logarithmically scaled to facilitate visualization.
The figure captures an evident trend in the simi-
larity axis. The lower the similarity of the lan-
guage with respect to English, the higher the
improvement achieved by HCEG-U. This can
be attributed to the manner in which TB/MP
models generated the space using English as
primary resource, hindering the potential quality
of languages that are distant from it. Additionally,
we see a less-prominent but existing trend in
371
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
t
l
a
c
_
a
_
0
0
3
2
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Description
Dinu-ArtetxeMUSEPanlex
d¬Hierarchy
e
s
i
v
r
e
p
u
S
¬Orthogonal Init.
¬Iterative Refinement
All vs All Inference
World langs. as root
HCEG-S
d¬Hierarchy
¬Orthogonal Init.
¬Iterative Refinement
All vs All Inference
World langs. as root
e
s
i
v
r
e
p
u
s
n
U
¬Freq. based Init.
HCEG-U
-
-
-
-
-
-
40.2
43.2
0.09
39.3
42.8
41.2
43.5
66.7
67.8
65.4
66.3
67.5
68.1
32.0
36.5
35.1
36.6
35.7
37.3
28.1
67.9
34.7
71.0
0.02
0.08
34.6
69.4
33.8
70.2
68.0
31.1
71.2 35.8
Figure 4: Improvement over the strategy proposed
by Artetxe et al. (2018) in Panlex,
in terms
of language similarity and number of speakers.
Darker denotes larger improvement.
the speaker sum axis. Despite some exceptions,
HCEG-U obtains higher differences with respect
to Artetxe et al. (2018) the less spoken a language
is. A behavior that is similar in essence to a Pareto
front can also be depicted from the figure. Even
if both variables contribute to the difference in
improvement of HCEG-U, one variable needs to
compensate for the other in order to maximize
accuracy. In other words, the improvement is
higher the fewer speakers the language pair has
or the more distant the two languages are from
English, but when both variables go to the extreme,
the improvement decreases. The aforementioned
trends serve as evidence that
the hierarchical
structure is indeed important when building a
crosslingual space that considers typologically
diverse languages, validating our premises for
designing HCEG.
5.3 Ablation Study
In order to assess the validity of each functionality
included as part of HCEG, we conducted an
ablation study. We summarize the results of this
study in Table 4, where the symbol ¬ indicates that
the subsequent feature is ablated in the model. For
example, ¬Hierarchy indicates that the Hierarchy
structure is removed, replacing it by a structure
where each language needs just one transformation
matrix to reach the World languages space.
Table 4: Ablation study.
As indicated by the ablation results,
the
hierarchical structure is indeed a key part of
HCEG, considerably reducing its performance
when removed, and having its strongest effect in
the dataset with the highest number of languages
(i.e., Panlex). The importance of the Iterative
Refinement strategy is also noticeable, making
the unsupervised version of HCEG useless when
removed. The Frequency-based initialization is
also a characteristic that considerably improves
the results of HCEG-U. Looking deeper into
the data, we found 2,198 language pairs (about
9% of all pairs) that obtained a vocabulary
induction accuracy close to 0 (<0.05) without
using this initialization, but were able to produce
enough signal to yield more substantial accuracy
values (>10.0) when using the Frequency-based
initialization. Endlich, the design decisions that we
initially took for reducing training time—(ich) Die
orthogonal initialization, (ii) the heuristic based
inference, Und (iii) using the lowest common
root for computing the loss function—also have a
positive effect on the performance of the HCEG.
5.4 Influence of Pivot Choice
One of the premises for building HCEG was to
design a strategy that would not require pivots
for achieving a single space with multiple word
embeddings, given that a pivot induces a bias
into the final space that can hinder the quality
of the mapping for languages that are too distant
to it. In this section we describe the results of
experiments conducted for measuring the effect
pivot selection can have on the performance
of the mapping. For doing so, we measure the
372
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
L B,W,E
L B,W,E
L
B,W,E
L B,W,E
L
B,W,E
L B,W,E
L B,W,E
af nl,fi,4
als en,vi,1
am arz,von,80 cs
an es,ka,17
arz mt,ja,3
as bn,vi,4
ast es,ja,20
ba tt,sq,34
bar de,fi,6
be ru,vi,4
bg mk,ka,9
bn as,vi,6
br cy,ka,18
bs
sr,ka,2
ca es,mt,5
ce en,sq,1
ceb tl,li,22
ckb tg,tr,19
sk,vi,12
tk,ka,13
tr
jv id,scn,34 my zh,mk,19 sco en,mt,1
gd,tt,12
ga
tt
ba,sa,9
sd bn,tl,5
nds nl,vi,3
ka en,bs,1
ga,vi,2
gd
ug tr,vls,4
si
af,ka,4
nl
kk ky,vi,51
dv,ka,5
gl
pt,ka,16
uk ru,fi,19
sk cs,vi,5
sv,vi,3
NEIN
km vi,nl,4
gom mr,fi,10
ur hi,eo,10
sr,vi,6
sl
es,Mein,3
oc
kn ta,lt,55
pa,ka,3
gu
vec pms,tr,2
so arz,sq,73
pa
arz,mk,10 ko en,af,1
Er
gu,vi,6
vi
km,vls,3
sq en,tt,1
pam id,sr,18
ur,ka,5
hi
pl
Std
vls nl,eo,8
Std,vi,4
sr
cs,vi,4
sr,tt,5
pms vec,sah,7 su id,mk,37 wa fr,fi,7
hsb pl,Bin,3
es,mt,5
yo en,lt,1
sv da,vi,5
pt
fi,ckb,9
hu
In,bn,1
ta ml,mt,3
qu
In,fi,1
hy
zh my,von,10
es,vi,6
ta,mk,15
te
ro
jv,vi,3
id
uk,su,20
ckb,ka,13
tg
ru
id,sq,6
ilo
In,vls,1
th
sa
sv,ka,3
hu,als,24 Ist
hi,ka,2
tr,lt,7
tk
sah tr,ka,2
es,mt,5
Es
Es,vi,5
ceb,ru,47
tl
scn it,ka,21
In,vi,1
ja
In,eo,1
ky kk,af,17
la
es,mt,3
lb de,ka,2
nl,ka,7
li
lt
ru,mt,5
mg id,sq,44
mk bg,vi,4
ml
ta,sq,29
mr si,ka,21
mt arz,tt,70
cv tr,sq,2
cy br,fi,2
sv,fi,4
da
von
lb,mt,5
dv si,ka,3
In,eo,1
Er
en es,gv,-
eo en,sq,1
es
pt,vi,2
eu en,lt,1
fi
fr
fy
Tisch 5: Best (B), worst (W), and English mapping ranking (E) for each language (L).
C
ich
v
A
l
S
–
Ö
T
l
A
B
/
N
A
e
P
Ö
R
u
E
–
Ö
D
N
ICH
N
A
ich
N
A
R
ICH
–
Ö
D
N
ICH
/
N
A
e
P
Ö
R
u
E
–
Ö
D
N
ICH
C
ich
N
A
M
R
e
G
/
N
A
e
P
Ö
R
u
E
–
Ö
D
N
ICH
C
ich
l
A
T
ICH
/
N
A
e
P
Ö
R
u
E
–
Ö
D
N
ICH
N
A
T
e
B
ich
T
–
Ö
N
ich
S
C
ich
T
A
ich
S
A
–
Ö
R
F
A
N
A
ich
S
e
N
Ö
R
T
S
u
A
C
ich
k
R
u
T
C
ich
l
A
R
U
Avg.
Pivot Language Family
In
arz
id
ru
von
hi
es
pt
zh
tr
hu
27.3 28.7 32.1 39.8 31.4 40.4 27.3 26.9 28.3 31.4
Indo-European/Germanic
30.2 27.1 28.1 32.1 28.3 33.4 25.1 23.4 27.1 28.3
Afro-Asiatic
27.1 30.3 27.7 31.1 28.3 32.5 25.8 24.6 27.6 28.3
Austronesian
Indo-European/Balto-Slavic 26.3 26.3 34.2 38.2 28.5 37.3 24.6 22.5 26.8 29.4
Indo-European/Germanic
25.1 26.9 25.1 37.6 27.3 37.2 24.7 23.7 25.6 28.1
Indo-European/Indo-Iranian 26.3 27.1 26.1 33.7 32.3 34.2 23.4 25.6 26.4 28.3
26.9 26.7 30.6 38.5 31.0 41.5 26.8 26.7 28.4 30.8
Indo-European/Italic
26.0 26.6 30.4 37.9 27.7 41.3 25.9 26.4 26.5 29.9
Indo-European/Italic
25.1 27.3 25.3 23.4 26.1 24.8 29.3 25.7 27.6 26.1
Sino-Tibetan
24.9 25.3 25.5 28.2 27.8 28.6 25.3 28.7 27.3 26.8
Turkic
25.4 25.8 25.8 31.8 26.4 32.8 25.5 21.9 30.1 27.3
Uralic
Tisch 6: Results obtained by existing bilingual mapping strategies using different pivots on the
Panlex dataset. Values in each cell indicate the average performance obtained for each of the pairwise
combinations of languages under the family noted in the corresponding column title. Zum Beispiel, Die
first cell indicates the average score obtained for all possible combinations of afro-asiatic languages
using English as a pivot. Results are averaged across the strategy presented in Conneau et al. (2017)
and Artetxe et al. (2018) in order to avoid system-specific biases.
performance of state-of-the-art bilingual mapping
strategies in a pivot-based inference scenario. Wir
use 11 different pivots and average the results of
two different strategies—(Conneau et al., 2017)
Und (Artetxe et al., 2018)—grouped by several
language families. As depicted by the results
presented in Table 6, selecting a pivot
Das
belongs to the family of the languages being
373
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
tested is always the best choice. In cases where we
considered multiple pivots of the same family, Die
most resource-rich language resulted in the best
Möglichkeit, nämlich, Spanish in the case of the Italic
family and English for the Germanic family. An
average, English is the best choice of pivot if all
language families need to be considered, followed
by Spanish and Portuguese. This validates two
of the design decisions for HCEG, das ist, Die
need to avoid selecting a pivot and the importance
of using the languages with largest speaker-base
when performing language transfer.
6 Conclusion and Future Work
We have introduced HCEG, a crosslingual
space learning strategy that does not depend on
a pivot language, as instead, it takes advantage of
the natural hierarchy existing among languages.
Results from extensive studies on 107 languages
demonstrate that
the proposed strategy out-
performs existing crosslingual space generation
Techniken,
in terms of vocabulary induction,
for both popular and not so popular languages.
HCEG improves the mapping quality of many
low-resource languages. We noticed that
Das
improvement mostly happens when a language
has more typologically related counterparts, Wie-
immer. daher, as future work, we intend to
investigate other techniques that can help improve
the quality of mapping for typologically isolated
low-resource languages. Zusätzlich, it is impor-
tant to note that the time complexity required by
the proposed algorithm is N (N −1), with N being
the number of languages considered. For the tra-
ditional TB/MP strategy, complexity is limited to
learning from N language pairs. daher, Wir
plan on exploring strategies to reduce the num-
ber of language pairs that need to be learned
for creating the crosslingual space. Endlich, Wir
will explore different data-driven strategies for
building the tree structure, such as geographical
proximity or lexical overlap, which could lead to
better optimized arrangements of the crosslingual
Raum.
Verweise
Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2016. Learning principled bilingual mappings
of word embeddings while preserving mono-
lingual invariance. In Proceedings of the 2016
Conference on Empirical Methods in Natural
Language Processing, pages 2289–2294.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2017. Learning bilingual word embeddings
mit (almost) no bilingual data. In Proceedings
of the 55th Annual Meeting of the Association
für Computerlinguistik (Volumen 1: Long
Papers), pages 451–462. ACL.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre.
2018. A robust self-learning method for fully
unsupervised cross-lingual mappings of word
embeddings. In Proceedings of the 56th Annual
Meeting of
the Association for Computa-
tional Linguistics (Volumen 1: Long Papers),
pages 789–798.
Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2017. Enriching word vec-
tors with subword information. Transactions of
the Association for Computational Linguistics,
5:135–146.
Fabienne Braune, Viktor Hangya, Tobias Eder,
and Alexander Fraser. 2018. Evaluating
bilingual word embeddings on the long tail.
In Proceedings of the 2018 Conference of the
North American Chapter of the Association for
Computerlinguistik: Human Language
Technologies, Volumen 2 (Short Papers),
pages 188–193.
Xilun Chen and Claire Cardie. 2018. Unsuper-
vised multilingual word embeddings. In Pro-
ceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing,
pages 261–270.
Bernard Comrie. 1989. Language Universals and
Linguistic Typology: Syntax and Morphology,
University of Chicago Press.
Alexis Conneau, Guillaume Lample, Marc’
Aurelio Ranzato, Ludovic Denoyer, and Herv´e
J´egou. 2017. Word translation without parallel
Daten. arXiv preprint arXiv:1710.04087.
Jocelyn Coulmance, Jean-Marc Marty, Guillaume
Wenzek, and Amine Benhalloum. 2015. Trans-
fast cross-lingual word-embeddings.
Gramm,
Die 2015 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing, pages 1109–1113.
374
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Paula Czarnowska, Sebastian Ruder,
´Edouard
Grave, Ryan Cotterell, and Ann Copestake.
2019. Don’t forget the long tail! A comprehen-
sive analysis of morphological generalization
in bilingual lexicon induction. In Proceedings
of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th
International
Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNP),
pages 973–982.
Georgiana Dinu, Angeliki Lazaridou, and Marco
Baroni. 2014. Improving zero-shot learning by
mitigating the hubness problem. arXiv preprint
arXiv:1412.6568.
Yerai Doval,
Jose Camacho-Collados, Luis
Espinosa Anke, and Steven Schockaert. 2018.
Improving cross-lingual word embeddings by
meeting in the middle. In Proceedings of the
2018 Conference on Empirical Methods in
Natural Language Processing, pages 294–304.
ACL.
Long Duong, Hiroshi Kanayama, Tengfei Ma,
Steven Bird, and Trevor Cohn. 2016. Learning
crosslingual word embeddings without bilin-
gual corpora. In Proceedings of
Die 2016
Conference on Empirical Methods in Natural
Language Processing, pages 1285–1295.
Goran Glavas, Robert Litschko, Sebastian Ruder,
and Ivan Vulic. 2019. How to (properly) eval-
uate cross-lingual word embeddings: On strong
baselines, comparative analyses, and some mis-
conceptions. arXiv preprint arXiv:1902.00508.
Stephan Gouws, Yoshua Bengio, and Greg Corrado.
2015. BilBOWA: Fast bilingual distributed rep-
resentations without word alignments. In Inter-
national Conference on Machine Learning,
pages 748–756.
Stephan Gouws and Anders Søgaard. 2015. Sim-
ple task-specific bilingual word embeddings.
In Proceedings of the 2015 Conference of the
North American Chapter of the Association for
Computerlinguistik: Human Language
Technologies, pages 1386–1390.
Edouard Grave, Piotr Bojanowski, Prakhar Gupta,
Armand Joulin, and Tomas Mikolov. 2018.
Learning word vectors for 157 languages. In
Proceedings of the International Conference
on Language Resources and Evaluation (LREC
2018).
Karl Moritz Hermann and Phil Blunsom. 2013.
Multilingual distributed representations without word
Ausrichtung. arXiv preprint arXiv:1312.6173.
Geert Heyman, Bregt Verreet, Ivan Vuli´c, Und
Marie Francine Moens. 2019. Learning unsu-
pervised multilingual word embeddings with
incremental multilingual hubs. In Proceedings
of the 2019 Conference of the North American
the Association for Computa-
Chapter of
tional Linguistics: Human Language Tech-
nologies, Volumen 1 (Long and Short Papers),
pages 1890–1902.
Pratik Jawanpuria, Arjun Balgovind, Anoop
Kunchukuttan, and Bamdev Mishra. 2019.
Learning multilingual word embeddings in
latent metric space: a geometric approach.
Transactions of the Association for Compu-
tational Linguistics, 7:107–120.
Armand
Joulin, Piotr Bojanowski, Tomas
Mikolov, Herv´e J´egou, and Edouard Grave.
2018. Loss in translation: Learning bilingual
word mapping with a retrieval criterion. In
Verfahren der 2018 Conference on Empir-
ical Methods in Natural Language Processing,
pages 2979–2984.
David Kamholz,
Jonathan Pool, and Susan
Colowick. 2014. Panlex: Building a resource for
panlingual lexical translation. In Proceedings
of the Ninth International Conference on Lan-
guage Resources and Evaluation (LREC-2014),
pages 3145–3150.
Yova Kementchedjhieva,
Sebastian Ruder,
Ryan Cotterell, and Anders Søgaard. 2018.
Generalizing procrustes analysis for better
bilingual dictionary induction. In Proceedings
the 22nd Conference on Computational
von
Natural Language Learning, pages 211–220.
Stanislas Lauly, Alex Boulanger, and Hugo
Larochelle. 2014. Learning multilingual word
representations using a bag-of-words autoen-
coder. arXiv preprint arXiv:1401.1803.
M. Paul Lewis and F. Gary. 2015. Simons, Und
Charles D. Fennig (Hrsg.). 2013. Ethnologue:
Languages of the world, pages 233–62.
Yuri Lin, Jean-Baptiste Michel, Erez Lieberman
Aiden, Jon Orwant, Will Brockman, and Slav
Petrov. 2012. Syntactic annotations for the
375
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
google books ngram corpus. In Proceedings
the ACL 2012 Systemdemonstrationen,
von
pages 169–174. ACL.
Offline bilingual word vectors, orthogonal
transformations and the inverted softmax. arXiv
preprint arXiv:1702.03859.
Robert Litschko, Goran Glavaˇs,
Ivan Vulic,
and Laura Dietz. 2019. Evaluating resource-
lean cross-lingual embedding models in unsu-
pervised retrieval. In Proceedings of the 42nd
International ACM SIGIR Conference on Re-
in Information
search and Development
Retrieval, pages 1109–1112. ACM.
Thang Luong, Hieu Pham, and Christopher D.
Manning. 2015. Bilingual word representations
with monolingual quality in mind. In Pro-
ceedings of the 1st Workshop on Vector Space
Modeling for Natural Language Processing,
pages 151–159.
Tomas Mikolov, Kai Chen, Greg Corrado, Und
Jeffrey Dean. 2013A. Efficient estimation of
word representations in vector space. arXiv
preprint arXiv:1301.3781.
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever.
2013B. Exploiting similarities among lan-
guages for machine translation. arXiv preprint
arXiv:1309.4168.
Jeffrey
Socher,
Pennington, Richard
Und
Christopher Manning. 2014. GloVe: Global
vectors for word representation. In Proceedings
of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 1532–1543.
Sebastian Ruder,
Ivan Vuli´c,
and Anders
Søgaard. 2017. A survey of cross-lingual
word embedding models. arXiv preprint
arXiv:1706.04902.
Andrew M. Sachsen, James L. McClelland, and Surya
Ganguli. 2013. Exact solutions to the nonlinear
dynamics of learning in deep linear neural
Netzwerke. arXiv preprint arXiv:1312.6120.
Yutaro Shigeto,
Ikumi Suzuki, Kazuo Hara,
Masashi Shimbo, and Yuji Matsumoto. 2015.
Ridge regression, hubness, and zero-shot
learning. In Joint European Conference on
Machine Learning and Knowledge Discovery
in Databases, pages 135–151. Springer.
Samuel L. Schmied, David H. P. Turban, Steven
Hamblin, and Nils Y. Hammerla. 2017.
376
Anders Søgaard, ˇZeljko Agi´c, H´ector Mart´ınez
Alonso, Barbara Plank, Bernd Bohnet, Und
Anders Johannsen. 2015. Inverted indexing
for cross-lingual NLP. In The 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conference of the Asian Federation of Natural
Language Processing (ACL-IJCNLP 2015).
Anders Søgaard, Sebastian Ruder, and Ivan
Vuli´c. 2018. On the limitations of unsupervised
bilingual dictionary induction. In Proceedings
of the 56th Annual Meeting of the Association
für Computerlinguistik (Volumen 1: Long
Papers), pages 778–788.
Ivan Vuli´c, Goran Glavaˇs, Roi Reichart, Und
Anna Korhonen. 2019. Do we really need
fully unsupervised cross-lingual embeddings?
Die 2019 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 4398–4409.
Ivan Vuli´c and Marie-Francine Moens. 2016.
Bilingual distributed word representations from
document-aligned comparable data. Zeitschrift für
Artificial Intelligence Research, 55:953–994.
Takashi Wada, Tomoharu Iwata, and Yuji
Matsumoto. 2019. Unsupervised multilingual
word embedding with limited resources using
neural language models. In Proceedings of the
57th Annual Meeting of the Association for
Computerlinguistik, pages 3113–3124.
orthogonal
Chao Xing, Dong Wang, Chao Liu,
Und
Yiye Lin. 2015. Normalized word embedding
bilingual
Und
Die
word translation.
the North American
2015 Conference of
Chapter of the Association for Computational
Linguistik: Human Language Technologies,
pages 1006–1011.
transform for
In Proceedings of
Will Y. Zou, Richard Socher, Daniel Cer,
and Christopher D. Manning. 2013. Bilingual
word embeddings for phrase-based machine
Die 2013
In Proceedings of
Übersetzung.
Conference on Empirical Methods in Natural
Language Processing, pages 1393–1398.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
2
0
1
9
2
3
5
9
6
/
/
T
l
A
C
_
A
_
0
0
3
2
0
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
PDF Herunterladen