An Empirical Study on Crosslingual Transfer
in Probabilistic Topic Models
Shudong Hao
Bard College at Simon’s Rock
Division of Science, Matemáticas, y
Informática
shao@simons-rock.edu
miguel j.. Pablo
University of Colorado
Department of Information Science
mpaul@colorado.edu
Probabilistic topic modeling is a common first step in crosslingual tasks to enable knowledge
transfer and extract multilingual features. Although many multilingual topic models have been
desarrollado, their assumptions about the training corpus are quite varied, and it is not clear how
well the different models can be utilized under various training conditions. In this article, el
knowledge transfer mechanisms behind different multilingual topic models are systematically
studied, and through a broad set of experiments with four models on ten languages, we provide
empirical insights that can inform the selection and future development of multilingual topic
modelos.
1. Introducción
Popularized by Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003), probabilístico
topic models have been an important tool for analyzing large collections of texts (Blei
2012, 2018). Their simplicity and interpretability make topic models popular for many
natural language processing tasks, such as discovery of document networks (Chen
et al. 2013; Chang and Blei 2009) and authorship attribution (Seroussi, Zukerman, y
Bohnert 2014).
Topic models take a corpus D as input, where each document d
D is usually
represented as a sparse vector in a vocabulary space, and project these documents
to a lower-dimensional topic space. En este sentido, topic models are often used as a
dimensionality reduction technique to extract representative and human-interpretable
características.
∈
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Text collections, sin embargo, are often not in a single language, and thus there has been
a need to generalize topic models from monolingual to multilingual settings. Given a
corpus D(1,…,l) in languages (cid:96)
, multilingual topic models learn topics in
1, . . . , l
}
∈ {
Envío recibido: 11 Octubre 2018; versión revisada recibida: 22 Noviembre 2019; accepted for publication:
25 Noviembre 2019.
https://doi.org/10.1162/COLI a 00369
© 2020 Asociación de Lingüística Computacional
Publicado bajo una Atribución Creative Commons-NoComercial-SinDerivadas 4.0 Internacional
(CC BY-NC-ND 4.0) licencia
Ligüística computacional
Volumen 46, Número 1
each of the languages. From a human’s view, each topic should be related to the same
tema, even if the words are not in the same language (Cifra 1(b)). From a machine’s
vista, the word probabilities within a topic should be similar across languages, such that
the low-dimensional representation of documents is not dependent on the language.
En otras palabras, the topic space in multilingual topic models is language agnostic
(Cifra 1(a)).
This article presents two major contributions to multilingual topic models. We first
provide an alternative view of multilingual topic models by explicitly formulating a
crosslingual knowledge transfer process during posterior inference (Sección 3). Based on
this analysis, we unify different multilingual topic models by defining a function called
the transfer operation. This function provides an abstracted view of the knowledge
transfer mechanism behind these models, while enabling further generalizations and
improvements. Using this formulation, we analyze several existing multilingual topic
modelos (Sección 4).
Segundo, in our experiments we compare four representative models under different
training conditions (Sección 5). The models are trained and evaluated on ten languages
from various language families to increase language diversity in the experiments. En
particular, we include five languages with relatively high resources and five others
with low resources. To quantitatively evaluate the models, we focus on topic quality
en la sección 5.3.1, and performance of downstream tasks using crosslingual document
classification in Section 5.3.2. We investigate how sensitive the models are to different
language resources (es decir., parallel/comparable corpus and dictionaries), and analyze
what factors cause this difference (Secciones 6 y 7).
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 1
Overview of multilingual topic models. (a) Multilingual topic models project-language specific
and high-dimensional features from the vocabulary space to a language-agnostic and
low-dimensional topic space. This figure shows a t-SNE (Maaten and Hinton 2008)
representation of a real data set. (b) Multilingual topic models produce theme-aligned topics for
all languages. From a human’s view, each topic contains different languages but the words are
describing the same thing.
96
Hao and Paul
Crosslingual Transfer in Topic Models
2. Fondo
We first review monolingual topic models, focusing on Latent Dirichlet Allocation, y
then describe two families of multilingual extensions. Based on the types of supervision
added to multilingual topic models, we separate the two model families into document-
level and word-level supervision.
Topic models provide a high-level view of latent thematic structures in a corpus.
Two main branches for topic models are non-probabilistic approaches such as Latent
Semantic Analysis (LSA; Deerwester et al. 1990) and Non-Negative Matrix Factorization
(Xu, Liu, and Gong 2003), and probabilistic ones such as Latent Dirichlet Allocation
(LDA; Blei, Ng, and Jordan 2003) and probabilistic LSA (pLSA; Hofmann 1999). Todo
these models were originally developed for monolingual data and later adapted to
multilingual situations. Though there has been work to adapt non-probabilistic models,
Por ejemplo, based on “pseudo-bilingual” corpora approaches (Littman, Dumais, y
Landauer 1998), most multilingual topic models that are trained on multilingual cor-
pora are based on probabilistic models, especially LDA. Por lo tanto, our work is focused
on the probabilistic topic models, and in the following section we start by describing
LDA.
2.1 Monolingual Topic Models
The most popular topic model is LDA, introduced by Blei, Ng, and Jordan (2003).
This model assumes each document d is represented by a multinomial distribution θd
over topics, and each “topic” k is a multinomial distribution φ(k) over the vocabulary
V. In the generative process, each θ and φ are generated from Dirichlet distributions
parameterized by α and β, respectivamente. The hyperparameters for Dirichlet distributions
can be asymmetric (Wallach, Mimno, and McCallum 2009), though in this work we use
symmetric priors. Cifra 2 shows the plate notation of LDA.
2.2 Multilingual Topic Models
We now describe a variety of multilingual topic models, organized into two families
based on the type of supervision they use. Más tarde, en la sección 4, we focus on a subset of the
models described here for deeper analysis using our knowledge transfer formulation,
selecting the most general and representative models.
2.2.1 Document Level. The first model proposed to process multilingual corpora using
LDA is the Polylingual Topic Model (PLTM; Mimno et al. 2009; Ni et al. 2009). Este
model extracts language-consistent topics from parallel or highly comparable multi-
lingual corpora (Por ejemplo, Wikipedia articles aligned across languages), assuming
that document translations share the same topic distributions. This model has been
Cifra 2
Plate notation of LDA. α and β are Dirichlet hyperparameters for θ and
assignments are denoted as z, and w denotes observed tokens.
Fi(k)
{
}
k
k=1. Tema
97
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Nd↵Kwz✓kD
Ligüística computacional
Volumen 46, Número 1
extensively used and adapted in various ways for different crosslingual tasks (Krstovski
and Smith 2011; Moens and Vulic 2013; Vuli´c and Moens 2014; Liu, Duh, and Matsumoto
2015; Krstovski and Smith 2016).
∼
Dir
In the generative process, PLTM first generates language-specific topic-word dis-
tributions φ((cid:96),k)
b((cid:96))
, for topics k = 1, . . . , K and languages (cid:96) = 1, . . . , l. Entonces,
, it generates a tuple-topic distribution θd ∼
for each document tuple d =
(cid:1)
Dir (a). Every topic in this document tuple is generated from θd, and the word tokens in
(cid:1)
this document tuple are then generated from language-specific word distributions φ((cid:96),k)
for each language. To apply PLTM, the corpus must be parallel or closely comparable
to provide document-level supervision. We refer to this as the document links model
(DOCLINK).
d(1), . . . , d(l)
(cid:0)
(cid:0)
Models that transfer knowledge on the document level have many variants, incluir-
ing SOFTLINK (Hao and Paul 2018), comparable bilingual LDA (C-BILDA; Heyman,
Vulic, and Moens 2016), the partially connected multilingual topic model (PCMLTM;
Liu, Duh, and Matsumoto 2015), and multi-level hyperprior polylingual topic model
(MLHPLTM; Krstovski, Herrero, and Kurtz 2016). SOFTLINK generalizes DOCLINK by
using a dictionary, so that documents can be linked based on overlap in their vocab-
ulary, even if the corpus is not parallel or comparable. C-BILDA is a direct extension
of DOCLINK that also models language-specific distributions to distinguish topics that
are shared across languages from language-specific topics. PCMLTM adds an additional
observed variable to indicate the absence of a language in a document tuple. MLHPLTM
uses a hierarchy of hyperparameters to generate section-topic distributions. Este modelo
was motivated by applications to scientific research articles, where each section s has its
own topic distribution θ(s) shared by both languages.
2.2.2 Word Level. Instead of document-level connections between languages, chico-
Graber and Blei (2009) and Jagarlamudi and Daum´e III (2010) proposed to model con-
nections between languages through words using a multilingual dictionary and apply
hyper-Dirichlet Type-I distributions (Andrzejewski, Zhu, and Craven 2009; Dennis III
1991). We refer to these approaches as the vocabulary links model (VOCLINK).
Específicamente, VOCLINK uses a dictionary to create a tree structure where each internal
node contains word translations, and words that are not translated are attached directly
to the root of the tree r as leaves. In the generative process, for each language (cid:96), VOCLINK
first generates K multinomial distributions over all internal nodes and word types that
are not translated, Fi(r,(cid:96),k)
, where β(r,(cid:96)) is a vector of Dirichlet prior from
root r to internal nodes and untranslated words in language (cid:96). Entonces, under each internal
b(i,(cid:96))
nodo i, for each language (cid:96), VOCLINK generates a multinomial φ(i,(cid:96),k)
over word types in language (cid:96) under the node i. Note that both β(r,(cid:96)) and β(i,(cid:96)) son
(cid:1)
vectores. In the first vector β(r,(cid:96)), each cell is parameterized by scalar β(cid:48) and scaled by
the number of word translations under that internal node. For the second vector β(i,(cid:96)),
it is a symmetric hyperparameter where every cell uses the same scalar β(cid:48)(cid:48). See Figure 3
for an illustration.
b(r,(cid:96))
Dir
Dir
∼
∼
(cid:0)
(cid:1)
(cid:0)
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
De este modo, to draw a word in language (cid:96) is equivalent to generating a path from the root
to leaf nodes:
r
i, i
→
→
(cid:0)
w((cid:96))
o
r
→
(cid:1)
(cid:0)
w((cid:96))
:
(cid:1)
Pr
r
(cid:0)
i, i
r
→
Pr
w((cid:96))
w((cid:96))
|
|
→
→
k
= Pr (i
k)
k
(cid:1)
= Pr
|
·
w((cid:96))
|
Pr
w((cid:96))
k
(cid:0)
k, i
|
(cid:1)
(1)
(2)
98
(cid:0)
(cid:1)
(cid:0)
(cid:1)
Hao and Paul
Crosslingual Transfer in Topic Models
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 3
An illustration of the tree structure used in word-level models. Hyperparameters β(r,(cid:96)) and β(i,(cid:96))
are both vectors, and β(cid:48) and β(cid:48)(cid:48) are scalars. In the figure, i1 has three translations, so the
corresponding hyperparameter β(r,EN)
= 3β(cid:48).
= β(r,SV)
1
1
Document-topic distributions θd are generated in the same way as monolingual
LDA, because no document translation is required.
The use of dictionaries to model similarities across topic-word distributions has
been formulated in other ways as well. PROBBILDA (Ma and Nasukawa 2017) usos
inverted indexing (Søgaard et al. 2015) to encode assumptions that word translations
are generated from same distributions. PROBBILDA does not use tree structures in the
parameters as in VOCLINK, but the general idea of sharing distributions among word
translations is similar. Guti´errez et al. (2016) use part-of-speech taggers to separate
topic words (nouns) and perspective words (adjectives and verbs), developed for the
application of detecting cultural differences, such as how different languages have
different perspectives on the same topic. Topic words are modeled in the same way
as in VOCLINK, whereas perspective words are modeled in a monolingual fashion.
3. Crosslingual Transfer in Probabilistic Topic Models
Conceptually, the term “knowledge transfer” indicates that there is a process of carrying
information from a source to a destination. Using the representations of graphical
modelos, the process can be visualized as the dependence of random variables. Para
ejemplo, X
Y implies that the generation of variable Y is conditioned on X, y
thus the information of X is carried to Y. If X represents a probability distribution, el
distribution of Y is informed by X, presenting a process of knowledge transfer, as we
define it in this work.
→
99
animalgreendjurgrönmångatadjurenk=1,…,KEnglishSwedishstorturqoisei1i2rk=1,…,k(i1,en,k)⇠Dir⇣(i1,en)⌘=Dir([00])(i1,sv,k)⇠Dir⇣(i1,sv)⌘=Dir([00,00])norte(i,`,k)⇠Dir⇣(i,`)⌘oIi=1(r,en,k)⇠Dir⇣(r,en)⌘=Dir⇣h(r,en)1,(r,en)2,(r,en)3i⌘=Dir([30,20,0])(r,sv,k)⇠Dir⇣(r,sv)⌘=Dir⇣h(r,sv)1,(rsv)2,(r,sv)3,(r,sv)4i⌘=Dir([30,20,0,0])(r,`,k)⇠Dir⇣(r,`)⌘From root to internal nodes and untranslated wordsFrom internal nodes to leavesk=1,…,k
Ligüística computacional
Volumen 46, Número 1
In our study, “knowledge” can be loosely defined as K multinomial distributions
k
k=1. De este modo, to study the transfer mechanisms in topic models
over the vocabularies:
k
is to reveal how the models transfer
k=1 from one language to another. Hasta la fecha,
}
this transfer process has not been obvious in most models, because typical multilingual
topic models assume the tokens in multiple languages are generated jointly.
Fi(k)
Fi(k)
{
}
{
En esta sección, we present a reformulation of these models that breaks down the co-
generation assumption of current models and instead explicitly show the dependencies
between languages. Starting with a simple example in Section 3.1, we show that our
alternative formulation derives the same collapsed Gibbs sampler, and thus the same
posterior distribution over samples, as in the original model. With this prerequisite, en
Sección 3.3 we introduce the transfer operation, which will be used to generalize and
extend current multilingual topic models in Section 4.
3.1 Transfer Dependencies
We start with a simple graphical model, where θ
+ is a K-dimensional categorical
distribución, drawn from a Dirichlet parameterized by α, a symmetric hyperparameter
(Cifra 4(a)). Using θ, the model generates two variables, X and Y, and we use x and y
to denote the generated observations. In the co-generation assumption, the variables X
and Y are generated from the same θ at the same time, without dependencies between
(X,Y) and the probability of the
entre sí. De este modo, we call this the joint model denoted as
sample (X, y) is Pr
X, y; a,
(X,Y)
∈
GRAMO
.
RK
According to Bayes’ theorem, there are two equivalent ways to expand the proba-
bility of (X, y):
(cid:0)
GRAMO
(cid:1)
Pr (X, y; a) = Pr (X
|
Pr (X, y; a) = Pr (y
|
y; a)
X; a)
Pr (y; a)
Pr (X; a)
·
·
(3)
(4)
where we notice that the generated sample is conditioned on another sample: Pr (X
and Pr (y
Figures 4(b) y 4(C), and denote the graphical structures as
activamente, to show the dependencies between the two variables.
y; a)
X; a), which fits into our concept of “transfer.” We show both cases in
Y), respetar-
|
X) y
|
GRAMO
GRAMO
(X
(Y
|
|
In this formulation, the model generates θx from Dirichlet (a) first and uses θx to
X]
|
generate the sample of x. Using the histogram of x denoted as nx = [n1
X, . . . , nK
|
X, n2
|
(a)
(b)
(C)
Cifra 4
(a) The co-generation assumption generates x and y at the same time from the same θ. (b) A
make the transfer process clear, we make the generation of y conditional on x and highlight the
dependency in red. Because both x and y are exchangeable, the dependency can go the other
way, as shown in (C).
100
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
xy↵✓NxNy↵xyNxNy✓x✓y|x↵xyNxNy✓y✓x|y
Hao and Paul
Crosslingual Transfer in Topic Models
where nk
parameter α, the model then generates a categorical distribution θy
from which the sample y is drawn.
x is the number of instances of X assigned to category k, together with hyper-
|
Dir (nx + a),
x ∼
|
This differs from the original joint model in that original parameter vector θ has
been replaced with two variable-specific parameter vectors. The next section derives
posterior inference with Gibbs sampling after integrating out the θ parameters, y
we show that the sampler for each of two model formulations is equivalent and thus
samples from an equivalent posterior distribution over x and y.
3.2 Collapsed Gibbs Sampling
General approaches to infer posterior distributions over graphical model variables in-
clude Gibbs sampling, variational inference, and hybrid approaches (kim, Voelker, y
Saul 2013). We focus on collapsed Gibbs sampling (Griffiths and Steyvers 2004), cual
marginalizes out the parameters (θ in the example above) to focus on the variables of
interés (x and y in the example).
Continuing with the example from the previous section, in each iteration of Gibbs
sampling (a “sweep” of samples), the sampler goes through each example in the data,
(X,Y) as in
which can be viewed as sampling from the full posterior of a joint model
Cifra 5(a). De este modo, when sampling an instance xi ∈
X, the collapsed conditional likeli-
hood is
GRAMO
Pr
x = k
, y; a
=
X
|
−
(cid:0)
(cid:1)
=
=
Pr(x = k, X
Pr(X
−
, y; a)
, y; a)
−
Γ
Γ
αk + nk
X + nk
y
|
|
Nx + Ny + 1(cid:62)a
(cid:0)
(cid:1)
·
(cid:1)
(cid:0)
i)
norte(
y + αk
X + nk
−
k
|
|
i)
norte(
X + Ny + 1(cid:62)a
−
i)
y + 1(cid:62)a
−
Γ
Nx + norte(
(cid:16)
Γ
i)
αk + norte(
X + nk
−
k
|
(cid:16)
y
|
(cid:17)
(cid:17)
(5)
(6)
(7)
−
i)
is the set of tokens excluding the current one and n(
−
X
k
|
where x
is the number of
instances x assigned to category k except the current xi. Note that in this equation, a
is the hyperparameter for the Dirichlet prior, which gets added to the counts in the
formula after integrating out the parameters θ.
(a)
(b)
Cifra 5
Sampling from a joint model
same MAP estimates.
GRAMO
(X,Y) (a) and two conditional models
(X
Y) y
|
GRAMO
GRAMO
(Y
X) (b) yields the
|
101
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
GRAMO(X,Y)…sweep 1sweep 2sweep txsampleysampleG(X,Y)GRAMO(X,Y)…sweep 1sweep 2sweep txsampleysampleG(X|Y)GRAMO(Y|X)GRAMO(X|Y)GRAMO(Y|X)GRAMO(X|Y)GRAMO(Y|X)
Ligüística computacional
Volumen 46, Número 1
Using our formulation from the previous section, we can separate each sweep into
two subprocedures, one for each variable. When sampling an instance of xi ∈
X, el
histogram of sample y is fixed, and therefore it is sampling from the conditional model
de
Y). De este modo, the conditional likelihood is
|
(X
GRAMO
Pr
x = k
(cid:16)
; y, a,
X
−
|
GRAMO
(X
Y)
|
=
(cid:17)
Pr(x = k, X
Pr(X
−
; y, a)
; y, a)
−
=
=
Γ
Γ
X + (nk
|
y + αk)
nk
|
Nx + (Ny + 1(cid:62)a)
(cid:0)
(cid:1)
·
(cid:0)
i)
norte(
y + αk)
X + (nk
−
k
|
|
i)
norte(
X + (Ny + 1(cid:62)a)
−
(cid:1)
y + 1(cid:62)a)
−
Γ
i)
Nx + (norte(
(cid:16)
i)
norte(
X + (nk
Γ
−
k
|
(cid:16)
y + αk)
|
(cid:17)
(cid:17)
(8)
(9)
(10)
where the hyperparameter for variable X and category k becomes nk
when sampling yi ∈
hood is
y which is generated from the model
GRAMO
(Y
y + αk. Similarmente,
|
X), the conditional likeli-
|
Pr
y = k
(cid:16)
; X, a,
y
−
|
GRAMO
(Y
X)
|
=
(cid:17)
i)
norte(
X + αk)
y + (nk
−
k
|
|
i)
norte(
y + (Nx + 1(cid:62)a)
−
(11)
with nk
X + αk as the hyperparameter for Y. This process is shown in Figure 5(b).
|
From the calculation perspective, although the meaning of Equations (7), (10), y
(11) are different, their formulae are identical. This allows us to analyze similar models
using the conditional formulation without changing the posterior estimation. A similar
approach is the pseudo-likelihood approximation, where a joint model is reformulated
as the combination of two conditional models, and the optimal parameters for the
pseudo-likelihood function are the same as for the original joint likelihood function
(Besag 1975; Koller and Friedman 2009; Lepp¨a-aho et al. 2017).
3.3 Transfer Operation
Now that we have made the transfer process explicit and showed that this alternative
formulation yields same collapsed posterior, we are able to describe a similar process in
detail in the context of multilingual topic models.
If we treat X and Y in the previous example as two languages, and the samples
x and y as either words, tokens, or documents from the two languages, we have a
bilingual data set (X, y). Topic models have more complex graphical structures, dónde
the examples (tokens) are organized within certain scopes (p.ej., documentos). To define
the transfer process for a specific topic model, when generating samples in one language
based on the transfer process of the model, we have to specify what examples we want
to use from another language, how much, and where we want to use them. Para tal fin,
we define the transfer operation, which allows us to examine different models under a
unified framework to compare them systematically.
102
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
Definición 1 (Transfer operation)
Let Ω
A transfer operation on Ω from language (cid:96)1 a (cid:96)2 is defined as a function
RM be the target distribution of knowledge transfer with dimensionality M.
∈
hΩ : RL2×
L1
NL1×
METRO
METRO
RL2×
+
×
(cid:55)→
×
RL2×
METRO
(12)
where L1 and L2 are the relevant dimensionalities for languages (cid:96)1 y (cid:96)2, respectivamente.
In this definition, the first argument of the transfer operation is where the two lan-
guages connect to each other, and can be defined as any bilingual supervision needed to
enable transfer. The actual values of L1 and L2 depend on specific models. In an example
of generating a document in language (cid:96)2, L1 is the number of documents in languages
RL1 could be an binary vector where δi = 1 if document i is the
(cid:96)1 and L2 = 1, and δ
translation to current document in (cid:96)2, or zero otherwise. This is the core of crosslingual
transfer through the transfer operation; later we will see that different multilingual topic
models mostly only differ in the input of this argument, and designing this matrix is
critical for an efficient knowledge transfer.
∈
The second argument in the transfer operation is the sufficient statistics of the trans-
fer source ((cid:96)1 in the definition). After generating instances in language (cid:96)1, the statistics
are organized into a matrix. The last argument is a prior distribution over the possible
target distributions Ω.
The output of the transfer operation depends on and has the same dimensionality
as the target distribution, which will be used as the prior to generate a multinomial dis-
tribution. Let Ω be the target distribution from which a topic of language (cid:96)2 is generated:
z
Multinomial (Ω). With a transfer operation, a topic is generated as follows:
∼
Dirichlet
hΩ
δ, norte((cid:96)1 ), ξ
Multinomial (Ω)
(cid:0)
(cid:0)
(cid:1)(cid:1)
Ω
z
∼
∼
(13)
(14)
where δ is bilingual supervision, norte((cid:96)1 ) the generated sample of language (cid:96)1, and ξ a
prior distribution with the same dimensionality as Ω. See Figure 6 for an illustration.
En resumen, this definition highlights three elements that are necessary to enable
transfer:
(1) language transformations or supervision from the transfer source to destination;
(2) data statistics in the source; y
(3) a prior on the destination.
In the next section, we show how different topic models can be formulated with transfer
operaciones, as well as how transfer operations can be used in the design of new models.
4. Representative Models
En esta sección, we describe four representative multilingual topic models in terms of
the transfer operation formulation. These are also the models we will experiment on in
Sección 5. The plate notations of these models are shown in Figure 7, and we provide
notations frequently used in these models in Table 1.
103
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
Cifra 6
An illustration of a transfer operation on a 3-dimensional Dirichlet distribution. el primero
argument of hΩ is a bilingual supervision δ, which is a 3
indicating word translations between two languages. The second argument N((cid:96)1 ) is the statistics
(or histogram) from the sample in language (cid:96)1, whose dimension is aligned with δ, and M = 1.
With ξ as the prior knowledge (a symmetric hyperparameter), the result of hΩ is then used as
hyperparameters for the Dirichlet distribution.
3 matrix, where L1 = L2 = 3,
×
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 7
Plate notations of DOCLINK, C-BILDA, SOFTLINK, and VOCLINK (from left to right). We use red
lines to make the knowledge transfer component clear. Note that in VOCLINK we assume every
word is translated, so the plate notation does not include untranslated words.
104
2410001000135biologiöversättdjurbiologytranslateanimalbiologytranslateanimalBilingual supervisionbiologidjuröversättStatistics from sourceN(`1)Prior knowledge⇠h⌦⇣,norte(`1),⇠⌘biologidjuröversätt⌘z↵wwzD(`1,`2)✓K(`1,k)(`2,k)KD(`1,`2)↵Kwz✓K(`1,k)(`2,k)K`z↵Kwwz✓✓D(`1)D(`2)↵i(i,`1,k)(i,`2,k)I(r,k)↵D(`1)D(`2)k(`1,k)(`2,k)Kzwwz✓d,`2✓d,`1ed`2doclinkvoclinksoftlinkc-bilda(i,`1)(i,`2)(r)(`1)(`1)(`1)(`2)(`2)(`2)Nd,`1Nd,`1Nd,`1Nd,`2Nd,`2Nd,`2Nd,`1+Nd,`2
Hao and Paul
Crosslingual Transfer in Topic Models
Mesa 1
Notation table.
Notations Descriptions
z
w((cid:96))
V((cid:96))
D((cid:96))
The topic assignment to a token.
A word type in language (cid:96).
The size of vocabulary in language (cid:96).
The size of corpus in language (cid:96).
D((cid:96)1,(cid:96)2 )
The number of document pairs in languages (cid:96)1 y (cid:96)2.
a
θd,(cid:96)
b((cid:96))
b(r,(cid:96))
b(i,(cid:96))
Fi((cid:96),k)
Fi(r,(cid:96),k)
Fi(i,(cid:96),k)
A symmetric Dirichlet prior vector of size K, where K is the number of topics,
and each cell is denoted as αk.
Multinomial distribution over topics for a document d in language (cid:96).
A symmetric Dirichlet prior vector of size V((cid:96)), where V((cid:96)) is the size of
vocabulary in language (cid:96).
An asymmetric Dirichlet prior vector of size I + V((cid:96),
of internal nodes in a Dirichlet tree, and V((cid:96),
words in language (cid:96). Each cell is denoted as β(r,(cid:96))
a specific node i or an untranslated word type.
−
i
−
), where I is the number
) the number of untranslated
, indicating a scalar prior to
A symmetric Dirichlet prior vector of size V((cid:96))
word types in language (cid:96) under internal node i.
i
, where V((cid:96))
i
is the number of
Multinomial distribution over word types in language (cid:96) of topic k for topic k.
Multinomial distribution over internal nodes in a Dirichlet tree for topic k.
Multinomial distribution over all word types in language (cid:96) under internal
node i for topic k.
4.1 Standard Models
Typical multilingual topic models are designed based on simple observations of mul-
tilingual data, such as parallel corpora and dictionaries. We focus on three popular
modelos, and re-formulate them using the conditional generation assumption and the
transfer operation we introduced in the previous sections.
4.1.1 DOCLINK. The document links model (DOCLINK) uses parallel/comparable data
conjuntos, so that each bilingual document pair shares the same distribution over topics.
Assume the document d in language (cid:96)1 is paired with d in language (cid:96)2. De este modo, the transfer
RK where K is the number of topics. For a document d(cid:96)2, dejar
target distribution is θd,(cid:96)2 ∈
δ
be an indicator vector to indicate if a document d(cid:96)1 is a translation or compa-
∈
rable document to d(cid:96)2,
ND((cid:96)1 )
+
δd(cid:96)1 = 1
d(cid:96)2 and d(cid:96)1 are translations
}
{
(15)
105
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
where D((cid:96)1 ) is the number of documents in language (cid:96)1. De este modo, the transfer operation for
each document d(cid:96)2 can be defined as
hθd,(cid:96)2
δ, norte((cid:96)1 ), a
= re
norte((cid:96)1 ) + a
·
(16)
ND((cid:96)1 )
where N((cid:96)1 )
K is the sufficient statistics from language (cid:96)1, and each cell ndk is
the count of topic k appearing in document d. We call this a “document-level” model,
because the transfer target distribution is document-wise.
∈
×
(cid:0)
(cid:1)
Por otro lado, DOCLINK does not have any word-level knowledge, como
dictionaries, so the transfer operation on φ in DOCLINK is straightforward. For every
topic k = 1, . . . , K and each word type w regardless of its language,
hφ((cid:96)2,k)
0, norte((cid:96)1 ), b((cid:96)2 )
= 0
·
norte((cid:96)1 ) + b((cid:96)2 ) = β((cid:96)2 )
(17)
where β((cid:96)2 )
Fi((cid:96)2,k), and V((cid:96)2 ) is the size of vocabulary in language (cid:96)2.
∈
RV((cid:96)2 )
+
(cid:0)
(cid:1)
is a symmetric Dirichlet prior for the topic-vocabulary distributions
4.1.2 C-BILDA. As a variation of DOCLINK, C-BILDA has all of the components of DOCLINK
and has the same transfer operations on θ and φ as in Equations (16) y (17), so this
model is considered as a document-level model as well. Recall that C-BILDA addition-
ally models topic-language distributions η.1 For each document pair d and each topic
k, a bivariate Bernoulli distribution over the two languages η(k,d)
+ is drawn from a
Beta distribution parameterized by
χ(d,(cid:96)1 ), χ(d,(cid:96)2 )
R2
∈
:
η(k,d)
(cid:96)(k,metro)
(cid:0)
∼
∼
Beta
(cid:1)
χ(d,(cid:96)1 ), χ(d,(cid:96)2 )
(cid:16)
Bernoulli
η(k,d)
(cid:16)
(cid:17)
(cid:17)
(18)
(19)
dónde (cid:96)(k,metro) is the language of the m-th token assigned to topic k in the entire document
pair d. Intuitivamente, η(k,d)
is the probability of generating a token in language (cid:96) given the
current document pair d and topic k.
(cid:96)
Before diving into the specific definition of the transfer operation for this model,
we need to take a closer look at the generative process of C-BILDA first, because in this
modelo, language itself is a random variable as well. We describe the generative process
in terms of the conditional formulation where one language is conditioned on the other.
As usual, a monolingual model first generates documents in (cid:96)1, and at this point each
document pair d only has tokens in one language. Then for each document pair d,
the conditional model additionally generates a number of topics z using the transfer
operation on θ as defined in Equation (16). Instead of directly drawing a new word
type in language (cid:96)2 according to z, C-BILDA adds a step to generate a language (cid:96)(cid:48) de
η(z,d). Because the current token is supposed to be in language (cid:96)2, si (cid:96)(cid:48)
= (cid:96)2, this token
is dropped, and the model keeps drawing the next topic z; de lo contrario, a word type is
drawn from φ(z,(cid:96)2 ) and attached to the document pair d. Once this process is over, cada
1 The original notation for topic-language distribution is δ (Heyman, Vulic, and Moens 2016). To avoid
confusion in Equation (15), we change to η. We also follow the original paper where the model is for a
bilingual case.
106
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(cid:54)
Hao and Paul
Crosslingual Transfer in Topic Models
Cifra 8
An illustration of difference between DOCLINK and C-BILDA in sequential generating process.
DOCLINK uses a transfer operation on θ to generate topics and then word types in Swedish (SV).
Además, C-BILDA uses a transfer operation on η to generate a language label according to a
topic z. If the language generated is in Swedish, it draws a word type from the vocabulary;
de lo contrario, the token is discarded.
document pair d contains tokens from two languages, and by separating the tokens
based on their languages we can obtain the corresponding set of comparable document
pares. Conceptually, C-BILDA adds an additional “selector” in the generative process to
decide if a topic should appear more in (cid:96)2 based on topics in (cid:96)1. We use Figure 8 as an
illustration to show the difference between DOCLINK and C-BILDA.
It is clear that the generation of tokens in language (cid:96)2 is affected by that of language
(cid:96)1; thus we define an additional transfer operation on η(k,d). The bilingual supervision δ
is the same as Equation (15), which is a vector of dimension D((cid:96)1 ) indicating document
translations. We denote the statistics term N((cid:96)1 )
2, where each cell in the first
column ndk is the counts of topic k in document d, while the second column is a zero
vector. Por último, the prior term is also a two-dimensional vector χ(d) =
.
Juntos, we have the transfer operation defined as
χ(d,(cid:96)1 ), χ(d,(cid:96)1 )
RD((cid:96)1 )
k ∈
×
(cid:0)
(cid:1)
hη(k,d)
δ, norte((cid:96)1 )
k
(cid:16)
, χ(d)
= re
norte((cid:96)1 )
k + χ(d)
(20)
·
(cid:17)
4.1.3 VOCLINK. Jagarlamudi and Daum´e III (2010) and Boyd-Graber and Blei (2009)
introduced another type of multilingual topic model, which uses a dictionary for word-
level supervision instead of parallel/comparable documents as supervision, and we
call this model VOCLINK.2 Because no document-level supervision is used, the transfer
operation on θ is simply defined as
hθd,(cid:96)2
0, norte((cid:96)1 ), a
= 0
·
norte((cid:96)1 ) + α = α
(21)
(cid:0)
(cid:1)
We now construct the transfer operation on the topic-word distribution φ based
on the tree-structued priors in VOCLINK (Cifra 3). Recall that each word w((cid:96)) is asso-
ciated with at least one path, denoted as λw((cid:96)). If w((cid:96)) is translated, the path is λw((cid:96)) =
where r is the root and i an internal node; de lo contrario, the path is
r
simply the edge from root to that word. De este modo, on the first level of the tree, the Dirichlet
(cid:0)
w((cid:96))
→
→
i, i
(cid:1)
2 Although some models, as in Hu et al. (2014b), transfer knowledge at both document and word levels,
in this analysis, we only focus on the word level where no transfer happens on the document level. El
generalization simply involves using the same transfer operation on θ that is used in DOCLINK.
107
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
BoulderstadnorraColoradoadministrativviktigasteuniversitetsstadenzzenensvsvsven. . .BoulderstadadministrativsvsvsvnorraviktigasteuniversitetsstadendoclinkC-BiLDAh✓N(en)h✓h⌘(k,d). . .topicscounts✓✓⌘K
Ligüística computacional
Volumen 46, Número 1
distribution φ(r,(cid:96)2,k) is of dimension I + V((cid:96)2,
(es decir., word translation entries), and V((cid:96)2,
R(I+V((cid:96)2,−))×
(cid:96)2. Let δ
+
words in language (cid:96)1, and each cell is
∈
V1
−
−
), where I is the number of internal nodes
) are the untranslated word types in language
be an indicator matrix where V1 is the number of translated
δi,w((cid:96)1 ) = 1
w((cid:96)1 ) is under node i
(22)
Given a topic k, the statistics argument N((cid:96)1 )
RV1 is a vector where each cell nw
is the count of word w assigned to topic k. Note that in the tree structure, the prior for
Dirichlet is asymmetric and is scaled by the number of translations under each internal
nodo. De este modo, the transfer operation on φ(r,(cid:96)2,k) es
∈
(cid:8)
(cid:9)
hφ(r,(cid:96)2,k)
δ, norte((cid:96)1 ), b(r,(cid:96)2 )
= re
norte((cid:96)1 ) + b(r,(cid:96)2 )
·
(23)
(cid:1)
Under each internal node, the Dirichlet is only related to specific languages, so no
transfer happens, and the transfer operation on φ(i,(cid:96)2,k) for an internal node i is simply
b(i,(cid:96)2 ):
(cid:0)
hφ(i,(cid:96)2,k)
0, norte((cid:96)1 ), b(i,(cid:96)2 )
= 0
·
norte((cid:96)1 ) + b(i,(cid:96)2 ) = β(i,(cid:96)2 )
(24)
(cid:0)
4.2 SOFTLINK: A Transfer Operation–Based Model
(cid:1)
We have formulated three representative multilingual topic models by defining transfer
operations for each model above. Our recent work, called SOFTLINK (Hao and Paul
2018), is explicitly designed according to the understanding of this transfer process. Nosotros
present this model as a demonstration of how transfer operations can be used to build
new multilingual topic models, which might not have an equivalent formulation using
the standard co-generation model, by modifying the transfer operation.
In DOCLINK, the supervision argument δ in the transfer operation is constructed
using comparable data sets. This requirement, sin embargo, substantially limits the data
that can be used. Además, the supervision δ is also limited by the data; if there is
no translation available to a target document, δ is an all-zero vector, and the transfer
operation defined in Equation (16) will cancel out all the available information N((cid:96)1 )
for the target document, which is an ineffective use of the resource. Unlike parallel
corpus, dictionaries are widely available and often easy to obtain for many languages.
De este modo, the general idea of SOFTLINK is to use a dictionary to retrieve as much as possible
information from (cid:96)1 to construct δ in a way that links potentially comparable documents
together, even if the corpus itself does not explicitly link together documents.
Específicamente, for a document d(cid:96)2, instead of a pre-defined indicator vector, SOFTLINK
defines δ as a probabilistic distribution over all documents in language (cid:96)1:
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
δd(cid:96)1 ∝
|
|
w((cid:96)1 )
w((cid:96)1 )
w((cid:96)2 )
w((cid:96)2 )
|
|
∩
∪
(cid:9)
(cid:9)
contains all the word types that appear in document d(cid:96), y
w((cid:96))
{
indicates all word pairs
dónde
∩
w((cid:96)2 )
in a dictionary as translations. De este modo, δd(cid:96)1
can be interpreted as the “probability” of d(cid:96)1 being the translation of d(cid:96)2. We call δ the
(cid:8)
(cid:1)
transfer distribution. See Figure 9 for an illustration.
w((cid:96)1 ), w((cid:96)2 )
w((cid:96)1 )
(cid:8)
(cid:8)
(cid:8)
(cid:8)
(cid:9)
(cid:9)
(cid:9)
(cid:8)
(cid:9)
}
(cid:0)
(25)
108
Hao and Paul
Crosslingual Transfer in Topic Models
Cifra 9
An example of how different inputs of transfer operation result in different Dirichlet priors
through DOCLINK and SOFTLINK. The middle is a mini-corpus in language (cid:96)1 and each
document’s topic histogram. When a document in (cid:96)2 is not translation to any of those in (cid:96)1,
DOCLINK defines δ as an all-zero vector which leads to an uninformative symmetric prior. En
contrast, SOFTLINK uses a dictionary to create δ as a distribution so that the topic histogram in
each document in (cid:96)1 can still be proportionally transferred.
In our initial work, we show that instead of a dense distribution, it is more efficient
to make the transfer distributions sparse by thresholding,
1
˜δd(cid:96)1 ∝
δd(cid:96)1
> π
·
máximo(δ)
δd(cid:96)1
·
(26)
[0, 1] is a fixed threshold parameter. With the same definition of N((cid:96)1 ) and α
where π
en la ecuación (16) and δ defined as Equation (25), SOFTLINK completes the same transfer
operaciones,
∈
(cid:8)
(cid:9)
hθd,(cid:96)2
δ, norte((cid:96)1 ), a
= re
hφ((cid:96)2,k)
0, norte((cid:96)1 ), b((cid:96)2 )
(cid:0)
(cid:1)
= 0
·
·
(cid:0)
(cid:1)
norte((cid:96)1 ) + a
norte((cid:96)1 ) + b((cid:96)2 ) = β((cid:96)2 )
(27)
(28)
4.3 Summary: Transfer Levels and Transfer Models
We categorize transfer operations into two groups based on the target transfer distribu-
ción. Document-level operations transfer knowledge on distributions related to the entire
documento, such as θ in DOCLINK, C-BILDA, and SOFTLINK, and η in C-BILDA. Word-level
operations transfer knowledge on those related to the entire vocabulary or specific word
types, such as φ in VOCLINK.
When a model only has transfer operations on just one specific level, we also use
the transfer level to refer the model. Por ejemplo, DOCLINK, C-BILDA, and SOFTLINK are
all document-level models, while VOCLINK is a word-level model. Those that transfer
knowledge on multiple levels, such as Hu et al. (2014b), are called mixed-level models.
We summarize the transfer operation definitions for different models in Table 2,
and add monolingual LDA as a reference to show how transfer operations are defined
when no transfer takes place. We will experiment on the four multilingual models in
Secciones 4.1.1 a través de 4.2.
5. Experiment Settings
From discussions above, we are able to describe various multilingual topic models by
defining different transfer operations, which explicitly represent the language transfer
proceso. When designing and applying those transfer operations in practice, alguno
109
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
topic 2topic 3topic 1topic 2topic 3123123123123topicscounts=[0,0,0,0]=[0.05,0.15,0.7,0]doclinksoftlinkh✓⇣,norte(`1),↵⌘h✓⇣,norte(`1),↵⌘d1d2d3d4topic 1
Ligüística computacional
Volumen 46, Número 1
Mesa 2
Summary of transfer operations defined in the compared models, where we assume the
direction of transfer is from (cid:96)1 a (cid:96)2.
Modelo
Document level Word level
Parameters of h
LDA
DOCLINK
C-BILDA
SOFTLINK
VOCLINK
a
δ
δ
δ
δ
·
·
·
·
a
b((cid:96)2 )
b((cid:96)2 )
b((cid:96)2 )
norte((cid:96)1 ) + a
norte((cid:96)1 ) + a,
norte((cid:96)1 )
k + χ(d)
norte((cid:96)1 ) + a
b((cid:96)2 )
norte((cid:96)1 ) + b(r,(cid:96)2 )
δ
·
—
δ: indicator vector;
norte((cid:96)1 ): doc-by-topic matrix;
supervision: comparable documents;
δ: transfer distribution;
norte((cid:96)1 ): doc-by-topic matrix;
supervision: dictionary;
δ: indicator vector;
norte((cid:96)1 ): node-by-word matrix;
supervision: dictionary;
natural questions arise, such as which transfer operation is more effective in what
type of situation, and how to design a model that is more generalizable regardless of
availability of multilingual resources.
To study the model behaviors empirically, we train the four models described
in the previous section—DOCLINK, C-BILDA, SOFTLINK, and VOCLINK—in ten lan-
calibres. Considering the resources available, we separate the ten languages into two
grupos: high-resource languages (HIGHLAN) and low-resource languages (LOWLAN).
For HIGHLAN, we have relatively abundant resources such as dictionary entries and
document translations. We additionally use these languages to simulate the settings of
LOWLAN by training multilingual topic models with different amounts of resources. Para
LOWLAN, we use all resources available to verify experiment results and conclusions
from HIGHLAN.
5.1 Language Groups and Preprocessing
We separate the ten languages into two groups: HIGHLAN and LOWLAN. En esta sección,
we describe the preprocessing details of these languages.
5.1.1 HIGHLAN. Languages in this group have a relatively large amount of resources,
and have been widely experimented on in multilingual studies. Considering language
diversidad, we select representative languages from five different families: Arábica (AR,
Semitic), Alemán (DE, Germanic), Español (ES, Romance), Russian (RU, Slavic), y
Chino (ZH, Sinitic). We follow standard preprocessing procedures: We first use stem-
mers to process both documents and dictionaries (segmenter for Chinese), then we
remove stopwords based on a fixed list and the most 100 frequent word types in the
training corpus. The tools for preprocessing are listed in Table 3.
5.1.2 LOWLAN. Languages in this group have much fewer resources than those in HIGH-
LAN, considered as low-resource languages. We similarly select five languages from
different families: Amharic (AM, Afro-Asiatic), Aymara (AY, Aymaran), Macedonian
(MK, Indo-European), Swahili (SW, Niger-Congo), and Tagalog (TL, Austronesian). Nota
110
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
Mesa 3
List of source of stemmers and stopwords used in experiments for HIGHLAN.
Idioma
Family
Stemmer
Stopwords
EN
DE
ES
RU
AR
ZH
Germanic
Germanic
Romance
Slavic
SnowBallStemmer 3
SnowBallStemmer
SnowBallStemmer
SnowBallStemmer
NLTK
NLTK
NLTK
NLTK
Semitic
Assem’s Arabic Light Stemmer 4
GitHub 5
Sinitic
Jieba 6
GitHub
that some of these are not strictly “low-resource” compared with many endangered
idiomas. For the truly low-resource languages, it is very difficult to test the models
with enough data, y, por lo tanto, we choose languages that are understudied in natural
language processing literature.
Preprocessing in this language group needs more consideration. Because they rep-
resent low-resource languages that most natural language processing tools are not
available for, we do not use a fixed stopword list. Stemmers are also not available for
these languages, so we do not apply stemming.
5.2 Training Sets and Model Configurations
There are many resources available for multilingual research, such as the European
Parliament Proceedings parallel corpus (EUROPARL; Koehn 2005), the Bible, y
Wikipedia. EuroParl provides a perfectly parallel corpus with precise translations, pero
it only contains 21 European languages, which limits its generalizability to most of the
idiomas. The Bible, por otro lado, is also perfectly parallel and is widely available
en 2,530 languages.7 Its disadvantages, sin embargo, are that the contents are very limited
(mostly about family and religion), the data set size is small (1,189 capítulos), and many
languages do not have digital format (Christodoulopoulos and Steedman 2015).
Compared with EUROPARL and the Bible, Wikipedia provides comparable docu-
ments in many languages with a large range of content, making it a very popular choice
for many multilingual studies. En nuestros experimentos, we create ten bilingual Wikipedia
corpus, each containing documents in one of the languages in either HIGHLAN or
LOWLAN, paired with documents in English (EN). Though most multilingual topic
models are not restricted to training bilingual corpora paired with English, this is a
helpful way to focus our experiments and analysis.
We present the statistics of the training corpus of Wikipedia and the dictionary we
usar (from Wiktionary) in the experiments in Table 4. Note that we train topic models on
3 http://snowball.tartarus.org.
4 http://arabicstemmer.com.
5 https://github.com/6/stopwords-json.
6 https://github.com/fxsjy/jieba.
7 https://www.unitedbiblesocieties.org/.
111
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
Mesa 4
Statistics of training Wikipedia corpus and Wiktionary.
Inglés (EN)
#tokens
616,524
332,794
369,181
410,530
392,745
#docs
2,000
2,000
2,000
2,000
2,000
N AR
A
DE
l
ES
h
GRAMO
RU
h
ZH
I
norte
A
l
W.
oh
l
AM 2,000
2,000
AY
2,000
MK
SW 2,000
2,000
TL
3,589,268
1,758,811
1,777,081
2,513,838
2,017,643
#types
48,133
35,921
37,100
39,870
38,217
161,879
84,064
100,767
143,691
261,919
Paired language
#tokens
#types
#docs
2,000
2,000
2,000
2,000
2,000
2,000
2,000
2,000
2,000
2,000
181,946
254,179
239,189
227,987
168,804
251,708
169,439
489,953
353,038
232,891
25,510
55,610
30,258
37,928
44,228
65,368
24,136
87,329
46,359
41,618
Wiktionary
#entradas
16,127
32,225
31,563
33,574
23,276
4,588
1,982
6,895
15,257
6,552
bilingual pairs, where one of the languages is always English, so in the table we show
statistics of English in every bilingual pair as well.
Por último, we summarize the model configurations in Table 5. The goal of this study
is to bring current multilingual topic models together, studying their corresponding
strengths and limitations. To keep the experiments as comparable as possible, we use
constant hyperparameters that are consistent across the models. For all models, we set
the Dirichlet hyperparameter αk = 0.1 for each topic k = 1, . . . , k. We run 1,000 Gibbs
sampling iterations on the training set and 200 iterations on the test sets. The number of
topics K is set to 20 by default for efficiency reasons.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Mesa 5
Model specifications.
Modelo
Hyperparameters
DOCLINK We set β to be a symmetric vector where each cell βi = 0.01 for all
word types of all the languages, and use the MALLET implemen-
tation for training (McCallum 2002). To enable consistent com-
parison, we disable hyperparameter optimization provided in the
package.
C-BILDA
Following the experiment results from Heyman, Vulic, and Moens
(2016), we set χ = 2 to make the results more competitive to
DOCLINK. The rest of the settings are the same as for DOCLINK.
SOFTLINK We use the document-wise thresholding approach for calculating
the transfer distributions. The focus threshold is set to 0.8. The rest
of the settings are the same as for DOCLINK.
VOCLINK We set the scalar β(cid:48) = 0.01 for hyperparameter β(r,(cid:96)) from the root
to both internal nodes or leaves. For those from internal nodes to
leaves, we set β(cid:48)(cid:48) = 100, following the settings in Hu et al. (2014b).
112
Hao and Paul
Crosslingual Transfer in Topic Models
5.3 Evaluation
We evaluate all models using both intrinsic and extrinsic metrics. Intrinsic evaluation
is used to measure the topic quality or coherence learned from the training set, y
extrinsic evaluation measures performance after applying the trained distributions to
downstream crosslingual applications. For all the following experiments and tasks,
we start by analyzing languages in HIGHLAN. Then we apply the analyzed results to
LOWLAN.
We choose topic coherence (hao, Boyd-Graber, and Paul 2018) and crosslingual
document classification (Smet, Espiga, and Moens 2011) as intrinsic and extrinsic eval-
uation tasks, respectivamente. The reason for choosing these two tasks is that they examine
the models from different angles: Topic coherence looks at topic-word distributions,
whereas classification focuses on document-topic distributions. Other evaluation tasks,
such as word translation detection and crosslingual information retrieval, also utilize
the trained distributions, but here we focus on a straightforward and representative
tarea.
5.3.1 Intrinsic Evaluation: Topic Quality. Intrinsic evaluation refers to evaluating the
learned model directly without applying it to any particular task; for topic models,
this is usually based on the quality of the topics. Standard evaluation measures for
modelos monolingües, such as perplexity (or held-out likelihood; Wallach et al. 2009) y
Normalized Pointwise Mutual Information (NPMI, Lau, Hombre nuevo, and Baldwin (2014)),
could potentially be considered for crosslingual models. Sin embargo, when evaluating
multilingual topics, how words in different languages make sense together is also a
critical criterion in addition to coherence within each of the languages.
In monolingual studies, Chang et al. (2009) show that held-out likelihood is not
always positively correlated with human judgments of topics. Held-out likelihood is
additionally suboptimal for multilingual topic models, because this measure is only
calculated within each language, and the important crosslingual information is ignored.
Crosslingual Normalized Pointwise Mutual Information (CNPMI; hao, chico-
Graber, and Paul 2018) is a measure designed specifically for multilingual topic models.
Extended from the widely used NPMI to measure topic quality in multilingual set-
tings, CNPMI uses a parallel reference corpus to extract crosslingual coherence. CNPMI
correlates well with bilingual speakers’ judgments on topic quality and predictive
performance in downstream applications. Por lo tanto, we use CNPMI for intrinsic
evaluations.
Definición 2 (Crosslingual Normalized Pointwise Mutual Information, CNPMI)
Dejar
cuerpo. The CNPMI of this topic is calculated as
be the set of top C words in a bilingual topic, y
((cid:96)1,(cid:96)2 )
C
W.
R
((cid:96)1,(cid:96)2 ) a parallel reference
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
CNPMI
((cid:96)1,(cid:96)2 )
C
W.
(cid:16)
(cid:17)
=
1
C2
−
(cid:88)Wisconsin,wj∈W
((cid:96)1,(cid:96)2 )
C
registro
Pr(Wisconsin,wj)
Pr(Wisconsin ) Pr(wj)
Wisconsin, wj
log Pr
(cid:0)
(cid:1)
(29)
where wi and wj are from languages (cid:96)1 y (cid:96)2, respectivamente. Let d =
parallel documents from the reference corpus
be a pair of
.
is the number of parallel document pairs in which wi and wj
(cid:12)
(cid:12)
((cid:96)1,(cid:96)2 ), whose size is denoted as
d(cid:96)1, d(cid:96)2
R
(cid:12)
(cid:12)
((cid:96)1,(cid:96)2 )
R
(cid:1)
(cid:0)
d(cid:96)1, wj ∈
d(cid:96)2
d : wi ∈
(cid:12)
(cid:8)
(cid:12)
(cid:9)(cid:12)
(cid:12)
113
Ligüística computacional
Volumen 46, Número 1
appear. The co-occurrence probability of a word pair and the probability of a single
word are calculated as
Pr
Wisconsin, wj
(cid:44)
(cid:0)
(cid:1)
Pr (Wisconsin) (cid:44)
d(cid:96)1, wj ∈
d : wi ∈
((cid:96)1,(cid:96)2 )
(cid:12)
(cid:8)
R
(cid:12)
(cid:12)
d : wi ∈
(cid:12)
((cid:96)1,(cid:96)2 )
(cid:12)
(cid:8)
R
(cid:12)
(cid:12)
(cid:12)
(cid:9)(cid:12)
(cid:12)
d(cid:96)1
(cid:12)
(cid:12)
(cid:12)
(cid:12)
d(cid:96)2
(cid:9)(cid:12)
(cid:12)
(30)
(31)
Intuitivamente, a coherent topic should contain words that make sense or fit in a spe-
cific context together. In the multilingual case, CNPMI measures how likely it is that
a bilingual word pair appears in a similar context provided by the parallel reference
cuerpo. We provide toy examples in Figure 10, where we show three bilingual topics.
In Topic A, both languages are about “language,” and all the bilingual word pairs have
high probability of appearing in the same comparable document pairs. Thus Topic A
is coherent crosslingually, and thus expected to have a high CNPMI score. Although we
can identify the themes within each language in Topic B, eso es, education in English
and biology in Swahili, most of the bilingual word pairs do not make sense or appear in
the same context, which gives us a low CNPMI score. The last topic is not coherent even
within each language, so it has low CNPMI as well. Through this example, we see that
CNPMI detects crosslingual coherence in multiple ways, unlike other intrinsic measures
that might be adapted for crosslingual models.
En nuestros experimentos, we use 10, 000 linked Wikipedia article pairs for each language
pair (EN, (cid:96)) (20, 000 en total) as the reference corpus, and set C = 10 by default. Nota
that HIGHLAN has more Wikipedia articles, and we make sure the articles used for
evaluating CNPMI scores do not appear in the training set. Sin embargo, for LOWLAN,
because the number of linked Wikipedia articles is extremely limited, we use all the
available pairs to evaluate CNPMI scores. The statistics are shown in Table 6.
5.3.2 Extrinsic Evaluation: Crosslingual Classification. Crosslingual document classification
is the most common downstream application for multilingual topic models (Smet, Espiga,
Cifra 10
CNPMI measures how likely a bilingual word pair appears in a similar context in two languages,
provided by a reference corpus. Topic A has a high CNPMI score because both languages are
talking about the same theme. Both Topic B and Topic C are incoherent multilingual topics,
although Topic B is coherent within each language.
114
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
worddialectvowellatinspokenletterArabicspeakerverblinguistቋንቋ (idioma)ቃላት (palabras)ፊደል (letter)ጽሕፈት (writing)አልፋቤት (alphabet)ድምጽ (audio)እንግሊዝኛ (Inglés)ልሳናት (tongues)ምልክት (señal)ሰዎች (gente)degreescienceprofessorawardbachelorprogramacademicinstitutestudentchemistrybakteria (bacteria)asidi (acid)spishi (species)amino (amino)seli (cells)aina (tipo)bata (duck)maji (agua)wanyama (animals)protini (protein)foodsecretunderbridgepillsdiplomafrogslightsliedonutsкрава (cow)книга (libro)безбедност (seguridad)универзумот (universe)започнете (comenzar)списание (revista)дрвја (árboles)завеса (curtain)чудо (miracle)витамин (vitamin)Topic A (English-Amharic)Topic B (English-Swahili)Topic C (English-Macedonian)cnpmi=0.3632cnpmi=0.0094cnpmi=0.0643
Hao and Paul
Crosslingual Transfer in Topic Models
Mesa 6
Statistics of Wikipedia corpus for topic coherence evaluation (CNPMI).
#docs
10,000
10,000
10,000
10,000
10,000
AR
DE
ES
RU
ZH
Inglés
#tokens
3,597,322
2,155,680
3,021,732
3,016,795
1,982,452
4,316
AM
4,187
AY
10,000
MK
SW 10,000
6,471
TL
9,632,700
5,231,260
11,080,304
13,931,839
7,720,517
#types
128,926
103,812
149,423
154,442
112,174
269,772
167,531
301,026
341,231
645,534
HIGHLAN
LOWLAN
Paired language
#tokens
#docs
#types
10,000
10,000
10,000
10,000
10,000
4,316
4,187
10,000
10,000
6,471
996,801
1,459,015
1,737,312
2,299,332
1,335,922
403,158
280,194
3,175,182
1,755,514
1,124,049
64,197
166,763
142,086
284,447
144,936
91,295
32,424
245,687
134,152
83,967
and Moens 2011; Vuli´c et al. 2015; Heyman, Vulic, and Moens 2016). Típicamente, a model
is trained on a multilingual training set D((cid:96)1, (cid:96)2 ) in languages (cid:96)1 y (cid:96)2. Using the trained
topic-vocabulary distributions φ, the model infers topics in test sets D(cid:48)
((cid:96)1 ) y D(cid:48)
((cid:96)2 ).
In multilingual topic models, document-topic distributions θ can be used as features
θd,(cid:96)1 vectors in language (cid:96)1 train a classifier tested by the
for classification, donde el
θd,(cid:96)2 vectors in language (cid:96)2. A better classification performance indicates more consis-
tent features across languages. See Figure 11 for an illustration. En nuestros experimentos,
we use a linear support vector machine to train multilabel classifiers with five-fold
(cid:98)
cross-validation. Entonces, we use micro-averaged F-1 scores to evaluate and compare
performance across different models.
(cid:98)
For crosslingual classification, we also require held-out test data with labels or
anotaciones. En nuestros experimentos, we construct test sets from two sources: TED Talks
2013 (TED) and Global Voices (GV). TED contains parallel documents in all languages in
HIGHLAN, whereas GV contains all languages from both HIGHLAN and LOWLAN.
Cifra 11
An illustration of crosslingual document classification. After training multilingual topic models,
θ of unseen documents in both
the topics,
θd, (cid:96)1 as features and the labels y
idiomas. A classifier is trained with the inferred distributions
in language (cid:96)1, and predicts labels in language (cid:96)2.
are used to infer document-topic distributions
Fi((cid:96), k)
{
(cid:98)
(cid:98)
}
(cid:98)
115
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Train/Test classifierInfer topics on unseen documentsTrain multilingualtopic models on -corpuscorpusClassifierD(`1,`2)D(`1)D(`2)nb(`2,k)oKk=1nb(`1,k)oKk=1test corpus w/ labelsD0(`1)=n⇣b✓d,`1,y⌘otest corpusD0(`2)=n⇣b✓d,`2,·⌘otesttrain
Ligüística computacional
Volumen 46, Número 1
Using the two multilingual sources, we create two types of test sets for HIGHLAN—
TED + TED and TED + GV, and only one type for LOWLAN—TED+GV. In TED+TED, nosotros
infer document-topic distributions on documents from TED in English and the paired
idioma. This only applies to HIGHLAN, because TED do not have documents in
LOWLAN. In TED+GV, we infer topics on English documents from TED, and infer topics
on documents from GV in the paired language (both HIGHLAN and LOWLAN). El
two types of test sets also represent different application situations. TED + TED implies
that the test documents in both languages are parallel and come from the same source,
whereas TED + GV represents how the topic model performs when the two languages
have different data sources.
Both corpora are retrieved from http://opus.nlpl.eu/ (Tiedemann 2012). The la-
bels, sin embargo, are manually retrieved from http://ted.com/ and http://globalvoices.
org/. In TED corpus, each document is a transcript of a talk and is assigned to multiple
categories on the Web page, such as “technology,” “arts,” and so forth. We collect all
categories for the entire TED corpus, and use the three most frequent categories—
tecnología, cultura, science—as document labels. Similarmente, in GV corpus, each document
is a news story, and has been labeled with multiple categories on the Web page of the
story. Because in TED + GV, the two sets are from different sources, and training and
testing is only possible when both sets share the same labels, we apply the same three
labels from TED to GV as well. This processing requires minor mappings, Por ejemplo,
from “arts-culture” in GV to “culture” in TED. The data statistics are presented in Table 7.
6. Document-Level Transfer and Its Limitations
We first explore the empirical characteristics of document-level transfer, using DOC-
LINK, C-BILDA, and SOFTLINK.
Multilingual corpora can be loosely categorized into three types: parallel, com-
parable, and incomparable. A parallel corpus contains exact document translations
Mesa 7
Statistics of TED Talks 2013 (TED) and Global Voices (GV) cuerpo.
Corpus statistics
#types
#tokens
#docs
Label distributions
#tecnología
cultura
ciencia
AR
DE
ES
RU
ZH
AR
DE
ES
RU
ZH
AM
AY
MK
SW
TL
1,112
1,063
1,152
1,010
1,123
2,000
1,481
2,000
2,000
2,000
39
674
1,992
1,383
254
1,066,754
774,734
933,376
831,873
1,032,708
325,879
269,470
367,631
488,878
528,370
10,589
66,076
388,713
359,066
26,072
15,124
19,826
13,088
17,020
19,594
13,072
16,031
11,104
16,157
18,194
4,047
4,939
29,022
14,072
6,138
384
364
401
346
386
510
346
457
516
499
3
76
343
137
32
304
289
312
275
315
489
344
387
369
366
3
100
426
110
67
290
276
295
261
290
33
42
38
62
56
1
46
182
71
19
TED
GV
(HIGHLAN)
GV
(LOWLAN)
116
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
across languages, of which EUROPARL and the Bible, discussed before, are examples.
A comparable corpus contains document pairs (in the bilingual case), where each doc-
ument in one language has a related counterpart in the other language. Sin embargo, estos
document pairs are not exact translations of each other, and they can only be connected
through a loosely defined “theme.” Wikipedia is an example, where document pairs are
linked by article titles. Incomparable corpora contain potentially unrelated documents
across languages, with no explicit indicators of document pairs.
With different levels of comparability comes different availabilities of such corpora:
It is much harder to find parallel corpora in low-resource languages. Por lo tanto, nosotros
first focus on HIGHLAN, and use Wikipedia to simulate the low-resource situation in
Sección 6.1, where we find that DOCLINK and C-BILDA are very sensitive to the training
cuerpo, and thus might not be the best option when it comes to low-resource languages.
We then examine LOWLAN in Section 6.2.
6.1 Sensitivity to Training Corpus
We first vary the comparability of the training corpus and study how different models
behave under different situations. All models are potentially affected by the compara-
bility of the training set, although only DOCLINK and C-BILDA explicitly rely on this
information to define transfer operations. This experiment shows that models transfer-
ring knowledge on the document level (DOCLINK and C-BILDA) are very sensitive to the
training set, but can be almost entirely insensitive with appropriate modifications to the
transfer operation as in SOFTLINK.
6.1.1 Experiment Settings. For each language pair (EN, (cid:96)), we construct a random sub-
sample of 2, 000 documents from Wikipedia in each language (4, 000 en total). To vary
the comparability, we vary the proportion of linked Wikipedia articles between the two
idiomas, de 0.0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, a 1. When the percentage is zero, el
bilingual corpus is entirely incomparable, eso es, no document-level translations can
be found in another language, and DOCLINK and C-BILDA degrade into monolingual
LDAs. The indicator matrix used by transfer operations in Section 4.1.1 is a zero matrix
δ = 0. When the percentage is one, meaning each document from one language is linked
to one document from another language, the corpus is considered fully comparable,
and δ is an identity matrix 1. Any number between 0 y 1 makes the corpus partially
comparable to different degrees. The CNPMI and crosslingual classification results are
como se muestra en la figura 12, and the shades indicate the standard deviations across five Gibbs
sampling chains. For VOCLINK and SOFTLINK, we use all the dictionary entries.
6.1.2 Resultados. In terms of topic coherence (CNPMI), both DOCLINK and C-BILDA have
competitive performance on CNPMI, and achieve full potential when the corpus is
fully comparable. Como se esperaba, models transferring knowledge at the document level
(DOCLINK and C-BILDA) are very sensitive to the training corpus: The more aligned
the corpus is, the better topics the model learns. For the word-level model, VOCLINK
roughly stays at the same performance level, which is also expected, because this model
does not use linked documents as supervision. Sin embargo, its performance on Russian is
surprisingly low compared with other languages and models. In the next section, nosotros
will look closer at this problem by investigating the impact of dictionaries.
It is notable that SOFTLINK, a document-level model, is also insensitive to the
training corpus and outperforms other models most of the time. Recall that on the doc-
ument level, SOFTLINK defines transfer operation on document-topic distributions θ,
117
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 12
Both SOFTLINK and VOCLINK stay at a stable performance level of either CNPMI or F-1 scores,
whereas DOCLINK and C-BILDA expectedly have better performance as there are more linked
Wikipedia articles.
similarly to DOCLINK and C-BILDA, but using dictionary resources. This implies that
good design of the supervision δ in the transfer operation could lead to a more stable
performance across different training situations.
When it comes to the classification task, the F-1 scores of DOCLINK and C-BILDA
have very large variations, and the increasing trend of F-1 scores is less obvious than
with CNPMI. This is especially true when the percentage of linked documents is very
pequeño. For one, when the percentage is small, the transfer on the document level is
less constrained, leaving the projection of two languages into the same topic space
less predictive. The evaluation scope of CNPMI is actually much smaller and more
118
0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.050.100.150.200.25CNPMI(en-ar)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.050.100.150.200.25CNPMI(en-de)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.050.100.150.200.25CNPMI(en-es)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.050.100.150.200.25CNPMI(en-ru)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.050.100.150.200.25CNPMI(en-zh)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-ar)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-ar)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-de)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-de)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-es)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-es)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-ru)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-ru)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-zh)0.000.010.050.100.200.400.801.00PercentageofLinkedWikipediaArticles0.10.20.30.40.50.6AveragedF-1score(en-zh)CNPMIF-1 (TED+TED)F-1 (TED+GV)EN-AREN-DEEN-ESEN-RUEN-ZHDOCLINKC-BiLDASOFTLINKVOCLINK
Hao and Paul
Crosslingual Transfer in Topic Models
concentrated than classification, because it only focuses on the top C words, which does
not lead to large variations.
One consistent result we notice is that SOFTLINK still performs well on classifi-
cation with very small variations and stable F-1 scores, which again benefits from
the definition of transfer operation in SOFTLINK. When transferring topics to another
idioma, SOFTLINK uses dictionary constraints as in VOCLINK, but instead of a simple
one-on-one word type mapping, it expands the transfer scope to the entire document.
Además, SOFTLINK distributionally transfers knowledge from the entire corpus in
another language, which actually reinforces the transfer efficiency without relying on
direct supervision at the document level.
6.2 Performance on LOWLAN
En esta sección, we take a look at languages in LOWLAN. For SOFTLINK and VOCLINK,
we use all dictionary entries to train languages in LOWLAN, because the sizes of
dictionaries in these languages are already very small. We again use a subsample of
2, 000 Wikipedia document pairs with English to make the results comparable with
HIGHLAN. En figura 13(a), we also present results of models for HIGHLAN using fully
comparable training corpora and full dictionaries for direct comparison of the effect of
language resources.
In most cases, transfer on document level (particularly C-BILDA) performs better
than on word levels, in both HIGHLAN and LOWLAN. Considering the number of
dictionary entries available from Table 4, it is reasonable to suspect that the dictionary
is a major factor affecting the performance of word-level transfer.
Por otro lado, although SOFTLINK does not model vocabularies directly as in
VOCLINK, transferring knowledge at the document level with a limited dictionary still
yields competitive CNPMI scores. Por lo tanto, in this experiment on LOWLAN, we see that
with the same lexicon resource, it is generally more efficient to transfer knowledge at the
document level. We will also explore this in detail in Section 7.
We also present a comparison of micro-averaged F-1 scores between HIGHLAN
and LOWLAN in Figure 13(b). The test set used for this comparison is TED + GV, desde
TED does not have articles available in LOWLAN. También, languages such as Amharic
(AM) have fewer than 50 GV articles available, which is an extremely small number for
training a robust classifier, so in these experiments, we only train classifiers on English
(TED articles) and test them on languages in HIGHLAN and LOWLAN (GV articles).
Similarmente, the classification results are generally better in document-level transfer,
and both C-BILDA and SOFTLINK give similar scores. Sin embargo, it is worth noting that
VOCLINK has very large variations in all languages, and the F-1 scores are very low.
This again suggests that transferring knowledge on the word level is less effective, y
en la sección 7 we study in detail why this is the case.
7. Word-Level Transfer and Its Limitations
In the previous section, we compared different multilingual topic models with a focus
on document-level models. We draw conclusions that DOCLINK and C-BILDA are very
sensitive to the training corpus, which is natural due to their definition of supervision
as a one-to-one document pair mapping. Por otro lado, the word-level model
VOCLINK in general has lower performance, especially with LOWLAN, even if the
corpus is entirely comparable.
119
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
(a) score comparison of different models and languages with cardinality C = 10.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(b) Micro-averaged F-1 scores of different models and languages on TED+GV corpora.
Cifra 13
Topic quality evaluation and classification performance on both HIGHLAN and LOWLAN. Nosotros
notice that VOCLINK has lower CNPMI and F-1 scores in general, with large standard deviations.
C-BILDA, por otro lado, outperforms other models in most of the languages.
One interesting result we observed from the previous section is that SOFTLINK
and VOCLINK use the same dictionary resource while transferring topics on different
niveles, and SOFTLINK generally has better performance than VOCLINK. Por lo tanto, en esto
sección, we explore the characteristics of the word-level model VOCLINK and compare it
with SOFTLINK to study why it does not use the same dictionary resource as effectively.
120
ARDEESRUZH0.10.20.30.4cnpmiDOCLINKC-BILDASOFTLINKVOCLINKAMAYMKSWTL0.10.20.30.4cnpmiDOCLINKC-BILDASOFTLINKVOCLINKARDEESRUZH0.00.10.20.30.40.50.6MicroF-1scoresDOCLINKC-BILDASOFTLINKVOCLINKAMAYMKSWTL0.00.10.20.30.40.50.6MicroF-1scoresDOCLINKC-BILDASOFTLINKVOCLINK
Hao and Paul
Crosslingual Transfer in Topic Models
Para tal fin, we first vary the amount of dictionary entries available and compare
how SOFTLINK and VOCLINK perform (Sección 7.1). Based on the results, we analyze
word-level transfer from three different angles: dictionary usage (Sección 7.2) as an
intuitive explanation of the models, topic analysis (Sección 7.3) from a more qualitative
perspectiva, and comparing transfer strength (Sección 7.4) as a quantitative analysis.
7.1 Sensitivity to Dictionaries
Word-level models such as VOCLINK use a dictionary as supervision, and thus will
naturally be affected by the dictionary used. Although SOFTLINK transfers knowledge
on the document level, it uses the dictionary to calculate the transfer distributions used
in its document-level transfer operation. En esta sección, we focus on the comparison of
SOFTLINK and VOCLINK.
7.1.1 Sampling the Dictionary Resource. The dictionary is the essential part of SOFTLINK
and VOCLINK and is used in different ways to define transfer operations. The availabil-
ity of dictionaries, sin embargo, varies among different languages. From Table 4, we notice
that for LOWLAN the number of available dictionary entries is very limited, cual
suggests it could be a major factor affecting the performance of word-level topic models.
Por lo tanto, in this experiment, we sample different numbers of dictionary entries in
HIGHLAN to study how this alters performance of SOFTLINK and VOCLINK.
Given a bilingual dictionary, we add only a proportion of entries in it to SOFTLINK
and VOCLINK. As in the previous experiments varying the proportion of document
Enlaces, we change the proportion from 0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, a 1.0. Cuando el
proportion is 0, both SOFTLINK and VOCLINK become monolingual LDA and no transfer
happens; when the proportion is 1, both models reach their highest potential with all the
dictionary entries available.
We also sample the dictionary in two manners: aleatorio- and frequency-based. En
random-based, the entries are randomly chosen from the dictionary, and the five chains
have different entries added to the models. In frequency-based, we select the most
frequent word types from the training corpus.
Cifra 14 shows a detailed comparison among different evaluations and languages.
Como se esperaba, adding more dictionary entries helps both SOFTLINK and VOCLINK, con
increasing CNPMI scores and F-1 scores in general. Sin embargo, we notice that adding more
dictionary entries can boost SOFTLINK’s performance very quickly, whereas the increase
in VOCLINK’s CNPMI scores is slower. Similar trends can be observed in the classification
task as well, where adding more words does not necessarily increase VOCLINK’s F-1
puntuaciones, and the variations are very high.
This comparison provides an interesting insight to increasing lexical resources effi-
ciently. In some applications, especially related to low-resource languages, the number
of available lexicon resources is very small, and one way to solve this problem is to
incorporate human feedback, such as interactive topic modeling proposed by Hu et al.
(2014a). In our case, a native speaker of the low-resource language could provide word
translations that could be incorporated into topic models. Because of limited time and
financial budget, sin embargo, it is impossible to translate all the word types that appear in
the corpus, so the challenge is how to boost the performance of the target task as much
as possible with less effort from humans. In this comparison, we see that if the target
task is to train coherent multilingual topics, training SOFTLINK is a more efficient way
than VOCLINK.
121
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 14
SOFTLINK produces better topics and is more capable of crosslingual classification tasks than
VOCLINK when the number of dictionary entries is very limited.
7.1.2 Varying Comparability of the Corpus. For SOFTLINK and VOCLINK, the dictionary is
only one aspect of the training situation. As discussed in our document-level experi-
mentos, the training corpus is also an important factor that could affect the performance
of all topic models. Although corpus comparability is not an explicit requirement of
SOFTLINK and VOCLINK, the comparability of the corpus might affect the coverage pro-
vided by the dictionary or affect performance in other ways. In SOFTLINK, comparabilidad
could also affect the transfer operator’s ability to find similar documents to link to. En
this section, we study the relationship between dictionary coverage and comparability
of the training corpus.
122
0.010.050.100.200.400.801.00Dictionarycoverage0.000.050.100.150.200.250.30CNPMI(en-ar)0.010.050.100.200.400.801.00Dictionarycoverage0.000.050.100.150.200.250.30CNPMI(en-de)0.010.050.100.200.400.801.00Dictionarycoverage0.000.050.100.150.200.250.30CNPMI(en-es)0.010.050.100.200.400.801.00Dictionarycoverage0.000.050.100.150.200.250.30CNPMI(en-ru)0.010.050.100.200.400.801.00Dictionarycoverage0.000.050.100.150.200.250.30CNPMI(en-zh)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-ar)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-ar)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-de)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-de)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-es)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-es)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-ru)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-ru)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-zh)0.010.050.100.200.400.801.00Dictionarycoverage0.000.100.200.300.400.500.60AveragedF-1score(en-zh)CNPMIF-1 (TED+TED)F-1 (TED+GV)EN-AREN-DEEN-ESEN-RUEN-ZHSOFTLINK(frequency)SOFTLINK(aleatorio)VOCLINK(aleatorio)VOCLINK(frequency)
Hao and Paul
Crosslingual Transfer in Topic Models
Similar to the previous section, we vary the dictionary coverage from 0.01, 0.05,
0.1, 0.2, 0.4, 0.8, a 1, using the frequency-based method as in the last experiment. Nosotros
also vary the number of linked Wikipedia articles from 0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8,
a 1. We present CNPMI scores in Figure 15(a), where the results are averaged over all
five languages in HIGHLAN. It is clear that SOFTLINK outperforms VOCLINK, a pesar de todo
of training corpus and dictionary size. This implies that SOFTLINK could potentially
learn coherent multilingual topics even when the training conditions are unfavorable:
Por ejemplo, when the training corpus is incomparable and there is only a small number
of dictionary entries.
(a) Average CNPMI scores on multilingual topic coherence.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
(b) Multilabel crosslingual document classification F-1 scores in HIGHLAN.
Cifra 15
Adding more dictionary entries has a higher impact on word-level model VOCLINK. SOFTLINK
learns better quality topics than VOCLINK. SOFTLINK also generally performs better on
classification.
123
0.00.010.050.10.20.40.81.0PercentageofLinkedWikipediaArticles1.00.80.40.20.10.050.01DictionaryCoverage.152.157.160.162.156.165.158.155.153.153.149.154.147.153.147.165.109.114.120.102.115.113.116.117.077.079.077.074.084.079.081.085.056.058.054.054.056.061.060.066.048.039.044.051.045.046.054.052.043.040.042.042.042.044.043.0420.00.010.050.10.20.40.81.0PercentageofLinkedWikipediaArticles1.00.80.40.20.10.050.01DictionaryCoverage.167.170.169.168.169.178.179.166.165.157.158.170.171.166.174.184.146.155.141.152.147.159.161.167.138.151.140.137.147.141.147.149.121.122.129.128.132.133.129.133.096.105.101.108.106.099.110.107.079.072.073.079.068.067.073.077SOFTLINKVOCLINK 0.00.010.050.10.20.40.81.0PercentageofLinkedWikipediaArticles1.00.80.40.20.10.050.01DictionaryCoverage.422.459.462.465.449.468.476.413.427.470.431.362.373.470.400.387.326.438.366.352.346.367.369.320.261.317.302.403.302.347.249.303.294.191.239.250.313.216.353.327.262.249.244.195.312.309.165.167.095.251.320.326.216.212.284.2200.00.010.050.10.20.40.81.0PercentageofLinkedWikipediaArticles1.00.80.40.20.10.050.01DictionaryCoverage.334.331.354.335.352.373.367.365.270.347.301.339.314.324.319.357.288.331.321.326.307.302.294.301.290.298.221.293.279.285.217.257.273.272.266.230.267.242.251.203.270.228.290.261.317.281.184.261.207.211.256.263.212.213.263.2140.00.010.050.10.20.40.81.0PercentageofLinkedWikipediaArticles1.00.80.40.20.10.050.01DictionaryCoverage.482.487.467.452.515.473.480.521.477.495.468.456.501.482.474.457.407.484.465.513.470.465.460.461.464.455.478.445.465.478.478.445.405.469.428.473.417.424.426.455.410.427.442.409.425.424.448.460.386.352.374.410.344.390.383.359SOFTLINKVOCLINK0.00.010.050.10.20.40.81.0PercentageofLinkedWikipediaArticles1.00.80.40.20.10.050.01DictionaryCoverage.384.381.391.410.406.336.382.400.373.372.348.387.393.373.398.399.393.425.373.390.364.397.349.354.381.360.401.342.365.357.374.365.335.331.336.361.350.319.342.371.372.373.351.359.366.333.332.340.304.334.290.349.322.286.286.286Test set: TED + TEDTest set: TED + GV
Ligüística computacional
Volumen 46, Número 1
The results of crosslingual classification are shown in Figure 15(b). When the test
sets are from the same source (TED + TED), SOFTLINK utilizes the dictionary more effi-
ciently and performs better than VOCLINK. En particular, F-1 scores of SOFTLINK using
solo 20% of dictionary entries is already outperforming VOCLINK using the full dictio-
nary. A similar comparison can also be drawn when the test sets are from different
sources such as TED + GV.
7.1.3 Discusión. From the results so far, it is empirically clear that transferring knowl-
edge on the word level tends to be less efficient than the document level. This is arguably
counter-intuitive. Recall that the goal of multilingual topic models is to let semantically
related words and translations have similar distributions over topics. The word-level
model VOCLINK directly uses this information—dictionary entries—to define transfer
operaciones, yet its CNPMI scores are lower. In the following sections, por lo tanto, we try to
explain this apparent contradiction. We first analyze the dictionary usage of VOCLINK
(Sección 7.2), and then lead our discussion on the transfer strength comparisons between
document and word levels for all models (Secciones 7.3 y 7.4).
7.2 Dictionary Usage
En la práctica, the assumption of VOCLINK is also often weakened by another important
factor: the presence of word translations in the training corpus. Given a word pair
w((cid:96)1 ), w((cid:96)2 )
, the assumption of VOCLINK is valid only when both words appear in the
training corpus in their respective languages. If w((cid:96)2 ) is not in D((cid:96)2 ), w((cid:96)1 ) will be treated
(cid:0)
as an untranslated word instead. Cifra 16 shows an example of how tree structures in
VOCLINK are affected by the corpus and the dictionary.
(cid:1)
En figura 17, we present the statistics of word types from different sources on
a logarithmic scale. “Dictionary” is the number of word types that appeared in the
original dictionary as shown in the last column of Table 4, and we use the same
preprocessing to the dictionary as to the training corpus to make sure the quantities are
comparable. “Training set” is the number of word types that appeared in the training
Cifra 16
The dictionary used by VOCLINK is affected by its overlap with the corpus. In this example, el
three entries in Dictionary A can all be found in the corpus, so the tree structure has all of them.
Sin embargo, only one entry in Dictionary B can be found in the corpus. Although the Swedish
word “heterotrofa” is also in the dictionary, its English translation cannot be found in the corpus,
so Dictionary B ends up a tree with only one entry.
124
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Animals are multicellular eukaryotic organisms that form the biological kingdom Animalia. Djur är flercelliga organismer som kännetecknas av att de är rörliga och heterotrofa.animal ✔multicellular ✔organisms ✔djur ✔flercelliga ✔organismer ✔animaldjurmulticellularflercelligaorganismsorganismereukaryoticformbiologicalkännetecknasrörligaheterotrofa……animaldjurmulticellularflercelligaorganismsorganismereukaryoticformbiologicalkännetecknasrörligaheterotrofa……animal ✔heterotrofic ✘body ✘djur ✔heterotrofa ✔kroppen ✘Dictionary ADictionary BEnglishEnglishSwedishSwedish
Hao and Paul
Crosslingual Transfer in Topic Models
Cifra 17
The number of word types that are linked in VOCLINK is far less than the original dictionary and
even than that of word types in the training sets.
colocar, and “Linked by VOCLINK” is the number of word types that are actually used in
VOCLINK, eso es, the number of non-zero entries in δ in the transfer operation.
Note that even when we use the complete dictionary to create the tree structure in
VOCLINK, in LOWLAN, there are far more word types in the training set than those in
the dictionary. En otras palabras, the supervision matrix δ used by hφ(r,k) is never actually
full rank, y por lo tanto, the full potential of VOCLINK is very difficult to achieve due to
the properties of the training corpus. This situation is as if the document-level model
DOCLINK had only half of the linked documents in the training corpus.
Por otro lado, we notice that in HIGHLAN, the number of word types in
the dictionary is usually comparable to that of the training set (except in AR). Para
LOWLAN, sin embargo, the situation is quite the contrary: There are more word types in the
training set than in the dictionary. De este modo, the availability of sufficient dictionary entries
is especially a problem for LOWLAN.
We conclude from Figure 15(a) that adding more dictionary entries will slowly
improve VOCLINK, but even when there are enough dictionary items, due to model
suposiciones, VOCLINK will not achieve its full potential unless every word in the train-
ing corpus is in the dictionary. A possible solution is to first extract word alignments
from parallel corpora, and then create a tree structure using those word alignments, como
experimented in Hu et al. (2014b). Sin embargo, when parallel corpora are available, nosotros
have shown that document-level models such as DOCLINK work better anyway, y el
accuracy of word aligners is another possible limitation to consider.
7.3 Topic Analysis
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Whereas VOCLINK uses a dictionary to directly model word translations, SOFTLINK
uses the same dictionary to define the supervision in transfer operation differently on
the document level. Experiments show that transferring knowledge on the document
level with a dictionary (es decir., SOFTLINK) is more efficient, resulting in stable and low-
variance topic qualities in various training situations. A natural question is why the
same resource results in different performance on different levels of transfer operations.
To answer this question from another angle, we further look into the actual topics
trained from SOFTLINK and VOCLINK in this section. The general idea is to look into
the same topic output from SOFTLINK and VOCLINK and see what topic words they
+), and what words they have exclusively, denoted as
have in common (denoted as
,VOC are
,VOC for SOFTLINK and VOCLINK, respectivamente. The words in
,SOFT and
W.
−
W.
−
W.
−
W.
125
ARDEESRUZHAMAYMKSWTL103104105106WordtypestatisticsTrainingsetDictionaryLinkedbyvoclink
Ligüística computacional
Volumen 46, Número 1
those with lower topic coherence and are thus the key to understanding the suboptimal
performance of VOCLINK.
k
k=1 and
7.3.1 Aligning Topics. Para tal fin, the first step is to align possible topics between
VOCLINK and SOFTLINK, since the initialization of Gibbs samplers is random. Dejar
k
VOC
k=1 be the K topics learned by VOCLINK and SOFTLINK respec-
k
{W.
activamente, from the same training conditions. For each topic pair (k, k(cid:48)) we calculate the
SOFT
Jaccard index
, one for each language, and use the average over the two
k(cid:48)
languages as the matching score mk,k(cid:48) of the topic pair:
SOFT
k
{W.
VOC
k
W.
y
W.
}
}
mk, k(cid:48) = 1
2
j
(cid:16)
VOC
k,(cid:96)1
W.
(cid:16)
+ j
,
SOFT
k(cid:48),(cid:96)1
W.
(cid:17)
VOC
k, (cid:96)2
W.
(cid:16)
,
SOFT
k(cid:48), (cid:96)2
W.
(cid:17)(cid:17)
(32)
where J(X, Y) is the Jaccard index between sets X and Y. De este modo, there are K2 matching
scores with a number of topics K. We set a threshold of 0.8, so that a matching score is
max mk,k(cid:48) over all the K2 scores. For each topic k, si
valid only when it is greater than 0.8
, and treat them as potentially the
k with
its matching score is valid, we align
same topic. When multiple matching scores are valid, we use the topic with the highest
score and ignore the rest.
·
W.
SOFT
k(cid:48)
W.
VOC
7.3.2 Comparing Document Frequency. Using the approximate alignment algorithm we
described above, we are now able to compare each aligned topic pair between VOCLINK
and SOFTLINK.
For a word type w, we define the document frequency as the percentage of doc-
uments where w appears. A low document frequency of word w implies that w only
dónde
appears in a small number of documents. For every aligned topic pair
Wj are topic word sets from SOFTLINK and VOCLINK, respectivamente, we have three
Wi and
sets of topic words derived from this pair:
Wi,
(cid:0)
Wj
(cid:1)
+ =
W.
,VOC =
−
W.
,SOFT =
−
W.
Wi ∩ Wj
Wi \ Wj
Wj \ Wi
(33)
(34)
(35)
Then we calculate the average document frequencies over all the words in each of
the sets, and we show the results in Figure 18.
We observe that the average document frequencies over words in
,VOC are con-
W.
+ are higher. This implies that
sistently lower in every language, whereas those in
VOCLINK tends to give rare words higher probability in the topic-word distributions.
En otras palabras, VOCLINK gives high probabilities to words that only appear in specific
contextos, such as named entities. De este modo, when evaluating topics using a reference corpus,
the co-occurrence of such words with other words is relatively low due to lack of that
specific context in the reference corpus.
W.
−
We show an example of an aligned topic in Figure 19. In this example, we see
that although both VOCLINK and SOFTLINK can discover semantically coherent words
+, VOCLINK focuses more on words that only appear in specific con-
shown in
,VOC that only appear in
textos: There are many words (mostly named entities) en
one document. Due to lack of this very specific context in the reference corpora, el
W.
W.
−
126
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
Cifra 18
Average document frequencies of
the triangle markers.
W.
,VOC are generally lower than
−
,SOFT and
−
W.
W.
+, shown in
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Cifra 19
An example of real data showing the topic words of SOFTLINK and VOCLINK. Words that appear
in both models are in
+; words that only appear in SOFTLINK or VOCLINK are included in
,SOFT or
−
W.
−
W.
W.
,VOC, respectivamente.
co-occurrence of these words with other more general words is likely to be zero, resultado-
ing in lower CNPMI.
7.4 Comparing Transfer Strength
While we have looked at the topics to explain what kind of words produced by VOC-
LINK make the model’s performance lower than SOFTLINK, in this section, we try to
explain why this happens by analyzing their transfer operations. Recall that VOCLINK
k
defines transfer operations on topic-node distributions
k=1 (Ecuación (23)), mientras
SOFTLINK defines transfer on document-topic distributions θ. The differences between
transfer levels with the same resources leads to a suspicion that document level has a
“stronger” transfer power.
φk,r}
{
The first question is to understand how this transfer operation actually functions
in the training of topic models. During Gibbs’ sampling of monolingual LDA, el
127
ARDEESRUZH0.000.050.100.150.20DocumentfrequenciesW−,vocW−,softW+AMAYMKSWTL0.000.200.400.600.801.00DocumentfrequenciesW−,vocW−,softW+(cid:6)(cid:13) (German army)(cid:15)(cid:13) (French army)(cid:19)(cid:16) (soldier)(cid:17)(cid:14) (ejército)(cid:2)(cid:3) (launch)(cid:13)(cid:9) (orden)(cid:6)(cid:2) (attack)(cid:24)(cid:12) (battle)(cid:25)(cid:18) (command)(cid:31)(cid:3) (act)(cid:23)(cid:3) (Ruan Guang) (cid:10)(cid:14)(cid:16) (Rommel)(cid:9)(cid:1)(cid:8) (armored division)(cid:12)(cid:4) (chariot)(cid:19)(cid:27) (Sun Ce)(cid:22)(cid:13) (British army)(cid:32)(cid:29)(cid:11) (Hitler)(cid:15)(cid:5)(cid:12) (Italia)(cid:12)(cid:7) (battle)(cid:5)(cid:13) (General)(cid:12)(cid:21) (battle) (cid:10)(cid:4) (plane) (cid:26)(cid:1) (now) (cid:13)(cid:14) (troops)(cid:13)(cid:18) (militar) (cid:30)(cid:13) (navy) (cid:12)(cid:17) (guerra) (cid:5)(cid:8) (abundant) (cid:7)(cid:20) (end) (cid:28)(cid:11) (eventually)command ship division British fire killair war would then end Rommel ChenYangtank Li sontroop emperor dynasty military death general French attack Germanbattle army force W+CNPMI = 0.488W,softSOFTLINKCNPMI = 0.396W,vocVOCLINK
Ligüística computacional
Volumen 46, Número 1
conditional distribution for a token, denoted as
, is calculated by conditioning on all
the other tokens and their topics, and can be factorized into two conditionals: documento-
nivel
y
z
all the other words and their current topic assignments in the corpus. The conditional
−
is then
Fi. Let the current token be of word type w, and w
θ and word-level
PAG
PAG
PAG
−
Pk = Pr
z = k
|
w, w
, z
−
−
k + βw
nw
(cid:1)
|
k + 1(cid:62)b
norte
·|
(cid:0)
nk
d + αk
|
(cid:0)
θk · P
PAG
φk
·
(cid:1)
∝
=
(36)
(37)
(38)
d is the number of topic k in document d, nw
|
where nk
k the number of word type w
|
in topic k, norte
k the number of tokens assigned to topic k, y 1 an all-one vector. En
·|
this equation, the final conditional distribution can be treated as a “vote” from the two
i,
conditionals:
meaning the conditional on document
θ dominates the decision of choosing a topic,
while the conditional on word
φ is a uniform distribution, entonces
Fi (Yuan et al. 2015). Si
φ is uninformative.
θ and
=
PAG
PAG
PAG
PAG
PAG
PAG
We apply this similar idea to multilingual topic models. For a token in language
can also generally be factorized to two individual
PAG
(cid:96)2, we let w be its word type, y
conditionals,
PAG
Pk = Pr
z = k
w, w
−
, z
−
|
(cid:0)
d + hθ
nk
|
(cid:1)
δ, norte((cid:96)1 ), a
k
·
(cid:2)
(cid:0)
PDOC, k
(cid:123)(cid:122)
(cid:124)
PDOC,k · PVOC,k
(cid:1)
(cid:3)
(cid:125)
(cid:124)
∝
=
nw
norte
k + hφ
|
k + 1(cid:62)hφ
(cid:0)
·|
δ(cid:48), norte((cid:96)1 ), b
w
δ(cid:48), norte((cid:96)1 ), b
(cid:1)
(cid:0)
PVOC, k
(cid:123)(cid:122)
(cid:1)
(cid:125)
(39)
(40)
(41)
where the transfer operation is clearly incorporated into the calculation of the condi-
PVOC are conditional distributions on document and word levels,
tional, y
respectivamente. De este modo, it is easy to see how transfers on different levels contribute to the
decision of a topic. This is also where our comparison of “transfer strength” starts.
PDOC and
,
PAG
PVOC,
To apply this idea, for each token, we first obtain three distributions described
PVOC. Then we calculate cosine similarities cos (
) y
PDOC, y
antes:
PDOC,
). If r = cos(
PDOC is dominant and helps shape the
porque (
PAG
PVOC,
porque(
PAG
PAG
; en otras palabras, the document level transfer is stronger. Nosotros
conditional distribution
PAG
calculate the ratio of similarities r = cos(
PDOC,
)
) for all the tokens in every model, and take
PAG
PVOC,
porque(
PAG
the model-wise average over all the tokens (Cifra 20). The most balanced situation is
r= 1, meaning transfers on both word and document levels are contributing equally to
the conditional distributions.
)
) > 1, we know that
PDOC,
PAG
From the results, we notice that both DOCLINK and C-BILDA have stronger transfer
strength on the document level, which means that the transfer operations on the doc-
ument levels are actually informing the decision of a token’s topic. Sin embargo, nosotros también
notice that VOCLINK has very comparable transfer strength to DOCLINK and C-BILDA,
which makes less sense, because VOCLINK defines transfer operations on the word level.
This implies that transferring knowledge on the word level is weaker. This also explains
128
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
Cifra 20
Comparisons of transfer strength. A value of one (shown in red dotted line) means an equal
balance of transfer between document and word levels. We notice SOFTLINK has the most
balanced transfer strength, whereas VOCLINK has stronger transfer at the document level
although its transfer operation is defined on the word level.
why, in the previous section, VOCLINK tends to find topic words appearing in only a few
documentos.
It is also interesting to see SOFTLINK having a relatively good balance between doc-
ument and word levels, with consistently the most balanced transfer strengths across
all models and languages.
8. Remarks and Conclusions
Multilingual topic models use corpora in multiple languages as input with additional
language resources as supervision. The traits of these models inevitably lead to a
wide variety of training scenarios, especially when a language’s resources are scarce,
whereas most previous studies on multilingual topic models have not analyzed in depth
the appropriateness of different models for different training situations and resource
availability. Por ejemplo, experiments are most often done in European languages, con
models that are typically trained on parallel or comparable corpora.
The contributions of our study are providing a unifying framework of these differ-
ent models, and systematically analyzing their efficacy in different training situations.
We conclude by summarizing our findings along two dimensions: training corpora
characteristics and dictionary characteristics, since these are the necessary components
to enable crosslingual knowledge transfer.
8.1 Model Selection
Document-level models are shown to work best when the corpus is parallel or at least
comparable. In terms of learning high-quality topics, DOCLINK and C-BILDA yield very
129
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
ARDEESRUZH0123456TransferstrengthDOCLINKC-BILDASOFTLINKVOCLINKAMAYMKSWTL0123456TransferstrengthDOCLINKC-BILDASOFTLINKVOCLINK
Ligüística computacional
Volumen 46, Número 1
similar results. Sin embargo, since C-BILDA has a “language selector” mechanism in the
generative process, it is slightly more efficient for training Wikipedia articles in low-
resource languages, where the document lengths have large gaps compared to English.
SOFTLINK, por otro lado, only needs a small dictionary to enable document-level
transfer, and yields very competitive results. This is especially useful for low-resource
languages when the dictionary size is small and only a small number of comparable
document pairs are available for training.
Word-level models are harder to achieve full potential of transfer, due to limits in
the dictionary size and training sets, and unrealistic assumptions of the generative pro-
cess regarding dictionary coverage. The representative model, VOCLINK, has similarly
good performance on document classification as other models, but the topic qualities
according to coherence-based metrics are lower. Comparing to SOFTLINK, which also
requires a dictionary as resource, directly modeling word translations in VOCLINK turns
out to be a less efficient way of transferring dictionary knowledge. Por lo tanto, cuando
using dictionary information, we recommend SOFTLINK over VOCLINK.
8.2 Crosslingual Representations
As an alternative method to learning crosslingual representations, crosslingual word
embeddings have been gaining attention (Ruder, Vulic, and Søgaard 2019; Upadhyay
et al. 2016). Recent crosslingual embedding architectures have been applied to a wider
range of applications in natural language processing, and achieve state-of-the-art per-
rendimiento. Similar to the topic space in multilingual topic models, crosslingual em-
beddings learn semantically consistent features in a shared embedding space for all
idiomas.
Both approaches—topic modeling and embedding—have advantages and limita-
ciones. Multilingual topic models still rely on supervised data to learn crosslingual rep-
resentaciones. The choice of such supervision and model is important, which leads to our
main discussion of this work. Topic models have the advantage of being interpretable.
Embedding methods are powerful in many natural language processing tasks, y el
representations are more fine-grained. Recent advancements in crosslingual embedding
training do not require crosslingual supervision resources such as dictionary or parallel
datos (casa de arte, Labaka, and Agirre 2018; Lample et al. 2018), which is a large step
toward generalization of crosslingual modeling. Although it is an open problem on
how to interpret the results and how to reduce the heavy computing resources required,
embedding based methods are a promising research direction.
Relations to Topic Models. A very common strategy for learning crosslingual embeddings
is to use a projection matrix as supervision or sub-objective to learn a projection matrix
that projects independently trained monolingual embeddings into a shared crosslingual
espacio (Dinu and Baroni 2014; Faruqui and Dyer 2014; Tsvetkov and Dyer 2016; Vuli´c and
Korhonen 2016).
In multilingual topic models, the supervision matrix δ plays the role of a projection
matrix between languages. In DOCLINK, Por ejemplo, δd(cid:96)2,d(cid:96)1 projects document d(cid:96)2 a
the document space of (cid:96)1 (Ecuación (15)). SOFTLINK provides a simple extension by
forming δ to a matrix of transfer distirbutions based on word-level document similari-
corbatas. VOCLINK applies projections in the form of word translations.
De este modo, we can see that the formation of projection matrices in multilingual topic
models is still static and restricted to an identity matrix or a simple pre-calculated
matrix. A generalization would be to add learning the projection matrix itself as an
objective into multilingual topic models. This could be a way to improve VOCLINK
130
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
by extending word associations to polysemy across languages, and making it less
dependent on context.
8.3 Future Directions
Our study inspires future work in two directions. The first direction is to increase
the efficiency of word-level knowledge transfer. Por ejemplo, it is possible to use co-
location information of translated words to transfer knowledge, though cautiously, a
untranslated words. It has been shown that word-level models can help find new word
translations, Por ejemplo, by using the existing dictionary as “seed,” and gradually
adding more internal nodes to the tree structure using trained topic-word distributions.
Además, our analysis showed the benefits of using a “language selector” in C-
BILDA to make the generative process of DOCLINK more realistic, and one could also
implement a similar mechanism in VOCLINK to make the conditional distributions for
tokens less dependent on specific context.
The second direction is more general. By systematically synthesizing various mod-
els and abstracting the knowledge transfer mechanism through an explicit transfer
operación, we can construct models that shape the probabilistic distributions of a target
language using that of a source language. By defining different transfer operations,
more complex and robust models can be developed, and this transfer formulation may
provide new ways of constructing models than with a traditional joint formulation (hao
and Paul 2019). Por ejemplo, SOFTLINK is generalization DOCLINK based on transfer
operations that does not have an equivalent joint formulation. This framework for
thinking about multilingual topic models may lead to new ideas for other models.
Referencias
Andrzejewski, David, Xiaojin Zhu, and Mark
Craven. 2009. Incorporating domain
knowledge into topic modeling via
Dirichlet forest priors. En Actas de la
26th Annual International Conference on
Machine Learning, pages 25–32, Montréal.
casa de arte, Mikel, Gorka Lavaka, y eneko
aguirre. 2018. A robust self-learning
method for fully unsupervised
cross-lingual mappings of word
embeddings. In Proceedings of the 56th
Reunión Anual de la Asociación de
Ligüística computacional, pages 789–798,
Melbourne.
Besag, Julian. 1975. Statistical analysis of
non-lattice data. Journal of the Royal
Statistical Society. Series D (The Statistician),
24:179–195.
Blei, David M.. 2012. Probabilistic topic
modelos. Communications of the ACM,
55(4):77–84.
Blei, David M.. 2018. Technical perspective:
expressive probabilistic models and
scalable method of moments.
Communications of the ACM, 61(4):84.
Blei, David M., Andrew Y. Ng, and Michael I.
Jordán. 2003. Latent Dirichlet allocation.
Journal of Machine Learning Research,
3:993–1022.
Boyd-Graber, Jordan L. and David M. Blei.
2009. Multilingual topic models for
unaligned text. In UAI 2009, Actas de
the Twenty-Fifth Conference on Uncertainty in
Artificial Intelligence, pages 75–82,
Montréal.
Chang, Jonathan and David M. Blei. 2009.
Relational topic models for document
redes. In Proceedings of the Twelfth
International Conference on Artificial
Intelligence and Statistics, AISTATS 2009,
pages 81–88, Clearwater Beach, Florida.
Chang, Jonathan, Jordan L. Boyd-Graber,
Sean Gerrish, Chong Wang, and David M.
Blei. 2009. Reading tea leaves: Cómo
humans interpret topic models. En
Avances en el procesamiento de información neuronal
Sistemas, pages 288–296, vancouver.
Chen, Y, Jun Zhu, Fei Xia, and Bo Zhang.
2013. Generalized relational topic models
with data augmentation. In IJCAI 2013,
Proceedings of the 23rd International Joint
Conferencia sobre Inteligencia Artificial,
pages 1273–1279, Beijing, Porcelana.
Christodoulopoulos, Christos and Mark
Steedman. 2015. A massively parallel
131
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
cuerpo: The bible in 100 idiomas.
Language Resources and Evaluation,
49(2):375–395.
Deerwester, Scott C., Susan T. Dumais,
Thomas K. Landauer, George W. Furnas,
and Richard A. Harshman. 1990. Indexing
by latent semantic analysis. Journal of the
American Society for Information Science,
41(6):391–407.
Dennis III, Samuel Y. 1991. On the hyper-
Dirichlet type 1 and hyper-Liouville
distributions. Communications in
Statistics — Theory and Methods,
20(12):4069–4081.
Dinu, Georgiana and Marco Baroni. 2014.
Improving zero-shot learning by
mitigating the hubness problem. CORR,
abs/1412.6568.
Faruqui, Manaal and Chris Dyer. 2014.
Improving vector space word
representations using multilingual
correlation. In Proceedings of the 14th
Conference of the European Chapter of the
Asociación de Lingüística Computacional,
pages 462–471, Gothenburg.
Griffiths, Thomas L. and Mark Steyvers.
2004. Finding scientific topics. Actas
of the National Academy of Sciences,
101(suppl 1):5228–5235.
Guti´errez, mi. Dario, Ekaterina Shutova,
Patricia Lichtenstein, Gerardo de Melo, y
Luca Gilardi. 2016. Detecting cross-cultural
differences using a multilingual topic
modelo. Transactions of the Association for
Ligüística computacional, 4:47–60.
hao, Shudong, Jordan L. Boyd-Graber, y
miguel j.. Pablo. 2018. Lessons from the
Bible on modern topics: Low-resource
multilingual topic model evaluation. En
Actas de la 2018 Conference of the
North American Chapter of the Association for
Ligüística computacional: Human Language
Technologies, NAACL-HLT 2018,
pages 1090–1100, Nueva Orleans, LA.
hao, Shudong and Michael J. Pablo. 2018.
Learning multilingual topics from
incomparable corpora. En Actas de la
27th International Conference on
Ligüística computacional, COLECCIONAR 2018,
pages 2595–2609, Santa Fe, NM.
hao, Shudong and Michael J. Pablo. 2019.
Analyzing Bayesian crosslingual transfer
in topic models. En Actas de la 2019
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano, NAACL-HLT
2019, pages 1551–1565, Mineápolis, Minnesota.
Heyman, Geert, Ivan Vulic, y
Marie-Francine Moens. 2016. C-BiLDA
132
extracting cross-lingual topics from
non-parallel texts by distinguishing shared
from unshared content. Data Mining and
Knowledge Discovery, 30(5):1299–1323.
Hofmann, tomás. 1999. Probabilistic latent
semantic indexing. In SIGIR ’99:
Proceedings of the 22nd Annual International
ACM SIGIR Conference on Research and
Development in Information Retrieval,
pages 50–57, berkeley, California.
Hu, Yuening, Jordan L. Boyd-Graber, Brianna
Satinoff, and Alison Smith. 2014a.
Interactive topic modeling. Machine
Aprendiendo, 95(3):423–469.
Hu, Yuening, Ke Zhai, Vladimir Eidelman,
and Jordan L. Boyd-Graber. 2014b.
Polylingual tree-based topic models for
translation domain adaptation. En
Proceedings of the 52nd Annual Meeting
of the Association for Computational
Lingüística, LCA 2014, pages 1166–1176,
baltimore, Maryland.
Jagarlamudi, Jagadeesh and Hal Daum´e III.
2010. Extracting multilingual topics from
unaligned comparable corpora. En
Advances in Information Retrieval, 32nd
European Conference on IR Research, ECIR
2010, pages 444–456, Milton Keynes.
kim, Do-kyum, Geoffrey M. Voelker, y
Lawrence K. Saul. 2013. A variational
approximation for topic modeling of
hierarchical corpora. En Actas de la
30th International Conference on Machine
Aprendiendo, ICML 2013, pages 55–63,
Atlanta, Georgia.
Koehn, Philipp. 2005. Europarl: A Parallel
Corpus for Statistical Machine Translation.
MT Summit, 5:79–86.
Koller, Daphne and Nir Friedman. 2009.
Probabilistic Graphical Models – Principles
and Techniques. CON prensa.
Krstovski, Kriste and David A. Herrero. 2011.
A minimally supervised approach for
detecting and ranking document
translation pairs. In Proceedings of the Sixth
Workshop on Statistical Machine Translation,
WMT@EMNLP 2011, pages 207–216,
Edimburgo.
Krstovski, Kriste and David A. Herrero. 2016.
Bootstrapping translation detection and
sentence extraction from comparable
corpus. In NAACL HLT 2016, el 2016
Conference of the North American Chapter of
la Asociación de Lingüística Computacional:
Tecnologías del lenguaje humano,
pages 1127–1132, San Diego, California.
Krstovski, Kriste, David A. Herrero, y
miguel j.. Kurtz. 2016. Online multilingual
topic models with multi-level hyperpriors.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Hao and Paul
Crosslingual Transfer in Topic Models
In NAACL HLT 2016, el 2016 Conference of
the North American Chapter of the Association
para Lingüística Computacional: Humano
Language Technologies, pages 454–459,
San Diego, California.
Lample, Guillaume, Alexis Conneau,
Marc’Aurelio Ranzato, Ludovic Denoyer,
and Herv´e J´egou. 2018. Word translation
without parallel data. In 6th International
Conferencia sobre Representaciones del Aprendizaje,
ICLR 2018, vancouver.
Lau, Jey Han, David Newman, and Timothy
Baldwin. 2014. Machine reading tea leaves:
automatically evaluating topic coherence
and topic model quality. En procedimientos de
the 14th Conference of the European Chapter of
la Asociación de Lingüística Computacional,
EACL 2014, pages 530–539, Gothenburg.
Lepp¨a-aho, Janne, Johan Pensar, Teemu
Roos, and Jukka Corander. 2017. Aprendiendo
Gaussian graphical models with
fractional marginal pseudo-likelihood.
International Journal of Approximate
Reasoning, 83:21–42.
Littman, Michael L., Susan T. Dumais, y
Thomas K. Landauer. 1998. In Automatic
cross-language information retrieval using
latent semantic indexing, In G.
Grefenstette, ed., Cross-Language
Information Retrieval, Saltador, pages 51–62.
Liu, Xiao Dong, Kevin Duh, and Yuji
Matsumoto. 2015. Multilingual topic
models for bilingual dictionary extraction.
ACM Transactions on Asian & Low-Resource
Language Information Processing,
14(3):11:1–11:22.
Mamá, Tengfei and Tetsuya Nasukawa. 2017.
Inverted bilingual topic models for lexicon
extraction from non-parallel data. En
Proceedings of the Twenty-Sixth International
Conferencia conjunta sobre inteligencia artificial,
IJCAI 2017, pages 4075–4081, Melbourne.
Maaten, Laurens van der and Geoffrey
Hinton. 2008. Visualizing Data Using
t-SNE. Journal of Machine Learning Research,
9(Nov):2579–2605.
McCallum, Andrew Kachites. 2002.
MALLET: A machine learning for
language toolkit. http://mallet.cs.
umass.edu.
Mimno, David M., Hanna M. Wallach, Jason
Naradowsky, David A. Herrero, y
Andrew McCallum. 2009. Polylingual
topic models. En Actas de la 2009
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, EMNLP 2009,
pages 880–889, Singapur.
Moens, Marie-Francine and Ivan Vulic. 2013.
Monolingual and cross-lingual
probabilistic topic models and their
applications in information retrieval. En
Advances in Information Retrieval – 35th
European Conference on IR Research, ECIR
2013, pages 874–877, Moscow.
En, Xiaochuan, Jian-Tao Sun, Jian Hu, y
Zheng Chen. 2009. Mining multilingual
topics from Wikipedia. En Actas de la
18ª Conferencia Internacional sobre Mundial
Web, WWW 2009, pages 1155–1156,
Madrid.
Ruder, Sebastian, Ivan Vulic, and Anders
Søgaard. 2019. A survey of cross-
lingual word embedding models.
Journal of Artificial Intelligence Research,
65:569–631.
Seroussi, Yanir, Ingrid Zukerman, and Fabian
Bohnert. 2014. Authorship attribution with
topic models. Ligüística computacional,
40(2):269–310.
Smet, Wim De, Jie Tang, and Marie-Francine
Moens. 2011. Knowledge transfer across
multilingual corpora via latent topics. En
Advances in Knowledge Discovery and Data
Minería – 15th Pacific-Asia Conference,
PAKDD 2011, pages 549–560, Shenzhen.
Søgaard, Anders, Zeljko Agic,
H´ector Mart´ınez Alonso, Barbara Plank,
Bernd Bohnet, and Anders Johannsen.
2015. Inverted indexing for cross-lingual
NLP. In Proceedings of the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint
Conference on Natural Language Processing of
the Asian Federation of Natural Language
Procesando, LCA 2015, pages 1713–1722,
Beijing.
Tiedemann, J ¨org. 2012. Parallel data, herramientas
and interfaces in OPUS. En procedimientos de
the Eighth International Conference on
Language Resources and Evaluation, LREC
2012, pages 2214–2218, Istanbul.
Tsvetkov, Yulia and Chris Dyer. 2016.
Cross-lingual bridges with models of
lexical borrowing. Journal of Artificial
Intelligence Research, 55:63–93.
Upadhyay, Shyam, Manaal Faruqui, cris
Dyer, and Dan Roth. 2016. multilingüe
models of word embeddings: An empirical
comparación. In Proceedings of the 54th
Reunión Anual de la Asociación de
Ligüística computacional, LCA 2016,
pages 1661–1670, Berlina.
Vuli´c, Ivan and Anna Korhonen. 2016. Sobre el
role of seed lexicons in learning bilingual
word embeddings. En Actas de la
54ª Reunión Anual de la Asociación de
Ligüística computacional, LCA 2016,
pages 247–257, Berlina.
133
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 1
Vuli´c, Ivan and Marie-Francine Moens. 2014.
Probabilistic models of cross-lingual
semantic similarity in context based on
latent cross-lingual concepts induced from
comparable data. En Actas de la 2014
Jornada sobre Métodos Empíricos en Natural
Procesamiento del lenguaje, EMNLP 2014, Octubre
25-29, 2014, pages 349–362, Doha.
Vuli´c, Ivan, Wim De Smet, Jie Tang, y
Marie-Francine Moens. 2015. probabilístico
topic modeling in multilingual settings:
An overview of its methodology and
applications. Information Processing &
Management, 51(1):111–147.
Wallach, Hanna M., David M.. Mimno, y
Andrew McCallum. 2009. Rethinking
LDA: Why priors matter. In Advances in
Neural Information Processing Systems 22,
pages 1973–1981, vancouver.
Wallach, Hanna M., Iain Murray, Ruslan
Salakhutdinov, and David M. Mimno.
2009. Evaluation methods for topic
modelos. In Proceedings of the 26th Annual
International Conference on Machine
Aprendiendo, ICML 2009, pages 1105–1112,
Montréal.
Xu, Wei, Xin Liu, and Yihong Gong. 2003.
Document clustering based on
non-negative matrix factorization. En
SIGIR 2003: Proceedings of the 26th
Annual International ACM SIGIR
Conference on Research and Development in
Information Retrieval, pages 267–273,
toronto.
Yuan, Jinhui, Fei Gao, Qirong Ho, Wei Dai,
Jinliang Wei, Xun Zheng, Eric Po Xing,
Tie-Yan Liu, and Wei-Ying Ma. 2015.
LightLDA: Big topic models on modest
computer clusters. In Proceedings of the 24th
International Conference on World Wide
Web, WWW 2015, pages 1351–1361,
Florencia.
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
1
9
5
1
8
4
7
8
1
2
/
C
oh
yo
i
_
a
_
0
0
3
6
9
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
134