Parameter Space Factorization for Zero-Shot Learning

Parameter Space Factorization for Zero-Shot Learning
across Tasks and Languages

Edoardo M. Pontiκ, Ivan Vuli´cκ, Ryan Cotterellκ, ζ,
Marinela Parovi´cκ, Roi Reichartτ , Anna Korhonenκ
κUniversity of Cambridge ζETH Z¨urich τ Technion, IIT
κ{ep490,iv250,rdc42,mp939,alk23}@cam.ac.uk
τ roiri@ie.technion.ac.il

Abstract

Most combinations of NLP tasks and lan-
guage varieties lack in-domain examples for
supervised training because of the paucity
of annotated data. How can neural models
make sample-efficient generalizations from
task–language combinations with available
data to low-resource ones? In this work, we
propose a Bayesian generative model for the
space of neural parameters. We assume that
this space can be factorized into latent vari-
ables for each language and each task. We infer
the posteriors over such latent variables based
on data from seen task–language combinations
inference. This enables
through variational
zero-shot classification on unseen combina-
tions at prediction time. For instance, given
training data for named entity recognition
(NER) in Vietnamese and for part-of-speech
(POS) tagging in Wolof, our model can per-
form accurate predictions for NER in Wolof.
In particular, we experiment with a typolog-
ically diverse sample of 33 languages from
4 continents and 11 families, and show that
our model yields comparable or better results
than state-of-the-art, zero-shot cross-lingual
transfer methods. Our code is available at
github.com/cambridgeltl/parameter
-factorization.

1 Introduction

The annotation efforts in NLP have achieved im-
pressive feats, such as the Universal Dependen-
cies (UD) project (Nivre et al., 2019), which now
includes 83 languages. But even UD covers only
a meager subset of the world’s estimated 8,506
languages (Hammarström et al., 2016). What is

410

more, the Association for Computational Linguis-
tics Wiki1 lists 24 separate NLP tasks. Labeled
data, which is both costly and labor-intensive, is
missing for many of such task–language combi-
nations. This shortage hinders the development
of computational models for the majority of the
world’s languages (Snyder and Barzilay, 2010;
Ponti et al., 2019a).

A common solution is transferring knowledge
across domains, such as tasks and languages
(Yogatama et al., 2019; Talmor and Berant, 2019),
which holds promise to mitigate the lack of
training data inherent to a large spectrum of NLP
applications (T¨ackstr¨om et al., 2012; Agi´c et al.,
2016; Ammar et al., 2016; Ponti et al., 2018;
Ziser and Reichart, 2018, inter alia). In the most
extreme scenario, zero-shot learning, no annotated
examples are available for the target domain.
In particular, zero-shot transfer across languages
implies a change in the data domain, and leverages
information from resource-rich languages to
tackle the same task in a previously unseen target
language (Lin et al., 2019; Rijhwani et al., 2019;
Artetxe and Schwenk, 2019; Ponti et al., 2019a,
inter alia). Zero-shot transfer across tasks within
the same language (Ruder et al., 2019a), on the
other hand, implies a change in the space of labels.
As our main contribution, we propose a
Bayesian generative model of the neural parame-
ter space. We assume this to be structured, and for
this reason factorizable into task- and language-
specific latent variables.2 By performing transfer
of knowledge from both related tasks and related
languages (i.e., from seen combinations), our
model allows for zero-shot prediction on unseen
the
task–language combinations. For instance,

1aclweb.org/aclwiki/State of the art.
2By latent variable we mean every variable that has to
be inferred from observed (directly measurable) variables.
To avoid confusion, we use the terms seen and unseen when
referring to different task–language combinations.

Transactions of the Association for Computational Linguistics, vol. 9, pp. 410–428, 2021. https://doi.org/10.1162/tacl a 00374
Action Editor: Jason Eisenstein. Submission batch: 2/2020; Revision batch: 2/2021; Published 4/2021.
c(cid:13) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

availability of annotated data for part-of-speech
(POS) tagging in Wolof and for named-entity
recognition (NER) in Vietnamese supplies plenty
of information to infer a task-agnostic represen-
tation for Wolof and a language-agnostic repre-
sentation for NER. Conditioning on these, the
appropriate neural parameters for Wolof NER
can be generated at evaluation time. While this
idea superficially resembles matrix completion for
collaborative filtering (Mnih and Salakhutdinov,
2008; Dziugaite and Roy, 2015),
the neural
parameters are latent and are non-identifiable.
Rather than recovering missing entries from par-
tial observations, in our approach we reserve
latent variables to each language and each task to
tie together neural parameters for combinations
that have either of them in common.

We adopt a Bayesian perspective towards infer-
ence. The posterior distribution over the model’s
latent variables is approximated through stochastic
variational inference (Hoffman et al., 2013, SVI).
Given the enormous number of parameters, we
also explore a memory-efficient inference scheme
based on a diagonal plus low-rank approximation
of the covariance matrix. This guarantees that our
model remains both expressive and tractable.

We evaluate the model on two sequence label-
ing tasks: POS tagging and NER, relying on a
typologically representative sample of 33 lan-
guages from 4 continents and 11 families. The
results clearly indicate that our generative model
surpasses standard baselines based on cross-
lingual transfer 1) from the (typologically) nearest
source language; 2) from the source language with
the most abundant in-domain data (English); and
3) from multiple source languages, in the form
of either a multi-task, multi-lingual model with
parameter sharing (Wu and Dredze, 2019) or an
ensemble of task- and language-specific models
(Rahimi et al., 2019).

2 Bayesian Generative Model

In this work, we propose a Bayesian generative
model for multi-task, multi-lingual NLP. We train
a single Bayesian neural network for several
tasks and languages jointly. Formally, we con-
sider a set T = {t1, . . . , tn} of n tasks and a set
L = {l1, . . . , lm} of m languages. The core mod-
eling assumption we make is that the parameter
space of the neural network is structured: Spe-
cifically, we posit that certain parameters corre-

411

θij

yijk

xijk

Figure 1: A graph (plate notation) of the generative
model based on parameter space factorization. Shaded
circles refer to observed variables.

spond to tasks and others correspond to languages.
This structure assumption allows us to general-
ize to unseen task–language pairs. In this regard,
the model is reminiscent of matrix factorization
as applied to collaborative filtering (Mnih and
Salakhutdinov, 2008; Dziugaite and Roy, 2015).
We now describe our generative model in three
steps that match the nesting level of the plates in
the diagram in Figure 1. Equivalently, the reader
can follow the nesting level of the for loops in
Algorithm 1 for an algorithmic illustration of the
generative story.

(1) Sampling Task and Language Representa-
tions: To kick off our generative process, we
first sample a latent representation for each
of the tasks and languages from multivari-
ate Gaussians: ti ∼ N (µti, Σti ) ∈ Rh and
lj ∼ N (µlj , Σlj ) ∈ Rh, respectively. While
we present the model in its most general form,
we take µti = µlj = 0 and Σti = Σlj = I
for the experimental portion of this paper.

(2) Sampling Task–Language-specific Param-
eters: Afterward, to generate task–language-
specific neural parameters, we sample θij
from N (fψ(ti, lj), diag(fφ(ti, lj))) ∈ Rd
where fψ(ti, lj) and fφ(ti, lj) are learned
deep feed-forward neural networks fψ :
Rh → Rd and fφ : Rh → Rd
≥0 parametrized
by ψ and φ, respectively, similar to Kingma
and Welling (2014). These transform the

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Algorithm 1 Generative Model of Neural Param-
eters for Multi-task, Multi-lingual NLP.

1: for ti ∈ T :
2:

ti ∼ N (µti, Σti)

3: for lj ∈ L :
4:

lj ∼ N (µlj , Σlj )

5: for ti ∈ T :
6:

for lj ∈ L :

10:

11:

µθij = fψ(ti, lj)
Σθij = fφ(ti, lj)
θij ∼ N (µθij , Σθij )
for xijk ∈ Xij :

yijk ∼ p(· | xijk, θij)

latent representations into the mean µθij and
diagonal of the covariance matrix σ2
for
θij
the parameters θij associated with ti and
lj. The feed-forward network fψ just has
a final linear layer as the mean can range
over Rd whereas fφ has a final softplus
(defined in Section 3) layer to ensure it
ranges only over Rd
≥0. Following Stolee and
Patterson (2019), the networks fψ and fφ
take as input a linear function of the task and
language vectors: t ⊕ l ⊕ (t − l) ⊕ (t ⊙ l),
where ⊕ stands for concatenation and ⊙ for
element-wise multiplication. The sampled
neural parameters θij are partitioned into a
weight Wij ∈ Re×c and a bias bij ∈ Rc, and
reshaped appropriately. Hence, the dimen-
sionality of the Gaussian is chosen to reflect
the number of parameters in the affine layer,
d = e · c + c, where e is the dimensionality
of the input token embeddings (detailed in
the next paragraph) and c is the maximum
number of classes across tasks.3 The number
of hidden layers and the hidden size of fψ
and fφ are hyper-parameters discussed in
Section 4.2. We tie the parameters ψ and φ
for all layers except for the last to reduce the
parameter count. We note that the space of
parameters for all tasks and languages forms
a tensor Θ ∈ Rn×m×d, where d is the number
of parameters of the largest model.

(3) Sampling Task Labels: Finally, we sam-
ple the kth label yijk for the ith task and the

jth language from a final softmax: p(yijk |
xijk, θij) = softmax(Wij BERT(xijk) + bij)
where BERT(xijk) ∈ Re is the multi-lingual
BERT (Pires et al., 2019) encoder. The incor-
poration of m-BERT as a pre-trained mul-
tilingual embedding allows for enhanced
cross-lingual transfer.

Consider the Cartesian product of all

tasks
and languages T × L. We can decompose this
product into seen task–language pairs S and un-
seen task–language pairs U, i.e., T × L = S ⊔ U.
Naturally, we are only able to train our model on
the seen task–language pairs S. However, as we
estimate all task–language parameter vectors θij
jointly, our model allows us to draw inferences
about the parameters for pairs in U as well. The
intuition for why this should work is as follows: By
observing multiple pairs where the task (language)
is the same but the language (task) varies, the
model learns to distill the relevant knowledge for
zero-shot learning because our generative model
structurally enforces a disentangled representations
—separating representations for the tasks from
the representations for the languages rather than
lumping them together into a single entangled
representation (Wu and Dredze, 2019, inter alia).
Furthermore, the neural networks fψ and fφ map-
ping the task- and language-specific latent vari-
ables to neural parameters are shared, allowing the
model to generalize across task–language pairs.

3 Variational Inference

Exact computation of the posterior over the
latent variables p(θ, t, l | x) is intractable. Thus,
we need to resort to an approximation. In this
work, we consider variational inference as our
approximate inference scheme. Variational infer-
ence finds an approximate posterior over the latent
variables by minimizing the variational gap, which
may be expressed as the Kullback–Leibler (KL)
divergence between the variational approximation
q(θ, t, l) and the true posterior p(θ, t, l | x). In
our work, we employ the following variational
distributions:
qλ = N (mt, St) mt ∈ Rh, St ∈ Rh×h (1)
qν = N (ml, Sl) ml ∈ Rh, Sl ∈ Rh×h (2)
qξ = N (fψ(t, l), diag(fφ(t, l)))
(3)

3Different tasks might involve different class numbers; the
number of parameters hence oscillates. The extra dimensions
not needed for a task can be considered as padded with zeros.

We note the unusual choice to tie parameters
between the generative model and the variational

412

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

KL (q(θ, t, l) || p(θ, t, l | x)) = − E
t∼qλ
= − E
t∼qλ

E
l∼qν
E
l∼qν

E
θ∼qξ
E
θ∼qξ

log

p(θ, t, l | x)
q(θ, t, l)

[log p(θ, t, l, x) − log p(x) − log q(θ, t, l)]

= log p(x) − E
t∼qλ

E
l∼qν

E
θ∼qξ

log

p(θ, t, l, x)
q(θ, t, l)

, log p(x) − L

(4)

log p(x) = log

p(x, θ, t, l) dθ dt dl

p(x | θ) p(θ | t, l) p(t) p(l) dθ dt dl

(cid:19)

qλ(t) qν(l) qξ(θ | t, l)
qλ(t) qν(l) qξ(θ | t, l)

(cid:19)

p(x | θ) p(θ | t, l) p(t) p(l) dθ dt dl

= log

(cid:18)Z Z Z

(cid:18)Z Z Z
E
t∼qλ

(cid:18)

E
l∼qν

E
θ∼qξ

p(θ | t, l) p(t) p(l) p(x | θ)
qλ(t) qν(l) qξ(θ | t, l)
p(x | θ) p(θ | t, l) p(t) p(l)
qλ(t) qν(l) qξ(θ | t, l)

(cid:19)

(cid:21)

, L

≥ E
t∼qλ

E
l∼qν

log

E
θ∼qξ (cid:20)

(cid:19)

(5)

= E
t∼qλ

E
l∼qν ”

log p(x | θ) + log

E
θ∼qξ (cid:20)

p(θ | t, l)
qξ(θ | t, l)

(cid:21)

+ log

p(t)
qλ(t)

+ log

p(l)
qν(l) #

= E
θ∼qξ

log p(x | θ)

−

requires approximation

}

KL (qλ(t) || p(t)) + KL (qν(l) || p(l)) + KL (qξ(θ | t, l) || p(θ | t, l))

closed-form solution

(6)

}

family in Equation (3); however, we found that
this choice performs better in our experiments.

Through a standard algebraic manipulation in
Equation (4), the KL-divergence for our genera-
tive model can be shown to equal the marginal
log-likelihood log p(x), independent from q(·),
and the so-called evidence lower bound (ELBO)
L. Thus, approximate inference becomes an opti-
mization problem where maximizing L results in
minimizing the KL-divergence. One derives L is
by expanding the marginal log-likelihood as in
Equation (5) by means of Jensen’s inequality. We
also show that L can be further broken into a
series of terms as illustrated in Equation (6). In
particular, we see that it is only the first term in
the expansion that requires approximation. The
subsequent terms are KL-divergences between
variational and true distributions that have closed-
form solution due to our choice of prior. Due
to the parameter-tying scheme above, the KL-
divergence in Equation (6) between the variational

413

distribution qξ(θ | t, l) and the prior distribution
p(θ | t, l) is zero.

In general, the covariance matrices St and Sl in
Equation (1) and Equation (2) will require O(h2)
space to store. As h is often very large, it is imprac-
tical to materialize either matrix in its entirety.
Thus, in this work, we experiment with smaller
matrices that have a reduced memory footprint;
specifically, we consider a diagonal covariance
matrix and a diagonal plus low-rank covariance
structure. A diagonal covariance matrix makes
computation feasible with a complexity of O(h);
this, however, comes at the cost of not letting
parameters influence each other, and thus failing
to capture their complex interactions. To allow
for a more expressive variational family, we also
consider a covariance matrix that is the sum of a
diagonal matrix and a low-rank matrix:
St = diag(δ2
Sl = diag(δ2

t ) + BtB⊤
t
l ) + BlB⊤
l

(8)

(7)

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

where B ∈ Rh×k ensures that rank
≤ k,
and diag(δ) is diagonal. We can store this struc-
tured covariance matrix in O(kh) space.

BB⊤

(cid:1)

(cid:0)

By definition, covariance matrices must be sym-
metric and positive semi-definite. The first prop-
erty holds by construction. The second property is
enforced by a softplus parameterization where
softplus(·) , ln(1 + exp(·)). Specifically, we
define δ2 = softplus(ρ) and we optimize over ρ.

3.1 Stochastic Variational Inference

To speed up the training time, we make use of
stochastic variational inference (Hoffman et al.,
2013). In this setting, we randomly sample a task
ti ∈ T and language lj ∈ L among seen combi-
nations during each training step,4 and randomly
select a batch of examples from the dataset for the
sampled task–language pair. We then optimize the
parameters of the feed-forward neural networks ψ
and φ as well as the parameters of the variational
approximation to the posterior mt, ml, ρt, ρl, Bt,
and Bl with a stochastic gradient-based optimizer
(discussed in Section 4.2).

The KL divergence terms and their gradients
in the ELBO appearing in Equation (6) can be
computed in closed form as the relevant densities
are Gaussian (Duchi, 2007, p. 13). Moreover, they
can be calculated for Gaussians with diagonal
and diagonal plus low-rank covariance structures
without explicitly unfolding the full matrix. For a
choice of prior p = N (0, I) and a diagonal plus
low-rank covariance structure, we have:

KL (q || p) =

(m2

i + δ2

i +

1
2

i=1
h
X
− h − lndet(S)

b2
ij)

j=1
X

(9)

where bij is the element in the i-th row and j-th
column of B. The last term can be estimated with-
out computing the full matrix explicitly thanks
to the generalization of the matrix–determinant
lemma,5 which, applied to the factored covariance
structure, yields:

4As an alternative, we experimented with a setup where
sampling probabilities are proportional to the number of
examples of each task–language combination, but
this
achieved similar performances on the development sets.

5det(A + U V ⊤) = det(I + V ⊤A−1U ) · det(A). Note that

the lemma assumes that A is invertible.

414

lndet(S) =ln

det(I + B⊤diag(δ−2)B)

h
+

i=1
X

ln(δ2
i )

(10)

where I ∈ Rk. The KL divergence for the variant
with diagonal covariance is just a special case of
Equation (9) with bij = 0.

However, as stated before, the following expec-
tation does not admit a closed-form solution. Thus
we consider a Monte Carlo approximation:

E
θ∼qξ

log p(x | θ) =

qξ(θ) log p(x | θ) dθ

≈

1
V

v=1
X

log p(x | θ(v)) where θ(v) ∼ qξ

(11)

where V is the number of Monte Carlo samples
taken. In order to allow the gradient to easily
flow through the generated samples, we adopt
the re-parametrization trick (Kingma and Welling,
2014). Specifically, we exploit the following iden-
tities ti = µti + σti ⊙ ǫ and lj = µlj +
σlj ⊙ ǫ, where ǫ ∼ N (0, I) and ⊙ is the
Hadamard product. For the diagonal plus low-rank
covariance structure, we exploit the identity:

µ + diag(δ2 ⊙ ǫ) + Bζ

(12)

where ǫ ∈ Rh, ζ ∈ Rk, and both are sampled
from N (0, I). The mean µθij and the diagonal of
the covariance matrix σ2
are deterministically
θij
computed given the above samples and the param-
eters θij are sampled from N (µθij , diag(σ2
)),
θij
again with the re-parametrization trick.

3.2 Posterior Predictive Distribution

During test time, we perform zero-shot predictions
on an unseen task–language pair by plugging
in the posterior means (under the variational
approximation) into the model. As an alternative,
we experimented with ensemble predictions
through Bayesian model averaging. That is, for
data for seen combinations xS and data for unseen
combinations xU , the true predictive posterior
p(xU |
can be approximated as p(xU | xS ) =
v=1 p(xU | θ(v), xS),
θ, xS ) qξ(θ | xS ) dθ ≈
where V are 100 Monte Carlo samples from the
posterior qξ. Performances on the development
sets are comparable to simply plugging in the
posterior mean.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

4 Experimental Setup

4.1 Data

We select NER and POS tagging as our exper-
imental tasks because their datasets encompass
an ample and diverse sample of languages, and
are common benchmarks for resource-poor NLP
(Cotterell and Duh, 2017, inter alia). In particular,
we opt for WikiANN (Pan et al., 2017) for the NER
task and Universal Dependencies 2.4 (UD; Nivre
et al., 2019) for POS tagging. Our sample of
languages is chosen from the intersection of those
available in WikiANN and UD. However, we
remark that this sample is heavily biased towards
the Indo-European family (Gerz et al., 2018).
Instead, the selection should be: i) typologically
diverse, to ensure that the evaluation scores truly
reflect
the expected cross-lingual performance
(Ponti et al., 2020); ii) a mixture of resource-rich
and low-resource languages, to recreate a realis-
tic setting and to allow for studying the effect of
data size. Hence, we further filter the languages in
order to make the sample more balanced. In par-
ticular, we sub-sample Indo-European languages
by including only resource-poor ones, and keep
all the languages from other families. Our final
sample comprises 33 languages from 4 continents
(17 from Asia, 11 from Europe, 4 from Africa,
and 1 from South America) and from 11 fami-
lies (6 Uralic, 6 Indo-European, 5 Afroasiatic, 3
Niger-Congo, 3 Turkic, 2 Austronesian, 2 Dra-
vidian, 1 Austroasiatic, 1 Kra-Dai, 1 Tupian, 1
Sino-Tibetan), as well as 2 isolates. The full list of
language ISO 639-2 codes is reported in Figure 2.
In order to simulate a zero-shot setting, we hold
out in turn half of all possible task–language pairs
and regard them as unseen, while treating the
others as seen pairs. The partition is performed in
such a way that a held-out pair has data available
for the same task in a different language, and for
the same language in a different task.6 Under this
constraint, pairs are assigned to train or evaluation
at random.7

We randomly split the WikiANN datasets into
training, development, and test portions with a

6We use the controlled partitioning for the following
reason. If a language lacks data both for NER and for POS,
the proposed factorization method cannot provide estimates
for its posterior. We leave model extensions that can handle
such cases for future work.

7See Section 5.2 for further experiments on splits con-

proportion of 80-10-10. We use the provided
splits for UD; if the training set for a language
is missing, we treat the test set as such when the
language is held out, and as a training set when it
is among the seen pairs.8

4.2 Hyper-parameters

The multilingual M-BERT encoder is initialized
with parameters pre-trained on masked language
modeling and next sentence prediction on 104
languages (Devlin et al., 2019).9 We opt for the
cased BERT-BASE architecture, which consists of
12 layers with 12 attention heads and a hidden size
of 768. As a consequence, this is also the dimen-
sion e of each encoded WordPiece unit, a subword
unit obtained through BPE (Wu et al., 2016). The
dimension h of the multivariate Gaussian for task
and language latent variables is set to 100. The
deep feed-forward networks fψ and fφ have 6
layers with a hidden size of 400 for the first layer,
768 for the internal layers, and ReLU non-linear
activations. Their depth and width were selected
based on validation performance.

The expectations over latent variables in Equa-
tion (6) are approximated through 3 Monte Carlo
samples per batch during training. The KL terms
are weighted with 1
|K| uniformly across training,
where |K| is the number of mini-batches.10 We ini-
tialize all the means m of the variational approxi-
mation with a random sample from N (0, 0.1), and
the parameters for covariance matrices S of the
variational approximation with a random sample
from U(0, 0.5) following Stolee and Patterson
(2019). We choose k = 10 as the number of
columns of B so it fits into memory. The maxi-
mum sequence length for inputs is limited to 250.
The batch size is set to 8, and the best setting for
the Adam optimizer (Kingma and Ba, 2015) was
found to be an initial learning rate of 5·10−6 based
on grid search. In order to avoid over-fitting, we
perform early stopping with a patience of 10 and
a validation frequency of 2.5K steps.

4.3 Baselines

We consider four baselines for cross-lingual trans-
fer that also use BERT as an encoder shared across
all languages.

8Note that, in the second case, no evaluation takes place

on such language.

9Available at github.com/google-research/bert

/blob/master/multilingual.md.

10We found this weighting strategy to work better than

trolled for language distance and sample size.

annealing as proposed by Blundell et al. (2015).

415

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

First Baseline. A common approach is transfer
from the nearest source (NS) language, which
selects the most compatible source to a target
language in terms of similarity. In particular, the
selection can be based on family membership
(Zeman and Resnik, 2008; Cotterell and Heigold,
2017; Kann et al., 2017), typological features
(Deri and Knight, 2016), KL-divergence between
part-of-speech trigram distributions (Rosa and
ˇZabokrtsk´y 2015; Agi´c, 2017), tree edit distance
of delexicalized dependency parses (Ponti et al.,
2018), or a combination of the above (Lin et al.,
2019). In our work, during evaluation, we choose
the classifier associated with the observed lan-
guage with the highest cosine similarity between
its typological features and those of the held-out
language. These features are sourced from URIEL
(Littell et al., 2017) and contain information about
family, area, syntax, and phonology.

Second Baseline. We also consider transfer
from the largest source (LS) language, that is,
the language with most training examples. This
approach has been adopted by several recent
works on cross-lingual transfer (Conneau et al.,
2018; Artetxe et al., 2020, inter alia). In our imple-
mentation, we always select the English classifier
for prediction.11 In order to make this baseline
comparable to our model, we adjust the number
of English NER training examples to the sum of
the examples available for all seen languages S.12

Third Baseline. Next, we apply a protocol de-
signed by Rahimi et al. (2019) for weighting the
predictions of a classifier ensemble according to
their reliability. For a specific task, the reliability
of each language-specific classifier is estimated
through a Bayesian graphical model. Intuitively,
this model learns from error patterns, which be-
have more randomly for untrustworthy models
and more consistently for the others. Among the
protocols proposed in the paper, we opt for BEA
in its zero-shot, token-based version, as it achieves
the highest scores in a setting comparable to the
current experiment. We refer to the original paper
for the details.13

11We include English to make the baseline more com-
petitive, but note that this language is not available for our
generative model as it is both Indo-European and resource-
rich.

12The number of NER training examples is 1,093,184 for

the first partition and 520,616 for the second partition.

13We implemented this model through the original code at

github.com/afshinrahimi/mmner.

Fourth Baseline. Finally, we take inspiration from
Wu and Dredze (2019). The joint multilingual
(JM) baseline, contrary to the previous baselines,
consists of two classifiers (one for POS tagging
and another for NER) shared among all observed
languages for a specific task. We follow the orig-
inal implementation of Wu and Dredze (2019),
closely adopting all recommended hyper-parameters
and strategies, such as freezing the parameters of
all encoder layers below the 3rd for sequence label-
ing tasks.

It must be noted that the number of parameters
in our generative model scales better than base-
lines with language-specific classifiers, but worse
than those with language-agnostic classifiers, as
the number of languages grows. However, even in
the second case, increasing the depth of baselines
networks to match the parameter count is detri-
mental if the BERT encoder is kept trainable, which
was also verified in previous work (Peters et al.,
2019).

5 Results and Discussion

5.1 Zero-shot Transfer

Firstly, we present the results for zero-shot predic-
tion based on our generative model using both of
the approximate inference schemes (with diago-
nal covariance PF-d and factor covariance PF-lr).
Table 1 summarizes the results on the two tasks
of POS tagging and NER averaged across all
languages. Our model (in both its variants) outper-
forms the four baselines on both tasks, including
state-of-the-art alternative methods. In particu-
lar, PF-d and PF-lr gain 4.49 / 4.20 in accuracy
(∼7%) for POS tagging and 7.29 / 7.73 in F1 score
(∼10%) for NER on average compared to transfer
from the largest source (LS), the strongest baseline
for single-source transfer. Compared to multilin-
gual joint transfer from multiple sources (JM), our
two variants gain 0.95 / 0.67 in accuracy (∼1%) for
POS tagging and +0.61 / +1.05 in F1 score (∼1%).
More details about the individual results on each
task–language pair are provided in Figure 2, which
includes the mean of the results over 3 separate
runs. Overall, we obtain improvements in 23/33
languages for NER and on 27/45 treebanks for
POS tagging, which further supports the benefits
of transferring both from tasks and languages.

Considering the baselines, the relative perfor-
mance of LS versus NS is an interesting finding
per se. LS largely outperforms NS on both POS

416

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: Results for NER (top) and POS tagging (bottom): Four baselines for cross-lingual transfer compared to
Matrix Factorization with diagonal covariance and diagonal plus low-rank covariance.

BEA
Task
POS
47.65 ± 1.54
NER 66.45 ± 0.56

NS
42.84 ± 1.23
74.16 ± 0.56

LS
60.51 ± 0.43
78.97 ± 0.56

JM
64.04 ± 0.18
85.65 ± 0.13

PF-d
65.00 ± 0.12
86.26 ± 0.17

PF-lr
64.71 ± 0.18
86.70 ± 0.10

Table 1: Results per task averaged across all languages.

417

tagging and NER. This shows that having more
data is more informative than relying primarily on
similarity according to linguistic properties. This
finding contradicts the received wisdom (Rosa and
ˇZabokrtsk´y, 2015; Cotterell and Heigold, 2017;
Lin et al., 2019, inter alia) that related languages
tend to be the most reliable source. We conjecture
that this is due to the pre-trained multi-lingual BERT
encoder, which helps to bridge the gap between
unrelated languages (Wu and Dredze, 2019).

The two baselines that hinge upon transfer
from multiple sources lie on opposite sides of the
spectrum in terms of performance. On the one
hand, BEA achieves the lowest average score for
NER, and surpasses only NS for POS tagging.
We speculate that this is due to the following: i)
adapting the protocol from Rahimi et al. (2019) to
our model implies assigning a separate classifier
head to each task–language pair, each of which is
exposed to fewer examples compared to a shared
one. This fragmentation fails to take advantage of
the massively multilingual nature of the encoder;
ii) our language sample is more typologically
diverse, which means that most source languages
are unreliable predictors. On the other hand, JM
yields extremely competitive scores. Similarly
to our model, it integrates knowledge from mul-
tiple languages and tasks. The extra boost in
our model stems from its ability to disentangle
each aspect of such knowledge and recombine it
appropriately.

Moreover, comparing the two approximate in-
ference schemes from Section 3.1, PF-lr obtains
a small but statistically significant improvement
over PF-d in NER, whereas they achieve the
same performance on POS tagging. This means
that the posterior is modeled well enough by a
Gaussian where covariance among co-variates is
negligible.

We see that even for the best model (PF-lr) there
is a wide variation in the scores for the same task
across languages. POS tagging accuracy ranges
from 12.56 ± 4.07 in Guaran´ı to 86.71 ± 0.67
in Galician, and NER F1 scores range from
49.44 ± 0.69 in Amharic to 96.20 ± 0.11 in Upper
Sorbian. Part of this variation is explained by
the fact that the multilingual BERT encoder is not
pre-trained in a subset of these languages (e.g.,
Amharic, Guaran´ı, Uyghur). Another cause is
more straightforward: The scores are expected to
be lower in languages for which we have fewer
training examples in the seen task–language pairs.

Task

POS
NER

|L| = 11

|L| = 22

Sim
72.44
89.51

Dif
53.25
81.73

Sim
66.59
86.78

Dif
63.22
85.12

Table 2: Average performance when relying on
|L| similar (Sim) versus different (Dif ) languages
in the train and evaluation sets.

5.2 Language Distance and Sample Size

While we designed the language sample to be both
realistic and representative of the cross-lingual
variation, there are several factors inherent to
a sample that can affect the zero-shot transfer
performance: i) language distance, the similarity
between seen and held-out languages; and ii)
sample size, the number of seen languages. In
order to disentangle these factors, we construct
subsets of size |L| so that training and evaluation
languages are either maximally similar (Sim) or
maximally different (Dif ). As a proxy measure,
we consider as ‘similar’ languages belonging to
the same family. In Table 2, we report the perfor-
mance of parameter factorization with diagonal
plus low-rank covariance (PF-lr), the best model
from Section 5.1, for each of these subsets.

Based on Table 2, there emerges a trade-off
between language distance and sample size. In
particular, performance is higher in Sim subsets
compared to Dif subsets for both tasks (POS and
NER) and for both sample sizes |L| ∈ {11, 22}. In
larger sample sizes, the average performance in-
creases for Dif but decreases for Sim. Intuitively,
languages with labeled data for several relatives
benefit from small, homogeneous subsets. Intro-
ducing further languages introduces noise. In-
stead, languages where this is not possible (such as
isolates) benefit from an increase in sample size.

5.3 Entropy of the Predictive Distribution

A notable problem of point estimate methods
is their tendency to assign most of the proba-
bility mass to a single class even in scenarios
with high uncertainty. Zero-shot transfer is one
of such scenarios, because it
involves drastic
distribution shifts in the data (Rabanser et al.,
2019). A key advantage of Bayesian inference, in-
stead, is marginalization over parameters, which
yields smoother posterior predictive distributions
(Kendall and Gal, 2017; Wilson, 2019).

418

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 3: Entropy of the posterior predictive distributions over classes for each test example. The higher the
entropy, the more uncertain the prediction.

We run an analysis of predictions based on
(approximate) Bayesian model averaging. First,
we randomly sample 800 examples from each test
set of a task–language pair. For each example,
we predict a distribution over classes Y through
model averaging based on 10 samples from the
posteriors. We then measure the prediction
entropy of each example—that
is, H(p) =
|Y |
y p(Y = y)lnp(Y = y)—whose plot is
−
shown in Figure 3.

P
Entropy is a measure of uncertainty. Intuitively,
the uniform categorical distribution (maximum
uncertainty) has the highest entropy, whereas if
the whole probability mass falls into a single
class (maximum confidence), then the entropy
H = 0.14 As it emerges from Figure 3, predictions
in certain languages tend to have higher entropy
on average, such as in Amharic, Guaran´ı, Uyghur,
or Assyrian Neo-Aramaic. This aligns well with
the performance metrics in Figure 2. In practice,
languages with low scores tend to display high
entropy in the predictive distribution, as expected.
To verify this claim, we measure the Pearson’s

14The maximum entropy is ≈ 2.2 for 9 classes as in NER

and ≈ 2.83 for 17 classes as in POS tagging.

correlation between entropies of each task–
language pair in Figure 3 and performance met-
rics. We find a very strong negative correlation
with a coefficient of ρ = −0.914 and a two-tailed
p-value of 1.018 × 10−26.

6 Related Work

Our approach builds on ideas from several dif-
ferent fields: cross-lingual transfer in NLP, with
a particular focus on sequence labeling tasks, as
well as matrix factorization, contextual parameter
generation, and neural Bayesian methods.

Cross-Lingual Transfer for Sequence Labeling.
One of the two dominant approaches for cross-
lingual transfer is projecting annotations from a
source language text to a target language text.
This technique was pioneered by Yarowsky et al.
(2001) and Hwa et al. (2005) for parsing, and later
extended to applications such as POS tagging (Das
and Petrov, 2011; Garrette et al., 2013; T¨ackstr¨om
et al., 2012; Duong et al., 2014; Huck et al., 2019)
and NER (Ni et al., 2017; Enghoff et al., 2018;
Agerri et al., 2018; Jain et al., 2019). This requires
tokens to be aligned through a parallel corpus, a

419

machine translation system, or a bilingual dictio-
nary (Durrett et al., 2012; Mayhew et al., 2017).
However, creating machine translation and word-
alignment systems demands parallel texts in the
first place, while automatically induced bilingual
lexicons are noisy and offer only limited cover-
age (Artetxe et al., 2018; Duan et al., 2020).
Furthermore, errors inherent
to such systems
cascade along the projection pipeline (Agi´c et al.,
2015).

The second approach, model transfer, offers
higher flexibility (Conneau et al., 2018). The main
idea is to train a model directly on the source data,
and then deploy it onto target data (Zeman and
Resnik 2008). Crucially, bridging between differ-
ent lexica requires input features to be language-
agnostic. While originally this implied delexi-
calization, replacing words with universal POS
tags (McDonald et al., 2011; Dehouck and Denis,
2017), cross-lingual Brown clusters (T¨ackstr¨om
et al., 2012; Rasooli and Collins, 2017), or cross-
lingual knowledge base grounding through wikifi-
cation (Camacho-Collados et al., 2016; Tsai et al.,
2016), more recently these have been supplanted
by cross-lingual word embeddings (Ammar et al.
2016; Zhang et al., 2016; Xie et al., 2018; Ruder
et al., 2019b) and multilingual pretrained language
models (Devlin et al., 2019; Conneau et al., 2020).
An orthogonal research thread regards the
selection of the source language(s). In particular,
multi-source transfer was shown to surpass single-
best source transfer in NER (Fang and Cohn, 2017;
Rahimi et al., 2019) and POS tagging (Enghoff
et al., 2018; Plank and Agi´c, 2018). Our parameter
space factorization model can be conceived as
an extension of multi-source cross-lingual model
transfer to a cross-task setting.

Data Matrix Factorization. Although we are
the first to propose a factorization of the param-
eter space for unseen combinations of tasks and
languages, the factorization of data for collab-
orative filtering and social recommendation is
an established research area. In particular, the
missing values in sparse data structures such
as user-movie review matrices can be filled via
probabilistic matrix factorization (PMF) through
a linear combination of user and movie matrices
(Mnih and Salakhutdinov, 2008; Ma et al., 2008;
Shan and Banerjee, 2010, inter alia) or through
neural networks (Dziugaite and Roy, 2015). Infer-
ence for PMF can be carried out through MAP

inference (Dziugaite and Roy, 2015), Markov
chain Monte Carlo (Salakhutdinov and Mnih,
2008) or stochastic variational inference (Stolee
and Patterson, 2019). Contrary to prior work, we
perform factorization on latent variables (task-
and language-specific parameters) rather than
observed ones (data).

Contextual Parameter Generation. Our model
is reminiscent of the idea that parameters can be
conditioned on language representations, as pro-
posed by Platanios et al. (2018). However, since
this approach is limited to a single task and a joint
learning setting, it is not suitable for generalization
in a zero-shot transfer setting.

Bayesian Neural Models. So far, these models
have found only limited application in NLP for
resource-poor languages, despite their desirable
properties. Firstly,
they can incorporate priors
over parameters to endow neural networks with
the correct inductive biases towards language:
Ponti et al. (2019b) constructed a prior imbued
with universal
linguistic knowledge for zero-
and few-shot character-level language modeling.
Secondly, they avoid the risk of over-fitting by
taking into account uncertainty. For instance,
Shareghi et al. (2019) and Doitch et al. (2019)
use a perturbation model to sample high-quality
and diverse solutions for structured prediction in
cross-lingual parsing.

7 Conclusion

The main contribution of our work is a Bayesian
generative model for multiple NLP tasks and
languages. At its core lies the idea that the space
of neural weights can be factorized into latent
variables for each task and each language. While
training data are available only for a meager sub-
set of task–language combinations, our model
opens up the possibility to perform prediction in
novel, undocumented combinations at evaluation
time. We performed inference through stochastic
variational methods, and ran experiments on zero-
shot named entity recognition (NER) and part-
of-speech (POS) tagging in a typologically diverse
set of 33 languages. Based on the reported results,
we conclude that leveraging the information from
tasks and languages simultaneously is superior
to model transfer from English (relying on more
abundant in-task data in the source language),
typologically similar language
from the most

420

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

(relying on prior information on language related-
ness), or from multiple source languages. More-
over, we found that the entropy of predictive
posterior distributions obtained through Bayesian
model averaging correlates almost perfectly with
the error rate in the prediction. As a consequence,
our approach holds promise to alleviating data
paucity issues for a wide spectrum of languages
and tasks, and to make knowledge transfer more
robust to uncertainty.

Finally, we remark that our model is amenable
tasks beyond
to be extended to multilingual
language
sequence labeling—such as natural
inference (Conneau et al., 2018) and question
answering (Artetxe et al., 2020; Lewis et al.,
2019; Clark et al., 2020)—and to zero-shot trans-
fer across combinations of multiple modalities
(e.g., speech, text, and vision) with tasks and lan-
guages. We leave these exciting research threads
for future research.

Acknowledgments

We would like to thank action editor Jacob
Eisenstein and the three anonymous reviewers
at TACL. This work is supported by the ERC
Consolidator Grant LEXICAL (no 648909) and
the Google Faculty Research Award 2018. RR
was partially funded by ISF personal grant no.
1625/18.

References

Rodrigo Agerri, Xavier G´omez Guinovart,
German Rigau, and Miguel Anxo Solla Portela.
2018. Developing new linguistic resources and
tools for the Galician language. In Proceedings
of LREC.

ˇZeljko Agi´c. 2017. Cross-lingual parser selection
for low-resource languages. In Proceedings of
the NoDaLiDa 2017 Workshop on Universal
Dependencies (UDW 2017), pages 1–10.

ˇZeljko Agi´c, Dirk Hovy, and Anders Søgaard.
2015. If all you have is a bit of the Bible: Learn-
ing POS taggers for truly low-resource lan-
guages. In Proceedings of ACL, pages 268–272.

ˇZeljko Agi´c, Anders Johannsen, Barbara Plank,
H´ector Mart´ınez Alonso, Natalie Schluter, and
Anders Søgaard. 2016. Multilingual projection
for parsing truly low-resource languages. Tran-

sactions of the ACL, 4:301–312. DOI: https://
doi.org/10.1162/tacl a 00100

Waleed Ammar, George Mulcaire, Miguel
Ballesteros, Chris Dyer, and Noah A. Smith.
2016. Many languages, one parser. Transac-
tions of the ACL, 4:431–444. DOI: https://
doi.org/10.1162/tacl a 00109

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and
Kyunghyun Cho. 2018. Unsupervised neural
machine translation. In Proceedings of ICLR.
DOI: https://doi.org/10.18653/v1
/D18-1399

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2020. On the cross-lingual transfer-
ability of monolingual representations. In Pro-
ceedings of ACL. DOI: https://doi.org
/10.18653/v1/2020.acl-main.421

Mikel Artetxe and Holger Schwenk. 2019. Mas-
sively multilingual sentence embeddings for
zero-shot cross-lingual transfer and beyond.
the ACL, 7:597–610. DOI:
Transactions of
https://doi.org/10.1162/tacl a
00288

Johannes Bjerva and Isabelle Augenstein. 2018.
From phonology to syntax: Unsupervised
linguistic typology at different
levels with
In Proceedings of
language embeddings.
NAACL-HLT, pages 907–916. DOI: https://
doi.org/10.18653/v1/N18-1083

Johannes Bjerva, Robert

¨Ostling, Maria Han
Veiga, J¨org Tiedemann, and Isabelle Augenstein.
2019. What do language representations really
represent? Computational Linguistics, 45(2):
381–389. DOI: https://doi.org/10.1162
/coli a 00351

Charles Blundell,

Julien Cornebise, Koray
Kavukcuoglu, and Daan Wierstra. 2015. Weight
uncertainty in neural networks. In Proceedings
of ICML, pages 1613–1622.

Jos´e Camacho-Collados, Mohammad Taher
Pilehvar, and Roberto Navigli. 2016. Nasari:
Integrating explicit knowledge and corpus sta-
tistics for a multilingual representation of con-
cepts and entities. Artificial Intelligence, 240:
36–64. DOI: https://doi.org/10.1016
/j.artint.2016.07.005

421

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Jonathan H. Clark, Eunsol Choi, Michael Collins,
Dan Garrette, Tom Kwiatkowski, Vitaly
Nikolaev, and Jennimaria Palomaki. 2020.
Tydi qa: A benchmark for information-seeking
question answering in typologically diverse
languages. Transactions of the Association for
Computational Linguistics.

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. In Proceedings of ACL.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.747

Alexis Conneau, Guillaume Lample, Ruty Rinott,
Adina Williams, Samuel R. Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI:
Evaluating cross-lingual sentence representa-
In Proceedings of EMNLP, pages
tions.
2475–2485. DOI: https://doi.org/10
.18653/v1/D18-1269

Ryan Cotterell and Kevin Duh. 2017. Low-
resource named entity recognition with cross-
lingual, character-level neural conditional random
fields. In Proceedings of IJNLP, pages 91–96.
Taipei, Taiwan. DOI: https://doi.org
/10.18653/v1/D17-1078

Ryan Cotterell

and Georg Heigold. 2017.
Cross-lingual character-level neural morpho-
logical tagging. In Proceedings of EMNLP,
pages 748–759.

Dipanjan Das and Slav Petrov. 2011. Unsuper-
vised part-of-speech tagging with bilingual
graph-based projections. In Proceedings of
ACL, pages 600–609.

Mathieu Dehouck and Pascal Denis. 2017.
Delexicalized word embeddings for cross-
lingual dependency parsing. In Proceedings of
EACL, pages 241–250.

Aliya Deri and Kevin Knight. 2016. Grapheme-
to-phoneme models for (almost) any language.
In Proceedings of ACL, pages 399–408.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language

understanding.
HLT, pages 4171–4186.

In Proceedings of NAACL-

Amichay Doitch, Ram Yazdi, Tamir Hazan, and
Roi Reichart. 2019. Perturbation based learning
for structured nlp tasks with application to
dependency parsing. Transactions of the ACL,
7:643–659. DOI: https://doi.org/10
.1162/tacl a 00291

Xiangju Duan, Baijun Ji, Hao Jia, Min Tan,
Min Zhang, Boxing Chen, Weihua Luo, and
Yue Zhang. 2020. Bilingual dictionary based
neural machine translation without using par-
allel sentences. In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.143

John Duchi. 2007, Derivations for linear algebra
and optimization, University of California,
Berkeley.

Long Duong, Trevor Cohn, Karin Verspoor,
Steven Bird, and Paul Cook. 2014. What can we
get from 1000 tokens? A case study of multilin-
gual POS tagging for resource-poor languages.
In Proceedings of EMNLP, pages 886–897.
DOI: https://doi.org/10.3115/v1
/D14-1096

Greg Durrett, Adam Pauls, and Dan Klein. 2012.
Syntactic transfer using a bilingual lexicon. In
Proceedings of EMNLP-CoNLL, pages 1–11.

Gintare Karolina Dziugaite and Daniel M. Roy.
2015. Neural network matrix factorization.
arXiv preprint arXiv:1511.06443.

Jan Vium Enghoff, Søren Harrison, and ˇZeljko
Agi´c. 2018. Low-resource named entity recog-
nition via multi-source projection: Not quite
there yet? In Proceedings of the 2018 EMNLP
Workshop W-NUT: The 4th Workshop on Noisy
User-generated Text, pages 195–201. DOI:
https://doi.org/10.18653/v1/W18
-6125

Meng Fang and Trevor Cohn. 2017. Model trans-
fer for tagging low-resource languages using
a bilingual dictionary. In Proceedings of ACL,
pages 587–593. DOI: https://doi.org
/10.18653/v1/P17-2093

Dan Garrette, Jason Mielens, and Jason Baldridge.
2013. Real-world semi-supervised learning of

422

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

POS-taggers for low-resource languages. In
Proceedings of ACL, pages 583–592.

Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti,
Roi Reichart, and Anna Korhonen. 2018. On the
relation between linguistic typology and (limi-
tations of) multilingual language modeling. In
Proceedings of EMNLP, pages 316–327. DOI:
https://doi.org/10.18653/v1/D18
-1029

Harald Hammarström, Robert Forkel, Martin
Haspelmath, and Sebastian Bank, editors . 2016.
Glottolog 2.7, Max Planck Institute for the
Science of Human History, Jena.

Matthew D. Hoffman, David M. Blei, Chong
Wang, and John Paisley. 2013. Stochastic
variational inference. The Journal of Machine
Learning Research, 14(1):1303–1347.

Matthias Huck, Diana Dutka, and Alexander
Fraser. 2019. Cross-lingual annotation pro-
jection is effective for neural part-of-speech
tagging. In Proceedings of the Sixth Work-
shop on NLP for Similar Languages, Varieties
and Dialects, pages 223–233. DOI: https://
doi.org/10.18653/v1/W19-1425

Rebecca Hwa, Philip Resnik, Amy Weinberg,
Clara I. Cabezas, and Okan Kolak. 2005.
Bootstrapping parsers via syntactic projection
across parallel texts. Natural Language Engi-
neering, 11(3):311–325. DOI: https://doi
.org/10.1017/S1351324905003840

Alankar

Jain,

Bhargavi

Paranjape,

and
Zachary C. Lipton. 2019. Entity projec-
tion via machine translation for cross-lingual
In Proceedings of EMNLP-IJCNLP,
NER.
pages 1083–1092. DOI: https://doi.org
/10.18653/v1/D19-1100

Melvin Johnson, Mike Schuster, Quoc V. Le,
Maxim Krikun, Yonghui Wu, Zhifeng Chen,
Nikhil Thorat, Fernanda Vi´egas, Martin
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017. Googles multilingual
neural machine translation system: Enabling
the
zero-shot
Association for Computational Linguistics,
5:339–351. DOI: https://doi.org/10
.1162/tacl a 00065

translation. Transactions of

423

Katharina Kann, Ryan Cotterell, and Hinrich
Sch¨utze. 2017. One-shot neural cross-lingual
transfer for paradigm completion. In Proceed-
ings of ACL, pages 1993–2003.

Alex Kendall and Yarin Gal. 2017. What
uncertainties do we need in Bayesian deep
learning for computer vision? In Proceedings
of NeurIPS, pages 5574–5584.

Diederik P. Kingma and Jimmy L. Ba. 2015.
Adam: A method for stochastic optimization.
In Proceedings of ICLR.

Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational Bayes. In Proceed-
ings of ICLR.

Patrick S. H. Lewis, Barlas O˘guz, Ruty Rinott,
Sebastian Riedel, and Holger Schwenk. 2019.
MLQA: Evaluating cross-lingual extractive
question answering. CoRR, abs/1910.07475.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui
Li, Yuyan Zhang, Mengzhou Xia, Shruti
Rijhwani, Junxian He, Zhisong Zhang, Xuezhe
Ma, Antonios Anastasopoulos, Patrick Littell,
and Graham Neubig. 2019. Choosing transfer
languages for cross-lingual learning. In Pro-
ceedings of ACL, pages 3125–3135.

Patrick Littell, David R. Mortensen, Ke Lin,
Katherine Kairis, Carlisle Turner, and Lori
Levin. 2017. URIEL and lang2vec: Repre-
senting languages as typological, geographical,
and phylogenetic vectors. In Proceedings of
EACL, pages 8–14. DOI: https://doi
.org/10.18653/v1/E17-2002

Hao Ma, Haixuan Yang, Michael R. Lyu, and
Irwin King. 2008. SoRec: Social recommenda-
tion using probabilistic matrix factorization. In
Proceedings of CIKM, pages 931–940. DOI:
https://doi.org/10.1145/1458082
.1458205, PMID: 19021718

Chaitanya Malaviya, Graham Neubig, and Patrick
Littell. 2017. Learning language representations
for typology prediction. In Proceedings of
EMNLP, pages 2529–2535.

Stephen Mayhew, Chen-Tse Tsai, and Dan Roth.
2017. Cheap translation for cross-lingual named
entity recognition. In Proceedings of EMNLP,
pages 2536–2545.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Ryan McDonald, Slav Petrov, and Keith Hall.
2011. Multi-source transfer of delexicalized
dependency parsers. In Proceedings of EMNLP,
pages 62–72.

Andriy Mnih and Ruslan Salakhutdinov. 2008.
Probabilistic matrix factorization. In Proceed-
ings of NeurIPS, pages 1257–1264.

Jian Ni, Georgiana Dinu, and Radu Florian.
2017. Weakly supervised cross-lingual named
entity recognition via effective annotation and
representation projection. In Proceedings of
ACL, pages 1470–1480.

Joakim Nivre, Mitchell Abrams,

ˇZeljko Agi´c,
Lars Ahrenberg, Gabriel˙e Aleksandraviˇci¯ut˙e,
Lene Antonsen, Katya Aplonova, Maria Jesus
Aranzabe, Gashaw Arutie, Masayuki Asahara,
Luma Ateyah, Mohammed Attia, Aitziber
Atutxa, Liesbeth Augustinus, Elena Badmaeva,
Miguel Ballesteros, Esha Banerjee, Sebastian
Bank, Verginica Barbu Mititelu, Victoria
Basmov, John Bauer, Sandra Bellato, Kepa
Bengoetxea, Yevgeni Berzak, Irshad Ahmad
Bhat, Riyaz Ahmad Bhat, Erica Biagetti,
Eckhard Bick, Agn˙e Bielinskien˙e, Rogier
Blokland, Victoria Bobicev, Lo¨ıc Boizou,
Emanuel Borges V¨olker, Carl B¨orstell, Cristina
Bosco, Gosse Bouma, Sam Bowman, Adriane
Boyd, Kristina Brokait˙e, Aljoscha Burchardt,
Marie Candito, Bernard Caron, Gauthier Caron,
G¨uls¸en Cebiro˘glu Eryi˘git, Flavio Massimiliano
Cecchini, Giuseppe G. A. Celano, Slavom´ır
ˇC´epl¨o, Savas Cetin, Fabricio Chalub, Jinho
Choi, Yongseok Cho, Jayeol Chun, Silvie
Cinkov´a, Aur´elie Collomb, C¸ a˘gri C¸ ¨oltekin,
Miriam Connor, Marine Courtin, Elizabeth
Davidson, Marie-Catherine
de Marneffe,
Valeria de Paiva, Arantza Diaz de Ilarraza,
Carly Dickerson, Bamba Dione, Peter Dirix,
Kaja Dobrovoljc, Timothy Dozat, Kira
Droganova, Puneet Dwivedi, Hanne Eckhoff,
Marhaba Eli, Ali Elkahky, Binyam Ephrem,
Tomaˇz Erjavec, Aline Etienne, Rich´ard Farkas,
Hector Fernandez Alcalde, Jennifer Foster,
Cl´audia Freitas, Kazunori Fujita, Katar´ına
Gajdoˇsov´a, Daniel Galbraith, Marcos Garcia,
Moa G¨ardenfors, Sebastian Garza, Kim Gerdes,
Filip Ginter, Iakes Goenaga, Koldo Gojenola,
Memduh G¨okirmak, Yoav Goldberg, Xavier
G´omez Guinovart, Berta Gonz´alez Saavedra,
Matias Grioni, Normunds Gr¯uz¯ıtis, Bruno

424

Guillaume, C´eline Guillot-Barbance, Nizar
Habash, Jan Hajiˇc, Jan Hajiˇc jr., Linh H`a M˜y,
Na-Rae Han, Kim Harris, Dag Haug, Johannes
Heinecke, Felix Hennig, Barbora Hladk´a,
Jaroslava Hlav´aˇcov´a, Florinel Hociung, Petter
Hohle, Jena Hwang, Takumi Ikeda, Radu Ion,
Elena Irimia, O. l´aj´ıd´e Ishola, Tom´aˇs Jel´ınek,
Anders Johannsen, Fredrik Jøorgensen, H¨uner
Kasikara, Andre Kaasen, Sylvain Kahane,
Hiroshi Kanayama, Jenna Kanerva, Boris Katz,
Tolga Kayadelen, Jessica Kenney, V´aclava
Kettnerov´a,
Jesse Kirchner, Arne K¨ohn,
Kamil Kopacewicz, Natalia Kotsyba, Jolanta
Kovalevskait˙e, Simon Krek, Sookyoung Kwak,
Veronika Laippala, Lorenzo Lambertino, Lucia
Lam, Tatiana Lando, Septina Dian Larasati,
Alexei Lavrentiev, John Lee, Phuong Lˆe Hong,
Alessandro Lenci, Saran Lertpradit, Herman
Leung, Cheuk Ying Li, Josie Li, Keying Li,
KyungTae Lim, Yuan Li, Nikola Ljubeˇsi´c,
Olga Loginova, Olga Lyashevskaya, Teresa
Lynn, Vivien Macketanz, Aibek Makazhanov,
Michael Mandl, Christopher Manning, Ruli
Manurung, C˘at˘alina M˘ar˘anduc, David Mareˇcek,
Katrin Marheinecke, H´ector Mart´ınez Alonso,
Andr´e Martins, Jan Maˇsek, Yuji Matsumoto,
Ryan McDonald,
Sarah McGuinness,
Gustavo Mendonça, Niko Miekka, Margarita
Misirpashayeva, Anna Missil¨a, C˘at˘alin Mititelu,
Yusuke Miyao, Simonetta Montemagni, Amir
More, Laura Moreno Romero, Keiko Sophie
Mori, Tomohiko Morioka, Shinsuke Mori,
Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Yugo
Murawaki, Kaili M¨u¨urisep, Pinkey Nainwani,
Ignacio Navarro Hor˜niacek, Anna
Juan
Nedoluzhko, Gunta Neˇspore-Berzkalne, Luong
Nguyˆen Thi, Huyˆen Nguyˆen Thi Minh,
Yoshihiro Nikaido, Vitaly Nikolaev, Rattima
Nitisaroj, Hanna Nurmi, Stina Ojala, Ad´eday
Ol´u`okun, Mai Omura, Petya Osenova, Robert
¨Ostling, Lilja Øvrelid, Niko Partanen, Elena
Pascual, Marco Passarotti, Agnieszka Patejuk,
Guilherme Paulino-Passos, Angelika Peljak-
Lapi´nska, Siyao Peng, Cenel-Augusto Perez,
Guy Perrier, Daria Petrova, Slav Petrov, Jussi
Piitulainen, Tommi A Pirinen, Emily Pitler,
Barbara Plank, Thierry Poibeau, Martin Popel,
Lauma Pretkalnin¸a, Sophie Pr´evost, Prokopis
Prokopidis, Adam Przepi´orkowski, Tiina
Puolakainen, Sampo Pyysalo, Andriela R¨a¨abis,
Alexandre Rademaker, Loganathan Ramasamy,

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Taraka Rama, Carlos Ramisch, Vinit
Ravishankar, Livy Real, SivaReddy, Georg
Rehm, Michael Rießler, Erika Rimkut˙e, Larissa
Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo
Romanenko, Rudolf Rosa, Davide Rovati,
Valentin Ros¸ca, Olga Rudina, Jack Rueter,
Shoval Sadde, Benoît Sagot, Shadi Saleh,
Alessio Salomoni, Tanja Samardˇzi´c, Stephanie
Samson, Manuela Sanguinetti, Dage S¨arg,
Baiba Saul¯ıte, Yanin Sawanakunanon, Nathan
Schneider, Sebastian Schuster, Djam´e Seddah,
Wolfgang Seeker, Mojgan Seraji, Mo Shen,
Atsuko Shimada, Hiroyuki Shirasu, Muh
Shohibussirri, Dmitry
Sichinava, Natalia
Silveira, Maria Simi, Radu Simionescu, Katalin
Simk´o, M´aria ˇsimkov´a, Kiril Simov, Aaron
Smith, Isabela Soares-Bastos, Carolyn Spadine,
Antonio Stella, Milan Straka, Jana Strnadov´a,
Alane Suhr, Umut Sulubacak, Shingo Suzuki,
Zsolt Sz´ant´o, Dima Taji, Yuta Takahashi, Fabio
Tamburini, Takaaki Tanaka, Isabelle Tellier,
Guillaume Thomas, Liisi Torga, Trond
Trosterud, Anna Trukhina, Reut Tsarfaty,
Francis Tyers, Sumire Uematsu, Zdeˇnka
Ureˇsov´a, Larraitz Uria, Hans Uszkoreit,
Sowmya Vajjala, Daniel van Niekerk, Gertjan
van Noord, Viktor Varga, Eric Villemonte de
la Clergerie, Veronika Vincze, Lars Wallin,
Abigail Walsh, Jing Xian Wang, Jonathan
North Washington, Maximilan Wendt, Seyi
Williams, Mats Wir´en, Christian Wittern,
Tsegay Woldemariam, Tak-sum Wong, Alina
Wr´oblewska, Mary Yako, Naoki Yamazaki,
Chunxiao Yan, Koichi Yasuoka, Marat M.
Yavrumyan, Zhuoran Yu, Zdenˇek ˇzabokrtsk´y,
Amir Zeldes, Daniel Zeman, Manying Zhang,
and Hanzhi Zhu. 2019. Universal Dependen-
cies 2.4. LINDAT/CLARIN digital library at
the Institute of Formal and Applied Linguistics
( ´UFAL), Faculty of Mathematics and Physics,
Charles University.

Robert ¨Ostling and J¨org Tiedemann. 2017. Con-
tinuous multilinguality with language vectors.
In Proceedings of the EACL, pages 644–649.

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
Nothman, Kevin Knight, and Heng Ji. 2017.
Cross-lingual name tagging and linking for 282
languages. In Proceedings of ACL, volume 1,
pages 1946–1958.

425

Matthew E. Peters, Sebastian Ruder, and Noah A.
Smith. 2019. To tune or not to tune? Adapting
pretrained representations to diverse tasks. In
Proceedings of
the 4th Workshop on Rep-
resentation Learning for NLP (RepL4NLP-
2019), pages 7–14. DOI: https://doi.org
.org/10.18653/v1/W19-4302, PMCID:
PMC6351953

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual is multilingual BERT?
In Proceedings of ACL, pages 4996–5001. DOI:
https://doi.org/10.18653/v1/P19
-1493

Barbara Plank and ˇZeljko Agi´c. 2018. Distant
supervision from disparate sources for low-
resource part-of-speech tagging. In Proceed-
ings of EMNLP, pages 614–620. DOI:
https://doi.org/10.18653/v1/D18
-1061

Emmanouil Antonios

Platanios, Mrinmaya
Sachan, Graham Neubig, and Tom Mitchell.
2018. Contextual parameter generation for uni-
In Pro-
versal neural machine translation.
ceedings of EMNLP, pages 425–435. DOI:
https://doi.org/10.18653/v1/D18
-1039

Edoardo Maria Ponti, Goran Glavaˇs, Olga
Ivan Vuli´c, and
Majewska, Qianchu Liu,
Anna Korhonen. 2020. XCOPA: A multilingual
dataset for causal commonsense reasoning. In
Proceedings of EMNLP.

Edoardo Maria Ponti, Helen O’Horan, Yevgeni
Ivan Vuli´c, Roi Reichart, Thierry
Berzak,
Poibeau, Ekaterina Shutova,
and Anna
Korhonen. 2019a. Modeling language varia-
tion and universals: A survey on typological
linguistics for natural
language processing.
Computational Linguistics, 45(3):559–601. DOI:
https://doi.org/10.1162/coli a
00357

Edoardo Maria Ponti, Roi Reichart, Anna
Korhonen, and Ivan Vuli´c. 2018. Isomorphic
transfer of syntactic structures in cross-lingual
NLP. In Proceedings of ACL, pages 1531–1542.

Edoardo Maria Ponti, Ivan Vuli´c, Ryan Cotterell,
Roi Reichart, and Anna Korhonen. 2019b.
Towards zero-shot language modeling. In Pro-
ceedings of EMNLP, pages 2900–2910.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Stephan Rabanser, Stephan G¨unnemann, and
Zachary Lipton. 2019. Failing loudly: An
empirical study of methods for detecting
In Proceedings of NeurIPS,
dataset shift.
pages 1394–1406.

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.
Massively multilingual transfer for NER. In
Proceedings of ACL, pages 151–164. DOI:
https://doi.org/10.18653/v1/P19
-1015

Mohammad Sadegh Rasooli

and Michael
Collins. 2017. Cross-lingual syntactic transfer
with limited resources. Transactions of
the
Association for Computational Linguistics,
5:279–293.

Shruti Rijhwani, Jiateng Xie, Graham Neubig,
and Jaime G. Carbonell. 2019. Zero-shot
neural transfer for cross-lingual entity linking.
In Proceedings of AAAI, pages 6924–6931.
DOI: https://doi.org/10.1609/aaai
.v33i01.33016924

Rudolf Rosa and Zdenˇek ˇZabokrtsk´y. 2015.
KLcpos3 – a language similarity measure for
delexicalized parser transfer. In Proceedings
of ACL, pages 243–249. DOI: https://
doi.org/10.3115/v1/P15-2040, PMID:
26076412

Sebastian Ruder, Matthew E. Peters, Swabha
Swayamdipta, and Thomas Wolf. 2019a. Trans-
fer learning in natural language processing.
In Proceedings of NAACL-HLT: Tutorials,
pages 15–18. DOI: https://doi.org/10
.18653/v1/N19-5004

Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard.
2019b. A survey of cross-lingual embedding
models. Journal of Artificial Intelligence Re-
search, 65:569–631. DOI: https://doi.org
.org/10.1613/jair.1.11640

Ruslan Salakhutdinov and Andriy Mnih. 2008.
Bayesian probabilistic matrix factorization
using Markov chain Monte Carlo. In Pro-
ICML, pages 880–887. DOI:
ceedings of
https://doi.org/10.1145/1390156
.1390267

Hanhuai Shan and Arindam Banerjee. 2010.
Generalized probabilistic matrix factorizations

for collaborative filtering. In Proceedings of
ICDM, pages 1025–1030. DOI: https://
doi.org/10.1109/ICDM.2010.116

Ehsan Shareghi, Yingzhen Li, Yi Zhu, Roi
Reichart, and Anna Korhonen. 2019. Bayesian
learning for neural dependency parsing. In Pro-
ceedings of NAACL-HLT, pages 3509–3519.

Benjamin Snyder and Regina Barzilay. 2010.
Climbing the tower of Babel: Unsupervised
multilingual learning. In Proceedings of ICML,
pages 29–36.

Jake Stolee and Neill Patterson. 2019, Matrix
factorization with neural networks and sto-
inference, University of
chastic variational
Toronto.

Oscar T¨ackstr¨om, Ryan McDonald, and Jakob
Uszkoreit. 2012. Cross-lingual word clusters
for direct transfer of linguistic structure. In
Proceedings of NAACL-HLT, pages 477–487.

Alon Talmor and Jonathan Berant. 2019. Mul-
tiQA: An empirical investigation of generaliza-
tion and transfer in reading comprehension.
In Proceedings of ACL, pages 4911–4921.
DOI: https://doi.org/10.18653/v1
/P19-1485

Chen-Tse Tsai, Stephen Mayhew, and Dan Roth.
2016. Cross-lingual named entity recognition
via wikification. In Proceedings of CoNLL,-
pages 219–228. DOI: https://doi.org
/10.18653/v1/K16-1022

Andrew Gordon Wilson. 2019. The case for
Bayesian deep learning. NYU Courant Tech-
nical Report.

Shijie Wu and Mark Dredze. 2019. Beto,
Bentz, Becas: The surprising cross-lingual
In Proceedings of
effectiveness of BERT.
EMNLP-IJCNLP, pages 833–844.

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, and others. 2016, Google’s
neural machine translation system: Bridging the
gap between human and machine translation,
Google.

426

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Jiateng Xie, Zhilin Yang, Graham Neubig,
Noah A. Smith, and Jaime Carbonell. 2018.
Neural cross-lingual named entity recognition
with minimal resources. In Proceedings of
EMNLP, pages 369–379.

David Yarowsky, Grace Ngai, and Richard
Wicentowski. 2001. Inducing multilingual text
analysis tools via robust projection across
aligned corpora. In Proceedings of the First
International Conference on Human Language
Technology Research, pages 1–8. DOI:
https://doi.org/10.3115/1072133
.1072187

Dani Yogatama, Cyprien de Masson d’Autume,
Jerome Connor, Tomas Kocisky, Mike
Chrzanowski, Lingpeng Kong, Angeliki
Lazaridou, Wang Ling, Lei Yu, Chris Dyer,
and others. 2019. Learning and evaluating
general linguistic intelligence. arXiv preprint
arXiv:1901.11373v1.

Daniel Zeman

and Philip Resnik.

2008.
Cross-language parser adaptation between
related languages. In Proceedings of IJCNLP,
pages 35–42.

Yuan Zhang, David Gaddy, Regina Barzilay,
and Tommi Jaakkola. 2016. Ten pairs to
tag – multilingual POS tagging via coarse
mapping between embeddings. In Proceedings
of the 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 1307–1317. Association for Computa-
tional Linguistics, San Diego, California. DOI:
https://doi.org/10.18653/v1/N16
-1156

Yftah Ziser and Roi Reichart. 2018. Deep
pivot-based modeling for cross-language cross-
domain transfer with minimal guidance. In
Proceedings of EMNLP, pages 238–249. DOI:
https://doi.org/10.18653/v1/D18
-1022

A KL-divergence of Gaussians

If both p , N (µ, Σ) and q , N (m, S) are
multivariate Gaussians, their KL-divergence can
be computed analytically as follows:

427

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 4: Samples from the posteriors of 4 languages,
PCA-reduced to 4 dimensions.

KL (q || p) =

1
2

|S|
|Σ|

− d + tr(S−1Σ)

(13)

+ (m − µ)⊤S−1(m − µ)

By substituting m = 0 and S = I, it is trivial to
obtain Equation (9).

B Visualization of the Learned Posteriors

The approximate posteriors of the latent variables
can be visualized in order to study the learned
languages. Previous work
representations for
¨Ostling and Tiedemann
(Johnson et al., 2017;
and
al., 2017; Bjerva
2017; Malaviya
Augenstein, 2018) induced point estimates of
language representations from artificial
tokens
concatenated to every input sentence, or from the
aggregated values of the hidden state of a neu-
ral encoder. The information contained in such
representations depends on the task (Bjerva and
Augenstein, 2018), but mainly reflects the struc-
tural properties of each language (Bjerva et al.,
2019).

In our work, due to the estimation procedure,
languages are represented by full distributions
rather than point estimates. By inspecting the
learned representations, language similarities do
not appear to follow the structural properties of
languages. This is most likely due to the fact that
parameter factorization takes place after the multi-
lingual BERT encoding, which blends the structural

differences across languages. A fair comparison
with previous works without such an encoder is
left for future investigation.

As an example, consider two pairs of languages
from two distinct families: Yoruba and Wolof
are Niger-Congo from the Atlantic-Congo branch,
Tamil and Telugu are Dravidian. We take 1,000
samples from the approximate posterior over
the latent variables for each of these languages. In

particular, we focus on the variational scheme
with a low-rank covariance structure. We then
reduce the dimensionality of each sample to 4
through PCA,15 and we plot the density along
each resulting dimension in Figure 4. We observe
that density areas of each dimension do not nec-
essarily overlap between members of the same
family. Hence, the learned representations depend
on more than genealogy.

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

:
/
/

d
i
r
e
c
t
.

i
t
.

e
d
u

/
t

a
c
l
/

a
r
t
i
c
e
–
p
d

f
/

d
o

i
/

1
0
1
1
6
2

/
t

a
c
_
a
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

a
c
_
a
_
0
0
3
7
4
p
d

b
y
g
u
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

15Note that the dimensionality reduced samples are also

Gaussian since PCA is a linear method.

428
Download pdf