Parameter Space Factorization for Zero-Shot Learning - 麻省理工学院人工智能研究专业

Parameter Space Factorization for Zero-Shot Learning
across Tasks and Languages

Edoardo M. Pontiκ, Ivan Vuli´cκ, Ryan Cotterellκ, ζ,
Marinela Parovi´cκ, Roi Reichartτ , Anna Korhonenκ
κUniversity of Cambridge ζETH Z¨urich τ Technion, IIT
κ{ep490,iv250,rdc42,mp939,alk23}@cam.ac.uk
τ roiri@ie.technion.ac.il

抽象的

Most combinations of NLP tasks and lan-
guage varieties lack in-domain examples for
supervised training because of the paucity
of annotated data. How can neural models
make sample-efficient generalizations from
task–language combinations with available
data to low-resource ones? 在这项工作中, 我们
propose a Bayesian generative model for the
space of neural parameters. We assume that
this space can be factorized into latent vari-
ables for each language and each task. We infer
the posteriors over such latent variables based
on data from seen task–language combinations
inference. This enables
through variational
zero-shot classification on unseen combina-
tions at prediction time. 例如, 给定
training data for named entity recognition
(NER) in Vietnamese and for part-of-speech
(销售点) tagging in Wolof, our model can per-
form accurate predictions for NER in Wolof.
尤其, we experiment with a typolog-
ically diverse sample of 33 languages from
4 continents and 11 家庭, and show that
our model yields comparable or better results
than state-of-the-art, zero-shot cross-lingual
transfer methods. Our code is available at
github.com/cambridgeltl/parameter
-factorization.

1 介绍

The annotation efforts in NLP have achieved im-
pressive feats, such as the Universal Dependen-
化学系 (UD) 项目 (Nivre et al., 2019), which now
includes 83 语言. But even UD covers only
a meager subset of the world’s estimated 8,506
语言 (Hammarström et al., 2016). 什么是

410

更多的, the Association for Computational Linguis-
tics Wiki1 lists 24 separate NLP tasks. Labeled
数据, which is both costly and labor-intensive, 是
missing for many of such task–language combi-
nations. This shortage hinders the development
of computational models for the majority of the
world’s languages (Snyder and Barzilay, 2010;
Ponti et al., 2019A).

A common solution is transferring knowledge
across domains, such as tasks and languages
(Yogatama et al., 2019; Talmor and Berant, 2019),
which holds promise to mitigate the lack of
training data inherent to a large spectrum of NLP
applications (T¨ackstr¨om et al., 2012; Agi´c et al.,
2016; Ammar et al., 2016; Ponti et al., 2018;
Ziser and Reichart, 2018, inter alia). In the most
extreme scenario, zero-shot learning, no annotated
examples are available for the target domain.
尤其, zero-shot transfer across languages
implies a change in the data domain, and leverages
information from resource-rich languages to
tackle the same task in a previously unseen target
语言 (林等人。, 2019; Rijhwani et al., 2019;
Artetxe and Schwenk, 2019; Ponti et al., 2019A,
inter alia). Zero-shot transfer across tasks within
the same language (Ruder et al., 2019A), 在
另一方面, implies a change in the space of labels.
As our main contribution, we propose a
Bayesian generative model of the neural parame-
ter space. We assume this to be structured, 并为
this reason factorizable into task- 和语言-
specific latent variables.2 By performing transfer
of knowledge from both related tasks and related
语言 (IE。, from seen combinations), 我们的
model allows for zero-shot prediction on unseen
这
task–language combinations. 例如,

1aclweb.org/aclwiki/State of the art.
2By latent variable we mean every variable that has to
be inferred from observed (directly measurable) 变量.
To avoid confusion, we use the terms seen and unseen when
referring to different task–language combinations.

计算语言学协会会刊, 卷. 9, PP. 410–428, 2021. https://doi.org/10.1162/tacl 00374
动作编辑器: Jason Eisenstein. 提交批次: 2/2020; 修改批次: 2/2021; 已发表 4/2021.
C(西德:13) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

availability of annotated data for part-of-speech
(销售点) tagging in Wolof and for named-entity
认出 (NER) in Vietnamese supplies plenty
of information to infer a task-agnostic represen-
tation for Wolof and a language-agnostic repre-
sentation for NER. Conditioning on these, 这
appropriate neural parameters for Wolof NER
can be generated at evaluation time. While this
idea superficially resembles matrix completion for
collaborative filtering (Mnih and Salakhutdinov,
2008; Dziugaite and Roy, 2015),
the neural
parameters are latent and are non-identifiable.
Rather than recovering missing entries from par-
tial observations, in our approach we reserve
latent variables to each language and each task to
tie together neural parameters for combinations
that have either of them in common.

We adopt a Bayesian perspective towards infer-
恩斯. The posterior distribution over the model’s
latent variables is approximated through stochastic
variational inference (Hoffman et al., 2013, SVI).
Given the enormous number of parameters, 我们
also explore a memory-efficient inference scheme
based on a diagonal plus low-rank approximation
of the covariance matrix. This guarantees that our
model remains both expressive and tractable.

We evaluate the model on two sequence label-
ing tasks: POS tagging and NER, relying on a
typologically representative sample of 33 lan-
guages from 4 continents and 11 家庭. 这
results clearly indicate that our generative model
surpasses standard baselines based on cross-
lingual transfer 1) 从 (typologically) nearest
source language; 2) from the source language with
the most abundant in-domain data (英语); 和
3) from multiple source languages, in the form
of either a multi-task, multi-lingual model with
parameter sharing (Wu and Dredze, 2019) or an
ensemble of task- and language-specific models
(Rahimi et al., 2019).

2 Bayesian Generative Model

在这项工作中, we propose a Bayesian generative
model for multi-task, multi-lingual NLP. We train
a single Bayesian neural network for several
tasks and languages jointly. 正式地, 我们骗-
sider a set T = {t1, . . . , tn} of n tasks and a set
L = {l1, . . . , lm} of m languages. The core mod-
eling assumption we make is that the parameter
space of the neural network is structured: Spe-
cifically, we posit that certain parameters corre-

411

的

θij

yijk

xijk

米

数字 1: A graph (plate notation) of the generative
model based on parameter space factorization. Shaded
circles refer to observed variables.

spond to tasks and others correspond to languages.
This structure assumption allows us to general-
ize to unseen task–language pairs. 在这方面,
the model is reminiscent of matrix factorization
as applied to collaborative filtering (Mnih and
Salakhutdinov, 2008; Dziugaite and Roy, 2015).
We now describe our generative model in three
steps that match the nesting level of the plates in
the diagram in Figure 1. Equivalently, the reader
can follow the nesting level of the for loops in
Algorithm 1 for an algorithmic illustration of the
generative story.

(1) Sampling Task and Language Representa-
系统蒸发散: To kick off our generative process, 我们
first sample a latent representation for each
of the tasks and languages from multivari-
ate Gaussians: ti ∼ N (µti, Σti ) ∈ Rh and
lj ∼ N (µlj , Σlj ) ∈ Rh, 分别. 尽管
we present the model in its most general form,
we take µti = µlj = 0 and Σti = Σlj = I
for the experimental portion of this paper.

(2) Sampling Task–Language-specific Param-
埃特斯: Afterward, to generate task–language-
specific neural parameters, we sample θij
from N (fψ(的, lj), diag(fφ(的, lj))) ∈ Rd
where fψ(的, lj) and fφ(的, lj) are learned
deep feed-forward neural networks fψ :
Rh → Rd and fφ : Rh → Rd
≥0 parametrized
by ψ and φ, 分别, similar to Kingma
and Welling (2014). These transform the

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Algorithm 1 Generative Model of Neural Param-
eters for Multi-task, Multi-lingual NLP.

1: for ti ∈ T :
2:

ti ∼ N (µti, Σti)

3: for lj ∈ L :
4:

lj ∼ N (µlj , Σlj )

5: for ti ∈ T :
6:

for lj ∈ L :

10:

11:

µθij = fψ(的, lj)
Σθij = fφ(的, lj)
θij ∼ N (µθij , Σθij )
for xijk ∈ Xij :

yijk ∼ p(· | xijk, θij)

latent representations into the mean µθij and
diagonal of the covariance matrix σ2
为了
θij
the parameters θij associated with ti and
lj. The feed-forward network fψ just has
a final linear layer as the mean can range
over Rd whereas fφ has a final softplus
(defined in Section 3) layer to ensure it
ranges only over Rd
≥0. Following Stolee and
帕特森 (2019), the networks fψ and fφ
take as input a linear function of the task and
language vectors: t ⊕ l ⊕ (t − l) ⊕ (t ⊙ l),
where ⊕ stands for concatenation and ⊙ for
逐元素乘法. The sampled
neural parameters θij are partitioned into a
weight Wij ∈ Re×c and a bias bij ∈ Rc, 和
reshaped appropriately. 因此, the dimen-
sionality of the Gaussian is chosen to reflect
the number of parameters in the affine layer,
d = e · c + C, where e is the dimensionality
of the input token embeddings (detailed in
the next paragraph) and c is the maximum
number of classes across tasks.3 The number
of hidden layers and the hidden size of fψ
and fφ are hyper-parameters discussed in
部分 4.2. We tie the parameters ψ and φ
for all layers except for the last to reduce the
parameter count. We note that the space of
parameters for all tasks and languages forms
a tensor Θ ∈ Rn×m×d, where d is the number
of parameters of the largest model.

(3) Sampling Task Labels: 最后, we sam-
ple the kth label yijk for the ith task and the

jth language from a final softmax: p(yijk |
xijk, θij) = softmax(Wij BERT(xijk) + bij)
where BERT(xijk) ∈ Re is the multi-lingual
BERT (Pires et al., 2019) encoder. The incor-
poration of m-BERT as a pre-trained mul-
tilingual embedding allows for enhanced
cross-lingual transfer.

Consider the Cartesian product of all

任务
and languages T × L. We can decompose this
product into seen task–language pairs S and un-
seen task–language pairs U, IE。, T × L = S ⊔ U.
Naturally, we are only able to train our model on
the seen task–language pairs S. 然而, 和我们一样
estimate all task–language parameter vectors θij
jointly, our model allows us to draw inferences
about the parameters for pairs in U as well. 这
intuition for why this should work is as follows: 经过
observing multiple pairs where the task (语言)
is the same but the language (任务) varies, 这
model learns to distill the relevant knowledge for
zero-shot learning because our generative model
structurally enforces a disentangled representations
—separating representations for the tasks from
the representations for the languages rather than
lumping them together into a single entangled
表示 (Wu and Dredze, 2019, inter alia).
此外, the neural networks fψ and fφ map-
ping the task- and language-specific latent vari-
ables to neural parameters are shared, allowing the
model to generalize across task–language pairs.

3 Variational Inference

Exact computation of the posterior over the
latent variables p(我, t, 我 | X) is intractable. 因此,
we need to resort to an approximation. 在这个
工作, we consider variational inference as our
approximate inference scheme. Variational infer-
ence finds an approximate posterior over the latent
variables by minimizing the variational gap, 哪个
may be expressed as the Kullback–Leibler (吉隆坡)
divergence between the variational approximation
q(我, t, 我) and the true posterior p(我, t, 我 | X). 在
我们的工作, we employ the following variational
分布:
qλ = N (mt, 英石) mt ∈ Rh, St ∈ Rh×h (1)
qν = N (ml, Sl) ml ∈ Rh, Sl ∈ Rh×h (2)
qξ = N (fψ(t, 我), diag(fφ(t, 我)))
(3)

3Different tasks might involve different class numbers; 这
number of parameters hence oscillates. The extra dimensions
not needed for a task can be considered as padded with zeros.

We note the unusual choice to tie parameters
between the generative model and the variational

412

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

吉隆坡 (q(我, t, 我) || p(我, t, 我 | X)) = − E
t∼qλ
= − E
t∼qλ

乙
l∼qν
乙
l∼qν

乙
θ∼qξ
乙
θ∼qξ

日志

p(我, t, 我 | X)
q(我, t, 我)

[log p(我, t, 我, X) − log p(X) − log q(我, t, 我)]

= 对数 p(X) − E
t∼qλ

乙
l∼qν

乙
θ∼qξ

日志

p(我, t, 我, X)
q(我, t, 我)

, log p(X) − L

(4)

log p(X) = log

p(X, 我, t, 我) dθ dt dl

p(X | 我) p(我 | t, 我) p(t) p(我) dθ dt dl

(西德:19)

qλ(t) qν(我) qξ(我 | t, 我)
qλ(t) qν(我) qξ(我 | t, 我)

(西德:19)

p(X | 我) p(我 | t, 我) p(t) p(我) dθ dt dl

= log

(西德:18)Z Z Z

(西德:18)Z Z Z
乙
t∼qλ

(西德:18)

乙
l∼qν

乙
θ∼qξ

p(我 | t, 我) p(t) p(我) p(X | 我)
qλ(t) qν(我) qξ(我 | t, 我)
p(X | 我) p(我 | t, 我) p(t) p(我)
qλ(t) qν(我) qξ(我 | t, 我)

(西德:19)

(西德:21)

, L

≥ E
t∼qλ

乙
l∼qν

日志

乙
θ∼qξ (西德:20)

(西德:19)

(5)

= E
t∼qλ

乙
l∼qν ”

log p(X | 我) + 日志

乙
θ∼qξ (西德:20)

p(我 | t, 我)
qξ(我 | t, 我)

(西德:21)

+ 日志

p(t)
qλ(t)

+ 日志

p(我)
qν(我) #

= E
θ∼qξ

log p(X | 我)

requires approximation

}

吉隆坡 (qλ(t) || p(t)) + 吉隆坡 (qν(我) || p(我)) + 吉隆坡 (qξ(我 | t, 我) || p(我 | t, 我))

closed-form solution

(6)

}

family in Equation (3); 然而, 我们发现
this choice performs better in our experiments.

Through a standard algebraic manipulation in
方程 (4), the KL-divergence for our genera-
tive model can be shown to equal the marginal
log-likelihood log p(X), independent from q(·),
and the so-called evidence lower bound (ELBO)
L. 因此, approximate inference becomes an opti-
mization problem where maximizing L results in
minimizing the KL-divergence. One derives L is
by expanding the marginal log-likelihood as in
方程 (5) by means of Jensen’s inequality. 我们
also show that L can be further broken into a
series of terms as illustrated in Equation (6). 在
特别的, we see that it is only the first term in
the expansion that requires approximation. 这
subsequent terms are KL-divergences between
variational and true distributions that have closed-
form solution due to our choice of prior. Due
to the parameter-tying scheme above, the KL-
divergence in Equation (6) between the variational

413

distribution qξ(我 | t, 我) and the prior distribution
p(我 | t, 我) is zero.

一般来说, the covariance matrices St and Sl in
方程 (1) and Equation (2) will require O(h2)
space to store. As h is often very large, it is imprac-
tical to materialize either matrix in its entirety.
因此, in this work, we experiment with smaller
matrices that have a reduced memory footprint;
具体来说, we consider a diagonal covariance
matrix and a diagonal plus low-rank covariance
结构. A diagonal covariance matrix makes
computation feasible with a complexity of O(H);
这, 然而, comes at the cost of not letting
parameters influence each other, and thus failing
to capture their complex interactions. To allow
for a more expressive variational family, we also
consider a covariance matrix that is the sum of a
diagonal matrix and a low-rank matrix:
St = diag(δ2
Sl = diag(δ2

t ) + BtB⊤
t
我 ) + BlB⊤
我

(8)

(7)

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

where B ∈ Rh×k ensures that rank
≤ k,
and diag(δ) is diagonal. We can store this struc-
tured covariance matrix in O(kh) 空间.

BB⊤

(西德:1)

(西德:0)

By definition, covariance matrices must be sym-
metric and positive semi-definite. The first prop-
erty holds by construction. The second property is
enforced by a softplus parameterization where
softplus(·) , ln(1 + 经验值(·)). 具体来说, 我们
define δ2 = softplus(ρ) and we optimize over ρ.

3.1 Stochastic Variational Inference

To speed up the training time, we make use of
stochastic variational inference (Hoffman et al.,
2013). 在这个设置下, we randomly sample a task
ti ∈ T and language lj ∈ L among seen combi-
nations during each training step,4 and randomly
select a batch of examples from the dataset for the
sampled task–language pair. We then optimize the
parameters of the feed-forward neural networks ψ
and φ as well as the parameters of the variational
approximation to the posterior mt, ml, ρt, ρl, Bt,
and Bl with a stochastic gradient-based optimizer
(discussed in Section 4.2).

The KL divergence terms and their gradients
in the ELBO appearing in Equation (6) 可
computed in closed form as the relevant densities
are Gaussian (Duchi, 2007, p. 13). 而且, 他们
can be calculated for Gaussians with diagonal
and diagonal plus low-rank covariance structures
without explicitly unfolding the full matrix. 为一个
choice of prior p = N (0, 我) and a diagonal plus
low-rank covariance structure, 我们有:

吉隆坡 (q || p) =

(平方米

我 + δ2

我 +

1
2

我=1
H
X
− h − lndet(S)

b2
ij)

j=1
X

(9)

我

where bij is the element in the i-th row and j-th
column of B. The last term can be estimated with-
out computing the full matrix explicitly thanks
to the generalization of the matrix–determinant
引理,5 哪个, applied to the factored covariance
结构, yields:

4As an alternative, we experimented with a setup where
sampling probabilities are proportional to the number of
examples of each task–language combination, 但
这
achieved similar performances on the development sets.

5这(A + U V ⊤) = det(我 + V ⊤A−1U ) · det(A). 注意

the lemma assumes that A is invertible.

414

lndet(S) =ln

这(我 + B⊤diag(δ−2)乙)

H
+

我=1
X

ln(δ2
我 )

我

(10)

where I ∈ Rk. The KL divergence for the variant
with diagonal covariance is just a special case of
方程 (9) with bij = 0.

然而, as stated before, the following expec-
tation does not admit a closed-form solution. 因此
we consider a Monte Carlo approximation:

乙
θ∼qξ

log p(X | 我) =

qξ(我) log p(X | 我) dθ

≈

1
V

v=1
X

log p(X | 我(v)) where θ(v) ∼ qξ

(11)

where V is the number of Monte Carlo samples
taken. In order to allow the gradient to easily
flow through the generated samples, we adopt
the re-parametrization trick (Kingma and Welling,
2014). 具体来说, we exploit the following iden-
tities ti = µti + σti ⊙ ǫ and lj = µlj +
σlj ⊙ ǫ, where ǫ ∼ N (0, 我) and ⊙ is the
Hadamard product. For the diagonal plus low-rank
covariance structure, we exploit the identity:

µ + diag(δ2 ⊙ ǫ) + Bζ

(12)

where ǫ ∈ Rh, ζ ∈ Rk, and both are sampled
from N (0, 我). The mean µθij and the diagonal of
the covariance matrix σ2
are deterministically
θij
computed given the above samples and the param-
eters θij are sampled from N (µθij , diag(σ2
)),
θij
again with the re-parametrization trick.

3.2 Posterior Predictive Distribution

During test time, we perform zero-shot predictions
on an unseen task–language pair by plugging
in the posterior means (under the variational
approximation) into the model. As an alternative,
we experimented with ensemble predictions
through Bayesian model averaging. 那是, 为了
data for seen combinations xS and data for unseen
combinations xU , the true predictive posterior
p(xU |
can be approximated as p(xU | xS ) =
v=1 p(xU | 我(v), xS),
我, xS ) qξ(我 | xS ) dθ ≈
where V are 100 Monte Carlo samples from the
posterior qξ. Performances on the development
sets are comparable to simply plugging in the
posterior mean.

磷

右

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

4 实验装置

4.1 数据

We select NER and POS tagging as our exper-
imental tasks because their datasets encompass
an ample and diverse sample of languages, 和
are common benchmarks for resource-poor NLP
(Cotterell and Duh, 2017, inter alia). 尤其,
we opt for WikiANN (Pan et al., 2017) for the NER
task and Universal Dependencies 2.4 (UD; Nivre
等人。, 2019) for POS tagging. Our sample of
languages is chosen from the intersection of those
available in WikiANN and UD. 然而, 我们
remark that this sample is heavily biased towards
the Indo-European family (Gerz et al., 2018).
反而, the selection should be: 我) typologically
diverse, to ensure that the evaluation scores truly
reflect
the expected cross-lingual performance
(Ponti et al., 2020); 二) a mixture of resource-rich
and low-resource languages, to recreate a realis-
tic setting and to allow for studying the effect of
data size. 因此, we further filter the languages in
order to make the sample more balanced. In par-
针状的, we sub-sample Indo-European languages
by including only resource-poor ones, and keep
all the languages from other families. Our final
sample comprises 33 languages from 4 continents
(17 from Asia, 11 from Europe, 4 from Africa,
和 1 from South America) 和来自 11 fami-
谎言 (6 Uralic, 6 Indo-European, 5 Afroasiatic, 3
Niger-Congo, 3 Turkic, 2 Austronesian, 2 Dra-
vidian, 1 Austroasiatic, 1 Kra-Dai, 1 Tupian, 1
Sino-Tibetan), 也 2 isolates. The full list of
language ISO 639-2 codes is reported in Figure 2.
In order to simulate a zero-shot setting, we hold
out in turn half of all possible task–language pairs
and regard them as unseen, while treating the
others as seen pairs. The partition is performed in
such a way that a held-out pair has data available
for the same task in a different language, 并为
the same language in a different task.6 Under this
约束, pairs are assigned to train or evaluation
at random.7

We randomly split the WikiANN datasets into
训练, 发展, and test portions with a

6We use the controlled partitioning for the following
原因. If a language lacks data both for NER and for POS,
the proposed factorization method cannot provide estimates
for its posterior. We leave model extensions that can handle
such cases for future work.

7See Section 5.2 for further experiments on splits con-

proportion of 80-10-10. We use the provided
splits for UD; if the training set for a language
is missing, we treat the test set as such when the
language is held out, and as a training set when it
is among the seen pairs.8

4.2 Hyper-parameters

The multilingual M-BERT encoder is initialized
with parameters pre-trained on masked language
modeling and next sentence prediction on 104
语言 (Devlin et al., 2019).9 We opt for the
cased BERT-BASE architecture, which consists of
12 layers with 12 attention heads and a hidden size
的 768. 作为结果, this is also the dimen-
sion e of each encoded WordPiece unit, a subword
unit obtained through BPE (Wu et al., 2016). 这
dimension h of the multivariate Gaussian for task
and language latent variables is set to 100. 这
deep feed-forward networks fψ and fφ have 6
layers with a hidden size of 400 for the first layer,
768 for the internal layers, and ReLU non-linear
activations. Their depth and width were selected
based on validation performance.

The expectations over latent variables in Equa-
的 (6) are approximated through 3 Monte Carlo
samples per batch during training. The KL terms
are weighted with 1
|K| uniformly across training,
在哪里 |K| is the number of mini-batches.10 We ini-
tialize all the means m of the variational approxi-
mation with a random sample from N (0, 0.1), 和
the parameters for covariance matrices S of the
variational approximation with a random sample
from U(0, 0.5) following Stolee and Patterson
(2019). We choose k = 10 as the number of
columns of B so it fits into memory. The maxi-
mum sequence length for inputs is limited to 250.
The batch size is set to 8, and the best setting for
the Adam optimizer (Kingma and Ba, 2015) 曾是
found to be an initial learning rate of 5·10−6 based
on grid search. In order to avoid over-fitting, 我们
perform early stopping with a patience of 10 和
a validation frequency of 2.5K steps.

4.3 基线

We consider four baselines for cross-lingual trans-
fer that also use BERT as an encoder shared across
all languages.

8注意, in the second case, no evaluation takes place

on such language.

9Available at github.com/google-research/bert

/blob/master/multilingual.md.

10We found this weighting strategy to work better than

trolled for language distance and sample size.

annealing as proposed by Blundell et al. (2015).

415

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

First Baseline. A common approach is transfer
from the nearest source (NS) 语言, 哪个
selects the most compatible source to a target
language in terms of similarity. 尤其, 这
selection can be based on family membership
(Zeman and Resnik, 2008; Cotterell and Heigold,
2017; Kann et al., 2017), typological features
(Deri and Knight, 2016), KL-divergence between
part-of-speech trigram distributions (Rosa and
ˇZabokrtsk´y 2015; Agi´c, 2017), tree edit distance
of delexicalized dependency parses (Ponti et al.,
2018), or a combination of the above (林等人。,
2019). In our work, during evaluation, we choose
the classifier associated with the observed lan-
guage with the highest cosine similarity between
its typological features and those of the held-out
语言. These features are sourced from URIEL
(Littell et al., 2017) and contain information about
家庭, 区域, syntax, and phonology.

Second Baseline. We also consider transfer
from the largest source (LS) 语言, 那是,
the language with most training examples. 这
approach has been adopted by several recent
works on cross-lingual transfer (Conneau et al.,
2018; Artetxe et al., 2020, inter alia). In our imple-
心理状态, we always select the English classifier
for prediction.11 In order to make this baseline
comparable to our model, we adjust the number
of English NER training examples to the sum of
the examples available for all seen languages S.12

Third Baseline. 下一个, we apply a protocol de-
signed by Rahimi et al. (2019) for weighting the
predictions of a classifier ensemble according to
their reliability. For a specific task, the reliability
of each language-specific classifier is estimated
through a Bayesian graphical model. 直观地,
this model learns from error patterns, 这是-
have more randomly for untrustworthy models
and more consistently for the others. Among the
protocols proposed in the paper, we opt for BEA
in its zero-shot, token-based version, as it achieves
the highest scores in a setting comparable to the
current experiment. We refer to the original paper
for the details.13

11We include English to make the baseline more com-
petitive, but note that this language is not available for our
generative model as it is both Indo-European and resource-
rich.

12The number of NER training examples is 1,093,184 为了

the first partition and 520,616 for the second partition.

13We implemented this model through the original code at

github.com/afshinrahimi/mmner.

Fourth Baseline. 最后, we take inspiration from
Wu and Dredze (2019). The joint multilingual
(JM) 基线, contrary to the previous baselines,
consists of two classifiers (one for POS tagging
and another for NER) shared among all observed
languages for a specific task. We follow the orig-
inal implementation of Wu and Dredze (2019),
closely adopting all recommended hyper-parameters
and strategies, such as freezing the parameters of
all encoder layers below the 3rd for sequence label-
ing tasks.

It must be noted that the number of parameters
in our generative model scales better than base-
lines with language-specific classifiers, but worse
than those with language-agnostic classifiers, 作为
the number of languages grows. 然而, even in
the second case, increasing the depth of baselines
networks to match the parameter count is detri-
mental if the BERT encoder is kept trainable, 哪个
was also verified in previous work (Peters et al.,
2019).

5 Results and Discussion

5.1 Zero-shot Transfer

Firstly, we present the results for zero-shot predic-
tion based on our generative model using both of
the approximate inference schemes (with diago-
nal covariance PF-d and factor covariance PF-lr).
桌子 1 summarizes the results on the two tasks
of POS tagging and NER averaged across all
语言. Our model (in both its variants) outper-
forms the four baselines on both tasks, 包括
state-of-the-art alternative methods. 尤其-
拉尔, PF-d and PF-lr gain 4.49 / 4.20 in accuracy
(∼7%) for POS tagging and 7.29 / 7.73 in F1 score
(∼10%) for NER on average compared to transfer
from the largest source (LS), the strongest baseline
for single-source transfer. Compared to multilin-
gual joint transfer from multiple sources (JM), 我们的
two variants gain 0.95 / 0.67 in accuracy (∼1%) 为了
POS tagging and +0.61 / +1.05 in F1 score (∼1%).
More details about the individual results on each
task–language pair are provided in Figure 2, 哪个
includes the mean of the results over 3 separate
runs. 全面的, we obtain improvements in 23/33
languages for NER and on 27/45 treebanks for
词性标注, which further supports the benefits
of transferring both from tasks and languages.

Considering the baselines, the relative perfor-
mance of LS versus NS is an interesting finding
为他自己. LS largely outperforms NS on both POS

416

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 2: Results for NER (顶部) and POS tagging (底部): Four baselines for cross-lingual transfer compared to
Matrix Factorization with diagonal covariance and diagonal plus low-rank covariance.

BEA
任务
销售点
47.65 ± 1.54
NER 66.45 ± 0.56

NS
42.84 ± 1.23
74.16 ± 0.56

LS
60.51 ± 0.43
78.97 ± 0.56

JM
64.04 ± 0.18
85.65 ± 0.13

PF-d
65.00 ± 0.12
86.26 ± 0.17

PF-lr
64.71 ± 0.18
86.70 ± 0.10

桌子 1: Results per task averaged across all languages.

417

tagging and NER. This shows that having more
data is more informative than relying primarily on
similarity according to linguistic properties. 这
finding contradicts the received wisdom (Rosa and
ˇZabokrtsk´y, 2015; Cotterell and Heigold, 2017;
林等人。, 2019, inter alia) that related languages
tend to be the most reliable source. We conjecture
that this is due to the pre-trained multi-lingual BERT
encoder, which helps to bridge the gap between
unrelated languages (Wu and Dredze, 2019).

The two baselines that hinge upon transfer
from multiple sources lie on opposite sides of the
spectrum in terms of performance. 就其一而言
手, BEA achieves the lowest average score for
NER, and surpasses only NS for POS tagging.
We speculate that this is due to the following: 我)
adapting the protocol from Rahimi et al. (2019) 到
our model implies assigning a separate classifier
head to each task–language pair, each of which is
exposed to fewer examples compared to a shared
一. This fragmentation fails to take advantage of
the massively multilingual nature of the encoder;
二) our language sample is more typologically
diverse, which means that most source languages
are unreliable predictors. 另一方面, JM
yields extremely competitive scores. 相似地
to our model, it integrates knowledge from mul-
tiple languages and tasks. The extra boost in
our model stems from its ability to disentangle
each aspect of such knowledge and recombine it
appropriately.

而且, comparing the two approximate in-
ference schemes from Section 3.1, PF-lr obtains
a small but statistically significant improvement
over PF-d in NER, whereas they achieve the
same performance on POS tagging. This means
that the posterior is modeled well enough by a
Gaussian where covariance among co-variates is
negligible.

We see that even for the best model (PF-lr) 那里
is a wide variation in the scores for the same task
across languages. POS tagging accuracy ranges
从 12.56 ± 4.07 in Guaran´ı to 86.71 ± 0.67
in Galician, and NER F1 scores range from
49.44 ± 0.69 in Amharic to 96.20 ± 0.11 in Upper
Sorbian. Part of this variation is explained by
the fact that the multilingual BERT encoder is not
pre-trained in a subset of these languages (例如,
Amharic, Guaran´ı, Uyghur). Another cause is
more straightforward: The scores are expected to
be lower in languages for which we have fewer
training examples in the seen task–language pairs.

任务

销售点
NER

|L| = 11

|L| = 22

辛
72.44
89.51

Dif
53.25
81.73

辛
66.59
86.78

Dif
63.22
85.12

桌子 2: Average performance when relying on
|L| 相似的 (辛) versus different (Dif ) 语言
in the train and evaluation sets.

5.2 Language Distance and Sample Size

While we designed the language sample to be both
realistic and representative of the cross-lingual
variation, there are several factors inherent to
a sample that can affect the zero-shot transfer
表现: 我) language distance, the similarity
between seen and held-out languages; and ii)
sample size, the number of seen languages. 在
order to disentangle these factors, we construct
subsets of size |L| so that training and evaluation
languages are either maximally similar (辛) 或者
maximally different (Dif ). As a proxy measure,
we consider as ‘similar’ languages belonging to
the same family. 表中 2, we report the perfor-
mance of parameter factorization with diagonal
plus low-rank covariance (PF-lr), the best model
from Section 5.1, for each of these subsets.

Based on Table 2, there emerges a trade-off
between language distance and sample size. 在
特别的, performance is higher in Sim subsets
compared to Dif subsets for both tasks (POS and
NER) and for both sample sizes |L| ∈ {11, 22}. 在
larger sample sizes, the average performance in-
creases for Dif but decreases for Sim. 直观地,
languages with labeled data for several relatives
benefit from small, homogeneous subsets. 介绍-
ducing further languages introduces noise. 在-
代替, languages where this is not possible (例如
isolates) benefit from an increase in sample size.

5.3 Entropy of the Predictive Distribution

A notable problem of point estimate methods
is their tendency to assign most of the proba-
bility mass to a single class even in scenarios
with high uncertainty. Zero-shot transfer is one
of such scenarios, 因为它
involves drastic
distribution shifts in the data (Rabanser et al.,
2019). A key advantage of Bayesian inference, 在-
代替, is marginalization over parameters, 哪个
yields smoother posterior predictive distributions
(Kendall and Gal, 2017; Wilson, 2019).

418

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 3: Entropy of the posterior predictive distributions over classes for each test example. The higher the
entropy, the more uncertain the prediction.

We run an analysis of predictions based on
(近似) Bayesian model averaging. 第一的,
we randomly sample 800 examples from each test
set of a task–language pair. For each example,
we predict a distribution over classes Y through
model averaging based on 10 samples from the
posteriors. We then measure the prediction
entropy of each example—that
是, H(p) =
|是 |
y p(Y = y)lnp(Y = y)—whose plot is
-
shown in Figure 3.

磷
Entropy is a measure of uncertainty. 直观地,
the uniform categorical distribution (maximum
uncertainty) has the highest entropy, whereas if
the whole probability mass falls into a single
班级 (maximum confidence), then the entropy
H = 0.14 As it emerges from Figure 3, 预测
in certain languages tend to have higher entropy
一般, such as in Amharic, Guaran´ı, Uyghur,
or Assyrian Neo-Aramaic. This aligns well with
the performance metrics in Figure 2. 在实践中,
languages with low scores tend to display high
entropy in the predictive distribution, as expected.
To verify this claim, we measure the Pearson’s

14The maximum entropy is ≈ 2.2 为了 9 classes as in NER

and ≈ 2.83 为了 17 classes as in POS tagging.

correlation between entropies of each task–
language pair in Figure 3 and performance met-
rics. We find a very strong negative correlation
with a coefficient of ρ = −0.914 and a two-tailed
p-value of 1.018 × 10−26.

6 相关工作

Our approach builds on ideas from several dif-
ferent fields: cross-lingual transfer in NLP, 和
a particular focus on sequence labeling tasks, 作为
well as matrix factorization, contextual parameter
一代, and neural Bayesian methods.

Cross-Lingual Transfer for Sequence Labeling.
One of the two dominant approaches for cross-
lingual transfer is projecting annotations from a
source language text to a target language text.
This technique was pioneered by Yarowsky et al.
(2001) and Hwa et al. (2005) for parsing, and later
extended to applications such as POS tagging (这
and Petrov, 2011; Garrette et al., 2013; T¨ackstr¨om
等人。, 2012; Duong et al., 2014; Huck et al., 2019)
and NER (Ni et al., 2017; Enghoff et al., 2018;
Agerri et al., 2018; Jain et al., 2019). 这需要
tokens to be aligned through a parallel corpus, A

419

machine translation system, or a bilingual dictio-
nary (Durrett et al., 2012; Mayhew et al., 2017).
然而, creating machine translation and word-
alignment systems demands parallel texts in the
first place, while automatically induced bilingual
lexicons are noisy and offer only limited cover-
年龄 (Artetxe et al., 2018; Duan et al., 2020).
此外, errors inherent
to such systems
cascade along the projection pipeline (Agi´c et al.,
2015).

The second approach, model transfer, 优惠
higher flexibility (Conneau et al., 2018). 主要的
idea is to train a model directly on the source data,
and then deploy it onto target data (Zeman and
Resnik 2008). 至关重要的是, bridging between differ-
ent lexica requires input features to be language-
agnostic. While originally this implied delexi-
calization, replacing words with universal POS
tags (McDonald et al., 2011; Dehouck and Denis,
2017), cross-lingual Brown clusters (T¨ackstr¨om
等人。, 2012; Rasooli and Collins, 2017), or cross-
lingual knowledge base grounding through wikifi-
阳离子 (Camacho-Collados et al., 2016; Tsai et al.,
2016), more recently these have been supplanted
by cross-lingual word embeddings (Ammar et al.
2016; 张等人。, 2016; Xie et al., 2018; Ruder
等人。, 2019乙) and multilingual pretrained language
型号 (Devlin et al., 2019; Conneau et al., 2020).
An orthogonal research thread regards the
selection of the source language(s). 尤其,
multi-source transfer was shown to surpass single-
best source transfer in NER (Fang and Cohn, 2017;
Rahimi et al., 2019) and POS tagging (Enghoff
等人。, 2018; Plank and Agi´c, 2018). Our parameter
space factorization model can be conceived as
an extension of multi-source cross-lingual model
transfer to a cross-task setting.

Data Matrix Factorization. Although we are
the first to propose a factorization of the param-
eter space for unseen combinations of tasks and
语言, the factorization of data for collab-
orative filtering and social recommendation is
an established research area. 尤其, 这
missing values in sparse data structures such
as user-movie review matrices can be filled via
probabilistic matrix factorization (PMF) 通过
a linear combination of user and movie matrices
(Mnih and Salakhutdinov, 2008; Ma et al., 2008;
Shan and Banerjee, 2010, inter alia) or through
神经网络 (Dziugaite and Roy, 2015). Infer-
ence for PMF can be carried out through MAP

inference (Dziugaite and Roy, 2015), Markov
chain Monte Carlo (Salakhutdinov and Mnih,
2008) or stochastic variational inference (Stolee
和帕特森, 2019). Contrary to prior work, 我们
perform factorization on latent variables (任务-
and language-specific parameters) 而不是
observed ones (数据).

Contextual Parameter Generation. Our model
is reminiscent of the idea that parameters can be
conditioned on language representations, as pro-
posed by Platanios et al. (2018). 然而, 自从
this approach is limited to a single task and a joint
learning setting, it is not suitable for generalization
in a zero-shot transfer setting.

Bayesian Neural Models. 迄今为止, these models
have found only limited application in NLP for
resource-poor languages, despite their desirable
特性. Firstly,
they can incorporate priors
over parameters to endow neural networks with
the correct inductive biases towards language:
Ponti et al. (2019乙) constructed a prior imbued
with universal
linguistic knowledge for zero-
and few-shot character-level language modeling.
第二, they avoid the risk of over-fitting by
taking into account uncertainty. 例如,
Shareghi et al. (2019) and Doitch et al. (2019)
use a perturbation model to sample high-quality
and diverse solutions for structured prediction in
cross-lingual parsing.

7 结论

The main contribution of our work is a Bayesian
generative model for multiple NLP tasks and
语言. At its core lies the idea that the space
of neural weights can be factorized into latent
variables for each task and each language. 尽管
training data are available only for a meager sub-
set of task–language combinations, our model
opens up the possibility to perform prediction in
小说, undocumented combinations at evaluation
时间. We performed inference through stochastic
variational methods, and ran experiments on zero-
shot named entity recognition (NER) and part-
of-speech (销售点) tagging in a typologically diverse
一套 33 语言. Based on the reported results,
we conclude that leveraging the information from
tasks and languages simultaneously is superior
to model transfer from English (relying on more
abundant in-task data in the source language),
typologically similar language
from the most

420

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

(relying on prior information on language related-
内斯), or from multiple source languages. 更多的-
超过, we found that the entropy of predictive
posterior distributions obtained through Bayesian
model averaging correlates almost perfectly with
the error rate in the prediction. 作为结果,
our approach holds promise to alleviating data
paucity issues for a wide spectrum of languages
and tasks, and to make knowledge transfer more
robust to uncertainty.

最后, we remark that our model is amenable
tasks beyond
to be extended to multilingual
语言
sequence labeling—such as natural
inference (Conneau et al., 2018) and question
answering (Artetxe et al., 2020; 刘易斯等人。,
2019; Clark et al., 2020)—and to zero-shot trans-
fer across combinations of multiple modalities
(例如, speech, 文本, and vision) with tasks and lan-
guages. We leave these exciting research threads
for future research.

致谢

We would like to thank action editor Jacob
Eisenstein and the three anonymous reviewers
at TACL. This work is supported by the ERC
Consolidator Grant LEXICAL (不 648909) 和
the Google Faculty Research Award 2018. RR
was partially funded by ISF personal grant no.
1625/18.

参考

Rodrigo Agerri, Xavier G´omez Guinovart,
German Rigau, and Miguel Anxo Solla Portela.
2018. Developing new linguistic resources and
tools for the Galician language. In Proceedings
of LREC.

ˇZeljko Agi´c. 2017. Cross-lingual parser selection
for low-resource languages. 在诉讼程序中
the NoDaLiDa 2017 Workshop on Universal
Dependencies (UDW 2017), 第 1–10 页.

ˇZeljko Agi´c, Dirk Hovy, and Anders Søgaard.
2015. If all you have is a bit of the Bible: 学习-
ing POS taggers for truly low-resource lan-
guages. In Proceedings of ACL, pages 268–272.

ˇZeljko Agi´c, Anders Johannsen, Barbara Plank,
H´ector Mart´ınez Alonso, Natalie Schluter, 和
Anders Søgaard. 2016. Multilingual projection
for parsing truly low-resource languages. 特兰-

sactions of the ACL, 4:301–312. DOI: https://
doi.org/10.1162/tacl 00100

Waleed Ammar, George Mulcaire, Miguel
Ballesteros, Chris Dyer, and Noah A. 史密斯.
2016. Many languages, one parser. Transac-
tions of the ACL, 4:431–444. DOI: https://
doi.org/10.1162/tacl 00109

Mikel Artetxe, Gorka Labaka, Eneko Agirre, 和
Kyunghyun Cho. 2018. Unsupervised neural
machine translation. In Proceedings of ICLR.
DOI: https://doi.org/10.18653/v1
/D18-1399

Mikel Artetxe, Sebastian Ruder, and Dani
Yogatama. 2020. On the cross-lingual transfer-
ability of monolingual representations. In Pro-
ceedings of ACL. DOI: https://doi.org
/10.18653/v1/2020.acl-main.421

Mikel Artetxe and Holger Schwenk. 2019. 但-
sively multilingual sentence embeddings for
zero-shot cross-lingual transfer and beyond.
the ACL, 7:597–610. DOI:
Transactions of
https://doi.org/10.1162/tacl
00288

Johannes Bjerva and Isabelle Augenstein. 2018.
From phonology to syntax: Unsupervised
linguistic typology at different
levels with
在诉讼程序中
language embeddings.
NAACL-HLT, pages 907–916. DOI: https://
doi.org/10.18653/v1/N18-1083

Johannes Bjerva, 罗伯特

¨Ostling, Maria Han
Veiga, J¨org Tiedemann, and Isabelle Augenstein.
2019. What do language representations really
represent? 计算语言学, 45(2):
381–389. DOI: https://doi.org/10.1162
/coli a 00351

Charles Blundell,

Julien Cornebise, Koray
Kavukcuoglu, and Daan Wierstra. 2015. 重量
uncertainty in neural networks. In Proceedings
of ICML, pages 1613–1622.

Jos´e Camacho-Collados, Mohammad Taher
Pilehvar, and Roberto Navigli. 2016. Nasari:
Integrating explicit knowledge and corpus sta-
tistics for a multilingual representation of con-
cepts and entities. 人工智能, 240:
36–64. DOI: https://doi.org/10.1016
/j.artint.2016.07.005

421

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Jonathan H. 克拉克, Eunsol Choi, Michael Collins,
Dan Garrette, Tom Kwiatkowski, Vitaly
Nikolaev, and Jennimaria Palomaki. 2020.
Tydi qa: A benchmark for information-seeking
question answering in typologically diverse
语言. 协会的交易
计算语言学.

Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzm´an, Edouard Grave, Myle Ott,
Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised cross-lingual representa-
tion learning at scale. In Proceedings of ACL.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.747

Alexis Conneau, Guillaume Lample, 鲁西·里诺特,
Adina Williams, Samuel R. Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI:
Evaluating cross-lingual sentence representa-
In Proceedings of EMNLP, 页面
系统蒸发散.
2475–2485. DOI: https://doi.org/10
.18653/v1/D18-1269

Ryan Cotterell and Kevin Duh. 2017. 低的-
resource named entity recognition with cross-
lingual, character-level neural conditional random
fields. In Proceedings of IJNLP, pages 91–96.
Taipei, 台湾. DOI: https://doi.org
/10.18653/v1/D17-1078

Ryan Cotterell

and Georg Heigold. 2017.
Cross-lingual character-level neural morpho-
logical tagging. In Proceedings of EMNLP,
pages 748–759.

Dipanjan Das and Slav Petrov. 2011. Unsuper-
vised part-of-speech tagging with bilingual
graph-based projections. 在诉讼程序中
前交叉韧带, pages 600–609.

Mathieu Dehouck and Pascal Denis. 2017.
Delexicalized word embeddings for cross-
lingual dependency parsing. 在诉讼程序中
EACL, pages 241–250.

Aliya Deri and Kevin Knight. 2016. Grapheme-
to-phoneme models for (almost) any language.
In Proceedings of ACL, pages 399–408.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language

理解.
赫勒特, pages 4171–4186.

In Proceedings of NAACL-

Amichay Doitch, Ram Yazdi, Tamir Hazan, 和
Roi Reichart. 2019. Perturbation based learning
for structured nlp tasks with application to
dependency parsing. Transactions of the ACL,
7:643–659. DOI: https://doi.org/10
.1162/tacl a 00291

Xiangju Duan, Baijun Ji, Hao Jia, Min Tan,
Min Zhang, Boxing Chen, Weihua Luo, 和
Yue Zhang. 2020. Bilingual dictionary based
neural machine translation without using par-
allel sentences. In Proceedings of ACL. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.143

John Duchi. 2007, Derivations for linear algebra
and optimization, 加州大学,
伯克利.

Long Duong, Trevor Cohn, Karin Verspoor,
Steven Bird, and Paul Cook. 2014. What can we
get from 1000 代币? A case study of multilin-
gual POS tagging for resource-poor languages.
In Proceedings of EMNLP, pages 886–897.
DOI: https://doi.org/10.3115/v1
/D14-1096

Greg Durrett, Adam Pauls, and Dan Klein. 2012.
Syntactic transfer using a bilingual lexicon. 在
Proceedings of EMNLP-CoNLL, pages 1–11.

Gintare Karolina Dziugaite and Daniel M. Roy.
2015. Neural network matrix factorization.
arXiv 预印本 arXiv:1511.06443.

Jan Vium Enghoff, Søren Harrison, and ˇZeljko
Agi´c. 2018. Low-resource named entity recog-
nition via multi-source projection: Not quite
there yet? 在诉讼程序中 2018 EMNLP
Workshop W-NUT: The 4th Workshop on Noisy
User-generated Text, pages 195–201. DOI:
https://doi.org/10.18653/v1/W18
-6125

Meng Fang and Trevor Cohn. 2017. Model trans-
fer for tagging low-resource languages using
a bilingual dictionary. In Proceedings of ACL,
pages 587–593. DOI: https://doi.org
/10.18653/v1/P17-2093

Dan Garrette, Jason Mielens, and Jason Baldridge.
2013. Real-world semi-supervised learning of

422

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

POS-taggers for low-resource languages. 在
Proceedings of ACL, pages 583–592.

Daniela Gerz, Ivan Vuli´c, Edoardo Maria Ponti,
Roi Reichart, and Anna Korhonen. 2018. 上
relation between linguistic typology and (limi-
tations of) multilingual language modeling. 在
Proceedings of EMNLP, pages 316–327. DOI:
https://doi.org/10.18653/v1/D18
-1029

Harald Hammarström, Robert Forkel, 马丁
Haspelmath, and Sebastian Bank, 编辑 . 2016.
Glottolog 2.7, Max Planck Institute for the
Science of Human History, Jena.

Matthew D. Hoffman, 大卫·M. Blei, Chong
王, and John Paisley. 2013. Stochastic
variational inference. The Journal of Machine
Learning Research, 14(1):1303–1347.

Matthias Huck, Diana Dutka, and Alexander
弗雷泽. 2019. Cross-lingual annotation pro-
jection is effective for neural part-of-speech
tagging. In Proceedings of the Sixth Work-
shop on NLP for Similar Languages, Varieties
and Dialects, pages 223–233. DOI: https://
doi.org/10.18653/v1/W19-1425

Rebecca Hwa, Philip Resnik, Amy Weinberg,
Clara I. Cabezas, and Okan Kolak. 2005.
Bootstrapping parsers via syntactic projection
across parallel texts. Natural Language Engi-
neering, 11(3):311–325. DOI: https://土井
.org/10.1017/S1351324905003840

Alankar

Jain,

Bhargavi

Paranjape,

和
Zachary C. Lipton. 2019. Entity projec-
tion via machine translation for cross-lingual
In Proceedings of EMNLP-IJCNLP,
NER.
pages 1083–1092. DOI: https://doi.org
/10.18653/v1/D19-1100

Melvin Johnson, Mike Schuster, Quoc V. Le,
Maxim Krikun, Yonghui Wu, Zhifeng Chen,
Nikhil Thorat, Fernanda Vi´egas, 马丁
Wattenberg, Greg Corrado, Macduff Hughes,
and Jeffrey Dean. 2017. Googles multilingual
neural machine translation system: Enabling
这
zero-shot
计算语言学协会,
5:339–351. DOI: https://doi.org/10
.1162/tacl a 00065

翻译. Transactions of

423

Katharina Kann, Ryan Cotterell, and Hinrich
Sch¨utze. 2017. One-shot neural cross-lingual
transfer for paradigm completion. In Proceed-
ings of ACL, pages 1993–2003.

Alex Kendall and Yarin Gal. 2017. 什么
uncertainties do we need in Bayesian deep
learning for computer vision? In Proceedings
of NeurIPS, pages 5574–5584.

Diederik P. Kingma and Jimmy L. Ba. 2015.
亚当: A method for stochastic optimization.
In Proceedings of ICLR.

Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational Bayes. In Proceed-
ings of ICLR.

Patrick S. H. Lewis, Barlas O˘guz, 鲁西·里诺特,
Sebastian Riedel, and Holger Schwenk. 2019.
MLQA: Evaluating cross-lingual extractive
question answering. CoRR, abs/1910.07475.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui
李, Yuyan Zhang, Mengzhou Xia, Shruti
Rijhwani, Junxian He, Zhisong Zhang, Xuezhe
Ma, Antonios Anastasopoulos, Patrick Littell,
and Graham Neubig. 2019. Choosing transfer
languages for cross-lingual learning. In Pro-
ceedings of ACL, pages 3125–3135.

Patrick Littell, David R. Mortensen, Ke Lin,
Katherine Kairis, Carlisle Turner, and Lori
莱文. 2017. URIEL and lang2vec: Repre-
senting languages as typological, geographical,
and phylogenetic vectors. 在诉讼程序中
EACL, pages 8–14. DOI: https://土井
.org/10.18653/v1/E17-2002

Hao Ma, Haixuan Yang, Michael R. Lyu, 和
Irwin King. 2008. SoRec: Social recommenda-
tion using probabilistic matrix factorization. 在
Proceedings of CIKM, pages 931–940. DOI:
https://doi.org/10.1145/1458082
.1458205, PMID: 19021718

Chaitanya Malaviya, Graham Neubig, and Patrick
Littell. 2017. Learning language representations
for typology prediction. 在诉讼程序中
EMNLP, pages 2529–2535.

Stephen Mayhew, Chen-Tse Tsai, and Dan Roth.
2017. Cheap translation for cross-lingual named
entity recognition. In Proceedings of EMNLP,
pages 2536–2545.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Ryan McDonald, Slav Petrov, and Keith Hall.
2011. Multi-source transfer of delexicalized
dependency parsers. In Proceedings of EMNLP,
pages 62–72.

Andriy Mnih and Ruslan Salakhutdinov. 2008.
Probabilistic matrix factorization. In Proceed-
ings of NeurIPS, pages 1257–1264.

Jian Ni, Georgiana Dinu, and Radu Florian.
2017. Weakly supervised cross-lingual named
entity recognition via effective annotation and
representation projection. 在诉讼程序中
前交叉韧带, pages 1470–1480.

乔金·尼弗尔, Mitchell Abrams,

ˇZeljko Agi´c,
Lars Ahrenberg, Gabriel˙e Aleksandraviˇci¯ut˙e,
Lene Antonsen, Katya Aplonova, Maria Jesus
Aranzabe, Gashaw Arutie, Masayuki Asahara,
Luma Ateyah, Mohammed Attia, Aitziber
Atutxa, Liesbeth Augustinus, Elena Badmaeva,
Miguel Ballesteros, Esha Banerjee, Sebastian
Bank, Verginica Barbu Mititelu, 维多利亚
Basmov, John Bauer, Sandra Bellato, Kepa
Bengoetxea, Yevgeni Berzak, Irshad Ahmad
Bhat, Riyaz Ahmad Bhat, Erica Biagetti,
Eckhard Bick, Agn˙e Bielinskien˙e, Rogier
Blokland, Victoria Bobicev, Lo¨ıc Boizou,
Emanuel Borges V¨olker, Carl B¨orstell, Cristina
Bosco, Gosse Bouma, Sam Bowman, Adriane
Boyd, Kristina Brokait˙e, Aljoscha Burchardt,
Marie Candito, Bernard Caron, Gauthier Caron,
G¨uls¸en Cebiro˘glu Eryi˘git, Flavio Massimiliano
Cecchini, Giuseppe G. A. Celano, Slavom´ır
ˇC´epl¨o, Savas Cetin, Fabricio Chalub, Jinho
Choi, Yongseok Cho, Jayeol Chun, 西尔维娅
辛科瓦, Aur´elie Collomb, C¸ a˘gri C¸ ¨oltekin,
Miriam Connor, Marine Courtin, Elizabeth
戴维森, Marie-Catherine
de Marneffe,
Valeria de Paiva, Arantza Diaz de Ilarraza,
Carly Dickerson, Bamba Dione, Peter Dirix,
Kaja Dobrovoljc, Timothy Dozat, Kira
Droganova, Puneet Dwivedi, Hanne Eckhoff,
Marhaba Eli, Ali Elkahky, Binyam Ephrem,
Tomaˇz Erjavec, Aline Etienne, 理查德·法卡斯,
Hector Fernandez Alcalde, Jennifer Foster,
Cl´audia Freitas, Kazunori Fujita, Katar´ına
Gajdoˇsov´a, Daniel Galbraith, Marcos Garcia,
Moa G¨ardenfors, Sebastian Garza, Kim Gerdes,
菲利普·金特, Iakes Goenaga, Koldo Gojenola,
Memduh G¨okirmak, Yoav Goldberg, Xavier
G´omez Guinovart, Berta Gonz´alez Saavedra,
Matias Grioni, Normunds Gr¯uz¯ıtis, Bruno

424

Guillaume, C´eline Guillot-Barbance, Nizar
Habash, Jan Hajiˇc, Jan Hajiˇc jr., Linh H`a M˜y,
Na-Rae Han, Kim Harris, Dag Haug, 约翰内斯
Heinecke, Felix Hennig, Barbora Hladk´a,
Jaroslava Hlav´aˇcov´a, Florinel Hociung, Petter
Hohle, Jena Hwang, Takumi Ikeda, Radu Ion,
Elena Irimia, 氧. l´aj´ıd´e Ishola, Tom´aˇs Jel´ınek,
Anders Johannsen, Fredrik Jøorgensen, H¨uner
Kasikara, Andre Kaasen, Sylvain Kahane,
Hiroshi Kanayama, Jenna Kanerva, Boris Katz,
Tolga Kayadelen, Jessica Kenney, V´aclava
Kettnerov´a,
Jesse Kirchner, Arne K¨ohn,
Kamil Kopacewicz, Natalia Kotsyba, Jolanta
Kovalevskait˙e, Simon Krek, Sookyoung Kwak,
Veronika Laippala, Lorenzo Lambertino, Lucia
Lam, Tatiana Lando, Septina Dian Larasati,
Alexei Lavrentiev, John Lee, Phuong Lˆe Hong,
Alessandro Lenci, Saran Lertpradit, Herman
Leung, Cheuk Ying Li, Josie Li, Keying Li,
KyungTae Lim, Yuan Li, Nikola Ljubeˇsi´c,
Olga Loginova, Olga Lyashevskaya, Teresa
林恩, Vivien Macketanz, Aibek Makazhanov,
Michael Mandl, Christopher Manning, Ruli
Manurung, C˘at˘alina M˘ar˘anduc, David Mareˇcek,
Katrin Marheinecke, H´ector Mart´ınez Alonso,
Andr´e Martins, Jan Maˇsek, Yuji Matsumoto,
Ryan McDonald,
Sarah McGuinness,
Gustavo Mendonça, Niko Miekka, Margarita
Misirpashayeva, Anna Missil¨a, C˘at˘alin Mititelu,
Yusuke Miyao, Simonetta Montemagni, Amir
更多的, Laura Moreno Romero, Keiko Sophie
森, Tomohiko Morioka, Shinsuke Mori,
Shigeki Moro, Bjartur Mortensen, Bohdan
Moskalevskyi, Kadri Muischnek, Yugo
Murawaki, Kaili M¨u¨urisep, Pinkey Nainwani,
Ignacio Navarro Hor˜niacek, 安娜
Juan
Nedoluzhko, Gunta Neˇspore-Berzkalne, Luong
Nguyˆen Thi, Huyˆen Nguyˆen Thi Minh,
Yoshihiro Nikaido, Vitaly Nikolaev, Rattima
Nitisaroj, Hanna Nurmi, Stina Ojala, Ad´eday
Ol´u`okun, Mai Omura, Petya Osenova, 罗伯特
¨Ostling, Lilja Øvrelid, Niko Partanen, 埃琳娜
Pascual, Marco Passarotti, Agnieszka Patejuk,
Guilherme Paulino-Passos, Angelika Peljak-
Lapi´nska, Siyao Peng, Cenel-Augusto Perez,
Guy Perrier, Daria Petrova, Slav Petrov, Jussi
Piitulainen, Tommi A Pirinen, Emily Pitler,
Barbara Plank, Thierry Poibeau, Martin Popel,
Lauma Pretkalnin¸a, Sophie Pr´evost, Prokopis
Prokopidis, Adam Przepi´orkowski, Tiina
Puolakainen, Sampo Pyysalo, Andriela R¨a¨abis,
Alexandre Rademaker, Loganathan Ramasamy,

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Taraka Rama, Carlos Ramisch, Vinit
Ravishankar, Livy Real, SivaReddy, Georg
Rehm, Michael Rießler, Erika Rimkut˙e, Larissa
Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo
Romanenko, Rudolf Rosa, Davide Rovati,
Valentin Ros¸ca, Olga Rudina, Jack Rueter,
Shoval Sadde, Benoît Sagot, Shadi Saleh,
Alessio Salomoni, Tanja Samardˇzi´c, Stephanie
Samson, Manuela Sanguinetti, Dage S¨arg,
Baiba Saul¯ıte, Yanin Sawanakunanon, Nathan
施耐德, Sebastian Schuster, Djam´e Seddah,
Wolfgang Seeker, Mojgan Seraji, Mo Shen,
Atsuko Shimada, Hiroyuki Shirasu, Muh
Shohibussirri, 德米特里
Sichinava, Natalia
Silveira, Maria Simi, Radu Simionescu, Katalin
Simk´o, M´aria ˇsimkov´a, Kiril Simov, Aaron
史密斯, Isabela Soares-Bastos, Carolyn Spadine,
Antonio Stella, Milan Straka, Jana Strnadov´a,
Alane Suhr, Umut Sulubacak, Shingo Suzuki,
Zsolt Sz´ant´o, Dima Taji, Yuta Takahashi, 法比奥
Tamburini, Takaaki Tanaka, Isabelle Tellier,
Guillaume Thomas, Liisi Torga, Trond
Trosterud, Anna Trukhina, Reut Tsarfaty,
Francis Tyers, Sumire Uematsu, Zdeˇnka
Ureˇsov´a, Larraitz Uria, Hans Uszkoreit,
Sowmya Vajjala, Daniel van Niekerk, Gertjan
van Noord, Viktor Varga, Eric Villemonte de
la Clergerie, Veronika Vincze, Lars Wallin,
Abigail Walsh, Jing Xian Wang, 乔纳森
North Washington, Maximilan Wendt, Seyi
威廉姆斯, Mats Wir´en, Christian Wittern,
Tsegay Woldemariam, Tak-sum Wong, Alina
Wr´oblewska, Mary Yako, Naoki Yamazaki,
Chunxiao Yan, Koichi Yasuoka, Marat M.
Yavrumyan, Zhuoran Yu, Zdenˇek ˇzabokrtsk´y,
Amir Zeldes, Daniel Zeman, Manying Zhang,
and Hanzhi Zhu. 2019. Universal Dependen-
化学系 2.4. LINDAT/CLARIN digital library at
the Institute of Formal and Applied Linguistics
( ´UFAL), Faculty of Mathematics and Physics,
Charles University.

Robert ¨Ostling and J¨org Tiedemann. 2017. 骗局-
tinuous multilinguality with language vectors.
In Proceedings of the EACL, pages 644–649.

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel
Nothman, Kevin Knight, and Heng Ji. 2017.
Cross-lingual name tagging and linking for 282
语言. In Proceedings of ACL, 体积 1,
pages 1946–1958.

425

Matthew E. Peters, Sebastian Ruder, and Noah A.
史密斯. 2019. To tune or not to tune? Adapting
pretrained representations to diverse tasks. 在
会议记录
the 4th Workshop on Rep-
resentation Learning for NLP (RepL4NLP-
2019), pages 7–14. DOI: https://doi.org
.org/10.18653/v1/W19-4302, PMCID:
PMC6351953

Telmo Pires, Eva Schlinger, and Dan Garrette.
2019. How multilingual is multilingual BERT?
In Proceedings of ACL, pages 4996–5001. DOI:
https://doi.org/10.18653/v1/P19
-1493

Barbara Plank and ˇZeljko Agi´c. 2018. Distant
supervision from disparate sources for low-
resource part-of-speech tagging. In Proceed-
ings of EMNLP, pages 614–620. DOI:
https://doi.org/10.18653/v1/D18
-1061

Emmanouil Antonios

Platanios, Mrinmaya
Sachan, Graham Neubig, and Tom Mitchell.
2018. Contextual parameter generation for uni-
In Pro-
versal neural machine translation.
ceedings of EMNLP, pages 425–435. DOI:
https://doi.org/10.18653/v1/D18
-1039

Edoardo Maria Ponti, Goran Glavaˇs, Olga
Ivan Vuli´c, 和
Majewska, Qianchu Liu,
Anna Korhonen. 2020. XCOPA: A multilingual
dataset for causal commonsense reasoning. 在
Proceedings of EMNLP.

Edoardo Maria Ponti, Helen O’Horan, Yevgeni
Ivan Vuli´c, Roi Reichart, Thierry
Berzak,
Poibeau, Ekaterina Shutova,
and Anna
科尔霍宁. 2019A. Modeling language varia-
tion and universals: A survey on typological
linguistics for natural
语言处理.
计算语言学, 45(3):559–601. DOI:
https://doi.org/10.1162/coli a
00357

Edoardo Maria Ponti, Roi Reichart, 安娜
科尔霍宁, and Ivan Vuli´c. 2018. Isomorphic
transfer of syntactic structures in cross-lingual
自然语言处理. In Proceedings of ACL, pages 1531–1542.

Edoardo Maria Ponti, Ivan Vuli´c, Ryan Cotterell,
Roi Reichart, and Anna Korhonen. 2019乙.
Towards zero-shot language modeling. In Pro-
ceedings of EMNLP, pages 2900–2910.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Stephan Rabanser, Stephan G¨unnemann, 和
Zachary Lipton. 2019. Failing loudly: 一个
empirical study of methods for detecting
In Proceedings of NeurIPS,
dataset shift.
pages 1394–1406.

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.
Massively multilingual transfer for NER. 在
Proceedings of ACL, pages 151–164. DOI:
https://doi.org/10.18653/v1/P19
-1015

Mohammad Sadegh Rasooli

和迈克尔
柯林斯. 2017. Cross-lingual syntactic transfer
with limited resources. Transactions of
这
计算语言学协会,
5:279–293.

Shruti Rijhwani, Jiateng Xie, Graham Neubig,
and Jaime G. Carbonell. 2019. Zero-shot
neural transfer for cross-lingual entity linking.
In Proceedings of AAAI, pages 6924–6931.
DOI: https://doi.org/10.1609/aaai
.v33i01.33016924

Rudolf Rosa and Zdenˇek ˇZabokrtsk´y. 2015.
KLcpos3 – a language similarity measure for
delexicalized parser transfer. In Proceedings
of ACL, pages 243–249. DOI: https://
doi.org/10.3115/v1/P15-2040, PMID:
26076412

Sebastian Ruder, Matthew E. Peters, Swabha
Swayamdipta, and Thomas Wolf. 2019A. 反式-
fer learning in natural language processing.
In Proceedings of NAACL-HLT: Tutorials,
pages 15–18. DOI: https://doi.org/10
.18653/v1/N19-5004

Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard.
2019乙. A survey of cross-lingual embedding
型号. Journal of Artificial Intelligence Re-
搜索, 65:569–631. DOI: https://doi.org
.org/10.1613/jair.1.11640

Ruslan Salakhutdinov and Andriy Mnih. 2008.
Bayesian probabilistic matrix factorization
using Markov chain Monte Carlo. In Pro-
ICML, pages 880–887. DOI:
ceedings of
https://doi.org/10.1145/1390156
.1390267

Hanhuai Shan and Arindam Banerjee. 2010.
Generalized probabilistic matrix factorizations

for collaborative filtering. 在诉讼程序中
ICDM, pages 1025–1030. DOI: https://
doi.org/10.1109/ICDM.2010.116

Ehsan Shareghi, Yingzhen Li, Yi Zhu, Roi
Reichart, and Anna Korhonen. 2019. Bayesian
learning for neural dependency parsing. In Pro-
ceedings of NAACL-HLT, pages 3509–3519.

Benjamin Snyder and Regina Barzilay. 2010.
Climbing the tower of Babel: Unsupervised
multilingual learning. In Proceedings of ICML,
pages 29–36.

Jake Stolee and Neill Patterson. 2019, Matrix
factorization with neural networks and sto-
inference, 大学
chastic variational
多伦多.

Oscar T¨ackstr¨om, Ryan McDonald, and Jakob
Uszkoreit. 2012. Cross-lingual word clusters
for direct transfer of linguistic structure. 在
Proceedings of NAACL-HLT, pages 477–487.

Alon Talmor and Jonathan Berant. 2019. 穆尔-
tiQA: An empirical investigation of generaliza-
tion and transfer in reading comprehension.
In Proceedings of ACL, pages 4911–4921.
DOI: https://doi.org/10.18653/v1
/P19-1485

Chen-Tse Tsai, Stephen Mayhew, and Dan Roth.
2016. Cross-lingual named entity recognition
via wikification. In Proceedings of CoNLL,-
pages 219–228. DOI: https://doi.org
/10.18653/v1/K16-1022

Andrew Gordon Wilson. 2019. The case for
Bayesian deep learning. NYU Courant Tech-
nical Report.

Shijie Wu and Mark Dredze. 2019. Beto,
Bentz, Becas: The surprising cross-lingual
在诉讼程序中
effectiveness of BERT.
EMNLP-IJCNLP, pages 833–844.

Yonghui Wu, Mike Schuster, Zhifeng Chen,
Quoc V. Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
Klaus Macherey, 和别的. 2016, Google’s
neural machine translation system: Bridging the
gap between human and machine translation,
谷歌.

426

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Jiateng Xie, Zhilin Yang, Graham Neubig,
诺亚A. 史密斯, and Jaime Carbonell. 2018.
Neural cross-lingual named entity recognition
with minimal resources. 在诉讼程序中
EMNLP, pages 369–379.

David Yarowsky, Grace Ngai, and Richard
Wicentowski. 2001. Inducing multilingual text
analysis tools via robust projection across
aligned corpora. In Proceedings of the First
International Conference on Human Language
Technology Research, 第 1–8 页. DOI:
https://doi.org/10.3115/1072133
.1072187

Dani Yogatama, Cyprien de Masson d’Autume,
Jerome Connor, Tomas Kocisky, Mike
Chrzanowski, Lingpeng Kong, Angeliki
Lazaridou, Wang Ling, Lei Yu, Chris Dyer,
和别的. 2019. Learning and evaluating
general linguistic intelligence. arXiv 预印本
arXiv:1901.11373v1.

Daniel Zeman

and Philip Resnik.

2008.
Cross-language parser adaptation between
related languages. In Proceedings of IJCNLP,
pages 35–42.

Yuan Zhang, David Gaddy, Regina Barzilay,
and Tommi Jaakkola. 2016. Ten pairs to
tag – multilingual POS tagging via coarse
mapping between embeddings. In Proceedings
的 2016 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
pages 1307–1317. Association for Computa-
tional Linguistics, 圣地亚哥, 加利福尼亚州. DOI:
https://doi.org/10.18653/v1/N16
-1156

Yftah Ziser and Roi Reichart. 2018. Deep
pivot-based modeling for cross-language cross-
domain transfer with minimal guidance. 在
Proceedings of EMNLP, pages 238–249. DOI:
https://doi.org/10.18653/v1/D18
-1022

A KL-divergence of Gaussians

If both p , 氮 (µ, Σ) and q , 氮 (米, S) 是
multivariate Gaussians, their KL-divergence can
be computed analytically as follows:

427

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

数字 4: Samples from the posteriors of 4 语言,
PCA-reduced to 4 方面.

吉隆坡 (q || p) =

1
2

|S|
|Σ|

− d + tr(S−1Σ)

(13)

+ (m − µ)⊤S−1(m − µ)

By substituting m = 0 and S = I, it is trivial to
obtain Equation (9).

我

B Visualization of the Learned Posteriors

The approximate posteriors of the latent variables
can be visualized in order to study the learned
语言. Previous work
representations for
¨Ostling and Tiedemann
(Johnson et al., 2017;
和
等人。, 2017; Bjerva
2017; Malaviya
Augenstein, 2018) induced point estimates of
language representations from artificial
代币
concatenated to every input sentence, or from the
aggregated values of the hidden state of a neu-
ral encoder. The information contained in such
representations depends on the task (Bjerva and
Augenstein, 2018), but mainly reflects the struc-
tural properties of each language (Bjerva et al.,
2019).

In our work, due to the estimation procedure,
languages are represented by full distributions
rather than point estimates. By inspecting the
learned representations, language similarities do
not appear to follow the structural properties of
语言. This is most likely due to the fact that
parameter factorization takes place after the multi-
lingual BERT encoding, which blends the structural

differences across languages. A fair comparison
with previous works without such an encoder is
left for future investigation.

举个例子, consider two pairs of languages
from two distinct families: Yoruba and Wolof
are Niger-Congo from the Atlantic-Congo branch,
Tamil and Telugu are Dravidian. We take 1,000
samples from the approximate posterior over
the latent variables for each of these languages. 在

特别的, we focus on the variational scheme
with a low-rank covariance structure. We then
reduce the dimensionality of each sample to 4
through PCA,15 and we plot the density along
each resulting dimension in Figure 4. We observe
that density areas of each dimension do not nec-
essarily overlap between members of the same
家庭. 因此, the learned representations depend
on more than genealogy.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
7
4
1
9
2
3
9
4
1

/
t

我

A
C
_
A
_
0
0
3
7
4
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

15Note that the dimensionality reduced samples are also

Gaussian since PCA is a linear method.

428
下载pdf