Improving Low-Resource Cross-lingual Parsing
with Expected Statistic Regularization
Thomas Effland
Columbia University, USA
teffland@cs.columbia.edu
Michael Collins
Google Research, USA
mjcollins@google.com
Abstract
We present Expected Statistic Regulariza
tion (ESR), a novel regularization technique
that utilizes low-order multi-task structural sta-
tistics to shape model distributions for semi-
supervised learning on low-resource datasets.
We study ESR in the context of cross-lingual
transfer for syntactic analysis (POS tagging
and labeled dependency parsing) and present
several classes of low-order statistic functions
that bear on model behavior. Experimentally,
we evaluate the proposed statistics with ESR
for unsupervised transfer on 5 diverse target
languages and show that all statistics, when
estimated accurately, yield improvements to
both POS and LAS, with the best statistic
improving POS by +7.0 and LAS by +8.5
on average. We also present semi-supervised
transfer and learning curve experiments that
show ESR provides significant gains over
strong cross-lingual-transfer-plus-fine-tuning
baselines for modest amounts of label data.
These results indicate that ESR is a promis-
ing and complementary approach to model-
transfer approaches for cross-lingual parsing.1
1
Introduction
In recent years, great strides have been made
on linguistic analysis for low-resource languages.
These gains are largely attributable to transfer
approaches from (1) massive pretrained multilin-
gual language model (PLM) encoders (Devlin
et al., 2019; Liu et al., 2019b); (2) multi-task
training across related syntactic analysis tasks
(Kondratyuk and Straka, 2019); and (3) multilin-
gual training on diverse high-resource languages
(Wu and Dredze, 2019; Ahmad et al., 2019;
Kondratyuk and Straka, 2019). Combined, these
approaches have been shown to be particularly
effective for cross-lingual syntactic analysis, as
shown by UDify (Kondratyuk and Straka, 2019).
1We have published for our implementation and exper-
iments at https://github.com/teffland/expected
-statistic-regularization.
122
However, even with the improvements brought
about by these techniques, transferred models still
make syntactically implausible predictions on
low-resource languages, and these error rates in-
crease dramatically as the target languages be-
come more distant from the source languages
(He et al., 2019; Meng et al., 2019). In par-
ticular, transferred models often fail to match
many low-order statistics concerning the typol-
ogy of the task structures. We hypothesize that
enforcing regularity with respect to estimates of
these structural statistics—effectively using them
as weak supervision—is complementary to cur-
rent transfer approaches for low-resource cross-
lingual parsing.
To this end, we introduce Expected Statistic
Regularization (ESR), a novel differentiable loss
that regularizes models on unlabeled target data-
sets by minimizing deviation of descriptive statis-
tics of model behavior from target values. The
class of descriptive statistics usable by ESR are
expressive and powerful. For example, they may
describe cross-task interactions, encouraging the
model to obey structural patterns that are not
explicitly tractable in the model factorization.
Additionally, the statistics may be derived from
constraints dictated by the task formalism itself
(such as ruling out invalid substructures) or by
numerical parameters that are specific to the
target dataset distribution (such as relative sub-
structure frequencies). In the latter case, we also
contribute a method for selecting those parame-
ters using small amounts of labeled data, based
on the bootstrap (Efron, 1979).
Although ESR is applicable to a variety of
problems, we study it using modern cross-lingual
syntactic analysis on the Universal Dependencies
data, building off of the strong model-transfer
framework of UDify (Kondratyuk and Straka,
2019). We show that ESR is complementary to
transfer-based approaches for building parsers on
low-resource languages. We present several inter-
esting classes of statistics for the tasks and perform
Transactions of the Association for Computational Linguistics, vol. 11, pp. 122–138, 2023. https://doi.org/10.1162/tacl a 00537
Action Editor: Alexander Clark. Submission batch: 4/2022; Revision batch: 7/2022; Published 1/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
extensive experiments in both oracle unsuper-
vised and realistic semi-supervised cross-lingual
multi-task parsing scenarios, with particularly en-
couraging results that significantly outperform
state-of-the-art approaches for semi-supervised
scenarios. We also present ablations that justify
key design choices.
2 Expected Statistic Regularization
We consider structured prediction in an abstract
setting where we have inputs x ∈ X , output struc-
tures y ∈ Y, and a conditional model pθ(y|x) ∈ P
with parameters θ ∈ Θ, where P is the distribution
space and Θ is the parameter space. In this sec-
tion we assume that the setting is semi-supervised,
with a small labeled dataset DL and a large un-
labeled dataset DU ; let DL = {(xi, yi)}m
i=1 be
the labeled dataset of size m and similarly define
DU = {xi}m+n
i=m+1 as the unlabeled dataset.
Our approach centers around a vectorized sta-
tistic function f that maps unlabeled samples and
models to real vectors of dimension df :
f : D × P → Rdf
(1)
where D is the set of unlabeled datasets of any
size, (i.e., DU ∈ D). The purpose of f is to sum-
marize various properties of the model using the
sample. For example, if the task is part-of-speech
tagging, one possible component of f could be
the expected proportion of NOUN tags in the unla-
beled data DU . In addition to f , we assume that
we are given vectors of target statistics t ∈ Rd and
margins of uncertainty σ ∈ Rd as its supervision
signal. We will discuss the details of f , t, and σ
shortly but first describe the overall objective.
2.1 Semi-Supervised Objective
Given labeled and unlabeled data DL and DU , we
propose the following semi-supervised objective
O, which breaks down into a sum of supervised
and unsupervised terms L and C:
ˆθ = arg min
θ∈Θ
O(θ; DL, DU ) = L(θ; DL) + αC(θ; DU )
O(θ; DL, DU )
(2)
(3)
where α > 0 is a balancing coefficient. The
supervised objective L can be any suitable super-
vised loss; here we will use the negative log-
likelihood of the data under the model. Our
contribution is the unsupervised objective C.
For C, we propose to minimize some dis-
tance function (cid:5) between the target statistics t
and the value of the statistics f calculated using
unlabeled data and the model pθ. ((cid:5) will also take
into account the uncertainty margins σ.) A sim-
ple objective would be:
C(θ; DU ) = (cid:5)(t, σ, f (DU , pθ))
This is a dataset-level loss penalizing divergences
from the target level statistics. The problem with
this approach is that this is not amenable to mod-
ern hardware constraints requiring SGD. Instead,
we propose to optimize this loss in expectation
over unlabeled mini-batch samples Dk
U , where
k is the mini-batch size and Dk
U is sampled uni-
formly with replacement from DU . Then, C is
given by:
C(θ; DU ) = EDk
U
[(cid:5)(t, σ, f (Dk
U , pθ))]
(4)
This objective penalizes the model if the sta-
tistic f , when applied to samples of unlabeled
data Dk
U , deviates from the targets t and thus
pushes the model toward satisfying these target
statistics.
Importantly, the objective in Eq. 4 is more
general than typical objectives in that the outer
loss function (cid:5) does not necessarily break down
into a sum over individual input examples—the
aggregation over examples is done inside f :
(cid:5)(t, σ, f (DU , pθ)) (cid:5)=
(cid:2)
x∈DU
(cid:5)(t, σ, f (x, pθ))
(5)
This generality is useful because components of
f may describe statistics that aggregate over in-
puts, estimating expected quantities concerning
sample-level regularities of the structures. In con-
trast, the right-hand side of Eq. 5 is more stringent,
imposing that the statistic be the same for all in-
stances of x. In practice, this loss reduces noise
compared to a per-sentence loss, as is shown in
Section 5.3.1.
2.2 The Statistic Function f
In principle the vectorized statistic function f
could be almost any function of the unlabeled data
and model, provided it is possible to obtain its
gradients with respect to the model parameters θ,
however, in this work we will assume f has the
following three-layer structure.
123
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
First, let g be another vectorized function of
‘‘sub-statistics’’ that may have a different dimen-
sionality than f and takes individual x, y pairs
as input:
g : X × Y → Rdg
(6)
We choose this function because it is robust to
outliers, adapts its width to the margin parameter
σi, and expresses a preference for fi = ti (as
opposed to max-margin losses). We give an abla-
tion study in Section 5.3.2 justifying its use.
Then let ¯g be the expected value of g under the
model pθ summed over the sample DU :
(cid:2)
Epθ(y|x)[g(x, y)]
(7)
¯g =
x∈DU
Given ¯g, let the f ’s j’th component be the result
of an aggregating function hj : Rdg → R on ¯g:
fj(DU , pθ) = hj(¯g)
(8)
The individual components gi will mostly be
counting functions that tally various substructures
in the data. The ¯gi’s then are expected substruc-
ture counts in the sample, and the hj’s aggregate
small subsets of these intermediate counts in
different ways to compute various marginal prob-
abilities. Again, in general f does not need to
follow this structure and any suitable statistic func-
tion can be incorporated into the regularization
term proposed in Eq. 4.
In some cases—when the structure of g does not
follow the model factorization either additively or
multipicatively—computation of the model ex-
pectation Epθ(y|x)[g(x, y)] in Eq. 7 is intractable.
In these situations, standard Monte Carlo approx-
imation breaks differentiability of the objective
with respect to the model parameters θ and can-
not be used. To remedy this, we propose to
use the ‘‘Stochastic Softmax’’ differentiable sam-
pling approximation from Paulus et al. (2020) to
allow optimization of these functions. We pro-
pose several such statistics in the application (see
Section 4.3).
2.3 The Distance Function (cid:5)
For the distance function (cid:5), we propose to use a
smoothed hinge loss (Girshick, 2015) that adapts
with the margins σ. Letting ¯f = f (Dk
U , pθ), the
i’th component of (cid:5) is given by:
(cid:3)
(cid:5)i =
( ¯fi−ti)2
2σi
| ¯fi − ti| − σi
if | ¯fi − ti| < σi
else
(9)
The total loss (cid:5) is then the sum of its components:
(cid:5)(t, σ, f (Dk
U , pθ)) =
(cid:2)
i
(cid:5)i(ti, σi, ¯fi)
(10)
3 Choosing the Targets and Margins
There are several possible approaches to choos-
ing the targets t and margins σ, and in general
they can differ based on the individual statistics.
For some statistics it may be possible to specify
the targets and margins using prior knowledge or
formal constraints from the task. In other cases,
estimating the targets and margins may be more
difficult. Depending on the problem context, one
may be able to estimate them from related tasks
or domains (such as neighboring languages for
cross-lingual parsing). Here, we propose a general
method that estimates the statistics using labeled
data, and is applicable to semi-supervised scenar-
ios where at least a small amount of labeled data
is available.
The ideal
targets are the expected statis-
the ‘‘true’’ model p∗ are: t∗ =
tics under
U , p∗)], where k is the batch size. We
[f (Dk
EDk
can estimate this expectation using labeled data
DL and bootstrap sampling (Efron, 1979). Uti-
lizing DL as a set of point estimates for p∗, we
sample B total minibatches of k labeled exam-
ples uniformly with replacement from DL and
calculate the statistic f for each of these mini-
batch datasets. We then compute the target sta-
tistic as the sample mean:
U
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
t =
1
B
B(cid:2)
i=1
f (D(i)
L ) ,
|D(i)
L
| = k, ∀i
(11)
where we have slightly abused notation by writing
f (DL) to mean f computed using the inputs {x :
(x, y) ∈ DL} and the point estimates p∗(y|x) =
1, ∀(x, y) ∈ DL.
In addition to estimating the target statistics for
small batch sizes, the bootstrap gives us a way to
estimate the natural variation of the statistics for
small sample sizes. To this end, we propose to
utilize the standard deviations from the bootstrap
samples as our margins of uncertainty σ:
(cid:4)
(cid:5)
(cid:5)
(cid:6) 1
B − 1
B(cid:2)
i=1
σ =
(f (D(i)
L ) − t)2
(12)
124
This allows our loss function (cid:5) to adapt to more
or less certain statistics. If some statistics are
naturally too variable to serve as effective super-
vision, they will automatically have weak contri-
bution to (cid:5) and little impact on the model training.
4 Application to Cross-Lingual Parsing
Now that we have described our general approach,
in this section we lay out a proposal for applying
it to cross-lingual joint POS tagging and depen-
dency parsing. We choose to apply our method
to this problem because it is an ideal testbed for
controlled experiments in semi-supervised struc-
tured prediction. By their nature, the parsing tasks
admit many types of interesting statistics that cap-
ture cross-task, universal, and language-specific
facts about the target test distributions.
We evaluate in two different transfer settings:
oracle unsupervised and realistic semi-supervised.
In the oracle unsupervised settings, there is no
supervised training data available for the target
languages (and the L term is dropped from Eq. 3),
but we use target values and margins calculated
from the held-out supervised data. This setting
allows us to understand the impact of our reg-
ularizer in isolation without the confounding ef-
fects of direct supervision or inaccuracte targets.
In the semi-supervised experiments, we vary the
amounts of supervised data, and calculate the tar-
gets from the small supervised data samples. This
is a realistic application of our approach that may
be applied to low-resource learning scenarios.
4.1 Problem Setup and Data
We use the Universal Dependencies (Nivre, 2020)
v2.8 (UD) corpus as data. In UD, syntactic anno-
tation is formulated as a labeled bilexical depen-
dency tree, connecting words in a sentence, with
additional part-of-speech (POS) tags annotated
for each word. The labeled tree can be broken
down into two parts: the arcs that connect the
head words to child words, forming a tree, and the
dependency labels assigned to each of those arcs.
Due to the definition of UD syntax, each word is
the child of exactly one arc, and so both the at-
tachments and labels can be written as sequences
that align with the words in the sentence.
More formally then, for each labeled sentence
x1:n of length n, the full structure y is given by
the three sequences y = (t1:n, e1:n, r1:n), where
ti ∈ T are the POS tags, e1:n, ei ∈
t1:n,
{1, . . . , n} are the head attachments, and r1:n,
ri ∈ R are the dependency labels.
4.2 The Model and Training
We now turn to the parsing model that is used
as the basis for our approach. Though the gen-
eral ideas of our approach are adaptable to other
models, we choose to use the UDify architecture
because it is one of the state-of-the-art multilin-
gual parsers for UD.
4.2.1 The UDify Model
The UDify model is based on trends in state-
of-the-art parsing, combining a multilingual pre-
trained transformer
language model encoder
(mBERT) with a deep biaffine arc-factored pars-
ing decoder, following Dozat and Manning (2017).
These encodings are additionally used to predict
POS tags with a separate decoder. The full details
are given in Kondratyuk and Straka (2019), but
here it suffices to characterize the parser by its
top-level probabilistic factorization:
p(t1:n, e1:n, r1:n|x1:n)
= p(e1:n|x1:n)p(t1:n|x1:n)p(r1:n|e1:n, x1:n)
= p(e1:n|x1:n)
n(cid:7)
i=1
(13)
p(ti|x1:n)p(ri|ei, x1:n) (14)
This model is scant on explicit joint factors,
following recent trends in structured prediction
that forgo higher-arity factors, instead opting for
shared underlying contextual representations pro-
duced by a mBERT that implicitly contain infor-
mation about the sentence and structure as a whole.
This factorization will prove useful in Section 4.3
where it will allow us to compute many of the
supervision statistics under the model exactly.
4.2.2 Training
The UDify approach to training is simple: It
begins with a multilingual PLM, mBERT, then
fine-tunes the parsing architecture on the con-
catenation of the source languages. With vanilla
UDify, transfer to target languages is zero-shot.
Our approach begins with these two training
steps from UDify, then adds a third: adapting to
the target language using the target statistics and
possibly small amounts of supervised data (Eq. 3).
125
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
4.3 Typological Statistics as Supervision
We now discuss a series of statistics that we
will use as weak supervision. Most of the pro-
posed statistics describe various probabilities for
different (but related) grammatical substructures
and can ultimately be broken down into ra-
tios of ‘‘count’’ functions (sums of indicators),
which tally various types of events in the data.
We propose statistics that cover surface level
(POS-only), single-arc, two-arc, and single-head
substructures, as well as conditional variants. Due
to space constraints, we omit their mathematical
descriptions.
Surface Level: One simple set of descriptive
statistics are the unigram and bigram distributions
over POS tags. POS unigrams can capture some
basic relative frequencies, such as our expectation
that nouns and verbs are common to all languages.
POS bigrams will allow us to capture simple
word-order preferences.
Single-Arc: This next set of statistical families
all capture information about various choices in
single-arc substructures. A single arc substructure
carries up to 5 pieces of information: the arc’s
direction, label, and distance, as well as the tags
for the head and child words. Various subsets of
these capture differing forms of regularity, such
as ‘‘the probability of seeing tag th head an arc
with label r in direction d’’.
Universally Impossible Arcs:
In addition to
many single-arc variants, we also consider the
label, child tag)
specific subset of (head tag,
single-arc triples that are never seen in the any UD
data. These combinations correspond to the im-
possible arrangements that do not ‘‘type-check’’
within the UD formalism and are interesting in that
they could in principle be specified by a linguist
without any labeled data whatsoever. As such,
they represent a particularly attractive use-case
of our approach, where a domain expert could
rule out all invalid substructures dictated from
the task formalism without the model having to
learn it implicitly from the training data. With
complex structures, this can be a large proportion
of the possibilities: in UD we can rule out 93.2%
(9,966/10,693) of the combinations.
be useful because they cover many important
typological phenomena, such as subject-object-
verb ordering. They also have been known to be
strong features in higher-order parsing models,
such as the parser of Carreras (2007), but are also
known to be intractable in non-projective parsers
(McDonald and Pereira, 2006).
Following McDonald and Pereira (2006), we
distinguish between two different patterns of
neighboring arcs: siblings and grandchildren. Sib-
ling arc pairs consist of two arcs that share a sin-
gle head word, while grandchild arc pairs share
an intermediate word that is the child of one arc
and the head of another.
Head-Valency: One interesting statistic that
does not fall into the other categories is the va-
lency of a particular head tag. This corresponds
to the count of outgoing arcs headed by some
tag. We convert this into a probability by using
a binning function that allows us to quantify the
‘‘probability that some tag heads between a and
b children’’. Like the two-arc statistics, expected
valency statistics are intractable under the model
and we must approximate their computation.
Conditional Variants: Further, each of these
statistics can be described in conditional terms, as
opposed to their full joint realizations. To do this,
we simply divide the joint counts by the counts
of the conditioned-upon sub-events. Conditional
variants may be useful because they do not express
preferences for probabilities of the sub-events on
the right side of the conditioning bar, which may
be hard to estimate.
Average Entropy:
In addition to the above pro-
posed relative frequency statistics, we also include
average per-token, per-edge, and MST tree en-
tropies as additional regularization statistics that
are always used. Though we do not show it here,
each of these functions may be formulated as a
statistic within our approach. The inclusion of
these statistics amounts to a form of Entropy Reg-
ularization (Grandvalet and Bengio, 2004) that
keep the models from optimizing the other ESR
constraints with degenerate constant predictions
(Mann and McCallum, 2010).
5 Oracle Unsupervised Experiments
Two-Arc: We also consider substructures span-
ning two connected arcs in the tree. They may
We begin with oracle unsupervised transfer ex-
periments that evaluate the potential of many
126
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Language
Code
Treebank
Family
Train Sents
UDPRE LAS
Arabic
Basque
Chinese
English
Finnish
Hebrew
Hindi
Italian
Japanese
Korean
Russian
Swedish
Turkish
German
Indonesian
Maltese
Persian
Vietnamese
ar
eu
zh
en
fi
he
hi
it
ja
ko
ru
sv
tr
de
id
mt
fa
vi
PADT
BDT
GSD
EWT
TDT
HTB
HDTB
ISDT
GSD
GSD
SynTagRus
Talbanken
IMST
HDT
GSD
MUDT
PerDT
VTB
Semitic
Basque
Sino-Tibetan
IE, Germanic
Uralic
Semitic
IE, Indic
IE, Romance
Japanese
Korean
IE, Slavic
IE, Germanic
Turkic
IE, Germanic
Austronesian
Semitic
IE, Iranian
Austro-Asiatic
6.1k
5.4k
4.0k
12.5k
12.2k
5.2k
13.3k
13.1k
7.1k
4.4k
15.0k∗
4.3k
3.7k
153.0k
4.5k
1.1k
26.2k
1.4k
80.5
77.0
62.3
88.1
84.4
80.5
87.0
91.8
73.6
79.0
89.1
85.7
61.7
82.7
50.4
20.9
57.0
48.1
Table 1: Training and Evaluation Treebank Details. The final column shows UDPRE test set performance
after UDify training (evaluation treebank performance is zero-shot). (∗): downsampled to the same 15k
sentences as ¨Ust¨un et al. (2020) to reduce training time and balance the data.
types of statistics and some ablations. In this set-
ting, we do not assume any labeled data in the
target language, but do assume accurate target
statistics and margins, calculated from held-out
training data using the method of Section 3. This
allows us to study the potential of our proposed
ESR regularization term C on its own and with-
out the confounds of supervised data or inaccu-
rate targets.
5.1 Experimental Setup
Next we describe setup details for the experi-
ments. These settings additionally apply to the
rest of the experiments unless otherwise stated.
5.1.1 Datasets
In all experiments, the models are first initial-
ized from mBERT, then trained using the UDify
code (Kondratyuk and Straka, 2019) on 13 di-
verse treebanks, following Kulmizev et al. (2019);
¨Ust¨un et al. (2020). This model, further referred
to as UDPRE, is used as the foundation for all
approaches.
As discussed in Kulmizev et al. (2019), these
13 training treebanks were selected to give a
diverse sample of languages, taking into account
factors such as language families, scripts, mor-
phological complexity, and annotation quality.
We evaluate all proposed methods on 5 held-out
languages, similarly selected for a diversity in lan-
guage typologies, but with the additional factor
of tranfser performance of the UDPRE baseline.2
A summary table of these training and eval-
uation treebanks is given in Table 1.
5.1.2 Approaches
We compare our approach to two strong base-
lines in all experiments, based on recent advances
in the literature for cross-lingual parsing. These
baselines are implemented in our code so that we
may fairly compare them in all of our experiments.
• UDPRE: The first baseline is the UDify
(Kondratyuk and Straka, 2019) model-transfer
approach. Multilingual model-transfer alone
2While we would like to evaluate on as many UD tree-
banks as possible, budgetary constraints required that we
restrict
languages when experiment-
ing with settings that combinatorially vary in other dimen-
sions. We do however experiment with more languages in
Section 6.2.
the number of test
127
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
is currently one of the state-of-the-art ap-
proaches to cross-lingual parsing and is a
strong baseline in its own right.
• UDPRE-PPT: We also apply the Parsimo-
nious Parser Transfer (PPT) approach from
Kurniawan et al. (2021). PPT is a nuanced
self-training approach, extending T¨ackstr¨om
et al. (2013), that encourages the model to
concentrate its mass on its most likely pre-
dicted parses for the target treebank. We use
their loss implementation, but apply it to
our UDPRE base model (instead of their
weaker base model) for a fair comparison,
so this approach combines UDify with PPT.
• UDPRE-ESR: Our proposed approach, Ex-
pected Statistic Regularization (ESR), ap-
plied to UDPRE as an unsupervised-only
objective. In individual experiments we will
specify the statistics used for regularization.
5.1.3 Training and Evaluation Details
For metrics, we report accuracy for POS tagging,
coarse-grained labeled attachment score (LAS)
for dependency trees, and their average as a single
summary score. The metrics are computed using
the official CoNLL-18 evaluation script.3 For all
scenarios, we use early-stopping for model se-
lection, measuring the POS-LAS average on the
specified development sets.
We tune learning rates and α for each pro-
posed loss variant at the beginning of the first
experiment with a low-budget grid search, using
the settings that achieve best validation metric on
average across the 5 language validation sets for
all remaining experiments with that variant. We
find generally that a base learning rate of 2 × 10−5
and α = 0.01 worked well for all variants of
our method. We train all models using AdamW
(Loshchilov and Hutter, 2019) on a slanted trian-
gular learning rate schedule (Devlin et al., 2019)
with 500 warmup steps. Also, since the datasets
vary in size, we normalize the training schedule to
25 epochs at 1000 steps per epoch. We use a batch
size of 8 sentences for training and estimating
statistic targets. When bootstrapping estimates for
t and σ we use B = 1000 samples.
3https://universaldependencies.org/conll18
/evaluation.html.
5.2 Assessing the Proposed Statistics
In this experiment we evaluate 32 types of statis-
tics from Section 4.3 for transfer of the UDPRE
model (pretrained on 13 languages) to the target
languages. The purpose of this experiment is to
get a sense of the effectiveness of each statistic
for improving model-based cross-lingual trans-
fer.4 To prevent overfitting to the test sets for later
experiments, all metrics for this experiment are
calculated on the development sets.
Results: The results of the experiment are pre-
sented in Table 2, ranked from best to worst.
Due to space constraints, we only show the top
10 statistics in addition to the Universal-Arc
statistic. Generally we find that all of the 32
proposed statistics improve upon the UDPRE and
UDPRE-PPT models on average, with many ex-
hibiting large boosts. The best performing statistic
concerns (Child Tag, Label, Direction) substruc-
tures, yielding an average improvement of +7.0
POS and +8.5 LAS, an average relative error
rate reduction of 23.5%. Many other statistics
are not far behind, and overall statistics that bear
on the child tag and dependency label had the
highest impact. This indicates that, with accurate
target estimates, the proposed statistics are highly
complementary to multilingual parser pretraining
(UDPRE) and substantially improve transfer qual-
ity in the unsupervised setting. By comparison, the
PPT approach provides marginal gains to UDPRE
of only +1.4 average POS and +1.5 average LAS.
Another interesting result is that several of the
intractable two-arc statistics were among the best
statistics overall, indicating that the use of the
differentiable SST approximation does not pre-
clude the applicability of intractable statistics. For
example the directed grandchild statistic of cooc-
currences of incoming and outgoing edges for
certain tags was the second highest performing,
with an average improvement of +7.0 POS ac-
curacy and +8.5 LAS (21.3% average error rate
reduction).
Results for the conditional variants (not shown)
were less positive. Generally, conditional vari-
ants were worse than their full joint counterparts
(e.g., ‘‘Child | Label’’ and ‘‘Label | Child’’ are
worse than ‘‘Child, Label’’), performing worse in
15/16 cases. This makes sense, as we are using
4While it would also be possible to try out different
combinations of the various statistics, due to cost considera-
tions we leave these experiments to future work.
128
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Statistic
de
id
POS
fa
vi
UDPRE
UDPRE-PPT
†Child, Label
∗†Child, Label, Grand-label
†Head, Child, Label
†Head, Label
†Head, Label | Child
†Label
†Label, Distance
∗†Head, Sibling Children Tags
†Head, Child
†Label | Child
Universal Arc
89.3
+0.4
+3.3
+1.5
+2.5
+1.7
+3.0
−0.2
−0.2
+2.2
+2.3
+1.2
+2.4
80.3
64.7
83.0
+5.6 −1.5 −0.1
+5.7
+2.8
+4.9
+3.2
+5.2
+4.4
+3.7
+5.1
+5.3
+4.3
+4.7
+4.0
+8.0
+5.9
+5.3
+4.1
+7.2
+6.3
+5.2
+5.7
+5.8
+5.1
+0.0
+3.9 −0.1
+7.8
+5.6
+5.9
+3.7
+4.8 −0.5
+1.4
+1.4
mt
41.4
+3.1
+14.0
+15.7
+14.2
+14.2
+10.9
+11.3
+11.7
+14.0
+14.0
+10.0
+8.7
de
82.7
+0.2
+3.5
+2.5
+2.8
+2.7
+2.8
+2.9
+2.8
+1.6
+1.9
+3.3
+2.1
id
50.4
+8.1
+10.1
+9.2
+9.9
+9.0
+9.7
+8.4
+8.9
+2.7
+3.3
+9.5
+8.1
LAS
fa
57.0
−5.5
+18.4
+16.2
+17.2
+16.4
+15.3
+17.1
+16.3
+11.3
+12.2
+16.6
+4.0
vi
48.1
−0.3
+0.5
+3.2
+1.3
+3.8
+2.1
+4.0
+4.1
−3.0
−0.7
+2.0
−3.1
mt
20.9
+4.6
+10.2
+12.5
+10.2
+11.0
+5.6
+9.5
+9.4
+11.1
+10.1
+5.6
+3.9
avg
61.8
+1.5
+7.8
+7.5
+7.4
+7.3
+6.6
+6.2
+6.0
+5.8
+5.8
+5.7
+3.4
Table 2: Unsupervised Oracle Statistic Variant Results. (Top): Baseline methods that do not use
ESR. (Bottom): Various statistics used by ESR as unsupervised loss on top of UDPRE. Scores are
measured on target treebank development sets. Bold names mark statistics used in later experiments.
(∗) : All statistics with ∗ are intractable and utilize the SST relaxation of Paulus et al. (2020). (†): All
statistics with † also include directional information.
accurate statistics and full joints are strictly more
expressive.
This experiment gives a broad but shallow view
into the effectiveness of the various proposed
statistics. In the rest of the experiments, we eval-
uate the following two variants in more depth:
1. ESR-CLD, which supervises target propor-
tions for (Child Tag, Label, Direction) triples.
This is the ‘‘Child, Label’’ row in Table 2.
2. ESR-UNIARC, which supervises the 9,966
universally impossible (Head Tag, Child Tag,
Label) arcs that do not require labeled data
to estimate. All of these combinations have
targets values of t = 0 and margins σ = 0.
This is the ‘‘Universal Arc’’ row in Table 2.
We choose these two because ESR-CLD is the
best performing statistic overall and ESR-UNIARC
is unique in that it does not require labeled data
to estimate; we do not evaluate others because of
cost considerations.
5.3 Ablation Studies
Next, we perform two ablation experiments to
evaluate key design choices of the proposed ap-
proach. First, we evaluate the use of batch-level
aggregation in the statistics before the loss, versus
the more standard approach of loss-per-sentence.
In the second, we evaluate the proposed form of (cid:5).
We compare the two aggregation variants us-
ing the CLD (Child Tag, Label, Direction) sta-
Aggregation Variant
POS avg LAS avg
avg
Loss per sentence
Loss per batch (ESR)
77.1
79.9
58.5
60.4
67.8
70.1
Table 3: Loss Aggregation Ablation Results. Loss
per batch outperforms loss per sentence for both
POS and LAS on average.
tistic (ESR-CLD). We report test set results av-
eraged over all 5 languages. We use the same
hyperparameters selected in Section 5.2.
5.3.1 Batch-Level Loss Ablation
In this ablation, we evaluate a key feature of our
proposal—the aggregation of the statistic over
the batch before loss computation Eq. 5 versus
the more standard approach, which is to apply the
loss per-sentence. The former, ‘‘Loss per batch’’,
has the form: (cid:5)(t, σ, f (DU , pθ)) while the latter,
‘‘Loss per sentence’’, has the form:
(cid:8)
(cid:5)(t, σ, f (x, pθ)).
x∈DU
The significance of this difference is that ‘‘Loss
per batch’’ allows for the variation in individual
sentences to somewhat average out and hence is
less noisy, while ‘‘Loss per sentence’’ requires
that each sentence individually satisfy the targets.
Results: The results are presented in Table 3.
From the table we can see that ‘‘Loss per batch’’
has an average POS of 79.9 and average LAS of
129
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
(cid:5) Variant
POS avg LAS avg
avg
L2 (σ = 0)
L1 (σ = 0)
Hard L1 (max-margin)
Smooth L1 (ESR)
78.0
78.5
78.4
79.9
58.2
60.3
59.9
60.4
68.1
69.5
69.2
70.1
Table 4: Loss Function Ablation Results. The
Smooth L1 loss outperforms the other simpler
loss variants for both POS and LAS, averaged
over 5 languages.
we cannot utilize intermediate on-task pretrain-
ing and domain-adaption, instead learning on the
target dataset starting ‘‘from scratch’’ with the
pretrained PLM (MBERT).
We use the same baselines as before, but aug-
ment each with a supervised fine-tuning loss on
the supervised data in addition to any unsu-
pervised losses. We refer to these models as
UDPRE-FT, UDPRE-FT-PPT, and UDPRE-FT-ESR.
That is, models with FT in the name have some
supervised fine-tuning in the target language.
60.4, compared to ‘‘Loss per sentence’’ with aver-
age POS of 77.1 and LAS of 58.5, which amount
to +2.8 POS and +1.9 LAS improvements. This
indicates that applying the loss at the batch level
confers an advantage over applying per sentence.
5.3.2 Smooth Hinge-Loss Ablation
Next, we evaluate the efficacy of the proposed
smoothed hinge-loss distance function (cid:5). We
compare to using just L1 or L2 uninterpolated
and with no margin parameters (σ = 0). We also
compare to the ‘‘Hard L1’’, which is the max-
margin hinge (cid:5)(t, σ, x) = max{0, |t − x| − σ}.
We use the same experimental setup as the pre-
vious ablation.
Results: The results are presented in Table 4.
From the table we can see that the Smooth L1
loss outperforms the other variants.
6 Realistic Semi-Supervised Experiments
The previous experiments considered an unsu-
pervised transfer scenario without labeled data.
In these next experiments we turn to a realis-
tic semi-supervised application of our approach
where we have access to limited labeled data for
the target treebank.
6.1 Learning Curves
In this experiment we present learning curves for
the approaches, varying the amount of labeled
| ∈ {50, 100, 500, 1000}. To make ex-
data |Dtrain
periments realistic, we calculate the target statis-
tics t and margins σ from the small subsampled
labeled training datasets using Eqs. 11 and 12.
L
We study two distinct settings. First, we study
the multi-source domain-adaptation transfer set-
ting, UDPRE. Second, we study our approach in
a more standard semi-supervised scenario where
In these experiments, we subsample labeled
training data 3 times for each setting. We re-
port averages over all 5 languages, 3 supervised
subsample runs each, for a total of 15 runs per
method and dataset size. We also use subsampled
development sets so that model selection is more
realistic.5 For development sets we subsample
|),
the data to a size of |Ddev
L
which reflects a 50/50 train/dev split until |DL| ≥
200, at which point we maximize training data
and only hold out 100 sentences for validation.
| = min(100, |Dtrain
L
We use the same hyperparameters as before,
except we use 40 epochs with 200 steps per epoch
as the training schedule, mixing supervised and
unsupervised data at a rate of 1:4.
6.1.1 UDPRE Transfer
In this experiment, we evaluate in the multlingual
transfer scenario by initializing from UDPRE. In
addition to the two chosen realistic ESR variants,
we also experiment with an ‘‘oracle’’ version
of ESR-CLD, called ESR-CLD∗, that uses target
statistics estimated from the full training data. This
allows us to see if small-sample estimates cause
a degradation in performance compared to accu-
rate large-sample estimates.
Results: Learning curves for the different ap-
proaches, averaged over all 3 runs for all 5 lan-
guages (15 total), are given in Figure 1. From the
figure we can discern several encouraging results.
ESR-CLD and ESR-UNIARC add significant
benefit to fine-tuning for small data. Both vari-
ants significantly outperform the baselines at 50
and 100 labeled examples. For example, relative
to UDPRE-FT, the ESR-CLD model yielded gains
of +2 POS, +3.6 LAS at 50 examples and +1.8
5As is argued by Oliver et al. (2018), using a realisti-
cally sized development set is overlooked in much of the
semi-supervised literature, leading to inappropriately strong
model selection and overly optimistic results.
130
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Multi-Source UDPRE Transfer Learning Curves. Baseline approaches are dotted, while ESR variants
are solid. All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled
datasets per language. The plots indicate a significant advantage of ESR over the baselines in low-data regions.
POS, +3.2 LAS at 100 labeled examples. At 500
and 1000 examples, however, we begin to see
diminishing benefits to ESR on top of fine-tuning.
ESR-UNIARC is much more effective in con-
junction with fine-tuning. Compared to the
unsupervised experiment in Section 5.2 where
the ESR-UNIARC statistic is
it ranked 25/32,
much more competitive with the more detailed
ESR-CLD statistics. One potential explanation
is that without labeled data (as in Section 5.2)
the ESR-UNIARC statistic is under-specified (the
727 allowed arcs are all free to take any value),
whereas the inclusion of some labeled data in this
experiment fills this gap by implicitly indicating
target proportions for the allowed arcs. This sug-
gests that an approach which combines UniArc
constraints with elements of self-training (like
PPT) that supervise the ‘‘free’’ non-zero com-
binations could potentially be a useful approach
to zero-shot transfer. However, we leave this to
future work.
Small-data estimates for ESR-CLD are as
good as accurate estimates. Comparing ESR-
CLD to the unrealistic ESR-CLD∗, we find no
significant difference between the two, indicating
that, at least for the CLD statistic, using target
estimates from small samples is as good as large-
sample estimates. This may be due in part to the
margin estimates σ, which are wider for the small
samples and somewhat mitigate their inaccuracies.
PPT adds little benefit to fine-tuning. Relative
to UDPRE-FT, the UDPRE-FT-PPT baseline does
not yield much gain, with a maximum average
improvement of +0.3 POS and +0.7 LAS over all
dataset sizes. This indicates that fine-tuning and
PPT-style self-training may be redundant.
6.1.2 MBERT Transfer
In this experiment, we consider a counterfactual
setting: What if the UD data was not a mas-
sively multilingual dataset where we can utilize
multilingual model-transfer, and instead was an
isolated dataset with no related data to transfer
from? This situation reflects the more standard
semi-supervised learning setting, where we are
given a new task, some labeled and unlabeled
data, and must build a model ‘‘from scratch’’ on
that data.
For this experiment, we repeat the learning
curve setting from Section 6.1.1, but initialize
our model directly with MBERT, skipping the
intermediate UDPRE training.
Results: Learning curves for the different ap-
proaches, averaged over all 3 runs for all 5 lan-
guages (15 total), are given in Figure 2. The results
from this experiment are encouraging; ESR has
even greater benefits when fine-tuning directly
from MBERT than the previous experiment, indi-
cating that our general approach may be even more
useful outside of domain-adaptation conditions.
6.2 Low-Resource Transfer
In previous experiments, we limited the number of
evaluation treebanks to 5 to allow for variation in
other dimensions (i.e., constraint types, loss types,
differing amounts of labeled data). In this exper-
iment, we expand the number of treebanks and
evaluate transfer performance in a low-resource
| = 50 labeled sentences in
setting with only |Dtrain
the target treebank, comparing UDPRE, UDPRE-FT,
and UDPRE-FT-ESR-CLD. As before, we subsam-
ple 3 small datasets per treebank and calculate
L
131
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: ‘‘From Scratch’’ MBERT Transfer Learning Curves. Baseline approaches are dotted, while ESR variants
are solid. All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled
datasets per language. The plots indicate a significant advantage of ESR over the baselines in all regions.
the target statistics t and margins σ from these
to make transfer results realistic.
We select evaluation treebanks according to
the following criteria. For each unique language
in UD v2.8 that is not one of the 13 training lan-
guages, we select the largest treebank, and keep
it if has at least 250 train sentences and a develop-
ment set, so that we can get reasonable variability
in the subsamples. This process yields 44 diverse
evaluation treebanks.
Results: The results of this experiment are given
in Table 5. From the table we can see the
our approach ESR (UDPRE-FT-ESR-CLD) out-
performed supervised fine-tuning (UDPRE-FT) in
many cases, often by a large margin. On average,
UDPRE-FT-ESR-CLD outperformed UDPRE-FT by
+2.6 POS and +2.3 LAS across the 44 lan-
guages. Further, UDPRE-FT-ESR-CLD outper-
formed zero shot transfer, UDPRE, by +10.0 POS
and +14.7 LAS on average.
Interestingly, we found that there were sev-
eral cases of large performance gains while there
were no cases of large performance declines. For
example, ESR improved LAS scores by +17.3
for Wolof, +16.8 for Maltese, and +12.5 for
Scottish Gaelic, and 9/44 languages saw LAS
improvements ≥ +5.0, while the largest decline
was only −2.5. Additionally, ESR improved POS
scores by +20.9 for Naija, +11.2 for Welsh, and
9/44 languages saw POS improvements ≥ +5.0.
The cases of performance decline for LAS
merit further analysis. Of the 20 languages with
negative Δ LAS, 18 of these are modern languages
spoken in continental Europe (mostly Slavic and
Romance), while only 5 of the 24 languages with
positive Δ LAS meet this criteria. We hypothe-
size that this tendency is be due to the train-
ing data used for pretraining MBERT, which was
heavily skewed towards this category (Devlin
et al., 2019). This suggests that ESR is particu-
larly helpful in cases of transfer to domains that
are underrepresented in pretraining.
7 Related Work
Related work generally falls into two categories:
weak supervision and cross-lingual transfer.
Weak Supervision: Supervising models with
signals weaker than fully labeled data has and
continues to be a popular topic of interest. Current
trends in weak supervision focus on generating
instance-level supervision, using weak informa-
tion such as: relations between multiple tasks
(Greenberg et al., 2018; Ratner et al., 2018;
Ben Noach and Goldberg, 2019); labeled fea-
tures (Druck et al., 2008; Ratner et al., 2016;
Karamanolakis et al., 2019a); coarse-grained la-
bels (Angelidis and Lapata, 2018; Karamanolakis
et al., 2019b); dictionaries and distant supervi-
sion (Bellare and McCallum, 2007; Carlson et al.,
2009; Liu et al., 2019a; ¨Ust¨un et al., 2020); or
some combination thereof (Ratner et al., 2016;
Karamanolakis et al., 2019a).
In contrast, our work is more closely related
to older work on population-level supervision.
These techniques include Constraint-Driven Learn-
ing (CODL) (Cha), posterior regularization (PR)
(Ganchev et al., 2010), the measurements frame-
work of Liang et al. (2009), and the generalized
expectation criteria (GEC) (Druck et al., 2008,
2009; Mann and McCallum, 2010).
132
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Treebank
Family
UDPRE
FT
ESR
Δ
UDPRE
FT
ESR
Δ
POS
LAS
Wolof-WTB
Maltese-MUDT
Scottish Gaelic-ARCOSG
Faroese-FarPaHC
Gothic-PROIEL
Welsh-CCG
Western Armenian-ArmTDP
Telugu-MTG
Vietnamese-VTB
Turkish German-SAGT
Afrikaans-AfriBooms
Hungarian-Szeged
Galician-CTG
Marathi-UFAL
Naija-NSC
Greek-GDT
Tamil-TTB
Indonesian-GSD
Uyghur-UDT
Old French-SRCMF
Old Church Slavonic-PROIEL
Portuguese-GSD
Danish-DDT
Armenian-ArmTDP
Spanish-AnCora
Catalan-AnCora
Serbian-SET
Slovak-SNK
Romanian-Nonstandard
Polish-PDB
German-HDT
Lithuanian-ALKSNIS
Latin-ITTB
Bulgarian-BTB
Czech-PDT
Persian-PerDT
Slovenian-SSJ
Croatian-SET
Urdu-UDTB
Ukrainian-IU
Dutch-Alpino
Norwegian-Bokmaal
Belarusian-HSE
Estonian-EDT
Average
Northern Atlantic
Semitic
Celtic
Germanic
Germanic
Celtic
Armenian
Dravidian
Viet-Muong
Code Switch
Germanic
Ugric
Romance
Marathi
Creole
Greek
Dravidian
Austronesian
Turkic
Romance
Slavic
Romance
Germanic
Armenian
Romance
Romance
Slavic
Slavic
Romance
Slavic
Germanic
Baltic
Italic
Slavic
Slavic
Iranian
Slavic
Slavic
Indic
Slavic
Germanic
Germanic
Slavic
Finnic
40.6
35.1
45.7
74.7
30.1
71.9
80.6
82.0
67.0
76.8
90.7
87.9
91.8
71.4
46.5
87.1
72.3
82.3
23.7
65.3
37.3
92.1
92.0
84.7
94.5
92.9
91.2
91.5
79.2
89.7
89.6
87.0
73.8
91.9
90.6
79.1
89.2
91.4
86.9
91.5
90.0
91.7
91.5
89.1
77.3
79.5
82.6
66.0
86.2
67.6
74.7
84.9
81.6
85.6
84.4
88.0
79.9
89.0
81.1
68.0
92.8
72.4
89.8
59.8
74.2
54.7
89.6
92.7
88.1
95.2
94.4
90.7
91.5
83.3
90.4
94.4
87.4
80.9
94.7
92.1
91.0
90.9
91.7
90.0
92.0
90.6
91.8
91.6
89.6
85.4
91.8
75.9
87.2
71.7
85.8
87.1
81.6
88.5
85.8
91.3
89.7
91.2
82.3
88.9
92.5
79.6
90.2
65.5
76.2
61.0
92.8
92.1
88.0
95.4
94.6
93.1
92.0
85.0
90.9
94.2
87.4
81.7
94.6
92.7
90.8
91.2
92.1
88.2
92.4
90.6
92.1
91.9
89.2
+5.9
+9.2
+9.9
+1.1
+4.1
+11.2
+2.2
0.0
+2.9
+1.4
+3.3
+9.7
+2.2
+1.1
+20.9
−0.3
+7.2
+0.5
+5.6
+2.0
+6.3
+3.3
−0.6
−0.1
+0.2
+0.3
+2.4
+0.5
+1.7
+0.5
−0.2
0.0
+0.8
−0.1
+0.6
−0.2
+0.3
+0.4
−1.8
+0.3
0.0
+0.3
+0.3
−0.4
84.7
87.3
+2.6
12.7
16.0
24.4
43.0
12.6
54.8
60.4
70.9
46.3
48.0
62.0
74.0
60.5
44.9
27.9
78.7
46.7
58.3
14.0
44.0
19.2
74.4
71.0
64.1
77.8
75.8
81.6
81.6
54.5
76.0
83.0
65.4
51.7
78.0
78.1
48.4
79.6
80.0
68.7
79.6
78.9
80.8
78.9
70.4
59.0
55.9
57.5
56.4
71.4
45.8
69.4
67.0
74.6
55.3
58.0
79.4
77.8
74.3
59.5
71.1
86.3
64.9
72.9
38.0
56.7
39.0
84.1
75.5
69.0
83.0
82.5
86.5
84.0
63.6
79.7
88.2
69.2
64.3
84.4
81.9
74.6
84.5
84.1
75.7
81.2
81.6
82.5
79.8
71.4
73.3
74.2
68.9
80.7
54.6
77.6
72.7
80.1
60.8
62.1
83.4
81.7
77.8
62.5
73.4
88.0
66.4
74.3
39.2
57.8
40.1
84.5
75.7
69.2
82.9
82.4
86.4
83.9
63.4
79.4
87.7
68.6
63.7
83.7
81.1
73.7
83.5
83.1
74.4
80.0
80.3
81.0
78.1
68.9
+17.3
+16.8
+12.5
+9.3
+8.8
+8.1
+5.7
+5.5
+5.5
+4.1
+3.9
+3.9
+3.6
+3.0
+2.3
+1.8
+1.5
+1.4
+1.3
+1.2
+1.1
+0.4
+0.2
+0.1
−0.1
−0.1
−0.1
−0.1
−0.2
−0.3
−0.5
−0.6
−0.6
−0.7
−0.8
−0.9
−0.9
−1.0
−1.3
−1.3
−1.3
−1.5
−1.8
−2.5
71.4
73.7
+2.3
Table 5: Low-Resource Semi-Supervised Transfer Results. Transfer results for 44 unseen test languages
using 50 labeled sentences in the target language, averaged over 3 subsampled datasets. ‘‘FT’’ refers
to the UDPRE-FT fine-tuning baseline, ‘‘ESR’’ refers to our UDPRE-ESR-CLD approach, and Δ refers
to the absolute difference of ESR minus FT. Best performing methods are bolded. Results are ordered
from best to worst Δ LAS.
Our work can be seen as an extension of GEC
to more expressive expectations and to modern
mini-batch SGD training. There are a two more
recent works that touch on these ideas, but both
have significant downsides compared to our ap-
proach. Meng et al. (2019) use a PR approach
inspired by Ganchev and Das (2013) for cross-
lingual parsing, but must use very simple con-
straints and require a slow inference procedure
that can only be used at test time. Ben Noach
133
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
and Goldberg (2019) utilize GEC with mini-
batch training, but focus on using related tasks for
computing simpler constraints and do not adapt
their targets to small batch sizes.
Cross-Lingual Transfer: Earlier
trends in
cross-lingual transfer for parsing used delexicali-
zation (Zeman and Resnik, 2008; McDonald et al.,
2011; T¨ackstr¨om et al., 2013) and then aligned
multilingual word vector-based approaches (Guo
et al., 2015; Ammar et al., 2016; Rasooli and
Collins, 2017; Ahmad et al., 2019). With the rapid
rise of language-model pretraining (Peters et al.,
2018; Devlin et al., 2019; Liu et al., 2019b), recent
research has focused on multilingual PLMs and
multi-task fine-tuning to achieve generalization
in transfer. Wu and Dredze (2019) showed that
a multilingual PLM afforded surprisingly effec-
tive cross-lingual transfer using only English as
the fine-tuning language. Kondratyuk and Straka
(2019) extended this approach by fine-tuning a
PLM on the concatenation of all treebanks. Tran
and Bisazza (2019), however, show that transfer
to distant languages benefit less.
Other recent successes have been found with
linguistic side-information (Meng et al., 2019;
¨Ust¨un et al., 2020), careful methodology for
source-treebank selection (Tiedemann and Agic,
2016; Tran and Bisazza, 2019; Lin et al., 2019;
Glavaˇs and Vuli´c, 2021), self-training (Kurniawan
et al., 2021), and paired bilingual text for anno-
tation projection (Rasooli and Tetreault, 2015;
Rasooli and Collins, 2019; Liu et al., 2020; Shi
et al., 2022).
8 Conclusion
We have presented Expected Statistic Regulariza-
tion, a general approach to weak supervision for
structured prediction, and studied it in the con-
text of modern cross-lingual multi-task syntactic
parsing. We evaluated a wide range of expressive
structural statistics in idealized and realistic trans-
fer scenarios and have shown that the proposed
approach is effective and complementary to the
state-of-the-art model-transfer approaches.
Acknowledgments
We would like to thank Chris Kedzie, Giannis
Karamanolakis, and the reviewers for helpful con-
versations and feedback.
References
Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe
Ma, Kai-Wei Chang, and Nanyun Peng. 2019.
Cross-lingual dependency parsing with unla-
beled auxiliary languages. In Proceedings of
the 23rd Conference on Computational Natural
Language Learning (CoNLL), pages 372–382,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/K19-1035
Waleed Ammar, George Mulcaire, Miguel
Ballesteros, Chris Dyer, and Noah A. Smith.
2016. Many languages, one parser. Transac-
tions of
the Association for Computational
Linguistics, 4:431–444. https://doi.org
/10.1162/tacl_a_00109
Stefanos Angelidis and Mirella Lapata. 2018.
Multiple instance learning networks for fine-
grained sentiment analysis. Transactions of
the Association for Computational Linguistics,
6:17–31. https://doi.org/10.1162/tacl
a 00002
Kedar Bellare and Andrew McCallum. 2007.
Learning extractors from unlabeled text us-
ing relevant databases. In Sixth international
workshop on information integration on the
web (AAAI).
Matan Ben Noach and Yoav Goldberg. 2019.
Transfer learning between related tasks us-
ing expected label proportions. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 31–42, Hong Kong, China. Association
for Computational Linguistics.
Andrew Carlson, Scott Gaffney, and Flavian
Vasile. 2009. Learning a named entity tagger
from gazetteers with the partial perceptron. In
AAAI Spring Symposium: Learning by Reading
and Learning to Read, pages 7–13.
Xavier Carreras. 2007. Experiments with a
higher-order projective dependency parser. In
Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
Learning (EMNLP-CoNLL), pages 957–961,
Prague, Czech Republic. Association for Com-
putational Linguistics.
134
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural depen-
dency parsing. In 5th International Confer-
ence on Learning Representations, ICLR 2017,
Toulon, France, April 24–26, 2017, Conference
Track Proceedings. OpenReview.net.
Gregory Druck, Gideon Mann, and Andrew
McCallum. 2008. Learning from labeled fea-
tures using generalized expectation criteria. In
Proceedings of the 31st Annual International
ACM SIGIR Conference on Research and De-
velopment in Information Retrieval, SIGIR ’08,
pages 595–602, New York, NY, USA. Asso-
ciation for Computing Machinery. https://
doi.org/10.1145/1390334.1390436
Gregory Druck, Gideon Mann, and Andrew
McCallum. 2009. Semi-supervised learning of
dependency parsers using generalized expecta-
tion criteria. In Proceedings of the Joint Con-
ference of the 47th Annual Meeting of the ACL
and the 4th International Joint Conference on
Natural Language Processing of the AFNLP,
pages 360–368, Suntec, Singapore. Association
for Computational Linguistics.
Bradley Efron. 1979. Bootstrap methods: An-
other look at the jackknife. Annals of Statistics,
7:1–26. https://doi.org/10.1214/aos
/1176344552
Kuzman Ganchev and Dipanjan Das. 2013. Cross-
lingual discriminative learning of sequence
models with posterior regularization. In Pro-
ceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing,
pages 1996–2006, Seattle, Washington, USA.
Association for Computational Linguistics.
Kuzman Ganchev,
Jo˜ao Grac¸a,
Jennifer
Gillenwater, and Ben Taskar. 2010. Poste-
rior regularization for structured latent vari-
able models. Journal of Machine Learning
Research, 11:2001–2049.
Ross Girshick. 2015. Fast r-cnn. In Proceed-
ings of the IEEE International Conference on
Computer Vision, pages 1440–1448.
Goran Glavaˇs and Ivan Vuli´c. 2021. Climbing the
tower of treebanks: Improving low-resource
dependency parsing via hierarchical source se-
lection. In Findings of
the Association for
Computational Linguistics: ACL-IJCNLP 2021,
pages 4878–4888, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.findings-acl.431
Yves Grandvalet and Yoshua Bengio. 2004.
Semi-supervised learning by entropy mini-
mization. In Advances in Neural Information
Processing Systems, volume 17. MIT Press.
Nathan Greenberg, Trapit Bansal, Patrick Verga,
and Andrew McCallum. 2018. Marginal likeli-
hood training of BiLSTM-CRF for biomedical
named entity recognition from disjoint label
sets. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Pro-
cessing, pages 2824–2829, Brussels, Belgium.
Association for Computational Linguistics.
Jiang Guo, Wanxiang Che, David Yarowsky,
Haifeng Wang, and Ting Liu. 2015. Cross-
lingual dependency parsing based on dis-
In Proceedings of
tributed representations.
the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th
International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers),
pages 1234–1244, Beijing, China. Association
for Computational Linguistics. https://doi
.org/10.3115/v1/P15-1119
Junxian He, Zhisong Zhang, Taylor Berg-
Kirkpatrick, and Graham Neubig. 2019. Cross-
lingual syntactic transfer through unsupervised
adaptation of invertible projections. In Pro-
ceedings of the 57th Annual Meeting of the
Association for Computational Linguistics,
pages 3211–3223, Florence, Italy. Association
for Computational Linguistics.
Giannis Karamanolakis, Daniel Hsu, and Luis
Gravano. 2019a. Leveraging just a few key-
words for fine-grained aspect detection through
weakly supervised co-training. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
135
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
pages 4611–4621, Hong Kong, China. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1468
Giannis Karamanolakis, Daniel Hsu, and Luis
Gravano. 2019b. Weakly supervised attention
networks for fine-grained opinion mining and
public health. In Proceedings of the 5th Work-
shop on Noisy User-generated Text (W-NUT
2019), pages 1–10, Hong Kong, China. Asso-
ciation for Computational Linguistics.
Dan Kondratyuk and Milan Straka. 2019. 75 lan-
guages, 1 model: Parsing Universal Dependen-
cies universally. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2779–2795,
Hong Kong, China. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/D19-1279
Artur Kulmizev, Miryam de Lhoneux, Johannes
Gontrum, Elena Fano, and Joakim Nivre.
2019. Deep contextualized word embeddings in
transition-based and graph-based dependency
parsing - a tale of two parsers revisited. In Pro-
ceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and
the 9th International Joint Conference on Nat-
ural Language Processing (EMNLP-IJCNLP),
pages 2755–2768, Hong Kong, China. Associ-
ation for Computational Linguistics.
Kemal Kurniawan, Lea Frermann, Philip Schulz,
and Trevor Cohn. 2021. PPT: Parsimonious
parser transfer for unsupervised cross-lingual
adaptation. In Proceedings of the 16th Con-
ference of the European Chapter of the Asso-
ciation for Computational Linguistics: Main
Volume, pages 2907–2918, Online. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/2021.eacl-main.254
Percy Liang, Michael I. Jordan, and Dan Klein.
2009. Learning from measurements in expo-
nential families. In Proceedings of the 26th
Annual International Conference on Machine
Learning, ICML ’09, pages 641–648, New
York, NY, USA. Association for Computing
Machinery. https://doi.org/10.1145
/1553374.1553457
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui
Li, Yuyan Zhang, Mengzhou Xia, Shruti
Rijhwani, Junxian He, Zhisong Zhang, Xuezhe
Ma, Antonios Anastasopoulos, Patrick Littell,
and Graham Neubig. 2019. Choosing trans-
learning. In
fer languages for cross-lingual
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 3125–3135, Florence, Italy. Association
for Computational Linguistics.
Lu Liu, Yi Zhou,
Jianhan Xu, Xiaoqing
Zheng, Kai-Wei Chang, and Xuanjing Huang.
2020. Cross-lingual dependency parsing by
POS-guided word reordering. In Findings of
the Association for Computational Linguistics:
EMNLP 2020, pages 2938–2948, Online.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.findings-emnlp.265
Tianyu Liu, Jin-Ge Yao, and Chin-Yew Lin.
2019a. Towards
improving neural named
entity recognition with gazetteers. In Pro-
ceedings of the 57th Annual Meeting of the
Association for Computational Linguistics,
pages 5301–5307, Florence, Italy. Association
for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei
Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin
Stoyanov. 2019b. RoBERTa: A robustly opti-
mized BERT pretraining approach. arXiv pre-
print arXiv:1907.11692v1.
Ilya Loshchilov and Frank Hutter. 2019. De-
coupled weight decay regularization. In 7th
International Conference on Learning Repre-
sentations, ICLR 2019, New Orleans, LA, USA,
May 6-9, 2019. OpenReview.net.
Gideon S. Mann and Andrew McCallum.
2010. Generalized expectation criteria for
semi-supervised learning with weakly labeled
data. Journal of Machine Learning Research,
11:955–984.
Ryan McDonald and Fernando Pereira. 2006.
Online learning of approximate dependency
In 11th Conference of
parsing algorithms.
the European Chapter of
the Association
for Computational Linguistics, pages 81–88,
Trento, Italy. Association for Computational
Linguistics.
136
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Ryan McDonald, Slav Petrov, and Keith Hall.
2011. Multi-source transfer of delexicalized
dependency parsers. In Proceedings of
the
2011 Conference on Empirical Methods in
Natural Language Processing, pages 62–72,
Edinburgh, Scotland, UK. Association for
Computational Linguistics.
Tao Meng, Nanyun Peng, and Kai-Wei Chang.
2019. Target language-aware constrained infer-
ence for cross-lingual dependency parsing. In
Proceedings of the 2019 Conference on Empiri-
cal Methods in Natural Language Processing
and the 9th International Joint Conference
on Natural Language Processing (EMNLP-
IJCNLP), pages 1117–1128, Hong Kong, China.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D19
-1103
Joakim Nivre. 2020. Multilingual dependency
parsing from universal dependencies to sesame
street. In Text, Speech, and Dialogue - 23rd
International Conference, TSD 2020, Brno,
Czech Republic, September 8–11, 2020, Pro-
ceedings, volume 12284 of Lecture Notes in
Computer Science, pages 11–29. Springer.
Avital Oliver, Augustus Odena, Colin Raffel,
Ekin Dogus Cubuk, and Ian J. Goodfellow.
2018. Realistic evaluation of deep semi-
supervised learning algorithms. In Advances
Information Processing Systems
in Neural
31: Annual Conference on Neural Informa-
tion Processing Systems 2018, NeurIPS 2018,
December 3–8, 2018, Montr´eal, Canada,
pages 3239–3250.
Max B. Paulus, Dami Choi, Daniel Tarlow,
Andreas Krause, and Chris J. Maddison. 2020.
Gradient estimation with stochastic softmax
tricks. In Advances in Neural Information Pro-
cessing Systems 33: Annual Conference on
Neural Information Processing Systems 2020,
NeurIPS 2020, December 6–12, 2020, virtual.
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings of
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1202
Mohammad Sadegh Rasooli and Michael Collins.
2017. Cross-lingual syntactic transfer with lim-
ited resources. Transactions of the Association
for Computational Linguistics, 5:279–293.
Mohammad Sadegh Rasooli and Michael Collins.
2019. Low-resource syntactic transfer with
unsupervised source reordering. In Proceedings
of the 2019 Conference of the North American
the Association for Compu-
Chapter of
tational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers),
pages 3845–3856, Minneapolis, Minnesota.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N19
-1385
Mohammad Sadegh Rasooli and Joel R. Tetreault.
2015. Yara parser: A fast and accurate de-
pendency parser. ArXiv, abs/1503.06733.
Alexander Ratner, Braden Hancock,
Jared
Dunnmon, Roger E. Goldman, and Christopher
R´e. 2018. Snorkel metal: Weak supervision for
multi-task learning. In Proceedings of the Sec-
ond Workshop on Data Management for End-
To-End Machine Learning, DEEM@SIGMOD
2018, Houston, TX, USA, June 15, 2018,
pages 3:1–3:4. ACM. https://doi.org
/10.1145/3209889.3209898
Alexander J. Ratner, Christopher De Sa, Sen
Wu, Daniel Selsam, and Christopher R´e. 2016.
Data programming: Creating large training
sets, quickly. In Advances in Neural Informa-
tion Processing Systems 29: Annual Conference
on Neural
Information Processing Systems
2016, December 5–10, 2016, Barcelona, Spain,
pages 3567–3575.
Freda Shi, Kevin Gimpel, and Karen Livescu.
2022. Substructure distribution projection for
zero-shot cross-lingual dependency parsing. In
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 6547–6563,
Dublin, Ireland. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/2022.acl-long.452
Oscar T¨ackstr¨om, Ryan McDonald, and Joakim
language adaptation of
Nivre. 2013. Target
137
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
discriminative transfer parsers. In Proceedings
of the 2013 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 1061–1071, Atlanta, Georgia. Associa-
tion for Computational Linguistics.
J¨org Tiedemann and Zeljko Agic. 2016. Syn-
thetic treebanking for cross-lingual dependency
parsing. Journal of Artificial Intelligence Re-
search, 55:209–248. https://doi.org/10
.1613/jair.4785
Ke Tran and Arianna Bisazza. 2019. Zero-shot
dependency parsing with pre-trained multilin-
gual sentence representations. In Proceedings
of the 2nd Workshop on Deep Learning Ap-
proaches for Low-Resource NLP (DeepLo
2019), pages 281–288, Hong Kong, China.
Association for Computational Linguistics.
Ahmet ¨Ust¨un, Arianna Bisazza, Gosse Bouma,
and Gertjan van Noord. 2020. Udapter:
Language adaptation for truly universal depen-
dency parsing. In Proceedings of
the 2020
Conference on Empirical Methods in Natural
Language Processing, EMNLP 2020, Online,
November 16–20, 2020, pages 2302–2315.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.emnlp-main.180
Shijie Wu and Mark Dredze. 2019. Beto, bentz,
becas: The surprising cross-lingual effective-
ness of BERT. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 833–844,
Hong Kong, China. Association for Computa-
tional Linguistics.
Daniel Zeman and Philip Resnik. 2008. Cross-
language parser adaptation between related
languages. In Proceedings of
the IJCNLP-
08 Workshop on NLP for Less Privileged
Languages.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
3
7
2
0
6
7
8
3
0
/
/
t
l
a
c
_
a
_
0
0
5
3
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
138
Download pdf