Improving Low-Resource Cross-lingual Parsing

Improving Low-Resource Cross-lingual Parsing
with Expected Statistic Regularization

Thomas Effland
Columbia University, USA
teffland@cs.columbia.edu

Michael Collins
Google Research, USA
mjcollins@google.com

Abstrakt

We present Expected Statistic Regulariza
tion (ESR), a novel regularization technique
that utilizes low-order multi-task structural sta-
tistics to shape model distributions for semi-
supervised learning on low-resource datasets.
We study ESR in the context of cross-lingual
transfer for syntactic analysis (POS tagging
and labeled dependency parsing) and present
several classes of low-order statistic functions
that bear on model behavior. Experimentally,
we evaluate the proposed statistics with ESR
for unsupervised transfer on 5 diverse target
languages and show that all statistics, Wann
estimated accurately, yield improvements to
both POS and LAS, with the best statistic
improving POS by +7.0 and LAS by +8.5
on average. We also present semi-supervised
transfer and learning curve experiments that
show ESR provides significant gains over
strong cross-lingual-transfer-plus-fine-tuning
baselines for modest amounts of label data.
These results indicate that ESR is a promis-
ing and complementary approach to model-
transfer approaches for cross-lingual parsing.1

1

Einführung

In den vergangenen Jahren, great strides have been made
on linguistic analysis for low-resource languages.
These gains are largely attributable to transfer
approaches from (1) massive pretrained multilin-
gual language model (PLM) encoders (Devlin
et al., 2019; Liu et al., 2019B); (2) multi-task
training across related syntactic analysis tasks
(Kondratyuk and Straka, 2019); Und (3) multilin-
gual training on diverse high-resource languages
(Wu and Dredze, 2019; Ahmad et al., 2019;
Kondratyuk and Straka, 2019). Combined, diese
approaches have been shown to be particularly
effective for cross-lingual syntactic analysis, als
shown by UDify (Kondratyuk and Straka, 2019).

1We have published for our implementation and exper-
iments at https://github.com/teffland/expected
-statistic-regularization.

122

Jedoch, even with the improvements brought
about by these techniques, transferred models still
make syntactically implausible predictions on
low-resource languages, and these error rates in-
crease dramatically as the target languages be-
come more distant from the source languages
(He et al., 2019; Meng et al., 2019). Im Par-
besonders, transferred models often fail to match
many low-order statistics concerning the typol-
ogy of the task structures. We hypothesize that
enforcing regularity with respect to estimates of
these structural statistics—effectively using them
as weak supervision—is complementary to cur-
rent transfer approaches for low-resource cross-
lingual parsing.

Zu diesem Zweck, we introduce Expected Statistic
Regularization (ESR), a novel differentiable loss
that regularizes models on unlabeled target data-
sets by minimizing deviation of descriptive statis-
tics of model behavior from target values. Der
class of descriptive statistics usable by ESR are
expressive and powerful. Zum Beispiel, they may
describe cross-task interactions, encouraging the
model to obey structural patterns that are not
explicitly tractable in the model factorization.
Zusätzlich, the statistics may be derived from
constraints dictated by the task formalism itself
(such as ruling out invalid substructures) or by
numerical parameters that are specific to the
target dataset distribution (such as relative sub-
structure frequencies). In the latter case, we also
contribute a method for selecting those parame-
ters using small amounts of labeled data, based
on the bootstrap (Efron, 1979).

Although ESR is applicable to a variety of
problems, we study it using modern cross-lingual
syntactic analysis on the Universal Dependencies
Daten, building off of the strong model-transfer
framework of UDify (Kondratyuk and Straka,
2019). We show that ESR is complementary to
transfer-based approaches for building parsers on
low-resource languages. We present several inter-
esting classes of statistics for the tasks and perform

Transactions of the Association for Computational Linguistics, Bd. 11, S. 122–138, 2023. https://doi.org/10.1162/tacl a 00537
Action Editor: Alexander Clark. Submission batch: 4/2022; Revision batch: 7/2022; Published 1/2023.
C(cid:2) 2023 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
3
7
2
0
6
7
8
3
0

/

/
T

l

A
C
_
A
_
0
0
5
3
7
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

extensive experiments in both oracle unsuper-
vised and realistic semi-supervised cross-lingual
multi-task parsing scenarios, with particularly en-
couraging results that significantly outperform
state-of-the-art approaches for semi-supervised
scenarios. We also present ablations that justify
key design choices.

2 Expected Statistic Regularization

We consider structured prediction in an abstract
setting where we have inputs x ∈ X , output struc-
tures y ∈ Y, and a conditional model pθ(j|X) ∈ P
with parameters θ ∈ Θ, where P is the distribution
space and Θ is the parameter space. In this sec-
tion we assume that the setting is semi-supervised,
with a small labeled dataset DL and a large un-
labeled dataset DU ; let DL = {(xi, yi)}M
i=1 be
the labeled dataset of size m and similarly define
DU = {xi}m+n

i=m+1 as the unlabeled dataset.
Our approach centers around a vectorized sta-
tistic function f that maps unlabeled samples and
models to real vectors of dimension df :

F : D × P → Rdf

(1)

where D is the set of unlabeled datasets of any
Größe, (d.h., DU ∈ D). The purpose of f is to sum-
marize various properties of the model using the
sample. Zum Beispiel, if the task is part-of-speech
tagging, one possible component of f could be
the expected proportion of NOUN tags in the unla-
beled data DU . In addition to f , we assume that
we are given vectors of target statistics t ∈ Rd and
margins of uncertainty σ ∈ Rd as its supervision
signal. We will discuss the details of f , T, and σ
shortly but first describe the overall objective.

2.1 Semi-Supervised Objective
Given labeled and unlabeled data DL and DU , Wir
propose the following semi-supervised objective
Ö, which breaks down into a sum of supervised
and unsupervised terms L and C:

ˆθ = arg min
θ∈Θ
Ö(θ; DL, DU ) = L(θ; DL) + αC(θ; DU )

Ö(θ; DL, DU )

(2)

(3)

where α > 0 is a balancing coefficient. Der
supervised objective L can be any suitable super-
vised loss; here we will use the negative log-
likelihood of the data under the model. Unser
contribution is the unsupervised objective C.

For C, we propose to minimize some dis-
tance function (cid:5) between the target statistics t
and the value of the statistics f calculated using
unlabeled data and the model pθ. ((cid:5) will also take
into account the uncertainty margins σ.) A sim-
ple objective would be:

C(θ; DU ) = (cid:5)(T, σ, F (DU , ))

This is a dataset-level loss penalizing divergences
from the target level statistics. The problem with
this approach is that this is not amenable to mod-
ern hardware constraints requiring SGD. Stattdessen,
we propose to optimize this loss in expectation
over unlabeled mini-batch samples Dk
U , Wo
k is the mini-batch size and Dk
U is sampled uni-
formly with replacement from DU . Dann, C is
given by:

C(θ; DU ) = EDk

U

[(cid:5)(T, σ, F (Dk

U , ))]

(4)

This objective penalizes the model if the sta-
tistic f , when applied to samples of unlabeled
data Dk
U , deviates from the targets t and thus
pushes the model toward satisfying these target
Statistiken.

Wichtig, the objective in Eq. 4 is more
general than typical objectives in that the outer
loss function (cid:5) does not necessarily break down
into a sum over individual input examples—the
aggregation over examples is done inside f :

(cid:5)(T, σ, F (DU , )) (cid:5)=

(cid:2)

x∈DU

(cid:5)(T, σ, F (X, ))

(5)

This generality is useful because components of
f may describe statistics that aggregate over in-
puts, estimating expected quantities concerning
sample-level regularities of the structures. In con-
trast, the right-hand side of Eq. 5 is more stringent,
imposing that the statistic be the same for all in-
stances of x. In der Praxis, this loss reduces noise
compared to a per-sentence loss, as is shown in
Abschnitt 5.3.1.

2.2 The Statistic Function f

In principle the vectorized statistic function f
could be almost any function of the unlabeled data
and model, provided it is possible to obtain its
gradients with respect to the model parameters θ,
Jedoch, in this work we will assume f has the
following three-layer structure.

123

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
5
3
7
2
0
6
7
8
3
0

/

/
T

l

A
C
_
A
_
0
0
5
3
7
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Erste, let g be another vectorized function of
‘‘sub-statistics’’ that may have a different dimen-
sionality than f and takes individual x, y pairs
as input:

G : X × Y → Rdg

(6)

We choose this function because it is robust to
outliers, adapts its width to the margin parameter
σi, and expresses a preference for fi = ti (als
opposed to max-margin losses). We give an abla-
tion study in Section 5.3.2 justifying its use.

Then let ¯g be the expected value of g under the
model pθ summed over the sample DU :
(cid:2)

Epθ(j|X)[G(X, j)]

(7)

¯g =

x∈DU

Given ¯g, let the f ’s j’th component be the result
of an aggregating function hj : Rdg → R on ¯g:

fj(DU , ) = hj(¯g)

(8)

The individual components gi will mostly be
counting functions that tally various substructures
in the data. The ¯gi’s then are expected substruc-
ture counts in the sample, and the hj’s aggregate
small subsets of these intermediate counts in
different ways to compute various marginal prob-
abilities. Wieder, in general f does not need to
follow this structure and any suitable statistic func-
tion can be incorporated into the regularization
term proposed in Eq. 4.

In some cases—when the structure of g does not
follow the model factorization either additively or
multipicatively—computation of the model ex-
pectation Epθ(j|X)[G(X, j)] in Eq. 7 is intractable.
In these situations, standard Monte Carlo approx-
imation breaks differentiability of the objective
with respect to the model parameters θ and can-
not be used. To remedy this, we propose to
use the ‘‘Stochastic Softmax’’ differentiable sam-
pling approximation from Paulus et al. (2020) Zu
allow optimization of these functions. We pro-
pose several such statistics in the application (sehen
Abschnitt 4.3).

2.3 The Distance Function (cid:5)

For the distance function (cid:5), we propose to use a
smoothed hinge loss (Girshick, 2015) that adapts
with the margins σ. Letting ¯f = f (Dk
U , ), Die
i’th component of (cid:5) is given by:

(cid:3)

(cid:5)i =

( ¯fi−ti)2
2σi

| ¯fi − ti| − σi

Wenn | ¯fi − ti| < σi else (9) The total loss (cid:5) is then the sum of its components: (cid:5)(t, σ, f (Dk U , pθ)) = (cid:2) i (cid:5)i(ti, σi, ¯fi) (10) 3 Choosing the Targets and Margins There are several possible approaches to choos- ing the targets t and margins σ, and in general they can differ based on the individual statistics. For some statistics it may be possible to specify the targets and margins using prior knowledge or formal constraints from the task. In other cases, estimating the targets and margins may be more difficult. Depending on the problem context, one may be able to estimate them from related tasks or domains (such as neighboring languages for cross-lingual parsing). Here, we propose a general method that estimates the statistics using labeled data, and is applicable to semi-supervised scenar- ios where at least a small amount of labeled data is available. The ideal targets are the expected statis- the ‘‘true’’ model p∗ are: t∗ = tics under U , p∗)], where k is the batch size. We [f (Dk EDk can estimate this expectation using labeled data DL and bootstrap sampling (Efron, 1979). Uti- lizing DL as a set of point estimates for p∗, we sample B total minibatches of k labeled exam- ples uniformly with replacement from DL and calculate the statistic f for each of these mini- batch datasets. We then compute the target sta- tistic as the sample mean: U l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 t = 1 B B(cid:2) i=1 f (D(i) L ) , |D(i) L | = k, ∀i (11) where we have slightly abused notation by writing f (DL) to mean f computed using the inputs {x : (x, y) ∈ DL} and the point estimates p∗(y|x) = 1, ∀(x, y) ∈ DL. In addition to estimating the target statistics for small batch sizes, the bootstrap gives us a way to estimate the natural variation of the statistics for small sample sizes. To this end, we propose to utilize the standard deviations from the bootstrap samples as our margins of uncertainty σ: (cid:4) (cid:5) (cid:5) (cid:6) 1 B − 1 B(cid:2) i=1 σ = (f (D(i) L ) − t)2 (12) 124 This allows our loss function (cid:5) to adapt to more or less certain statistics. If some statistics are naturally too variable to serve as effective super- vision, they will automatically have weak contri- bution to (cid:5) and little impact on the model training. 4 Application to Cross-Lingual Parsing Now that we have described our general approach, in this section we lay out a proposal for applying it to cross-lingual joint POS tagging and depen- dency parsing. We choose to apply our method to this problem because it is an ideal testbed for controlled experiments in semi-supervised struc- tured prediction. By their nature, the parsing tasks admit many types of interesting statistics that cap- ture cross-task, universal, and language-specific facts about the target test distributions. We evaluate in two different transfer settings: oracle unsupervised and realistic semi-supervised. In the oracle unsupervised settings, there is no supervised training data available for the target languages (and the L term is dropped from Eq. 3), but we use target values and margins calculated from the held-out supervised data. This setting allows us to understand the impact of our reg- ularizer in isolation without the confounding ef- fects of direct supervision or inaccuracte targets. In the semi-supervised experiments, we vary the amounts of supervised data, and calculate the tar- gets from the small supervised data samples. This is a realistic application of our approach that may be applied to low-resource learning scenarios. 4.1 Problem Setup and Data We use the Universal Dependencies (Nivre, 2020) v2.8 (UD) corpus as data. In UD, syntactic anno- tation is formulated as a labeled bilexical depen- dency tree, connecting words in a sentence, with additional part-of-speech (POS) tags annotated for each word. The labeled tree can be broken down into two parts: the arcs that connect the head words to child words, forming a tree, and the dependency labels assigned to each of those arcs. Due to the definition of UD syntax, each word is the child of exactly one arc, and so both the at- tachments and labels can be written as sequences that align with the words in the sentence. More formally then, for each labeled sentence x1:n of length n, the full structure y is given by the three sequences y = (t1:n, e1:n, r1:n), where ti ∈ T are the POS tags, e1:n, ei ∈ t1:n, {1, . . . , n} are the head attachments, and r1:n, ri ∈ R are the dependency labels. 4.2 The Model and Training We now turn to the parsing model that is used as the basis for our approach. Though the gen- eral ideas of our approach are adaptable to other models, we choose to use the UDify architecture because it is one of the state-of-the-art multilin- gual parsers for UD. 4.2.1 The UDify Model The UDify model is based on trends in state- of-the-art parsing, combining a multilingual pre- trained transformer language model encoder (mBERT) with a deep biaffine arc-factored pars- ing decoder, following Dozat and Manning (2017). These encodings are additionally used to predict POS tags with a separate decoder. The full details are given in Kondratyuk and Straka (2019), but here it suffices to characterize the parser by its top-level probabilistic factorization: p(t1:n, e1:n, r1:n|x1:n) = p(e1:n|x1:n)p(t1:n|x1:n)p(r1:n|e1:n, x1:n) = p(e1:n|x1:n) n(cid:7) i=1 (13) p(ti|x1:n)p(ri|ei, x1:n) (14) This model is scant on explicit joint factors, following recent trends in structured prediction that forgo higher-arity factors, instead opting for shared underlying contextual representations pro- duced by a mBERT that implicitly contain infor- mation about the sentence and structure as a whole. This factorization will prove useful in Section 4.3 where it will allow us to compute many of the supervision statistics under the model exactly. 4.2.2 Training The UDify approach to training is simple: It begins with a multilingual PLM, mBERT, then fine-tunes the parsing architecture on the con- catenation of the source languages. With vanilla UDify, transfer to target languages is zero-shot. Our approach begins with these two training steps from UDify, then adds a third: adapting to the target language using the target statistics and possibly small amounts of supervised data (Eq. 3). 125 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 4.3 Typological Statistics as Supervision We now discuss a series of statistics that we will use as weak supervision. Most of the pro- posed statistics describe various probabilities for different (but related) grammatical substructures and can ultimately be broken down into ra- tios of ‘‘count’’ functions (sums of indicators), which tally various types of events in the data. We propose statistics that cover surface level (POS-only), single-arc, two-arc, and single-head substructures, as well as conditional variants. Due to space constraints, we omit their mathematical descriptions. Surface Level: One simple set of descriptive statistics are the unigram and bigram distributions over POS tags. POS unigrams can capture some basic relative frequencies, such as our expectation that nouns and verbs are common to all languages. POS bigrams will allow us to capture simple word-order preferences. Single-Arc: This next set of statistical families all capture information about various choices in single-arc substructures. A single arc substructure carries up to 5 pieces of information: the arc’s direction, label, and distance, as well as the tags for the head and child words. Various subsets of these capture differing forms of regularity, such as ‘‘the probability of seeing tag th head an arc with label r in direction d’’. Universally Impossible Arcs: In addition to many single-arc variants, we also consider the label, child tag) specific subset of (head tag, single-arc triples that are never seen in the any UD data. These combinations correspond to the im- possible arrangements that do not ‘‘type-check’’ within the UD formalism and are interesting in that they could in principle be specified by a linguist without any labeled data whatsoever. As such, they represent a particularly attractive use-case of our approach, where a domain expert could rule out all invalid substructures dictated from the task formalism without the model having to learn it implicitly from the training data. With complex structures, this can be a large proportion of the possibilities: in UD we can rule out 93.2% (9,966/10,693) of the combinations. be useful because they cover many important typological phenomena, such as subject-object- verb ordering. They also have been known to be strong features in higher-order parsing models, such as the parser of Carreras (2007), but are also known to be intractable in non-projective parsers (McDonald and Pereira, 2006). Following McDonald and Pereira (2006), we distinguish between two different patterns of neighboring arcs: siblings and grandchildren. Sib- ling arc pairs consist of two arcs that share a sin- gle head word, while grandchild arc pairs share an intermediate word that is the child of one arc and the head of another. Head-Valency: One interesting statistic that does not fall into the other categories is the va- lency of a particular head tag. This corresponds to the count of outgoing arcs headed by some tag. We convert this into a probability by using a binning function that allows us to quantify the ‘‘probability that some tag heads between a and b children’’. Like the two-arc statistics, expected valency statistics are intractable under the model and we must approximate their computation. Conditional Variants: Further, each of these statistics can be described in conditional terms, as opposed to their full joint realizations. To do this, we simply divide the joint counts by the counts of the conditioned-upon sub-events. Conditional variants may be useful because they do not express preferences for probabilities of the sub-events on the right side of the conditioning bar, which may be hard to estimate. Average Entropy: In addition to the above pro- posed relative frequency statistics, we also include average per-token, per-edge, and MST tree en- tropies as additional regularization statistics that are always used. Though we do not show it here, each of these functions may be formulated as a statistic within our approach. The inclusion of these statistics amounts to a form of Entropy Reg- ularization (Grandvalet and Bengio, 2004) that keep the models from optimizing the other ESR constraints with degenerate constant predictions (Mann and McCallum, 2010). 5 Oracle Unsupervised Experiments Two-Arc: We also consider substructures span- ning two connected arcs in the tree. They may We begin with oracle unsupervised transfer ex- periments that evaluate the potential of many 126 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Language Code Treebank Family Train Sents UDPRE LAS Arabic Basque Chinese English Finnish Hebrew Hindi Italian Japanese Korean Russian Swedish Turkish German Indonesian Maltese Persian Vietnamese ar eu zh en fi he hi it ja ko ru sv tr de id mt fa vi PADT BDT GSD EWT TDT HTB HDTB ISDT GSD GSD SynTagRus Talbanken IMST HDT GSD MUDT PerDT VTB Semitic Basque Sino-Tibetan IE, Germanic Uralic Semitic IE, Indic IE, Romance Japanese Korean IE, Slavic IE, Germanic Turkic IE, Germanic Austronesian Semitic IE, Iranian Austro-Asiatic 6.1k 5.4k 4.0k 12.5k 12.2k 5.2k 13.3k 13.1k 7.1k 4.4k 15.0k∗ 4.3k 3.7k 153.0k 4.5k 1.1k 26.2k 1.4k 80.5 77.0 62.3 88.1 84.4 80.5 87.0 91.8 73.6 79.0 89.1 85.7 61.7 82.7 50.4 20.9 57.0 48.1 Table 1: Training and Evaluation Treebank Details. The final column shows UDPRE test set performance after UDify training (evaluation treebank performance is zero-shot). (∗): downsampled to the same 15k sentences as ¨Ust¨un et al. (2020) to reduce training time and balance the data. types of statistics and some ablations. In this set- ting, we do not assume any labeled data in the target language, but do assume accurate target statistics and margins, calculated from held-out training data using the method of Section 3. This allows us to study the potential of our proposed ESR regularization term C on its own and with- out the confounds of supervised data or inaccu- rate targets. 5.1 Experimental Setup Next we describe setup details for the experi- ments. These settings additionally apply to the rest of the experiments unless otherwise stated. 5.1.1 Datasets In all experiments, the models are first initial- ized from mBERT, then trained using the UDify code (Kondratyuk and Straka, 2019) on 13 di- verse treebanks, following Kulmizev et al. (2019); ¨Ust¨un et al. (2020). This model, further referred to as UDPRE, is used as the foundation for all approaches. As discussed in Kulmizev et al. (2019), these 13 training treebanks were selected to give a diverse sample of languages, taking into account factors such as language families, scripts, mor- phological complexity, and annotation quality. We evaluate all proposed methods on 5 held-out languages, similarly selected for a diversity in lan- guage typologies, but with the additional factor of tranfser performance of the UDPRE baseline.2 A summary table of these training and eval- uation treebanks is given in Table 1. 5.1.2 Approaches We compare our approach to two strong base- lines in all experiments, based on recent advances in the literature for cross-lingual parsing. These baselines are implemented in our code so that we may fairly compare them in all of our experiments. • UDPRE: The first baseline is the UDify (Kondratyuk and Straka, 2019) model-transfer approach. Multilingual model-transfer alone 2While we would like to evaluate on as many UD tree- banks as possible, budgetary constraints required that we restrict languages when experiment- ing with settings that combinatorially vary in other dimen- sions. We do however experiment with more languages in Section 6.2. the number of test 127 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 is currently one of the state-of-the-art ap- proaches to cross-lingual parsing and is a strong baseline in its own right. • UDPRE-PPT: We also apply the Parsimo- nious Parser Transfer (PPT) approach from Kurniawan et al. (2021). PPT is a nuanced self-training approach, extending T¨ackstr¨om et al. (2013), that encourages the model to concentrate its mass on its most likely pre- dicted parses for the target treebank. We use their loss implementation, but apply it to our UDPRE base model (instead of their weaker base model) for a fair comparison, so this approach combines UDify with PPT. • UDPRE-ESR: Our proposed approach, Ex- pected Statistic Regularization (ESR), ap- plied to UDPRE as an unsupervised-only objective. In individual experiments we will specify the statistics used for regularization. 5.1.3 Training and Evaluation Details For metrics, we report accuracy for POS tagging, coarse-grained labeled attachment score (LAS) for dependency trees, and their average as a single summary score. The metrics are computed using the official CoNLL-18 evaluation script.3 For all scenarios, we use early-stopping for model se- lection, measuring the POS-LAS average on the specified development sets. We tune learning rates and α for each pro- posed loss variant at the beginning of the first experiment with a low-budget grid search, using the settings that achieve best validation metric on average across the 5 language validation sets for all remaining experiments with that variant. We find generally that a base learning rate of 2 × 10−5 and α = 0.01 worked well for all variants of our method. We train all models using AdamW (Loshchilov and Hutter, 2019) on a slanted trian- gular learning rate schedule (Devlin et al., 2019) with 500 warmup steps. Also, since the datasets vary in size, we normalize the training schedule to 25 epochs at 1000 steps per epoch. We use a batch size of 8 sentences for training and estimating statistic targets. When bootstrapping estimates for t and σ we use B = 1000 samples. 3https://universaldependencies.org/conll18 /evaluation.html. 5.2 Assessing the Proposed Statistics In this experiment we evaluate 32 types of statis- tics from Section 4.3 for transfer of the UDPRE model (pretrained on 13 languages) to the target languages. The purpose of this experiment is to get a sense of the effectiveness of each statistic for improving model-based cross-lingual trans- fer.4 To prevent overfitting to the test sets for later experiments, all metrics for this experiment are calculated on the development sets. Results: The results of the experiment are pre- sented in Table 2, ranked from best to worst. Due to space constraints, we only show the top 10 statistics in addition to the Universal-Arc statistic. Generally we find that all of the 32 proposed statistics improve upon the UDPRE and UDPRE-PPT models on average, with many ex- hibiting large boosts. The best performing statistic concerns (Child Tag, Label, Direction) substruc- tures, yielding an average improvement of +7.0 POS and +8.5 LAS, an average relative error rate reduction of 23.5%. Many other statistics are not far behind, and overall statistics that bear on the child tag and dependency label had the highest impact. This indicates that, with accurate target estimates, the proposed statistics are highly complementary to multilingual parser pretraining (UDPRE) and substantially improve transfer qual- ity in the unsupervised setting. By comparison, the PPT approach provides marginal gains to UDPRE of only +1.4 average POS and +1.5 average LAS. Another interesting result is that several of the intractable two-arc statistics were among the best statistics overall, indicating that the use of the differentiable SST approximation does not pre- clude the applicability of intractable statistics. For example the directed grandchild statistic of cooc- currences of incoming and outgoing edges for certain tags was the second highest performing, with an average improvement of +7.0 POS ac- curacy and +8.5 LAS (21.3% average error rate reduction). Results for the conditional variants (not shown) were less positive. Generally, conditional vari- ants were worse than their full joint counterparts (e.g., ‘‘Child | Label’’ and ‘‘Label | Child’’ are worse than ‘‘Child, Label’’), performing worse in 15/16 cases. This makes sense, as we are using 4While it would also be possible to try out different combinations of the various statistics, due to cost considera- tions we leave these experiments to future work. 128 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Statistic de id POS fa vi UDPRE UDPRE-PPT †Child, Label ∗†Child, Label, Grand-label †Head, Child, Label †Head, Label †Head, Label | Child †Label †Label, Distance ∗†Head, Sibling Children Tags †Head, Child †Label | Child Universal Arc 89.3 +0.4 +3.3 +1.5 +2.5 +1.7 +3.0 −0.2 −0.2 +2.2 +2.3 +1.2 +2.4 80.3 64.7 83.0 +5.6 −1.5 −0.1 +5.7 +2.8 +4.9 +3.2 +5.2 +4.4 +3.7 +5.1 +5.3 +4.3 +4.7 +4.0 +8.0 +5.9 +5.3 +4.1 +7.2 +6.3 +5.2 +5.7 +5.8 +5.1 +0.0 +3.9 −0.1 +7.8 +5.6 +5.9 +3.7 +4.8 −0.5 +1.4 +1.4 mt 41.4 +3.1 +14.0 +15.7 +14.2 +14.2 +10.9 +11.3 +11.7 +14.0 +14.0 +10.0 +8.7 de 82.7 +0.2 +3.5 +2.5 +2.8 +2.7 +2.8 +2.9 +2.8 +1.6 +1.9 +3.3 +2.1 id 50.4 +8.1 +10.1 +9.2 +9.9 +9.0 +9.7 +8.4 +8.9 +2.7 +3.3 +9.5 +8.1 LAS fa 57.0 −5.5 +18.4 +16.2 +17.2 +16.4 +15.3 +17.1 +16.3 +11.3 +12.2 +16.6 +4.0 vi 48.1 −0.3 +0.5 +3.2 +1.3 +3.8 +2.1 +4.0 +4.1 −3.0 −0.7 +2.0 −3.1 mt 20.9 +4.6 +10.2 +12.5 +10.2 +11.0 +5.6 +9.5 +9.4 +11.1 +10.1 +5.6 +3.9 avg 61.8 +1.5 +7.8 +7.5 +7.4 +7.3 +6.6 +6.2 +6.0 +5.8 +5.8 +5.7 +3.4 Table 2: Unsupervised Oracle Statistic Variant Results. (Top): Baseline methods that do not use ESR. (Bottom): Various statistics used by ESR as unsupervised loss on top of UDPRE. Scores are measured on target treebank development sets. Bold names mark statistics used in later experiments. (∗) : All statistics with ∗ are intractable and utilize the SST relaxation of Paulus et al. (2020). (†): All statistics with † also include directional information. accurate statistics and full joints are strictly more expressive. This experiment gives a broad but shallow view into the effectiveness of the various proposed statistics. In the rest of the experiments, we eval- uate the following two variants in more depth: 1. ESR-CLD, which supervises target propor- tions for (Child Tag, Label, Direction) triples. This is the ‘‘Child, Label’’ row in Table 2. 2. ESR-UNIARC, which supervises the 9,966 universally impossible (Head Tag, Child Tag, Label) arcs that do not require labeled data to estimate. All of these combinations have targets values of t = 0 and margins σ = 0. This is the ‘‘Universal Arc’’ row in Table 2. We choose these two because ESR-CLD is the best performing statistic overall and ESR-UNIARC is unique in that it does not require labeled data to estimate; we do not evaluate others because of cost considerations. 5.3 Ablation Studies Next, we perform two ablation experiments to evaluate key design choices of the proposed ap- proach. First, we evaluate the use of batch-level aggregation in the statistics before the loss, versus the more standard approach of loss-per-sentence. In the second, we evaluate the proposed form of (cid:5). We compare the two aggregation variants us- ing the CLD (Child Tag, Label, Direction) sta- Aggregation Variant POS avg LAS avg avg Loss per sentence Loss per batch (ESR) 77.1 79.9 58.5 60.4 67.8 70.1 Table 3: Loss Aggregation Ablation Results. Loss per batch outperforms loss per sentence for both POS and LAS on average. tistic (ESR-CLD). We report test set results av- eraged over all 5 languages. We use the same hyperparameters selected in Section 5.2. 5.3.1 Batch-Level Loss Ablation In this ablation, we evaluate a key feature of our proposal—the aggregation of the statistic over the batch before loss computation Eq. 5 versus the more standard approach, which is to apply the loss per-sentence. The former, ‘‘Loss per batch’’, has the form: (cid:5)(t, σ, f (DU , pθ)) while the latter, ‘‘Loss per sentence’’, has the form: (cid:8) (cid:5)(t, σ, f (x, pθ)). x∈DU The significance of this difference is that ‘‘Loss per batch’’ allows for the variation in individual sentences to somewhat average out and hence is less noisy, while ‘‘Loss per sentence’’ requires that each sentence individually satisfy the targets. Results: The results are presented in Table 3. From the table we can see that ‘‘Loss per batch’’ has an average POS of 79.9 and average LAS of 129 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 (cid:5) Variant POS avg LAS avg avg L2 (σ = 0) L1 (σ = 0) Hard L1 (max-margin) Smooth L1 (ESR) 78.0 78.5 78.4 79.9 58.2 60.3 59.9 60.4 68.1 69.5 69.2 70.1 Table 4: Loss Function Ablation Results. The Smooth L1 loss outperforms the other simpler loss variants for both POS and LAS, averaged over 5 languages. we cannot utilize intermediate on-task pretrain- ing and domain-adaption, instead learning on the target dataset starting ‘‘from scratch’’ with the pretrained PLM (MBERT). We use the same baselines as before, but aug- ment each with a supervised fine-tuning loss on the supervised data in addition to any unsu- pervised losses. We refer to these models as UDPRE-FT, UDPRE-FT-PPT, and UDPRE-FT-ESR. That is, models with FT in the name have some supervised fine-tuning in the target language. 60.4, compared to ‘‘Loss per sentence’’ with aver- age POS of 77.1 and LAS of 58.5, which amount to +2.8 POS and +1.9 LAS improvements. This indicates that applying the loss at the batch level confers an advantage over applying per sentence. 5.3.2 Smooth Hinge-Loss Ablation Next, we evaluate the efficacy of the proposed smoothed hinge-loss distance function (cid:5). We compare to using just L1 or L2 uninterpolated and with no margin parameters (σ = 0). We also compare to the ‘‘Hard L1’’, which is the max- margin hinge (cid:5)(t, σ, x) = max{0, |t − x| − σ}. We use the same experimental setup as the pre- vious ablation. Results: The results are presented in Table 4. From the table we can see that the Smooth L1 loss outperforms the other variants. 6 Realistic Semi-Supervised Experiments The previous experiments considered an unsu- pervised transfer scenario without labeled data. In these next experiments we turn to a realis- tic semi-supervised application of our approach where we have access to limited labeled data for the target treebank. 6.1 Learning Curves In this experiment we present learning curves for the approaches, varying the amount of labeled | ∈ {50, 100, 500, 1000}. To make ex- data |Dtrain periments realistic, we calculate the target statis- tics t and margins σ from the small subsampled labeled training datasets using Eqs. 11 and 12. L We study two distinct settings. First, we study the multi-source domain-adaptation transfer set- ting, UDPRE. Second, we study our approach in a more standard semi-supervised scenario where In these experiments, we subsample labeled training data 3 times for each setting. We re- port averages over all 5 languages, 3 supervised subsample runs each, for a total of 15 runs per method and dataset size. We also use subsampled development sets so that model selection is more realistic.5 For development sets we subsample |), the data to a size of |Ddev L which reflects a 50/50 train/dev split until |DL| ≥ 200, at which point we maximize training data and only hold out 100 sentences for validation. | = min(100, |Dtrain L We use the same hyperparameters as before, except we use 40 epochs with 200 steps per epoch as the training schedule, mixing supervised and unsupervised data at a rate of 1:4. 6.1.1 UDPRE Transfer In this experiment, we evaluate in the multlingual transfer scenario by initializing from UDPRE. In addition to the two chosen realistic ESR variants, we also experiment with an ‘‘oracle’’ version of ESR-CLD, called ESR-CLD∗, that uses target statistics estimated from the full training data. This allows us to see if small-sample estimates cause a degradation in performance compared to accu- rate large-sample estimates. Results: Learning curves for the different ap- proaches, averaged over all 3 runs for all 5 lan- guages (15 total), are given in Figure 1. From the figure we can discern several encouraging results. ESR-CLD and ESR-UNIARC add significant benefit to fine-tuning for small data. Both vari- ants significantly outperform the baselines at 50 and 100 labeled examples. For example, relative to UDPRE-FT, the ESR-CLD model yielded gains of +2 POS, +3.6 LAS at 50 examples and +1.8 5As is argued by Oliver et al. (2018), using a realisti- cally sized development set is overlooked in much of the semi-supervised literature, leading to inappropriately strong model selection and overly optimistic results. 130 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Multi-Source UDPRE Transfer Learning Curves. Baseline approaches are dotted, while ESR variants are solid. All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled datasets per language. The plots indicate a significant advantage of ESR over the baselines in low-data regions. POS, +3.2 LAS at 100 labeled examples. At 500 and 1000 examples, however, we begin to see diminishing benefits to ESR on top of fine-tuning. ESR-UNIARC is much more effective in con- junction with fine-tuning. Compared to the unsupervised experiment in Section 5.2 where the ESR-UNIARC statistic is it ranked 25/32, much more competitive with the more detailed ESR-CLD statistics. One potential explanation is that without labeled data (as in Section 5.2) the ESR-UNIARC statistic is under-specified (the 727 allowed arcs are all free to take any value), whereas the inclusion of some labeled data in this experiment fills this gap by implicitly indicating target proportions for the allowed arcs. This sug- gests that an approach which combines UniArc constraints with elements of self-training (like PPT) that supervise the ‘‘free’’ non-zero com- binations could potentially be a useful approach to zero-shot transfer. However, we leave this to future work. Small-data estimates for ESR-CLD are as good as accurate estimates. Comparing ESR- CLD to the unrealistic ESR-CLD∗, we find no significant difference between the two, indicating that, at least for the CLD statistic, using target estimates from small samples is as good as large- sample estimates. This may be due in part to the margin estimates σ, which are wider for the small samples and somewhat mitigate their inaccuracies. PPT adds little benefit to fine-tuning. Relative to UDPRE-FT, the UDPRE-FT-PPT baseline does not yield much gain, with a maximum average improvement of +0.3 POS and +0.7 LAS over all dataset sizes. This indicates that fine-tuning and PPT-style self-training may be redundant. 6.1.2 MBERT Transfer In this experiment, we consider a counterfactual setting: What if the UD data was not a mas- sively multilingual dataset where we can utilize multilingual model-transfer, and instead was an isolated dataset with no related data to transfer from? This situation reflects the more standard semi-supervised learning setting, where we are given a new task, some labeled and unlabeled data, and must build a model ‘‘from scratch’’ on that data. For this experiment, we repeat the learning curve setting from Section 6.1.1, but initialize our model directly with MBERT, skipping the intermediate UDPRE training. Results: Learning curves for the different ap- proaches, averaged over all 3 runs for all 5 lan- guages (15 total), are given in Figure 2. The results from this experiment are encouraging; ESR has even greater benefits when fine-tuning directly from MBERT than the previous experiment, indi- cating that our general approach may be even more useful outside of domain-adaptation conditions. 6.2 Low-Resource Transfer In previous experiments, we limited the number of evaluation treebanks to 5 to allow for variation in other dimensions (i.e., constraint types, loss types, differing amounts of labeled data). In this exper- iment, we expand the number of treebanks and evaluate transfer performance in a low-resource | = 50 labeled sentences in setting with only |Dtrain the target treebank, comparing UDPRE, UDPRE-FT, and UDPRE-FT-ESR-CLD. As before, we subsam- ple 3 small datasets per treebank and calculate L 131 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: ‘‘From Scratch’’ MBERT Transfer Learning Curves. Baseline approaches are dotted, while ESR variants are solid. All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled datasets per language. The plots indicate a significant advantage of ESR over the baselines in all regions. the target statistics t and margins σ from these to make transfer results realistic. We select evaluation treebanks according to the following criteria. For each unique language in UD v2.8 that is not one of the 13 training lan- guages, we select the largest treebank, and keep it if has at least 250 train sentences and a develop- ment set, so that we can get reasonable variability in the subsamples. This process yields 44 diverse evaluation treebanks. Results: The results of this experiment are given in Table 5. From the table we can see the our approach ESR (UDPRE-FT-ESR-CLD) out- performed supervised fine-tuning (UDPRE-FT) in many cases, often by a large margin. On average, UDPRE-FT-ESR-CLD outperformed UDPRE-FT by +2.6 POS and +2.3 LAS across the 44 lan- guages. Further, UDPRE-FT-ESR-CLD outper- formed zero shot transfer, UDPRE, by +10.0 POS and +14.7 LAS on average. Interestingly, we found that there were sev- eral cases of large performance gains while there were no cases of large performance declines. For example, ESR improved LAS scores by +17.3 for Wolof, +16.8 for Maltese, and +12.5 for Scottish Gaelic, and 9/44 languages saw LAS improvements ≥ +5.0, while the largest decline was only −2.5. Additionally, ESR improved POS scores by +20.9 for Naija, +11.2 for Welsh, and 9/44 languages saw POS improvements ≥ +5.0. The cases of performance decline for LAS merit further analysis. Of the 20 languages with negative Δ LAS, 18 of these are modern languages spoken in continental Europe (mostly Slavic and Romance), while only 5 of the 24 languages with positive Δ LAS meet this criteria. We hypothe- size that this tendency is be due to the train- ing data used for pretraining MBERT, which was heavily skewed towards this category (Devlin et al., 2019). This suggests that ESR is particu- larly helpful in cases of transfer to domains that are underrepresented in pretraining. 7 Related Work Related work generally falls into two categories: weak supervision and cross-lingual transfer. Weak Supervision: Supervising models with signals weaker than fully labeled data has and continues to be a popular topic of interest. Current trends in weak supervision focus on generating instance-level supervision, using weak informa- tion such as: relations between multiple tasks (Greenberg et al., 2018; Ratner et al., 2018; Ben Noach and Goldberg, 2019); labeled fea- tures (Druck et al., 2008; Ratner et al., 2016; Karamanolakis et al., 2019a); coarse-grained la- bels (Angelidis and Lapata, 2018; Karamanolakis et al., 2019b); dictionaries and distant supervi- sion (Bellare and McCallum, 2007; Carlson et al., 2009; Liu et al., 2019a; ¨Ust¨un et al., 2020); or some combination thereof (Ratner et al., 2016; Karamanolakis et al., 2019a). In contrast, our work is more closely related to older work on population-level supervision. These techniques include Constraint-Driven Learn- ing (CODL) (Cha), posterior regularization (PR) (Ganchev et al., 2010), the measurements frame- work of Liang et al. (2009), and the generalized expectation criteria (GEC) (Druck et al., 2008, 2009; Mann and McCallum, 2010). 132 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Treebank Family UDPRE FT ESR Δ UDPRE FT ESR Δ POS LAS Wolof-WTB Maltese-MUDT Scottish Gaelic-ARCOSG Faroese-FarPaHC Gothic-PROIEL Welsh-CCG Western Armenian-ArmTDP Telugu-MTG Vietnamese-VTB Turkish German-SAGT Afrikaans-AfriBooms Hungarian-Szeged Galician-CTG Marathi-UFAL Naija-NSC Greek-GDT Tamil-TTB Indonesian-GSD Uyghur-UDT Old French-SRCMF Old Church Slavonic-PROIEL Portuguese-GSD Danish-DDT Armenian-ArmTDP Spanish-AnCora Catalan-AnCora Serbian-SET Slovak-SNK Romanian-Nonstandard Polish-PDB German-HDT Lithuanian-ALKSNIS Latin-ITTB Bulgarian-BTB Czech-PDT Persian-PerDT Slovenian-SSJ Croatian-SET Urdu-UDTB Ukrainian-IU Dutch-Alpino Norwegian-Bokmaal Belarusian-HSE Estonian-EDT Average Northern Atlantic Semitic Celtic Germanic Germanic Celtic Armenian Dravidian Viet-Muong Code Switch Germanic Ugric Romance Marathi Creole Greek Dravidian Austronesian Turkic Romance Slavic Romance Germanic Armenian Romance Romance Slavic Slavic Romance Slavic Germanic Baltic Italic Slavic Slavic Iranian Slavic Slavic Indic Slavic Germanic Germanic Slavic Finnic 40.6 35.1 45.7 74.7 30.1 71.9 80.6 82.0 67.0 76.8 90.7 87.9 91.8 71.4 46.5 87.1 72.3 82.3 23.7 65.3 37.3 92.1 92.0 84.7 94.5 92.9 91.2 91.5 79.2 89.7 89.6 87.0 73.8 91.9 90.6 79.1 89.2 91.4 86.9 91.5 90.0 91.7 91.5 89.1 77.3 79.5 82.6 66.0 86.2 67.6 74.7 84.9 81.6 85.6 84.4 88.0 79.9 89.0 81.1 68.0 92.8 72.4 89.8 59.8 74.2 54.7 89.6 92.7 88.1 95.2 94.4 90.7 91.5 83.3 90.4 94.4 87.4 80.9 94.7 92.1 91.0 90.9 91.7 90.0 92.0 90.6 91.8 91.6 89.6 85.4 91.8 75.9 87.2 71.7 85.8 87.1 81.6 88.5 85.8 91.3 89.7 91.2 82.3 88.9 92.5 79.6 90.2 65.5 76.2 61.0 92.8 92.1 88.0 95.4 94.6 93.1 92.0 85.0 90.9 94.2 87.4 81.7 94.6 92.7 90.8 91.2 92.1 88.2 92.4 90.6 92.1 91.9 89.2 +5.9 +9.2 +9.9 +1.1 +4.1 +11.2 +2.2 0.0 +2.9 +1.4 +3.3 +9.7 +2.2 +1.1 +20.9 −0.3 +7.2 +0.5 +5.6 +2.0 +6.3 +3.3 −0.6 −0.1 +0.2 +0.3 +2.4 +0.5 +1.7 +0.5 −0.2 0.0 +0.8 −0.1 +0.6 −0.2 +0.3 +0.4 −1.8 +0.3 0.0 +0.3 +0.3 −0.4 84.7 87.3 +2.6 12.7 16.0 24.4 43.0 12.6 54.8 60.4 70.9 46.3 48.0 62.0 74.0 60.5 44.9 27.9 78.7 46.7 58.3 14.0 44.0 19.2 74.4 71.0 64.1 77.8 75.8 81.6 81.6 54.5 76.0 83.0 65.4 51.7 78.0 78.1 48.4 79.6 80.0 68.7 79.6 78.9 80.8 78.9 70.4 59.0 55.9 57.5 56.4 71.4 45.8 69.4 67.0 74.6 55.3 58.0 79.4 77.8 74.3 59.5 71.1 86.3 64.9 72.9 38.0 56.7 39.0 84.1 75.5 69.0 83.0 82.5 86.5 84.0 63.6 79.7 88.2 69.2 64.3 84.4 81.9 74.6 84.5 84.1 75.7 81.2 81.6 82.5 79.8 71.4 73.3 74.2 68.9 80.7 54.6 77.6 72.7 80.1 60.8 62.1 83.4 81.7 77.8 62.5 73.4 88.0 66.4 74.3 39.2 57.8 40.1 84.5 75.7 69.2 82.9 82.4 86.4 83.9 63.4 79.4 87.7 68.6 63.7 83.7 81.1 73.7 83.5 83.1 74.4 80.0 80.3 81.0 78.1 68.9 +17.3 +16.8 +12.5 +9.3 +8.8 +8.1 +5.7 +5.5 +5.5 +4.1 +3.9 +3.9 +3.6 +3.0 +2.3 +1.8 +1.5 +1.4 +1.3 +1.2 +1.1 +0.4 +0.2 +0.1 −0.1 −0.1 −0.1 −0.1 −0.2 −0.3 −0.5 −0.6 −0.6 −0.7 −0.8 −0.9 −0.9 −1.0 −1.3 −1.3 −1.3 −1.5 −1.8 −2.5 71.4 73.7 +2.3 Table 5: Low-Resource Semi-Supervised Transfer Results. Transfer results for 44 unseen test languages using 50 labeled sentences in the target language, averaged over 3 subsampled datasets. ‘‘FT’’ refers to the UDPRE-FT fine-tuning baseline, ‘‘ESR’’ refers to our UDPRE-ESR-CLD approach, and Δ refers to the absolute difference of ESR minus FT. Best performing methods are bolded. Results are ordered from best to worst Δ LAS. Our work can be seen as an extension of GEC to more expressive expectations and to modern mini-batch SGD training. There are a two more recent works that touch on these ideas, but both have significant downsides compared to our ap- proach. Meng et al. (2019) use a PR approach inspired by Ganchev and Das (2013) for cross- lingual parsing, but must use very simple con- straints and require a slow inference procedure that can only be used at test time. Ben Noach 133 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 and Goldberg (2019) utilize GEC with mini- batch training, but focus on using related tasks for computing simpler constraints and do not adapt their targets to small batch sizes. Cross-Lingual Transfer: Earlier trends in cross-lingual transfer for parsing used delexicali- zation (Zeman and Resnik, 2008; McDonald et al., 2011; T¨ackstr¨om et al., 2013) and then aligned multilingual word vector-based approaches (Guo et al., 2015; Ammar et al., 2016; Rasooli and Collins, 2017; Ahmad et al., 2019). With the rapid rise of language-model pretraining (Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019b), recent research has focused on multilingual PLMs and multi-task fine-tuning to achieve generalization in transfer. Wu and Dredze (2019) showed that a multilingual PLM afforded surprisingly effec- tive cross-lingual transfer using only English as the fine-tuning language. Kondratyuk and Straka (2019) extended this approach by fine-tuning a PLM on the concatenation of all treebanks. Tran and Bisazza (2019), however, show that transfer to distant languages benefit less. Other recent successes have been found with linguistic side-information (Meng et al., 2019; ¨Ust¨un et al., 2020), careful methodology for source-treebank selection (Tiedemann and Agic, 2016; Tran and Bisazza, 2019; Lin et al., 2019; Glavaˇs and Vuli´c, 2021), self-training (Kurniawan et al., 2021), and paired bilingual text for anno- tation projection (Rasooli and Tetreault, 2015; Rasooli and Collins, 2019; Liu et al., 2020; Shi et al., 2022). 8 Conclusion We have presented Expected Statistic Regulariza- tion, a general approach to weak supervision for structured prediction, and studied it in the con- text of modern cross-lingual multi-task syntactic parsing. We evaluated a wide range of expressive structural statistics in idealized and realistic trans- fer scenarios and have shown that the proposed approach is effective and complementary to the state-of-the-art model-transfer approaches. Acknowledgments We would like to thank Chris Kedzie, Giannis Karamanolakis, and the reviewers for helpful con- versations and feedback. References Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Kai-Wei Chang, and Nanyun Peng. 2019. Cross-lingual dependency parsing with unla- beled auxiliary languages. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 372–382, Hong Kong, China. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/K19-1035 Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2016. Many languages, one parser. Transac- tions of the Association for Computational Linguistics, 4:431–444. https://doi.org /10.1162/tacl_a_00109 Stefanos Angelidis and Mirella Lapata. 2018. Multiple instance learning networks for fine- grained sentiment analysis. Transactions of the Association for Computational Linguistics, 6:17–31. https://doi.org/10.1162/tacl a 00002 Kedar Bellare and Andrew McCallum. 2007. Learning extractors from unlabeled text us- ing relevant databases. In Sixth international workshop on information integration on the web (AAAI). Matan Ben Noach and Yoav Goldberg. 2019. Transfer learning between related tasks us- ing expected label proportions. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), pages 31–42, Hong Kong, China. Association for Computational Linguistics. Andrew Carlson, Scott Gaffney, and Flavian Vasile. 2009. Learning a named entity tagger from gazetteers with the partial perceptron. In AAAI Spring Symposium: Learning by Reading and Learning to Read, pages 7–13. Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learning (EMNLP-CoNLL), pages 957–961, Prague, Czech Republic. Association for Com- putational Linguistics. 134 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural depen- dency parsing. In 5th International Confer- ence on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net. Gregory Druck, Gideon Mann, and Andrew McCallum. 2008. Learning from labeled fea- tures using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’08, pages 595–602, New York, NY, USA. Asso- ciation for Computing Machinery. https:// doi.org/10.1145/1390334.1390436 Gregory Druck, Gideon Mann, and Andrew McCallum. 2009. Semi-supervised learning of dependency parsers using generalized expecta- tion criteria. In Proceedings of the Joint Con- ference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 360–368, Suntec, Singapore. Association for Computational Linguistics. Bradley Efron. 1979. Bootstrap methods: An- other look at the jackknife. Annals of Statistics, 7:1–26. https://doi.org/10.1214/aos /1176344552 Kuzman Ganchev and Dipanjan Das. 2013. Cross- lingual discriminative learning of sequence models with posterior regularization. In Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1996–2006, Seattle, Washington, USA. Association for Computational Linguistics. Kuzman Ganchev, Jo˜ao Grac¸a, Jennifer Gillenwater, and Ben Taskar. 2010. Poste- rior regularization for structured latent vari- able models. Journal of Machine Learning Research, 11:2001–2049. Ross Girshick. 2015. Fast r-cnn. In Proceed- ings of the IEEE International Conference on Computer Vision, pages 1440–1448. Goran Glavaˇs and Ivan Vuli´c. 2021. Climbing the tower of treebanks: Improving low-resource dependency parsing via hierarchical source se- lection. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4878–4888, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.findings-acl.431 Yves Grandvalet and Yoshua Bengio. 2004. Semi-supervised learning by entropy mini- mization. In Advances in Neural Information Processing Systems, volume 17. MIT Press. Nathan Greenberg, Trapit Bansal, Patrick Verga, and Andrew McCallum. 2018. Marginal likeli- hood training of BiLSTM-CRF for biomedical named entity recognition from disjoint label sets. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 2824–2829, Brussels, Belgium. Association for Computational Linguistics. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross- lingual dependency parsing based on dis- In Proceedings of tributed representations. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 1234–1244, Beijing, China. Association for Computational Linguistics. https://doi .org/10.3115/v1/P15-1119 Junxian He, Zhisong Zhang, Taylor Berg- Kirkpatrick, and Graham Neubig. 2019. Cross- lingual syntactic transfer through unsupervised adaptation of invertible projections. In Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3211–3223, Florence, Italy. Association for Computational Linguistics. Giannis Karamanolakis, Daniel Hsu, and Luis Gravano. 2019a. Leveraging just a few key- words for fine-grained aspect detection through weakly supervised co-training. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Natu- ral Language Processing (EMNLP-IJCNLP), 135 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages 4611–4621, Hong Kong, China. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1468 Giannis Karamanolakis, Daniel Hsu, and Luis Gravano. 2019b. Weakly supervised attention networks for fine-grained opinion mining and public health. In Proceedings of the 5th Work- shop on Noisy User-generated Text (W-NUT 2019), pages 1–10, Hong Kong, China. Asso- ciation for Computational Linguistics. Dan Kondratyuk and Milan Straka. 2019. 75 lan- guages, 1 model: Parsing Universal Dependen- cies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 2779–2795, Hong Kong, China. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D19-1279 Artur Kulmizev, Miryam de Lhoneux, Johannes Gontrum, Elena Fano, and Joakim Nivre. 2019. Deep contextualized word embeddings in transition-based and graph-based dependency parsing - a tale of two parsers revisited. In Pro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing (EMNLP-IJCNLP), pages 2755–2768, Hong Kong, China. Associ- ation for Computational Linguistics. Kemal Kurniawan, Lea Frermann, Philip Schulz, and Trevor Cohn. 2021. PPT: Parsimonious parser transfer for unsupervised cross-lingual adaptation. In Proceedings of the 16th Con- ference of the European Chapter of the Asso- ciation for Computational Linguistics: Main Volume, pages 2907–2918, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.eacl-main.254 Percy Liang, Michael I. Jordan, and Dan Klein. 2009. Learning from measurements in expo- nential families. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 641–648, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145 /1553374.1553457 Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, and Graham Neubig. 2019. Choosing trans- learning. In fer languages for cross-lingual Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3125–3135, Florence, Italy. Association for Computational Linguistics. Lu Liu, Yi Zhou, Jianhan Xu, Xiaoqing Zheng, Kai-Wei Chang, and Xuanjing Huang. 2020. Cross-lingual dependency parsing by POS-guided word reordering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2938–2948, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .findings-emnlp.265 Tianyu Liu, Jin-Ge Yao, and Chin-Yew Lin. 2019a. Towards improving neural named entity recognition with gazetteers. In Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5301–5307, Florence, Italy. Association for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A robustly opti- mized BERT pretraining approach. arXiv pre- print arXiv:1907.11692v1. Ilya Loshchilov and Frank Hutter. 2019. De- coupled weight decay regularization. In 7th International Conference on Learning Repre- sentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955–984. Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency In 11th Conference of parsing algorithms. the European Chapter of the Association for Computational Linguistics, pages 81–88, Trento, Italy. Association for Computational Linguistics. 136 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 62–72, Edinburgh, Scotland, UK. Association for Computational Linguistics. Tao Meng, Nanyun Peng, and Kai-Wei Chang. 2019. Target language-aware constrained infer- ence for cross-lingual dependency parsing. In Proceedings of the 2019 Conference on Empiri- cal Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP- IJCNLP), pages 1117–1128, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19 -1103 Joakim Nivre. 2020. Multilingual dependency parsing from universal dependencies to sesame street. In Text, Speech, and Dialogue - 23rd International Conference, TSD 2020, Brno, Czech Republic, September 8–11, 2020, Pro- ceedings, volume 12284 of Lecture Notes in Computer Science, pages 11–29. Springer. Avital Oliver, Augustus Odena, Colin Raffel, Ekin Dogus Cubuk, and Ian J. Goodfellow. 2018. Realistic evaluation of deep semi- supervised learning algorithms. In Advances Information Processing Systems in Neural 31: Annual Conference on Neural Informa- tion Processing Systems 2018, NeurIPS 2018, December 3–8, 2018, Montr´eal, Canada, pages 3239–3250. Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, and Chris J. Maddison. 2020. Gradient estimation with stochastic softmax tricks. In Advances in Neural Information Pro- cessing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextu- alized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/N18-1202 Mohammad Sadegh Rasooli and Michael Collins. 2017. Cross-lingual syntactic transfer with lim- ited resources. Transactions of the Association for Computational Linguistics, 5:279–293. Mohammad Sadegh Rasooli and Michael Collins. 2019. Low-resource syntactic transfer with unsupervised source reordering. In Proceedings of the 2019 Conference of the North American the Association for Compu- Chapter of tational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 3845–3856, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19 -1385 Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate de- pendency parser. ArXiv, abs/1503.06733. Alexander Ratner, Braden Hancock, Jared Dunnmon, Roger E. Goldman, and Christopher R´e. 2018. Snorkel metal: Weak supervision for multi-task learning. In Proceedings of the Sec- ond Workshop on Data Management for End- To-End Machine Learning, DEEM@SIGMOD 2018, Houston, TX, USA, June 15, 2018, pages 3:1–3:4. ACM. https://doi.org /10.1145/3209889.3209898 Alexander J. Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher R´e. 2016. Data programming: Creating large training sets, quickly. In Advances in Neural Informa- tion Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pages 3567–3575. Freda Shi, Kevin Gimpel, and Karen Livescu. 2022. Substructure distribution projection for zero-shot cross-lingual dependency parsing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6547–6563, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653 /v1/2022.acl-long.452 Oscar T¨ackstr¨om, Ryan McDonald, and Joakim language adaptation of Nivre. 2013. Target 137 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 discriminative transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1061–1071, Atlanta, Georgia. Associa- tion for Computational Linguistics. J¨org Tiedemann and Zeljko Agic. 2016. Syn- thetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Re- search, 55:209–248. https://doi.org/10 .1613/jair.4785 Ke Tran and Arianna Bisazza. 2019. Zero-shot dependency parsing with pre-trained multilin- gual sentence representations. In Proceedings of the 2nd Workshop on Deep Learning Ap- proaches for Low-Resource NLP (DeepLo 2019), pages 281–288, Hong Kong, China. Association for Computational Linguistics. Ahmet ¨Ust¨un, Arianna Bisazza, Gosse Bouma, and Gertjan van Noord. 2020. Udapter: Language adaptation for truly universal depen- dency parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020, pages 2302–2315. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .emnlp-main.180 Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effective- ness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Computa- tional Linguistics. Daniel Zeman and Philip Resnik. 2008. Cross- language parser adaptation between related languages. In Proceedings of the IJCNLP- 08 Workshop on NLP for Less Privileged Languages. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 3 7 2 0 6 7 8 3 0 / / t l a c _ a _ 0 0 5 3 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 138
PDF Herunterladen