Surface Statistics of an Unknown Language Indicate How to Parse It

Dingquan Wang and Jason Eisner
Department of Computer Science, Johns Hopkins University
{wdd,jason}@cs.jhu.edu

Astratto

We introduce a novel framework for delex-
icalized dependency parsing in a new lan-
guage. We show that useful features of the
target language can be extracted automati-
cally from an unparsed corpus, which con-
sists only of gold part-of-speech (POS) se-
quences. Providing these features to our
neural parser enables it to parse sequences
like those in the corpus. Strikingly, our sys-
tem has no supervision in the target lan-
guage. Piuttosto,
it is a multilingual sys-
tem that is trained end-to-end on a vari-
ety of other languages, so it learns a fea-
ture extractor that works well. We show
experimentally across multiple languages:
(1) Features computed from the unparsed
(2) In-
corpus improve parsing accuracy.
cluding thousands of synthetic languages
in the training yields further improvement.
(3) Despite being computed from unparsed
corpora, our learned task-speciﬁc features
beat previous work’s interpretable typolog-
ical features that require parsed corpora or
expert categorization of the language. Nostro
best method improved attachment scores on
held-out test languages by an average of 5.6
percentage points over past work that does
not inspect the unparsed data (McDonald
et al., 2011), e da 20.7 points over past
“grammar induction” work that does not use
training languages (Naseem et al., 2010).

introduzione

Dependency parsing is one of the core natural
language processing tasks.
It aims to parse a
given sentence into its dependency tree: a directed
graph of
labeled syntactic relations between
parole. Supervised dependency parsers—which
are trained using a “treebank” of known parses in
the target language—have been very successful
(McDonald, 2006; Nivre, 2008; Kiperwasser
and Goldberg, 2016). By contrast, the progress

667

of unsupervised dependency parsers has been
slow, and they have apparently not been used in
any downstream NLP systems (Mareˇcek, 2016).
An unsupervised parser does not have access
to a treebank, but only to a corpus of unparsed
sentences in the target language.

Unsupervised parsing has been studied for
decades. The most common approach is gram-
mar induction (Lari and Young, 1990; Carroll
and Charniak, 1992; Klein and Manning, 2004).
Grammar induction induces an explicit grammar
from the unparsed corpus, such as a probabilis-
tic context-free grammar (PCFG), and uses that to
parse sentences of the language. This approach
has encountered two major difﬁculties:

• Search error: Most formulations of gram-
mar induction involve optimizing a highly
non-convex objective function such as likeli-
hood. The optimization is typically NP-hard
(Cohen and Smith, 2012), and approximate
local search methods tend to get stuck in lo-
cal optima.

• Model error: Likelihood does not correlate
well with parsing accuracy anyway (Smith,
2006, Figura 3.2). Likelihood optimization
seeks latent trees that help to predict the
observed sentences, but these unsupervised
trees may use a non-standard syntactic anal-
ysis or even be optimized to predict non-
syntactic properties such as topic. We seek
a standard syntactic analysis—what Smith
(2006) calls the MATCHLINGUIST task.

We address both difﬁculties by using a super-
vised learning framework—one whose objective
function is easier to optimize and explicitly tries
to match linguists’ standard syntactic analyses.

Our approach is inspired by Wang and Eisner
(2017), who use an unparsed but tagged corpus
to predict the ﬁne-grained syntactic typology of

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 6, pag. 667–685, 2018. Redattore di azioni: Marco Kuhlmann.
Lotto di invio: 2/2018; Lotto di revisione: 4/2018; Pubblicato 12/2018.
C(cid:13) 2018 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

a language. Per esempio, they may predict that
Di 70% of the direct objects fall to the right of
the verb. Their system is trained on a large num-
ber of (unparsed corpus, true typology) pairs, each
representing a different language. With this train-
ing, it can generalize to predict typology from the
unparsed corpus of a new language. Our approach
is similar except that we predict parses rather than
just a typology. In entrambi i casi, the system is trained
to optimize a task-speciﬁc quality measure. IL
system’s parameterization can be chosen to sim-
plify optimization (strikingly, the training objec-
tive could even be made convex by using a condi-
tional random ﬁeld architecture) and/or to incor-
porate linguistically motivated features.

The positive results of Wang and Eisner (2017)
demonstrate that there are indeed surface clues to
syntactic structure in the input corpus, at least if it
is POS-tagged (as in their work and ours). How-
ever, their method only found global typological
informazione: it did not establish which 70% del
direct objects fell to the right of their verbs, let
alone identify which nouns were in fact direct ob-
jects of which verbs. That requires a token-level
analysis of each sentence, which we undertake in
this paper. Again, the basic idea is that instead of
predicting interpretable typological properties of a
language as Wang and Eisner (2017) did, we will
predict a language-speciﬁc version of the scoring
function that a parser uses to choose among vari-
ous actions or substructures.

2 Unsupervised Parsing with Supervised

Tuning

Our fundamental question is whether gold part-of-
speech (POS) sequences carry useful information
about the syntax of a language.1 As we will show,
the answer is yes, and the information can be ex-
tracted and used to obtain actual parses.

This is the same question that has been implic-
itly asked by previous papers in the unsupervised
parsing tradition (see §5). Unsupervised pars-
ing of gold POS sequences is an artiﬁcial task,
to be sure.2 Nonetheless, it is a starting point

1We also include an experiment on noisy POS sequences.
2It is clearly not the task setting faced by human language
learners. Nor is it a plausible engineering setting: a language
with gold POS sequences often also has at least a small tree-
bank of gold parses, or at least parallel text in a language
from which noisy parses can be noisily projected (Agi´c et al.,
2016). There is also no practical reason to consider POS tags
without their attached words.

for more ambitious settings that would learn from
words and real-world grounding (with or without
the POS tags). Even this starting point has proved
surprisingly difﬁcult over decades of research, so
it has not been clear whether the POS sequences
even contain the necessary information.

Yet this task—like others that engineers, lin-
guists, or human learners might face—might be
solvable with general knowledge about the distri-
bution of human languages. An experienced lin-
guist can sometimes puzzle out the structure of
a new language. The reader may be willing to
guess a parse for the gold POS sequence VERB
DET NOUN ADJ DET NOUN. After all, adjectives
usually attach to nouns (Naseem et al., 2010), E
the adjective in this example seems to attach to the
ﬁrst noun—not to the second, since determiners
usually fall at the edge of a noun phrase. Mean-
while, the sequence’s sole verb is apparently fol-
lowed by two noun phrases, which suggests ei-
ther VSO (verb-subject-object) or VOS order—
and VSO is a good guess as it is more common
(Dryer and Haspelmath, 2013). Observing a cor-
pus of additional POS sequences might help re-
solve the question of whether this language is pri-
marily VSO or VOS, Per esempio, by guessing that
short noun phrases in the corpus (Per esempio, un-
modiﬁed pronouns) are more often subjects.

Così, we propose to solve the task by training a
kind of “artiﬁcial linguist” that can do such analy-
sis on corpora of new languages.

This is a general approach to developing an un-
supervised method for a speciﬁc type of dataset:
tune its structure and hyperparameters so that it
works well on actual datasets of that sort, and then
apply it to new datasets. Per esempio, consider
clustering—the canonical unsupervised problem.
What constitutes a useful cluster depends on the
type of data and the application. Basu et al. (2013)
develop a text clustering system speciﬁcally to aid
teachers. Their “Powergrading” system can group
all the student-written answers to a novel question,
having been trained on human judgments of an-
swer similarity for other questions. Their novel
questions are analogous to our novel languages:
their unsupervised system is speciﬁcally tailored
to match teachers’ semantic similarity judgments
within any corpus of student answers, just as ours
is tailored to match linguists’ syntactic judgments
within any corpus of human-language POS se-
quences. Other NLP work on supervised tuning of

668

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

unsupervised learners includes strapping (Eisner
and Karakos, 2005; Karakos et al., 2007), Quale
tunes with the help of both real and synthetic
datasets, just as we will (§3).

Are such systems really “unsupervised”? Yes,
in the sense that they are able to discover desirable
structure in a new dataset. Unsupervised learners
are normally crafted using assumptions about the
data domain. Their structure and hyperparameters
may have been manually tuned to produce pleas-
ing results for typical datasets in that domain. In
the domain of POS corpora, we simply scale up
this practice to automatically tune a large set of pa-
rameters, which later guide our system’s search for
linguist-approved structure on each new human-
language dataset. Our system should be regarded
as “supervised” if the examples are taken to be en-
tire languages: after all, we train it to map un-
labeled corpora to usefully labeled corpora. Ma
once trained, it is “unsupervised” if the examples
are taken to be the sentences within a given corpus:
by analyzing the corpus, our system ﬁgures out
how to map sentences of that language to parses,
without any labeled examples in that language.

3 Data

We use two datasets in our experiment:

UD: Universal Dependencies version 1.2 (Nivre
et al., 2015) A collection of 37 dependency tree-
banks of 33 languages, tokenized and annotated
with a common set of POS tags and dependency
relations.3 In principle, our trained system could
be applied to predict UD-style dependency re-
lations in any tokenized natural-language corpus
with UD-style POS tags.

GD: Galactic Dependencies version 1.0 (Wang
and Eisner, 2016) A collection of dependency
treebanks for 53,428 synthetic languages (Di
which we will use a subset). A GD treebank
is generated by starting with some UD treebank
and stochastically permuting the child subtrees
of nouns and/or verbs to match their orders in
other UD treebanks. Per esempio, one of the
GD treebanks reﬂects what the English UD tree-
bank might have looked like if English had been
both VSO (like Irish) and postpositional (like
Japanese). This typologically diverse collection

3While it might have been preferable to use the expanded
and revised UD version 2.0, we wished to compare fairly with
GD 1.0, which is based on UD 1.2.

of resource-rich synthetic languages aims to pro-
pel the development of NLP systems that can han-
dle diverse natural languages, such as multilingual
parsers and taggers.

3.1 Why Synthetic Training Languages?

We hope for our system to do well, on average, at
matching real linguist-parsed corpora of real hu-
man languages. We therefore tune its parameters
Θ on such treebanks. UD provides training exam-
ples actually drawn from that distribution D over
treebanks—but alas, rather few. Thus to better es-
timate the expected performance of Θ under D,
we follow Wang and Eisner (2017) and augment
our training data with GD’s synthetic treebanks.

Ideally we would have sampled these synthetic
treebanks from a careful estimate ˆD of D: for ex-
ample, the mean of a Bayesian posterior for D,
derived from prior assumptions and UD evidence.
Tuttavia, such adventurous “extrapolation” of un-
seen languages would have required actually con-
structing such an estimate ˆD—which would em-
body a distribution over semantic content and a
full theory of universal grammar! The GD tree-
banks were derived more simply and more con-
servatively by “interpolation” among the actual
UD corpora. They combine observed parse trees
(which provide attested semantic content) con
stochastic word order models trained on observed
languages (which attempt to mimic attested pat-
terns for presenting that content). GD’s sampling
distribution ˆD still offers moderately varied syn-
thetic datasets, which remain moderately realistic,
as they are limited to phenomena observed in UD.
As Wang and Eisner (2016) pointed out, syn-
thetic examples have been used in many other su-
pervised machine learning settings. A common
if real image
technique is to exploit invariance:
z should be classiﬁed as a cat, then so should a
rotated version of image z. Our technique is the
same! We assume that if real corpus u should be
parsed as having certain dependencies among the
word tokens, then so should a version of corpus
u in which those tokens have been systematically
permuted in a linguistically plausible way.4 This
is analogous to how rotation sytematically trans-
forms the image (rotating all pixels through the
same angle) in a physically plausible way (as real
objects do rotate relative to the camera). The sys-
tematicity is needed to ensure that the task on syn-

4Another example is back-translation.

669

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

thetic data is feasible. In our case, the synthetic
corpus then provides many sentences that have
been similarly permuted, which may jointly pro-
vide enough clues to guess the word order of this
synthetic language (Per esempio, VSO vs. VOS in
§2) and thus recover the dependencies. See Wang
and Eisner (2018, §2) for related discussion.

With enough good synthetic languages to use
for training, even nearest-neighbor could be an
effective method. Questo è, one could obtain the
parser for a test corpus simply by copying the
trained parser for the most similar training cor-
pus (under some metric). Wang and Eisner (2016)
explored this approach of “single-source transfer”
from synthetic languages. Yet with only thousands
of synthetic languages, perhaps no single training
corpus is sufﬁciently similar.5 To draw on pat-
terns in many training corpora to ﬁgure out how to
parse the test corpus, we will train a single parser
that can handle all of the training corpora (Ammar
et al., 2016), much as we trained our typological
classiﬁer in earlier work (Wang and Eisner, 2017).

details of these two components in §6 and §7.

We assume in this paper that the input sentence
x is given as a POS sequence: questo è, our parser
is delexicalized. This spares us from also need-
ing language-speciﬁc lexical parameters associ-
ated with the speciﬁc vocabulary of each language,
a problem that we leave to future work.

We will choose our universal parameter values
by minimizing an estimate of their expected loss,

ˆΘ = argmin

mean
(cid:96)∈Ltrain

Loss(Θ; X((cid:96)), sì((cid:96)), tu((cid:96)))

(1)

where Ltrain is a collection of training languages
(ideally drawn IID from the distribution D of pos-
sible human languages) for which some syntac-
tic information is available. Speciﬁcally, each
training language (cid:96) has a treebank (X((cid:96)), sì((cid:96))),
where x((cid:96)) is a collection of POS-tagged sentences
whose correct dependency trees are given by y((cid:96)).
Each (cid:96) also has an unparsed corpus u((cid:96)) (possibly
equal to x((cid:96)) or containing x((cid:96))). We can therefore
deﬁne the parser’s loss on training language (cid:96) COME

4 Task Formulation

Loss(Θ; X((cid:96)), sì((cid:96)), tu((cid:96)))

(2)

An unsupervised parser for language (cid:96) is built
without any gold parse trees for (cid:96). Tuttavia, we
assume a corpus u of unparsed but POS-tagged
sentences of (cid:96) is available. From u, we will extract
statistics T(tu) that are informative about the syn-
tactic structure of (cid:96), to guide us in parsing POS-
tagged sentences of (cid:96).

Overall, our approach is to train a “language-
agnostic” parser—one that does not know what
lingua (cid:96) it is parsing in. It produces a parse tree
ˆy = ParseΘ(X; tu) from a sentence x, construct-
ing T(tu) as an intermediate quantity that carries
(Per esempio) typological information about (cid:96).
The parameters Θ are shared by all languages,
and determine how to construct and use T. A
learn them, we will allow (cid:96) to range over training
languages, and then test our ability to parse when
(cid:96) ranges over novel test languages.

Our ParseΘ(X; tu) system has two stages. Primo
it uses a neural network to compute T(tu) ∈ Rm, UN
vector that represents the typological properties of
(cid:96) and resembles the language embedding of Am-
mar et al. (2016). Then it parses sentence x while
taking T(tu) as an additional input. We will give

5Wang and Eisner (2018) do investigate synthesis “on de-
mand” of a permuted training corpus that is as similar as pos-
sible to the test corpus.

mean
(X,sì)∈(X((cid:96)),sì((cid:96)))

loss(ParseΘ(X; tu((cid:96)))
, sì)
(cid:125)

(cid:124)

(cid:123)(cid:122)
ˆy

where loss(. . .) is a task-speciﬁc per-sentence loss
(deﬁned in §8.1) that evaluates the parser’s output
ˆy on sentence x against x’s correct tree y.

5 Related Work

5.1 Per-Language Learning

Many papers rely on some universal learning pro-
cedure to determine T(tu) (see §4) for a target lan-
guage. Per esempio, T(·) may be the Expectation-
Maximization (EM) algorithm, yielding a PCFG
T(tu) that fully determines a CKY parser (Carroll
and Charniak, 1992; Klein and Manning, 2004).
Since EM and CKY are ﬁxed algorithms, this ap-
proach has no trainable parameters.

Grammar induction tries to turn an unsuper-
vised corpus into a generative grammar. The ap-
proach of the previous paragraph is often modi-
ﬁed to reduce model error or search error (§1). A
reduce model error, many papers have used de-
pendency grammar, with training objectives that
incorporate notions like lexical attraction (Yuret,
1998) and grammatical bigrams (Paskin, 2001,
2002).
The dependency model with valence
(DMV) (Klein and Manning, 2004) was the ﬁrst

670

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

method to beat a simple right-branching heuris-
tic. Headden III et al. (2009) and Spitkovsky
et al. (2012) made the DMV more expressive by
considering higher-order valency or punctuation.
To reduce search error, strategies for eliminating
or escaping local optima have included convex-
iﬁed objectives (Wang et al., 2008; Gimpel and
Smith, 2012), smart initialization (Klein and Man-
ning, 2004; Mareˇcek and Straka, 2013), search
bias (Smith and Eisner, 2005, 2006; Naseem et al.,
2010; Gillenwater et al., 2010), branch-and-bound
search (Gormley and Eisner, 2013), and switching
objectives (Spitkovsky et al., 2013).

Unsupervised parsing (which is also our task)
tries to turn the same corpus directly into a tree-
bank, without necessarily ﬁnding a grammar. Noi
discuss some recent milestones here. Grave and
Elhadad (2015) propose a transductive learning
objective for unsupervised parsing, and a convex
relaxation of it. (Jiang et al. (2017) combined that
work with grammar induction.) Martínez Alonso
et al. (2017) create an unsupervised dependency
parser that is formally similar to ours in that it
uses cross-linguistic knowledge as well as statis-
tics computed from a corpus of POS sequences in
the target language. Tuttavia, its cross-linguistic
the set of
knowledge is hand-coded: namely,
POS-to-POS dependencies that are allowed by the
UD annotation scheme, and the typical directions
for some of these dependencies. The only cor-
pus statistic extracted from u is whether ADP-
NOMINAL or NOMINAL-ADP bigrams are more
frequent,6 which distinguishes prepositional from
postpositional languages. The actual parser starts
by identifying the head word as the most “central”
word according to a PageRank (Page et al., 1999)
analysis of the graph of candidate edges, and pro-
ceeds by greedily attaching words of decreasing
PageRank at lower depths in the tree.

5.2 Multi-Language Learning

This approach parses a “target” language using
the treebanks of other resource-rich languages as
“source” languages. There are two main variants.

Memory-based. This method trains a super-
vised parsing model on each source treebank. It
uses these (delexicalized) source-language models
to help parse the target sentence, favoring sources
that are similar to the target language. A common

6In our notation of §6.1, below,
t∈{NOUN,PRON,PROPN} πw

this asks whether
T|ADP is greater for w = 1 or w = −1.

(cid:80)

similarity measure (Rosa and Žabokrtský, 2015UN)
considers the probability of the target language’s
POS-corpus u under a trigram language model of
source-language POS sequences.

(Rosa

(SST)

transfer

Single-source

E
Žabokrtský, 2015UN; Wang and Eisner, 2016)
simply uses the parser for the most similar source
treebank. Multi-source transfer (MST) (Rosa
and Žabokrtský, 2015UN) parses the target POS
sequence with each of the source parsers, E
then combines these parses into a consensus tree
using the Chu-Liu-Edmonds algorithm (Chu,
1965; Edmonds, 1967). As a faster variant, modello
interpolation (Rosa and Žabokrtský, 2015B) builds
a consensus model for the target language (via a
weighted average of source models’ parameters),
rather than a consensus parse for each target
sentence separately.

Memory-based methods require storing models
for all source treebanks, which is expensive when
we include thousands of GD treebanks (§3).

Model-based. This method trains a single
language-agnostic model. McDonald et al. (2011)
train a delexicalized parser on the concatenation
of all source treebanks, achieving a large gain over
grammar induction. This parser can learn univer-
sals such as the preference for determiners to at-
tach to nouns (which was hard-coded by Naseem
et al. (2010)). Tuttavia, it is expected to parse a
sentence x without being told the language (cid:96) O
even a corpus u, possibly by guessing properties
of the language from the conﬁgurations it encoun-
ters in the single sentence x alone.

Further gains were achieved (Naseem et al.,
2012; Täckström et al., 2013B; Zhang and Barzi-
lay, 2015; Ammar et al., 2016) by providing the
parser with about 10 typological properties of
x’s language—for example, whether direct objects
generally fall to the right of the verb—as listed
in the World Atlas of Linguistic Structures (Dryer
and Haspelmath, 2013).

Tuttavia, relying on WALS raises some is-
sues. (1) The unknown language might not be in
WALS.7 (2) Some typological features are missing
(3) All the WALS features
for some languages.
are categorical values, which loses useful infor-
mation about tendencies (Per esempio, how often
the canonical word order is violated). (4) Not all
WALS features are useful—only 56 of them per-
tain to word order, and only 8 of those have been

72,679 out of about 7,000 world languages are in WALS.

671

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

used in past work. (5) With a richer parser (a stack
LSTM dependency parser), WALS features do not
appear to help at all on unknown languages (Am-
mar et al., 2016, footnote 30).

5.3 Exploiting Parallel Data

Some other work on generalizing from source
to target languages assumes the availability of
source-target parallel data, or bitext. Two uses:

Induction of multilingual word embeddings.
Similar to universal POS tags, multilingual word
embeddings serve as a universal representation
that bridges the lexical differences among lan-
guages. Guo et al. (2016) proposed two ap-
proaches: (1) Training a variant of the skip-gram
modello (Mikolov et al., 2013) by using bilingual
sets of context words. (2) Generating the embed-
ding of each target word by averaging the embed-
dings of the source words to which it is aligned.

Annotation projection. Given aligned bitext,
one can generate an approximate parse for a tar-
get sentence by “projecting” the parse tree of the
corresponding source sentence. A target-language
parser can then be trained from these approxi-
mate parses. The idea was originally proposed
by Yarowsky et al. (2001), and then applied to
dependency parsing on low-resource languages
(Hwa et al., 2005; Ganchev et al., 2009; Smith
and Eisner, 2009; Tiedemann, 2014, inter alia).
McDonald et al. (2011) extend this approach to
multiple source languages by projected transfer.
Later work in this vein mainly tries to improve
the approximate parses, including translating the
source treebanks into the target language with an
off-the-shelf machine translation system (Tiede-
mann et al., 2014), augmenting the trees with
pesi (Agi´c et al., 2016), and using only partial
trees with high-conﬁdence alignments (Rasooli
and Collins, 2015, 2017; Lacroix et al., 2016).

5.4 Situating Our Work

Our own approach can be categorized as model-
based multi-language learning with no parallel
text or target-side supervision. Tuttavia, we also
analyze an unparsed corpus u of the target lan-
guage, as the per-language systems of §5.1 do.
Our analysis of u does not produce a specialized
target grammar or parser, but only extracts a tar-
get vector T(tu) to be fed to the language-agnostic
parser. The analyzer is trained jointly with the
parser, over many languages.

Figura 1: A 2-layer typology component. The bias
vettori (bW ) are suppressed for readability.

6 The Typology Component

Wang and Eisner (2017) extract typological prop-
erties of a language from its POS-tagged corpus u,
in effect predicting syntactic structure from super-
ﬁcial features. Like them, we compute a hidden
layer T(tu) using a standard multilayer perceptron
architecture, Per esempio,

T(tu) = ψ(W π(tu) + bW ) ∈ Rh

(3)

where π(tu) ∈ Rd is the surface features of u,
W ∈ Rh×d maps π(tu) into a h-dimensional
spazio, bW ∈ Rh is a bias vector, and ψ is an
element-wise activation function. While equa-
zione (3) has only 1 layer, we explore versions with
from 0 A 3 layers (where T(tu) = π(tu) in the
0-layer case). A 2-layer version is shown in Fig-
ure 1. The number of layers is chosen by cross-
validation, as are h and the ψ function.

6.1 Design of the Surface Features π(tu)

To deﬁne π(tu), we used development data to se-
lect the following fast but effective subset of the
features proposed by Wang and Eisner (2017).

Hand-engineered features. Given a token j in a
sentence, let its right window Rj be the sequence
of POS tags pj+1, . . . , pj+w (padding the sentence
as needed with # symbols). w is the window size.
Deﬁne gw(T | j) ∈ [0, 1] to be the fraction of
words in Rj tagged with t. Now, given a corpus
tu, deﬁne

πw
t = mean

gw(T | j), πw

T|s = mean
j: Tj =s

gw(T | j)

where j ranges over tokens of u. The unigram
prevalence πw
t measures the frequency of t over-
Tutto, while the bigram prevalence πw
T|s measures
the frequency with which t can be found to the left

672

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

T(tu) W1W2

of an average s tag (in a window of size w). For
each of these quantities, we have a corresponding
mirror-image quantity (denoted by negating w) by
computing it on a reversed version of the corpus.
The ﬁnal hand-engineered π(tu) includes:

• πw

T , for each tag type t and each w ∈
{1, 3, 8, 100}. This quantity measures how
frequently t appears in u.

• πw

T|s/πw
T

t and π−w

T|s //π−w
T|s//πw
, for each tag type
T
pair s, t and each w ∈ {1, 3, 8, 100}. We de-
ﬁne x//y = min(x/y, 1) to bound the fea-
ture values for better generalization. Notice
that if w = 1, the log of πw
is the bigram
pointwise mutual information. Each matched
pair of these quantities is intuitively related
to the word order typology—for example,
if ADPs are more likely to have closely
following than closely preceding NOUNs
NOUN > π−w
(πw
NOUN),
the language is more likely to be preposi-
tional than postpositional.

NOUN|ADP//π−w

NOUN|ADP//πw

Neural features.
In contrasto, our neural features
automatically learn to extract arbitrary predictive
conﬁgurations. As Figure 2 shows, we encode
each POS-tagged sentence ui ∈ u using a recur-
rent neural network, which reads one-hot POS em-
beddings from left to right, then outputs its ﬁnal
hidden state vector f i as the encoding. The ﬁnal
neural π(tu) is the average encoding of all sen-
tences (average-pooling): questo è, the average of all
sentence-level conﬁgurations. We speciﬁcally use
a gated recurrent unit (GRU) rete (Cho et al.,
2014). The GRU is jointly trained with all other
parameters in the system so that it focuses on de-
tecting word-order properties of u that are useful
for parsing.

7 The Parsing Architecture

To construct Parse(X; tu), we can extend any sta-
tistical parsing architecture Parse(X) to be sen-
sitive to T(tu).
For our experiments, we ex-
tend the delexicalized graph-based implementa-
tion of the BIST parser (Kiperwasser and Gold-
berg, 2016)—an arc-factored dependency model
with neural context features extracted by a bidi-
rectional LSTM. This recent parser was the state
of the art when it was published.

Figura 2: Computing the neural feature vector π(tu).

parser ﬁrst computes an unlabeled projective tree

argmax
y∈Y(X)

score(X, sì; tu)

(4)

Dove, letting a range over the arcs in tree y,

score(X, sì; tu) =

(cid:88)

a∈y

S(φ(UN; X, tu))

(5)

With this deﬁnition, the argmax in (4) is com-
puted efﬁciently by the algorithm of Eisner (1996).
S(·) is a neural scoring function on vectors,

S(φ(· · · )) = v tanh(V φ(· · · ) + bV )

(6)

where V is a matrix, bV is a bias vector, and v is
a vector, all being parameters in Θ.

The function φ(UN; X, tu) extracts the feature vec-
tor of arc a given x and u. BIST scores unlabeled
arcs, so a denotes a pair (io, j)—the indices of the
parent and child, rispettivamente. We deﬁne

φ(UN; X, tu) = [B(X, io; T(tu)); B(X, j; T(tu))]

(7)

which concatenates contextual representations of
tokens i and j. B(X, io) is itself a concatenation
of the hidden states of a left-to-right LSTM and a
right-to-left LSTM (Graves, 2012) when each has
read sentence x up through word i (really POS tag
io). These LSTM parameters are included in Θ.

The POS tags in x are provided to the LSTMs as
one-hot vectors. Crucially, T(tu) is also provided
to the LSTM at each step, as shown in Figure 3.

After selecting the best tree via equation (4), we
use each arc’s φ vector again to predict its label.
This yields the labeled tree ˆy = ParseΘ(X; tu).

Given a POS-sentence x and a corpus u, our

The only extension that this makes to BIST is

673

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

. . .. . .f1. . .. . .. . .. . .. . .f2. . . . . .avg-pooling⇡(tu)

We want the correct tree y to beat each tree y(cid:48) by
a margin equal to the number of errors in y(cid:48) (we
count spurious edges). Formalmente, loss(X, sì; tu) È
given by

max(0, − score(X, sì; tu)+

(cid:16)

max
sì(cid:48)

score(X, sì(cid:48); tu)
(cid:123)(cid:122)
(cid:125)
(cid:124)
model score

(cid:88)

1a /∈y

(cid:17)

)

(9)

a∈y(cid:48)
(cid:125)
(cid:123)(cid:122)
(cid:124)
precision error

where a ranges over the arcs of a tree y, and 1a /∈y
is an indicator that is 1 if a /∈ y. Così, this loss
function is high if there exists a tree y(cid:48) that has a
high score relative to y yet low precision.9

The training algorithm makes use of loss-
augmented inference (Taskar et al., 2005), a vari-
ant on the ordinary inference of (4). The most vi-
olating tree y(cid:48) (in the maxy(cid:48) above) is computed
again by an arc-factored dependency algorithm
(Eisner, 1996), where the score of any candidate
arc a is s(φ(UN; X, tu)) + 1a /∈y.

Actually, the above method would only train the
score function to predict the correct unlabeled tree
as above (since a ranges over unlabeled arcs as be-
fore). In practice, we also jointly train the labeler
to predict the correct labels on the gold arcs, using
a separate hinge-loss objective. Because these two
components share parameters through φ(UN; X, tu),
this is a multi-task learning problem.

8.2 Training Algorithm

Augment training data. Unlike ordinary NLP
problems whose training examples are sentences,
each training example in equation (1) is an entire
lingua. Unfortunately, UD (§3) only provides a
few dozen languages—presumably not enough to
generalize well to novel languages. We therefore
augment our training dataset Ltrain with thou-
sands of synthetic languages from the GD dataset
(§3), as already discussed in §3.1.

Stochastic gradient descent (SGD).10 Treating
each language as a single large example during
training would lead to slow SGD steps. Invece,
we take our SGD examples to be individual sen-
tences, by regarding equations (1)–(2) together as

Figura 3: The architecture of the delexicalized graph-
based BIST parser with the introduction of T(tu),
where si,j in each cell is the arc score s(φ(UN; X, T(tu))
from equation (6). The root of the tree is always posi-
zione 0, where x0 is a distinguished “root” symbol that
is prepended to the input sentence.

to supply T(tu) to the BiLSTM.8 This extension
is not a signiﬁcant slowdown at test time, since
T(tu) only needs to be computed once per test lan-
guage, not once per test sentence. Since T(tu) can
be computed for any novel language at test time,
this differs from the “many languages, one parser”
architecture (Ammar et al., 2016), in which a test-
time language must have been seen at training time
or at least must have known WALS features.

Product of experts. We also consider a variant
of the function (6) for scoring arc a, namely

λsH(UN) + (1 − λ)sN(UN)

(8)

where sH(UN) and sN(UN) are the scores produced by
separately trained systems using, rispettivamente, IL
hand-engineered and neural features from §6.1.
Hyperparameter λ ∈ [0, 1] is tuned on dev data.

8 Training the System

8.1 Training Objective

We exactly follow the training method of Kiper-
wasser and Goldberg (2016), who minimize a
structured max-margin hinge loss (Taskar et al.,
2004; McDonald et al., 2005; LeCun et al., 2007).

8An alternative would be to concatenate T(tu) with the
representation computed by the BiLSTM. This gets empiri-
cally worse results, probably because the BiLSTM does not
have advance knowledge of language-speciﬁc word order as
it reads the sentence. We also tried an architecture that does
both, with no notable improvement.

9Formalmente, for this loss function to be used in equation (2),
we must interpret ParseΘ in that equation as returning a for-
est of scored parses, not just a single parse.

10More precisely, we use Adam (Kingma and Ba, 2015), UN
popular variant of SGD. The parameters Θ are initialized by
“Xavier initialization” (Glorot and Bengio, 2010).

674

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

T(tu)T(tu)T(tu)T(tu)123012s0,1s1,0s1,2s2,1s3,2s2,3s3,1s1,3s0,2s2,0s0,3s3,0yx+SUMx0x1x2x3uscore(X,sì;tu)langinfoarcsPOSseq.

an objective averaged over sentences. Each ex-
ample (X, sì, tu) is sampled hierarchically, by ﬁrst
drawing a language (cid:96) from Ltrain and setting u =
tu((cid:96)), then drawing the sentence (X, sì) uniformly
from (X((cid:96)), sì((cid:96))). We train using mini-batches of
100 sentences; each mini-batch can mix many lan-
guages.

Encourage real languages. To sample (cid:96) from
Ltrain, we ﬁrst ﬂip a coin with weight β ∈ [0, 1] A
choose “real” vs. “synthetic,” and then sample uni-
formly within that set. Why? The test sentences
will come from real languages, so the synthetic
languages are out-of-domain. Including them re-
duces variance but increases bias. We raise β to
keep them from overwhelming the real languages.

Sample efﬁciently. The sentences (X, sì) are
stored in different ﬁles by language. To reduce
disk accesses, we do not visit a ﬁle on each
sample. Piuttosto, for each language (cid:96), we maintain
in memory a subset of (X((cid:96)), sì((cid:96))), obtained by
Samples from (X((cid:96)), sì((cid:96)))
reservoir sampling.
are drawn sequentially from this “chunk,” and
when it is used up we fetch a new chunk. We also
maintain u((cid:96)) and the hand-engineered features
from π(tu((cid:96))) in memory.

9 Experiments

9.1 Basic Setup

Our data split follows that of Wang and Eisner
(2017), as shown in Table 2,11 which has 18
training languages (20 treebanks) E 17 test lan-
guages. All hyperparameters are tuned via 5-fold
cross-validation on the 20 training treebanks—that
È, we evaluate each fold (4 treebanks) using the
model trained on the remaining folds (16 tree-
banks). Tuttavia, a model trained on a treebank of
lingua (cid:96) is never evaluated on another treebank
of language (cid:96). We selected the hyperparameters
that maximized the average unlabeled attachment
score (UAS) (Kübler et al., 2009), which is the
evaluation metric that is reported by most previ-

11Tuttavia, as we are interested in transfer to unseen lan-
guages, our Table 2 follows the principle of Eisner and Wang
(n.d.) and does not test on the Finnishftb or Latin treebanks
because other treebanks of those languages appeared in train-
ing data. Speciﬁcally, Latinitt and Latinproiel fall in the same
training folds as French and Italian, rispettivamente. For the
same reason, Tavolo 2 does not show cross-validation devel-
opment results on these Latin treebanks—nor on the Ancient
Greekgrc and Ancient Greekgrc_proiel treebanks, which fall in
the same training folds as Czech and Danish, rispettivamente.

ous work on unsupervised parsing. We also report
labeled attachment score (LAS).12

When augmenting the data, IL 16 training tree-
banks are “mixed and matched” to get GD tree-
banks for 16×17×17 = 4624 additional synthetic
training languages (Wang and Eisner, 2016, §5).

The next sections analyze these cross-validation
risultati. Finalmente, §9.8 will evaluate on 15 pre-
viously unseen languages (excluding Latin and
Finnishftb) with our model trained on all 18 train-
ing languages (20 treebanks for UD, plus 20×21×
21 = 8840 when adding GD) with the hyperpa-
rameters that achieved the best average unlabeled
attachment score during cross-validation.

The UD and GD corpora provide a train/dev/test
split of each treebank, denoted as (xtrain, ytrain),
(xdev, ydev) E (xtest, ytest). Throughout this
paper, for both training and testing languages, we
take (X((cid:96)), sì((cid:96))) = (xtrain, ytrain). We take u((cid:96))
to consist of all xtrain sentences with ≤ 40 gettoni.

9.2 Comparison Among Architectures

Tavolo 1 shows the cross-validation parsing results
over different systems discussed so far. For each
architecture, we show the best average unlabeled
attachment score (the UAS column) chosen by
cross-validation, and the corresponding labeled at-
tachment score (the LAS column).
In brief, IL
main sources of improvement are twofold:

Synthetic languages. We observe that +GD
consistently outperforms UD across all architec-
tures. It even helps with the baseline system that
we tried, which simply ignores the target cor-
pus u((cid:96)).
In that system (similar to McDonald
et al. (2011)), the BiLSTM may still manage to
extract (cid:96)-speciﬁc information from the single sen-
tence x ∈ x((cid:96)) that it is parsing.13 The additional
GD training languages apparently help it learn to
do so in a way that generalizes to new languages.

12When reporting LAS and when studying the labeling er-
rors in §9.7, we would ideally have ﬁrst re-tuned our sys-
tem to optimize LAS via cross-validation. Unfortunately,
these potentially improved LAS results would have required
months of additional computation. The optimal hyperparam-
eters may not be very different, Tuttavia, since UAS and LAS
rose and fell together when we varied other training condi-
tions in Figures 4–6.

13Questo è, our baseline system has learned a single parser
that can handle a cross-linguistic variety of POS sequences
(cf. McDonald et al., 2011; Ammar et al., 2016, section 4.2),
just as the reader was able to parse VERB DET NOUN ADJ
DET NOUN in §2.

675

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

UAS

LAS

System UD

SST
Linea di base
H
N
H;N
H+N
TD
TW

(cid:123)
(cid:124)
(cid:125)
(cid:122)

S
e
R
tu
T
UN
e
F

e
l
C
UN
R
o

66.22*
63.95
64.83
65.30
65.26
67.34*
65.94
64.84

+GD
65.70
67.97
69.41
70.06
69.62
70.65*
70.01*
69.75

UD
50.40
48.46
49.41
49.43
49.67
52.02*
49.77
49.30

+GD
50.54
52.78
53.63
54.19
53.68
55.18*
53.43
53.79

Tavolo 1: Average parsing results over 16 languages,
computed by 5-fold cross-validation. We compare
training on real languages only (the UD column) ver-
sus augmenting with synthetic languages at β = 0.2
(the +GD column). Baseline is the ablated system that
omits T(tu) (§9.2). SST is the single-source transfer
approach (§5.2). H and N use only hand-engineered
features or neural features, while H;N deﬁnes π(tu)
to concatenate both (§6.1) and H+N is the product-of-
experts model (§7). TD and TW that incorporate oracle
knowledge of the target-language syntax (§9.4). For
each comparison between UD and +GD, we boldface
the better (higher) result, or both if they are not signiﬁ-
cantly different (paired permutation test over languages
with p < 0.05). In each column, we star the best result as well as all results that are not signiﬁcantly worse. To better understand the trend, we study how the performance varies when more synthetic lan- guages are used. As shown in Figure 4, when β = 1, all the training languages are sampled from real languages. By gradually increasing the propor- tion of GD languages (reducing β from §8.2), the baseline UAS increases dramatically from 63.95 to 67.97. However, if all languages are uniformly sampled (β = 16 4624+16 ≈ 0.003) or only synthetic languages are used (β = 0), the UAS falls back slightly to 67.42 or 67.36. The best β value is 0.2, 0.2/16 0.8/4624 ≈ 72 which treats each real language as times more helpful than each synthetic language, yet 80% of the training data is contributed by syn- thetic languages. β = 0.2 was also optimal for the non-baseline systems in Table 1. Unparsed corpora. The systems that exploit unparsed corpora consistently outperform the baseline system in both the UD and +GD condi- tions. To investigate, we examine the impact of reducing u((cid:96)) when parsing a held-out language (cid:96). We used the system in row N and column +GD of Table 1, which was trained on full-sized u cor- pora. When testing on a held-out language (cid:96), we compute T(u((cid:96))) using only a random size-t sub- set of u((cid:96)). As shown in Figure 5, the system does Figure 4: Effect of β. The UAS and LAS (y-axis) of the baseline system as a function of β (x-axis). Figure 5: Effect of the size |u((cid:96))| of the unparsed cor- pus. The y-axis represents the cross-validation UAS and LAS scores, averaged over the 7 languages that have |u((cid:96))| ≥ 9000 sentences, when using only a sub- set of the sentences from u((cid:96)). Using all of u((cid:96)) would achieve 64.61 UAS and 49.04 LAS. The plot shows the average over 10 runs with different random subsets; the error bars indicate the 10th to the 90th percentile of those runs. The 7 languages are Finnish (Finnic), Nor- wegian (Germanic), Dutch (Germanic), Czech (Slavic), German (Germanic), Hindi (Indic), and English (Ger- manic). not need a very large unparsed corpus—most of the beneﬁt is obtained by t = 256. Nonetheless, a larger corpus always achieves a better and more stable performance. 9.3 Comparison to SST Besides Baseline, another directly comparable ap- proach is SST (§5.2). As shown in Table 1, SST gives a stronger baseline on the UD column—as good as H+N. However, this advantage does not carry over to the +GD column, meaning that SST cannot exploit the extra training data. Wang and Eisner (2016, Figure 5) already found that GD languages provide diminishing beneﬁt to SST as 676 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 6465666768Avg. attachment scoreUAS0.00.20.40.60.81.04950515253LAS61626364Avg. attachment scoreUAS1632641282565121024Size of unparsed corpus4546474849LAS more UD languages get involved.14 For H+N, however, the extra GD languages do help to iden- tify the truly useful surface patterns in u. We also considered trying model interpolation (Rosa and Žabokrtský, 2015b). Unfortunately, as mentioned in §5.2, this method is impractical with GD languages, because it requires storing 4624 (§9.1) additional local models. Nonetheless, we can estimate an “upper bound” on how well the interpolation might do. Our upper bound is SST where an oracle is used to choose the source lan- guage; Rosa and Žabokrtský (2015b) found that in practice, this does better than interpolation. This approximate upper bound is 68.03 of UAS and 52.10 of LAS, neither of which is signiﬁcantly bet- ter than H+N on UD, but both of which are signif- icantly outperformed by H+N on +GD. 9.4 Oracle Typology vs. Our Learned T(u) The results in Table 1 demonstrate that we learned to extract features T(u), from the unparsed tar- get corpus u, that improve the baseline parser. We consider replacing T(u) by an oracle that has ac- cess to the true syntax of the target language. We consider two different oracles, TD and TW. TD is the directionalities typology that was studied by Liu (2010) and used as a training tar- get by Wang and Eisner (2017). Speciﬁcally, TD ∈ [0, 1]57 is a vector of the directionalities of each type of dependency relation; it speciﬁes what fraction of direct objects fall to the right of the verb, and so on.15 In principle, this should be very helpful for parsing, but it must be extracted from a treebank, which is presumably unavailable for unknown languages. We also consider TW—the WALS features— as the typological classiﬁcation given by linguists. This resembles the previous multi-language learn- ing approaches (Naseem et al., 2012; Täckström et al., 2013b; Zhang and Barzilay, 2015; Am- mar et al., 2016) that exploited the WALS fea- tures. In particular, we use 81A, 82A, 83A, 85A, 86A, 87A, 88A and 89A—a union of WALS fea- tures used by those works. In order to derive the WALS features for a synthetic GD language, we ﬁrst copy the features from its substrate language 14The number of real treebanks in our cross-validation set- ting is 16, greater than the 10 in Wang and Eisner (2016). a →) 15The directionality of a relation a in language (cid:96) is given count(cid:96)(a) , where count(cid:96)( a→) is the count of a-relations that by count(cid:96)( point from left to right, and count(cid:96)(a) is the count of all a- relations. (Wang and Eisner, 2016). We then replace the 81A, 82A, 83A features—which concern the order between verbs and their dependents—by those of its V-superstrate language16 (if any). We replace 85A, 86A, 87A, 88A and 89A—which concern the order between nouns and their dependents— by those of its N-superstrate language (if any). As a pleasant surprise, we ﬁnd that our best sys- tem (H+N) is competitive with both oracle meth- ods. It outperforms both of them on both UAS and LAS, and the improvements are signiﬁcant and substantial in 3 of these 4 cases. Our parser has learned to extract information T(u) that is not only cheap (no treebank needed), but also at least as useful as “gold” typology for parsing. 9.5 Selected Hyperparameter Settings For the rest of the experiments, we use the H+N system, as it wins under cross-validation on both UD and +GD (Table 1). This is a combination via (8) of the best H system and the best N system under cross-validation, with the mixture hyperpa- rameter λ also chosen by cross-validation. For both UD and +GD, cross-validation se- lected 125 as the sizes of the LSTM hidden states and 100 as the sizes of the hidden layers for scor- ing arcs (the length of v in equation (6)). Hyperparameters for UD. The H system com- putes T(u) with a 1-layer network (as in equa- tion (3)), with hidden size h = 128 and ψ = tanh as the activation function. For the N system, T(u) is a 1-layer network with hidden size h = 64 and ψ = sigmoid as the activation function. The size of the hidden state of GRU as shown in Figure 2 is 128. The mixture weight for the ﬁnal H+N system is λ = 0.5. Hyperparameters for +GD. The H system computes T(u) with a 2-layer network (as shown in Figure 1), with h = 128 and ψ = sigmoid for both hidden layers. For N, T(u) is a 1-layer net- work with hidden size h = 64 and ψ = sigmoid. The size of the hidden state of GRU is 256. Both H and N set β = 0.2 (see §8.2). The mixture weight for the ﬁnal H+N system is λ = 0.4. 9.6 Performance on Noisy Tag Sequences We test our trained system in a more realistic sce- nario where both u and x for held-out languages 16The language whose word order model is used to per- mute the dependents of the verbs. See Wang and Eisner (2016) for details. 677 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 ers. (Remember from §9.2 that the hyperparam- eters were selected to maximize unlabeled scores (UAS) rather than labeled (LAS).) Figure 8 gives the label confusion matrix. While the dark NONE column shows that arcs of each type are often missed altogether (recall er- rors), the dark diagonal shows that they are usually labeled correctly if found. That said, it is relatively common to confuse the different labels for nom- inal dependents of verbs (nsubj, dobj, nmod). We suspect that lexical information could help sort out these roles via distributional semantics. Some other mistakes arise from discrepancies in the an- notation scheme. For example, neg can be easily confused with advmod, as some languages (for example, Spanish) use ADV instead of PART for negations. 9.8 Final Evaluation on Test Data In all previous sections, we evaluated on the 16 languages in the training set by cross-validation. For the ﬁnal test, we combine all the 20 tree- banks and train the system with the hyperparam- eters given in §9.5, then test on the 15 unseen test languages. Table 2 displays results on these 15 test languages (top) as well as the cross-validation results on the 16 languages (bottom). We see that we improve signiﬁcantly over base- line on almost every language. Indeed, on the test languages, +T(u) improves both UAS and LAS by > 3.5 percentage points on average. The im-
provement grows to > 5.6 if we augment the train-
ing data as well (+GD, meaning +T(tu)+GD).

One disappointment concerns the added beneﬁt
on the LAS of +GD over just +T(tu): while this
data augmentation helped signiﬁcantly on nearly
every one of the 16 development languages, Esso
produced less consistent improvements on the test
languages and hurt some of them. We suspect that
this is because we tuned the hyperparameters to
maximize UAS, not LAS (§9.2). Di conseguenza, while
the average beneﬁt across our 15 test languages
was fairly large, this sample was not large enough
to establish that it was signiﬁcantly greater from 0,
questo è, that future test languages would also see an
improvement from data augmentation.

We also notice that there seems to be a small
difference between the pattern of results on devel-
opment versus test languages. This may simply
reﬂect overﬁtting to the development languages,
but we also note that the test languages (chosen by

Figura 6: Performance on noisy input over 16 train-
ing languages. Each dot is an experiment annotated by
the number of sentences used to train the tagger. (IL
rightmost “∞” point uses gold tags instead of a tagger,
which is the result from Table 1.) The x-axis gives the
average accuracy of the trained RDRPOSTagger. IL
y-axis gives the average parsing performance.

consist of noisy POS tags rather than gold POS
tags. Following Wang and Eisner (2016, Ap-
pendix B), at test time, the gold POS tags in a cor-
pus are replaced by a noisy version produced by
the RDRPOSTagger (Nguyen et al., 2014) trained
on a subset of the original gold-tagged corpus.17
Figura 6 shows a linear relationship between the
performance of our best model (H+N with +GD)
and the noisiness of the POS tags, which is con-
trolled by altering the amount of training data.
With only 100 training sentences, the performance
suffers greatly—the UAS drops from 70.65 A
51.57. Nonetheless, even this is comparable to
Naseem et al. (2010) on gold POS tags, Quale
yields a UAS of 50.00. That system was the ﬁrst
grammar induction approach to exploit knowledge
of the distribution of natural languages, and re-
mained state-of-the-art (Noji et al., 2016) until
the work of Mareˇcek (2016) and Martínez Alonso
et al. (2017).

9.7 Analysis by Dependency Relation Type

Figura 7 breaks down the results by dependency
relation type—showing that using u and synthetic
data improves results almost across the board.

We also notice large differences between la-
beled and unlabeled F1 scores for some relations,
especially rarer ones.
In other words, the sys-
tem mislabels the arcs that it correctly recov-

17Another way to get noisy tags, as a reviewer notes, would
have been to use a cross-lingual POS tagger designed for low-
resource settings (Täckström et al., 2013UN; Kim et al., 2017).

678

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

55606570Avg. attachment score0.1K0.2K0.4K0.8K1.6K3.2K6.4K12.8KUAS80859095100Avg. tagging accuracy3540455055LAS

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

:
/
/

D
io
R
e
C
T
.

io
T
.

e
D
tu

/
T

UN
C
l
/

UN
R
T
io
C
e
–
P
D

F
/

D
o

io
/

1
0
1
1
6
2

/
T

UN
C
_
UN
_
0
0
2
4
8
1
5
6
7
6
7
2

/
T

UN
C
_
UN
_
0
0
2
4
8
P
D

B
sì
G
tu
e
S
T

o
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3

Figura 7: Evaluation by dependency relation type, showing an equal-weighted average of the 16 development
languages. Each vertical bar spans the range from labeled F1 (bottom) to unlabeled F1 (top), with error bars
given by bootstrap resampling over the 16 languages. Precision and recall are also indicated. The pattern is that
F1, precision, and recall—both labeled and unlabeled—are improved over baseline when we exploit unlabeled
corpora (+T(tu)), and improved again when we augment training data (+T(tu)+GD). The relations are sorted by
their average gold proportion in the 16 languages, shown by the gray area and right vertical axis. Per esempio,
nmod is the most common relation, accounting for 15.5% of all arcs. Altogether, IL 20 most frequent relations
(shown here) account for 94% of the arcs.

Wang and Eisner (2016)) tended to have consider-
ably smaller unparsed corpora u, so there may be
a domain mismatch problem. To ameliorate this
problem, one could include training examples with
versions of u that are truncated to lengths seen in
test data (cf. Figura 5). One could also include the
size |tu| explicitly in T(tu).

10 Conclusion and Future Work

We showed how to build a “language-agnostic”
delexicalized dependency parser that can better
parse sentences of an unknown language by ex-
ploiting an unparsed (but POS-tagged) corpus of
that language. Unlike grammar induction, Quale
estimates a PCFG from the unparsed corpus, we
train a neural network to extract a feature vec-
tor from the unparsed corpus that helps a subse-
quent neural parser. By end-to-end training on the
treebanks of many languages (optionally includ-
ing synthetic languages), our neural network can
extract linguistic information that helps neural de-
pendency parsing.

Variants of our architecture are possible. In fu-
ture work, the neural parser could use attention to
look at individual relevant sentences of u, Quale
are posited to be triggers in some theories of child
grammar acquisition (Gibson and Wexler, 1994;
Frank and Kapur, 1996). We could also try in-
jecting T(tu) into the neural parser by means other
than concatenating it with the input POS embed-
dings. We might also consider parsing architec-

tures other than BIST, such as the LSTM-Minus
architecture for scoring spans (Cross and Huang,
2016), or the recent attention-based arc-factored
modello (Dozat and Manning, 2017). Finalmente, our
approach is applicable to tasks other than depen-
dency parsing, such as constituent parsing or se-
mantic parsing—if suitable treebanks are available
for many training languages.

For applied uses,

it would be interesting to
combine the unsupervised techniques of this pa-
per with low-resource techniques that make use of
some annotated or parallel data in the target lan-
guage. It would also be interesting to include fur-
ther synthetic languages that have been modiﬁed
to better resemble the actual target languages, us-
ing the method of (Wang and Eisner, 2018).

It is important to relax the delexicalized as-
assunzione. As shown in §9.6, the performance
of our system relies heavily on the gold POS
tags, which are presumably not available for un-
known languages. What is available is lexical
information—which has proved to be very impor-
tant for supervised parsing, and should help un-
supervised parsers as well. As discussed in §9.7,
some errors seem easily ﬁxable by considering
word distributions. In the future, we will explore
ways to extend our cross-linguistic parser to work
with word sequences rather than POS sequences,
perhaps by learning a cross-language word rep-
resentation that is shared among training and test
languages (Ruder et al., 2017).

679

0.03.67.210.814.418.0Avg. proportion (%)nmodcasepunctdetnsubjrootamoddobjadvmodconjccmarkauxacladvclcompoundcopnummodnamexcompDependency relations0.00.20.40.60.81.0Precision \ Recall \ F1PrecisionRecallBaseline+ T(tu)+ T(tu) + GD

UAS

LAS

Irish

B +T(tu) +GD B +T(tu) +GD
Language
49.89 54.34 57.59 27.07 31.46 35.32
Basque
Croatian 65.03 67.78 68.65 48.68 52.29 53.68
65.91 68.37 70.46 50.1 56.73 57.89
Greek
Hebrew 62.58 66.27 65.3 49.71 53.29 52.08
Hungarian 58.5 64.13 70.02 42.85 47.73 49.99
Indonesian 55.22 64.63 65.36 39.46 47.63 48.38
58.58 61.51 62.21 39.06 40.75 42.36
Japanese 54.97 60.41 58.4 37.57 40.6 37.86
Slavonic 68.79 71.13 71.54 40.03 43.95 44.12
40.38 34.2 57.25 30.06 24.6 47.14
Persian
72.15 76.85 78.28 50.08 54.85 58.15
Polish
Romanian 66.55 69.69 71.18 50.9 53.42 55.17
Slovenian 72.21 76.06 78.62 57.09 61.48 64.1
Swedish 72.26 75.32 73.89 55.35 58.42 52.39
51.59 57.53 57.91 28.39 37.81 32.52
60.97 64.55 67.11 43.09 47.00 48.74
45.75 49.32 53.83 36.4 40.39 44.14
66.71 68.41 68.4 52.24 54.49 54.67
Norwegian 68.35 70.89 71.22 52.33 56.01 56.37
64.31 68.77 72.42 50.19 55.16 57.95
Czech
Estonian 72.67 79.88 81.67 42.81 51.32 52.57
Portuguese 70.48 73.47 74.83 60.85 63.18 64.96
62.18 63.62 66.52 48.44 49.46 53.51
42.6 45.17
63.23 66.72 70.75 39.1
70.0
75.9 79.24 80.57 65.46 68.8
Bulgarian 77.57 79.53 83.66 55.83 57.65 61.47
Finnishﬁ 53.73 58.03 60.44 34.68 39.55 43.15
74.57 76.88 79.34 64.1 66.83 68.48
French
59.63 62.58 60.31 45.84 48.28 47.98
Dutch
61.66 63.99 65.9 47.61 51.43 53.13
English
35.84 40.74 62.45 18.63 21.65 41.12
Hindi
70.65 75.36 78.03 60.8 65.45 68.23
Spanish
All Avg. 62.51 65.99 68.94 45.86 49.59 52.07

Tamil
Avg.
Arabic
Danish

German
Gothic
Italian

Tavolo 2: Data splits and ﬁnal evaluation on the 15 test
languages (top), along with cross-validation results on
IL 16 development languages (bottom) grouped by 5
folds (separated by dashed lines). For languages with
multiple treebanks, we identify them by subscripts.
We use “Slavonic” for Old Church Slavonic. Column
B is the baseline that doesn’t use T(tu) (McDonald
et al., 2011). +T(tu) is our H+N system, and +GD
is that system when the training data is augmented
with synthetic languages. In comparing among these
three systems, we boldface the highest score as well
as all scores that are not signiﬁcantly worse (paired
permutation test, P < 0.05). If a row is an average over many sentences of a single language, then each paired datapoint is a sentence, so a signiﬁcant improvement should generalize to new sentences. But if a row is an average, then each paired datapoint is a language (as in Table 1), so a signiﬁcant improvement should generalize to new languages. speciﬁc kind of structure that we desire, using any features that we think may be discriminative. 680 Figure 8: The confusion matrix of our parser, as an equal-weight average over 16 development languages. Each row is normalized to sum to 1 and represents a frequent gold relation. For example, the nsubj row shows how well we recovered the gold nsubj arcs; the (nsubj, dobj) entry shows p(predicted = dobj | gold = nsubj), which measures the fraction of nsubj relations that are recovered but mislabeled as dobj. The diagonal represents correct arcs: where dark, it indicates high labeled recall for that relation. The ﬁnal column represents gold arcs that were not re- covered with any label: where dark, it indicates low unlabeled recall for that relation. We show the top 20 relations sorted by gold frequency. One takeaway message from this work is contained in our title. Surface statistics of a language—mined from the surface part-of-speech order—provide clues about how to ﬁnd the un- derlying syntactic dependencies. Chomsky (1965) imagined that such clues might be exploited by a Language Acquisition Device, so it is interesting to know that they do exist. Another takeaway message is that synthetic training languages are useful for NLP. Using syn- thetic examples in training is a way to encourage a system to be invariant to superﬁcial variation. We created synthetic languages by varying the surface structure in a way that “should” preserve the deep structure. This allows our trained system to be in- variant to variation in surface structure, just as ob- ject recognition wants to be invariant to an image’s angle or lighting conditions (§3.1). Our ﬁnal takeaway goes beyond language: one can treat unsupervised structure discovery as a su- pervised learning problem. As §§1–2 discussed, this approach inherits the advantages of supervised learning. Training may face an easier optimization landscape, and we can train the system to ﬁnd the l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 nmodpunctcasedetnsubjrootamoddobjadvmodconjccmarkauxacladvclcompoundcopnummodnamexcompOTHERSNONEPredicted relationsOTHERSxcompnamenummodcopcompoundadvclaclauxmarkccconjadvmoddobjamodrootnsubjdetcasepunctnmodGold relations0.00.5 Acknowledgements This material is based upon work supported by the U.S. National Science Foundation under Grants No. 1423276 and 1718846. We are grateful to the state of Maryland for providing indispensable computing resources via the Maryland Advanced Research Computing Center (MARCC). Thanks to Kiper- wasser and Goldberg (2016) for releasing their Dynet parsing code, which was the basis for our reimplementation in PyTorch. We thank the Argo lab members for useful discussions. Finally, we thank TACL action editor Marco Kuhlmann and the anonymous reviewers for high-quality suggestions. References Željko Agi´c, Anders Johannsen, Barbara Plank, Héctor Martínez Alonso, Natalie Schluter, and Anders Søgaard. 2016. Multilingual projec- tion for parsing truly low-resource languages. Transactions of the Association for Computa- tional Linguistics, 4:301–312. Waleed Ammar, George Mulcaire, Miguel Balles- teros, Chris Dyer, and Noah Smith. 2016. Many languages, one parser. Transactions of the As- sociation of Computational Linguistics, 4:431– 444. Yoeng-Jin Chu. 1965. On the shortest arbores- Science Sinica, cence of a directed graph. 14:1396–1400. Shay B. Cohen and Noah A. Smith. 2012. Empirical risk minimization for probabilis- tic grammars: Sample complexity and hard- ness of learning. Computational Linguistics, 38(3):479–526. James Cross and Liang Huang. 2016. Incre- mental parsing with minimal features using bi- directional LSTM. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 2: Short Papers), pages 32–37, Berlin. Timothy Dozat and Christopher D. Manning. 2017. Deep biafﬁne attention for neural depen- dency parsing. In Proceedings of the Interna- tional Conference on Learning Representations. Matthew S. Dryer and Martin Haspelmath, editors. 2013. The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology. http://wals.info/. Jack Edmonds. 1967. Optimum branchings. Jour- nal of Research of the National Bureau of Stan- dards B, 71(4):233–240. Sumit Basu, Chuck Jacobs, and Lucy Vander- wende. 2013. Powergrading: A clustering ap- proach to amplify human effort for short an- swer grading. Transactions of the Association for Computational Linguistics, 1:391–402. Jason Eisner. 1996. Three new probabilistic mod- els for dependency parsing: An exploration. In Proceedings of the 16th International Confer- ence on Computational Linguistics, pages 340– 345. Glenn Carroll and Eugene Charniak. 1992. Two experiments on learning probabilistic depen- In Working dency grammars from corpora. Notes of the Workshop on Statistically-Based NLP Techniques, pages 1–13. Jason Eisner and Damianos Karakos. 2005. Boot- In Proceedings of strapping without the boot. Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 395–402. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: In Proceedings Encoder-decoder approaches. of Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103– 111. Jason Eisner and Dingquan Wang. No date. On evaluating cross-lingual generalization. To be posted on arXiv. Robert Frank and Shyam Kapur. 1996. On the use of triggers in parameter setting. Linguistic In- quiry, 27:623–660. Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT Press, Cambridge, MA. Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction 681 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 In Proceed- via bitext projection constraints. ings of the Joint Conference of the 47th An- nual Meeting of the ACL and the 4th Interna- tional Joint Conference on Natural Language Processing of the AFNLP, pages 369–377. Edward Gibson and Kenneth Wexler. 1994. Trig- gers. Linguistic Inquiry, 25(3):407–454. Jennifer Gillenwater, Kuzman Ganchev, João Gra ca, Fernando Pereira, and Ben Taskar. 2010. Sparsity in dependency grammar induction. In Proceedings of the ACL 2010 Conference Short Papers, pages 194–199. Kevin Gimpel and Noah A. Smith. 2012. Concav- ity and initialization for unsupervised depen- dency parsing. In Proceedings of the 2012 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 577–581. Xavier Glorot and Yoshua Bengio. 2010. Under- standing the difﬁculty of training deep feedfor- In Proceedings of the ward neural networks. International Conference on Artiﬁcial Intelli- gence and Statistics, volume 9, pages 249–256. Matthew Gormley and Jason Eisner. 2013. Non- convex global optimization for latent-variable In Proceedings of the 51st Annual models. Meeting of the Association for Computational Linguistics, pages 444–454. Edouard Grave and Noémie Elhadad. 2015. A convex and feature-rich discriminative ap- In proach to dependency grammar induction. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natu- ral Language Processing, pages 1375–1384. Alex Graves. 2012. Supervised Sequence Labelling with Recurrent Neural Networks. Springer. Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2016. A repre- sentation learning framework for multi-source transfer parsing. In Proceedings of the Thirti- eth AAAI Conference on Artiﬁcial Intelligence, pages 2734–2740. William P. Headden III, Mark Johnson, and David Improving unsupervised McClosky. 2009. dependency parsing with richer contexts and In Proceedings of Human Lan- smoothing. guage Technologies: The 2009 Annual Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics, pages 101–109. Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Boot- strapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3):311–325. Yong Jiang, Wenjuan Han, and Kewei Tu. 2017. Combining generative and discriminative ap- proaches to unsupervised dependency parsing via dual decomposition. In Proceedings of the 2017 Conference on Empirical Methods in Nat- ural Language Processing, pages 1689–1694. Damianos Karakos, Jason Eisner, Sanjeev Khu- danpur, and Carey E. Priebe. 2007. Cross- instance tuning of unsupervised document clus- In Human Language Tech- tering algorithms. nologies: Proceedings of the Annual Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics, pages 252–259. Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier. 2017. Cross-lingual transfer learning for POS tagging In Proceed- without cross-lingual resources. ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2832–2838. Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceed- ings of the International Conference on Learn- ing Representations. Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing us- ing bidirectional LSTM feature representations. Transactions of the Association of Computa- tional Linguistics, 4:313–327. Dan Klein and Christopher Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 478–485. 682 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Sandra Kübler, Ryan McDonald, and Joakim Nivre. 2009. Dependency Parsing. Morgan and Claypool. and François Yvon. Ophélie Lacroix, Lauriane Aufrant, Guillaume 2016. Wisniewski, for Frustratingly easy cross-lingual In transition-based dependency parsing. the Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1058–1063. the 2016 Conference of transfer Karim Lari and Steve J. Young. 1990. The estima- tion of stochastic context-free grammars using the Inside-Outside algorithm. Computer Speech and Language, 4(1):35–56. Yann LeCun, Sumit Chopra, Raia Hadsell, and Fu Jie Huang. 2007. A tutorial on energy-based learning. In Gökhan Bakır, Thomas Hofmann, Bernhard Schölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan, editors, Pre- dicting Structured Data. MIT Press. Haitao Liu. 2010. Dependency direction as a means of word-order typology: A method Lingua, based on dependency treebanks. 120(6):1567–1578. David Mareˇcek. 2016. Delexicalized and min- imally supervised parsing on universal de- In Statistical Language and pendencies. Speech Processing: 4th International Confer- ence, SLSP 2016, Pilsen, Czech Republic, Oc- tober 11-12, 2016, Proceedings, pages 30–42, Cham. David Mareˇcek and Milan Straka. 2013. Stop- probability estimates computed on a large cor- pus improve unsupervised dependency parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 281–290. David Mareˇcek. 2016. Twelve years of un- In Proceed- supervised dependency parsing. ings of the 16th ITAT Conference Information Technologies—Applications and Theory, pages 56–62. Héctor Martínez Alonso, Željko Agi´c, Barbara Plank, and Anders Søgaard. 2017. Parsing uni- In Pro- versal dependencies without training. 683 ceedings of the 15th Conference of the Euro- pean Chapter of the Association for Compu- tational Linguistics: Volume 1, Long Papers, pages 230–240. Ryan McDonald. 2006. Discriminative Learning and Spanning Tree Algorithms for Dependency Parsing. Ph.D. thesis, University of Pennsylva- nia. Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Compu- tational Linguistics (ACL’05), pages 91–98. Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized de- pendency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 62–72. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Infor- mation Processing Systems, pages 3111–3119. Tahira Naseem, Regina Barzilay, and Amir Globerson. 2012. Selective sharing for multi- lingual dependency parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 629–637. Tahira Naseem, Harr Chen, Regina Barzilay, and Mark Johnson. 2010. Using universal linguis- tic knowledge to guide grammar induction. In Proceedings of the 2010 Conference on Empir- ical Methods in Natural Language Processing, pages 1234–1244. Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, and Son Bao Pham. 2014. RDRPOSTag- ger: A ripple down rules-based part-of-speech tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 17–20. Joakim Nivre. 2008. Algorithms for determinis- tic incremental dependency parsing. Computa- tional Linguistics, 34(4):513–553. Joakim Nivre, Željko Agi´c, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Cristina Bosco, Sam Bow- man, Giuseppe G. A. Celano, Miriam Connor, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Daniel Galbraith, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, Berta Gon- zales, Bruno Guillaume, Jan Hajiˇc, Dag Haug, Radu Ion, Elena Irimia, Anders Johannsen, Hi- roshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubeši´c, Teresa Lynn, Christopher Manning, Cˇatˇalina Mˇarˇanduc, David Mareˇcek, Héctor Martínez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mi- titelu, Yusuke Miyao, Simonetta Montemagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitu- lainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ra- masamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolfgang Seeker, Mojgan Seraji, Na- talia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Aaron Smith, Jan Štˇepánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Lar- raitz Uria, Viktor Varga, Veronika Vincze, Zdenˇek Žabokrtský, Daniel Zeman, and Hanzhi Zhu. 2015. Universal dependencies 1.2. LIN- DAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles Uni- versity in Prague. Hiroshi Noji, Yusuke Miyao, and Mark Johnson. 2016. Using left-corner parsing to encode uni- versal structural constraints in grammar induc- tion. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pages 33–43. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank cita- tion ranking: Bringing order to the Web. Tech- nical Report 1999-66, Stanford InfoLab. Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 328–338. Mohammad Sadegh Rasooli and Michael Collins. 2017. Cross-lingual syntactic transfer with lim- ited resources. Transactions of the Association for Computational Linguistics, 5:279–293. Rudolf Rosa and Zdenˇek Žabokrtský. 2015a. KLcpos3 – a language similarity measure for delexicalized parser transfer. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Processing, pages 243–249. Rudolf Rosa and Zdenˇek Žabokrtský. 2015b. Mstparser model interpolation for multi-source In Proceedings of the delexicalized transfer. 14th International Conference on Parsing Tech- nologies, pages 71–75. Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard. 2017. A survey of cross-lingual word embed- ding models. Computing Research Repository, arXiv:1706.04902. David A. Smith and Jason Eisner. 2009. Parser adaptation and projection with quasi- synchronous grammar features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 822–831. Noah A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. Ph.D. thesis, Johns Hopkins University. Noah A. Smith and Jason Eisner. 2005. Guid- ing unsupervised grammar induction using con- trastive estimation. In International Joint Con- ference on Artiﬁcial Intelligence (IJCAI) Work- shop on Grammatical Inference Applications, pages 73–82. Mark A. Paskin. 2001. Cubic-time parsing and learning algorithms for grammatical bigram. Mark A. Paskin. 2002. Grammatical bigrams. In Advances in Neural Information Processing Systems, pages 91–97. Noah A. Smith and Jason Eisner. 2006. Annealing structural bias in multilingual weighted gram- mar induction. In Proceedings of the Interna- tional Conference on Computational Linguis- tics and the Association for Computational Lin- guistics, pages 569–576. 684 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2012. Three dependency-and- boundary models for grammar induction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learning, pages 688–698. Dingquan Wang and Jason Eisner. 2016. The treebanks: Getting Galactic Dependencies more data by synthesizing new languages. Transactions of the Association of Computa- tional Linguistics, 4:491–505. Data available at https://github.com/gdtreebank/ gdtreebank. Dingquan Wang and Jason Eisner. 2017. Fine- grained prediction of syntactic typology: Dis- covering latent structure with supervised learn- ing. Transactions of the Association for Com- putational Linguistics, 5:147–161. Dingquan Wang and Jason Eisner. 2018. Syn- thetic data made to order: The case of parsing. In Proceedings of the Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), pages 1325–1337, Brussels. Qin Iris Wang, Dale Schuurmans, and Dekang Lin. 2008. Semi-supervised convex training for dependency parsing. In Proceedings of the 46th Annual Meeting of Annual Meeting of the Asso- ciation for Computational Linguistics and the 2008 Conference of the North American Chap- ter of the Association for Computational Lin- guistics: Human Language Technologies, pages 532–540. David Yarowsky, Grace Ngai, and Richard Wicen- towski. 2001. Inducing multilingual text anal- ysis tools via robust projection across aligned In Proceedings of the First Interna- corpora. tional Conference on Human Language Tech- nology Research. Deniz Yuret. 1998. Discovery of Linguistic Re- lations Using Lexical Attraction. Ph.D. thesis, Massachusetts Institute of Technology. Yuan Zhang and Regina Barzilay. 2015. Hierar- chical low-rank tensors for multilingual trans- In Proceedings of the 2015 Con- fer parsing. ference on Empirical Methods in Natural Lan- guage Processing, pages 1857–1867. Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. 2013. Breaking out of local optima with count transforms and model recombina- In Pro- tion: A study in grammar induction. ceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1983–1995. Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. 2013a. To- ken and type constraints for cross-lingual part- of-speech tagging. Transactions of the Associa- tion for Computational Linguistics, 1:1–12. Oscar Täckström, Ryan McDonald, and Joakim Nivre. 2013b. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Conference of the North Ameri- can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- gies, pages 1061–1071. Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd International Confer- ence on Machine Learning, pages 896–903. Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and Christopher Manning. 2004. Max- In Proceedings of the 2004 margin parsing. Conference on Empirical Methods in Natural Language Processing. Jörg Tiedemann. 2014. Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of COLING 2014, the 25th Inter- national Conference on Computational Linguis- tics: Technical Papers, pages 1854–1864. Jörg Tiedemann, Željko Agi´c, and Joakim Nivre. 2014. Treebank translation for cross-lingual In Proceedings of the Eigh- parser induction. teenth Conference on Computational Natural Language Learning, pages 130–140. 685 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 2 4 8 1 5 6 7 6 7 2 / / t l a c _ a _ 0 0 2 4 8 p d . f b y g u e s t t o n 0 9 S e p e m b e r 2 0 2 3 Surface Statistics of an Unknown Language Indicate How to Parse It image

Scarica il pdf