Surface Statistics of an Unknown Language Indicate How to Parse It
Dingquan Wang and Jason Eisner
Department of Computer Science, Johns Hopkins Universität
{wdd,jason}@cs.jhu.edu
Abstrakt
We introduce a novel framework for delex-
icalized dependency parsing in a new lan-
Spur. We show that useful features of the
target language can be extracted automati-
cally from an unparsed corpus, which con-
sists only of gold part-of-speech (POS) se-
quences. Providing these features to our
neural parser enables it to parse sequences
like those in the corpus. Auffallend, our sys-
tem has no supervision in the target lan-
Spur. Eher,
it is a multilingual sys-
tem that is trained end-to-end on a vari-
ety of other languages, so it learns a fea-
ture extractor that works well. We show
experimentally across multiple languages:
(1) Features computed from the unparsed
(2) In-
corpus improve parsing accuracy.
cluding thousands of synthetic languages
in the training yields further improvement.
(3) Despite being computed from unparsed
corpora, our learned task-specific features
beat previous work’s interpretable typolog-
ical features that require parsed corpora or
expert categorization of the language. Unser
best method improved attachment scores on
held-out test languages by an average of 5.6
percentage points over past work that does
not inspect the unparsed data (McDonald
et al., 2011), und von 20.7 points over past
“grammar induction” work that does not use
training languages (Naseem et al., 2010).
1
Einführung
Dependency parsing is one of the core natural
language processing tasks.
It aims to parse a
given sentence into its dependency tree: a directed
graph of
labeled syntactic relations between
Wörter. Supervised dependency parsers—which
are trained using a “treebank” of known parses in
the target language—have been very successful
(McDonald, 2006; Nivre, 2008; Kiperwasser
and Goldberg, 2016). Im Gegensatz, the progress
667
of unsupervised dependency parsers has been
slow, and they have apparently not been used in
any downstream NLP systems (Mareˇcek, 2016).
An unsupervised parser does not have access
to a treebank, but only to a corpus of unparsed
sentences in the target language.
Unsupervised parsing has been studied for
decades. The most common approach is gram-
mar induction (Lari and Young, 1990; Carroll
and Charniak, 1992; Klein and Manning, 2004).
Grammar induction induces an explicit grammar
from the unparsed corpus, such as a probabilis-
tic context-free grammar (PCFG), and uses that to
parse sentences of the language. This approach
has encountered two major difficulties:
• Search error: Most formulations of gram-
mar induction involve optimizing a highly
non-convex objective function such as likeli-
hood. The optimization is typically NP-hard
(Cohen and Smith, 2012), and approximate
local search methods tend to get stuck in lo-
cal optima.
• Model error: Likelihood does not correlate
well with parsing accuracy anyway (Schmied,
2006, Figur 3.2). Likelihood optimization
seeks latent trees that help to predict the
observed sentences, but these unsupervised
trees may use a non-standard syntactic anal-
ysis or even be optimized to predict non-
syntactic properties such as topic. We seek
a standard syntactic analysis—what Smith
(2006) calls the MATCHLINGUIST task.
We address both difficulties by using a super-
vised learning framework—one whose objective
function is easier to optimize and explicitly tries
to match linguists’ standard syntactic analyses.
Our approach is inspired by Wang and Eisner
(2017), who use an unparsed but tagged corpus
to predict the fine-grained syntactic typology of
Transactions of the Association for Computational Linguistics, Bd. 6, S. 667–685, 2018. Action Editor: Marco Kuhlmann.
Submission batch: 2/2018; Revision batch: 4/2018; Published 12/2018.
C(cid:13) 2018 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
a language. Zum Beispiel, they may predict that
um 70% of the direct objects fall to the right of
das Verb. Their system is trained on a large num-
ber of (unparsed corpus, true typology) pairs, jede
representing a different language. With this train-
ing, it can generalize to predict typology from the
unparsed corpus of a new language. Our approach
is similar except that we predict parses rather than
just a typology. In both cases, the system is trained
to optimize a task-specific quality measure. Der
system’s parameterization can be chosen to sim-
plify optimization (strikingly, the training objec-
tive could even be made convex by using a condi-
tional random field architecture) and/or to incor-
porate linguistically motivated features.
The positive results of Wang and Eisner (2017)
demonstrate that there are indeed surface clues to
syntactic structure in the input corpus, at least if it
is POS-tagged (as in their work and ours). Wie-
immer, their method only found global typological
Information: it did not establish which 70% of the
direct objects fell to the right of their verbs, let
alone identify which nouns were in fact direct ob-
jects of which verbs. That requires a token-level
analysis of each sentence, which we undertake in
this paper. Wieder, the basic idea is that instead of
predicting interpretable typological properties of a
language as Wang and Eisner (2017) did, we will
predict a language-specific version of the scoring
function that a parser uses to choose among vari-
ous actions or substructures.
2 Unsupervised Parsing with Supervised
Tuning
Our fundamental question is whether gold part-of-
Rede (POS) sequences carry useful information
about the syntax of a language.1 As we will show,
the answer is yes, and the information can be ex-
tracted and used to obtain actual parses.
This is the same question that has been implic-
itly asked by previous papers in the unsupervised
parsing tradition (see §5). Unsupervised pars-
ing of gold POS sequences is an artificial task,
to be sure.2 Nonetheless, it is a starting point
1We also include an experiment on noisy POS sequences.
2It is clearly not the task setting faced by human language
learners. Nor is it a plausible engineering setting: a language
with gold POS sequences often also has at least a small tree-
bank of gold parses, or at least parallel text in a language
from which noisy parses can be noisily projected (Agi´c et al.,
2016). There is also no practical reason to consider POS tags
without their attached words.
for more ambitious settings that would learn from
words and real-world grounding (with or without
the POS tags). Even this starting point has proved
surprisingly difficult over decades of research, Also
it has not been clear whether the POS sequences
even contain the necessary information.
Yet this task—like others that engineers, lin-
guists, or human learners might face—might be
solvable with general knowledge about the distri-
bution of human languages. An experienced lin-
guist can sometimes puzzle out the structure of
a new language. The reader may be willing to
guess a parse for the gold POS sequence VERB
DET NOUN ADJ DET NOUN. After all, Adjektive
usually attach to nouns (Naseem et al., 2010), Und
the adjective in this example seems to attach to the
first noun—not to the second, since determiners
usually fall at the edge of a noun phrase. Mean-
while, the sequence’s sole verb is apparently fol-
lowed by two noun phrases, which suggests ei-
ther VSO (verb-subject-object) or VOS order—
and VSO is a good guess as it is more common
(Dryer and Haspelmath, 2013). Observing a cor-
pus of additional POS sequences might help re-
solve the question of whether this language is pri-
marily VSO or VOS, Zum Beispiel, by guessing that
short noun phrases in the corpus (Zum Beispiel, un-
modified pronouns) are more often subjects.
Daher, we propose to solve the task by training a
kind of “artificial linguist” that can do such analy-
sis on corpora of new languages.
This is a general approach to developing an un-
supervised method for a specific type of dataset:
tune its structure and hyperparameters so that it
works well on actual datasets of that sort, and then
apply it to new datasets. Zum Beispiel, consider
clustering—the canonical unsupervised problem.
What constitutes a useful cluster depends on the
type of data and the application. Basu et al. (2013)
develop a text clustering system specifically to aid
Lehrer. Their “Powergrading” system can group
all the student-written answers to a novel question,
having been trained on human judgments of an-
swer similarity for other questions. Their novel
questions are analogous to our novel languages:
their unsupervised system is specifically tailored
to match teachers’ semantic similarity judgments
within any corpus of student answers, just as ours
is tailored to match linguists’ syntactic judgments
within any corpus of human-language POS se-
quences. Other NLP work on supervised tuning of
668
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
unsupervised learners includes strapping (Eisner
and Karakos, 2005; Karakos et al., 2007), welche
tunes with the help of both real and synthetic
datasets, just as we will (§3).
Are such systems really “unsupervised”? Ja,
in the sense that they are able to discover desirable
structure in a new dataset. Unsupervised learners
are normally crafted using assumptions about the
data domain. Their structure and hyperparameters
may have been manually tuned to produce pleas-
ing results for typical datasets in that domain. In
the domain of POS corpora, we simply scale up
this practice to automatically tune a large set of pa-
rameters, which later guide our system’s search for
linguist-approved structure on each new human-
language dataset. Our system should be regarded
as “supervised” if the examples are taken to be en-
tire languages: schließlich, we train it to map un-
labeled corpora to usefully labeled corpora. Aber
once trained, it is “unsupervised” if the examples
are taken to be the sentences within a given corpus:
by analyzing the corpus, our system figures out
how to map sentences of that language to parses,
without any labeled examples in that language.
3 Data
We use two datasets in our experiment:
UD: Universal Dependencies version 1.2 (Nivre
et al., 2015) A collection of 37 dependency tree-
banks of 33 languages, tokenized and annotated
with a common set of POS tags and dependency
relations.3 In principle, our trained system could
be applied to predict UD-style dependency re-
lations in any tokenized natural-language corpus
with UD-style POS tags.
GD: Galactic Dependencies version 1.0 (Wang
and Eisner, 2016) A collection of dependency
treebanks for 53,428 synthetic languages (von
which we will use a subset). A GD treebank
is generated by starting with some UD treebank
and stochastically permuting the child subtrees
of nouns and/or verbs to match their orders in
other UD treebanks. Zum Beispiel, one of the
GD treebanks reflects what the English UD tree-
bank might have looked like if English had been
both VSO (like Irish) and postpositional (wie
Japanese). This typologically diverse collection
3While it might have been preferable to use the expanded
and revised UD version 2.0, we wished to compare fairly with
GD 1.0, which is based on UD 1.2.
of resource-rich synthetic languages aims to pro-
pel the development of NLP systems that can han-
dle diverse natural languages, such as multilingual
parsers and taggers.
3.1 Why Synthetic Training Languages?
We hope for our system to do well, on average, bei
matching real linguist-parsed corpora of real hu-
man languages. We therefore tune its parameters
Θ on such treebanks. UD provides training exam-
ples actually drawn from that distribution D over
treebanks—but alas, rather few. Thus to better es-
timate the expected performance of Θ under D,
we follow Wang and Eisner (2017) and augment
our training data with GD’s synthetic treebanks.
Ideally we would have sampled these synthetic
treebanks from a careful estimate ˆD of D: zum Beispiel-
reichlich, the mean of a Bayesian posterior for D,
derived from prior assumptions and UD evidence.
Jedoch, such adventurous “extrapolation” of un-
seen languages would have required actually con-
structing such an estimate ˆD—which would em-
body a distribution over semantic content and a
full theory of universal grammar! The GD tree-
banks were derived more simply and more con-
servatively by “interpolation” among the actual
UD corpora. They combine observed parse trees
(which provide attested semantic content) mit
stochastic word order models trained on observed
languages (which attempt to mimic attested pat-
terns for presenting that content). GD’s sampling
distribution ˆD still offers moderately varied syn-
thetic datasets, which remain moderately realistic,
as they are limited to phenomena observed in UD.
As Wang and Eisner (2016) pointed out, syn-
thetic examples have been used in many other su-
pervised machine learning settings. A common
if real image
technique is to exploit invariance:
z should be classified as a cat, then so should a
rotated version of image z. Our technique is the
same! We assume that if real corpus u should be
parsed as having certain dependencies among the
word tokens, then so should a version of corpus
u in which those tokens have been systematically
permuted in a linguistically plausible way.4 This
is analogous to how rotation sytematically trans-
forms the image (rotating all pixels through the
same angle) in a physically plausible way (as real
objects do rotate relative to the camera). The sys-
tematicity is needed to ensure that the task on syn-
4Another example is back-translation.
669
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
thetic data is feasible. In unserem Fall, the synthetic
corpus then provides many sentences that have
been similarly permuted, which may jointly pro-
vide enough clues to guess the word order of this
synthetic language (Zum Beispiel, VSO vs. VOS in
§2) and thus recover the dependencies. See Wang
and Eisner (2018, §2) for related discussion.
With enough good synthetic languages to use
for training, even nearest-neighbor could be an
effective method. Das ist, one could obtain the
parser for a test corpus simply by copying the
trained parser for the most similar training cor-
pus (under some metric). Wang and Eisner (2016)
explored this approach of “single-source transfer”
from synthetic languages. Yet with only thousands
of synthetic languages, perhaps no single training
corpus is sufficiently similar.5 To draw on pat-
terns in many training corpora to figure out how to
parse the test corpus, we will train a single parser
that can handle all of the training corpora (Ammar
et al., 2016), much as we trained our typological
classifier in earlier work (Wang and Eisner, 2017).
details of these two components in §6 and §7.
We assume in this paper that the input sentence
x is given as a POS sequence: das ist, our parser
is delexicalized. This spares us from also need-
ing language-specific lexical parameters associ-
ated with the specific vocabulary of each language,
a problem that we leave to future work.
We will choose our universal parameter values
by minimizing an estimate of their expected loss,
ˆΘ = argmin
Θ
mean
(cid:96)∈Ltrain
Loss(Θ; X((cid:96)), j((cid:96)), u((cid:96)))
(1)
where Ltrain is a collection of training languages
(ideally drawn IID from the distribution D of pos-
sible human languages) for which some syntac-
tic information is available. Konkret, jede
training language (cid:96) has a treebank (X((cid:96)), j((cid:96))),
where x((cid:96)) is a collection of POS-tagged sentences
whose correct dependency trees are given by y((cid:96)).
Jede (cid:96) also has an unparsed corpus u((cid:96)) (möglicherweise
equal to x((cid:96)) or containing x((cid:96))). We can therefore
define the parser’s loss on training language (cid:96) als
4 Task Formulation
Loss(Θ; X((cid:96)), j((cid:96)), u((cid:96)))
(2)
An unsupervised parser for language (cid:96) is built
without any gold parse trees for (cid:96). Jedoch, Wir
assume a corpus u of unparsed but POS-tagged
sentences of (cid:96) is available. From u, we will extract
statistics T(u) that are informative about the syn-
tactic structure of (cid:96), to guide us in parsing POS-
tagged sentences of (cid:96).
Gesamt, our approach is to train a “language-
agnostic” parser—one that does not know what
Sprache (cid:96) it is parsing in. It produces a parse tree
ˆy = ParseΘ(X; u) from a sentence x, construct-
ing T(u) as an intermediate quantity that carries
(Zum Beispiel) typological information about (cid:96).
The parameters Θ are shared by all languages,
and determine how to construct and use T. To
learn them, we will allow (cid:96) to range over training
languages, and then test our ability to parse when
(cid:96) ranges over novel test languages.
Our ParseΘ(X; u) system has two stages. Erste
it uses a neural network to compute T(u) ∈ Rm, A
vector that represents the typological properties of
(cid:96) and resembles the language embedding of Am-
mar et al. (2016). Then it parses sentence x while
taking T(u) as an additional input. We will give
5Wang and Eisner (2018) do investigate synthesis “on de-
mand” of a permuted training corpus that is as similar as pos-
sible to the test corpus.
=
mean
(X,j)∈(X((cid:96)),j((cid:96)))
loss(ParseΘ(X; u((cid:96)))
, j)
(cid:125)
(cid:124)
(cid:123)(cid:122)
ˆy
where loss(. . .) is a task-specific per-sentence loss
(defined in §8.1) that evaluates the parser’s output
ˆy on sentence x against x’s correct tree y.
5 Related Work
5.1 Per-Language Learning
Many papers rely on some universal learning pro-
cedure to determine T(u) (see §4) for a target lan-
Spur. Zum Beispiel, T(·) may be the Expectation-
Maximization (EM) Algorithmus, yielding a PCFG
T(u) that fully determines a CKY parser (Carroll
and Charniak, 1992; Klein and Manning, 2004).
Since EM and CKY are fixed algorithms, this ap-
proach has no trainable parameters.
Grammar induction tries to turn an unsuper-
vised corpus into a generative grammar. The ap-
proach of the previous paragraph is often modi-
fied to reduce model error or search error (§1). To
reduce model error, many papers have used de-
pendency grammar, with training objectives that
incorporate notions like lexical attraction (Yuret,
1998) and grammatical bigrams (Paskin, 2001,
2002).
The dependency model with valence
(DMV) (Klein and Manning, 2004) was the first
670
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
method to beat a simple right-branching heuris-
tic. Headden III et al. (2009) and Spitkovsky
et al. (2012) made the DMV more expressive by
considering higher-order valency or punctuation.
To reduce search error, strategies for eliminating
or escaping local optima have included convex-
ified objectives (Wang et al., 2008; Gimpel and
Schmied, 2012), smart initialization (Klein and Man-
ning, 2004; Mareˇcek and Straka, 2013), suchen
bias (Smith and Eisner, 2005, 2006; Naseem et al.,
2010; Gillenwater et al., 2010), branch-and-bound
suchen (Gormley and Eisner, 2013), and switching
objectives (Spitkovsky et al., 2013).
Unsupervised parsing (which is also our task)
tries to turn the same corpus directly into a tree-
bank, without necessarily finding a grammar. Wir
discuss some recent milestones here. Grave and
Elhadad (2015) propose a transductive learning
objective for unsupervised parsing, and a convex
relaxation of it. (Jiang et al. (2017) combined that
work with grammar induction.) Martínez Alonso
et al. (2017) create an unsupervised dependency
parser that is formally similar to ours in that it
uses cross-linguistic knowledge as well as statis-
tics computed from a corpus of POS sequences in
the target language. Jedoch, its cross-linguistic
the set of
knowledge is hand-coded: nämlich,
POS-to-POS dependencies that are allowed by the
UD annotation scheme, and the typical directions
for some of these dependencies. The only cor-
pus statistic extracted from u is whether ADP-
NOMINAL or NOMINAL-ADP bigrams are more
frequent,6 which distinguishes prepositional from
postpositional languages. The actual parser starts
by identifying the head word as the most “central”
word according to a PageRank (Page et al., 1999)
analysis of the graph of candidate edges, and pro-
ceeds by greedily attaching words of decreasing
PageRank at lower depths in the tree.
5.2 Multi-Language Learning
This approach parses a “target” language using
the treebanks of other resource-rich languages as
“source” languages. There are two main variants.
Memory-based. This method trains a super-
vised parsing model on each source treebank. Es
uses these (delexicalized) source-language models
to help parse the target sentence, favoring sources
that are similar to the target language. A common
6In our notation of §6.1, below,
t∈{NOUN,PRON,PROPN} πw
this asks whether
T|ADP is greater for w = 1 or w = −1.
(cid:80)
similarity measure (Rosa and Žabokrtský, 2015A)
considers the probability of the target language’s
POS-corpus u under a trigram language model of
source-language POS sequences.
(Rosa
(SST)
transfer
Single-source
Und
Žabokrtský, 2015A; Wang and Eisner, 2016)
simply uses the parser for the most similar source
treebank. Multi-source transfer (MST) (Rosa
and Žabokrtský, 2015A) parses the target POS
sequence with each of the source parsers, Und
then combines these parses into a consensus tree
using the Chu-Liu-Edmonds algorithm (Chu,
1965; Edmonds, 1967). As a faster variant, Modell
interpolation (Rosa and Žabokrtský, 2015B) builds
a consensus model for the target language (via a
weighted average of source models’ parameters),
rather than a consensus parse for each target
sentence separately.
Memory-based methods require storing models
for all source treebanks, which is expensive when
we include thousands of GD treebanks (§3).
Model-based. This method trains a single
language-agnostic model. McDonald et al. (2011)
train a delexicalized parser on the concatenation
of all source treebanks, achieving a large gain over
grammar induction. This parser can learn univer-
sals such as the preference for determiners to at-
tach to nouns (which was hard-coded by Naseem
et al. (2010)). Jedoch, it is expected to parse a
sentence x without being told the language (cid:96) oder
even a corpus u, possibly by guessing properties
of the language from the configurations it encoun-
ters in the single sentence x alone.
Further gains were achieved (Naseem et al.,
2012; Täckström et al., 2013B; Zhang and Barzi-
lay, 2015; Ammar et al., 2016) by providing the
parser with about 10 typological properties of
x’s language—for example, whether direct objects
generally fall to the right of the verb—as listed
in the World Atlas of Linguistic Structures (Dryer
and Haspelmath, 2013).
Jedoch, relying on WALS raises some is-
sues. (1) The unknown language might not be in
WALS.7 (2) Some typological features are missing
(3) All the WALS features
for some languages.
are categorical values, which loses useful infor-
mation about tendencies (Zum Beispiel, how often
the canonical word order is violated). (4) Not all
WALS features are useful—only 56 of them per-
tain to word order, und nur 8 of those have been
72,679 out of about 7,000 world languages are in WALS.
671
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
used in past work. (5) With a richer parser (a stack
LSTM dependency parser), WALS features do not
appear to help at all on unknown languages (Bin-
mar et al., 2016, footnote 30).
5.3 Exploiting Parallel Data
Some other work on generalizing from source
to target languages assumes the availability of
source-target parallel data, or bitext. Two uses:
Induction of multilingual word embeddings.
Similar to universal POS tags, multilingual word
embeddings serve as a universal representation
that bridges the lexical differences among lan-
guages. Guo et al. (2016) proposed two ap-
proaches: (1) Training a variant of the skip-gram
Modell (Mikolov et al., 2013) by using bilingual
sets of context words. (2) Generating the embed-
ding of each target word by averaging the embed-
dings of the source words to which it is aligned.
Annotation projection. Given aligned bitext,
one can generate an approximate parse for a tar-
get sentence by “projecting” the parse tree of the
corresponding source sentence. A target-language
parser can then be trained from these approxi-
mate parses. The idea was originally proposed
by Yarowsky et al. (2001), and then applied to
dependency parsing on low-resource languages
(Hwa et al., 2005; Ganchev et al., 2009; Schmied
and Eisner, 2009; Tiedemann, 2014, Unter anderem).
McDonald et al. (2011) extend this approach to
multiple source languages by projected transfer.
Later work in this vein mainly tries to improve
the approximate parses, including translating the
source treebanks into the target language with an
off-the-shelf machine translation system (Tiede-
mann et al., 2014), augmenting the trees with
weights (Agi´c et al., 2016), and using only partial
trees with high-confidence alignments (Rasooli
and Collins, 2015, 2017; Lacroix et al., 2016).
5.4 Situating Our Work
Our own approach can be categorized as model-
based multi-language learning with no parallel
text or target-side supervision. Jedoch, we also
analyze an unparsed corpus u of the target lan-
Spur, as the per-language systems of §5.1 do.
Our analysis of u does not produce a specialized
target grammar or parser, but only extracts a tar-
get vector T(u) to be fed to the language-agnostic
parser. The analyzer is trained jointly with the
parser, over many languages.
Figur 1: A 2-layer typology component. The bias
vectors (bW ) are suppressed for readability.
6 The Typology Component
Wang and Eisner (2017) extract typological prop-
erties of a language from its POS-tagged corpus u,
in effect predicting syntactic structure from super-
ficial features. Like them, we compute a hidden
layer T(u) using a standard multilayer perceptron
architecture, Zum Beispiel,
T(u) = ψ(W π(u) + bW ) ∈ Rh
(3)
where π(u) ∈ Rd is the surface features of u,
W ∈ Rh×d maps π(u) into a h-dimensional
Raum, bW ∈ Rh is a bias vector, and ψ is an
element-wise activation function. While equa-
tion (3) has only 1 layer, we explore versions with
aus 0 Zu 3 layers (where T(u) = π(u) im
0-layer case). A 2-layer version is shown in Fig-
ure 1. The number of layers is chosen by cross-
validation, as are h and the ψ function.
6.1 Design of the Surface Features π(u)
To define π(u), we used development data to se-
lect the following fast but effective subset of the
features proposed by Wang and Eisner (2017).
Hand-engineered features. Given a token j in a
Satz, let its right window Rj be the sequence
of POS tags pj+1, . . . , pj+w (padding the sentence
as needed with # symbols). w is the window size.
Define gw(T | J) ∈ [0, 1] to be the fraction of
words in Rj tagged with t. Jetzt, given a corpus
u, define
πw
t = mean
J
gw(T | J), πw
T|s = mean
J: Tj =s
gw(T | J)
where j ranges over tokens of u. The unigram
prevalence πw
t measures the frequency of t over-
alle, while the bigram prevalence πw
T|s measures
the frequency with which t can be found to the left
672
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
T(u) W1W2
of an average s tag (in a window of size w). Für
each of these quantities, we have a corresponding
mirror-image quantity (denoted by negating w) von
computing it on a reversed version of the corpus.
The final hand-engineered π(u) includes:
• πw
T , for each tag type t and each w ∈
{1, 3, 8, 100}. This quantity measures how
frequently t appears in u.
• πw
T|s/πw
T
t and π−w
T|s //π−w
T|s//πw
, for each tag type
T
pair s, t and each w ∈ {1, 3, 8, 100}. We de-
fine x//y = min(x/y, 1) to bound the fea-
ture values for better generalization. Notice
that if w = 1, the log of πw
is the bigram
pointwise mutual information. Each matched
pair of these quantities is intuitively related
to the word order typology—for example,
if ADPs are more likely to have closely
following than closely preceding NOUNs
NOUN > π−w
(πw
NOUN),
the language is more likely to be preposi-
tional than postpositional.
NOUN|ADP//π−w
NOUN|ADP//πw
Neural features.
Im Gegensatz, our neural features
automatically learn to extract arbitrary predictive
configurations. As Figure 2 zeigt an, we encode
each POS-tagged sentence ui ∈ u using a recur-
rent neural network, which reads one-hot POS em-
beddings from left to right, then outputs its final
hidden state vector f i as the encoding. Das Finale
neural π(u) is the average encoding of all sen-
tences (average-pooling): das ist, the average of all
sentence-level configurations. We specifically use
a gated recurrent unit (GRU) Netzwerk (Cho et al.,
2014). The GRU is jointly trained with all other
parameters in the system so that it focuses on de-
tecting word-order properties of u that are useful
for parsing.
7 The Parsing Architecture
To construct Parse(X; u), we can extend any sta-
tistical parsing architecture Parse(X) to be sen-
sitive to T(u).
For our experiments, we ex-
tend the delexicalized graph-based implementa-
tion of the BIST parser (Kiperwasser and Gold-
berg, 2016)—an arc-factored dependency model
with neural context features extracted by a bidi-
rectional LSTM. This recent parser was the state
of the art when it was published.
Figur 2: Computing the neural feature vector π(u).
parser first computes an unlabeled projective tree
argmax
y∈Y(X)
Punktzahl(X, j; u)
(4)
Wo, letting a range over the arcs in tree y,
Punktzahl(X, j; u) =
(cid:88)
a∈y
S(Phi(A; X, u))
(5)
With this definition, the argmax in (4) is com-
puted efficiently by the algorithm of Eisner (1996).
S(·) is a neural scoring function on vectors,
S(Phi(· · · )) = v tanh(V φ(· · · ) + bV )
(6)
where V is a matrix, bV is a bias vector, and v is
a vector, all being parameters in Θ.
The function φ(A; X, u) extracts the feature vec-
tor of arc a given x and u. BIST scores unlabeled
arcs, so a denotes a pair (ich, J)—the indices of the
parent and child, jeweils. We define
Phi(A; X, u) = [B(X, ich; T(u)); B(X, J; T(u))]
(7)
which concatenates contextual representations of
tokens i and j. B(X, ich) is itself a concatenation
of the hidden states of a left-to-right LSTM and a
right-to-left LSTM (Graves, 2012) when each has
read sentence x up through word i (really POS tag
ich). These LSTM parameters are included in Θ.
The POS tags in x are provided to the LSTMs as
one-hot vectors. Crucially, T(u) is also provided
to the LSTM at each step, wie in der Abbildung gezeigt 3.
After selecting the best tree via equation (4), Wir
use each arc’s φ vector again to predict its label.
This yields the labeled tree ˆy = ParseΘ(X; u).
Given a POS-sentence x and a corpus u, unser
The only extension that this makes to BIST is
673
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
. . .. . .f1. . .. . .. . .. . .. . .f2. . . . . .avg-pooling⇡(u)
We want the correct tree y to beat each tree y(cid:48) von
a margin equal to the number of errors in y(cid:48) (Wir
count spurious edges). Formally, loss(X, j; u) Ist
given by
max(0, − score(X, j; u)+
(cid:16)
max
j(cid:48)
Punktzahl(X, j(cid:48); u)
(cid:123)(cid:122)
(cid:125)
(cid:124)
model score
+
(cid:88)
1a /∈y
(cid:17)
)
(9)
a∈y(cid:48)
(cid:125)
(cid:123)(cid:122)
(cid:124)
precision error
where a ranges over the arcs of a tree y, and 1a /∈y
is an indicator that is 1 if a /∈ y. Daher, this loss
function is high if there exists a tree y(cid:48) that has a
high score relative to y yet low precision.9
The training algorithm makes use of loss-
augmented inference (Taskar et al., 2005), a vari-
ant on the ordinary inference of (4). The most vi-
olating tree y(cid:48) (in the maxy(cid:48) über) is computed
again by an arc-factored dependency algorithm
(Eisner, 1996), where the score of any candidate
arc a is s(Phi(A; X, u)) + 1a /∈y.
Eigentlich, the above method would only train the
score function to predict the correct unlabeled tree
as above (since a ranges over unlabeled arcs as be-
Vordergrund). In der Praxis, we also jointly train the labeler
to predict the correct labels on the gold arcs, verwenden
a separate hinge-loss objective. Because these two
components share parameters through φ(A; X, u),
this is a multi-task learning problem.
8.2 Training Algorithm
Augment training data. Unlike ordinary NLP
problems whose training examples are sentences,
each training example in equation (1) is an entire
Sprache. Bedauerlicherweise, UD (§3) only provides a
few dozen languages—presumably not enough to
generalize well to novel languages. We therefore
augment our training dataset Ltrain with thou-
sands of synthetic languages from the GD dataset
(§3), as already discussed in §3.1.
Stochastic gradient descent (SGD).10 Treating
each language as a single large example during
training would lead to slow SGD steps. Stattdessen,
we take our SGD examples to be individual sen-
tences, by regarding equations (1)–(2) together as
Figur 3: The architecture of the delexicalized graph-
based BIST parser with the introduction of T(u),
where si,j in each cell is the arc score s(Phi(A; X, T(u))
from equation (6). The root of the tree is always posi-
tion 0, where x0 is a distinguished “root” symbol that
is prepended to the input sentence.
to supply T(u) to the BiLSTM.8 This extension
is not a significant slowdown at test time, seit
T(u) only needs to be computed once per test lan-
Spur, not once per test sentence. Since T(u) can
be computed for any novel language at test time,
this differs from the “many languages, one parser”
architecture (Ammar et al., 2016), in which a test-
time language must have been seen at training time
or at least must have known WALS features.
Product of experts. We also consider a variant
of the function (6) for scoring arc a, nämlich
λsH(A) + (1 − λ)sN(A)
(8)
where sH(A) and sN(A) are the scores produced by
separately trained systems using, jeweils, Die
hand-engineered and neural features from §6.1.
Hyperparameter λ ∈ [0, 1] is tuned on dev data.
8 Training the System
8.1 Training Objective
We exactly follow the training method of Kiper-
wasser and Goldberg (2016), who minimize a
structured max-margin hinge loss (Taskar et al.,
2004; McDonald et al., 2005; LeCun et al., 2007).
8An alternative would be to concatenate T(u) mit dem
representation computed by the BiLSTM. This gets empiri-
cally worse results, probably because the BiLSTM does not
have advance knowledge of language-specific word order as
it reads the sentence. We also tried an architecture that does
beide, with no notable improvement.
9Formally, for this loss function to be used in equation (2),
we must interpret ParseΘ in that equation as returning a for-
est of scored parses, not just a single parse.
10More precisely, we use Adam (Kingma and Ba, 2015), A
popular variant of SGD. The parameters Θ are initialized by
“Xavier initialization” (Glorot and Bengio, 2010).
674
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
T(u)T(u)T(u)T(u)123012s0,1s1,0s1,2s2,1s3,2s2,3s3,1s1,3s0,2s2,0s0,3s3,0yx+SUMx0x1x2x3uscore(X,j;u)langinfoarcsPOSseq.
an objective averaged over sentences. Each ex-
reichlich (X, j, u) is sampled hierarchically, by first
drawing a language (cid:96) from Ltrain and setting u =
u((cid:96)), then drawing the sentence (X, j) uniformly
aus (X((cid:96)), j((cid:96))). We train using mini-batches of
100 Sätze; each mini-batch can mix many lan-
guages.
Encourage real languages. To sample (cid:96) aus
Ltrain, we first flip a coin with weight β ∈ [0, 1] Zu
choose “real” vs. “synthetic,” and then sample uni-
formly within that set. Warum? The test sentences
will come from real languages, so the synthetic
languages are out-of-domain. Including them re-
duces variance but increases bias. We raise β to
keep them from overwhelming the real languages.
Sample efficiently. The sentences (X, j) Sind
stored in different files by language. To reduce
disk accesses, we do not visit a file on each
sample. Eher, for each language (cid:96), we maintain
in memory a subset of (X((cid:96)), j((cid:96))), obtained by
Samples from (X((cid:96)), j((cid:96)))
reservoir sampling.
are drawn sequentially from this “chunk,” and
when it is used up we fetch a new chunk. We also
maintain u((cid:96)) and the hand-engineered features
from π(u((cid:96))) in memory.
9 Experimente
9.1 Basic Setup
Our data split follows that of Wang and Eisner
(2017), as shown in Table 2,11 which has 18
training languages (20 treebanks) Und 17 test lan-
guages. All hyperparameters are tuned via 5-fold
cross-validation on the 20 training treebanks—that
Ist, we evaluate each fold (4 treebanks) using the
model trained on the remaining folds (16 tree-
banks). Jedoch, a model trained on a treebank of
Sprache (cid:96) is never evaluated on another treebank
of language (cid:96). We selected the hyperparameters
that maximized the average unlabeled attachment
Punktzahl (UAS) (Kübler et al., 2009), which is the
evaluation metric that is reported by most previ-
11Jedoch, as we are interested in transfer to unseen lan-
guages, our Table 2 follows the principle of Eisner and Wang
(n.d.) and does not test on the Finnishftb or Latin treebanks
because other treebanks of those languages appeared in train-
ing data. Konkret, Latinitt and Latinproiel fall in the same
training folds as French and Italian, jeweils. For the
same reason, Tisch 2 does not show cross-validation devel-
opment results on these Latin treebanks—nor on the Ancient
Greekgrc and Ancient Greekgrc_proiel treebanks, which fall in
the same training folds as Czech and Danish, jeweils.
ous work on unsupervised parsing. We also report
labeled attachment score (LAS).12
When augmenting the data, Die 16 training tree-
banks are “mixed and matched” to get GD tree-
banks for 16×17×17 = 4624 additional synthetic
training languages (Wang and Eisner, 2016, §5).
The next sections analyze these cross-validation
results. Endlich, §9.8 will evaluate on 15 pre-
viously unseen languages (excluding Latin and
Finnishftb) with our model trained on all 18 train-
ing languages (20 treebanks for UD, plus 20×21×
21 = 8840 when adding GD) with the hyperpa-
rameters that achieved the best average unlabeled
attachment score during cross-validation.
The UD and GD corpora provide a train/dev/test
split of each treebank, denoted as (xtrain, ytrain),
(xdev, ydev) Und (xtest, ytest). Throughout this
Papier, for both training and testing languages, Wir
take (X((cid:96)), j((cid:96))) = (xtrain, ytrain). We take u((cid:96))
to consist of all xtrain sentences with ≤ 40 tokens.
9.2 Comparison Among Architectures
Tisch 1 shows the cross-validation parsing results
over different systems discussed so far. For each
architecture, we show the best average unlabeled
attachment score (the UAS column) chosen by
cross-validation, and the corresponding labeled at-
tachment score (the LAS column).
In brief, Die
main sources of improvement are twofold:
Synthetic languages. We observe that +GD
consistently outperforms UD across all architec-
tures. It even helps with the baseline system that
we tried, which simply ignores the target cor-
pus u((cid:96)).
In that system (similar to McDonald
et al. (2011)), the BiLSTM may still manage to
extract (cid:96)-specific information from the single sen-
tence x ∈ x((cid:96)) that it is parsing.13 The additional
GD training languages apparently help it learn to
do so in a way that generalizes to new languages.
12When reporting LAS and when studying the labeling er-
rors in §9.7, we would ideally have first re-tuned our sys-
tem to optimize LAS via cross-validation. Bedauerlicherweise,
these potentially improved LAS results would have required
months of additional computation. The optimal hyperparam-
eters may not be very different, Jedoch, since UAS and LAS
rose and fell together when we varied other training condi-
tions in Figures 4–6.
13Das ist, our baseline system has learned a single parser
that can handle a cross-linguistic variety of POS sequences
(vgl. McDonald et al., 2011; Ammar et al., 2016, section 4.2),
just as the reader was able to parse VERB DET NOUN ADJ
DET NOUN in §2.
675
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
UAS
LAS
System UD
SST
Baseline
H
N
H;N
H+N
TD
TW
(cid:123)
(cid:124)
(cid:125)
(cid:122)
S
e
R
u
T
A
e
F
e
l
C
A
R
Ö
66.22*
63.95
64.83
65.30
65.26
67.34*
65.94
64.84
+GD
65.70
67.97
69.41
70.06
69.62
70.65*
70.01*
69.75
UD
50.40
48.46
49.41
49.43
49.67
52.02*
49.77
49.30
+GD
50.54
52.78
53.63
54.19
53.68
55.18*
53.43
53.79
Tisch 1: Average parsing results over 16 languages,
computed by 5-fold cross-validation. We compare
training on real languages only (the UD column) ver-
sus augmenting with synthetic languages at β = 0.2
(the +GD column). Baseline is the ablated system that
omits T(u) (§9.2). SST is the single-source transfer
Ansatz (§5.2). H and N use only hand-engineered
features or neural features, while H;N defines π(u)
to concatenate both (§6.1) and H+N is the product-of-
experts model (§7). TD and TW that incorporate oracle
knowledge of the target-language syntax (§9.4). Für
each comparison between UD and +GD, we boldface
the better (higher) Ergebnis, or both if they are not signifi-
cantly different (paired permutation test over languages
with p < 0.05). In each column, we star the best result
as well as all results that are not significantly worse.
To better understand the trend, we study how
the performance varies when more synthetic lan-
guages are used. As shown in Figure 4, when β =
1, all the training languages are sampled from real
languages. By gradually increasing the propor-
tion of GD languages (reducing β from §8.2), the
baseline UAS increases dramatically from 63.95
to 67.97. However, if all languages are uniformly
sampled (β = 16
4624+16 ≈ 0.003) or only synthetic
languages are used (β = 0), the UAS falls back
slightly to 67.42 or 67.36. The best β value is 0.2,
0.2/16
0.8/4624 ≈ 72
which treats each real language as
times more helpful than each synthetic language,
yet 80% of the training data is contributed by syn-
thetic languages. β = 0.2 was also optimal for the
non-baseline systems in Table 1.
Unparsed corpora. The systems that exploit
unparsed corpora consistently outperform the
baseline system in both the UD and +GD condi-
tions. To investigate, we examine the impact of
reducing u((cid:96)) when parsing a held-out language (cid:96).
We used the system in row N and column +GD
of Table 1, which was trained on full-sized u cor-
pora. When testing on a held-out language (cid:96), we
compute T(u((cid:96))) using only a random size-t sub-
set of u((cid:96)). As shown in Figure 5, the system does
Figure 4: Effect of β. The UAS and LAS (y-axis) of
the baseline system as a function of β (x-axis).
Figure 5: Effect of the size |u((cid:96))| of the unparsed cor-
pus. The y-axis represents the cross-validation UAS
and LAS scores, averaged over the 7 languages that
have |u((cid:96))| ≥ 9000 sentences, when using only a sub-
set of the sentences from u((cid:96)). Using all of u((cid:96)) would
achieve 64.61 UAS and 49.04 LAS. The plot shows the
average over 10 runs with different random subsets; the
error bars indicate the 10th to the 90th percentile of
those runs. The 7 languages are Finnish (Finnic), Nor-
wegian (Germanic), Dutch (Germanic), Czech (Slavic),
German (Germanic), Hindi (Indic), and English (Ger-
manic).
not need a very large unparsed corpus—most of
the benefit is obtained by t = 256. Nonetheless,
a larger corpus always achieves a better and more
stable performance.
9.3 Comparison to SST
Besides Baseline, another directly comparable ap-
proach is SST (§5.2). As shown in Table 1, SST
gives a stronger baseline on the UD column—as
good as H+N. However, this advantage does not
carry over to the +GD column, meaning that SST
cannot exploit the extra training data. Wang and
Eisner (2016, Figure 5) already found that GD
languages provide diminishing benefit to SST as
676
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
6465666768Avg. attachment scoreUAS0.00.20.40.60.81.04950515253LAS61626364Avg. attachment scoreUAS1632641282565121024Size of unparsed corpus4546474849LAS
more UD languages get involved.14 For H+N,
however, the extra GD languages do help to iden-
tify the truly useful surface patterns in u.
We also considered trying model interpolation
(Rosa and Žabokrtský, 2015b). Unfortunately, as
mentioned in §5.2, this method is impractical with
GD languages, because it requires storing 4624
(§9.1) additional local models. Nonetheless, we
can estimate an “upper bound” on how well the
interpolation might do. Our upper bound is SST
where an oracle is used to choose the source lan-
guage; Rosa and Žabokrtský (2015b) found that in
practice, this does better than interpolation. This
approximate upper bound is 68.03 of UAS and
52.10 of LAS, neither of which is significantly bet-
ter than H+N on UD, but both of which are signif-
icantly outperformed by H+N on +GD.
9.4 Oracle Typology vs. Our Learned T(u)
The results in Table 1 demonstrate that we learned
to extract features T(u), from the unparsed tar-
get corpus u, that improve the baseline parser. We
consider replacing T(u) by an oracle that has ac-
cess to the true syntax of the target language. We
consider two different oracles, TD and TW.
TD is the directionalities typology that was
studied by Liu (2010) and used as a training tar-
get by Wang and Eisner (2017). Specifically,
TD ∈ [0, 1]57 is a vector of the directionalities
of each type of dependency relation; it specifies
what fraction of direct objects fall to the right of
the verb, and so on.15 In principle, this should be
very helpful for parsing, but it must be extracted
from a treebank, which is presumably unavailable
for unknown languages.
We also consider TW—the WALS features—
as the typological classification given by linguists.
This resembles the previous multi-language learn-
ing approaches (Naseem et al., 2012; Täckström
et al., 2013b; Zhang and Barzilay, 2015; Am-
mar et al., 2016) that exploited the WALS fea-
tures. In particular, we use 81A, 82A, 83A, 85A,
86A, 87A, 88A and 89A—a union of WALS fea-
tures used by those works. In order to derive the
WALS features for a synthetic GD language, we
first copy the features from its substrate language
14The number of real treebanks in our cross-validation set-
ting is 16, greater than the 10 in Wang and Eisner (2016).
a
→)
15The directionality of a relation a in language (cid:96) is given
count(cid:96)(a) , where count(cid:96)( a→) is the count of a-relations that
by count(cid:96)(
point from left to right, and count(cid:96)(a) is the count of all a-
relations.
(Wang and Eisner, 2016). We then replace the
81A, 82A, 83A features—which concern the order
between verbs and their dependents—by those of
its V-superstrate language16 (if any). We replace
85A, 86A, 87A, 88A and 89A—which concern
the order between nouns and their dependents—
by those of its N-superstrate language (if any).
As a pleasant surprise, we find that our best sys-
tem (H+N) is competitive with both oracle meth-
ods.
It outperforms both of them on both UAS
and LAS, and the improvements are significant
and substantial in 3 of these 4 cases. Our parser
has learned to extract information T(u) that is not
only cheap (no treebank needed), but also at least
as useful as “gold” typology for parsing.
9.5 Selected Hyperparameter Settings
For the rest of the experiments, we use the H+N
system, as it wins under cross-validation on both
UD and +GD (Table 1). This is a combination via
(8) of the best H system and the best N system
under cross-validation, with the mixture hyperpa-
rameter λ also chosen by cross-validation.
For both UD and +GD, cross-validation se-
lected 125 as the sizes of the LSTM hidden states
and 100 as the sizes of the hidden layers for scor-
ing arcs (the length of v in equation (6)).
Hyperparameters for UD. The H system com-
putes T(u) with a 1-layer network (as in equa-
tion (3)), with hidden size h = 128 and ψ = tanh
as the activation function. For the N system, T(u)
is a 1-layer network with hidden size h = 64 and
ψ = sigmoid as the activation function. The size
of the hidden state of GRU as shown in Figure 2 is
128. The mixture weight for the final H+N system
is λ = 0.5.
Hyperparameters for +GD. The H system
computes T(u) with a 2-layer network (as shown
in Figure 1), with h = 128 and ψ = sigmoid for
both hidden layers. For N, T(u) is a 1-layer net-
work with hidden size h = 64 and ψ = sigmoid.
The size of the hidden state of GRU is 256. Both H
and N set β = 0.2 (see §8.2). The mixture weight
for the final H+N system is λ = 0.4.
9.6 Performance on Noisy Tag Sequences
We test our trained system in a more realistic sce-
nario where both u and x for held-out languages
16The language whose word order model is used to per-
mute the dependents of the verbs. See Wang and Eisner
(2016) for details.
677
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
ers. (Remember from §9.2 that the hyperparam-
eters were selected to maximize unlabeled scores
(UAS) rather than labeled (LAS).)
Figure 8 gives the label confusion matrix.
While the dark NONE column shows that arcs of
each type are often missed altogether (recall er-
rors), the dark diagonal shows that they are usually
labeled correctly if found. That said, it is relatively
common to confuse the different labels for nom-
inal dependents of verbs (nsubj, dobj, nmod).
We suspect that lexical information could help sort
out these roles via distributional semantics. Some
other mistakes arise from discrepancies in the an-
notation scheme. For example, neg can be easily
confused with advmod, as some languages (for
example, Spanish) use ADV instead of PART for
negations.
9.8 Final Evaluation on Test Data
In all previous sections, we evaluated on the 16
languages in the training set by cross-validation.
For the final test, we combine all the 20 tree-
banks and train the system with the hyperparam-
eters given in §9.5, then test on the 15 unseen test
languages. Table 2 displays results on these 15
test languages (top) as well as the cross-validation
results on the 16 languages (bottom).
We see that we improve significantly over base-
line on almost every language. Indeed, on the test
languages, +T(u) improves both UAS and LAS
by > 3.5 percentage points on average. The im-
provement grows to > 5.6 if we augment the train-
ing data as well (+GD, meaning +T(u)+GD).
One disappointment concerns the added benefit
on the LAS of +GD over just +T(u): while this
data augmentation helped significantly on nearly
every one of the 16 development languages, Es
produced less consistent improvements on the test
languages and hurt some of them. We suspect that
this is because we tuned the hyperparameters to
maximize UAS, not LAS (§9.2). Infolge, while
the average benefit across our 15 test languages
was fairly large, this sample was not large enough
to establish that it was significantly greater from 0,
das ist, that future test languages would also see an
improvement from data augmentation.
We also notice that there seems to be a small
difference between the pattern of results on devel-
opment versus test languages. This may simply
reflect overfitting to the development languages,
but we also note that the test languages (chosen by
Figur 6: Performance on noisy input over 16 train-
ing languages. Each dot is an experiment annotated by
the number of sentences used to train the tagger. (Der
rightmost “∞” point uses gold tags instead of a tagger,
which is the result from Table 1.) The x-axis gives the
average accuracy of the trained RDRPOSTagger. Der
y-axis gives the average parsing performance.
consist of noisy POS tags rather than gold POS
tags. Following Wang and Eisner (2016, Ap-
pendix B), at test time, the gold POS tags in a cor-
pus are replaced by a noisy version produced by
the RDRPOSTagger (Nguyen et al., 2014) trained
on a subset of the original gold-tagged corpus.17
Figur 6 shows a linear relationship between the
performance of our best model (H+N with +GD)
and the noisiness of the POS tags, which is con-
trolled by altering the amount of training data.
With only 100 training sentences, die Performance
suffers greatly—the UAS drops from 70.65 Zu
51.57. dennoch, even this is comparable to
Naseem et al. (2010) on gold POS tags, welche
yields a UAS of 50.00. That system was the first
grammar induction approach to exploit knowledge
of the distribution of natural languages, and re-
mained state-of-the-art (Noji et al., 2016) until
the work of Mareˇcek (2016) and Martínez Alonso
et al. (2017).
9.7 Analysis by Dependency Relation Type
Figur 7 breaks down the results by dependency
relation type—showing that using u and synthetic
data improves results almost across the board.
We also notice large differences between la-
beled and unlabeled F1 scores for some relations,
especially rarer ones.
Mit anderen Worten, the sys-
tem mislabels the arcs that it correctly recov-
17Another way to get noisy tags, as a reviewer notes, würde
have been to use a cross-lingual POS tagger designed for low-
resource settings (Täckström et al., 2013A; Kim et al., 2017).
678
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
55606570Avg. attachment score0.1K0.2K0.4K0.8K1.6K3.2K6.4K12.8KUAS80859095100Avg. tagging accuracy3540455055LAS
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
T
l
A
C
_
A
_
0
0
2
4
8
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
9
S
e
P
e
M
B
e
R
2
0
2
3
Figur 7: Evaluation by dependency relation type, showing an equal-weighted average of the 16 Entwicklung
languages. Each vertical bar spans the range from labeled F1 (bottom) to unlabeled F1 (top), with error bars
given by bootstrap resampling over the 16 languages. Precision and recall are also indicated. The pattern is that
F1, precision, and recall—both labeled and unlabeled—are improved over baseline when we exploit unlabeled
corpora (+T(u)), and improved again when we augment training data (+T(u)+GD). The relations are sorted by
their average gold proportion in the 16 languages, shown by the gray area and right vertical axis. Zum Beispiel,
nmod is the most common relation, accounting for 15.5% of all arcs. Insgesamt, Die 20 most frequent relations
(shown here) account for 94% of the arcs.
Wang and Eisner (2016)) tended to have consider-
ably smaller unparsed corpora u, so there may be
a domain mismatch problem. To ameliorate this
Problem, one could include training examples with
versions of u that are truncated to lengths seen in
test data (vgl. Figur 5). One could also include the
Größe |u| explicitly in T(u).
10 Conclusion and Future Work
We showed how to build a “language-agnostic”
delexicalized dependency parser that can better
parse sentences of an unknown language by ex-
ploiting an unparsed (but POS-tagged) corpus of
that language. Unlike grammar induction, welche
estimates a PCFG from the unparsed corpus, Wir
train a neural network to extract a feature vec-
tor from the unparsed corpus that helps a subse-
quent neural parser. By end-to-end training on the
treebanks of many languages (optionally includ-
ing synthetic languages), our neural network can
extract linguistic information that helps neural de-
pendency parsing.
Variants of our architecture are possible. In fu-
ture work, the neural parser could use attention to
look at individual relevant sentences of u, welche
are posited to be triggers in some theories of child
grammar acquisition (Gibson and Wexler, 1994;
Frank and Kapur, 1996). We could also try in-
jecting T(u) into the neural parser by means other
than concatenating it with the input POS embed-
dings. We might also consider parsing architec-
tures other than BIST, such as the LSTM-Minus
architecture for scoring spans (Cross and Huang,
2016), or the recent attention-based arc-factored
Modell (Dozat and Manning, 2017). Endlich, unser
approach is applicable to tasks other than depen-
dency parsing, such as constituent parsing or se-
mantic parsing—if suitable treebanks are available
for many training languages.
For applied uses,
it would be interesting to
combine the unsupervised techniques of this pa-
per with low-resource techniques that make use of
some annotated or parallel data in the target lan-
Spur. It would also be interesting to include fur-
ther synthetic languages that have been modified
to better resemble the actual target languages, us-
ing the method of (Wang and Eisner, 2018).
It is important to relax the delexicalized as-
sumption. As shown in §9.6, die Performance
of our system relies heavily on the gold POS
tags, which are presumably not available for un-
known languages. What is available is lexical
information—which has proved to be very impor-
tant for supervised parsing, and should help un-
supervised parsers as well. As discussed in §9.7,
some errors seem easily fixable by considering
word distributions. In the future, we will explore
ways to extend our cross-linguistic parser to work
with word sequences rather than POS sequences,
perhaps by learning a cross-language word rep-
resentation that is shared among training and test
languages (Ruder et al., 2017).
679
0.03.67.210.814.418.0Avg. proportion (%)nmodcasepunctdetnsubjrootamoddobjadvmodconjccmarkauxacladvclcompoundcopnummodnamexcompDependency relations0.00.20.40.60.81.0Precision \ Recall \ F1PrecisionRecallBaseline+ T(u)+ T(u) + GD
UAS
LAS
Irish
B +T(u) +GD B +T(u) +GD
Language
49.89 54.34 57.59 27.07 31.46 35.32
Basque
Croatian 65.03 67.78 68.65 48.68 52.29 53.68
65.91 68.37 70.46 50.1 56.73 57.89
Greek
Hebrew 62.58 66.27 65.3 49.71 53.29 52.08
Hungarian 58.5 64.13 70.02 42.85 47.73 49.99
Indonesian 55.22 64.63 65.36 39.46 47.63 48.38
58.58 61.51 62.21 39.06 40.75 42.36
Japanese 54.97 60.41 58.4 37.57 40.6 37.86
Slavonic 68.79 71.13 71.54 40.03 43.95 44.12
40.38 34.2 57.25 30.06 24.6 47.14
Persian
72.15 76.85 78.28 50.08 54.85 58.15
Polish
Romanian 66.55 69.69 71.18 50.9 53.42 55.17
Slovenian 72.21 76.06 78.62 57.09 61.48 64.1
Swedish 72.26 75.32 73.89 55.35 58.42 52.39
51.59 57.53 57.91 28.39 37.81 32.52
60.97 64.55 67.11 43.09 47.00 48.74
45.75 49.32 53.83 36.4 40.39 44.14
66.71 68.41 68.4 52.24 54.49 54.67
norwegisch 68.35 70.89 71.22 52.33 56.01 56.37
64.31 68.77 72.42 50.19 55.16 57.95
Czech
Estonian 72.67 79.88 81.67 42.81 51.32 52.57
Portuguese 70.48 73.47 74.83 60.85 63.18 64.96
62.18 63.62 66.52 48.44 49.46 53.51
42.6 45.17
63.23 66.72 70.75 39.1
70.0
75.9 79.24 80.57 65.46 68.8
Bulgarian 77.57 79.53 83.66 55.83 57.65 61.47
Finnishfi 53.73 58.03 60.44 34.68 39.55 43.15
74.57 76.88 79.34 64.1 66.83 68.48
French
59.63 62.58 60.31 45.84 48.28 47.98
Dutch
61.66 63.99 65.9 47.61 51.43 53.13
English
35.84 40.74 62.45 18.63 21.65 41.12
Hindi
70.65 75.36 78.03 60.8 65.45 68.23
Spanish
All Avg. 62.51 65.99 68.94 45.86 49.59 52.07
Tamil
Avg.
Arabic
Danish
Deutsch
Gothic
Italian
Tisch 2: Data splits and final evaluation on the 15 test
languages (top), along with cross-validation results on
Die 16 development languages (bottom) grouped by 5
folds (separated by dashed lines). For languages with
multiple treebanks, we identify them by subscripts.
We use “Slavonic” for Old Church Slavonic. Column
B is the baseline that doesn’t use T(u) (McDonald
et al., 2011). +T(u) is our H+N system, and +GD
is that system when the training data is augmented
with synthetic languages. In comparing among these
three systems, we boldface the highest score as well
as all scores that are not significantly worse (paired
permutation test, P < 0.05). If a row is an average over
many sentences of a single language, then each paired
datapoint is a sentence, so a significant improvement
should generalize to new sentences. But if a row is
an average, then each paired datapoint is a language
(as in Table 1), so a significant improvement should
generalize to new languages.
specific kind of structure that we desire, using any
features that we think may be discriminative.
680
Figure 8: The confusion matrix of our parser, as an
equal-weight average over 16 development languages.
Each row is normalized to sum to 1 and represents
a frequent gold relation. For example, the nsubj
row shows how well we recovered the gold nsubj
arcs; the (nsubj, dobj) entry shows p(predicted =
dobj | gold = nsubj), which measures the fraction
of nsubj relations that are recovered but mislabeled
as dobj. The diagonal represents correct arcs: where
dark, it indicates high labeled recall for that relation.
The final column represents gold arcs that were not re-
covered with any label: where dark, it indicates low
unlabeled recall for that relation. We show the top 20
relations sorted by gold frequency.
One takeaway message from this work is
contained in our title.
Surface statistics of a
language—mined from the surface part-of-speech
order—provide clues about how to find the un-
derlying syntactic dependencies. Chomsky (1965)
imagined that such clues might be exploited by a
Language Acquisition Device, so it is interesting
to know that they do exist.
Another takeaway message is that synthetic
training languages are useful for NLP. Using syn-
thetic examples in training is a way to encourage a
system to be invariant to superficial variation. We
created synthetic languages by varying the surface
structure in a way that “should” preserve the deep
structure. This allows our trained system to be in-
variant to variation in surface structure, just as ob-
ject recognition wants to be invariant to an image’s
angle or lighting conditions (§3.1).
Our final takeaway goes beyond language: one
can treat unsupervised structure discovery as a su-
pervised learning problem. As §§1–2 discussed,
this approach inherits the advantages of supervised
learning. Training may face an easier optimization
landscape, and we can train the system to find the
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
nmodpunctcasedetnsubjrootamoddobjadvmodconjccmarkauxacladvclcompoundcopnummodnamexcompOTHERSNONEPredicted relationsOTHERSxcompnamenummodcopcompoundadvclaclauxmarkccconjadvmoddobjamodrootnsubjdetcasepunctnmodGold relations0.00.5
Acknowledgements This material
is based
upon work supported by the U.S. National
Science Foundation under Grants No. 1423276
and 1718846. We are grateful to the state of
Maryland for providing indispensable computing
resources via the Maryland Advanced Research
Computing Center (MARCC). Thanks to Kiper-
wasser and Goldberg (2016) for releasing their
Dynet parsing code, which was the basis for
our reimplementation in PyTorch. We thank the
Argo lab members for useful discussions. Finally,
we thank TACL action editor Marco Kuhlmann
and the anonymous reviewers for high-quality
suggestions.
References
Željko Agi´c, Anders Johannsen, Barbara Plank,
Héctor Martínez Alonso, Natalie Schluter, and
Anders Søgaard. 2016. Multilingual projec-
tion for parsing truly low-resource languages.
Transactions of the Association for Computa-
tional Linguistics, 4:301–312.
Waleed Ammar, George Mulcaire, Miguel Balles-
teros, Chris Dyer, and Noah Smith. 2016. Many
languages, one parser. Transactions of the As-
sociation of Computational Linguistics, 4:431–
444.
Yoeng-Jin Chu. 1965. On the shortest arbores-
Science Sinica,
cence of a directed graph.
14:1396–1400.
Shay B. Cohen and Noah A. Smith. 2012.
Empirical
risk minimization for probabilis-
tic grammars: Sample complexity and hard-
ness of learning. Computational Linguistics,
38(3):479–526.
James Cross and Liang Huang. 2016.
Incre-
mental parsing with minimal features using bi-
directional LSTM. In Proceedings of the 54th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 2: Short Papers),
pages 32–37, Berlin.
Timothy Dozat and Christopher D. Manning.
2017. Deep biaffine attention for neural depen-
dency parsing. In Proceedings of the Interna-
tional Conference on Learning Representations.
Matthew S. Dryer and Martin Haspelmath, editors.
2013. The World Atlas of Language Structures
Online. Max Planck Institute for Evolutionary
Anthropology. http://wals.info/.
Jack Edmonds. 1967. Optimum branchings. Jour-
nal of Research of the National Bureau of Stan-
dards B, 71(4):233–240.
Sumit Basu, Chuck Jacobs, and Lucy Vander-
wende. 2013. Powergrading: A clustering ap-
proach to amplify human effort for short an-
swer grading. Transactions of the Association
for Computational Linguistics, 1:391–402.
Jason Eisner. 1996. Three new probabilistic mod-
els for dependency parsing: An exploration. In
Proceedings of the 16th International Confer-
ence on Computational Linguistics, pages 340–
345.
Glenn Carroll and Eugene Charniak. 1992. Two
experiments on learning probabilistic depen-
In Working
dency grammars from corpora.
Notes of the Workshop on Statistically-Based
NLP Techniques, pages 1–13.
Jason Eisner and Damianos Karakos. 2005. Boot-
In Proceedings of
strapping without the boot.
Human Language Technology Conference and
Conference on Empirical Methods in Natural
Language Processing, pages 395–402.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry
Bahdanau, and Yoshua Bengio. 2014. On
the properties of neural machine translation:
In Proceedings
Encoder-decoder approaches.
of Eighth Workshop on Syntax, Semantics and
Structure in Statistical Translation, pages 103–
111.
Jason Eisner and Dingquan Wang. No date. On
evaluating cross-lingual generalization. To be
posted on arXiv.
Robert Frank and Shyam Kapur. 1996. On the use
of triggers in parameter setting. Linguistic In-
quiry, 27:623–660.
Noam Chomsky. 1965. Aspects of the Theory of
Syntax. MIT Press, Cambridge, MA.
Kuzman Ganchev, Jennifer Gillenwater, and Ben
Taskar. 2009. Dependency grammar induction
681
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
In Proceed-
via bitext projection constraints.
ings of the Joint Conference of the 47th An-
nual Meeting of the ACL and the 4th Interna-
tional Joint Conference on Natural Language
Processing of the AFNLP, pages 369–377.
Edward Gibson and Kenneth Wexler. 1994. Trig-
gers. Linguistic Inquiry, 25(3):407–454.
Jennifer Gillenwater, Kuzman Ganchev, João Gra
ca, Fernando Pereira, and Ben Taskar. 2010.
Sparsity in dependency grammar induction. In
Proceedings of the ACL 2010 Conference Short
Papers, pages 194–199.
Kevin Gimpel and Noah A. Smith. 2012. Concav-
ity and initialization for unsupervised depen-
dency parsing. In Proceedings of the 2012 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 577–581.
Xavier Glorot and Yoshua Bengio. 2010. Under-
standing the difficulty of training deep feedfor-
In Proceedings of the
ward neural networks.
International Conference on Artificial Intelli-
gence and Statistics, volume 9, pages 249–256.
Matthew Gormley and Jason Eisner. 2013. Non-
convex global optimization for latent-variable
In Proceedings of the 51st Annual
models.
Meeting of the Association for Computational
Linguistics, pages 444–454.
Edouard Grave and Noémie Elhadad. 2015.
A convex and feature-rich discriminative ap-
In
proach to dependency grammar induction.
Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and
the 7th International Joint Conference on Natu-
ral Language Processing, pages 1375–1384.
Alex Graves. 2012.
Supervised Sequence
Labelling with Recurrent Neural Networks.
Springer.
Jiang Guo, Wanxiang Che, David Yarowsky,
Haifeng Wang, and Ting Liu. 2016. A repre-
sentation learning framework for multi-source
transfer parsing. In Proceedings of the Thirti-
eth AAAI Conference on Artificial Intelligence,
pages 2734–2740.
William P. Headden III, Mark Johnson, and David
Improving unsupervised
McClosky. 2009.
dependency parsing with richer contexts and
In Proceedings of Human Lan-
smoothing.
guage Technologies: The 2009 Annual Confer-
ence of the North American Chapter of the As-
sociation for Computational Linguistics, pages
101–109.
Rebecca Hwa, Philip Resnik, Amy Weinberg,
Clara Cabezas, and Okan Kolak. 2005. Boot-
strapping parsers via syntactic projection across
parallel texts. Natural Language Engineering,
11(3):311–325.
Yong Jiang, Wenjuan Han, and Kewei Tu. 2017.
Combining generative and discriminative ap-
proaches to unsupervised dependency parsing
via dual decomposition. In Proceedings of the
2017 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1689–1694.
Damianos Karakos, Jason Eisner, Sanjeev Khu-
danpur, and Carey E. Priebe. 2007. Cross-
instance tuning of unsupervised document clus-
In Human Language Tech-
tering algorithms.
nologies: Proceedings of the Annual Confer-
ence of the North American Chapter of the As-
sociation for Computational Linguistics, pages
252–259.
Joo-Kyung Kim, Young-Bum Kim, Ruhi
Sarikaya,
and Eric Fosler-Lussier. 2017.
Cross-lingual transfer learning for POS tagging
In Proceed-
without cross-lingual resources.
ings of
the 2017 Conference on Empirical
Methods in Natural Language Processing,
pages 2832–2838.
Diederik Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In Proceed-
ings of the International Conference on Learn-
ing Representations.
Eliyahu Kiperwasser and Yoav Goldberg. 2016.
Simple and accurate dependency parsing us-
ing bidirectional LSTM feature representations.
Transactions of the Association of Computa-
tional Linguistics, 4:313–327.
Dan Klein and Christopher Manning. 2004.
Corpus-based induction of syntactic structure:
Models of dependency and constituency.
In
Proceedings of the 42nd Annual Meeting of
the Association for Computational Linguistics,
pages 478–485.
682
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Sandra Kübler, Ryan McDonald, and Joakim
Nivre. 2009. Dependency Parsing. Morgan and
Claypool.
and François Yvon.
Ophélie Lacroix, Lauriane Aufrant, Guillaume
2016.
Wisniewski,
for
Frustratingly easy cross-lingual
In
transition-based dependency parsing.
the
Proceedings of
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, pages 1058–1063.
the 2016 Conference of
transfer
Karim Lari and Steve J. Young. 1990. The estima-
tion of stochastic context-free grammars using
the Inside-Outside algorithm. Computer Speech
and Language, 4(1):35–56.
Yann LeCun, Sumit Chopra, Raia Hadsell, and
Fu Jie Huang. 2007. A tutorial on energy-based
learning. In Gökhan Bakır, Thomas Hofmann,
Bernhard Schölkopf, Alexander J. Smola, Ben
Taskar, and S. V. N. Vishwanathan, editors, Pre-
dicting Structured Data. MIT Press.
Haitao Liu. 2010.
Dependency direction as
a means of word-order typology: A method
Lingua,
based on dependency treebanks.
120(6):1567–1578.
David Mareˇcek. 2016. Delexicalized and min-
imally supervised parsing on universal de-
In Statistical Language and
pendencies.
Speech Processing: 4th International Confer-
ence, SLSP 2016, Pilsen, Czech Republic, Oc-
tober 11-12, 2016, Proceedings, pages 30–42,
Cham.
David Mareˇcek and Milan Straka. 2013. Stop-
probability estimates computed on a large cor-
pus improve unsupervised dependency parsing.
In Proceedings of the 51st Annual Meeting of
the Association for Computational Linguistics,
pages 281–290.
David Mareˇcek. 2016.
Twelve years of un-
In Proceed-
supervised dependency parsing.
ings of the 16th ITAT Conference Information
Technologies—Applications and Theory, pages
56–62.
Héctor Martínez Alonso, Željko Agi´c, Barbara
Plank, and Anders Søgaard. 2017. Parsing uni-
In Pro-
versal dependencies without training.
683
ceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Compu-
tational Linguistics: Volume 1, Long Papers,
pages 230–240.
Ryan McDonald. 2006. Discriminative Learning
and Spanning Tree Algorithms for Dependency
Parsing. Ph.D. thesis, University of Pennsylva-
nia.
Ryan McDonald, Koby Crammer, and Fernando
Pereira. 2005. Online large-margin training of
dependency parsers. In Proceedings of the 43rd
Annual Meeting of the Association for Compu-
tational Linguistics (ACL’05), pages 91–98.
Ryan McDonald, Slav Petrov, and Keith Hall.
2011. Multi-source transfer of delexicalized de-
pendency parsers. In Proceedings of the 2011
Conference on Empirical Methods in Natural
Language Processing, pages 62–72.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S.
Corrado, and Jeff Dean. 2013. Distributed
representations of words and phrases and their
compositionality. In Advances in Neural Infor-
mation Processing Systems, pages 3111–3119.
Tahira Naseem, Regina Barzilay,
and Amir
Globerson. 2012. Selective sharing for multi-
lingual dependency parsing. In Proceedings of
the 50th Annual Meeting of the Association for
Computational Linguistics, pages 629–637.
Tahira Naseem, Harr Chen, Regina Barzilay, and
Mark Johnson. 2010. Using universal linguis-
tic knowledge to guide grammar induction. In
Proceedings of the 2010 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1234–1244.
Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc
Pham, and Son Bao Pham. 2014. RDRPOSTag-
ger: A ripple down rules-based part-of-speech
tagger. In Proceedings of the Demonstrations at
the 14th Conference of the European Chapter of
the Association for Computational Linguistics,
pages 17–20.
Joakim Nivre. 2008. Algorithms for determinis-
tic incremental dependency parsing. Computa-
tional Linguistics, 34(4):513–553.
Joakim Nivre, Željko Agi´c, Maria Jesus Aranzabe,
Masayuki Asahara, Aitziber Atutxa, Miguel
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Ballesteros, John Bauer, Kepa Bengoetxea,
Riyaz Ahmad Bhat, Cristina Bosco, Sam Bow-
man, Giuseppe G. A. Celano, Miriam Connor,
Marie-Catherine de Marneffe, Arantza Diaz de
Ilarraza, Kaja Dobrovoljc, Timothy Dozat,
Tomaž Erjavec, Richárd Farkas, Jennifer Foster,
Daniel Galbraith, Filip Ginter, Iakes Goenaga,
Koldo Gojenola, Yoav Goldberg, Berta Gon-
zales, Bruno Guillaume, Jan Hajiˇc, Dag Haug,
Radu Ion, Elena Irimia, Anders Johannsen, Hi-
roshi Kanayama, Jenna Kanerva, Simon Krek,
Veronika Laippala, Alessandro Lenci, Nikola
Ljubeši´c, Teresa Lynn, Christopher Manning,
Cˇatˇalina Mˇarˇanduc, David Mareˇcek, Héctor
Martínez Alonso, Jan Mašek, Yuji Matsumoto,
Ryan McDonald, Anna Missilä, Verginica Mi-
titelu, Yusuke Miyao, Simonetta Montemagni,
Shunsuke Mori, Hanna Nurmi, Petya Osenova,
Lilja Øvrelid, Elena Pascual, Marco Passarotti,
Cenel-Augusto Perez, Slav Petrov, Jussi Piitu-
lainen, Barbara Plank, Martin Popel, Prokopis
Prokopidis, Sampo Pyysalo, Loganathan Ra-
masamy, Rudolf Rosa, Shadi Saleh, Sebastian
Schuster, Wolfgang Seeker, Mojgan Seraji, Na-
talia Silveira, Maria Simi, Radu Simionescu,
Katalin Simkó, Kiril Simov, Aaron Smith, Jan
Štˇepánek, Alane Suhr, Zsolt Szántó, Takaaki
Tanaka, Reut Tsarfaty, Sumire Uematsu, Lar-
raitz Uria, Viktor Varga, Veronika Vincze,
Zdenˇek Žabokrtský, Daniel Zeman, and Hanzhi
Zhu. 2015. Universal dependencies 1.2. LIN-
DAT/CLARIN digital library at the Institute of
Formal and Applied Linguistics, Charles Uni-
versity in Prague.
Hiroshi Noji, Yusuke Miyao, and Mark Johnson.
2016. Using left-corner parsing to encode uni-
versal structural constraints in grammar induc-
tion. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Pro-
cessing, pages 33–43.
Lawrence Page, Sergey Brin, Rajeev Motwani,
and Terry Winograd. 1999. The PageRank cita-
tion ranking: Bringing order to the Web. Tech-
nical Report 1999-66, Stanford InfoLab.
Mohammad Sadegh Rasooli and Michael Collins.
2015. Density-driven cross-lingual transfer of
dependency parsers. In Proceedings of the 2015
Conference on Empirical Methods in Natural
Language Processing, pages 328–338.
Mohammad Sadegh Rasooli and Michael Collins.
2017. Cross-lingual syntactic transfer with lim-
ited resources. Transactions of the Association
for Computational Linguistics, 5:279–293.
Rudolf Rosa and Zdenˇek Žabokrtský. 2015a.
KLcpos3 – a language similarity measure for
delexicalized parser transfer. In Proceedings of
the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language
Processing, pages 243–249.
Rudolf Rosa and Zdenˇek Žabokrtský. 2015b.
Mstparser model interpolation for multi-source
In Proceedings of the
delexicalized transfer.
14th International Conference on Parsing Tech-
nologies, pages 71–75.
Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard.
2017. A survey of cross-lingual word embed-
ding models. Computing Research Repository,
arXiv:1706.04902.
David A. Smith and Jason Eisner. 2009.
Parser adaptation and projection with quasi-
synchronous grammar features. In Proceedings
of the Conference on Empirical Methods in
Natural Language Processing, pages 822–831.
Noah A. Smith. 2006. Novel Estimation Methods
for Unsupervised Discovery of Latent Structure
in Natural Language Text. Ph.D. thesis, Johns
Hopkins University.
Noah A. Smith and Jason Eisner. 2005. Guid-
ing unsupervised grammar induction using con-
trastive estimation. In International Joint Con-
ference on Artificial Intelligence (IJCAI) Work-
shop on Grammatical Inference Applications,
pages 73–82.
Mark A. Paskin. 2001. Cubic-time parsing and
learning algorithms for grammatical bigram.
Mark A. Paskin. 2002. Grammatical bigrams.
In Advances in Neural Information Processing
Systems, pages 91–97.
Noah A. Smith and Jason Eisner. 2006. Annealing
structural bias in multilingual weighted gram-
mar induction. In Proceedings of the Interna-
tional Conference on Computational Linguis-
tics and the Association for Computational Lin-
guistics, pages 569–576.
684
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Valentin I. Spitkovsky, Hiyan Alshawi, and
Daniel Jurafsky. 2012. Three dependency-and-
boundary models for grammar induction.
In
Proceedings of the 2012 Joint Conference on
Empirical Methods in Natural Language Pro-
cessing and Computational Natural Language
Learning, pages 688–698.
Dingquan Wang and Jason Eisner. 2016. The
treebanks: Getting
Galactic Dependencies
more data by synthesizing new languages.
Transactions of the Association of Computa-
tional Linguistics, 4:491–505. Data available
at https://github.com/gdtreebank/
gdtreebank.
Dingquan Wang and Jason Eisner. 2017. Fine-
grained prediction of syntactic typology: Dis-
covering latent structure with supervised learn-
ing. Transactions of the Association for Com-
putational Linguistics, 5:147–161.
Dingquan Wang and Jason Eisner. 2018. Syn-
thetic data made to order: The case of parsing.
In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing
(EMNLP), pages 1325–1337, Brussels.
Qin Iris Wang, Dale Schuurmans, and Dekang
Lin. 2008. Semi-supervised convex training for
dependency parsing. In Proceedings of the 46th
Annual Meeting of Annual Meeting of the Asso-
ciation for Computational Linguistics and the
2008 Conference of the North American Chap-
ter of the Association for Computational Lin-
guistics: Human Language Technologies, pages
532–540.
David Yarowsky, Grace Ngai, and Richard Wicen-
towski. 2001. Inducing multilingual text anal-
ysis tools via robust projection across aligned
In Proceedings of the First Interna-
corpora.
tional Conference on Human Language Tech-
nology Research.
Deniz Yuret. 1998. Discovery of Linguistic Re-
lations Using Lexical Attraction. Ph.D. thesis,
Massachusetts Institute of Technology.
Yuan Zhang and Regina Barzilay. 2015. Hierar-
chical low-rank tensors for multilingual trans-
In Proceedings of the 2015 Con-
fer parsing.
ference on Empirical Methods in Natural Lan-
guage Processing, pages 1857–1867.
Valentin I. Spitkovsky, Hiyan Alshawi, and Daniel
Jurafsky. 2013. Breaking out of local optima
with count transforms and model recombina-
In Pro-
tion: A study in grammar induction.
ceedings of the 2013 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 1983–1995.
Oscar Täckström, Dipanjan Das, Slav Petrov,
Ryan McDonald, and Joakim Nivre. 2013a. To-
ken and type constraints for cross-lingual part-
of-speech tagging. Transactions of the Associa-
tion for Computational Linguistics, 1:1–12.
Oscar Täckström, Ryan McDonald, and Joakim
Nivre. 2013b. Target language adaptation of
discriminative transfer parsers. In Proceedings
of the 2013 Conference of the North Ameri-
can Chapter of the Association for Computa-
tional Linguistics: Human Language Technolo-
gies, pages 1061–1071.
Ben Taskar, Vassil Chatalbashev, Daphne Koller,
and Carlos Guestrin. 2005. Learning structured
prediction models: A large margin approach. In
Proceedings of the 22nd International Confer-
ence on Machine Learning, pages 896–903.
Ben Taskar, Dan Klein, Michael Collins, Daphne
Koller, and Christopher Manning. 2004. Max-
In Proceedings of the 2004
margin parsing.
Conference on Empirical Methods in Natural
Language Processing.
Jörg Tiedemann. 2014. Rediscovering annotation
projection for cross-lingual parser induction. In
Proceedings of COLING 2014, the 25th Inter-
national Conference on Computational Linguis-
tics: Technical Papers, pages 1854–1864.
Jörg Tiedemann, Željko Agi´c, and Joakim Nivre.
2014. Treebank translation for cross-lingual
In Proceedings of the Eigh-
parser induction.
teenth Conference on Computational Natural
Language Learning, pages 130–140.
685
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
2
4
8
1
5
6
7
6
7
2
/
/
t
l
a
c
_
a
_
0
0
2
4
8
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3