What Should/Do/Can LSTMs Learn
When Parsing Auxiliary Verb Constructions?
Miryam de Lhoneux∗
Departamento de Ciencias de la Computación
Universidad de Copenhague
ml@di.ku.dk
Sara Stymne
Department of Linguistics and Philology
Uppsala University
sara.stymne@lingfil.uu.se
Joakim Nivré
Department of Linguistics and Philology
Uppsala University
joakim.nivre@lingfil.uu.se
There is a growing interest in investigating what neural NLP models learn about language. A
prominent open question is the question of whether or not it is necessary to model hierarchical
estructura. We present a linguistic investigation of a neural parser adding insights to this
pregunta. We look at transitivity and agreement information of auxiliary verb constructions
(AVCs) in comparison to finite main verbs (FMVs). This comparison is motivated by theoretical
work in dependency grammar and in particular the work of Tesni`ere (1959), where AVCs and
FMVs are both instances of a nucleus, the basic unit of syntax. An AVC is a dissociated nucleus;
it consists of at least two words, and an FMV is its non-dissociated counterpart, consisting of
exactly one word. We suggest that the representation of AVCs and FMVs should capture similar
información. We use diagnostic classifiers to probe agreement and transitivity information in
vectors learned by a transition-based neural parser in four typologically different languages. Nosotros
find that the parser learns different information about AVCs and FMVs if only sequential models
(BiLSTMs) are used in the architecture but similar information when a recursive layer is used.
We find explanations for why this is the case by looking closely at how information is learned
in the network and looking at what happens with different dependency representations of AVCs.
We conclude that there may be benefits to using a recursive layer in dependency parsing and that
we have not yet found the best way to integrate it in our parsers.
∗ Work carried out while at Uppsala University.
Envío recibido: 18 Julio 2019; versión revisada recibida: 19 Puede 2020; accepted for publication:
3 Octubre 2020.
https://doi.org/10.1162/COLI a 00392
© 2020 Asociación de Lingüística Computacional
Publicado bajo una Atribución Creative Commons-NoComercial-SinDerivadas 4.0 Internacional
(CC BY-NC-ND 4.0) licencia
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 4
1. Introducción
In the past few years, the interest in investigating what neural models learn about
language has been growing. This can be interesting from both a machine learning per-
perspectiva, to better understand how our models work, and from a linguistic perspective,
to find out what aspects of linguistic theories are important to model. A popular method
is the use of diagnostic classifiers (Hupkes, Veldhoen, and Zuidema 2018), where the idea
is to probe whether or not a model trained for a source task (Por ejemplo, idioma
modelado) learns a target task (for example subject-verb agreement) as a byproduct of
learning the source task. This is done by training vectors on the source task, freezing
these vectors, and using them to train a classifier on a target task. If we can successfully
train that classifier, we have some indication that the target task has been learned as a
byproduct of learning the source task. In this article, we use this method to investigate
whether or not a specific aspect of linguistic theory is learned, the notion of dissociated
nucleus from dependency grammar (explained in §2.1), as a byproduct of learning the
task of dependency parsing for several languages. Para esto, we focus on auxiliary verb
constructions (AVCs).
A prominent question in neural modeling of syntax is whether or not it is neces-
sary to model hierarchical structure. Sequential models (long short-term memory net-
obras [LSTMs]) have shown surprising capability for learning syntactic tasks (Linzen,
Dupoux, and Goldberg 2016; Gulordava et al. 2018), and models of dependency parsing
using sequential models are very accurate (Kiperwasser and Goldberg 2016). A pesar de
recursive neural networks surpass the abilities of sequential models for learning syn-
tactic tasks (Kuncoro et al. 2018), their use in dependency parsing seems superfluous
compared to using sequential models when looking at parsing accuracy (de Lhoneux,
Ballesteros, and Nivre 2019). Sin embargo, there may be benefits to using recursive neural
networks in parsing that are not reflected in parsing accuracy. En particular, for reasons
outlined in §2.2, they may be useful when it comes to learning the notion of dissociated
núcleo. In this article, we evaluate the usefulness of recursive neural networks when it
comes to learning the notion of dissociated nucleus.
The goals of this article are thus threefold. Primero, we look at dependency grammar
to theoretically motivate that our models should learn the notion of dissociated nucleus.
Segundo, we develop a method using diagnostic classifiers to test whether or not our
models do learn this notion. Tercero, we investigate the role that recursive neural networks
play in learning this notion and test whether or not our models can learn this notion
when augmented with a recursive layer.
2. Background and Research Questions
2.1 AVCs in Dependency Grammar and Dependency Parsing
Dependency parsing has gained popularity in the last 15 años, drawing ideas from de-
pendency grammar but diverging from it in important ways. Research on dependency
parsing has relied on a definition of dependency trees where the basic units of syntax are
words and the relations that hold between words in a sentence are binary asymmetric
relaciones. In most dependency grammar theories, representations are considerably more
complex, often consisting of multiple strata, as in Meaning-Text Theory (Mel’ˇcuk 1988)
and Functional Generative Description (Sgall, Hajiˇcov´a, and Panevov´a 1986). En el
seminal work of Tesni`ere (1959), a single level of representation is used but the basic
unit of syntax is not the word but the more abstract notion of nucleus. Nuclei often
764
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
eso
could work
could work
eso
.
Cifra 1
Two different representations of a sentence with auxiliary as used in dependency parsing (izquierda)
vs. as can be represented following following Tesni`ere (1959) (bien).
correspond to individual words but sometimes correspond to several words, típicamente
a content word together with one or more function words, which are said to constitute
a dissociated nucleus. The internal elements of a dissociated nucleus are connected by
transfer relations, while the nuclei themselves are connected by dependency relations
o (in the case of coordination) by junction relations.
In the dependency parsing literature, a sentence with an auxiliary is usually rep-
resented as either the top or the bottom tree in the left part of Figure 1, with either
the auxiliary or the main verb being dependent on the other. If we follow the ideas
from Tesni`ere (1959), it can be represented as in the right part of Figure 1, donde el
auxiliary and main verb are connected by a transfer relation to form a nucleus. Este
nucleus is itself connected to the other words/nuclei in the sentence. In this exam-
por ejemplo, the word that corresponds to a nucleus and the words could and work are each
part of a dissociated nucleus. En este sentido, the definition of dependency trees that is
used in dependency parsing is a simplification compared to the representations used
in Tesni`ere’s dependency grammar. We are losing the information that the relation
between could and work is a different type of relation than the relation between that
and work.1 We are forced to choose a head between the main verb and the auxiliary,
even though the main verb and auxiliary share head properties: Inflectional verbal
features like agreement, tense, aspect, mood, Etcétera, are typically encoded in
the auxiliary whereas lexical features like valency are properties of the main verb.
As Williams, Drozdov, and Bowman (2018) have shown, a network that learns latent
trees as part of a downstream task does not necessarily learn trees that correspond to
our linguistic intuitions. This means that teaching our models the difference between
transfer and dependency relations will not necessarily make a difference when it comes
to downstream tasks. Sin embargo, it is still informative to find out whether we can learn
a representation that is linguistically motivated, to understand better how our models
represent certain linguistic phenomena. We can subsequently investigate if learning this
type of representation is useful for downstream tasks—we leave this to future work. En
doblar, this could also inform linguistic theory, by finding whether this notion is relevant
or not for practical language technology.
Universal Dependencies (UD) (Nivre et al. 2016) is a project that seeks to harmonize
the annotation of dependency treebanks across languages. Having such harmonized
annotations makes it easier to incorporate linguistic information into parsing models
while remaining language-independent. En particular, as described by Nivre (2015), UD
adopts an analysis of language where function words attach to content words. This anal-
ysis, he argues, can be interpreted as a dissociated nucleus, as defined by Tesni`ere (1959).
1 Note that the labels disambiguate these cases but there are more labels than these two and labels do not
encode information about whether they are dependency or transfer relations.
765
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 4
Sin embargo, this notion has not been made explicit when training parsers. In pre-neural
transition-based parsers as in Nivre, Sala, and Nilsson (2006), when a dependent gets
attached to its head, features of the head are still used for further parsing but features
of the dependent are usually discarded.2
In neural parsers, it is less clear what information is used by parsers. Current state-
of-the-art models use (Bi)LSTMs (Dyer et al. 2015; Kiperwasser and Goldberg 2016;
Dozat and Manning 2017), and LSTMs make it possible to encode information about
the surrounding context of words in an unbounded window (which is usually limited
to a sentence in practice). In this article, we take a step in finding out what neural parsers
learn by testing if they capture the notion of dissociated nuclei. We do this by looking
in detail at what a BiLSTM-based parser learns about a specific type of dissociated
núcleo: auxiliary verb constructions. We focus on AVCs as they are a typical example
of dissociated nucleus and are well attested typologically; ver, Por ejemplo, anderson
(2011). We focus on four different languages, for reasons explained in §3.
We can use diagnostic classifiers to look at whether information like valency, agree-
mento, tense, mood, etcétera, is encoded in a vector representing a word or a subtree
in parsing. This allows us to specify how to test whether or not our parser learns the
notion of dissociated nucleus. We ask the following question: When making a parsing
decision about a nucleus, does the parser have access to similar information regardless
of whether the nucleus is dissociated or not? For the case of AVCs, this means that
the parser should learn similar information about AVCs as it does about their non-
dissociated counterpart: simple finite main verbs (henceforth FMV). An example FMV
is did in the sentence I did that.
2.2 Recursive vs. Recurrent Neural Networks
LSTMs are sequential models and therefore do not explicitly model hierarchical struc-
tura. Dyer et al. (2015) have shown that using a recursive layer on top of an LSTM
is useful when learning a parsing model. This recursive layer is used to compose the
representation of subtrees. Sin embargo, Kiperwasser and Goldberg (2016) have more re-
cently obtained parsing results on par with the results from Dyer et al. (2015) using only
BiLSTMs. Además, recent work has claimed that LSTMs are capable of learning
hierarchical structure (Linzen, Dupoux, and Goldberg 2016; Enguehard, Goldberg, y
Linzen 2017; Gulordava et al. 2018; Blevins, Exacción, and Zettlemoyer 2018). Esto indica
that a BiLSTM might be sufficient to capture the hierarchical structure necessary in
analizando. Sin embargo, Kuncoro et al. (2018) have also shown that although sequential
LSTMs can learn syntactic information, a recursive neural network that explicitly mod-
els hierarchy (the Recurrent Neural Network Grammar model from Dyer et al. [2015])
is better at this: It performs better on the number agreement task from Linzen, Dupoux,
and Goldberg (2016). Además, Ravfogel, Goldberg, and Tyers (2018) and Ravfogel,
Goldberg, y lentejas (2019) have cast some doubts on the results by Linzen, Dupoux,
and Goldberg (2016) and Gulordava et al. (2018) by looking at Basque and synthetic
languages with different word orders, respectivamente, in the two studies.
Motivated by these findings, we recently investigated the impact of adding a re-
cursive layer on top of a BiLSTM-based parser in de Lhoneux, Ballesteros, and Nivre
(2019) and found that this recursive layer is superfluous in that parsing model when
we look at parsing accuracy. This indicates that BiLSTM parsers capture information
2 Although features of the dependent can be used as features of the head.
766
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
about subtrees, but it is also possible that the advantages and disadvantages of using a
recursive layer cancel each other out in the context of our parser and that the advantages
of using it are not reflected in parsing accuracy. A recursive layer might still be useful
when it comes to learning the notion of dissociated nucleus. As a matter of fact, it might
make sense to use recursive composition to model relations of transfer and not relations
of dependency in the sense of Tesni`ere (1959).
2.3 Research Questions
We use diagnostic classifiers to probe what information is encoded in vectors represent-
ing AVCs learned by a dependency parser. We are interested in finding out whether
or not a parser learns similar information about AVCs as it learns about their non-
dissociated counterpart, FMVs. En otras palabras, if a parser learns the notion of disso-
ciated nucleus, we expect it to have information about agreement, tense, aspect, y
mood, as well as valency encoded in subtree representations of AVCs to the same extent
as it is encoded in vectors representing FMVs.
With UD treebanks, it is straightforward to design tasks that probe transitivity in
AVCs and FMVs: We can look at objects and indirect objects of the main verb. Es
also straightforward to design tasks that probe agreement of AVCs and FMVs: We can
use the morphological features that encode information about the subject’s number and
persona. It is less straightforward to design tasks that probe information about tense,
mood, and aspect (TMA) because that would require annotation of the verb phrases,
since the morphological features of individual verbs do not give enough information.
Por ejemplo, in the AVC has been, the tense feature for has is present and for been it is past
but no feature indicates that the AVC is in the present perfect.3 We therefore leave TMA
features to future work and instead only use agreement and transitivity tasks. We look
at whether or not subtrees representing AVCs encode the same information about these
tasks as FMVs. Note that there is a difference between transitivity as a lexical property
of a verb and transitivity of a specific clause: Some verbs can be used both transitively
and intransitively (Aikhenvald and Dixon 2000). For practical reasons, we only consider
transitivity as a property of a clause, rather than as a lexical property of a verb.
An assumption underlying our question is that agreement and transitivity infor-
mation is learned by the parser and, specifically, that it is available to the parser
when making decisions about main verbs. Johnson (2011) found that adding subject-
agreement features in the Charniak parser did not improve accuracy but explained this
result by the fact that this information was already present in POS tags. Esta informacion
is not present in POS tags in UD4 and it seems reasonable to assume that agreement
information is useful for parsing, which we want to test.
Our research questions can therefore be formulated as follows, with the first being
a pre-condition for the others:
RQ1
RQ2
Is information about agreement and transitivity learned by the parser?
Does a sequential NN-based dependency parser learn the notion of
dissociated nucleus?
3 Ver https://universaldependencies.org/u/feat/Tense.html.
4 And we are not using POS tags in our parser anyway.
767
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 4
RQ3
Does a dependency parser augmented with a recursive layer learn the
notion of dissociated nucleus?
How we go about answering these questions will be explained in more detail in §3.3.
3. Experimental Set-up
3.1 Datos
Treebank Selection. We select UD treebanks according to the criteria from de Lhoneux,
Stymne, and Nivre (2017b) that are relevant for this task:
•
•
•
Typological variety
Variety of domains
High annotation quality
We add criteria specific to our problem:
•
•
•
A minimum number of AVCs (al menos 4,000)
High annotation quality of AVCs
Availability of the information we need for the prediction tasks
Tests of treebank quality are available on the UD homepage, and we can look at the
quality of auxiliary chains to know if AVCs are well annotated.5 The information
we need for the prediction tasks are: the presence of Verbform=Fin in morphological
características, so as to collect FMVs, and the presence of the Number and Person features for
the agreement task.
These added criteria make the selection more difficult, as this discards many of
the small treebanks and makes it difficult to keep typological variety, since the bigger
treebanks come from the same language families and we want to avoid having results
that are biased in terms of typological properties. We select Catalan AnCora (ca), Croa-
tian SET (hr), Dutch Alpino (nl), and Finnish TDT (fi). Mesa 1 summarizes the data used.
We use UD version 2.2 (Nivre et al. 2018). Note that for Catalan, we use a list of lemmas
from the UD documentation to filter out noisy cases of auxilary dependency relations,
which are numerous.
Creating the Data Set. We are mostly interested in what happens in parsing with the
UD representation of AVCs because we believe it is a sound representation, as argued
in the introduction. This is because function words attach to content words, y esto
is compatible with an interpretation where these relations are part of a dissociated
núcleo. Sin embargo, it is also informative to look at what happens with a representation
where auxiliaries are the head of AVCs. We do this in order to find out whether the
representation of the AVC subtree differs depending on which element is its head, desde
elements of the AVC share head properties. We therefore also consider the representa-
tion described in Mel’ˇcuk (1988) (Mel’ˇcuk style, henceforth MS). We use the method in
5 http://universaldependencies.org/svalidation.html.
768
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
Mesa 1
Data set sizes for the transitivity (t) and agreement (A) tasks for finite verbs and AVCs (FMV
and AVC) as well as punctuation items.
FMV
punct
AVC
t
A
I
ca
fi
hr
nl
ca
fi
hr
nl
train
14k
12k
6k
9k
14k
10k
6k
9k
desarrollador
2k
1k
803
618
2k
1k
803
618
train
7k
9k
4k
6k
7k
8k
4k
6k
desarrollador
964
1k
491
516
964
850
491
516
train
12k
4k
5k
5k
12k
4k
5k
5k
desarrollador
2k
458
653
251
2k
443
653
246
nsubj
obj
did
Verbform=Fin
FMV
este
Cifra 2
Finite main verb in a UD tree.
de Lhoneux and Nivre (2016) to transform the data sets from UD to MS.6 An example
of AVC as represented in UD is given in the top part of Figure 3 and its transformed
representation into MS is given in the bottom part of that figure.
Collecting FMVs and AVCs in UD. Collecting FMVs such as in Figure 2 in UD treebanks
is straightforward: Verbs are annotated with a feature called VerbForm which has the
value Fin if the verb is finite. We find candidates using the feature VerbForm=Fin and
only keep those that are not dependent of a copula or auxiliary dependency relations to
make sure they are not part of a larger verbal construction.
Collecting AVCs like the one in Figure 3 is slightly more involved. We do this in
the same way as de Lhoneux and Nivre (2016). In UD (top of the figure), an AVC
vi has a main verb mvi (done in the example) and a set of auxiliaries AUXi with at
least one element (could and have in the example). AVCs are collected by traversing
the sentence from left to right, looking at auxiliary dependency relations and collecting
information about the AVCs that these relations are a part of. An auxiliary dependency
aux←−−wmv is a relation where the main verb is the head and the auxiliary is
relation waux
the dependent (and the two verbs may occur in any order). Only auxiliary dependency
relations between two verbal forms are considered. This allows us to filter out cases
where a noun is head of an auxiliary dependency relation and making sure we have
a main verb. We maintain a dictionary of AVC main verbs. When we find an auxiliary
dependency relation, we add the dependent to the set of auxiliaries AUXi of the AVC
whose main verb mvi is the head of that dependency relation.
6 With a slight modification: In that method, we discarded AVCs if they were in the passive voice; Aquí nosotros
do not.
769
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 4
raíz
nsubj
aux
advmod
I
podría
easily
tener
aux
dobj
hecho
NFMV
este
raíz
aux
nsubj
advmod
aux
dobj
I
podría
MAUX
easily
tener
hecho
este
Cifra 3
Example sentence with an AVC annotated in UD (arriba), and in MS (abajo). AVC subtree in thick
azul.
Collecting AVCs in MS. To collect AVCs in MS such as in the bottom of Figure 3, nosotros
also scan the sentence left to right, looking for auxiliary dependency relations and we
maintain a list of auxilaries which are part of AVCs of the sentence. When we find an
auxiliary dependency relation, if its dependent is not in the list of auxiliaries already
procesado, we follow the chain of heads until we find an auxiliary that is not itself the
dependent of an auxiliary relation. We then follow the chain of dependents until we
find a node which is not the head of an auxiliary dependency relation, which is the main
verb. While recursing the auxiliary chain, we add each head of an auxiliary dependency
relation to the list of auxiliaries for the sentence.
Tasks. When we have our set of FMVs and AVCs, we can create our task data sets. El
transitivity task is a binary decision of whether the main verb has an object or not. Este
information can be obtained by looking at whether or not the main verb has an obj
dependent. In UD, a verb can have only one such dependent.7
For the agreement task, we look at the morphological features of the verbs (the FMV
or the auxiliary in case of AVCs) and concatenate the features Person and Number. El
possible values are therefore all possible combinations of 1st, 2nd, and 3rd person with
plural and singular. There are cases where this information is not available, en el cual
case the agreement task is undefined for the AVC.
The code to reproduce our experiments is available at https://github.com
/mdelhoneux/avc analyser, including the modifications we made to the parser to
freeze the vector representations at the different layers in the network.
3.2 BiLSTM-based Parsing
We use UUParser, a greedy transition-based parser (Nivre 2008) based on the framework
of Kiperwasser and Goldberg (2016) where BiLSTMs (Hochreiter and Schmidhuber
7 We experimented with a harder task: predicting the number of objects. In the example in Figure 3, eso
number would be 1. In case of intransitive use of verbs, it would be 0, and with a ditransitive use of a
verb, it would be 2. We observed the same trends and therefore do not report these results.
770
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
1997; Tumbas 2008) learn representations of tokens in context, and are trained together
with a multilayer perceptron that predicts transitions and arc labels based on a few
BiLSTM vectors. Our parser uses the arc-hybrid transition system from Kuhlmann,
G ´omez-Rodr´ıguez, and Satta (2011) and is extended with a SWAP transition to allow
the construction of non-projective dependency trees Nivre (2009). We also introduce a
static-dynamic oracle to allow the parser to learn from non-optimal configurations at
training time in order to recover better from mistakes at test time (de Lhoneux, Stymne,
and Nivre 2017a).
For an input sentence of length n with words w1, . . . , wn, the parser creates a
sequence of vectors x1:norte, where the vector xi representing wi is the concatenation of a
randomly initialized word embedding e(Wisconsin) and a character vector. The character vector
is obtained by running a BiLSTM over the characters chj (1 ≤ j ≤ m) of wi. Finalmente, cada
input element is represented by a BiLSTM vector, vi:
xi = [mi(Wisconsin); BILSTM(ch1:metro)]
vi = BILSTM(x1:norte, i)
(1)
(2)
The parser therefore learns representations at a type level that consists of two parts: (1)
an embedding of the word type that represents its use in the corpus (mi(Wisconsin)), y (2) a
character vector representing the sequence of characters of the word type. It also learns
a representation of the word at the token level, in the context of the sentence (vi). Nosotros
will refer to these as type, personaje, and token vectors, respectivamente.
As is usual in transition-based parsing, the parser makes use of a configuration
that consists of a stack, a buffer, and a set of arcs. The configuration c is represented by a
feature function φ(·) over a subset of its elements; and for each configuration, transitions
are scored by a classifier. En este caso, the classifier is a multilayer perceptron (MLP)
and φ(·) is a concatenation of BiLSTM vectors on top of the stack and the beginning of
the buffer. The MLP scores transitions together with the arc labels for transitions that
involve adding an arc. Both the embeddings and the BiLSTMs are trained together with
el modelo. Por simplicidad, we only use the 2 top items of the stack and the first item of
the buffer, as they are the tokens that may be involved in a transition in this transition
sistema.
AVC Subtree Vector. As explained earlier, we are interested in finding out whether
or not an LSTM trained with a parsing objective can learn the notion of dissociated
nucleus as well as whether a recursive composition function can help to learn this. El
head of an AVC in UD is a non-finite main verb, which we will refer to as NFMV.
The head of an AVC in MS is the outermost auxiliary, which we will refer to as the
main auxiliary MAUX. We therefore look at NFMV and MAUX token vectors for the
respective representation schemes and consider two definitions of these. In one, we use
the BiLSTM encoding of the main verb token vi. In the other, we construct a subtree
vector ci by recursively composing the representation of AVCs as auxiliaries get attached
to their main verb. When training the parser, we concatenate this composed vector to
a vector of the head of the subtree to form vi. This makes little difference in parsing
exactitud (ver tabla 2).
As in de Lhoneux, Ballesteros, and Nivre (2019), we follow Dyer et al. (2015) en
defining the composition function. The composed representation ci is built by con-
catenating the token vector vh of the head with the vector of the dependent vd being
771
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 4
Mesa 2
LAS results of the baseline (bas) and recursive (rec) parser with UD and MS representations.
ca
fi
hr
nl
av
ud bas
87.8
78.9
81.0
84.1
83.0
ud rc
87.7
78.9
80.8
83.7
82.8
ms bas
87.8
78.6
80.5
84.1
82.7
ms rc
87.8
78.7
80.6
83.7
82.7
attached, as well as a vector r representing the label used and the direction of the arc; ver
Ecuación 4. (In our case, since we are only composing the subtrees of AVCs, r can only
have two values: left-aux and right-aux.) That concatenated vector is passed through an
affine transformation and then a (tanh) nonlinear activation. Initially, ci is just a copy of
the token vector of the word (BILSTM(x1:norte, i)).
vi = [BILSTM(x1:norte, i); ci]
ci = tanh(W.[vh; vd; r] + b)
(3)
(4)
For our prediction experiments using this recursive composition function, nosotros sólo
investigate what is encoded in ci. We call this experimental setup the recursive setup.
We refer to this recursive composition function as composition in the remainder of this
artículo. Note that in de Lhoneux, Ballesteros, and Nivre (2019), we used two different
composition functions, one using a simple recurrent cell and one using an LSTM cell.
We saw that the one using an LSTM cell performed better. Sin embargo, in this set of
experimentos, we only do recursive compositions over a limited part of the subtree: solo
between auxiliaries and NFMVs. This means that the LSTM would only pass through
two states in most cases, and maximum four.8 This does not allow us to learn proper
weights for the input, producción, and forget gates. An RNN seems more appropriate here
and we only use that.
3.3 Research Questions Revisited
Now that we have introduced all experimental variables, we can explain in more detail
than before how we can go about answering our main research questions, repeated
abajo, and then explained in more detail in turn.
RQ1
RQ2
RQ3
Is information about agreement and transitivity learned by the parser?
Does a sequential NN-based dependency parser learn the notion of
dissociated nucleus?
Does a dependency parser augmented with a recursive layer learn the
notion of dissociated nucleus?
8 The maximum number of auxiliaries in one AVC in our data set is three.
772
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
RQ1. We verify that the parser has information about transitivity and agreement avail-
able when making parsing decisions about FMVs by comparing the accuracy of classi-
fiers trained on token vectors of FMVs on these tasks to the majority baseline (the most
frequent value for the task in training data). We expect them to perform substantially
better than that.
We also want to compare the representation of FMVs with the representation of a
nearby token that is not expected to have this information, to rule out the possibility
that the LSTM propagates information about AVCs to all the sentence tokens or at least
the nearby ones. We select punctuation items that are close to the main verb for this
purpose. Punctuation items attach to the main verb in UD and are frequent enough that
we can expect many main verbs to have at least one as a dependent. We expect these
vectors to be uninformative about the tasks.
If this information is available in token vectors of FMVs, we are also interested
in finding out how this information was obtained from the network. If it is available
in the context independent representation of words, a saber, the type and character
representation of the word, it may be propagated upwards from these representations
to the token representation. De lo contrario, we know that it is learned by the BiLSTM.
If it is present in the context independent representation of words, there is some
indication that this information is learned by the parser. To verify that it is learned
specifically for the task of parsing, we compare type vectors of FMVs obtained using
our parser to vectors trained using a language modeling objective. We train a word2vec
(Mikolov et al. 2013) language model on the same training set as for parsing. We thus
obtain vectors of the same dimension as our word type vectors and trained with the
same data set but trained with a different objective.9 We expect the following to hold:
• Main verb token vectors are informative with respect to transitivity and
agreement: they perform better than the majority baseline on these tasks.
• Main verb token vectors are more informative than punctuation token
vectors with respect to transitivity and agreement.
• Main verb type vectors trained with a parsing objective contain more
information about transitivity and agreement than type vectors trained
with a language modeling objective.
RQ2 and RQ3. If the parser learns a notion of dissociated nucleus we expect to observe
that AVC subtree vectors (es decir., the AVC’s NFMV token vector or its composed version
for UD, the AVC’s MAUX token vector or its composed version for MS) contain a similar
amount of information about agreement and transitivity as FMVs do.
Note that we investigate whether the parser learns information about these tasks.
We think it is reasonable to assume that if this information is learned by the parser, es
useful for the task of parsing, since the representations are obtained from a model that
is trained end-to-end without explicit supervision about the prediction tasks. Recent re-
search has cast some doubt on this: As discussed by Belinkov (2018), it is not impossible
that the information is in the network activations but is not used by the network. In any
caso, we are not interested in improving the parser here but in finding out what it learns.
9 Note that for this kind of experiment, training language models that learn contextual representations of
words such as ELMo (Peters et al. 2018) or BERT (Devlin et al. 2019) and compare these representations to
our token vectors would be more appropriate. Sin embargo, these models are typically trained on very large
data sets and it is unclear how well they perform when trained on just treebank data. We leave doing this
to future work.
773
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
Ligüística computacional
Volumen 46, Número 4
3.4 Vectors
We train parsers for 30 epochs for all these treebanks and pick the model of the best
epoch based on LAS score on the development set. We report parsing results in Table 2.
We train the parser with the same hyperparameters as in Smith et al. (2018) excepto por
the character BiLSTM values: We set the character embedding size to 24 and the charac-
ter BiLSTM output dimension to 50. This is a good compromise between efficiency and
exactitud. The token vectors have a dimension of 250, type vectors 100, and character
vectores 50. We load the parameters of the trained parser, run the BiLSTM through the
training and development sentences, and collect the token vectors of NFMVs or MAUX,
as well as the type and token vectors of finite verbs, from the training and development
conjuntos. We use the vectors collected from the training set to train the classifiers and we test
the classifiers on the vectors of the development sets.
As punctuation vectors, we take the token vectors of the punctuation items that
are closest to the FMV. We look at children of the FMV that have punct as dependency
relation and take the closest one in linear order, first looking at the right children and
then at the left ones.
As for word2vec vectors, we use the Gensim ( ˇReh ˚uˇrek and Sojka 2010) implementa-
tion with default settings: using CBOW, a window of 5, and ignoring words with lower
frequency than 5. We learn embeddings of the same dimension as our type vectors for
comparabilidad: 100.
In the recursive setup, we load the best model, pass the data through the BiLSTM,
and parse the sentences so as to obtain composed representations of AVCs. For that,
we collect the final composed vectors of NFMVs or MAUX after prediction for each
oración.
We illustrate the vectors of interest with either representation in figures 4 y 5,
except for the word2vec ones and the ones obtained recursively.
We follow Belinkov (2018) and choose to work with an MLP with one hidden layer
as a diagnostic classifier. Belinkov (2018) argues that it is important to avoid working
Cifra 4
Example AVC with vectors of interest: simbólico (tok), personaje (char) and type of MAUX and
NFMV.
774
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
LSTMfLSTMbconcatconcatconcatLSTMfLSTMbconcatLSTMfLSTMbLSTMfLSTMbXXV IVdoneVthatVhaveXIXhavedonethatIhavedonethate(hecho)mi(eso)mi(I)mi(tener)NFMVMAUXtokchartypetokchartype
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
Cifra 5
Example sentence with a FMV with vectors of interest: simbólico (tok), personaje (char), and type of
FMV and punctuation (punct).
with a classifier that is too weak or too strong. If it is too weak, it might fail to find
information that is in the vectors investigated. If it is too strong, it might find patterns
that the main network cannot use. He argues that a neural network with one hidden
layer strikes a good balance in classifier power. In one study, he compared accuracies
of a linear classifier and two nonlinear classifiers with one and two hidden layers and
found that although the nonlinear classifiers performed consistently better on the task,
the linear classifier showed the same trends. We ran all our experiments with a linear
classifier as well and observed the same thing: The trends are generally the same. Para
the first research question, this means that the question can be answered positively
regardless of the classifier used. For our other research questions, because they are
more concerned with where the information is present in the network than how much
information there is, this means that our observations do not depend on the classifier
used and that our conclusions would be the same with different classifiers. For clarity,
we therefore only report results using the MLP.
4. Resultados
We compare the prediction accuracy on each task to the majority baseline. Conseguir un
measure that is comparable across languages and settings, we compute the difference
between the accuracy of a classifier on a task using a set of vectors to the majority
baseline for this set of vectors. The larger the difference, the greater the indication that
the vector encodes the information needed for the task.10 We perform paired t-tests to
measure whether the difference to the majority baseline is statistically significant on
promedio. Given that the set of labels we predict is very restricted, the majority baseline
performs reasonably well. As can be seen in tables 3 y 4, the performance ranges
de 49 a 81, with most values around 60. It seems reasonable to assume that a system
10 We calculated relative error reductions as well but because these results showed the same trends, nosotros
exclusively report absolute difference in accuracy.
775
yo
D
oh
w
norte
oh
a
d
mi
d
F
r
oh
metro
h
t
t
pag
:
/
/
d
i
r
mi
C
t
.
metro
i
t
.
mi
d
tu
/
C
oh
yo
i
/
yo
a
r
t
i
C
mi
–
pag
d
F
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
C
oh
yo
i
_
a
_
0
0
3
9
2
pag
d
.
F
b
y
gramo
tu
mi
s
t
t
oh
norte
0
7
S
mi
pag
mi
metro
b
mi
r
2
0
2
3
LSTMfLSTMbLSTMfLSTMbLSTMfLSTMbconcatconcatconcatconcatLSTMfLSTMbIe(I)XXVdidVthatV !XV IXdidthat!didthat!mi(did)mi(eso)mi(!)FMVtoktokchartypepunctI
Ligüística computacional
Volumen 46, Número 4
Mesa 3
Classification accuracy of the majority baseline (maj) and classifier trained on the type, simbólico
(tok), and word2vec (w2v) vectors of FMVs, and token vectors of punctuation (punct) sobre el
agreement (A) and transitivity (t) tareas. Diferencia (δ) to majority baseline of these classifiers.
Average difference significantly higher than the baseline are marked with ∗ = p < .05 and
∗∗ = p < .01.
T
A
maj
70.5
ca
59.2
fi
55.9
hr
nl
61.7
av 61.8
6.2
sd
74.4
ca
61.6
fi
60.9
hr
81.6
nl
av 69.6
sd 10.1
tok
88.7
86.2
79.7
82.1
84.2
4.0
82.2
86.0
78.1
87.2
83.4
4.1
FMV
type
79.4
72.8
71.3
74.0
74.4
3.5
82.6
63.2
74.8
85.7
76.6
10.0
char w2v
74.4
75.0
59.2
74.9
57.8
70.6
64.8
69.4
64.0
72.5
7.5
2.9
74.4
98.4
61.6
93.5
60.9
97.8
81.6
96.3
69.6
96.5
10.1
2.2
punct
δ FMV
maj
67.5
56.6
61.5
62.0
61.9
4.5
76.7
59.7
64.0
81.8
70.5
10.4
tok
71.3
64.1
62.7
69.6
66.9
4.2
76.6
59.7
64.0
80.5
70.2
10.0
type
tok
9.0
18.2
13.5
27.0
15.4
23.8
20.5
12.4
22.4** 12.6*
2.7
8.1
1.6
13.9
4.2
6.9*
5.4
3.9
7.7
24.4
17.2
5.7
13.7*
8.7
4.5
15.6
14.7
7.7
char w2v
3.9
0.0
1.8
3.1
10.7* 2.2
1.7
5.4
0.0
24.0
0.0
31.9
0.0
36.9
0.0
14.8
26.9** 0.0
0.0
9.7
δ punct
tok
3.8
7.5
1.2
7.6
5.0*
3.1
−0.1
0.0
0.0
−1.2
−0.3
0.6
Table 4
Classification accuracy of the majority baseline (maj) and classifier trained on the token (tok),
type, and character (char) vectors of FMVs, NFMVs, and MAUX on agreement (A) and
transitivity (T) tasks. Difference (δ) to majority baseline of these classifiers. Average difference
significantly higher than the baseline are marked with ∗ = p < .05 and ∗∗ = p < .01.
FMV AVC NFMV-UD
MAUX-MS
maj tok maj
tok type char tok type char
T
A
ca 70.5 88.7 66.9 89.3 78.8 74.3 88.5 66.8
fi 59.2 86.2 49.1 81.8 70.7 71.1 72.4 62.0
hr 55.9 79.7 51.2 82.5 75.4 69.7 74.9 55.8
nl 61.7 82.1 70.5 88.6 74.5 71.5 86.8 80.2
av 61.8 84.2 59.4 85.6 74.8 71.6 80.6 66.2
sd 6.2 4.0 10.8
8.2 10.3
ca 74.4 82.2 76.0 76.0 76.0 76.0 83.3 99.3
fi 61.6 86.0 67.5 68.5 67.5 69.8 77.7 89.6
hr 60.9 78.1 71.2 71.2 71.9 71.2 77.2 95.3
nl 81.6 87.2 72.4 70.4 72.4 72.4 89.4 99.5 100.0
av 69.6 83.4 71.8 71.5 72.0 72.3 81.9 95.9
5.7 4.6
sd 10.1 4.1
3.2 3.5 2.7
3.9 3.3 1.9
3.5
FMV
tok
66.7 18.2
61.0 27.0
56.1 23.8
81.6 20.5
66.3 22.4**
3.9
11.1
99.7
7.7
92.3 24.4
94.6 17.2
5.7
96.7 13.7*
8.7
3.8
δ nfmv
tok
22.4
32.7
31.4
18.1
26.1**
7.0
0.0
1.0
0.0
−2.0
−0.2
1.2
type char
7.4
11.9
21.9
21.5
18.5
24.3
1.0
4.0
15.4* 12.2*
9.7
0.0
2.3
0.0
0.0
0.6
1.2
9.3
0.0
0.0
0.7
0.1
0.2
0.3
δ maux
tok
char
type
21.7 −0.1 −0.2
11.8
12.8
23.3
4.9
4.7
23.7
11.1
9.6
16.3
6.9*
6.8*
21.2**
5.7
5.7
3.4
23.7
23.2
7.2
24.8
22.1
10.2
23.4
24.0
6.0
27.6
27.2
17.1
24.9**
24.1**
10.1*
1.9
2.2
5.0
that performs significantly better than that learns information relevant to the notion
predicted. The majority baselines of FMVs and AVCs are on average very close (see
Table 4), indicating that results are comparable for these two sets.11
RQ1: Is agreement and transitivity information learned by the parser?
Results pertaining to RQ1 are given in Table 3. Note that we only report results on
the UD data here because the representation of finite verbs does not change between
11 The color scheme in the tables indicates no change (yellow) to improvements (dark green).
776
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
UD and MS. The results conform to our expectations. FMV token vectors contain
information about both agreement and transitivity. They perform significantly better
than the majority baseline for both tasks.
Token vectors of punctuation marks related to FMVs contain some information
about transitivity, with some variance depending on the language, and perform signif-
icantly better than the majority baseline. However, they perform considerably worse
than token vectors of FMVs. The difference between FMV token vectors and their
majority baseline is substantially larger than the difference between punctuation vectors
and their majority baseline for all languages, and with 17 percentage points more on
average. Punctuation vectors seem to be completely uninformative about agreement.
The classifier seems to learn the majority baseline in most cases. This indicates that
information about agreement and transitivity is relevant for scoring transitions involv-
ing FMVs, but this information is not necessary to score transitions involving tokens
that are related to FMVs (or at least it is not necessary for punctuation tokens that are
related to FMVs). It indicates that this information is available in contextual information
of FMVs but not in contextual information of all tokens in the verb phrase, except
marginally for transitivity.
We can conclude that information about both agreement and transitivity are learned
by the parser. We can now look more closely at where in the network this information is
present by looking at context independent vectors: type and character vectors. FMV
type vectors seem to encode some information about transitivity: On average, they
perform significantly better than the majority baseline on the transitivity task. When
it comes to agreement, the difference to the baseline for FMV type vectors is smaller but
they still perform significantly better than the majority baseline on average, although
with more variation: They seem uninformative for Finnish. FMV token vectors contain
substantially more information than type vectors for both tasks. The difference between
token vectors and their majority baseline is, in most cases, slightly to substantially
larger than it is between type vectors and their majority baseline, but on average, it
is substantially larger.
FMV token vectors also contain substantially more information than character vec-
tors for the transitivity task. For the agreement task, however, character vectors contain
substantially more information than token vectors.
This indicates that the information flow for the two tasks differs to some extent.
Because token vectors contain much more information about transitivity than both
type and character vectors, we can conclude that this information is obtained from the
BiLSTM.12 For agreement, however, token vectors are less informative than character
vectors, which indicates that part of this information probably comes from the character
vector, but some of it gets filtered out between the character vector and the token
vector. This indicates that agreement may be useful for signaling potential relationships
between words that are then captured by the BiLSTM. A substantial part of the infor-
mation does remain though, indicating that agreement information is still useful when
it comes to making parsing decisions.
We finally compare representations of word types when trained for the parsing
as opposed to the language modeling task. For this, we look at word2vec vectors.
Word2vec vectors of FMVs contain little information about transitivity, from no differ-
12 Further evidence for this conclusion comes from the fact that some of this information, unlike the
agreement information, seems to spill over on neighboring tokens like the punctuation tokens considered
earlier.
777
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 46, Number 4
ence with the majority baseline to 3.9 percentage points above it. FMV type vectors are
substantially better than word2vec vectors for all languages, with an average difference
to the majority baseline that is 10 percentage points larger than the difference between
word2vec vectors and their majority baseline. Word2vec vectors contain no information
at all about agreement; the network learns the majority baseline for these vectors for
all languages. FMV type vectors are better on this task for all languages. The difference
between FMV type vectors and the majority baseline is small for Finnish but on average
it is 6.9 percentage points larger than the majority baseline. This indicates that informa-
tion about transitivity and agreement is more relevant for the task of parsing than for
the task of language modeling.
We have clearly seen (1) that transitivity and agreement are learned by the parser
and that some information about these tasks is available to the parser when making
decisions about FMVs and (2) that this information is not available everywhere in the
network and is therefore available specifically when making decisions about FMVs. This
answers RQ1 positively.
We should keep in mind that we observed a different information flow for transitiv-
ity where information is obtained mostly by the BiLSTM compared to agreement where
it seems to be strongly signaled at the layer of context independent representations (in
particular in the character vector) and weaker at the output of the BiLSTM.
RQ2: Does a BiLSTM-based parser learn the notion of dissociated nucleus?
Results pertaining to RQ2 are given in Table 4. Comparing first FMV and NFMV token
vectors, we can see that NFMVs are somewhat better than FMVs at the transitivity task,
but both perform substantially better than the majority baseline. On the agreement task,
however, FMV vectors perform substantially better than NFMV vectors. NFMV vectors
seem completely uninformative when it comes to agreement, performing on average
slightly worse (−0.2 percentage point) than the majority baseline. FMV vectors perform
moderately (Dutch) to largely better (Finnish) than this depending on the language,
with an average difference to the majority baseline of 13.7.
An unpaired t-test reveals that the results of FMVs and NFMVs on the agreement
task are significantly different with p < .01, which further supports the hypothesis that
they do not capture the same information. The results between FMVs and NFMVs are
not significantly different on the transitivity task.
Looking at MAUX token vectors in an MS representation, we see that they are
also substantially better than the majority baseline on the transitivity task, performing
slightly worse than FMVs, on average. Contrary to NFMV token vectors, they perform
significantly better than the majority baseline on the agreement task and somewhat
worse than FMVs. The results between MAUX and FMVs are not significantly different
for either of the tasks, indicating that they do seem to capture a similar amount of
information.
We can conclude that a BiLSTM-based parser does not learn the notion of dissoci-
ated nucleus for AVCs when working with a representation of AVCs where the main
verb is the head such as is the case in UD: The representation of NFMVs contains less
information about agreement than the representation of FMVs. However, when using
a representation where the auxiliary is the head, a BiLSTM-based parser does seem
to learn this notion; it learns a representation of the AVC’s head that is similar to the
representation of FMVs.
This can be explained by the different information flow in the network for the two
tasks. In figures 6 and 7, we illustrate further the different information flow for the
778
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
Figure 6
Information flow for transitivity.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 7
Information flow for agreement.
transitivity and agreement task, respectively, and for both FMVs and AVCs. We use the
same color scheme as in the tables (from yellow to dark green means no information
about a task to a substantial amount of information about the task, as measured by
the difference to the majority baseline) and we simplify the architecture illustration
from figures 4 and 5. As we saw in Table 3, and as we can see from the left part
of these figures, looking at FMVs, information about transitivity is mostly obtained
from the BiLSTM, whereas information about agreement is present in the character
vector and propagated to the token vector. We observe a similar phenomenon with
AVCs, as presented in Table 4: For the transitivity task, the type vectors of NFMVs
contain more information than the type vectors of MAUX, but in both cases, the token
representation of the head of the AVC contains substantially more information than
the type and character vectors. By contrast, both type and character vectors of MAUX
contain information about agreement, whereas NFMV type and character vectors do
not. It seems that, with this model, in order for agreement information to be available
to the head of the AVC, the head of the AVC needs to be the auxiliary. When it comes to
transitivity, the BiLSTM is able to construct this information regardless of what word is
779
havedonedid!FMVNFMVMAUXpunctdid!FMVNFMVMAUXhavedonepunct
Computational Linguistics
Volume 46, Number 4
Table 5
Classification accuracy of the majority baseline (maj) and classifier trained on the vectors (vec) of
NFMVs with (+c) and without composition and FMVs on agreement (A) and transitivity (T)
tasks. Difference (δ) to majority baseline of these classifiers. Average difference significantly
higher than the baseline are marked with ∗ = p < .05 and ∗∗ = p < .01.
T
A
FMV
maj
ca 70.5 88.7
fi 59.2 86.2
hr 55.9 79.7
nl 61.7 82.1
av 61.8 84.2
4.0
sd 6.2
ca 74.4 82.2
fi 61.6 86.0
hr 60.9 78.1
nl 81.6 87.2
av 69.6 83.4
4.1
sd 10.1
AVC
tok maj
66.9
49.1
51.2
70.5
59.4
10.8
76.0
67.5
71.2
72.4
71.8
3.5
NFMV
tok tok+c
89.3 88.5
81.8 76.0
82.5 82.0
88.6 83.4
85.6 82.5
5.2
76.0 91.6
68.5 79.2
71.2 83.6
70.4 84.0
71.5 84.6
5.2
3.2
3.9
MAUX
tok tok+c
88.5 86.5
72.4 79.4
74.9 79.5
86.8 88.5
80.6 83.5
4.7
83.3 77.1
77.7 74.7
77.2 73.3
89.4 83.6
81.9 77.2
4.6
5.7
8.2
δ
nfmv nfmv+c maux maux+c
fmv
22.4
18.2
32.7
27.0
31.4
23.8
18.1
20.5
26.1**
22.4**
7.0
3.9
0.0
7.7
1.0
24.4
0.0
17.2
−2.0
5.7
13.7* −0.2
1.2
8.7
21.7
26.9
30.9
12.9
23.1**
7.8
15.6
11.7
12.4
11.6
12.8*
1.9
21.7
23.3
23.7
16.3
21.2**
3.4
7.2
10.2
6.0
17.1
10.1*
5.0
19.7
30.3
28.3
17.9
24.0**
6.2
1.1
7.2
2.1
11.2
5.4
4.7
the head of the AVC, which is why the model is able to learn the notion of dissociated
nucleus for the MS representation but not the UD representation.
Because the MS representation does not improve parsing accuracy (see Table 213),
it is possible that either learning this notion is not important for parsing, or that the
benefits of learning this notion are offset by other factors. We attempt to identify
whether or not we can get the best of both worlds by learning a BiLSTM parser that
uses recursive composition and training it on the UD representation. We also look at
what happens with recursive composition with the MS representation.
Note that, in all experiments, we only report results for NFMVs with a UD repre-
sentation and for MAUX with an MS representation because these are the vectors that
represent the subtree. However, the information that is learned in these vectors does
not seem to depend much on the representation of the AVC: token, type, and character
vectors representing MAUX learn similar information whether in UD or in MS and
token, type, and character vectors of NFMVs also learn similar information whether in
UD or in MS. This means that the parser learns similar representations of AVC elements
and only the representation of the subtree depends on the representation style of AVCs.
RQ3: Does subtree composition help?
As mentioned previously, we found in de Lhoneux, Ballesteros, and Nivre (2019) that
a recursive composition function does not make our parsing model more accurate. A
recursive composition function might not be necessary for parsing accuracy, but might
help in this case, however. It could make it possible to get the relevant information from
the main verb and the auxiliary token vectors. As we have just seen, the token vector
13 Note that these results are not directly comparable; we would need to transform the MS representation
back to UD to compare against the same annotation type. However, we expect these results to be even
worse when transformed, as results from de Lhoneux and Nivre (2016) indicate.
780
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
of the MAUX has information that is missing in the token vector of the NFMV and that
could be propagated to the NFMV vector through the recursive composition function.
We train a version of the parser where we recursively compose the representation
of AVC subtrees during parsing. For UD, this means that the representation of the
NFMV token gets updated as auxiliaries get attached to it. For MS, this means that the
representation of AVC subtrees is composed in a chain from the outermost auxiliary
to the main verb. Note that this recursive composition function models the transfer
relation for UD: It is used when an auxiliary is attached to the main verb. In MS, by
contrast, it can also be used between two auxiliaries, which is a different type of relation.
As reported in Table 2, this decreases parsing accuracy very slightly. We compare the
vectors of this composed representation to the representation of FMVs. Results are given
in Table 5.
On average, composed NFMV vectors perform similarly to non-composed NFMV
vectors on the transitivity task, slightly worse but substantially better than the majority
baseline. For the agreement task, composed NFMV vectors are much better than the
non-composed NFMV vectors, all performing substantially better than the majority
baseline, although with less variation than the FMV vectors. The difference between
composed NFMV vectors and the majority baseline is slightly higher (0.7 percent-
age points) than the difference between FMV vectors and their majority baseline for
transitivity and slightly lower for agreement, but with variation across languages. On
average, they seem to capture a similar amount of information. An unpaired t-test
reveals that there is no significant difference between the results of FMVs and composed
NFMV vectors. We can therefore conclude that a recursive composition function on top
of a BiLSTM allows the model to capture similar information about AVCs and their non-
dissociated counterpart, FMVs. This indicates that composing subtree representations
with a recursive layer makes it possible for the parser to learn the notion of dissociated
nucleus with a representation of AVCs where the head is the main verb.
With the MS representation, composition improves accuracy on transitivity (except
for Catalan) but decreases accuracy on agreement, making it on average not signifi-
cantly better than the majority baseline and making the MAUX representation almost
completely uninformative with regard to agreement for Catalan and Croatian. There is
no statistical difference between the results of FMVs and composed MAUX on either of
the tasks, however, indicating that they do capture a similar amount of information.
Overall, it seems that using a UD representation and a recursive composition func-
tion is the best option we have to have an accurate parser that captures the notion
of dissociated nucleus, given our definition of what it means to capture this notion
(that FMV and NFMV representations encode a similar amount of information about
agreement and transitivity). This does not improve overall parsing accuracy, which
means either that it is not important to capture this notion for parsing or that the benefits
of doing so are offset by drawbacks of this method. It would be interesting to find out
whether learning this information is important to downstream tasks, which we leave to
future work.
5. Conclusion
We used diagnostic classifiers to investigate the question of whether or not a BiLSTM-
based parser learns the notion of dissociated nucleus, focusing on AVCs. We looked at
agreement and transitivity tasks and verified that the parser has access to this infor-
mation when making parsing decisions concerning main verbs. We compared what a
781
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 46, Number 4
parser learns about AVCs with what it learns about their non-dissociated counterpart:
finite main verbs. We observed that, with a UD representation of AVCs, vectors that
represent AVCs encode information about transitivity to the same extent as FMVs but,
contrary to FMVs, they are mostly uninformative when it comes to agreement. We
concluded that a purely recurrent BiLSTM-based parser does not learn the notion of
dissociated nucleus. We found explanations for this by investigating the information
flow in the network and looking at what happens with a representation of AVCs where
the auxiliary is the head.
We finally investigated whether or not explicitly composing AVC subtree represen-
tations using a recursive layer makes a difference and it seems to make the represen-
tation of AVCs more similar to the representation of FMVs, indicating that recursively
composing the representation of subtrees makes it possible for the parser to learn the
notion of dissociated nucleus.
We started out by arguing that parsers should learn the notion of dissociated nucleus,
then found out that a BiLSTM-based parser does not learn this notion when working with
a UD representation of AVCs but can learn it if augmented by a recursive composition
function. This recursive layer has been shown to be superfluous in previous work when
we only look at parsing accuracy but our results here indicate that there may be benefits
to using this recursive layer that are not reflected in parsing accuracy. This suggests
that we may just not have found the best way to integrate this recursive layer into a
BiLSTM-based parser yet. More generally, this lends some support to the hypothesis
that hierarchical modeling is a useful inductive bias when modeling syntax.
In future work, we plan to use diagnostic classifiers to investigate other cases of
dissociated nuclei, such as combinations of adpositions and nouns and of subordinating
conjunctions and verbs. We also plan to further investigate the use of a recursive layer
in parsing with BiLSTM-based feature representations. It may be useful to model the
transfer relation between two elements of a dissociated nucleus to obtain a representa-
tion of the nucleus that is embedded in the same space as other nuclei. It would finally
be interesting to find out whether learning this notion of dissociated nucleus is useful
when it comes to downstream applications.
Acknowledgments
We thank Paola Merlo for thorough
comments on an earlier version of this paper,
as well as the anonymous reviewers. We
acknowledge the computational resources
provided by CSC in Helsinki and Sigma2 in
Oslo through NeIC-NLPL (www.nlpl.eu).
References
Aikhenvald, Alexandra Yurievna, and Robert
M. W. Dixon. 2000. Changing valency: Case
studies in transitivity, Cambridge
University Press Cambridge.
Anderson, Gregory DS. 2011. Auxiliary verb
constructions (and other complex
predicate types): A functional–
constructional overview. Language
and Linguistics Compass, 5(11):795–828.
DOI: https://doi.org/10.1111/j.1749
-818X.2011.00311.x
Belinkov, Yonatan. 2018. On Internal Language
Representations in Deep Learning: An
Analysis of Machine Translation and Speech
Recognition. Ph.D. thesis, Massachusetts
Institute of Technology.
Blevins, Terra, Omer Levy, and Luke
Zettlemoyer. 2018. Deep RNNs encode soft
hierarchical syntax. In Proceedings of the
56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short
Papers), pages 14–19. Melbourne. DOI:
https://doi.org/10.18653/v1/P18-2003
Devlin, Jacob, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of deep bidirectional transformers
for language understanding. In Proceedings
of the 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, MN.
Dozat, Timothy and Christopher Manning.
2017. Deep Biaffine Attention for Neural
Dependency Parsing. In Proceedings of the
782
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
de Lhoneux et al.
What Should/Do/Can LSTMs Learn When Parsing Auxiliary Verb Constructions?
5th International Conference on Learning
Representations. Toulon.
Dyer, Chris, Miguel Ballesteros, Wang Ling,
Austin Matthews, and Noah A. Smith.
2015. Transition-based dependency
parsing with stack long short-term
memory. In Proceedings of the 53rd Annual
Meeting of the Association for Computational
Linguistics (ACL), pages 334–343, Beijing.
DOI: https://doi.org/10.3115/v1
/P15-1033
Enguehard, ´Emile, Yoav Goldberg, and Tal
Linzen. 2017. Exploring the syntactic
abilities of RNNs with multi-task learning.
In Proceedings of the 21st Conference on
Computational Natural Language Learning
(CoNLL 2017), pages 3–14, Vancouver. DOI:
https://doi.org/10.18653/v1
/K17-1003
Graves, Alex. 2008. Supervised Sequence
Labelling with Recurrent Neural Networks.
Ph.D. thesis, Technical University Munich.
Gulordava, Kristina, Piotr Bojanowski,
Edouard Grave, Tal Linzen, and Marco
Baroni. 2018. Colorless green recurrent
networks dream hierarchically. In
Proceedings of the 2018 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers),
pages 1195–1205, New Orleans, LA. DOI:
https://doi.org/10.18653/v1/N18-1108
In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics
(ACL), pages 673–682, Portland, OR.
Kuncoro, Adhiguna, Chris Dyer, John Hale,
Dani Yogatama, Stephen Clark, and Phil
Blunsom. 2018. LSTMs can learn
syntax-sensitive dependencies well, but
modeling structure makes them better. In
Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics
(Volume 1: Long Papers), volume 1,
pages 1426–1436, Melbourne. DOI:
https://doi.org/10.18653/v1/P18-1132
de Lhoneux, Miryam, Miguel Ballesteros,
and Joakim Nivre. 2019. Recursive subtree
composition in LSTM-based dependency
parsing. In Proceedings of the 2019
Conference of the North American Chapter of
the Association for Computational Linguistics:
Human Language Technologies, Volume 1
(Long and Short Papers), pages 1566–1576,
Minneapolis, MN. DOI: https://doi.org
/10.18653/v1/N19-1159
de Lhoneux, Miryam and Joakim Nivre.
2016. Should have, would have, could
have. Investigating verb group
representations for parsing with universal
dependencies. In Proceedings of the
Workshop on Multilingual and Cross-lingual
Methods in NLP, pages 10–19, San Diego,
CA. DOI: https://doi.org/10.18653/v1
/W16-1202
Hochreiter, Sepp and J ¨urgen Schmidhuber.
de Lhoneux, Miryam, Sara Stymne, and
1997. Long short-term memory. Neural Com-
putation, 9(8):1735–1780. DOI: https://doi
.org/10.1162/neco.1997.9.8.1735,
PMID: 9377276
Hupkes, Dieuwke, Sara Veldhoen, and
Willem Zuidema. 2018. Visualisation and
‘diagnostic classifiers’ reveal how
recurrent and recursive neural networks
process hierarchical structure. Journal of
Artificial Intelligence Research, 61:907–926.
DOI: https://doi.org/10.1613/jair
.1.11196
Johnson, Mark. 2011. How relevant is
linguistics to computational linguistics.
Linguistic Issues in Language Technology,
6(7).
Kiperwasser, Eliyahu and Yoav Goldberg.
2016. Simple and accurate dependency
parsing using bidirectional LSTM feature
representations. Transactions of the
Association for Computational Linguistics,
4:313–327.
Kuhlmann, Marco, Carlos
G ´omez-Rodr´ıguez, and Giorgio Satta.
2011. Dynamic Programming Algorithms
for Transition-Based Dependency Parsers.
Joakim Nivre. 2017a. Arc-hybrid
non-projective dependency parsing with a
static-dynamic oracle. In Proceedings of the
15th International Conference on Parsing
Technologies, pages 99–104, Pisa.
de Lhoneux, Miryam, Sara Stymne, and
Joakim Nivre. 2017b. Old school vs. new
school: Comparing transition-based
parsers with and without neural network
enhancement. In Proceedings of the 15th
Treebanks and Linguistic Theories Workshop
(TLT), pages 99–110, Bloomington, IN.
Linzen, Tal, Emmanuel Dupoux, and Yoav
Goldberg. 2016. Assessing the ability of
LSTMs to learn syntax-sensitive
dependencies. Transactions of the
Association for Computational Linguistics,
4:521–535. DOI: https://doi.org
/10.1162/tacl a 00115
Mel’ˇcuk, Igor. 1988. Dependency Syntax:
Theory and Practice. State University of
New York Press.
Mikolov, Tomas, Ilya Sutskever, Kai Chen,
Greg S. Corrado, and Jeff Dean. 2013.
Distributed representations of words and
phrases and their compositionality. In C. J.
783
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Computational Linguistics
Volume 46, Number 4
C. Burges, L. Bottou, M. Welling, Z.
Ghahramani, and K. Q. Weinberger,
editors, Advances in Neural Information
Processing Systems 26, Curran Associates,
Inc., pages 3111–3119.
Nivre, Joakim. 2008. Algorithms for
deterministic incremental dependency
parsing. Computational Linguistics,
34(4):513–553. DOI: https://doi.org
/10.1162/coli.07-056-R1-07-027
Nivre, Joakim. 2009. Non-projective
dependency parsing in expected linear
time. In Proceedings of the Joint Conference of
the 47th Annual Meeting of the ACL and the
4th International Joint Conference on Natural
Language Processing of the AFNLP
(ACL-IJCNLP), pages 351–359, Suntec.
DOI: https://doi.org/10.3115/1687878
.1687929
Nivre, Joakim. 2015. Towards a universal
grammar for natural language processing,
Alexander Gelbukh, editor, Computational
Linguistics and Intelligent Text Processing,
Springer, pages 3–16. DOI: https://doi
.org/10.1007/978-3-319-18111-0 1
Nivre, Joakim, Johan Hall, and Jens Nilsson.
2006. MaltParser: A data-driven parser-
generator for dependency parsing. In
Proceedings of the 5th International Confer-
ence on Language Resources and Evaluation
(LREC), pages 2216–2219, Genoa.
Nivre, Joakim, Marie-Catherine de Marneffe,
Filip Ginter, Yoav Goldberg, Jan Hajiˇc,
Christopher Manning, Ryan McDonald,
Slav Petrov, Sampo Pyysalo, Natalia
Silveira, Reut Tsarfaty, and Daniel Zeman.
2016. Universal dependencies v1: A
multilingual treebank collection. In
Proceedings of the 10th International
Conference on Language Resources and
Evaluation (LREC 2016), pages 1659–1666,
European Language Resources
Association, Portoroˇz.
Nivre, Joakim, Mitchell Abrams, ˇZeljko Agi´c,
et al. 2018. Universal Dependencies 2.2.
LINDAT/CLARIN digital library at the
Institute of Formal and Applied
Linguistics, Charles University, Prague.
Peters, Matthew, Mark Neumann, Mohit
Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. 2018.
Deep contextualized word representations.
In Proceedings of the 2018 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers),
pages 2227–2237, New Orleans, LA. DOI:
https://doi.org/10.18653/v1/N18-1202
Ravfogel, Shauli, Yoav Goldberg, and Tal
Linzen. 2019. Studying the inductive
biases of RNNs with synthetic variations
of natural languages. In Proceedings of the
2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers),
pages 3532–3542, Minneapolis, MN. DOI:
https://doi.org/10.18653/v1/N19-1356
Ravfogel, Shauli, Yoav Goldberg, and Francis
Tyers. 2018. Can LSTM learn to capture
agreement? The case of Basque. In
Proceedings of the 2018 EMNLP Workshop
BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 98–107,
Brussels. DOI: https://doi.org/10
.18653/v1/W18-5412
ˇReh ˚uˇrek, Radim and Petr Sojka. 2010.
Software framework for topic modelling
with large corpora. In Proceedings of the
LREC 2010 Workshop on New Challenges for
NLP Frameworks, pages 45–50, ELRA,
Valletta.
Sgall, Petr, Eva Hajiˇcov´a, and Jarmila
Panevov´a. 1986. The Meaning of the Sentence
in Its Pragmatic Aspects, Reidel.
Smith, Aaron, Bernd Bohnet, Miryam
de Lhoneux, Joakim Nivre, Yan Shao, and
Sara Stymne. 2018. 82 treebanks, 34
models: Universal dependency parsing
with multi-treebank models. In Proceedings
of the CoNLL 2018 Shared Task: Multilingual
Parsing from Raw Text to Universal
Dependencies, pages 113–123, Brussels.
DOI: https://doi.org/10.18653/v1
/K18-2011
Tesni`ere, Lucien. 1959. ´El´ements de syntaxe
structurale, Editions Klincksieck.
Williams, Adina, Andrew Drozdov, and
Samuel R. Bowman. 2018. Do latent tree
learning models identify meaningful
structure in sentences? Transactions of the
Association of Computational Linguistics, 6:
253–267. DOI: https://doi.org/10.1162
/tacl a 00019
784
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
c
o
l
i
/
l
a
r
t
i
c
e
-
p
d
f
/
/
/
/
4
6
4
7
6
3
1
8
8
8
3
1
6
/
c
o
l
i
_
a
_
0
0
3
9
2
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
785
Descargar PDF