Dependency Parsing with Backtracking using
Deep Reinforcement Learning
Franck Dary, Maxime Petit, Alexis Nasr
Aix Marseille Univ, Universit´e de Toulon, CNRS, LIS, Marseille, France
{franck.dary,maxime.petit,alexis.nasr}@lis-lab.fr
Abstract
Greedy algorithms for NLP such as transition-
based parsing are prone to error propagation.
One way to overcome this problem is to al-
low the algorithm to backtrack and explore an
alternative solution in cases where new evi-
dence contradicts the solution explored so far.
In order to implement such a behavior, we use
reinforcement learning and let the algorithm
backtrack in cases where such an action gets
a better reward than continuing to explore the
current solution. We test this idea on both POS
tagging and dependency parsing and show that
backtracking is an effective means to fight
against error propagation.
1
Introduction
Transition-based parsing has become a major ap-
proach in dependency parsing, since the work of
Yamada and Matsumoto (2003) and Nivre et al.
(2004), for it combines linear time complexity
and high linguistic performances. The algorithm
follows a local and greedy approach to parsing
that consists of selecting at every step of the
parsing process the action that maximizes a lo-
cal score, typically computed by a classifier. The
action selected is greedily applied to the cur-
rent configuration of the parser and yields a new
configuration.
At training time, an oracle function transforms
the correct syntactic tree of a sentence into a
sequence of correct (configuration, action) pairs.
These pairs are used to train the classifier of the
parser. The configurations that do not pertain to
the set of correct configurations are never seen
during training.
At inference time, if the parser predicts and
executes an incorrect action, it produces an in-
correct configuration, with respect to the sentence
being parsed, which might have never been seen
during training, yielding a poor prediction of the
next action to perform. Additionally, the parser
follows a single hypothesis by greedily selecting
the best scoring action. The solution built by the
parser can be sub-optimal for there is no guarantee
that the sum of the scores of the actions selected
maximizes the global score.
These are well-known problems of transition-
based parsing and several solutions have been
proposed in the literature to overcome them, they
will be briefly discussed in Section 2. The solution
we propose in this article consists of allowing the
parser to backtrack. At every step of the parsing
process, the parser has the opportunity to undo its
n previous actions to explore alternative solutions.
The decision to backtrack or not is taken each time
a new word is considered, before trying to process
it, by giving the current configuration to a binary
classifier, that will assign a score to the back-
tracking action. Traditional supervised learning is
not suited to learn such a score, since the training
data contains no occurrences of backtrack actions.
In order to learn in which situation a backtrack
action is worthy, we use reinforcement learning.
During training, the parser has the opportunity to
try backtracking actions and the training algorithm
responds to this choice by granting it a reward. If
the backtracking action is the adequate move to
make in the current configuration, it will receive a
positive reward and the parser will learn in which
situation backtrack is adequate.
The work presented here is part of a more am-
bitious project which aims at modeling eye move-
ments during human reading. More precisely,
we are interested to predict regressive saccades:
eye movements that bring the gaze back to a
previous location in the sentence.
There is much debate in the psycholinguis-
tic literature concerning the reasons of such eye
movements (Lopopolo et al., 2019). Our position
with respect to this debate is the one advocated by
Rayner and Sereno (1994), for whom part of these
saccades are linguistically motivated and happen
in situations where the reader’s incremental com-
prehension of the sentence is misguided by an
888
Transactions of the Association for Computational Linguistics, vol. 10, pp. 888–903, 2022. https://doi.org/10.1162/tacl a 00496
Action Editor: Yusuke Miyao. Submission batch: 12/2021; Revision batch: 3/2022; Published 9/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
ambiguous sentence start, until reaching a novel
word which integration will prove incompatible
with the current understanding and will trigger a
regressive saccade, as in garden path sentences.
Our long-term project is to model regressive sac-
cades with backtracking actions. Although this
work enters in this long-term project, the focus of
this article is on the NLP aspects of this program
and we propose a way to implement backtrack-
ing in the framework of transition-based parsing.
We will just mention some preliminary studies on
garden path sentences in the Conclusion.
In order to move in the direction of a more cog-
nitively plausible model, we add two constraints
to our model.
The first one concerns the text window around
the current word that the parser takes into account
when predicting an action. This window can be
seen as an approximation of the sliding window
introduced by McConkie and Rayner (1975) to
model the perceptual span of a human reader.1
Transition-based parsers usually allow taking into
account the right context of the current word. The
words in the right context constitute a rich source
of information and yield better predictions of the
next action to perform.
In our model, the parser does not have access
to this right context, simulating the fact that a
human reader has only a limited access to the right
context (few characters to the right of the fixation
point (McConkie and Rayner, 1976)). It is only
after backtracking that a right context is avail-
able for it has been uncovered before backtrack
took place.
The second constraint is incrementality. When
performing several tasks, such as POS tagging and
parsing, as will be done in the tagparser described
in Section 3, these tasks are organized in an incre-
mental fashion. At each step, a word is read, POS
tagged, and integrated in the syntactic structure—
a more cognitively plausible behavior than a se-
quential approach where the whole sentence is
first POS tagged then parsed.
The structure of the paper is the following:
In Section 2, we compare our work to other ap-
proaches in transition-based parsing which aim
at proposing solutions to the two problems men-
tioned above. In Section 3, we describe the model
that we use to predict the linguistic annotation and
1It is actually a rough approximation of the sliding window
because it defines a span over words and not characters.
introduce the notion of a BACK action. In Section 4
we show how backtracking actions can be pre-
dicted using reinforcement learning. Section 5
describes the experimental part of the work and
discusses the results obtained. Section 6 concludes
the paper and presents different directions in which
this work will be extended.
2 Related Work
Several ways to overcome the two limits of
transition-based parsing mentioned in the Intro-
duction, namely, training the parser with only
correct examples and exploring only a single hy-
pothesis at inference time, have been explored in
the literature.
Beam Search The standard solution to the sin-
gle hypothesis search is beam search, which allows
considering a fixed number of solutions in parallel
during parsing. Beam search is a general tech-
nique that has been applied to transition-based
parsing in many studies, including Zhang and
Clark (2008), Huang and Sagae (2010), and Zhang
and Nivre (2012). They show that exploring a
fixed number of solutions increases the linguistic
performances over a single hypothesis parser. In
this work, we do not use a beam search algo-
rithm, but we do explore several solutions in a
non-parallel fashion, using backtracking.
Dynamic Oracles A first way to overcome the
lack of exploration during training problem is
the proposition by Goldberg and Nivre (2012)
to replace the standard oracle of transition-based
parsing by a dynamic oracle that is able to de-
termine the optimal action a to perform for an
incorrect configuration c. During training,
the
dynamic oracle explores a larger part of the config-
uration space than the static oracle and produces
for an incorrect configuration c an optimal action
a. The pair (c, a) is given as training example
to the classifier, yielding a more robust classifier
that is able to predict the optimal action in some
incorrect configurations. Ballesteros et al. (2016)
show that the principle of the dynamic oracle can
be adapted to train the greedy Stack-LSTM de-
pendency parser of Dyer et al. (2015), improving
its performance.
Yu et al. (2018) describe a method for training
a small and efficient neural network model that
approximates a dynamic oracle for any transition
889
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
system. Their model is trained using reinforcement
learning.
In this paper, we use dynamic oracles in two
different ways. First, as baselines for models al-
lowing some exploration during training. Second,
in the definition of the immediate reward function,
as explained in Section 4.
Reinforcement Learning A second answer to
the lack of exploration problem is reinforcement
learning. The exploitation–exploration trade-off
of reinforcement learning allows, at training time,
the model to explore some incorrect configurations
and to learn a policy that selects in such cases the
action that maximizes the long-term reward of
choosing this action. Reinforcement learning is
well suited for transition-based algorithms.
To the best of our knowledge,
two papers
directly address the idea of training a syntac-
tic transition-based parser with reinforcement
learning: Zhang and Chan (2009) and Lˆe and
Fokkens (2017).
Zhang and Chan (2009) cast the problem, as we
do, as a Decision Markov Process but use a Re-
stricted Boltzmann Machine in order to compute
scores of actions and use the SARSA algorithm
to compute an optimal policy. In our case, we
use deep Q-learning based on a multilayer per-
ceptron, as described in Section 4. In addition,
their immediate reward function is based on the
number of arcs in the gold tree of a sentence that
are absent in the tree being built. We also use an
immediate reward function, but it is based on the
number of arcs in the gold tree that can no longer
be built given the current tree being built, an idea
introduced in the dynamic oracle of Goldberg and
Nivre (2012).
Two major differences distinguish our approach
and the work of Lˆe and Fokkens (2017). The first
is the idea of pre-training a parser in a supervised
way and then fine-tuning its parameters using re-
inforcement learning. In our case, the parser is
trained from scratch using reinforcement learn-
ing. The reason for this difference is related to our
desire to learn how to backtrack: It is difficult to
make the parser learn to backtrack when it has
been initially trained not to do so (using standard
supervised learning). The second major difference
is the use of a global reward function, which is
computed after the parser has parsed a sentence.
In our case, as mentioned above, we use an im-
mediate reward. The reason for this difference is
linked to pretraining. Since we do not pre-train
our parser, and allow it to backtrack, granting a
reward at the end of the sentence does not allow
the parser to converge since reaching the end of
the sentence is almost impossible using a non
pre-trained parser. Other less fundamental differ-
ences distinguish our approach, such as the use of
Q-learning, in our case, to find an optimal policy,
where they use a novel algorithm called Approx-
imate Learning Gradient, based on the fact that
their model’s output is a probability distribution
over parsing actions (as in standard supervised
learning). Another minor difference is the explo-
ration of the search space in the training phase.
They sample the next action to perform using the
probability distribution computed by their model,
while we use an adaptation of ε-greedy that we
describe in Section 4.
Reinforcement learning has also been used to
train a transition-based model for other tasks.
Naseem et al. (2019) present a method to fine-tune
the Stack-LSTM transition-based AMR parser of
Ballesteros and Al-Onaizan (2017) with reinforce-
ment learning, using the Smatch score of the
predicted graph as reward. For semantic depen-
dency parsing, Kurita and Søgaard (2019) found
that fine-tuning their parser with a policy gra-
dient allows it to develop an easy-first strategy,
reducing error propagation.
The fundamental difference between all papers
cited above and our work is the idea of adding
backtracking in a greedy transition-based model.
We use reinforcement learning as a means to
achieve the exploration necessary to this goal. In
term of parsing performance, our method will fare
lower than the state of the art in transition-based
parsing because our parser is constrained to not
see words beyond the current word, a constraint
that comes from the fact that our long-term goal
is not to improve parsing performance but to find
a natural way to encourage a parser to simu-
late regressive saccades observed during human
reading.
3 Backtracking Reading Machines
Our model is an extension of the Reading Ma-
chine, a general model for NLP proposed in Dary
and Nasr (2021) that generalizes transition-based
parsing to other NLP tasks. A Reading Machine
is a finite automaton whose states correspond to
linguistic levels. There can be, for example, one
890
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
which words are read and one output tape, on
which predicted POS are written. Each transition is
labeled with a tagging action of the form POS(p)
that simply writes the POS tag p on the POS output
tape at the word index position.
The machine on the right implements an arc-
eager transition-based parser that produces unla-
beled dependency trees. It has the same simple
structure than the tagging machine. Its transitions
are labeled with the four standard arc-eager actions
of unlabeled transition-based parsing: LEFT, RE-
DUCE, SHIFT, and RIGHT. The machine has two input
tapes, one for words and one for POS tags, and one
output tape on which it writes the index of the
governor of the current word when LEFT or RIGHT
actions are predicted.
The machine on the bottom part of the figure,
which we call a tagparser, realizes the two tasks
simultaneously in an incremental fashion. When
in state POS, the machine tags the current word then
control jumps to the parser in order to attach the
current word to the syntactic structure built so far
or to store it in a stack. Once this is done, control
returns to state POS to tag the next word. The reason
why the transitions labeled RIGHT and SHIFT lead
to state POS is that these two actions increase the
word index wi and initiate the processing of a new
word. The machine has one input tape for words
and two output tapes, one for POS tags and one for
the governor position.
We augment the machines described above in
order to allow them to undo some preceding ac-
tions. This ability relies on three elements: (a) the
definition of a new action, called BACK, that undoes
a certain number of actions; (b) the history of the
actions performed so far in order to decide which
actions to undo; and (c) the definition of undoing
an action.
Undoing an action amounts to recovering the
configuration that existed before the action was
performed. This is quite straightforward for tag-
ging and parsing actions.4
Three backtracking machines, based on the ma-
chines of Figure 1, are shown in Figure 2. They
all have an extra state, named BACK, and two extra
transitions. When in state BACK, the machine pre-
dicts one of the two actions BACK or ¬BACK. When
action ¬BACK is selected, control jumps either to
state POS or SYN, depending on the machine, and
4In order to be undone, actions REDUCE and LEFT also need
to store stack elements that have been popped.
Figure 1: Three simple reading machines. The top left
machine performs POS tagging; the top right one per-
forms unlabeled dependency parsing; and the bottom
one performs the two tasks simultaneously.
state for POS tagging, one state for lemmatization,
one state for syntactic parsing, and so on. When
the machine is in a given state, an action is pre-
dicted, which generally writes on an output tape a
label corresponding to the prediction just made.2
There are usually as many output tapes as there
are levels of linguistic predictions and at least one
input tape that usually contains the words of the
sentence to process.3 Predictions are realized by
classifiers that take as input the configuration of
the machine and compute a probability distribu-
tion over the possible actions. Configurations are
complex objects that describe the aspects of the
machine that are useful in order to predict the next
action to perform. Among the important elements
of a configuration for the rest of the paper, we
can cite the current state of the machine, the word
index (noted as wi) that is the position of the word
currently processed, and the history, a list of all
actions performed so far.
The text is read word by word, a window of an
arbitrary size centered on wi defines the part of
the text that can be used to make predictions on
the current word.
Figure 1 shows the architecture of three simple
machines. The two machines in the top part of the
figure realize a single task. The machine on the
left part realizes POS tagging. It is made of a single
state and has a series of transitions that loop on
the state. The machine has one input tape from
2Some actions can be complex and change, for example,
the state of an internal stack. All details concerning the
Reading Machine can be found in Dary and Nasr (2021).
3For the sake of simplicity, we consider in this paper that
the text to process has already been segmented into sentences
and tokenized into words, even if the Reading Machine allows
performing these two operations.
891
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Three backtracking machines based on the
machines of Figure 1.
the machine behaves like the simple machines
of Figure 1. Action ¬BACK does not modify the
configuration of the machine. This is not the
only possible architecture for backtracking ma-
chines, this point will be briefly addressed in the
Conclusion.
If action BACK is selected, the last actions of
the history are undone until a ¬BACK action is
reached. This definition of action BACK allows un-
doing all actions that are related to the previous
word. After BACK has been applied, the configu-
ration of the machine is almost the same as the
configuration it was in before processing the cur-
rent word. There is, however, a major difference:
It now has access to the following word. Other-
wise, the machine would deterministically predict
the action it has predicted before. One can
notice that the transition labeled BACK in the ma-
chine loops on a single state, this feature allows
the machine to perform several successive BACK
actions.
Figure 3 shows how the tagparsing machine of
Figure 2 would ideally process the sentence the
old man the boat, a classical garden path sentence
for which, two words (old and man) should be
re-analyzed after the noun phrase the boat has
been read. The figure describes the machine con-
figuration each time it is in state BACK. Three tapes
are represented: the input tape that contains tokens,
Figure 3: Processing the garden path sentence the
old man the boat with a bactracking tagparser. After
reading the second determiner,
the machine back-
tracks in order to reanalyze words old and man.
the POS tape, and the parsing tape that contains the
index of the governor of the current word. The
figure also represents, at the bottom, the BA array,
which is described below, as well as the sequence
of actions predicted since the last visit to state
BACK. The current word appears in boldface. The
figure shows two successive occurrences of a BACK
action, after the second determiner is read, leading
to the re-analysis of the word old that was tagged
ADJ and the word man that was tagged NOUN.
A backtracking machine as the one described
above can run into infinite loops: Nothing pre-
vents it to repeat endlessly the same sequence of
actions. One can hope that, during training, such
behavior leads to poor performance and is ruled
out. But there is no guarantee that this will be
the case. Furthermore, we would like to prevent
the machine for exploring, at inference time, the
whole (or a large part of the) configuration space.
In order to do so, we introduce a constraint on the
number of times a BACK action is taken when pars-
ing a sentence. A simple way to introduce such
a constraint is to limit to a given constant k the
number of authorized BACK actions per word. This
892
feature is implemented by introducing an array
BA of size n, where n is the number of words
of the sentence to process. Array BA is initialized
with zeros, and every time a BACK action is pre-
dicted in position i, the value of BA[wi] is incre-
mented. When the machine is in state BACK and
BA[wi] is equal to k, performing a BACK action is
not permitted. A ¬BACK action is forced, bypass-
ing the classifier that decides whether to back-
track or not.
The introduction of array BA and the parameter
k defines an upper bound on the size of the
action sequence for a sentence of length n. This
upper bound is equal to 3nk + 2n for the tagger,
4nk + 3n for the parser, and 5nk + 4n for the
tagparser.5 As one can notice, linearity is pre-
served. In our experiments, we chose k = 1.
4 Training
Reading Machines, as introduced by Dary and
Nasr (2021), are trained in a supervised learning
fashion. Given data annotated at several linguis-
tic levels, an oracle function decomposes it into
a sequence of configurations and actions (c0, a0,
c1, a1, . . . , cn, an). This sequence of configura-
tions and actions constitute the training data of the
classifiers of the machine to train: Pairs (ci, ai)
are presented iteratively to the classifier during
the training stage. A backtracking Reading Ma-
chine cannot be trained this way since there are no
occurrences of BACK actions in the data. In order
to learn useful occurrences of such actions, the
training process should have the ability to gener-
ate some BACK actions and be able to measure if
this action was beneficial.
In order to implement such a behavior, we use
reinforcement learning (RL). We cast our problem
as a Markov Decision Process (MDP). In an MDP, an
agent (the parser) is in configuration ct at time t. It
selects an action at from an action set (made of the
tagging actions, the parsing actions and the BACK
and ¬BACK actions) and performs it. We note C the
set of all configurations and A the set of actions.
The environment (the annotated data) responds
to action at by giving a reward rt = r(ct, at)
and by producing the succeeding configuration
ct+1 = δ(ct, at). In our case, configuration ct+1
is deterministically determined by the structure
of the machine. The reward function gives high
5See Appendix A for details.
reward to actions that move the parser towards
the correct parse of the sentence. The fundamental
difference with supervised training is that, during
training, the agent is not explicitly told which
actions to take, but instead must discover which
action yields the most reward through trial and
error. This feature gives the opportunity for the
parser to try some BACK actions, provided that a
reward can be computed for such actions.
Given an MDP that indicates the reward asso-
ciated to applying action a in configuration c,
the goal is to learn a function q∗(c, a) that maxi-
mizes the total amount of reward (also called dis-
counted return) that can be expected after action a
has been applied. Once this function (or an approx-
imation of it) is computed, one can use it to select
which action to choose when in configuration c,
by simply picking the action that maximizes func-
tion q∗. Such behavior is called an optimal policy
in the RL literature. Function q∗ is the solution to
the Bellman optimality equation.6
q∗(c, a) = r(c, a) + γ max
a(cid:4) q∗(δ(c, a), a(cid:4))
where γ ∈ [0, 1] is the discount factor that al-
lows discounting the reward of future actions. The
equation expresses the relationship between the
value of an action a in configuration c and the value
of the best action a(cid:4) that can be performed in its
successor configuration δ(c, a). This recursive
definition is the basis for algorithms that itera-
tively approximate q∗, among which is Q learning
(Watkins, 1989), which approximates q∗ with a
function called Q. In Q learning, during training,
each time an action a is chosen in configuration c,
the value of Q(c, a) is updated:
Q(c, a) ← Q(c, a) + α(Q(cid:4)(c, a) − Q(c, a))
where α is a learning rate and Q(cid:4)(c, a) is a new
estimation of Q(c, a):
Q(cid:4)(c, a) = r(c, a) + γ max
a(cid:4) Q(δ(c, a), a(cid:4))
It has been proven (Watkins and Dayan, 1992)
that such iterative algorithm converges towards
the q∗ function.
6This is actually a simplified form of the Bellman opti-
mality equation, due to the fact that our MDP is deterministic:
Applying action a in configuration c yields configuration
δ(c, a) with probability 1.
893
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
In order to store the values of Q, the algorithm
uses a large table that has an entry for every
(configuration, action) pair. In our case, there
are far too many configurations to allocate such
a table. Instead we use a simple form of deep
Q learning (Mnih et al., 2013) and approximate
function Q using a multilayer perceptron CQ.
CQ takes as input a configuration c and out-
puts a vector whose dimension is the number of
different actions in the action set. The component
of the vector corresponding to action a is the
approximation of Q(c, a) computed by CQ. It is
noted CQ(c, a).
During training, every time an action a is
performed by the parser in configuration c, the
parameters of CQ are updated using gradient de-
scent of the loss function. The loss function should
be defined in a way to minimize the difference
between the actual value CQ(c, a) and its updated
(cid:4)(c, a). This difference is computed with
value CQ
the smooth l1 loss (Girshick, 2015):
L(c, a) = l1(CQ(c, a), CQ
(cid:4)(c, a))
where CQ
(cid:4)(c, a) is computed as follows:
CQ
(cid:4)(c, a) = r(c, a) + γ max
a(cid:4) CQ(δ(c, a), a(cid:4))
Reward Functions
In RL, the training process
is guided by the reward r(c, a) granted by the
environment when action a is performed in config-
uration c. Defining a reward function for tagging
action and parsing action is quite straightforward.
In the case of tagging, the reward should be
high when the tag chosen for the current word is
correct and low when it is not. A simple reward
function is one that associates, for example, value
0 in the first case and −1 in the second. More
elaborate reward functions could be defined that
penalize more some confusions (e.g., tagging a
verb as a preposition).
In the case of parsing, a straightforward reward
function will give a reward of zero for a correct
action and a negative reward for an incorrect one.
We use a slightly more complex function inspired
by the dynamic oracle of Goldberg and Nivre
(2012). This function, in the case of an incorrect
action a, counts the number of dependencies of
the correct analysis that cannot anymore be pre-
dicted due to a. The reward for a is the opposite
of this number. Actions that cannot be executed,
such as popping an empty stack, are granted a
reward of −1.5.
Defining the reward function for BACK actions
is more challenging. When a BACK action is per-
formed, a certain number of actions ai . . . ai+k
are undone. Each of these actions was granted a
reward ri . . . ri+k. Let’s call E = −Σk
t=0ri+t the
opposite of the sum of these rewards (E ≥ 0).
The larger E is, the more errors have been made.
Let’s call ϕ(E) the function that computes the
reward for executing a BACK action, given E.
Formally, we want ϕ to respect the following
principles:
1. Don’t execute a BACK action if there are no
errors: ϕ(0) < 0.
2. ϕ(E) should be increasing with respect to
E: the more errors, the more a BACK action
should be encouraged.
3. ϕ(E) should not grow too fast with respect to
E. Granting too much reward to BACK actions
encourages the system to make errors in order
to correct them with a highly rewarded BACK
action.
A function that fulfills these principles is the
following:
(cid:2)
ϕ(E) =
−1
ln(E + 1)
if E = 0
else
It is the function that we use to compute the
reward of a BACK action.
Exploring the Configuration Space
In order to
learn useful BACK actions, the parser should try, in
the training phase, to perform some BACK actions
and update the classifier based on the reward it
receives from the environment. One standard way
to do that is to adopt an ε-greedy policy in which
the model selects a random action with probability
ε or the most probable action as predicted by the
classifier with probability 1 − ε. This setup allows
the system to perform exploitation (choosing the
best action so far) as well as exploration (randomly
choosing an action). We adopt a variant of this
policy, based on two parameters ε and β with
0 ≤ ε ≤ 1, 0 ≤ β ≤ 1, and ε + β ≤ 1. As
in standard ε-greedy policy, the agent chooses
a random action with probability ε, it chooses
894
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Probabilities of choosing the next action
during RL: at random, following the oracle, or the
model.
the correct action as given by the oracle7 with
probability β, and, finally, it chooses the most
probable action as predicted by the classifier with
probability 1 − (ε + β). Parameter β has been
introduced in order to speed up training. In the
beginning of the training process, the system is
encouraged to follow the oracle. Then, as training
progresses, the system relies more on its predicting
capacity (exploitation augments) and less on the
oracle. Figure 4 shows the evolution of theses
parameters.
5 Experiments
Three machines were used in our experiments:
a tagger, a parser and a tagparser, based on the
architectures of Figure 2. Each of the three ma-
chines has been trained in three different learning
regimes: Supervised Learning (SL) using a dy-
namic oracle, Reinforcement Learning without
back actions (RL), and Reinforcement Learning
with back actions (RLB).
5.1 Universal Dependencies Corpus
Our primary experiments were conducted on a
French Universal Dependencies corpus (Zeman
et al., 2021), more specifically the GSD corpus,
consisting of 16,341 sentences and 400,399 words.
The original split of the data was 88% train, 9%
dev, and 3% test. The size of the test set being
too small to obtain significant results, we decided
to use a k-fold strategy, where all the data was
first merged, randomly shuffled, and then split
into ten folds, each fold becoming the test set of a
new train/dev/test split of the data in the following
proportions: 80%/10%/10%.
Using the ten folds was unnecessary to obtain
significant results; we therefore decided to limit
ourselves to three folds. The size of the test set
7¬BACK in the case of the back state.
Figure 5: Structure of the classifier.
for which results are reported has a size of 4,902
sentences and 124,560 words.
5.2 Experimental Setup
Each machine consists of a single classifier, a
multilayer perceptron, with a single hidden layer
of size 3200. When the machine realizes several
tasks, as in case of the tagparser, the classifier has
one decision layer for each task. The output size
of each decision layer is the number of actions
of its corresponding task. A dropout of 30% is
applied to the input vector and to the output of
the hidden layer. The output of the hidden layer is
given as input to a ReLU function. The structure
of the classifier is represented in Figure 5.
The details of the features extracted from con-
figurations and their encoding in the input layer
of the classifier are detailed in Appendix B.
In the case of Supervised Learning, the ma-
chines are trained with a dynamic oracle. In the
beginning of the training process, the machines
are trained with a static oracle, for two epochs.
Then every two epoch, the machines are used
to decode the training corpus and for each con-
figuration produced (which could be incorrect)
the dynamic oracle selects the optimal action and
these (configuration, action) pairs are used to train
the classifier.
Training the machines in the RL regime is longer
than training them in the SL regime. In the first
case, 200 epochs were needed and 300 in the
second. This difference is probably due to the
larger exploration of the configuration space.
5.3 Results - Performance
The results for the three machines, under the
three learning regimes, are shown in Table 1. POS
895
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
TAGGER
PARSER
PARSER TAGGER TAGPARSER
Regime
UPOS
p val.
UAS
RLB
RL
SL
97.65
96.84
96.11
0.000
0.000
88.21
86.60
86.17
Regime
UPOS
RLB
RL
SL
97.06
96.73
96.59
TAGPARSER
p val.
UAS
0.000
0.090
87.85
87.12
86.94
p val.
0.000
0.037
p val.
0.001
0.211
Table 1: Performances of tagger, parser, and tag-
parser under three learning regimes on our French
corpus.
tagging performance is measured with accuracy
and displayed in column UPOS. Dependency pars-
ing performance is measured with the unlabeled
accuracy score (ratio of words that have been at-
tached to the correct governor) and displayed in
the UAS column. The p-value next to each score is
a confidence metric indicating whether the score
is significantly better than the one below (that’s
why the last line is never given a p-value). This
p-value has been estimated with a paired boot-
strap resampling algorithm (Koehn, 2004) using
the script (Popel et al., 2017) of the CoNLL 2018
shared task.
The table shows the same pattern for the three
machines: The RLB regime gets higher results than
the simple RL regime, which is itself better than
the SL regime. Two important conclusions can
be drawn from these results. The first is that RLB
regime is consistently better than SL: Backtracking
machines make fewer errors than machines trained
in supervised mode. At this point we do not know
whether this superiority comes from reinforce-
ment learning or the addition of a BACK action.
In fact, previous experiments in Zhang and Chan
(2009) and Lˆe and Fokkens (2017) showed that re-
inforcement learning (without backtracking) can
lead to better results than supervised learning. The
comparison of RLB and RL shows that most of the
performance boost comes from backtracking.
The results of Table 1 also show that the tag-
parser gets better results than single-task machines
(the tagger and the parser) when trained with su-
pervised learning. Note that this comparison is
possible because the parser was not given gold
#Actions
#Errs
#Backs
bPrec
bRec
C→C
E→E
C→E
E→C
115,588
3,597
1,063
76.86%
24.52%
18.07%
41.39%
05.15%
35.39%
79,620
1,323
891
68.46%
46.49%
28.65%
26.09%
02.79%
42.47%
153,764
6,249
4,491
73.48%
61.72%
56.89%
23.46%
02.81%
16.83%
Table 2: Behavior comparison of three RLB
machines.
PoS as input, but instead the ones predicted by the
tagparser. These results are in line with the work of
Bohnet and Nivre (2012) and Alberti et al. (2015)
that show that joint prediction of POS tags and
syntactic tree improves the performances of both.
However, this is not true when the machines are
trained with reinforcement learning. In this case
the parser and the tagger get better results than
the tagparser. One reason that could explain this
difference is the size of the configuration space of
the tagparser that is an order of magnitude larger
than those of the tagger or the parser. We will
return to this point in the Conclusion.
5.4 Results - Statistics
One can gain a better understanding of the ef-
fect of the BACK actions performed by the three
machines with the statistics displayed in Table 2.
Each column of the table concerns one machine
trained in RLB mode. The first line shows the total
number of actions predicted while decoding the
test set, the second line shows, the number of
errors made, and the third line shows, the num-
ber of BACK actions predicted. Lines 4 and 5 give
the precision and recall of the BACK actions. The
precision is the ratio of BACK actions that were
predicted after an error was made and the recall is
the ratio of errors after which a BACK action was
predicted. These two figures measure the error de-
tection capabilities of the BACK actions prediction
mechanism. In the case of the parser, the precision
is equal to 76.86%, which means that 76.86% of
the BACK actions were predicted after an error was
made and 24.52% (recall) of the errors provoked
a BACK action prediction. The recall constitutes an
upper bound of the errors that could be corrected.
The four last lines break down the BACK actions
predicted into four categories. C→C is the case
896
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
where a BACK action was predicted after a correct
action, but did not change the action, E→E is the
case where a BACK action was predicted after an
error, but the error was not corrected, either the
same erroneous action was predicted or another
erroneous one was predicted. E→C is the case
where a BACK action was predicted after an error
and has corrected it, while C→E is the case where
a correct action was replaced by an incorrect one
after a BACK action.
Several conclusions can be drawn from these
statistics.
First, backtracking corrects errors. For the three
machines, there are many more cases where an
error is corrected rather than introduced after a
BACK action was predicted (E→C (cid:10) C→E). This
means that the difference in scores that we ob-
served between RL and RLB in Table 1 can indeed
be attributed to backtracking.
Second, backtracking is conservative. The num-
ber of predicted BACK actions is quite low (around
1% of the actions for the tagger and the parser
and around 3% for the tagparser) and the preci-
sion is quite high. The machines do not backtrack
very often and they usually do it when errors are
made. This is the kind of behavior we were aim-
ing for. It can be modified by changing the reward
function ϕ of the BACK action.
Third, tagging errors are easier to correct than
parsing errors. The comparison of columns 2 and
3 (parser and tagger) shows that the tagger has
a higher recall than the parser, tagging errors are
therefore easier to detect. This comparison also
shows that E→C is higher for the tagger than it is
for the parser, tagging errors are therefore easier
to correct.
Lastly, the poor performance of the tagparser
does not come from the fact that it does not
backtrack. It actually does backtrack around three
times as often as the parser or the tagger. But it
has a hard time correcting the errors; most of the
time, it reproduces errors made before. We will
return to this point in the Conclusion.
5.5 Results on Other Languages
In order to study the behavior of backtracking
on other languages, we have trained and eval-
uated our system on six other languages from
various typological origin: Arabic (AR), Chinese
(ZH), English (EN), German (DE), Romanian (RO),
and Russian (RU). The experimental setup for
Lang. Corpus
Train
Dev
Test
AR
ZH
EN
FR
DE
RO
RU
PADT
GSD
GUM
GSD
HDT
RRT
SynTag
254,400
98,616
81,861
364,349
2,753,627
185,113
871,526
34,261
12,663
15,598
36,775
319,513
17,074
118,692
32,132
12,012
15,926
10,298
326,250
16,324
117,523
Table 3: Corpora used for experiments on new
languages, with the size of training, development,
and test sets (in tokens).
these languages is different from the one we have
used for French: we have used the original split
in train, development, and test sets, as defined in
the Universal Dependencies corpora. We report
in Table 3 the corpora used for each language as
well as the size of the training, development, and
test sets. We did not run experiments on the tag-
parser for it gave poor results on our experiments
on French. Additionally, for a sanity check, we
have rerun experiments on French data, using the
original split in order to make sure that the dif-
ference of experimental conditions did not yield
important differences in the results. The results of
these experiments can be found in Table 4. The
table indicates p-values of the difference between
one system and the next best performing one. The
system with the worse performances is therefore
not associated with a p-value.
The results obtained on French are lower than
the results obtained using the k-fold strategy. But
the drop is moderate—0.13% for the tagger and
0.52% for the parser—and could be explained
simply by the difference of the test corpora on
which the systems were evaluated.
We observe more or less the same pattern for
the new languages: The highest performance is
reached by reinforcement learning with backtrack
(RLB), for both the tagger and the parser. The
second-best performing systems for tagging are
usually trained with reinforcement learning but
differences are usually non-significant. In the case
of the parser, the second-best performing sys-
tems are trained in a supervised regime, but as
was the case for tagging, the differences are of-
ten non significant. The performance on Arabic
is different, where no significant advantage was
observed when using backtracking. The reason
for this is the agglutinative nature of Arabic and
the tokenization conventions of UD that tokenizes
897
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
TAGGER
Regime UPOS p val. UAS
PARSER
RLB
RL
SL
RLB
RL
SL
RLB
RL
SL
RLB
RL
SL
RLB
RL
SL
RLB
RL
SL
RLB
RL
SL
English-GUM
0.00
0.00
French-GSD
0.00
0.10
German-HDT
0.00
0.47
Romanian-RRT
0.00
0.03
94.99
93.53
92.63
97.53
96.64
96.25
97.88
97.29
97.28
97.07
96.28
95.77
79.96
72.97
73.12
87.70
86.63
86.97
93.00
91.26
91.31
85.40
84.70
85.00
Russian-SynTagRus
98.41
0.00
97.93
0.01
97.80
86.59
85.25
85.27
Chinese-GSD
0.00
0.23
Arabic-PADT
0.16
0.03
93.01
91.61
91.30
96.43
96.20
95.74
71.70
64.63
65.79
83.81
83.77
83.95
p val.
0.00
0.43
0.17
0.33
0.00
0.35
0.26
0.32
0.00
0.46
0.00
0.13
0.47
0.38
Table 4: Results for seven languages.
agglutinated pronouns. The effect of this tokeniza-
tion is to increase the distances between content
words. The most common pattern that triggers a
backtrack in tagging consists in going back to
the previous word in order to modify its part of
speech. In the case of Arabic, if the target of the
backtrack has an agglutinated pronoun, the tagger
has to perform two successive BACK actions to re-
alize the correction, a pattern that is more difficult
to learn.
The general conclusions that we can draw there-
fore is that reinforcement learning with backtrack
yields the best performance for both the parser and
the tagger (with the exception of Arabic), but there
are no notable differences between supervised
898
Lang
#Act
#Errs
#Backs
bPrec
bRec
C→C
E→E
C→E
E→C
EN
DE
RU
AR
31,852
1,097
691
652,500
54,173
53,573
234,658
19,374
19,435
56,528
1,093
310
72.94% 95.88% 95.64% 58.06%
45.94% 94.83% 95.94% 16.47%
24.31% 03.77% 04.21% 39.35%
27.06% 07.35% 05.40% 29.03%
02.75% 00.35% 00.16% 02.58%
45.88% 88.53% 90.23% 29.03%
Table 5: Statistics for BACK actions performed
during tagging for four languages.
learning with a dynamic oracle and reinforcement
learning (without backtrack).
The statistics on the situations in which BACK
actions are performed have been displayed for
four languages in Table 5. The table reveals some
striking differences for two languages: German
and Russian. For these languages, the ratio of
BACK actions with respect to the total number of
actions predicted is equal to 8.3% for German and
to 8.2% for Russian, a figure that is far above what
is observed for other languages. A closer look at
the results shows that for these two languages, the
machine learns a strategy that consists in provok-
ing errors (it tags as punctuation linguistic tokens),
in order to be able to correct them using a BACK
action. This behavior is not due to linguistic rea-
sons but rather to the size of the training corpora.
As one can see in Table 3, the training corpora for
German and Russian are much larger than they are
for other languages. Our hypothesis is that, when
trained on a large training corpus, the machine has
more opportunities to develop complex strategies
such as provoking errors in order to correct them
using a BACK action. Indeed, this phenomenon van-
ishes when we reduce the size of the train set. In
order to fight against this behavior, one can act
on the reward function and decrease the reward
of BACK action as the size of the training corpus
increases. More investigation is needed to fully
understand this behavior.
The reason why the tagger chose to make reg-
ular mistakes—tagging as punctuation the words
that are later corrected—is not clear. Our hypoth-
esis is that this is an error that is easy to detect in
order to predict a BACK action.
The same phenomenon (intentional errors) has
been observed, to a lesser degree, on the parser.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
6 Conclusions
We have proposed, in this article, an extension to
transition-based parsing that consists of allowing
the parser to undo the last actions performed in
order to explore alternative hypotheses. We have
shown that this kind of model can be effectively
trained using deep reinforcement learning.
This work will be extended in several ways.
The first concerns the disappointing results of
the tagparser. As already mentioned in Section 4,
many studies have shown that there is usually
an advantage to jointly predict several types of
linguistic annotations over predicted them sepa-
rately. We did not observe this phenomenon in our
backtracking tagparser. The problem could come
from the structure of the backtracking machines.
The structures used, illustrated in Figure 2, are
just one possible architecture, others are possible,
as for example dedicating different BACK states for
the parser and the tagger.
The second direction concerns the integration of
other NLP tasks in a single machine. Dary and Nasr
(2021) showed that reading machines can take
as input raw text and realize word segmentation,
POS tagging, lemmatization, parsing, and sentence
segmentation. We would like to train such a com-
plex machine with reinforcement learning and
backtracking to study whether the machine can
backtrack across several linguistic levels.
A third direction concerns the processing of
garden path sentences. Such sentences offer exam-
ples in which we expect a backtracking machine
to backtrack. We have built a corpus of 54 garden
path sentences in French, using four different syn-
tactic patterns of different complexity level and
organized the sentences in minimal pairs and have
tested our machine on it. The results are mixed;
in some cases, the machine behaves as expected,
in other cases it does not. A detailed analysis
showed that the machine usually backtracks on
garden path sentences but has a tendency of not
reanalyzing the sentence as was expected. The
problem seems to be of a lexical nature. Some
words that should be attributed new POS tags in
the reanalysis phase resist this reanalysis. This is
probably due to their lexical representation. More
work is needed to understand this phenomenon
and find ways to overcome it.
The last direction is linked to the long-term
project of predicting regressive saccades. The gen-
eral idea is to compare back movements predicted
by our model and actual eye movement data and
study whether these movements are correlated.
Appendix A: Time Complexity of
Backtracking
In arc-eager dependency parsing, a sentence of
length n is processed in 2n actions: an action that
pushes a word on the stack (shift or right) and an
action that removes it from the stack (reduce or
left). In our backtracking tagparser, we must add
one tagging action and one ¬BACK per word.
Therefore, without applying any BACK action,
the number of actions needed to process a sentence
of size n is 4n: for each word a ¬BACK, a POS
action, a push action and a pop action.
Let si be the length of the action sequence tak-
ing place when processing the ith word of the
sentence. The sum of these lengths is also the
number of actions to process the whole sentence,
therefore
n
i=1 si = 4n.
(cid:3)
Now, in the worst case scenario where BACK
is applied k times per word, the total number of
actions is 5nk + 4n, which is the sum of:
• k back actions per word: nk.
• Initial application of the si sequences: 4n.
• The k re-processing of the sequences:
(cid:3)
(cid:3)
n
i=1 k × si = k
n
i=1 si = 4nk.
Appendix B: Features of the Classifiers
The input layer of the classifiers described in
Section 5 is a vector of features extracted from
the current configuration. Features are represented
as randomly initialized and learnable embeddings
of size 128, with the exception of words, that
are represented by fastText pretrained word em-
beddings of size 300 (Bojanowski et al., 2017).
Four embedding spaces are used: words, POS tags,
letters, and actions.
The features are the following:
• POS tags and form of the words in a window
of size [-2,2] centered on the current word,
with the addition, for the parser and tagparser,
of the POS tag of the governor, the rightmost
and the leftmost dependents of the three top-
most stack elements.
899
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
• The history of the 10 last actions performed.
• Prefix and suffix of size 4 for the current
word.
• For backtracking machines, a binary feature
indicating whether or not a back is allowed.
When the value of a feature is not available, it is
replaced by a special learnable embedding, repre-
senting the reason of unavailability. The following
situation are distinguished:
• Out of bounds: target word is before the first
or after the last word of the sentence.
• Empty stack: target word is in the empty
stack.
• No dep / gov: target word is the dependent /
governor of a word without one.
• Not seen: target word is in right context, and
has not been seen yet.
• Erased: target value has been erased after a
BACK action.
Acknowledgments
We would like to thank the action editor as well
as the anonymous reviewers, for their detailed and
thoughtful insights, which helped us improve on
our work substantially.
References
Chris Alberti, David Weiss, Greg Coppola, and
Slav Petrov. 2015. Improved transition-based
parsing and tagging with neural networks. In
Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing,
pages 1354–1359. https://doi.org/10
.18653/v1/D15-1159
Miguel Ballesteros and Yaser Al-Onaizan. 2017.
AMR parsing using stack-LSTMs. In Pro-
ceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing,
pages 1269–1275, Copenhagen, Denmark.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D17
-1130
Miguel Ballesteros, Yoav Goldberg, Chris Dyer,
and Noah A. Smith. 2016. Training with explo-
ration improves a greedy stack LSTM parser.
the 2016 Conference on
In Proceedings of
Empirical Methods in Natural Language Pro-
cessing, pages 2005–2010. https://doi
.org/10.18653/v1/D16-1211
Bernd Bohnet and Joakim Nivre. 2012. A
transition-based system for joint part-of-speech
tagging and labeled non-projective dependency
parsing. In Proceedings of the 2012 Joint Con-
ference on Empirical Methods in Natural Lan-
guage Processing and Computational Natural
Language Learning, pages 1455–1465. Associ-
ation for Computational Linguistics.
Piotr Bojanowski, Edouard Grave, Armand
Joulin, and Tomas Mikolov. 2017. Enriching
word vectors with subword information. Tran-
sactions of the Association for Computational
Linguistics, 5:135–146. https://doi.org
/10.1162/tacl a 00051
Franck Dary and Alexis Nasr. 2021. The reading
machine: A versatile framework for studying
incremental parsing strategies. In Proceedings
of the 17th International Conference on Pars-
ing Technologies (IWPT 2021), pages 26–37,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.iwpt-1.3
Chris Dyer, Miguel Ballesteros, Wang Ling,
Austin Matthews, and Noah A. Smith. 2015.
Transition-based dependency parsing with stack
long short-term memory. In Proceedings of
the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th
International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers),
pages 334–343, Beijing, China. Association for
Computational Linguistics. https://doi
.org/10.3115/v1/P15-1033
Ross Girshick. 2015. Fast R-CNN. In Proceed-
ings of the IEEE International Conference on
Computer Vision, pages 1440–1448. https://
doi.org/10.1109/ICCV.2015.169
Yoav Goldberg and Joakim Nivre. 2012. A
dynamic oracle for arc-eager dependency
parsing. In Proceedings of COLING 2012,
pages 959–976.
Liang Huang and Kenji Sagae. 2010. Dy-
namic programming for linear-time incremen-
tal parsing. In Proceedings of the 48th Annual
900
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Meeting of the Association for Computational
Linguistics, pages 1077–1086.
Philipp Koehn. 2004. Statistical significance tests
for machine translation evaluation. In Pro-
ceedings of the 2004 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 388–395, Barcelona, Spain. Association
for Computational Linguistics.
Shuhei Kurita and Anders Søgaard. 2019. Multi-
task semantic dependency parsing with policy
gradient for learning easy-first strategies. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2420–2430, Florence, Italy. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/P19-1232
Minh Lˆe and Antske Fokkens. 2017. Tackling
error propagation through reinforcement learn-
ing: A case of greedy dependency parsing.
In Proceedings of the 15th Conference of the
European Chapter of the Association for Com-
putational Linguistics: Volume 1, Long Papers,
pages 677–687.
Alessandro Lopopolo, Stefan L. Frank, Antal
Van Den Bosch, and Roel Willems. 2019. D.
In Proceedings of the Workshop on Cogni-
tive Modeling and Computational Linguistics,
pages 77–85.
George W. McConkie and Keith Rayner. 1975.
The span of the effective stimulus during a fix-
ation in reading. Perception & Psychophysics,
17(6):578–586. https://doi.org/10.3758
/BF03203972
George W. McConkie
and Keith Rayner.
1976. Asymmetry of the perceptual span in
reading. Bulletin of the psychonomic society,
8(5):365–368. https://doi.org/10.3758
/BF03335168
Volodymyr Mnih, Koray Kavukcuoglu, David
Silver, Alex Graves,
Ioannis Antonoglou,
Daan Wierstra, and Martin Riedmiller. 2013.
Playing atari with deep reinforcement learning.
arXiv preprint arXiv:1312.5602.
Tahira Naseem, Abhishek Shah, Hui Wan, Radu
Florian, Salim Roukos, and Miguel Ballesteros.
2019. Rewarding Smatch: Transition-based
AMR parsing with reinforcement learning. In
Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 4586–4592, Florence, Italy. Association
for Computational Linguistics. https://doi
.org/10.18653/v1/P19-1451
Joakim Nivre, Johan Hall, and Jens Nilsson.
2004. Memory-based dependency parsing. In
Proceedings of the Eighth Conference on Com-
putational Natural Language Learning (CoNLL-
2004) at HLT-NAACL 2004, pages 49–56.
Martin Popel, Zdenˇek ˇZabokrtsk`y, and Martin
Vojtek. 2017. UDAPI: Universal API
for
In Proceedings of
universal dependencies.
the NoDaLiDa 2017 Workshop on Universal
Dependencies (UDW 2017), pages 96–101.
Keith Rayner and Sara C. Sereno. 1994. Re-
gressive eye movements and sentence parsing:
On the use of regression-contingent analyses.
Memory & Cognition, 22(3):281–285. https://
doi.org/10.3758/BF03200855
Christopher
and Peter
J. C. H. Watkins
Dayan. 1992. Q-learning. Machine learning,
8(3–4):279–292. https://doi.org/10.1023
/A:1022676722315
Christopher John Cornish Hellaby Watkins. 1989.
Learning from Delayed Rewards. Ph.D. thesis,
King’s College, Cambridge United Kingdom.
Hiroyasu Yamada and Yuji Matsumoto. 2003.
Statistical dependency analysis with support
vector machines. In Proceedings of the Eighth
International Conference on Parsing Technol-
ogies, pages 195–206.
Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2018.
Approximate dynamic oracle for dependency
parsing with reinforcement learning. In Pro-
ceedings of the Second Workshop on Universal
Dependencies (UDW 2018), pages 183–191,
Brussels, Belgium. Association for Computa-
tional Linguistics.
Daniel Zeman, Joakim Nivre, Mitchell Abrams,
Elia Ackermann, No¨emi Aepli, Hamid
ˇZeljko Agi´c, Amir Ahmadi, Lars
Aghaei,
Ahrenberg, Chika Kennedy Ajede, Gabriel˙e
Aleksandraviˇci¯ut˙e, Ika Alfina, Lene Antonsen,
Katya Aplonova, Angelina Aquino, Carolina,
Aragon, Maria Jesus Aranzabe, Bilge Nas
Arıcan, P´orunn Arnard´ottir, Gashaw Arutie,
Jessica Naraiswari Arwidarasti, Masayuki
Asahara, Deniz Baran Aslan, Luma Ateyah,
901
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Eryi˘git,
Furkan Atmaca, Mohammed Attia, Aitziber
Atutxa, Liesbeth Augustinus, Elena Badmaeva,
Keerthana Balasubramani, Miguel Ballesteros,
Esha Banerjee, Sebastian Bank, Verginica
Barbu Mititelu, Starkaður Barkarson, Rodolfo
Basile, Victoria Basmov, Colin Batchelor, John
Bauer, John Bauer, Seyyit Talha Bedir, Kepa
Bengoetxea, G¨ozde Berk, Yevgeni Berzak,
Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica
Biagetti, Eckhard Bick, Agn˙e Bielinskien˙e,
Krist´ın Bjarnad´ottir, Rogier Blokland, Victoria
Bobicev, Lo¨ıc Boizou, Emanuel Borges V¨olker,
Carl B¨orstell, Cristina Bosco, Gosse Bouma,
Sam Bowman, Adriane Boyd, Anouck Braggaar,
Kristina Brokait˙e, Aljoscha Burchardt, Marie
Candito, Bernard Caron, Gauthier Caron,
Lauren Cassidy, Tatiana Cavalcanti, G¨uls¸en
Cebiro˘glu
Flavio Massimiliano
Cecchini, Giuseppe G. A. Celano, Slavom´ır
ˇC´epl¨o, Neslihan Cesur, Savas Cetin,
¨Ozlem
C¸ etino˘glu, Fabricio Chalub, Shweta Chauhan,
Ethan Chi, Taishi Chika, Yongseok Cho, Jinho
Choi, Jayeol Chun, Juyeon Chung, Alessandra
T. Cignarella, Silvie Cinkov´a, Aur´elie Collomb,
C¸ a˘grı C¸ ¨oltekin, Miriam Connor, Marine Courtin,
Mihaela Cristescu, Philemon Daniel, Elizabeth
Davidson, Marie-Catherine de Marneffe, Valeria
de Paiva, Mehmet Oguz Derin, Elvis de Souza,
Arantza Diaz de Ilarraza, Carly Dickerson,
Arawinda Dinakaramani, Elisa Di Nuovo,
Bamba Dione, Peter Dirix, Kaja Dobrovoljc,
Timothy Dozat, Kira Droganova, Puneet
Dwivedi, Hanne Eckhoff, Sandra Eich, Marhaba
Eli, Ali Elkahky, Binyam Ephrem, Olga Erina,
Tomaˇz Erjavec, Aline Etienne, Wograine
Evelyn, Sidney Facundes, Rich´ard Farkas,
Jannatul Ferdaousi, Mar´ılia Fernanda, Hector
Fernandez Alcalde, Jennifer Foster, Cl´audia
Freitas, Kazunori Fujita, Katar´ına Gajdoˇsov´a,
Daniel Galbraith, Marcos Garcia, Moa
G¨ardenfors, Sebastian Garza, Fabr´ıcio Ferraz
Gerardi, Kim Gerdes, Filip Ginter, Gustavo
Iakes Goenaga, Koldo Gojenola,
Godoy,
Memduh G¨okırmak, Yoav Goldberg, Xavier
G´omez Guinovart, Berta Gonz´alez Saavedra,
Bernadeta Grici¯ut˙e, Matias Grioni, Lo¨ıc
Grobol, Normunds Gr¯uz¯ıtis, Bruno Guillaume,
C´eline Guillot-Barbance, Tunga G¨ung¨or, Nizar
Habash, Hinrik Hafsteinsson, Jan Hajiˇc, Jan
Hajiˇc Jr., Mika H¨am¨al¨ainen, Linh H´a M˜y,
Na-Rae Han, Muhammad Yudistira Hanifmuti,
Sam Hardwick, Kim Harris, Dag Haug,
902
Jha, Anders
Johannes Heinecke, Oliver Hellwig, Felix
Hennig, Barbora Hladk´a, Jaroslava Hlav´aˇcov´a,
Florinel Hociung, Petter Hohle, Eva Huber,
Jena Hwang, Takumi
Ikeda, Anton Karl
Ingason, Radu Ion, Elena Irimia, O. l´aj´ıd´e Ishola,
Kaoru Ito, Siratun Jannat, Tom´aˇs Jel´ınek,
Apoorva
Johannsen, Hildur
J´onsd´ottir, Fredrik Jørgensen, Markus Juutinen,
Sarveswaran K, H¨uner Kas¸ıkara, Andre Kaasen,
Nadezhda Kabaeva, Sylvain Kahane, Hiroshi
Kanayama, Jenna Kanerva, Neslihan Kara,
Boris Katz, Tolga Kayadelen, Jessica Kenney,
V´aclava Kettnerov´a, Jesse Kirchner, Elena
Klementieva, Elena Klyachko, Arne K¨ohn,
Abdullatif K¨oksal, Kamil Kopacewicz,
Timo Korkiakangas, Mehmet K¨ose, Natalia
Kotsyba, Jolanta Kovalevskait˙e, Simon Krek,
Parameswari Krishnamurthy, Sandra K¨ubler,
O˘guzhan Kuyrukc¸u, Aslı Kuzgun, Sookyoung
Kwak, Veronika Laippala, Lucia Lam, Lorenzo
Lambertino, Tatiana Lando, Septina Dian
Larasati, Alexei Lavrentiev, John Lee, Phương
Lˆe Hồng, Alessandro Lenci, Saran Lertpradit,
Herman Leung, Maria Levina, Cheuk Ying Li,
Josie Li, Keying Li, Yuan Li, KyungTae Lim,
Bruna Lima Padovani, Krister Lind´en, Nikola
Ljubeˇsi´c, Olga Loginova, Stefano Lusito, Andry
Luthfi, Mikko Luukko, Olga Lyashevskaya,
Teresa Lynn, Vivien Macketanz, Menel
Mahamdi, Jean Maillard, Aibek Makazhanov,
Michael Mandl, Christopher Manning, Ruli
Manurung, B¨us¸ra Mars¸an, C˘at˘alina M˘ar˘anduc,
David Mareˇcek, Katrin Marheinecke, H´ector
Mart´ınez Alonso, Lorena Mart´ın-Rodr´ıguez,
Andr´e Martins, Jan Maˇsek, Hiroshi Matsuda,
Yuji Matsumoto, Alessandro Mazzei, Ryan
McDonald,
Sarah McGuinness, Gustavo
Mendonc¸a, Tatiana Merzhevich, Niko Miekka,
Karina Mischenkova, Margarita Misirpashayeva,
Anna Missil¨a, C˘at˘alin Mititelu, Maria Mitrofan,
Yusuke Miyao, AmirHossein Mojiri Foroushani,
Judit Moln´ar, Amirsaeid Moloodi, Simonetta
Montemagni, Amir More, Laura Moreno
Romero, Giovanni Moretti, Keiko Sophie Mori,
Shinsuke Mori, Tomohiko Morioka, Shigeki
Moro, Bjartur Mortensen, Bohdan Moskalevskyi,
Kadri Muischnek, Robert Munro, Yugo
Murawaki, Kaili M¨u¨urisep, Pinkey Nainwani,
Mariam Nakhl´e,
Ignacio Navarro
Juan
Hor˜niacek, Anna Nedoluzhko, Gunta Neˇspore-
B¯erzkalne, Manuela Nevaci, Lương Nguyễn
Thi., Huyền Nguyễn Thi. Minh, Yoshihiro
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Nikaido, Vitaly Nikolaev, Rattima Nitisaroj,
Alireza Nourian, Hanna Nurmi, Stina Ojala,
Atul Kr. Ojha, Ad´edayo Ol´u`okun, Mai Omura,
Emeka Onwuegbuzia, Petya Osenova, Robert
¨Ostling, Lilja Øvrelid, S¸ aziye Bet¨ul
¨Ozates¸,
Merve ¨Ozc¸elik, Arzucan ¨Ozg¨ur, Balkız ¨Ozt¨urk
Bas¸aran, Hyunji Hayley Park, Niko Partanen,
Elena Pascual, Marco Passarotti, Agnieszka
Patejuk, Guilherme Paulino-Passos, Angelika
Peljak-Łapi´nska, Siyao Peng, Cenel-Augusto
Perez, Natalia Perkova, Guy Perrier, Slav
Petrov, Daria Petrova, Jason Phelan, Jussi
Piitulainen, Tommi A. Pirinen, Emily Pitler,
Thierry Poibeau, Barbara Plank, Larisa
Ponomareva, Martin Popel, Lauma Pretkalnin, a,
Sophie Pr´evost, Prokopis Prokopidis, Adam
Przepi´orkowski, Tiina Puolakainen, Sampo
Pyysalo, Peng Qi, Andriela R¨a¨abis, Alexandre
Rademaker, Mizanur Rahoman, Taraka Rama,
Loganathan Ramasamy, Carlos Ramisch, Fam
Rashel, Mohammad Sadegh Rasooli, Vinit
Ravishankar, Livy Real, Petru Rebeja, Siva
Reddy, Mathilde Regnault, Georg Rehm, Ivan
Riabov, Michael Rieβler, Erika Rimkut˙e,
Larissa Rinaldi, Laura Rituma, Putri Rizqiyah,
Luisa Rocha, Eir´ıkur R¨ognvaldsson, Mykhailo
Romanenko, Rudolf Rosa, Valentin Ros, ca,
Davide Rovati, Olga Rudina, Jack Rueter,
Kristj´an R´unarsson, Shoval Sadde, Pegah
Safari, Benoˆıt Sagot, Aleksi Sahala, Shadi
Saleh, Alessio Salomoni, Tanja Samardˇzi´c,
Stephanie Samson, Manuela Sanguinetti, Ezgi
Sanıyar, Dage S¨arg, Baiba Saul¯ıte, Yanin
Sawanakunanon,
Saxena, Kevin
Shefali
Scannell, Salvatore Scarlata, Nathan Schneider,
Sebastian Schuster, Lane Schwartz, Djam´e
Seddah, Wolfgang Seeker, Mojgan Seraji,
Syeda Shahzadi, Mo Shen, Atsuko Shimada,
Hiroyuki Shirasu, Yana Shishkina, Muh
Shohibussirri, Dmitry Sichinava, Janine Siewert,
Einar Freyr Sigurðsson, Aline Silveira, Natalia
Silveira, Maria Simi, Radu Simionescu, Katalin
Simk´o, M´aria ˇSimkov´a, Kiril Simov, Maria
Skachedubova, Aaron Smith, Isabela Soares-
Bastos, Shafi Sourov, Carolyn Spadine, Rachele
Sprugnoli, Steinþ´or Steingr´ımsson, Antonio
Stella, Milan Straka, Emmett Strickland, Jana
Strnadov´a, Alane Suhr, Yogi Lesmana Sulestio,
Umut Sulubacak, Shingo Suzuki, Zsolt Sz´ant`o,
Chihiro Taguchi, Dima Taji, Yuta Takahashi,
Fabio Tamburini, Mary Ann C. Tan, Takaaki
Tanaka, Dipta Tanaya, Samson Tella, Isabelle
Tellier, Marinella Testori, Guillaume Thomas,
Liisi Torga, Marsida Toska, Trond Trosterud,
Anna Trukhina, Reut Tsarfaty, Utku T¨urk,
Francis Tyers, Sumire Uematsu, Roman
Untilov, Zdeˇnka Ureˇsov´a, Larraitz Uria, Hans
Uszkoreit, Andrius Utka, Sowmya Vajjala, Rob
van der Goot, Martine Vanhove, Daniel van
Niekerk, Gertjan van Noord, Viktor Varga,
Eric Villemonte de la Clergerie, Veronika
Vincze, Natalia Vlasova, Aya Wakasa, Joel C.
Wallenberg, Lars Wallin, Abigail Walsh, Jing
Xian Wang,
Jonathan North Washington,
Maximilan Wendt, Paul Widmer, Sri Hartati
Wijono, Seyi Williams, Mats Wir´en, Christian
Wittern, Tsegay Woldemariam, Tak-sum Wong,
Alina Wr´oblewska, Mary Yako, Kayo Yamashita,
Naoki Yamazaki, Chunxiao Yan, Koichi
Yasuoka, Marat M. Yavrumyan, Arife Bet¨ul
Yenice, Olcay Taner Yıldız, Zhuoran Yu, Arlisa
Yuliawati, Zdenˇek ˇZabokrtsk´y, Shorouq Zahra,
Amir Zeldes, He Zhou, Hanzhi Zhu, Anna
Zhuravleva, and Rayan Ziane. 2021. Univer-
sal dependencies 2.9. LINDAT/CLARIAH-CZ
digital library at the Institute of Formal and
Applied Linguistics ( ´UFAL), Faculty of Math-
ematics and Physics, Charles University.
Lidan Zhang and Kwok-Ping Chan. 2009. Depen-
dency parsing with energy-based reinforcement
learning. In Proceedings of the 11th Interna-
tional Conference on Parsing Technologies
(IWPT’09), pages 234–237. https://doi
.org/10.3115/1697236.1697284
Yue Zhang and Stephen Clark. 2008. A tale
of two parsers: Investigating and combining
graph-based and transition-based dependency
parsing. In Proceedings of the 2008 Conference
on Empirical Methods in Natural Language
Processing, pages 562–571. https://doi
.org/10.3115/1613715.1613784
Yue Zhang and Joakim Nivre. 2012. Analyz-
ing the effect of global learning and beam-
search on transition-based dependency parsing.
In Proceedings of COLING 2012: Posters,
pages 1391–1400.
903
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1
/
/
t
l
a
c
_
a
_
0
0
4
9
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Download pdf