Dependency Parsing with Backtracking using

Dependency Parsing with Backtracking using
Deep Reinforcement Learning

Franck Dary, Maxime Petit, Alexis Nasr
Aix Marseille Univ, Universit´e de Toulon, CNRS, LIS, Marseille, France
{franck.dary,maxime.petit,alexis.nasr}@lis-lab.fr

Abstract

Greedy algorithms for NLP such as transition-
based parsing are prone to error propagation.
One way to overcome this problem is to al-
low the algorithm to backtrack and explore an
alternative solution in cases where new evi-
dence contradicts the solution explored so far.
In order to implement such a behavior, we use
reinforcement learning and let the algorithm
backtrack in cases where such an action gets
a better reward than continuing to explore the
current solution. We test this idea on both POS
tagging and dependency parsing and show that
backtracking is an effective means to fight
against error propagation.

1

Introduction

Transition-based parsing has become a major ap-
proach in dependency parsing, since the work of
Yamada and Matsumoto (2003) and Nivre et al.
(2004), for it combines linear time complexity
and high linguistic performances. The algorithm
follows a local and greedy approach to parsing
that consists of selecting at every step of the
parsing process the action that maximizes a lo-
cal score, typically computed by a classifier. The
action selected is greedily applied to the cur-
rent configuration of the parser and yields a new
configuration.

At training time, an oracle function transforms
the correct syntactic tree of a sentence into a
sequence of correct (configuration, action) pairs.
These pairs are used to train the classifier of the
parser. The configurations that do not pertain to
the set of correct configurations are never seen
during training.

At inference time, if the parser predicts and
executes an incorrect action, it produces an in-
correct configuration, with respect to the sentence
being parsed, which might have never been seen
during training, yielding a poor prediction of the
next action to perform. Additionally, the parser
follows a single hypothesis by greedily selecting

the best scoring action. The solution built by the
parser can be sub-optimal for there is no guarantee
that the sum of the scores of the actions selected
maximizes the global score.

These are well-known problems of transition-
based parsing and several solutions have been
proposed in the literature to overcome them, they
will be briefly discussed in Section 2. The solution
we propose in this article consists of allowing the
parser to backtrack. At every step of the parsing
process, the parser has the opportunity to undo its
n previous actions to explore alternative solutions.
The decision to backtrack or not is taken each time
a new word is considered, before trying to process
it, by giving the current configuration to a binary
classifier, that will assign a score to the back-
tracking action. Traditional supervised learning is
not suited to learn such a score, since the training
data contains no occurrences of backtrack actions.
In order to learn in which situation a backtrack
action is worthy, we use reinforcement learning.
During training, the parser has the opportunity to
try backtracking actions and the training algorithm
responds to this choice by granting it a reward. If
the backtracking action is the adequate move to
make in the current configuration, it will receive a
positive reward and the parser will learn in which
situation backtrack is adequate.

The work presented here is part of a more am-
bitious project which aims at modeling eye move-
ments during human reading. More precisely,
we are interested to predict regressive saccades:
eye movements that bring the gaze back to a
previous location in the sentence.

There is much debate in the psycholinguis-
tic literature concerning the reasons of such eye
movements (Lopopolo et al., 2019). Our position
with respect to this debate is the one advocated by
Rayner and Sereno (1994), for whom part of these
saccades are linguistically motivated and happen
in situations where the reader’s incremental com-
prehension of the sentence is misguided by an

888

Transactions of the Association for Computational Linguistics, vol. 10, pp. 888–903, 2022. https://doi.org/10.1162/tacl a 00496
Action Editor: Yusuke Miyao. Submission batch: 12/2021; Revision batch: 3/2022; Published 9/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1

/

/
t

l

a
c
_
a
_
0
0
4
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

ambiguous sentence start, until reaching a novel
word which integration will prove incompatible
with the current understanding and will trigger a
regressive saccade, as in garden path sentences.
Our long-term project is to model regressive sac-
cades with backtracking actions. Although this
work enters in this long-term project, the focus of
this article is on the NLP aspects of this program
and we propose a way to implement backtrack-
ing in the framework of transition-based parsing.
We will just mention some preliminary studies on
garden path sentences in the Conclusion.

In order to move in the direction of a more cog-
nitively plausible model, we add two constraints
to our model.

The first one concerns the text window around
the current word that the parser takes into account
when predicting an action. This window can be
seen as an approximation of the sliding window
introduced by McConkie and Rayner (1975) to
model the perceptual span of a human reader.1
Transition-based parsers usually allow taking into
account the right context of the current word. The
words in the right context constitute a rich source
of information and yield better predictions of the
next action to perform.

In our model, the parser does not have access
to this right context, simulating the fact that a
human reader has only a limited access to the right
context (few characters to the right of the fixation
point (McConkie and Rayner, 1976)). It is only
after backtracking that a right context is avail-
able for it has been uncovered before backtrack
took place.

The second constraint is incrementality. When
performing several tasks, such as POS tagging and
parsing, as will be done in the tagparser described
in Section 3, these tasks are organized in an incre-
mental fashion. At each step, a word is read, POS
tagged, and integrated in the syntactic structure—
a more cognitively plausible behavior than a se-
quential approach where the whole sentence is
first POS tagged then parsed.

The structure of the paper is the following:
In Section 2, we compare our work to other ap-
proaches in transition-based parsing which aim
at proposing solutions to the two problems men-
tioned above. In Section 3, we describe the model
that we use to predict the linguistic annotation and

1It is actually a rough approximation of the sliding window

because it defines a span over words and not characters.

introduce the notion of a BACK action. In Section 4
we show how backtracking actions can be pre-
dicted using reinforcement learning. Section 5
describes the experimental part of the work and
discusses the results obtained. Section 6 concludes
the paper and presents different directions in which
this work will be extended.

2 Related Work

Several ways to overcome the two limits of
transition-based parsing mentioned in the Intro-
duction, namely, training the parser with only
correct examples and exploring only a single hy-
pothesis at inference time, have been explored in
the literature.

Beam Search The standard solution to the sin-
gle hypothesis search is beam search, which allows
considering a fixed number of solutions in parallel
during parsing. Beam search is a general tech-
nique that has been applied to transition-based
parsing in many studies, including Zhang and
Clark (2008), Huang and Sagae (2010), and Zhang
and Nivre (2012). They show that exploring a
fixed number of solutions increases the linguistic
performances over a single hypothesis parser. In
this work, we do not use a beam search algo-
rithm, but we do explore several solutions in a
non-parallel fashion, using backtracking.

Dynamic Oracles A first way to overcome the
lack of exploration during training problem is
the proposition by Goldberg and Nivre (2012)
to replace the standard oracle of transition-based
parsing by a dynamic oracle that is able to de-
termine the optimal action a to perform for an
incorrect configuration c. During training,
the
dynamic oracle explores a larger part of the config-
uration space than the static oracle and produces
for an incorrect configuration c an optimal action
a. The pair (c, a) is given as training example
to the classifier, yielding a more robust classifier
that is able to predict the optimal action in some
incorrect configurations. Ballesteros et al. (2016)
show that the principle of the dynamic oracle can
be adapted to train the greedy Stack-LSTM de-
pendency parser of Dyer et al. (2015), improving
its performance.

Yu et al. (2018) describe a method for training
a small and efficient neural network model that
approximates a dynamic oracle for any transition

889

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1

/

/
t

l

a
c
_
a
_
0
0
4
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

system. Their model is trained using reinforcement
learning.

In this paper, we use dynamic oracles in two
different ways. First, as baselines for models al-
lowing some exploration during training. Second,
in the definition of the immediate reward function,
as explained in Section 4.

Reinforcement Learning A second answer to
the lack of exploration problem is reinforcement
learning. The exploitation–exploration trade-off
of reinforcement learning allows, at training time,
the model to explore some incorrect configurations
and to learn a policy that selects in such cases the
action that maximizes the long-term reward of
choosing this action. Reinforcement learning is
well suited for transition-based algorithms.

To the best of our knowledge,

two papers
directly address the idea of training a syntac-
tic transition-based parser with reinforcement
learning: Zhang and Chan (2009) and Lˆe and
Fokkens (2017).

Zhang and Chan (2009) cast the problem, as we
do, as a Decision Markov Process but use a Re-
stricted Boltzmann Machine in order to compute
scores of actions and use the SARSA algorithm
to compute an optimal policy. In our case, we
use deep Q-learning based on a multilayer per-
ceptron, as described in Section 4. In addition,
their immediate reward function is based on the
number of arcs in the gold tree of a sentence that
are absent in the tree being built. We also use an
immediate reward function, but it is based on the
number of arcs in the gold tree that can no longer
be built given the current tree being built, an idea
introduced in the dynamic oracle of Goldberg and
Nivre (2012).

Two major differences distinguish our approach
and the work of Lˆe and Fokkens (2017). The first
is the idea of pre-training a parser in a supervised
way and then fine-tuning its parameters using re-
inforcement learning. In our case, the parser is
trained from scratch using reinforcement learn-
ing. The reason for this difference is related to our
desire to learn how to backtrack: It is difficult to
make the parser learn to backtrack when it has
been initially trained not to do so (using standard
supervised learning). The second major difference
is the use of a global reward function, which is
computed after the parser has parsed a sentence.
In our case, as mentioned above, we use an im-
mediate reward. The reason for this difference is

linked to pretraining. Since we do not pre-train
our parser, and allow it to backtrack, granting a
reward at the end of the sentence does not allow
the parser to converge since reaching the end of
the sentence is almost impossible using a non
pre-trained parser. Other less fundamental differ-
ences distinguish our approach, such as the use of
Q-learning, in our case, to find an optimal policy,
where they use a novel algorithm called Approx-
imate Learning Gradient, based on the fact that
their model’s output is a probability distribution
over parsing actions (as in standard supervised
learning). Another minor difference is the explo-
ration of the search space in the training phase.
They sample the next action to perform using the
probability distribution computed by their model,
while we use an adaptation of ε-greedy that we
describe in Section 4.

Reinforcement learning has also been used to
train a transition-based model for other tasks.
Naseem et al. (2019) present a method to fine-tune
the Stack-LSTM transition-based AMR parser of
Ballesteros and Al-Onaizan (2017) with reinforce-
ment learning, using the Smatch score of the
predicted graph as reward. For semantic depen-
dency parsing, Kurita and Søgaard (2019) found
that fine-tuning their parser with a policy gra-
dient allows it to develop an easy-first strategy,
reducing error propagation.

The fundamental difference between all papers
cited above and our work is the idea of adding
backtracking in a greedy transition-based model.
We use reinforcement learning as a means to
achieve the exploration necessary to this goal. In
term of parsing performance, our method will fare
lower than the state of the art in transition-based
parsing because our parser is constrained to not
see words beyond the current word, a constraint
that comes from the fact that our long-term goal
is not to improve parsing performance but to find
a natural way to encourage a parser to simu-
late regressive saccades observed during human
reading.

3 Backtracking Reading Machines

Our model is an extension of the Reading Ma-
chine, a general model for NLP proposed in Dary
and Nasr (2021) that generalizes transition-based
parsing to other NLP tasks. A Reading Machine
is a finite automaton whose states correspond to
linguistic levels. There can be, for example, one

890

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1

/

/
t

l

a
c
_
a
_
0
0
4
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

which words are read and one output tape, on
which predicted POS are written. Each transition is
labeled with a tagging action of the form POS(p)
that simply writes the POS tag p on the POS output
tape at the word index position.

The machine on the right implements an arc-
eager transition-based parser that produces unla-
beled dependency trees. It has the same simple
structure than the tagging machine. Its transitions
are labeled with the four standard arc-eager actions
of unlabeled transition-based parsing: LEFT, RE-
DUCE, SHIFT, and RIGHT. The machine has two input
tapes, one for words and one for POS tags, and one
output tape on which it writes the index of the
governor of the current word when LEFT or RIGHT
actions are predicted.

The machine on the bottom part of the figure,
which we call a tagparser, realizes the two tasks
simultaneously in an incremental fashion. When
in state POS, the machine tags the current word then
control jumps to the parser in order to attach the
current word to the syntactic structure built so far
or to store it in a stack. Once this is done, control
returns to state POS to tag the next word. The reason
why the transitions labeled RIGHT and SHIFT lead
to state POS is that these two actions increase the
word index wi and initiate the processing of a new
word. The machine has one input tape for words
and two output tapes, one for POS tags and one for
the governor position.

We augment the machines described above in
order to allow them to undo some preceding ac-
tions. This ability relies on three elements: (a) the
definition of a new action, called BACK, that undoes
a certain number of actions; (b) the history of the
actions performed so far in order to decide which
actions to undo; and (c) the definition of undoing
an action.

Undoing an action amounts to recovering the
configuration that existed before the action was
performed. This is quite straightforward for tag-
ging and parsing actions.4

Three backtracking machines, based on the ma-
chines of Figure 1, are shown in Figure 2. They
all have an extra state, named BACK, and two extra
transitions. When in state BACK, the machine pre-
dicts one of the two actions BACK or ¬BACK. When
action ¬BACK is selected, control jumps either to
state POS or SYN, depending on the machine, and

4In order to be undone, actions REDUCE and LEFT also need

to store stack elements that have been popped.

Figure 1: Three simple reading machines. The top left
machine performs POS tagging; the top right one per-
forms unlabeled dependency parsing; and the bottom
one performs the two tasks simultaneously.

state for POS tagging, one state for lemmatization,
one state for syntactic parsing, and so on. When
the machine is in a given state, an action is pre-
dicted, which generally writes on an output tape a
label corresponding to the prediction just made.2
There are usually as many output tapes as there
are levels of linguistic predictions and at least one
input tape that usually contains the words of the
sentence to process.3 Predictions are realized by
classifiers that take as input the configuration of
the machine and compute a probability distribu-
tion over the possible actions. Configurations are
complex objects that describe the aspects of the
machine that are useful in order to predict the next
action to perform. Among the important elements
of a configuration for the rest of the paper, we
can cite the current state of the machine, the word
index (noted as wi) that is the position of the word
currently processed, and the history, a list of all
actions performed so far.

The text is read word by word, a window of an
arbitrary size centered on wi defines the part of
the text that can be used to make predictions on
the current word.

Figure 1 shows the architecture of three simple
machines. The two machines in the top part of the
figure realize a single task. The machine on the
left part realizes POS tagging. It is made of a single
state and has a series of transitions that loop on
the state. The machine has one input tape from

2Some actions can be complex and change, for example,
the state of an internal stack. All details concerning the
Reading Machine can be found in Dary and Nasr (2021).

3For the sake of simplicity, we consider in this paper that
the text to process has already been segmented into sentences
and tokenized into words, even if the Reading Machine allows
performing these two operations.

891

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1

/

/
t

l

a
c
_
a
_
0
0
4
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1

/

/
t

l

a
c
_
a
_
0
0
4
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Figure 2: Three backtracking machines based on the
machines of Figure 1.

the machine behaves like the simple machines
of Figure 1. Action ¬BACK does not modify the
configuration of the machine. This is not the
only possible architecture for backtracking ma-
chines, this point will be briefly addressed in the
Conclusion.

If action BACK is selected, the last actions of
the history are undone until a ¬BACK action is
reached. This definition of action BACK allows un-
doing all actions that are related to the previous
word. After BACK has been applied, the configu-
ration of the machine is almost the same as the
configuration it was in before processing the cur-
rent word. There is, however, a major difference:
It now has access to the following word. Other-
wise, the machine would deterministically predict
the action it has predicted before. One can
notice that the transition labeled BACK in the ma-
chine loops on a single state, this feature allows
the machine to perform several successive BACK
actions.

Figure 3 shows how the tagparsing machine of
Figure 2 would ideally process the sentence the
old man the boat, a classical garden path sentence
for which, two words (old and man) should be
re-analyzed after the noun phrase the boat has
been read. The figure describes the machine con-
figuration each time it is in state BACK. Three tapes
are represented: the input tape that contains tokens,

Figure 3: Processing the garden path sentence the
old man the boat with a bactracking tagparser. After
reading the second determiner,
the machine back-
tracks in order to reanalyze words old and man.

the POS tape, and the parsing tape that contains the
index of the governor of the current word. The
figure also represents, at the bottom, the BA array,
which is described below, as well as the sequence
of actions predicted since the last visit to state
BACK. The current word appears in boldface. The
figure shows two successive occurrences of a BACK
action, after the second determiner is read, leading
to the re-analysis of the word old that was tagged
ADJ and the word man that was tagged NOUN.

A backtracking machine as the one described
above can run into infinite loops: Nothing pre-
vents it to repeat endlessly the same sequence of
actions. One can hope that, during training, such
behavior leads to poor performance and is ruled
out. But there is no guarantee that this will be
the case. Furthermore, we would like to prevent
the machine for exploring, at inference time, the
whole (or a large part of the) configuration space.
In order to do so, we introduce a constraint on the
number of times a BACK action is taken when pars-
ing a sentence. A simple way to introduce such
a constraint is to limit to a given constant k the
number of authorized BACK actions per word. This

892

feature is implemented by introducing an array
BA of size n, where n is the number of words
of the sentence to process. Array BA is initialized
with zeros, and every time a BACK action is pre-
dicted in position i, the value of BA[wi] is incre-
mented. When the machine is in state BACK and
BA[wi] is equal to k, performing a BACK action is
not permitted. A ¬BACK action is forced, bypass-
ing the classifier that decides whether to back-
track or not.

The introduction of array BA and the parameter
k defines an upper bound on the size of the
action sequence for a sentence of length n. This
upper bound is equal to 3nk + 2n for the tagger,
4nk + 3n for the parser, and 5nk + 4n for the
tagparser.5 As one can notice, linearity is pre-
served. In our experiments, we chose k = 1.

4 Training

Reading Machines, as introduced by Dary and
Nasr (2021), are trained in a supervised learning
fashion. Given data annotated at several linguis-
tic levels, an oracle function decomposes it into
a sequence of configurations and actions (c0, a0,
c1, a1, . . . , cn, an). This sequence of configura-
tions and actions constitute the training data of the
classifiers of the machine to train: Pairs (ci, ai)
are presented iteratively to the classifier during
the training stage. A backtracking Reading Ma-
chine cannot be trained this way since there are no
occurrences of BACK actions in the data. In order
to learn useful occurrences of such actions, the
training process should have the ability to gener-
ate some BACK actions and be able to measure if
this action was beneficial.

In order to implement such a behavior, we use
reinforcement learning (RL). We cast our problem
as a Markov Decision Process (MDP). In an MDP, an
agent (the parser) is in configuration ct at time t. It
selects an action at from an action set (made of the
tagging actions, the parsing actions and the BACK
and ¬BACK actions) and performs it. We note C the
set of all configurations and A the set of actions.
The environment (the annotated data) responds
to action at by giving a reward rt = r(ct, at)
and by producing the succeeding configuration
ct+1 = δ(ct, at). In our case, configuration ct+1
is deterministically determined by the structure
of the machine. The reward function gives high

5See Appendix A for details.

reward to actions that move the parser towards
the correct parse of the sentence. The fundamental
difference with supervised training is that, during
training, the agent is not explicitly told which
actions to take, but instead must discover which
action yields the most reward through trial and
error. This feature gives the opportunity for the
parser to try some BACK actions, provided that a
reward can be computed for such actions.

Given an MDP that indicates the reward asso-
ciated to applying action a in configuration c,
the goal is to learn a function q∗(c, a) that maxi-
mizes the total amount of reward (also called dis-
counted return) that can be expected after action a
has been applied. Once this function (or an approx-
imation of it) is computed, one can use it to select
which action to choose when in configuration c,
by simply picking the action that maximizes func-
tion q∗. Such behavior is called an optimal policy
in the RL literature. Function q∗ is the solution to
the Bellman optimality equation.6

q∗(c, a) = r(c, a) + γ max

a(cid:4) q∗(δ(c, a), a(cid:4))

where γ ∈ [0, 1] is the discount factor that al-
lows discounting the reward of future actions. The
equation expresses the relationship between the
value of an action a in configuration c and the value
of the best action a(cid:4) that can be performed in its
successor configuration δ(c, a). This recursive
definition is the basis for algorithms that itera-
tively approximate q∗, among which is Q learning
(Watkins, 1989), which approximates q∗ with a
function called Q. In Q learning, during training,
each time an action a is chosen in configuration c,
the value of Q(c, a) is updated:

Q(c, a) ← Q(c, a) + α(Q(cid:4)(c, a) − Q(c, a))

where α is a learning rate and Q(cid:4)(c, a) is a new
estimation of Q(c, a):

Q(cid:4)(c, a) = r(c, a) + γ max

a(cid:4) Q(δ(c, a), a(cid:4))

It has been proven (Watkins and Dayan, 1992)
that such iterative algorithm converges towards
the q∗ function.

6This is actually a simplified form of the Bellman opti-
mality equation, due to the fact that our MDP is deterministic:
Applying action a in configuration c yields configuration
δ(c, a) with probability 1.

893

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
9
6
2
0
4
2
5
8
1

/

/
t

l

a
c
_
a
_
0
0
4
9
6
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

In order to store the values of Q, the algorithm
uses a large table that has an entry for every
(configuration, action) pair. In our case, there
are far too many configurations to allocate such
a table. Instead we use a simple form of deep
Q learning (Mnih et al., 2013) and approximate
function Q using a multilayer perceptron CQ.

CQ takes as input a configuration c and out-
puts a vector whose dimension is the number of
different actions in the action set. The component
of the vector corresponding to action a is the
approximation of Q(c, a) computed by CQ. It is
noted CQ(c, a).

During training, every time an action a is
performed by the parser in configuration c, the
parameters of CQ are updated using gradient de-
scent of the loss function. The loss function should
be defined in a way to minimize the difference
between the actual value CQ(c, a) and its updated
(cid:4)(c, a). This difference is computed with
value CQ
the smooth l1 loss (Girshick, 2015):

L(c, a) = l1(CQ(c, a), CQ

(cid:4)(c, a))

where CQ

(cid:4)(c, a) is computed as follows:

CQ

(cid:4)(c, a) = r(c, a) + γ max

a(cid:4) CQ(δ(c, a), a(cid:4))

Reward Functions
In RL, the training process
is guided by the reward r(c, a) granted by the
environment when action a is performed in config-
uration c. Defining a reward function for tagging
action and parsing action is quite straightforward.
In the case of tagging, the reward should be
high when the tag chosen for the current word is
correct and low when it is not. A simple reward
function is one that associates, for example, value
0 in the first case and −1 in the second. More
elaborate reward functions could be defined that
penalize more some confusions (e.g., tagging a
verb as a preposition).

In the case of parsing, a straightforward reward
function will give a reward of zero for a correct
action and a negative reward for an incorrect one.
We use a slightly more complex function inspired
by the dynamic oracle of Goldberg and Nivre
(2012). This function, in the case of an incorrect
action a, counts the number of dependencies of
the correct analysis that cannot anymore be pre-
dicted due to a. The reward for a is the opposite
of this number. Actions that cannot be executed,

such as popping an empty stack, are granted a
reward of −1.5.

Defining the reward function for BACK actions
is more challenging. When a BACK action is per-
formed, a certain number of actions ai . . . ai+k
are undone. Each of these actions was granted a
reward ri . . . ri+k. Let’s call E = −Σk
t=0ri+t the
opposite of the sum of these rewards (E ≥ 0).
The larger E is, the more errors have been made.
Let’s call ϕ(E) the function that computes the
reward for executing a BACK action, given E.
Formally, we want ϕ to respect the following
principles:

1. Don’t execute a BACK action if there are no

errors: ϕ(0) < 0. 2. ϕ(E) should be increasing with respect to E: the more errors, the more a BACK action should be encouraged. 3. ϕ(E) should not grow too fast with respect to E. Granting too much reward to BACK actions encourages the system to make errors in order to correct them with a highly rewarded BACK action. A function that fulfills these principles is the following: (cid:2) ϕ(E) = −1 ln(E + 1) if E = 0 else It is the function that we use to compute the reward of a BACK action. Exploring the Configuration Space In order to learn useful BACK actions, the parser should try, in the training phase, to perform some BACK actions and update the classifier based on the reward it receives from the environment. One standard way to do that is to adopt an ε-greedy policy in which the model selects a random action with probability ε or the most probable action as predicted by the classifier with probability 1 − ε. This setup allows the system to perform exploitation (choosing the best action so far) as well as exploration (randomly choosing an action). We adopt a variant of this policy, based on two parameters ε and β with 0 ≤ ε ≤ 1, 0 ≤ β ≤ 1, and ε + β ≤ 1. As in standard ε-greedy policy, the agent chooses a random action with probability ε, it chooses 894 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 4: Probabilities of choosing the next action during RL: at random, following the oracle, or the model. the correct action as given by the oracle7 with probability β, and, finally, it chooses the most probable action as predicted by the classifier with probability 1 − (ε + β). Parameter β has been introduced in order to speed up training. In the beginning of the training process, the system is encouraged to follow the oracle. Then, as training progresses, the system relies more on its predicting capacity (exploitation augments) and less on the oracle. Figure 4 shows the evolution of theses parameters. 5 Experiments Three machines were used in our experiments: a tagger, a parser and a tagparser, based on the architectures of Figure 2. Each of the three ma- chines has been trained in three different learning regimes: Supervised Learning (SL) using a dy- namic oracle, Reinforcement Learning without back actions (RL), and Reinforcement Learning with back actions (RLB). 5.1 Universal Dependencies Corpus Our primary experiments were conducted on a French Universal Dependencies corpus (Zeman et al., 2021), more specifically the GSD corpus, consisting of 16,341 sentences and 400,399 words. The original split of the data was 88% train, 9% dev, and 3% test. The size of the test set being too small to obtain significant results, we decided to use a k-fold strategy, where all the data was first merged, randomly shuffled, and then split into ten folds, each fold becoming the test set of a new train/dev/test split of the data in the following proportions: 80%/10%/10%. Using the ten folds was unnecessary to obtain significant results; we therefore decided to limit ourselves to three folds. The size of the test set 7¬BACK in the case of the back state. Figure 5: Structure of the classifier. for which results are reported has a size of 4,902 sentences and 124,560 words. 5.2 Experimental Setup Each machine consists of a single classifier, a multilayer perceptron, with a single hidden layer of size 3200. When the machine realizes several tasks, as in case of the tagparser, the classifier has one decision layer for each task. The output size of each decision layer is the number of actions of its corresponding task. A dropout of 30% is applied to the input vector and to the output of the hidden layer. The output of the hidden layer is given as input to a ReLU function. The structure of the classifier is represented in Figure 5. The details of the features extracted from con- figurations and their encoding in the input layer of the classifier are detailed in Appendix B. In the case of Supervised Learning, the ma- chines are trained with a dynamic oracle. In the beginning of the training process, the machines are trained with a static oracle, for two epochs. Then every two epoch, the machines are used to decode the training corpus and for each con- figuration produced (which could be incorrect) the dynamic oracle selects the optimal action and these (configuration, action) pairs are used to train the classifier. Training the machines in the RL regime is longer than training them in the SL regime. In the first case, 200 epochs were needed and 300 in the second. This difference is probably due to the larger exploration of the configuration space. 5.3 Results - Performance The results for the three machines, under the three learning regimes, are shown in Table 1. POS 895 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 TAGGER PARSER PARSER TAGGER TAGPARSER Regime UPOS p val. UAS RLB RL SL 97.65 96.84 96.11 0.000 0.000 88.21 86.60 86.17 Regime UPOS RLB RL SL 97.06 96.73 96.59 TAGPARSER p val. UAS 0.000 0.090 87.85 87.12 86.94 p val. 0.000 0.037 p val. 0.001 0.211 Table 1: Performances of tagger, parser, and tag- parser under three learning regimes on our French corpus. tagging performance is measured with accuracy and displayed in column UPOS. Dependency pars- ing performance is measured with the unlabeled accuracy score (ratio of words that have been at- tached to the correct governor) and displayed in the UAS column. The p-value next to each score is a confidence metric indicating whether the score is significantly better than the one below (that’s why the last line is never given a p-value). This p-value has been estimated with a paired boot- strap resampling algorithm (Koehn, 2004) using the script (Popel et al., 2017) of the CoNLL 2018 shared task. The table shows the same pattern for the three machines: The RLB regime gets higher results than the simple RL regime, which is itself better than the SL regime. Two important conclusions can be drawn from these results. The first is that RLB regime is consistently better than SL: Backtracking machines make fewer errors than machines trained in supervised mode. At this point we do not know whether this superiority comes from reinforce- ment learning or the addition of a BACK action. In fact, previous experiments in Zhang and Chan (2009) and Lˆe and Fokkens (2017) showed that re- inforcement learning (without backtracking) can lead to better results than supervised learning. The comparison of RLB and RL shows that most of the performance boost comes from backtracking. The results of Table 1 also show that the tag- parser gets better results than single-task machines (the tagger and the parser) when trained with su- pervised learning. Note that this comparison is possible because the parser was not given gold #Actions #Errs #Backs bPrec bRec C→C E→E C→E E→C 115,588 3,597 1,063 76.86% 24.52% 18.07% 41.39% 05.15% 35.39% 79,620 1,323 891 68.46% 46.49% 28.65% 26.09% 02.79% 42.47% 153,764 6,249 4,491 73.48% 61.72% 56.89% 23.46% 02.81% 16.83% Table 2: Behavior comparison of three RLB machines. PoS as input, but instead the ones predicted by the tagparser. These results are in line with the work of Bohnet and Nivre (2012) and Alberti et al. (2015) that show that joint prediction of POS tags and syntactic tree improves the performances of both. However, this is not true when the machines are trained with reinforcement learning. In this case the parser and the tagger get better results than the tagparser. One reason that could explain this difference is the size of the configuration space of the tagparser that is an order of magnitude larger than those of the tagger or the parser. We will return to this point in the Conclusion. 5.4 Results - Statistics One can gain a better understanding of the ef- fect of the BACK actions performed by the three machines with the statistics displayed in Table 2. Each column of the table concerns one machine trained in RLB mode. The first line shows the total number of actions predicted while decoding the test set, the second line shows, the number of errors made, and the third line shows, the num- ber of BACK actions predicted. Lines 4 and 5 give the precision and recall of the BACK actions. The precision is the ratio of BACK actions that were predicted after an error was made and the recall is the ratio of errors after which a BACK action was predicted. These two figures measure the error de- tection capabilities of the BACK actions prediction mechanism. In the case of the parser, the precision is equal to 76.86%, which means that 76.86% of the BACK actions were predicted after an error was made and 24.52% (recall) of the errors provoked a BACK action prediction. The recall constitutes an upper bound of the errors that could be corrected. The four last lines break down the BACK actions predicted into four categories. C→C is the case 896 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 where a BACK action was predicted after a correct action, but did not change the action, E→E is the case where a BACK action was predicted after an error, but the error was not corrected, either the same erroneous action was predicted or another erroneous one was predicted. E→C is the case where a BACK action was predicted after an error and has corrected it, while C→E is the case where a correct action was replaced by an incorrect one after a BACK action. Several conclusions can be drawn from these statistics. First, backtracking corrects errors. For the three machines, there are many more cases where an error is corrected rather than introduced after a BACK action was predicted (E→C (cid:10) C→E). This means that the difference in scores that we ob- served between RL and RLB in Table 1 can indeed be attributed to backtracking. Second, backtracking is conservative. The num- ber of predicted BACK actions is quite low (around 1% of the actions for the tagger and the parser and around 3% for the tagparser) and the preci- sion is quite high. The machines do not backtrack very often and they usually do it when errors are made. This is the kind of behavior we were aim- ing for. It can be modified by changing the reward function ϕ of the BACK action. Third, tagging errors are easier to correct than parsing errors. The comparison of columns 2 and 3 (parser and tagger) shows that the tagger has a higher recall than the parser, tagging errors are therefore easier to detect. This comparison also shows that E→C is higher for the tagger than it is for the parser, tagging errors are therefore easier to correct. Lastly, the poor performance of the tagparser does not come from the fact that it does not backtrack. It actually does backtrack around three times as often as the parser or the tagger. But it has a hard time correcting the errors; most of the time, it reproduces errors made before. We will return to this point in the Conclusion. 5.5 Results on Other Languages In order to study the behavior of backtracking on other languages, we have trained and eval- uated our system on six other languages from various typological origin: Arabic (AR), Chinese (ZH), English (EN), German (DE), Romanian (RO), and Russian (RU). The experimental setup for Lang. Corpus Train Dev Test AR ZH EN FR DE RO RU PADT GSD GUM GSD HDT RRT SynTag 254,400 98,616 81,861 364,349 2,753,627 185,113 871,526 34,261 12,663 15,598 36,775 319,513 17,074 118,692 32,132 12,012 15,926 10,298 326,250 16,324 117,523 Table 3: Corpora used for experiments on new languages, with the size of training, development, and test sets (in tokens). these languages is different from the one we have used for French: we have used the original split in train, development, and test sets, as defined in the Universal Dependencies corpora. We report in Table 3 the corpora used for each language as well as the size of the training, development, and test sets. We did not run experiments on the tag- parser for it gave poor results on our experiments on French. Additionally, for a sanity check, we have rerun experiments on French data, using the original split in order to make sure that the dif- ference of experimental conditions did not yield important differences in the results. The results of these experiments can be found in Table 4. The table indicates p-values of the difference between one system and the next best performing one. The system with the worse performances is therefore not associated with a p-value. The results obtained on French are lower than the results obtained using the k-fold strategy. But the drop is moderate—0.13% for the tagger and 0.52% for the parser—and could be explained simply by the difference of the test corpora on which the systems were evaluated. We observe more or less the same pattern for the new languages: The highest performance is reached by reinforcement learning with backtrack (RLB), for both the tagger and the parser. The second-best performing systems for tagging are usually trained with reinforcement learning but differences are usually non-significant. In the case of the parser, the second-best performing sys- tems are trained in a supervised regime, but as was the case for tagging, the differences are of- ten non significant. The performance on Arabic is different, where no significant advantage was observed when using backtracking. The reason for this is the agglutinative nature of Arabic and the tokenization conventions of UD that tokenizes 897 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 TAGGER Regime UPOS p val. UAS PARSER RLB RL SL RLB RL SL RLB RL SL RLB RL SL RLB RL SL RLB RL SL RLB RL SL English-GUM 0.00 0.00 French-GSD 0.00 0.10 German-HDT 0.00 0.47 Romanian-RRT 0.00 0.03 94.99 93.53 92.63 97.53 96.64 96.25 97.88 97.29 97.28 97.07 96.28 95.77 79.96 72.97 73.12 87.70 86.63 86.97 93.00 91.26 91.31 85.40 84.70 85.00 Russian-SynTagRus 98.41 0.00 97.93 0.01 97.80 86.59 85.25 85.27 Chinese-GSD 0.00 0.23 Arabic-PADT 0.16 0.03 93.01 91.61 91.30 96.43 96.20 95.74 71.70 64.63 65.79 83.81 83.77 83.95 p val. 0.00 0.43 0.17 0.33 0.00 0.35 0.26 0.32 0.00 0.46 0.00 0.13 0.47 0.38 Table 4: Results for seven languages. agglutinated pronouns. The effect of this tokeniza- tion is to increase the distances between content words. The most common pattern that triggers a backtrack in tagging consists in going back to the previous word in order to modify its part of speech. In the case of Arabic, if the target of the backtrack has an agglutinated pronoun, the tagger has to perform two successive BACK actions to re- alize the correction, a pattern that is more difficult to learn. The general conclusions that we can draw there- fore is that reinforcement learning with backtrack yields the best performance for both the parser and the tagger (with the exception of Arabic), but there are no notable differences between supervised 898 Lang #Act #Errs #Backs bPrec bRec C→C E→E C→E E→C EN DE RU AR 31,852 1,097 691 652,500 54,173 53,573 234,658 19,374 19,435 56,528 1,093 310 72.94% 95.88% 95.64% 58.06% 45.94% 94.83% 95.94% 16.47% 24.31% 03.77% 04.21% 39.35% 27.06% 07.35% 05.40% 29.03% 02.75% 00.35% 00.16% 02.58% 45.88% 88.53% 90.23% 29.03% Table 5: Statistics for BACK actions performed during tagging for four languages. learning with a dynamic oracle and reinforcement learning (without backtrack). The statistics on the situations in which BACK actions are performed have been displayed for four languages in Table 5. The table reveals some striking differences for two languages: German and Russian. For these languages, the ratio of BACK actions with respect to the total number of actions predicted is equal to 8.3% for German and to 8.2% for Russian, a figure that is far above what is observed for other languages. A closer look at the results shows that for these two languages, the machine learns a strategy that consists in provok- ing errors (it tags as punctuation linguistic tokens), in order to be able to correct them using a BACK action. This behavior is not due to linguistic rea- sons but rather to the size of the training corpora. As one can see in Table 3, the training corpora for German and Russian are much larger than they are for other languages. Our hypothesis is that, when trained on a large training corpus, the machine has more opportunities to develop complex strategies such as provoking errors in order to correct them using a BACK action. Indeed, this phenomenon van- ishes when we reduce the size of the train set. In order to fight against this behavior, one can act on the reward function and decrease the reward of BACK action as the size of the training corpus increases. More investigation is needed to fully understand this behavior. The reason why the tagger chose to make reg- ular mistakes—tagging as punctuation the words that are later corrected—is not clear. Our hypoth- esis is that this is an error that is easy to detect in order to predict a BACK action. The same phenomenon (intentional errors) has been observed, to a lesser degree, on the parser. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 6 Conclusions We have proposed, in this article, an extension to transition-based parsing that consists of allowing the parser to undo the last actions performed in order to explore alternative hypotheses. We have shown that this kind of model can be effectively trained using deep reinforcement learning. This work will be extended in several ways. The first concerns the disappointing results of the tagparser. As already mentioned in Section 4, many studies have shown that there is usually an advantage to jointly predict several types of linguistic annotations over predicted them sepa- rately. We did not observe this phenomenon in our backtracking tagparser. The problem could come from the structure of the backtracking machines. The structures used, illustrated in Figure 2, are just one possible architecture, others are possible, as for example dedicating different BACK states for the parser and the tagger. The second direction concerns the integration of other NLP tasks in a single machine. Dary and Nasr (2021) showed that reading machines can take as input raw text and realize word segmentation, POS tagging, lemmatization, parsing, and sentence segmentation. We would like to train such a com- plex machine with reinforcement learning and backtracking to study whether the machine can backtrack across several linguistic levels. A third direction concerns the processing of garden path sentences. Such sentences offer exam- ples in which we expect a backtracking machine to backtrack. We have built a corpus of 54 garden path sentences in French, using four different syn- tactic patterns of different complexity level and organized the sentences in minimal pairs and have tested our machine on it. The results are mixed; in some cases, the machine behaves as expected, in other cases it does not. A detailed analysis showed that the machine usually backtracks on garden path sentences but has a tendency of not reanalyzing the sentence as was expected. The problem seems to be of a lexical nature. Some words that should be attributed new POS tags in the reanalysis phase resist this reanalysis. This is probably due to their lexical representation. More work is needed to understand this phenomenon and find ways to overcome it. The last direction is linked to the long-term project of predicting regressive saccades. The gen- eral idea is to compare back movements predicted by our model and actual eye movement data and study whether these movements are correlated. Appendix A: Time Complexity of Backtracking In arc-eager dependency parsing, a sentence of length n is processed in 2n actions: an action that pushes a word on the stack (shift or right) and an action that removes it from the stack (reduce or left). In our backtracking tagparser, we must add one tagging action and one ¬BACK per word. Therefore, without applying any BACK action, the number of actions needed to process a sentence of size n is 4n: for each word a ¬BACK, a POS action, a push action and a pop action. Let si be the length of the action sequence tak- ing place when processing the ith word of the sentence. The sum of these lengths is also the number of actions to process the whole sentence, therefore n i=1 si = 4n. (cid:3) Now, in the worst case scenario where BACK is applied k times per word, the total number of actions is 5nk + 4n, which is the sum of: • k back actions per word: nk. • Initial application of the si sequences: 4n. • The k re-processing of the sequences: (cid:3) (cid:3) n i=1 k × si = k n i=1 si = 4nk. Appendix B: Features of the Classifiers The input layer of the classifiers described in Section 5 is a vector of features extracted from the current configuration. Features are represented as randomly initialized and learnable embeddings of size 128, with the exception of words, that are represented by fastText pretrained word em- beddings of size 300 (Bojanowski et al., 2017). Four embedding spaces are used: words, POS tags, letters, and actions. The features are the following: • POS tags and form of the words in a window of size [-2,2] centered on the current word, with the addition, for the parser and tagparser, of the POS tag of the governor, the rightmost and the leftmost dependents of the three top- most stack elements. 899 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 • The history of the 10 last actions performed. • Prefix and suffix of size 4 for the current word. • For backtracking machines, a binary feature indicating whether or not a back is allowed. When the value of a feature is not available, it is replaced by a special learnable embedding, repre- senting the reason of unavailability. The following situation are distinguished: • Out of bounds: target word is before the first or after the last word of the sentence. • Empty stack: target word is in the empty stack. • No dep / gov: target word is the dependent / governor of a word without one. • Not seen: target word is in right context, and has not been seen yet. • Erased: target value has been erased after a BACK action. Acknowledgments We would like to thank the action editor as well as the anonymous reviewers, for their detailed and thoughtful insights, which helped us improve on our work substantially. References Chris Alberti, David Weiss, Greg Coppola, and Slav Petrov. 2015. Improved transition-based parsing and tagging with neural networks. In Proceedings of the 2015 Conference on Empir- ical Methods in Natural Language Processing, pages 1354–1359. https://doi.org/10 .18653/v1/D15-1159 Miguel Ballesteros and Yaser Al-Onaizan. 2017. AMR parsing using stack-LSTMs. In Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1269–1275, Copenhagen, Denmark. Association for Computational Linguistics. https://doi.org/10.18653/v1/D17 -1130 Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with explo- ration improves a greedy stack LSTM parser. the 2016 Conference on In Proceedings of Empirical Methods in Natural Language Pro- cessing, pages 2005–2010. https://doi .org/10.18653/v1/D16-1211 Bernd Bohnet and Joakim Nivre. 2012. A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In Proceedings of the 2012 Joint Con- ference on Empirical Methods in Natural Lan- guage Processing and Computational Natural Language Learning, pages 1455–1465. Associ- ation for Computational Linguistics. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Tran- sactions of the Association for Computational Linguistics, 5:135–146. https://doi.org /10.1162/tacl a 00051 Franck Dary and Alexis Nasr. 2021. The reading machine: A versatile framework for studying incremental parsing strategies. In Proceedings of the 17th International Conference on Pars- ing Technologies (IWPT 2021), pages 26–37, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2021.iwpt-1.3 Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), pages 334–343, Beijing, China. Association for Computational Linguistics. https://doi .org/10.3115/v1/P15-1033 Ross Girshick. 2015. Fast R-CNN. In Proceed- ings of the IEEE International Conference on Computer Vision, pages 1440–1448. https:// doi.org/10.1109/ICCV.2015.169 Yoav Goldberg and Joakim Nivre. 2012. A dynamic oracle for arc-eager dependency parsing. In Proceedings of COLING 2012, pages 959–976. Liang Huang and Kenji Sagae. 2010. Dy- namic programming for linear-time incremen- tal parsing. In Proceedings of the 48th Annual 900 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Meeting of the Association for Computational Linguistics, pages 1077–1086. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Pro- ceedings of the 2004 Conference on Empiri- cal Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics. Shuhei Kurita and Anders Søgaard. 2019. Multi- task semantic dependency parsing with policy gradient for learning easy-first strategies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2420–2430, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1232 Minh Lˆe and Antske Fokkens. 2017. Tackling error propagation through reinforcement learn- ing: A case of greedy dependency parsing. In Proceedings of the 15th Conference of the European Chapter of the Association for Com- putational Linguistics: Volume 1, Long Papers, pages 677–687. Alessandro Lopopolo, Stefan L. Frank, Antal Van Den Bosch, and Roel Willems. 2019. D. In Proceedings of the Workshop on Cogni- tive Modeling and Computational Linguistics, pages 77–85. George W. McConkie and Keith Rayner. 1975. The span of the effective stimulus during a fix- ation in reading. Perception & Psychophysics, 17(6):578–586. https://doi.org/10.3758 /BF03203972 George W. McConkie and Keith Rayner. 1976. Asymmetry of the perceptual span in reading. Bulletin of the psychonomic society, 8(5):365–368. https://doi.org/10.3758 /BF03335168 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Tahira Naseem, Abhishek Shah, Hui Wan, Radu Florian, Salim Roukos, and Miguel Ballesteros. 2019. Rewarding Smatch: Transition-based AMR parsing with reinforcement learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4586–4592, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1451 Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. Memory-based dependency parsing. In Proceedings of the Eighth Conference on Com- putational Natural Language Learning (CoNLL- 2004) at HLT-NAACL 2004, pages 49–56. Martin Popel, Zdenˇek ˇZabokrtsk`y, and Martin Vojtek. 2017. UDAPI: Universal API for In Proceedings of universal dependencies. the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 96–101. Keith Rayner and Sara C. Sereno. 1994. Re- gressive eye movements and sentence parsing: On the use of regression-contingent analyses. Memory & Cognition, 22(3):281–285. https:// doi.org/10.3758/BF03200855 Christopher and Peter J. C. H. Watkins Dayan. 1992. Q-learning. Machine learning, 8(3–4):279–292. https://doi.org/10.1023 /A:1022676722315 Christopher John Cornish Hellaby Watkins. 1989. Learning from Delayed Rewards. Ph.D. thesis, King’s College, Cambridge United Kingdom. Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proceedings of the Eighth International Conference on Parsing Technol- ogies, pages 195–206. Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. 2018. Approximate dynamic oracle for dependency parsing with reinforcement learning. In Pro- ceedings of the Second Workshop on Universal Dependencies (UDW 2018), pages 183–191, Brussels, Belgium. Association for Computa- tional Linguistics. Daniel Zeman, Joakim Nivre, Mitchell Abrams, Elia Ackermann, No¨emi Aepli, Hamid ˇZeljko Agi´c, Amir Ahmadi, Lars Aghaei, Ahrenberg, Chika Kennedy Ajede, Gabriel˙e Aleksandraviˇci¯ut˙e, Ika Alfina, Lene Antonsen, Katya Aplonova, Angelina Aquino, Carolina, Aragon, Maria Jesus Aranzabe, Bilge Nas Arıcan, P´orunn Arnard´ottir, Gashaw Arutie, Jessica Naraiswari Arwidarasti, Masayuki Asahara, Deniz Baran Aslan, Luma Ateyah, 901 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Eryi˘git, Furkan Atmaca, Mohammed Attia, Aitziber Atutxa, Liesbeth Augustinus, Elena Badmaeva, Keerthana Balasubramani, Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Starkaður Barkarson, Rodolfo Basile, Victoria Basmov, Colin Batchelor, John Bauer, John Bauer, Seyyit Talha Bedir, Kepa Bengoetxea, G¨ozde Berk, Yevgeni Berzak, Irshad Ahmad Bhat, Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Agn˙e Bielinskien˙e, Krist´ın Bjarnad´ottir, Rogier Blokland, Victoria Bobicev, Lo¨ıc Boizou, Emanuel Borges V¨olker, Carl B¨orstell, Cristina Bosco, Gosse Bouma, Sam Bowman, Adriane Boyd, Anouck Braggaar, Kristina Brokait˙e, Aljoscha Burchardt, Marie Candito, Bernard Caron, Gauthier Caron, Lauren Cassidy, Tatiana Cavalcanti, G¨uls¸en Cebiro˘glu Flavio Massimiliano Cecchini, Giuseppe G. A. Celano, Slavom´ır ˇC´epl¨o, Neslihan Cesur, Savas Cetin, ¨Ozlem C¸ etino˘glu, Fabricio Chalub, Shweta Chauhan, Ethan Chi, Taishi Chika, Yongseok Cho, Jinho Choi, Jayeol Chun, Juyeon Chung, Alessandra T. Cignarella, Silvie Cinkov´a, Aur´elie Collomb, C¸ a˘grı C¸ ¨oltekin, Miriam Connor, Marine Courtin, Mihaela Cristescu, Philemon Daniel, Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva, Mehmet Oguz Derin, Elvis de Souza, Arantza Diaz de Ilarraza, Carly Dickerson, Arawinda Dinakaramani, Elisa Di Nuovo, Bamba Dione, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Hanne Eckhoff, Sandra Eich, Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga Erina, Tomaˇz Erjavec, Aline Etienne, Wograine Evelyn, Sidney Facundes, Rich´ard Farkas, Jannatul Ferdaousi, Mar´ılia Fernanda, Hector Fernandez Alcalde, Jennifer Foster, Cl´audia Freitas, Kazunori Fujita, Katar´ına Gajdoˇsov´a, Daniel Galbraith, Marcos Garcia, Moa G¨ardenfors, Sebastian Garza, Fabr´ıcio Ferraz Gerardi, Kim Gerdes, Filip Ginter, Gustavo Iakes Goenaga, Koldo Gojenola, Godoy, Memduh G¨okırmak, Yoav Goldberg, Xavier G´omez Guinovart, Berta Gonz´alez Saavedra, Bernadeta Grici¯ut˙e, Matias Grioni, Lo¨ıc Grobol, Normunds Gr¯uz¯ıtis, Bruno Guillaume, C´eline Guillot-Barbance, Tunga G¨ung¨or, Nizar Habash, Hinrik Hafsteinsson, Jan Hajiˇc, Jan Hajiˇc Jr., Mika H¨am¨al¨ainen, Linh H´a M˜y, Na-Rae Han, Muhammad Yudistira Hanifmuti, Sam Hardwick, Kim Harris, Dag Haug, 902 Jha, Anders Johannes Heinecke, Oliver Hellwig, Felix Hennig, Barbora Hladk´a, Jaroslava Hlav´aˇcov´a, Florinel Hociung, Petter Hohle, Eva Huber, Jena Hwang, Takumi Ikeda, Anton Karl Ingason, Radu Ion, Elena Irimia, O. l´aj´ıd´e Ishola, Kaoru Ito, Siratun Jannat, Tom´aˇs Jel´ınek, Apoorva Johannsen, Hildur J´onsd´ottir, Fredrik Jørgensen, Markus Juutinen, Sarveswaran K, H¨uner Kas¸ıkara, Andre Kaasen, Nadezhda Kabaeva, Sylvain Kahane, Hiroshi Kanayama, Jenna Kanerva, Neslihan Kara, Boris Katz, Tolga Kayadelen, Jessica Kenney, V´aclava Kettnerov´a, Jesse Kirchner, Elena Klementieva, Elena Klyachko, Arne K¨ohn, Abdullatif K¨oksal, Kamil Kopacewicz, Timo Korkiakangas, Mehmet K¨ose, Natalia Kotsyba, Jolanta Kovalevskait˙e, Simon Krek, Parameswari Krishnamurthy, Sandra K¨ubler, O˘guzhan Kuyrukc¸u, Aslı Kuzgun, Sookyoung Kwak, Veronika Laippala, Lucia Lam, Lorenzo Lambertino, Tatiana Lando, Septina Dian Larasati, Alexei Lavrentiev, John Lee, Phương Lˆe Hồng, Alessandro Lenci, Saran Lertpradit, Herman Leung, Maria Levina, Cheuk Ying Li, Josie Li, Keying Li, Yuan Li, KyungTae Lim, Bruna Lima Padovani, Krister Lind´en, Nikola Ljubeˇsi´c, Olga Loginova, Stefano Lusito, Andry Luthfi, Mikko Luukko, Olga Lyashevskaya, Teresa Lynn, Vivien Macketanz, Menel Mahamdi, Jean Maillard, Aibek Makazhanov, Michael Mandl, Christopher Manning, Ruli Manurung, B¨us¸ra Mars¸an, C˘at˘alina M˘ar˘anduc, David Mareˇcek, Katrin Marheinecke, H´ector Mart´ınez Alonso, Lorena Mart´ın-Rodr´ıguez, Andr´e Martins, Jan Maˇsek, Hiroshi Matsuda, Yuji Matsumoto, Alessandro Mazzei, Ryan McDonald, Sarah McGuinness, Gustavo Mendonc¸a, Tatiana Merzhevich, Niko Miekka, Karina Mischenkova, Margarita Misirpashayeva, Anna Missil¨a, C˘at˘alin Mititelu, Maria Mitrofan, Yusuke Miyao, AmirHossein Mojiri Foroushani, Judit Moln´ar, Amirsaeid Moloodi, Simonetta Montemagni, Amir More, Laura Moreno Romero, Giovanni Moretti, Keiko Sophie Mori, Shinsuke Mori, Tomohiko Morioka, Shigeki Moro, Bjartur Mortensen, Bohdan Moskalevskyi, Kadri Muischnek, Robert Munro, Yugo Murawaki, Kaili M¨u¨urisep, Pinkey Nainwani, Mariam Nakhl´e, Ignacio Navarro Juan Hor˜niacek, Anna Nedoluzhko, Gunta Neˇspore- B¯erzkalne, Manuela Nevaci, Lương Nguyễn Thi., Huyền Nguyễn Thi. Minh, Yoshihiro l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Nikaido, Vitaly Nikolaev, Rattima Nitisaroj, Alireza Nourian, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Ad´edayo Ol´u`okun, Mai Omura, Emeka Onwuegbuzia, Petya Osenova, Robert ¨Ostling, Lilja Øvrelid, S¸ aziye Bet¨ul ¨Ozates¸, Merve ¨Ozc¸elik, Arzucan ¨Ozg¨ur, Balkız ¨Ozt¨urk Bas¸aran, Hyunji Hayley Park, Niko Partanen, Elena Pascual, Marco Passarotti, Agnieszka Patejuk, Guilherme Paulino-Passos, Angelika Peljak-Łapi´nska, Siyao Peng, Cenel-Augusto Perez, Natalia Perkova, Guy Perrier, Slav Petrov, Daria Petrova, Jason Phelan, Jussi Piitulainen, Tommi A. Pirinen, Emily Pitler, Thierry Poibeau, Barbara Plank, Larisa Ponomareva, Martin Popel, Lauma Pretkalnin, a, Sophie Pr´evost, Prokopis Prokopidis, Adam Przepi´orkowski, Tiina Puolakainen, Sampo Pyysalo, Peng Qi, Andriela R¨a¨abis, Alexandre Rademaker, Mizanur Rahoman, Taraka Rama, Loganathan Ramasamy, Carlos Ramisch, Fam Rashel, Mohammad Sadegh Rasooli, Vinit Ravishankar, Livy Real, Petru Rebeja, Siva Reddy, Mathilde Regnault, Georg Rehm, Ivan Riabov, Michael Rieβler, Erika Rimkut˙e, Larissa Rinaldi, Laura Rituma, Putri Rizqiyah, Luisa Rocha, Eir´ıkur R¨ognvaldsson, Mykhailo Romanenko, Rudolf Rosa, Valentin Ros, ca, Davide Rovati, Olga Rudina, Jack Rueter, Kristj´an R´unarsson, Shoval Sadde, Pegah Safari, Benoˆıt Sagot, Aleksi Sahala, Shadi Saleh, Alessio Salomoni, Tanja Samardˇzi´c, Stephanie Samson, Manuela Sanguinetti, Ezgi Sanıyar, Dage S¨arg, Baiba Saul¯ıte, Yanin Sawanakunanon, Saxena, Kevin Shefali Scannell, Salvatore Scarlata, Nathan Schneider, Sebastian Schuster, Lane Schwartz, Djam´e Seddah, Wolfgang Seeker, Mojgan Seraji, Syeda Shahzadi, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Yana Shishkina, Muh Shohibussirri, Dmitry Sichinava, Janine Siewert, Einar Freyr Sigurðsson, Aline Silveira, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simk´o, M´aria ˇSimkov´a, Kiril Simov, Maria Skachedubova, Aaron Smith, Isabela Soares- Bastos, Shafi Sourov, Carolyn Spadine, Rachele Sprugnoli, Steinþ´or Steingr´ımsson, Antonio Stella, Milan Straka, Emmett Strickland, Jana Strnadov´a, Alane Suhr, Yogi Lesmana Sulestio, Umut Sulubacak, Shingo Suzuki, Zsolt Sz´ant`o, Chihiro Taguchi, Dima Taji, Yuta Takahashi, Fabio Tamburini, Mary Ann C. Tan, Takaaki Tanaka, Dipta Tanaya, Samson Tella, Isabelle Tellier, Marinella Testori, Guillaume Thomas, Liisi Torga, Marsida Toska, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Utku T¨urk, Francis Tyers, Sumire Uematsu, Roman Untilov, Zdeˇnka Ureˇsov´a, Larraitz Uria, Hans Uszkoreit, Andrius Utka, Sowmya Vajjala, Rob van der Goot, Martine Vanhove, Daniel van Niekerk, Gertjan van Noord, Viktor Varga, Eric Villemonte de la Clergerie, Veronika Vincze, Natalia Vlasova, Aya Wakasa, Joel C. Wallenberg, Lars Wallin, Abigail Walsh, Jing Xian Wang, Jonathan North Washington, Maximilan Wendt, Paul Widmer, Sri Hartati Wijono, Seyi Williams, Mats Wir´en, Christian Wittern, Tsegay Woldemariam, Tak-sum Wong, Alina Wr´oblewska, Mary Yako, Kayo Yamashita, Naoki Yamazaki, Chunxiao Yan, Koichi Yasuoka, Marat M. Yavrumyan, Arife Bet¨ul Yenice, Olcay Taner Yıldız, Zhuoran Yu, Arlisa Yuliawati, Zdenˇek ˇZabokrtsk´y, Shorouq Zahra, Amir Zeldes, He Zhou, Hanzhi Zhu, Anna Zhuravleva, and Rayan Ziane. 2021. Univer- sal dependencies 2.9. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics ( ´UFAL), Faculty of Math- ematics and Physics, Charles University. Lidan Zhang and Kwok-Ping Chan. 2009. Depen- dency parsing with energy-based reinforcement learning. In Proceedings of the 11th Interna- tional Conference on Parsing Technologies (IWPT’09), pages 234–237. https://doi .org/10.3115/1697236.1697284 Yue Zhang and Stephen Clark. 2008. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 562–571. https://doi .org/10.3115/1613715.1613784 Yue Zhang and Joakim Nivre. 2012. Analyz- ing the effect of global learning and beam- search on transition-based dependency parsing. In Proceedings of COLING 2012: Posters, pages 1391–1400. 903 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 6 2 0 4 2 5 8 1 / / t l a c _ a _ 0 0 4 9 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3
Download pdf