Coreference Resolution through a seq2seq Transition-Based System
Bernd Bohnet1, Chris Alberti2, Michael Collins2
1Google Research, The Netherlands
2Google Research, Etats-Unis
{bohnetbd,chrisalberti,mjcollins}@google.com
Abstrait
Most recent coreference resolution systems
use search algorithms over possible spans to
identify mentions and resolve coreference. Nous
instead present a coreference resolution sys-
tem that uses a text-to-text (seq2seq) paradigm
to predict mentions and links jointly. We im-
plement the coreference system as a transition
system and use multilingual T5 as an underly-
ing language model. We obtain state-of-the-art
accuracy on the CoNLL-2012 datasets with
83.3 F1-score for English (un 2.3 higher
F1-score than previous work [Dobrovolskii,
2021]) using only CoNLL data for training,
68.5 F1-score for Arabic (+4.1 higher than
previous work), et 74.3 F1-score for Chinese
(+5.3). In addition we use the SemEval-2010
data sets for experiments in the zero-shot set-
ting, a few-shot setting, and supervised setting
using all available training data. We obtain
substantially higher zero-shot F1-scores for 3
out of 4 languages than previous approaches
and significantly exceed previous supervised
state-of-the-art results for all five tested lan-
guages. We provide the code and models as
open source.1
1
Introduction
There has been a great deal of recent research
in pretrained language models that employ encoder-
decoder or decoder-only architectures (par exemple., voir
GPT-3, GLAM, Lamda [Brown et al., 2020; Du
et coll., 2021; Thoppilan et al., 2022]), and that can
generate text using autoregressive or text-to-text
(seq2seq) models (par exemple., see T5, MT5 [Raffel et al.,
2019; Xue et al., 2021]). These models have led
to remarkable results on a number of problems.
Coreference resolution is the task of finding
referring expressions in text that point to the same
entity in the real world. Coreference resolution is
a core task in NLP, relevant to a wide range of
1https://github.com/google-research
/google-research/tree/master/coref_mt5.
212
applications (par exemple., see Jurafsky and Martin [2021]
Chapter 21 for discussion), but somewhat surpris-
franchement, there has been relatively limited work on
coreference resolution using encoder-decoder or
decoder-only architectures.
The state-of-the-art models on coreference
problems are based on encoder-only models, tel
as BERT (Devlin et al., 2019) or SpanBERT (Joshi
et coll., 2020). All recent state-of-the-art coreference
models (see Table 2), cependant, have the disadvan-
tage of a) requiring engineering of a specialized
search or structured prediction step for corefer-
ence resolution, on top of the encoder’s output
representations; b) often requiring a pipelined ap-
proach with intermediate stages of prediction (par exemple.,
mention detection followed by coreference predic-
tion); and c) an inability to leverage more recent
work in pretrained seq2seq models.
This paper describes a text-to-text (seq2seq) ap-
proach to coreference resolution that can directly
leverage modern encoder-decoder or decoder-only
models. The method takes as input a sentence at
a time, together with prior context, encoded as
a string, and makes predictions corresponding to
coreference links. The method has the following
advantages over previous approaches:
• Simplicity: We use greedy seq2seq predic-
tion without a separate mention detection step
and do not employ a higher order decoder to
identify links.
• Accuracy: The accuracy of the method
exceeds the previous state of the art.
• Text-to-text (seq2seq) based: The method
can make direct use of modern generation
models that employ the generation of text
strings as the key primitive.
A key question that we address in our work is
how to frame coreference resolution as a seq2seq
problem. We describe three transition systems,
where the seq2seq model takes a single sentence
Transactions of the Association for Computational Linguistics, vol. 11, pp. 212–226, 2023. https://doi.org/10.1162/tacl a 00543
Action Editor: Vincent Ng. Submission batch: 4/2022; Revision batch: 10/2022; Published 3/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
je
un
c
_
un
_
0
0
5
4
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
put), and mention-based (Mention-Link-Append),
which has a separate mention detection system, dans
some sense mirroring prior work (see Section 5).
We describe results on the CoNLL-2012 data set
in Section 4. En outre, Section 5 describes mul-
tilingual results, in two settings: d'abord, the setting
where we fine-tune on each language of interest;
second, zero-shot results, where an MT5 model
fine-tuned on English alone is applied to languages
other than English. Zero-shot experiments show
that for most languages, accuracies are higher
than recent translation-based approaches and early
supervised systems.
2 Related Work
Most similar to our approach is the work of Webster
and Curran (2014), who use a shift-reduce
transition-based system for coreference reso-
lution. The transition system uses two data
structures, a queue initialized with all mentions
and a list. The SHIFT transition moves from the
queue a mention to top of the list. The REDUCE
transition merges the top mentions with selected
clusters. Webster and Curran (2014) consider
the approach to better reflect human cognitive
traitement, to be simple and to have small mem-
ory requirements. Xia et al. (2020) use this
transition-based system together with a neural
approach for mention identification and transition
prediction; this neural model (Xia et al., 2020)
gives higher accuracy scores (see Table 2) que
Webster and Curran (2014).
Lee et al. (2017) focus on predicting mentions
and spans using an end-to-end neural model based
on LSTMs (Hochreiter and Schmidhuber, 1997),
while Lee et al. (2018) extend this to a differ-
entiable higher-order model considering directed
paths in the antecedent tree.
Another important method to gain higher accu-
racy is to use stronger pretrained language models,
which we follow in this paper as well. A num-
ber of recent coreference resolution systems kept
the essential architecture fixed while they replace
the pretrained models with increasingly stronger
models. Lee et al. (2018) used Elmo (Peters
et coll., 2018) including feature tuning and show
an impressive improvement of 5.1 F1 on the
English CoNLL 2012 test set over the baseline
score of Lee et al. (2017). The extension from
an end-to-end to the differentiable higher-order
inference provides an additional 0.7 F1-score on
Chiffre 1: Example of one of our transition-based coref-
erence systems, the Link-Append system. The system
processes a single sentence at a time, using an input
encoding of the prior sentences annotated with corefer-
ence clusters, followed by the new sentence. As output,
the system makes predictions that link mentions in the
new sentence to either previously created coreference
clusters (par exemple., ‘‘You → [1’’) or when a new cluster
is created, to previous mentions (par exemple., ‘‘the apartment
→ your house’’). The system predicts ‘‘SHIFT’’ when
processing of the sentence is complete. Note in the
figure we use the word indices 2 et 17 to distinguish
the two incidences of ‘‘I’’ in the text.
as input, and outputs an action corresponding
to a set of coreference links involving that sen-
tence as its output. Chiffre 1 gives an overview of
the highest performing system, ‘‘Link-Append’’
which encodes prior coreference decisions in the
input to the seq2seq model, and predicts new
conference links (either to existing clusters, ou
creating a new cluster) as its output. We provide
the code and models as open source.2 Section 4 de-
scribes ablations considering other systems, tel
as a ‘‘Link-only’’ system (which does not en-
code previous coreference decisions in the in-
2https://github.com/google-research
/google-research/tree/master/coref_mt5.
213
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
je
un
c
_
un
_
0
0
5
4
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
the test set, which leads to a final F1-score of 73.0
for this approach. Joshi et al. (2019) use the same
inference model and explore how to best use and
gain another significant improvement of 3.9 points
absolute and reach a score of 76.9 F1-score on the
test set (see Table 2). Enfin, Joshi et al. (2020)
use SpanBERT, which leads to a even higher ac-
curacy score of 79.6. SpanBERT performs well
for coreference resolution due to its span-based
pretraining objective.
Dobrovolskii
(2021) considers coreference
links between words instead of spans, which re-
duces the complexity to O(n2) of the coreference
models and uses RoBERTa as language model,
which provides better results than SpanBERT for
many tasks.
De la même manière, Kirstain et al. (2021) reduce the high
memory footprint of mention detection by using
the start- and end-points of mention spans to iden-
tify mentions with a bilinear scoring function.
The top λn scored mentions are used to restrict
the search space for coreferences prediction us-
ing again a bilinear function for scoring. Le
algorithm has a quadratic complexity since each
possible coreference pair has to be scored.
Wu et al. (2020) cast coreference resolution as
question answering and report gains originating
from pretraining on Quoref and SQuAD 2.0 de 1
F1-score on the development set. The approach
first predicts mentions with a recall-oriented ob-
jective, then creates queries for these potential
mentions for the cluster prediction. This proce-
dure requires the application of the model for each
mention candidate multiple times per document,
which leads to high execution time.
Our work makes direct use of T5-based mod-
le (Raffel et al., 2019). T5 adopts the idea of
treating tasks in Natural Language Processing uni-
formly as ‘‘text-to-text’’ problems, which means
to only have text as input and generate text as
output. This idea simplifies and unifies the ap-
proach for a large number of tasks by applying
the same model, objective, training procedure, et
decoding process.
clusters that have been built up over the first
(i − 1) phrases. As an example, the input for
i = 3 for the example in Figure 1 is the following:
Input: Speaker-A [1 je ] still have n’t gone to
that fresh French restaurant by your house #
Speaker-A [1 je ] ’m like dying to go there | #
Speaker-B You mean the one right next to the
apartment **
Here the # symbol is used to delimit sentences,
and the start of the focus sentence is marked using
the pipe-symbol | and the end of a sentence with
two asterisk symbols **.
We have three sentences (i = 3). There is a
single coreference cluster in the first i − 1 = 2
phrases, marked using the [1 . . . ] bracketings.
The output from the seq2seq model is also a
text string. The text string encodes a sequence of
0 or more actions, terminated by the SHIFT token.
Each action links some mention (a span) in the ith
sentence to some mention in the previous context
(often in the first i − 1 phrases, but sometimes
also in the ith sentence). An example prediction
given the above input is the following:
Prediction You → [1 ; the apartment → your
maison; the one right next to the apartment → that
fresh French restaurant by your house ; SHIFT
More precisely, the first action would actually
be ‘‘You ## mean the one → [1’’ where the sub-
string ‘‘mean the one’’ is the 3-gram in the original
text immediately after the mention ‘‘You’’. Le
3-gram helps to disambiguate the mention fully, dans
the case where the same string might appear mul-
tiple times in the sentence of interest. For brevity
we omit these 3-grams in the following discussion,
but they are used throughout the models output to
specify mentions.3
In this case there are three actions, separated by
the ‘‘;’’ symbol, followed by the terminating SHIFT
action. The first action is
3 Three seq2seq Transition Systems
3.1 The Link-Append System
The Link-Append system processes the document
a single sentence at a time. At each point the input
to the seq2seq model is a text string that encodes
the first i sentences together with coreference
3Note that no explicit constraints are placed on the model’s
output, so there is the potential for the model to generate
mention references that do not correspond to substrings within
the input; however this happens very rarely in practice, voir
section 6.2 for discussion. There is also the potential for
the 3-gram to be insufficient context to disambiguate the
exact location of a mention; again, this happens rarely, voir
section 6.2.
214
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
je
un
c
_
un
_
0
0
5
4
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
You → [1
This is an append action: specifically, it appends
the mention ‘‘You’’ in the third sentence to the
existing coreference cluster labeled [1 . . . ]. Le
second action is
the apartment → your house
This is a link action. It links the mention ‘‘the
apartment’’ in the third sentence to ‘‘your house’’
in the previous context. Similarly the third action,
the one right next to the apartment → that fresh
French restaurant by your house
is a also a link action, in this case linking the
mention ‘‘the one right next to the apartment’’ to
a previous mention in the discourse.
The sequence of actions is terminated by the
SHIFT symbol. At this point the ith sentence has
been processed, and the model moves to the next
step where the (i+1)th sentence will be processed.
Assuming the next sentence is ‘‘Speaker-B yeah
yeah yeah’’, the input at the (je + 1)th step will be
Input: Speaker-A [1 je ] still have n’t gone to [3
that fresh French restaurant by [2 your house ] ]
# Speaker-A [1 je ] ’m like dying to go there #
Speaker-B [1 You ] mean [3 the one right next to
[2 the apartment ] ] | # Speaker-B yeah yeah yeah
Note that the three actions in the previous predic-
tion have been reflected in the new input, lequel
now includes three coreference clusters, labeled
[1 . . . ], [2 . . . ] et [3 . . . ].
En résumé, the method processes a sentence
at a time, and uses append and link actions to
build up links between mentions in the current
sentence under focus and previous mentions in the
discourse.
A critical question is how to map training data
examples (which contain coreference clusters for
entire documents) to sequences of actions for each
sentence. Clearly there is some redundancy in the
système, in that in many cases either link or append
actions could be used to build up the same set of
coreference clusters. We use the following method
for creation of training examples:
• Process mentions in the order in which they
appear in the sentence. Spécifiquement, mentions
are processed in order of their end-point (ear-
lier end-points are earlier in the ordering).
Ties are broken by their start-point (plus tard
start-points are earlier in the ordering). It can
be seen that the order in the previous exam-
ple, You, the apartment, the one right next to
the apartment, follows this procedure.
• For each mention, if there is another mention
in the same coreference cluster earlier in the
document, either:
1. Create an append action if there are at
least two members of the cluster in the
previous i − 1 phrases.
2. Otherwise create a link action to the
most recent member of the coreference
cluster (this may be either in the first
i − 1 phrases, or in the ith sentence).
The basic idea then will be to always use append
actions where possible, but to use link actions
where a suitable append action is not available.
3.2 The Link-only System
The Link-only system is a simple variant of the
Link-Append system. There are two changes:
D'abord, the only actions in the Link-only system
are link and SHIFT, as described in the previous
section. Deuxième, when encoding the input in the
Link-only system, the first i sentences are taken
again with the # separator, but no information
about coreference clusters over the first i − 1
sentences is included.
The Link-only system can therefore be viewed
as a simplification of the Link-Append system.
We will compare the two systems in experi-
ments, in general seeing that the Link-Append
system provides significant
improvements in
performance.
3.3 The Mention-Link-Append System
The Mention-Link-Append system is a modifica-
tion of the Link-Append system, which includes
an additional class of actions, the mention actions.
A mention action selects a single sub-string from
the sentence under focus, and creates a single-
ton coreference cluster. The algorithm that creates
training examples is modified to have an addi-
tional step for the creation of mention actions, comme
follows:
• Process mentions in the order in which they
appear in the sentence.
215
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
je
un
c
_
un
_
0
0
5
4
3
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
• For each mention, if it is the first mention in
a coreference structure, introduce a mention
action for that mention.
hence M is the set of all potential mentions in
the document, and M≤i is the set of potential
mentions in sentences 1 . . . je.
• For each mention, if there is another mention
in the same coreference cluster earlier in the
document, either:
1. Create an append action if there is at
least two members of the cluster in the
previous i − 1 phrases.
2. Otherwise create a link action to the
most recent member of the coreference
cluster (this may be either in the first
i − 1 phrases, or in the ith sentence).
Note that the Mention-Link-Append system can
create singleton coreference structures, unlike the
LINK-APPEND or Link-only systems. This is its
primary motivation.
3.4 A Formal Description
We now give a formal definition of the three
systèmes. This section can be safely skipped on a
first reading of the paper.
3.4.1 Initial Definitions and Problem
Statement
We introduce some key initial definitions—of
documents, potential mentions, and clusterings—
before giving a problem statement:
Definition 1 (Documents). A document is a pair
(w1 . . . wn, s1 . . . sm), where wi is the ith word
in the document, and s1, s2, . . . , sm is a sequence
of integers specifying a segmentation of w1 . . . wn
into m sentences. Each si is the endpoint for
sentence i in the document. Ainsi 1 ≤ s1 <
s2 . . . sm−1 < sm, and sm = n. The ith sentence
spans words (si−1 + 1) . . . si inclusive (where for
convenience we define s0 = 0).
Definition 2 (Potential Mentions). Assume an
input document (w1 . . . wn, s1 . . . sm). For each
i ∈ 1 . . . m we define Mi to be the set of potential
mentions in the ith sentence; specifically,
Mi = {(a, b) : si−1 < a ≤ b ≤ si}
Hence each member of Mi is a pair (a, b) speci-
fying a subspan of the ith sentence. We define
M = ∪m
i=1
Mi, M≤i = ∪i
j=1
Mj
Definition 3 (Clusterings). A clustering K is
a sequence of sets K1, K2, . . . K|K|, where each
Ki ⊆ M, and for any i, j such that i (cid:8)= j, we
have Ki ∩ Kj = ∅. We in addition assume that
for all i, |Ki| ≥ 2 (although see Section 3.5 for
discussion of the case where |Ki| ≥ 1). We define
K to be the set of all possible clusterings.
Definition 4 (Problem Statement). The corefer-
ence problem is to take a document x as input, and
to predict a clustering K as the output. We assume
a training set of N examples, {(x(i), K (i))}N
i=1,
consisting of documents paired with clusterings.
3.5 The Three Transition Systems
The transition systems considered in this paper
take a document x as input, and produce a coref-
erence clustering K as the output. We assume
a definition of transition systems that is closely
related to work on deterministic dependency pars-
ing (Nivre, 2003, 2008), and which is very similar
to the conventional definition of deterministic
finite-state machines. Specifically, a transition
system consists of: 1) A set of states C. 2) An
initial state c0 ∈ C. 3) A set of actions A. 4) A
transition function δ : C × A → C. This will
usually be a partial function: That is, for a par-
ticular state c, there will be some actions a such
that δ(c, a) is undefined. For convenience, for any
state c we define A(c) ⊆ A to be the set of actions
such that for all a ∈ A(c), δ(c, a) is defined. 5) A
set of final states F ⊆ C.
A path is then a sequence c0, a0, c1, a1, . . . cN
where for i = 1 . . . N , ci+1 = δ(ci, ai), and where
cN ∈ F.
All transition systems in this paper use the
following definition of states:
Definition 5 (States). A state is a pair (i, K) such
that 1 ≤ i ≤ (m + 1) and K ∈ K is a clustering
such that for k ∈ 1 . . . |K|, for j ∈ (i + 1) . . . m,
Kk ∩ Mj = ∅ (i.e,. K is a clustering over the
mentions in the first i sentences). In addition we
define the following:
• C is the set of all possible states.
• c0 = (1, (cid:4)) is the initial state, where (cid:4) is the
empty sequence.
216
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
• F = {(i, K) : (i, k) ∈ C, i = (m + 1)} is
the set of final states.
Mention Actions. Given a state (i, K), we
define the set of possible mention actions as
Intuitively, the state (i, K) keeps track of which
sentence is being worked on, through the index i,
and also keeps track of a clustering of the partial
mentions up to and including sentence i.
We now describe the actions used by the var-
ious transition systems. The actions will either
augment the clustering K, or increment the index
i. The actions fall into four classes—link actions,
append actions, mention actions, and the shift
action—defined as follows:
Link Actions. Given a state (i, K), we define
the set of possible link actions as
Mention(i, K) = {Add(m) : m ∈ Mi}}
A mention action Add(m) augments K by either
creating a new singleton cluster containing m
alone, assuming that m does not currently appear
in K; otherwise it leaves K unchanged. We define
K ⊕ Add(m) to be the result of this action, and
δ((i, K), Add(m)) = (i, K ⊕ Add(m)).
The SHIFT Action. The final action in the
system is the SHIFT action. This can be applied in
any state, and simply advances the index i, leaving
the clustering K unchanged:
L(i, K) = {m → m(cid:12) : m ∈ Mi, m(cid:12) ∈ M≤i}
δ((i, K), SHIFT) = ((i + 1), K)
A link action (m → m(cid:12)) augments K by adding
a link between mentions m and m(cid:12). We de-
fine K ⊕ (m → m(cid:12)) to be the result of adding
link m → m(cid:12) to clustering K.4 We can then
define the transition function associated with a
link action:
δ((i, K), m → m(cid:12)) = (i, K ⊕ (m → m(cid:12)))
Append Actions. Given a state (i, K), we
define the set of possible append actions as
App(i, K) = {m → k : m ∈Mi, k ∈ {1 . . . |K|}}
An append action (m → k) augments K by
adding mention m to the cluster Kk withing the
sequence K. We define K ⊕ (m → k) to be the
result of this action (thereby overloading the ⊕
operator); the transition function associated with
an append action is then
δ((i, K), m → k) = (i, K ⊕ (m → k))
4Specifically, the addition of the link m → m(cid:12) can either:
1) create a new cluster within K, if neither m or m(cid:12) are in
an existing cluster within K; 2) add m to an existing cluster
within K, if m(cid:12) is already in some cluster in K, and m is
not in an existing clustering; 3) add m(cid:12) to an existing cluster
within K, if m is already in some cluster in K, and m(cid:12) is
not in an existing clustering; 4) merge two clusters, if m and
m(cid:12) are both in clusters within K, and the two clusters are
different; 5) leave K unchanged, if m and m(cid:12) are both within
the same existing cluster within K. In practice cases (2), (3),
(4), and (5) are never seen in oracle sequences of actions, but
for completeness we include them.
We are now in a position to define the transition
systems:
Definition 6 (The Three Transition Systems).
The link-append transition system is defined as
follows:
• C, c0, and F are as defined in definition 5.
• For any state (i, K), the set of possible ac-
tions is A(i, K) = L(i, K) ∪ App(i, K) ∪
{SHIFT}. The full set of actions is A =
∪(i,K)∈CA(i, K)
• The transition function δ is as defined above.
The Link-only system is identical to the above,
but with A(i, K) = L(i, K) ∪ {SHIFT}. The
Mention-Link-Append system is identical to the
above, but with A(i, K) = L(i, K) ∪ App(i, K) ∪
Mention(i, k) ∪ {SHIFT}.
All that remains in defining the seq2seq method
for each transition system is to: a) define an
encoding of the state (i, K) as a string input
to the seq2seq model; b) define an encoding of
each type of action, and of a sequence of actions
corresponding to single sentence; c) defining a
mapping from a training example consisting of
an (x, K) pair to a sequence of input-output texts
corresponding to training examples.
4 Experimental Setup
We train a mT5 model to predict from an input a
target text. We use the provided training, devel-
opment, and test splits as described in section 4.1.
217
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Training Development
Language docs tokens docs tokens docs tokens
Test
OntoNotes / CoNLL-2012 datasets
348
218
44
1940 1.3M 343
254
1729 750k
44
359 300k
160k
110k
30k
SemEval 2010 data
42k
142
9k
23
73k
199
16k
18
44k
140
829 253k
145
46k
900 331k
80
81k
875 284k
167
72
136
46
168
English
Chinese
Arabic
Catalan
Dutch
German
Italian
Spanish
170k
90k
30k
49k
48k
50k
41k
51k
Table 1: Sizes of the SemEval Shared Task data
sets and OntoNotes (CoNLL-2012).
For the preparation of the input text, we follow
previous work and include the speaker in the input
text before each sentence (Wu et al., 2020) as
well as the text genre at the document start if this
information is available in the corpus. We apply
as described in Section 3 the corresponding tran-
sitions as an oracle to obtain the input and target
texts. We shorten the text at the front if the input
text is larger as the sentence piece token input
size of the language model and add further con-
text beyond the sentence i when the input space is
not filled up (note that as described in Section 3,
we use the pipe symbol | to mark the start of
the focus sentence and the end with two asterisk
symbols **).
4.1 Data
We use the English coreference resolution dataset
from the CoNLL-2012 Shared Task (Pradhan
et al., 2012) and SemEval-2010 Shared Task set
(Recasens et al., 2010) for multilingual corefer-
ence resolution experiments. The SemEval-2010
datasets include six languages and is therefore a
good test bed for multilingual coreference reso-
lution. We excluded English as the data overlaps
with our training data.
The statistics on the dataset sizes are summa-
rized in Table 1. The table shows that the English
CoNLL-2012 Shared Task is substantially larger
than any of the other data sets.
4.2 Experiments
or xxl checkpoints.5 For fine-tuning, we use the
hyperparameters suggested by Raffel et al. (2019):
a batch-size of 128 sequences and a constant
learning rate of 0.001. We use micro-batches of
8 to reduce the memory requirements. We save
checkpoints every 2k steps. From these models,
we select the model with the best development
results. We train for 100k steps. We use inputs
with 2048 sentence piece tokens and 384 output
tokens for training. All our models have been
tested with 3k sentence piece tokens input length
if not stated otherwise. The training of the xxl-
model takes about 2 days on 128 TPUs-v4. On
the development set, inference takes about 30
minutes on 8 TPUs.
Setup for Other Languages. We used the En-
glish model in this work to continue training with
the above settings on other languages than En-
glish (Arabic, Chinese, and the SemEval-2010
datasets). For few-shot learning, we use the first
10 documents for each language and we train
only for 200 steps since the evaluation then shows
100% fit to the training set.
The experimental evaluation changed in recent
publication for the SemEval-2010 by reporting
F1-scores as an average of MUC, B3, and CEAFΦ4
following the CoNLL-2012 evaluation schema
(Roesiger and Kuhn, 2016; Schr¨oder et al., 2021;
Xue et al., 2021). We follow this schema in this pa-
per as well. Another important difference between
the SemEval-2010 and the CoNLL-2012 datasets
is the annotation of singletons (mentions without
antecedents) in the SemEval datasets. Most recent
systems predict only coreference chains. This has
lead also to different evaluation methods for the
SemEval-2010 datasets. The first method keeps
the singletons for the evaluation purposes (e.g.,
Xia and Durme, 2021) and the second excludes
the singletons from evaluation set (e.g., Roesiger
and Kuhn, 2016; Schr¨oder et al., 2021; Bitew
et al., 2021). The exclusion of singletons seems
better suited to compare recent systems but makes
direct comparison with previous work difficult.
We report numbers for both setups.
In Section 5, we present our work on mul-
tilingual coreference resolution and Section 6
discusses the results for all languages.
Setup for English. For our experiments, we use
mT5 and initialize our model with either the xl
5https://github.com/google-research
/multilingual-t5.
218
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
LM
Decoder
MUC
R
P
English
F1
P
B3
R
CEAFΦ4
R
F1
Avg.
F1
F1
P
Lee et al. (2017)
Lee et al. (2018)
Joshi et al. (2019)
Yu et al. (2020)
Joshi et al. (2020)
Xia et al. (2020)
Wu et al. (2020)
Xu and Choi (2020)
Kirstain et al. (2021)
Dobrovolskii (2021)
Link-Append
–
neural e2e 78.4 73.4 75.8 68.6 61.8 65.0 62.7 59.0 60.8
81.4 79.5 80.4 72.2 69.5 70.8 68.2 67.1 67.6
c2f
Elmo
84.7 82.4 83.5 76.5 74.0 75.3 74.1 69.8 71.9
c2f
BERT
82.7 83.3 83.0 73.8 75.6 74.7 72.2 71.0 71.6
Ranking
BERT
SpanBERT c2f
85.8 84.8 85.3 78.3 77.9 78.1 76.4 74.2 75.3
SpanBERT transitions 85.7 84.8 85.3 78.1 77.5 77.8 76.3 74.1 75.2
88.6 87.4 88.0 82.4 82.0 82.2 79.9 78.3 79.1
SpanBERT QA
85.9 85.5 85.7 79.0 78.9 79.0 76.7 75.2 75.9
SpanBERT hoi
86.5 85.1 85.8 80.3 77.9 79.1 76.8 75.4 76.1
LongFormer bilinear
84.9 87.9 86.3 77.4 82.6 79.9 76.1 77.1 76.6
RoBERTa
87.4 88.3 87.8 81.8 83.4 82.6 79.1 79.9 79.5
mT5
c2f
transition
Aloraini et al. (2020) AraBERT
Min (2021)
GigaBERT
Link-Append
mT5
c2f
c2f
transition
63.2 70.9 66.8 57.1 66.3 61.3 61.6 65.5 63.5
73.6 61.8 67.2 70.7 55.9 62.5 66.1 62.0 64.0
71.0 70.9 70.9 66.5 66.7 66.6 68.3 68.6 68.4
Arabic
Xia and Durme (2021) XLM-R
Link-Append
mT5
transition
transition
Chinese
–
–
–
81.5 76.8 79.1 76.1 69.9 72.9 74.1 67.9 70.9
–
–
–
–
–
–
67.2
73.0
76.9
76.4
79.6
79.4
83.1∗
80.2
80.3
81.0
83.3
63.9
64.6
68.7
69.0
74.3
Table 2: English, Arabic, and Chinese test set results and comparison with previous work on the
CoNLL-2012 Shared Task test data set. The average F1 score of MUC, B3, and CEAFΦ4 is the main
evaluation criterion. *Wu et al. (2020) use additional training data.
5 Multilingual Coreference Resolution
Results
5.1 Zero-Shot and Few-Shot
Since mT5 is pretrained on 100+ languages (Xue
et al., 2021), we evaluate Zero-Shot transfer ability
from English to other languages. We apply our sys-
tem trained on the English CoNLL-2012 Shared
Task dataset to the non-English SemEval-2010
test sets. Table 3 shows evaluation scores for our
transition-based systems and reference systems.
In our results overview (Table 3), we report in
the column sing. whether singletons are included
(Y) or excluded (N), in the P-column for pre-
diction and in the E-column for evaluation. We
use for training the same setting as the reference
systems (Kobdani and Sch¨utze, 2010; Roesiger
and Kuhn, 2016; Schr¨oder et al., 2021). In the
Zero-shot experiments, the transition-based sys-
tems are trained only on English CoNLL-2012
datasets and applied without modification to the
multilingual SemEval-2010 test sets.
Bitew et al. (2021) use machine translation for
the coreferences prediction of the SemEval-2010
datasets. The authors found they obtained the best
accuracy when they first translated the test sets
to English, then predicted the English corefer-
ences with the system of Joshi et al. (2020) and
finally projected back the predictions. They ap-
ply this method to four out of the six languages
for the SemEval-2010 datasets. We include in
Table 3 their results as a comparison to our
Zero-Shot results. The two methods are directly
comparable as they do not use the target language
annotations for training. Our Zero-Shot F1-scores
are substantially higher compared with the ma-
chine translation approach for Dutch, Italian, and
Spanish and a bit lower for Catalan, cf. Table 3.
Xia and Durme (2021) explored for a large
number of settings few-shot learning using the
continued training approach. We use the same ap-
proach with a single setting that uses the first 10
documents for each language. For details about
the experimental setup see Section 4.2. Table 3
presents the results for the Link-Append system.
This shows that already with a few additional train-
ing documents a high accuracy can be reached.
This could be useful either to adapt to a spe-
cific coreference annotations schema or to specific
language (see examples in Figure 2 and 3).
5.2 Supervised
We also carried out experiments in a fully su-
pervised setup in which we use all available
219
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Systems
Attardi et al. (2010)
Mention-Link-Append
Xia and Durme (2021)
Mention-Link-Append
Bitew et al. (2021)
Link-Append
Link-Append
Sing.
# training Avg.
P E docs./method F1
Catalan
48.2
all
Y Y
83.5
all
Y Y
51.0
all
N Y
59.2
N Y
all
N N ∅/Translation 48.0
N N ∅/Zero-shot 47.7
N N 10/Few-shot 68.9
Dutch
Kobdani and Sch¨utze (2010) Y Y
all
19.1
66.6
Y Y
all
Mention-Link-Append
N Y
Xia and Durme (2021)
all
55.4
59.9
all
N Y
Mention-Link-Append
N N ∅/Translation 37.5
Bitew et al. (2021)
N N ∅/Zero-shot 57.6
Link-Append
N N 10/Few-shot 65.7
Link-Append
German
Kobdani and Sch¨utze (2010) Y Y
59.8
all
86.4
all
Mention-Link-Append
Y Y
48.6
all
Roesiger and Kuhn (2016) N N
Schr¨oder et al. (2021)
74.5
all
N N
77.8
N N
all
Link-Append
N N ∅/Zero-shot 55.0
Link-Append
N N 10/Few-shot 69.8
Link-Append
Italian
Kobdani and Sch¨utze (2010) Y Y
60.7
all
65.9
all
Y Y
Mention-Link-Append
N N
Mention-Link-Append
59.4
all
N N ∅/Translation 36.2
Bitew et al. (2021)
N N ∅/Zero-shot 39.4
Link-Append
N N 10/Few-shot 61.2
Link-Append
Spanish
Attardi et al. (2010)
Mention-Link-Append
Xia and Durme (2021)
Mention-Link-Append
Link-Append
Bitew et al. (2021)
Link-Append
Link-Append
49.0
all
Y Y
83.9
all
Y Y
51.3
all
N Y
59.3
all
N Y
83.1
all
N N
N N ∅/Translation 46.1
N N ∅/Zero-shot 49.4
N N 10/Few-shot 72.5
Table 3: Test set
results for SemEval-2010
datasets. The Sing. column shows whether the
singletons are included (Y) or removed (N) in
the Prediction and the Evaluation set. The last
column shows average F1 score of MUC, B3, and
CEAFΦ4.
the Arabic OntoNotes dataset for the later we use
data and splits of the CoNLL-2012 Shared Task.
To verify the finding of Xia and Durme (2021),
we compared the results when we continue the
training from a finetuned model and from the ini-
tial mT5 model. We conducted this exploratory
experiment using 1k training steps for the German
dataset. The results are in favor of the exper-
iment with continued training using an already
fine-tuned model with a score of 84.5 F1 vs 81.0
F1 for fresh mT5 model. This model also achieves
77.3 F1 when evaluated without singletons (cf.
Table 3), surpassing previous SotA of 74.5 F1
(Schr¨oder et al., 2021). We did not explore train-
ing longer due to the computational cost of training
from a fresh mT5 model to reach a potentially bet-
ter performance. We adopted the approach for all
datasets of the SemEval-2010 Shared Task as this
model provides competitive coreference models
with low training cost.
Table 3 includes the accuracy scores for the
cluster/mentions-based transition systems which
reaches SotA for all languages when the predic-
tion and evaluation includes the singletons (P=Y,
E=Y). In order to compare the results with Xia
and Durme (2021), we removed from the results
for the cluster/mentions-based transition system
the singletons in the prediction but still include
them in the evaluation (P=N, E=Y).
Table 2 compares the results for Arabic and
Chinese of our model with the recent work. The
Link-Append system is 4.1 points better than Min
(2021) and 5.3 points better than Xia and Durme
(2021), which present previous SotA for Arabic
and Chinese, respectively.
6 Discussion
In this section, we analyze performance factors
with an ablation study, analyze errors, and re-
flect on design choices that are important for
the model’s performance. Table 2 shows the re-
sults for our systems on the English, Arabic, and
Chinese CoNLL-2012 Shared Task and compares
with previous work.
6.1 Ablation Study
training data of the SemEval-2010 Shared Task.
We adopted the method of continued training of
Xia and Durme (2021). In our experiments, we
start from our finetuned English model and con-
tinue training on the SemEval-2010 datasets and
Our best-performing transition system is Link-
Append, which predicts links and clusters it-
eratively for sentences of a document without
predicting mentions before-hand. Table 4 shows
an ablation study. The results at the top of the
220
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
System
Link-Append
Link-Append
Link-Append
Link-Append
Link-Append
Link-Append
Mention-Link-Append
Mention-Link-Append
Link-only
Ablation
100k steps/3k pieces
2k sentence pieces
50k steps
no context beyond i
xxl-T5.1.1
xl-mT5
3k pieces
2k pieces
link transitions only
F1
83.2
83.1
82.9
82.8
82.7
78.0
82.6
82.2
81.4
Table 4: Development set results for an ablation
study using English CoNLL-2012 data sets and
reporting Avg. F1-scores. The models have been
trained with 100k training steps and tested with
2k sentence pieces filling up remaining space in
the input beyond the focus sentence i with further
sentences of the document as context. In inference
mode, the model uses input length of 3k sentences
pieces if not stated otherwise.
table show the development set results for the
Link-Append system when, with each SHIFT,
the already identified coreference clusters are an-
notated in the input. This information is then
available in the next step and the clusters can be
extended by the APPEND transition.
The models are trained with an input size of
2048 tokens using mT5. We use a larger input size
of 3000 (3k) tokens for decoding to accommo-
date for long documents and very long distances
between mentions of coreferences. When we use
2k sentence pieces, the accuracy is 83.1 instead
of 83.2 averaged F1-score on the development set
using the model trained for 100k steps.
At the bottom of Table 4, the performance of
a system is shown that does not annotate the
identified clusters in the input. In this system the
Append transition cannot be applied and hence
only the Link and Shift transition are used. The
accuracy of this system is substantially lower, by
1.8 F1-score.
We observe drops in accuracy when we do
not use context beyond the sentence i or when
we train for only 50k steps. We observe 0.5
lower F1-score, when we use xxl-T5.1.16 in-
stead of the xxl-mT5 model. An analysis shows
that the English OntoNotes corpus contains some
6The xxl-T5.1.1 model refers to a model provided by Xue
et al. (2021) trained for 1.1 million steps on English data.
Subset
1 – 128
129 – 256
257 – 512
513 – 768
769 – 1152
1153+
all
#Docs
57
73
78
71
52
12
343
JS-L
84.6
83.7
82.9
80.1
79.1
71.3
80.1
CM
84.5
83.6
83.4
79.3
78.6
69.6
79.5
LA
85.8
85.2
86.0
83.2
83.3
74.9
83.2
Table 5: Average F1-score on the development
set for buckets of document length incremented
by 128 tokens. The column JS-L shows average
F1-scores for the SpanBert-Large model (Joshi
et al., 2020), CM for the Constant Memory model
(Xia et al., 2020), and LA for the Link-Append
system. The entries for JS-L and CM are taken
from the paper of Xia et al. (2020).
non-English text, speaker names, and special sym-
bols. For instance, there are Arabic names that
are mapped to OOV, but also the curly brackets
{}. There are also other cases where T5 trans-
lated non-English words to English (e.g., German
‘nicht’ to ‘not’).
With the Mention-Link-Append system, we
introduced a system that is capable of introducing
mentions, which is useful for data sets that include
single mentions, such as the SemEval-2010 data
set. This transition system has an 82.6 F1-score on
the development set with an input context of 3k
sentences pieces, which is 0.6 F1-score lower than
the Link-Append transition system. We added
examples in the Appendix to illustrate mistakes
in a Zero-shot setting (Figure 2) and supervised
English example (Appendix).
6.2 Error Analysis
We observe two problems originating from the
sequence-to-sequence models:
first, hallucina-
tions (words not found in the input) and second,
ambiguous matches of mentions to the input. In
order to evaluate the frequency of hallucinations,
we counted cases where the predicted mentions
and their context could not be matched to a word
sequence in the input. We found only 11 cases
(0.07%) in all 14.5k Link and Append predictions
for the development set. The second problem are
mentions with their n-gram context which are
found more than once in the input. This consti-
tutes 84 cases (0.6%) of all 14.5k Link and Append
predictions.
221
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Table 5 shows average F1-scores for buckets
of documents within a length range incremented
by 128 tokens, analogous to the analysis of Xia
et al. (2020). All systems’ F1-scores drop after the
segment length 257–512 substantially by about
3-4 points. The Link-Append (LA) system seems
to have two more stable F1-score regions 1–512
and 513–1152 tokens divided by the mentioned
larger drop while we see for the other system
slightly lower accuracy in each segment.
6.3 Design Choice
With this paper, we follow the paradigm of a
text-to-text approach. Our goal was to use only
the text output from a seq2seq model, and poten-
tially the score associated with the output. Crucial
for the high accuracy of the Link-Append systems
are the design choices that seem to fit a text-to-
text approach well. (1) Initial experiments, not
presented in the paper, showed lower perfor-
mance for a standard two-stage approach using
mention prediction followed by mention-linking.
The Link-only transition system, which we in-
cluded as a baseline in the paper, was the first
system that we implemented that only predicted
conference links, avoiding mention-detection.
Hence this crucial first design choice is the
prediction of links and not to predict mentions
first. (2) The prediction of links in a state-full
fashion, where the prior input records previous
coreference decisions, finally leads to the superior
accuracy for the text-to-text model. (3) The larger
model enables us to use the simpler paradigm of a
text-to-text model successfully. The smaller mod-
els provide substantially lower performance. We
speculate in line with the arguments of Kaplan
et al. (2020) that distinct capabilities of a model
become strong or even emerge with model size.
(4) The strong multilingual results originate from
the multilingual T5 model, which was initially
surprising to us. For English, the mT5 model per-
formed better as well which we attribute to larger
vocab of the sentence piece encoding model of
mT5.
7 Conclusions
In this paper, we combine a text-to-text (seq2seq)
language model with a transition-based systems
to perform coreference resolution. We reach 83.3
F1-score on the English CoNLL-2012 data set
surpassing previous SotA.
In the text-to-text
framework, the Link-Append transition system
has been superior to hybrid Mention-Link-Append
transition system with mixed prediction of men-
tions, links and clusters. Our trained models are
useful for future work as they could be used to ini-
tialize models for continuous training or zero-shot
transfer to new languages.
Acknowledgments
We would like to thank the action editor and
three anonymous reviewers for their thoughtful
and insightful comments, which were very helpful
in improving the paper.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
222
Figure 2: German Zero-shot predictions. The red bold marked text are wrong predictions.
Figure 3: Mistakes picked from CoNLL-2012 development set, e.g., Hong Kong should have been identified
recursively within Hong Kong Disneyland; in the last sentence, [3 Disney] refers to [3 Disney Corporation]
cluster instead correctly to [4 The world ’s fifth Disney park] cluster.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Mistakes picked from CoNLL-2012 development, e.g., the coreferences [18 a lot of blood] as well as
[27 [13 the ship’s ] attorneys] are not in the gold annotation.
223
References
Abdulrahman Aloraini, Juntao Yu, and Massimo
Poesio. 2020. Neural coreference resolution for
Arabic. In Proceedings of the Third Work-
shop on Computational Models of Reference,
Anaphora and Coreference, pages 99–110,
Barcelona, Spain (online). Association for
Computational Linguistics.
Giuseppe Attardi, Maria Simi, and Stefano
Dei Rossi. 2010. TANL-1: Coreference res-
olution by parse analysis and similarity
clustering. In Proceedings of the 5th Inter-
national Workshop on Semantic Evaluation,
pages 108–111, Uppsala, Sweden. Association
for Computational Linguistics.
Semere Kiros Bitew, Johannes Deleu, Chris
Develder, and Thomas Demeester. 2021. Lazy
low-resource coreference resolution: A study
on leveraging black-box translation tools. In
Proceedings of the Fourth Workshop on Com-
putational Models of Reference, Anaphora
and Coreference, pages 57–62, Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics.
Tom B. Brown, Benjamin Mann, Nick Ryder,
Jared Kaplan, Prafulla
Melanie Subbiah,
Dhariwal, Arvind Neelakantan, Pranav Shyam,
GirishSastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler,
Jeffrey Wu, Clemens
Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever,
and Dario Amodei. 2020. Language models
are few-shot learners. CoRR, abs/2005.14165.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
the 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.
Vladimir Dobrovolskii. 2021. Word-level coref-
erence resolution. In Proceedings of the 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 7670–7675,
Online and Punta Cana, Dominican Repub-
lic. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.emnlp-main.605
Nan Du, Yanping Huang, Andrew M. Dai, Simon
Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim
Krikun, Yanqi Zhou, Adams Wei Yu, Orhan
Firat, Barret Zoph, Liam Fedus, Maarten
Bosma, Zongwei Zhou, Tao Wang, Yu Emma
Wang, Kellie Webster, Marie Pellat, Kevin
Robinson, Kathy Meier-Hellstern, Toju Duke,
Lucas Dixon, Kun Zhang, Quoc V. Le, Yonghui
Wu, Zhifeng Chen, and Claire Cui. 2021.
Glam: Efficient scaling of language models with
mixture-of-experts. CoRR, abs/2112.06905.
Sepp Hochreiter and J¨urgen Schmidhuber. 1997.
Long short-term memory. Neural Computation,
https://doi.org/10
9(8):1735–1780.
.1162/neco.1997.9.8.1735, PubMed:
9377276
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
Weld, Luke Zettlemoyer, and Omer Levy. 2020.
SpanBERT: Improving pre-training by rep-
resenting and predicting spans. Transactions
of
the Association for Computational Lin-
guistics, 8:64–77. https://doi.org/10
.1162/tacl_a_00300
Mandar Joshi, Omer Levy, Luke Zettlemoyer, and
Daniel Weld. 2019. BERT for coreference res-
olution: Baselines and analysis. In Proceedings
of the 2019 Conference on Empirical Meth-
ods in Natural Language Processing and the
9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP),
pages 5803–5808, Hong Kong, China. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/D19-1588
Daniel Jurafsky and James H. Martin. 2021.
Speech and Language Processing: An Intro-
duction to Natural Language Processing, Com-
putational Linguistics, and Speech Recognition,
third edition.
Jared Kaplan, Sam McCandlish, Tom Henighan,
Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and
Dario Amodei. 2020. Scaling laws for neural
language models.
224
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Yuval Kirstain, Ori Ram, and Omer Levy. 2021.
Coreference resolution without span represen-
tations. In Proceedings of
the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 2: Short Papers), pages 14–19,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.acl-short.3
Hamidreza Kobdani and Hinrich Sch¨utze. 2010.
SUCRE: A modular system for coreference
resolution. In Proceedings of the 5th Inter-
national Workshop on Semantic Evaluation,
pages 92–95, Uppsala, Sweden. Association
for Computational Linguistics.
Kenton Lee, Luheng He, Mike Lewis, and Luke
Zettlemoyer. 2017. End-to-end neural corefer-
ence resolution. In Proceedings of the 2017
Conference on Empirical Methods in Nat-
ural Language Processing, pages 188–197,
Copenhagen, Denmark. Association for Com-
putational Linguistics.
Kenton Lee, Luheng He, and Luke Zettlemoyer.
2018. Higher-order coreference resolution with
In Proceedings of
coarse-to-fine inference.
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 687–692,
New Orleans, Louisiana. Association for
Computational Linguistics.
Bonan Min. 2021. Exploring pre-trained trans-
transfer learning for
formers and bilingual
In Proceed-
Arabic coreference resolution.
ings of
the Fourth Workshop on Com-
putational Models of Reference, Anaphora
and Coreference, pages 94–99, Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.crac-1.10
Joakim Nivre. 2003. An efficient algorithm for
projective dependency parsing. In Proceed-
ings of the Eighth International Conference on
Parsing Technologies, pages 149–160, Nancy,
France.
Joakim Nivre. 2008. Algorithms for deterministic
incremental dependency parsing. Computa-
tional Linguistics, 34(4):513–553. https://
doi.org/10.1162/coli.07-056-R1
-07-027
Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings of
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1202
Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Olga Uryupina, and Yuchen Zhang. 2012.
CoNLL-2012 shared task: Modeling multilin-
gual unrestricted coreference in OntoNotes. In
Joint Conference on EMNLP and CoNLL -
Shared Task, pages 1–40, Jeju Island, Korea.
Association for Computational Linguistics.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2019. Exploring the limits of transfer learning
with a unified text-to-text transformer. CoRR,
abs/1910.10683.
Marta Recasens, Llu´ıs M`arquez, Emili Sapena,
M. Ant`onia Mart´ı, Mariona Taul´e, V´eronique
Hoste, Massimo Poesio, and Yannick Versley.
2010. SemEval-2010 task 1: Coreference
In Pro-
resolution in multiple languages.
ceedings of the 5th International Workshop
on Semantic Evaluation, pages 1–8, Upp-
sala, Sweden. Association for Computational
Linguistics. https://doi.org/10.3115
/1621969.1621982
Ina Roesiger and Jonas Kuhn. 2016. IMS Hot-
Coref DE: A data-driven co-reference resolver
for German. In Proceedings of the Tenth Inter-
national Conference on Language Resources
and Evaluation (LREC’16), pages 155–160,
Portoroˇz, Slovenia. European Language Re-
sources Association (ELRA).
Fynn Schr¨oder, Hans Ole Hatzel, and Chris
Biemann. 2021. Neural end-to-end coreference
resolution for German in different domains.
In Proceedings of
the 17th Conference on
Natural Language Processing (KONVENS
225
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
2021), pages 170–181, D¨usseldorf, Germany.
KONVENS 2021 Organizers.
Romal Thoppilan, Daniel De Freitas, Jamie
Hall, Noam Shazeer, Apoorv Kulshreshtha,
Heng-Tze Cheng, Alicia Jin, Taylor Bos,
Leslie Baker, Yu Du, YaGuang Li, Hongrae
Lee, Huaixiu Steven Zheng, Amin Ghafouri,
Marcelo Menegali, Yanping Huang, Maxim
Krikun, Dmitry Lepikhin, James Qin, Dehao
Chen, Yuanzhong Xu, Zhifeng Chen,
Adam Roberts, Maarten Bosma, Yanqi
Zhou, Chung-Ching Chang, Igor Krivokon,
Will Rusch, Marc Pickett, Kathleen S.
Meier-Hellstern, Meredith Ringel Morris,
Tulsee Doshi, Renelito Delos Santos, Toju
Duke, Johnny Soraker, Ben Zevenbergen,
Vinodkumar Prabhakaran, Mark Diaz, Ben
Hutchinson, Kristen Olson, Alejandra Molina,
Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi
Rajakumar, Alena Butryna, Matthew Lamm,
Viktoriya Kuzmina, Joe Fenton, Aaron Cohen,
Rachel Bernstein, Ray Kurzweil, Blaise
Aguera-Arcas, Claire Cui, Marian Croak,
Ed Chi, and Quoc Le. 2022. Lamda: Lan-
guage models for dialog applications. CoRR,
abs/2201.08239.
Kellie Webster and James R. Curran. 2014.
Limited memory incremental coreference res-
olution. In Proceedings of COLING 2014,
the 25th International Conference on Com-
putational Linguistics: Technical Papers,
pages 2129–2139, Dublin, Ireland. Dublin City
University and Association for Computational
Linguistics.
Computational Linguistics. https://doi
.org/10.18653/v1/2020.acl-main
.622
Patrick Xia and Benjamin Durme. 2021. Mov-
ing on from OntoNotes: Coreference resolution
model transfer. In Proceedings of the 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 5241–5256,
Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Patrick Xia,
Jo˜ao Sedoc,
and Benjamin
Van Durme. 2020. Incremental neural coref-
erence resolution in constant memory.
In
Proceedings of the 2020 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP), pages 8617–8624. Association for
Computational Linguistics, Online.
Liyan Xu and Jinho D. Choi. 2020. Re-
vealing the myth of higher-order inference
in coreference resolution. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 8527–8533, Online. Association for
Computational Linguistics.
Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. mT5: A
massively multilingual pre-trained text-to-text
transformer. In Proceedings of the 2021 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, pages 483–498,
Online. Association
Computational
Linguistics.
for
Wei Wu, Fei Wang, Arianna Yuan, Fei Wu,
and Jiwei Li. 2020. CorefQA: Coreference
resolution as query-based span prediction. In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 6953–6963, Online. Association for
Juntao Yu, Alexandra Uma, and Massimo Poesio.
2020. A cluster ranking model for full anaphora
resolution. In Proceedings of the 12th Lan-
guage Resources and Evaluation Conference,
pages 11–20, Marseille, France. European
Language Resources Association.
226
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
4
3
2
0
7
4
8
8
8
/
/
t
l
a
c
_
a
_
0
0
5
4
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF