Unleashing the True Potential of Sequence-to-Sequence Models

Unleashing the True Potential of Sequence-to-Sequence Models
for Sequence Tagging and Structure Parsing

Han He
Department of Computer Science
Emory University
Atlanta, GA 30322 Etats-Unis
han.he@emory.edu

Jinho D. Choi
Department of Computer Science
Emory University
Atlanta, GA 30322 Etats-Unis
jinho.choi@emory.edu

Abstrait

Sequence-to-Sequence (S2S) models have
achieved remarkable success on various text
generation tasks. Cependant, learning complex
structures with S2S models remains challeng-
ing as external neural modules and additional
lexicons are often supplemented to predict
non-textual outputs. We present a systematic
study of S2S modeling using contained de-
coding on four core tasks: part-of-speech tag-
ging, named entity recognition, constituency,
and dependency parsing, to develop efficient
exploitation methods costing zero extra param-
eters. En particulier, 3 lexically diverse lineari-
zation schemas and corresponding constrained
decoding methods are designed and evaluated.
Experiments show that although more lexical-
ized schemas yield longer output sequences
that require heavier training, their sequences
being closer to natural language makes them
easier to learn. De plus, S2S models using
our constrained decoding outperform other
S2S approaches using external resources. Notre
best models perform better than or comparably
to the state-of-the-art for all 4 tasks, lighting
a promise for S2S models to generate non-
sequential structures.

Introduction

Sequence-to-Sequence (S2S) models pretrained
for language modeling (PLM) and denoising ob-
jectives have been successful on a wide range of
NLP tasks where both inputs and outputs are se-
quences (Radford et al., 2019; Raffel et al., 2020;
Lewis et al., 2020; Brown et al., 2020). Cependant,
for non-sequential outputs like trees and graphs, un
procedure called linearization is often required to
flatten them into ordinary sequences (Li et al.,
2018; Fern´andez-Gonz´alez and G´omez-Rodr´ıguez,
2020; Yan et al., 2021; Bevilacqua et al., 2021; Il
and Choi, 2021un), where labels in non-sequential
structures are mapped heuristically as individual

582

tokens in sequences, and numerical properties like
indices are either predicted using an external de-
coder such as Pointer Networks (Vinyals et al.,
2015un) or cast to additional tokens in the vocab-
ulary. While these methods are found to be effec-
tive, we hypothesize that S2S models can learn
complex structures without adapting such patches.
To challenge the limit of S2S modeling, BART
(Lewis et al., 2020) is finetuned on four tasks with-
out extra decoders: part-of-speech tagging (POS),
named entity recognition (NER), constituency pars-
ing (CON), and dependency parsing (DEP). Three
novel linearization schemas are introduced for
each task: label sequence (LS), label with text
(LT), and prompt (PT). LS to PT feature an
increasing number of lexicons and a decreasing
number of labels, which are not in the vocabulary
(Section 3). Every schema is equipped with a
constrained decoding algorithm searching over
valid sequences (Section 4).

Our experiments on three popular datasets de-
pict that S2S models can learn these linguistic
structures without external resources such as in-
dex tokens or Pointer Networks. Our best models
perform on par with or better than the other state-
of-the-art models for all four tasks (Section 5).
Enfin, a detailed analysis is provided to compare
the distinctive natures of our proposed schemas
(Section 6).1

2 Related Work

S2S (Sutskever et al., 2014) architectures have
been effective on many sequential modeling tasks.
Conventionally, S2S is implemented as an en-
coder and decoder pair, where the encoder learns
input representations used to generate the output

1All our resources including source codes are publicly
available: https://github.com/emorynlp/seq2seq
-corenlp.

Transactions of the Association for Computational Linguistics, vol. 11, pp. 582–599, 2023. https://doi.org/10.1162/tacl a 00557
Action Editor: Emily Pitler. Submission batch: 11/2022; Revision batch: 1/2023; Published 6/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

sequence via the decoder. Since the input sequence
can be very long, attention mechanisms (Bahdanau
et coll., 2015; Vaswani et al., 2017) focusing on par-
ticular positions are often augmented to the basic
architecture. With transfer-learning, S2S models
pretrained on large unlabeled corpora have risen
to a diversity of new approaches that convert lan-
guage problems into a text-to-text format (Akbik
et coll., 2018; Lewis et al., 2020; Radford et al.,
2019; Raffel et al., 2020; Brown et al., 2020).
Among them, tasks most related to our work are
linguistic structure predictions using S2S, POS,
NER, DEP, and CON.

POS has been commonly tackled as a sequence
tagging task, where the input and output sequences
have equal lengths. S2S, on the other hand, does
not enjoy such constraints as the output sequence
can be arbitrarily long. Donc, S2S is not as
popular as sequence tagging for POS. Prevailing
neural architectures for POS are often built on top
of a neural sequence tagger with rich embeddings
(Bohnet et al., 2018; Akbik et al., 2018) and Con-
ditional Random Fields (Lafferty et al., 2001).

NER has been cast to a neural sequence tagging
task using the IOB notation (Lample et al., 2016)
over the years, which benefits most from contex-
tual word embeddings (Devlin et al., 2019; Wang
et coll., 2021). Early S2S-based works cast NER
to a text-to-IOB transduction problem (Chen and
Moschitti, 2018; Strakov´a et al., 2019; Zhu et al.,
2020), which is included as a baseline schema in
Section 3.2. Yan et al. (2021) augment Pointer Net-
works to generate numerical entity spans, lequel
we refrain to use because the focus of this work
is purely on the S2S itself. Most recently, Cui
et autres. (2021) propose the first template prompting
to query all possible spans against a S2S lan-
guage model, which is highly simplified into a
one-pass generation in our PT schema. Instead of
directly prompting for the entity type, Chen et al.
(2022) propose to generate its concepts first then
its type later. Their two-step generation is tai-
lored for few-shot learning, orthogonal to our ap-
proach. De plus, our prompt approach does not
rely on non-textual tokens as they do.

CON is a more established task for S2S models
since the bracketed constituency tree is naturally
a linearized sequence. Top-down tree lineariza-
tions based on brackets (Vinyals et al., 2015b)
or shift-reduce actions (Sagae and Lavie, 2005)
rely on a strong encoder over the sentence while
bottom-up ones (Zhu et al., 2013; Ma et al.,

2017) can utilize rich features from readily built
the in-order traversal
partial parses. Recently,
has proved superior to bottom-up and top-down
in both transition (Liu and Zhang, 2017) et
S2S (Fern´andez-Gonz´alez and G´omez-Rodr´ıguez,
2020) constituency parsing. Most recently, un
Pointer Networks augmented approach (Yang and
Tu, 2022) is ranked top among S2S approaches.
Since we are interested in the potential of S2S
models without patches, a naive bottom-up baseline
and its novel upgrades are studied in Section 3.3.
DEP has been underexplored as S2S due to the
linearization complexity. The first S2S work maps
a sentence to a sequence of source sentence words
interleaved with the arc-standard, reduce-actions
in its parse (Wiseman and Rush, 2016), which is
adopted as our LT baseline in Section 3.4. Zhang
et autres. (2017) introduce a stack-based multi-layer
attention mechanism to leverage structural lin-
guistics information from the decoding stack in
arc-standard parsing. Arc-standard is also used
in our LS baseline, cependant, we use no such ex-
tra layers. Apart from transition parsing, Li et al.
(2018) directly predict the relative head posi-
tion instead of the transition. This schema is
later extended to multilingual and multitasking by
Choudhary and O’riordan (2021). Their encoder
and decoder use different vocabularies, while in
our PT setting, we re-use the vocabulary in the
S2S language model.

S2S appears to be more prevailing for semantic
parsing due to two reasons. D'abord, synchronous
context-free grammar bridges the gap between
natural text and meaning representation for S2S.
It has been employed to obtain silver annotations
(Jia and Liang, 2016), and to generate canonical
natural language paraphrases that are easier to
learn for S2S (Shin et al., 2021). This trend of in-
sights viewing semantic parsing as prompt-guided
generation (Hashimoto et al., 2018) and paraphras-
ing (Berant and Liang, 2014) has also inspired our
design of PT. Deuxième, the flexible input/output
format of S2S facilitates joint learning of seman-
tic parsing and generation. Latent variable sharing
(Tseng et al., 2020) and unified pretraining (Bai
et coll., 2022) are two representative joint model-
ing approaches, which could be augmented with
our idea of PT schema as a potentially more ef-
fective linearization.

Our finding that core NLP tasks can be solved
using LT overlaps with the Translation between
Augmented Natural Languages (Paolini et al.,

583

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

2021). Cependant, we take one step further to
study the impacts of textual tokens in schema
design choices. Our constrained decoding is sim-
ilar to existing work (Hokamp and Liu, 2017;
Deutsch et al., 2019; Shin et al., 2021). Nous
craft constrained decoding algorithms for our pro-
posed schemas and provide a systematic ablation
study in Section 6.1.

3 Schemas

This section presents our output schemas for
POS, NER, CON, and DEP in Table 1. For each
task, 3 lexically diverse schemas are designed as
follows to explore the best practice for structure
learning. D'abord, Label Sequence (LS) is defined as
a sequence of labels consisting of a finite set of
task-related labels, that are merged into the S2S
vocabulary, with zero text. Deuxième, Label with
Texte (LT) includes tokens from the input text on
top of the labels such that it has a medium num-
ber of labels and text. Troisième, PrompT (PT) gives
a list of sentences describing the linguistic struc-
ture in natural language with no label. We hy-
pothesize that the closer the output is to natural
langue, the more advantage the S2S takes from
the PLM.

3.1 Part-of-Speech Tagging (POS)

LS LS defines the output as a sequence of POS
tags. Officiellement, given an input sentence of n to-
kens x = {x1, x2, · · · , xn}, its output is a tag se-
quence of the same length yLS = {y1, y2, · · · ,
yn}. Distinguished from sequence tagging, any
LS output sequence is terminated by the ‘‘end-
of-sequence’’ (EOS) token, which is omitted
from yLS for simplicity. Predicting POS tags
often depends on their neighbor contexts. Nous
challenge that the autoregressive decoder of a
S2S model can capture this dependency through
self-attention.

LT For LT, the token from the input is inserted
before its corresponding tag. Officiellement, the output
is defined yLT = {(x1, y1), (x2, y2), .., (xn, yn)}.
Both x and y are part of the output and the S2S
model is trained to generate each pair sequentially.

PT PT is a human-readable text describing the
POS sequence. Spécifiquement, we use a phrase
i =‘‘xi is y(cid:3)
yPT
i is
the definition of a POS tag yi, par exemple., a noun. Le

i’’ for the i-th token, where y(cid:3)

final prompt is then the semicolon concatenation
of all phrases: yPT = yPT

2 ; · · · ; yPT
n .

1 ; yPT

3.2 Named Entity Recognition (NER)

LS LT of an input sentence comprising n tokens
x = {x1, x2, · · · , xn} is defined as the BIEOS tag
sequence yLS = {y1, y2, · · · , yn}, which labels
each token as the Beginning, Inside, End, Outside,
or Single-token entity.

LT LT uses a pair of entity type labels to wrap
each entity: yLT = ..B-yj, xi, .., xi+k, E-yj, ..,
is the type label of the j-th entity
where yj
consisting of k tokens.

i’’, where y(cid:3)

PT PT is defined as a list of sentences describ-
i =‘‘xi is y(cid:3)
ing each entity: yPT
i is
the definition of a NER tag yi, par exemple., a person.
Different from the prior prompt work (Cui et al.,
2021), our model generates all entities in one
pass which is more efficient than their brute-force
approche.

3.3 Constituency Parsing (CON)

Schemas for CON are developed on constituency
trees pre-processed by removing the first level of
non-terminals (POS tags) and rewiring their chil-
les enfants (tokens) to parents, par exemple., (NP (PRON My)
(NOUN friend)) → (NP My friend).

LS LS is based on a top-down shift-reduce sys-
tem consisting of a stack, a buffer, and a depth
record d. Initially, the stack contains only the root
constituent with label TOP and depth 0; the buffer
contains all tokens from the input sentence; d is
set to 1. A Node-X (N-X) transition creates a
new depth-d non-terminal labeled with X, pushes
it to the stack, and sets d ← d + 1. A Shift
(SH) transition removes the first token from the
buffer and pushes it to the stack as a new terminal
with depth d. A Reduce (RE) pops all elements
with the same depth d from the stack then make
them the children of the top constituent of the
stack, and it sets d ← d − 1. The linearization of
a constituency tree using our LS schema can be
obtained by applying 3 string substitutions: concernant-
place each left bracket and the label X following
it with a Node-X, replace terminals with SH,
replace right brackets with RE.

LT LT is derived by reverting all SH in LS back
to the corresponding tokens so that tokens in LT
effectively serves as SH in our transition system.

584

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 1: Schemas for the sentence ‘‘My friend who lives in Orlando bought me a gift from Disney
World’’.

PT PT is also based on a top-down lineariza-
tion, although it describes a constituent using tem-
plates: ‘‘pi has {cj}’’, where pi is a constituent
and cj-s are its children. To describe a constitu-

ent, the indefinite article ‘‘a’’ is used to denote a
new constituent (par exemple., ‘‘. . . has a noun phrase’’).
The definite article ‘‘the’’ is used for referring
to an existing constituent mentioned before (par exemple.,

585

‘‘the noun phrase has …’’), or describing a con-
stituent whose children are all terminals (par exemple.,
‘‘. . . has the noun phrase ‘My friend’’’). Quand
describing a constituent that directly follows its
mention, the determiner ‘‘which’’ is used instead
of repeating itself multiple times e.g., ‘‘(. . . et
the subordinating clause, which has …’’). Sen-
tences are joined with a semicolon ‘‘;’’ as the
final prompt.

3.4 Dependency Parsing (DEP)

LS LS uses three transitions from the arc-
standard system (Nivre, 2004): shift (SH), gauche
arc (<), and right arc (>).

LT LT for DEP is obtained by replacing each
SH in a LS with its corresponding token.

PT PT is derived from its LS sequence by re-
moving all SH. Alors, for each left arc creating
an arc from xj to xi with dependency relation r
(par exemple., a possessive modifier), a sentence is created
by applying the template ‘‘xi is r of xj’’. Pour
each right arc creating an arc from xi to xj with
the dependency relation r, a sentence is created
with another template ‘‘xi has r xj’’. The prompt
is finalized by joining all such sentences with a
semicolon.

4 Decoding Strategies

To ensure well-formed output sequences that
match the schemas (Section 3), a set of constrained
decoding strategies is designed per task except for
CON, which is already tackled as S2S model-
ing without constrained decoding (Vinyals et al.,
2015b; Fern´andez-Gonz´alez and G´omez-Rodr´ıguez,
2020). Officiellement, given an input x and any par-
tial y 2n then

return {EOS}

else

if i is even then
}
return {x i

else

return D

Algorithm 2: Prefix Matching
Function PrefixMatch(T , p):

node ← T
while node and p do

node ← node.children[p1]
p ← p>1
return node

depends on the parity of
Algorithm 1.

je, as defined in

PT The PT generation can be divided into two
phases: token and ‘‘is-a-tag’’ statement genera-
tion. A binary status u is used to indicate whether
yi is expected to be a token. To generate a token,
an integer k ≤ n is used to track the index of
the next token. To generate an ‘‘is-a-tag’’ state-
ment, an offset o is used to track the beginning
of an ‘‘is-a-tag’’ statement. Each description dj
of a POS tag tj is extended to a suffix sj =
‘‘is dj ; ’’.

Suffixes are stored in a trie tree T to facilitate
prefix matching between a partially generated
statement and all candidate suffixes, as shown
in Algorithm 2. The full decoding is depicted in
Algorithm 3.

4.2 Named Entity Recognition

LS Similar to POS-LS, the NextY for NER
returns BIEOS tags if i ≤ n else EOS.

LT Opening tags (<>) in NER-LT are grouped
into a vocabulary O. The last generated output
token yi−1 (assuming y0 = BOS, a.k.a. début
of a sentence) is looked up in O to decide what
type of token will be generated next. To enforce
label consistency between a pair of tags, a variable
e is introduced to record the expected closing
tag. Reusing the definition of k in Algorithm 3,
decoding of NER-LT is described in Algorithm 4.

586

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Algorithm 3: Constrained POS-PT
u ← true, k ← 0, o ← 0
Function NextY(X, ouio)
if node.children is empty then

u ← true
return NextY(X, oui n then

Y ← Y ∪ {EOS}

else

Y ← Y ∪ {xk}

return Y

PT For each entity type ei, its description di is
filled into the template ‘‘is di;’’ to create an ‘‘is-a’’
suffix si. Since the prompt is constructed using
text while the number of entities is variable, it is
not straightforward to tell whether a token belongs
to an entity or an ‘‘is-a’’ suffix. Donc, a noisy
segmentation procedure is utilized to split a phrase
into two parts: entity and ‘‘is-a’’ suffix. Each si is
collected into a trie S to perform segmentation of
a partially generated phrase p (Algorithm 5).

Once a segment is obtained, the decoder is
constrained to generate the entity or the suffix.
For the generation of an entity, string matching
is used to find every occurrence o of its partial
generation in x and add the following token xo+1

587

Algorithm 5: Segmentation
Function Segment(S, p):
for i ← 1 à |p| do

entity, suffix = p≤i, p>i
node ← PrefixMatch(S, p>i)
if node then

return entity, suffix, node

return null

Algorithm 6: Constrained NER-PT
Function NextY(X, oui o then

spans ← spans ∪ {(o, je, null)}

spans ← spans ∪ {(je, j, v)}

if o < |x| + 1 then spans ← spans ∪ {(o, |x| + 1, null)} return spans while parent do foreach sibling of parent do if sibling.label is label and sibling has no children then return sibling parent ← parent.parent return null Algorithm 10: Reverse CON-PT Function Reverse(T , x): root ← parent ← new TOP-tree latest ← null foreach (i, j, v) ∈ Split(T , x) do if v then if xi:j starts with ‘‘the’’ then target ← FindTarget(parent, v) else latest ← new v-tree add latest to parent.children latest.parent ← parent else if xi:j starts with ‘‘has’’ or ‘‘which has’’ then parent ← latest add tokens in ‘‘’’ into latest return root parent. The search of the target constituent is described in Algorithm 9. shows Algorithm 10 reverse final the creating new constituents, and sentences attach- ing new constituents to existing ones. Splitting is done by longest-prefix-matching (Algorithm 7) using a trie T built with the definite and indef- inite article versions of the description of each constituent label e.g., ‘‘the noun phrase’’ and ‘‘a noun phrase’’ of NP. Algorithm 8 describes the splitting procedure. Once a prompt is split into two types of sen- tences, a constituency tree is then built accord- ingly. We use a variable parent to track the last constituent that gets attachments, and another var- iable latest to track the current new consis- tent that gets created. Due to the top-down nature of linearization, the target constituent that new constituents are attached to is always among the siblings of either parent or the ancestors of linearization. 4.4 Dependency Parsing LS Arc-standard (Nivre, 2004) transitions are added to a candidate set and only transitions per- mitted by the current parsing state are allowed. LT DEP-LS replaces all SH transitions with input tokens in left-to-right order. Therefore, an incremental offset is kept to generate the next token in place of each SH in DEP-LT. PT DEP-PT is more complicated than CON-PT because each sentence contains one more token. Its generation is therefore divided into 4 possible token (1st), relation (rel), sec- states: first ond token (2ed), and semicolon. An arc-standard 588 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 7 2 1 3 4 4 9 5 / / t l a c _ a _ 0 0 5 5 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Algorithm 11: Recall Shift Function RecallShift(system, i, xj): while system.si is not xj do system.apply(SH) transition system is executed synchronously with constrained decoding since PT is essentially a sim- plified transition sequence with all SH removed. Let b and s be the system buffer and stack, respec- tively. Let c be a set of candidate tokens that will be generated in y, which initially contains all input tokens and an inserted token ‘‘sentence’’ that is only used to represent the root in ‘‘the sentence has a root . . .’’ A token is removed from c once it gets popped out of s. Since DEP-PT generates no SH, each input token xj in y effectively introduces SH(s) till it is pushed onto s at index i (i ∈ {1, 2}), as formally described in Algorithm 11. After the first token is generated, its offset o in y is recorded such that the following relation sequence yi>o can be located. To decide the next
token of yi>o, it is then prefix-matched with a
trie T built with the set of ‘‘has-’’ and ‘‘is-’’
dependency relations. The children of the prefix-
matched node are considered candidates if it
has any. Otherwise, the dependency relation is
marked as completed. Once a relation is gen-
erated, the second token will be generated in a
similar way. Enfin, upon the completion of a
sentence, the transition it describes is applied to
the system and c is updated accordingly. The full
procedure is described in Algorithm 12. Since a
transition system has been synchronously main-
tained with constrained decoding, no extra re-
verse linearization is needed.

5 Experiments

For all tasks, BART-Large (Lewis et al., 2020)
is finetuned as our underlying S2S model. Nous
also tried T5 (Raffel et al., 2020), although its
performance was less satisfactory. Every model
is trained three times using different random seeds
and their average scores and standard deviations
on the test sets are reported. Our models are ex-
perimented on the OntoNotes 5 (Weischedel et al.,
2013) using the data split suggested by Pradhan
et autres. (2013). En outre, two other popular data-
sets are used for fair comparisons to previous
travaux: the Wall Street Journal corpus from the

Algorithm 12: Constrained DEP-PT
(status, transition, t1, t2, o) ← (1st, null, null, null, 0)
c ← {sentence} ∪ y
Function NextY(X, ouio)
if node.children then

Y ← Y ∪ {node.children}

else

relation ← the relation in y>o
if y>o starts with ‘‘is’’ then

transition ← LA-relation

else

transition ← RA-relation

status ← 2ed

else if status is 2ed then

Y ← Y ∪ c
status ← semicolon

else if status is semicolon then

t2 ← yi−1
Y ← Y ∪ {; }
RecallShift(système, 1, t2)
RecallShift(système, 2, t1)
if transition starts with LA then

remove s1 from c

else

remove s2 from c
system.apply(transition)
if system is terminal then
Y ← Y ∪ {EOS}

status ← 1st

return Y

Penn Treebank 3 (Marcus et al., 1993) for POS,
DEP, and CON, as well as the English portion
of the CoNLL’03 dataset (Tjong Kim Sang and
De Meulder, 2003) for NER.

589

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

OntoNotes

Model

CoNLL’03

OntoNotes 5

Model

Bohnet et al. (2018)

He and Choi (2021b)
LS

PTB

97.96

–
97.51 ± 0.11
97.70 ± 0.02
97.64 ± 0.01

–
98.32 ± 0.02
98.21 ± 0.02
98.40 ± 0.01
98.37 ± 0.02

Tableau 2: Results for POS.

Each token is independently tokenized using
the subword tokenizer of BART and merged into
an input sequence. The boundary information for
each token is recorded to ensure full tokens are
generated in LT and PT without broken pieces.
To fit in the positional embeddings of BART, sen-
tences longer than 1,024 subwords are discarded,
qui comprennent 1 sentence from the Penn Tree-
bank 3 training set, et 24 sentences from the
OntoNotes 5 training set. Development sets and
test sets are not affected.

5.1 Part-of-Speech Tagging

Token level accuracy is used as the metric for
POS. LT outperforms LS although LT is twice as
long as LS, suggesting that textual tokens posi-
tively impact the learning of the decoder (Tableau 2).
PT performs almost the same with LT, peut-être
due to the fact that POS is not a task requiring a
powerful decoder.

5.2 Named Entity Recognition

For CoNLL’03, the provided splits without merg-
ing the development and training sets are used.
For OntoNotes 5, the same splits as Chiu and
Nichols (2016), Li et al. (2017); Ghaddar and
Langlais (2018); He and Choi (2020, 2021b) sont
used. Labeled span-level F1 score is used for
evaluation.

We acknowledge that the performance of NER
systems can be largely improved by rich embed-
dings (Wang et al., 2021), document context fea-
photos (Yu et al., 2020), dependency tree features
(Xu et al., 2021), and other external resources.
While our focus is the potential of S2S, we mainly
consider two strong baselines that also use BART
the generative
as the only external resource:
BART-Pointer framework (Yan et al., 2021) et
the recent template-based BART NER (Cui et al.,
2021).

As shown in Table 3, LS performs the worst
on both datasets, possibly attributed to the fact

590

Clark et al. (2018)

Peters et al. (2018)

Akbik et al. (2019)

Strakov´a et al. (2019)

Yamada et al. (2020)
Yu et al. (2020)†
Yan et al. (2021)‡S
Cui et al. (2021)S
He and Choi (2021b)

Wang et al. (2021)

Zhu and Li (2022)

Ye et al. (2022)
LS

92.60

92.22

93.18

93.07

92.40

92.50

93.24

92.55

–

94.6

–

–
70.29 ± 0.70
92.75 ± 0.03
93.18 ± 0.04

–

89.83

90.38

–
89.04 ± 0.14
–

91.74

91.9
84.61 ± 1.18
89.60 ± 0.06
90.33 ± 0.04

Tableau 3: Results for NER. S denotes S2S.

that the autoregressive decoder overfits the high-
order left-to-right dependencies of BIEOS tags.
LT performs close to the BERT-Large biaffine
model (Yu et al., 2020). PT performs compa-
rably well with the Pointer Networks approach
(Yu et al., 2020) and it outperforms the template
prompting (Cui et al., 2021) by a large margin,
suggesting S2S has the potential to learn struc-
tures without using external modules.

5.3 Constituency Parsing

All POS tags are removed and not used in train-
ing or evaluation. Terminals belonging to the
same non-terminal are flattened into one con-
stituent before training and unflattened in post-
traitement. The standard constituent-level F-score
produced by the EVALB3 is used as the evalua-
tion metric.

Tableau 4 shows the results on OntoNotes 5 et
PTB 3. Incorporating textual tokens into the out-
put sequence is important on OntoNotes 5, lead-
ing to a +0.9 F-score, while it is not the case on
PTB 3. It is possibly due to the fact that Onto-
Notes is more diverse in domains, requiring a
higher utilization of pre-trained S2S for domain
transfer. PT performs the best, and it has a com-
petitive performance to recent works, despite the
fact that it uses no extra decoders.

3https://nlp.cs.nyu.edu/evalb/.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

PTB 3

OntoNotes 5

Fern´andez-Gonz´alez and
G´omez-Rodr´ıguez (2020)S
Mrini et al. (2020)

He and Choi (2021b)
Yang and Tu (2022)S
LS

91.6

96.38

–

96.01
95.23 ± 0.08
95.24 ± 0.04
95.34 ± 0.06

–

–
94.43 ± 0.03
–
93.40 ± 0.31
94.32 ± 0.11
94.55 ± 0.03

Tableau 4: Results for CON. S denotes S2S.

UAS
91.17

93.71

94.11

Model
Wiseman and Rush (2016)S
Zhang et al. (2017)S
Li et al. (2018)S
Mrini et al. (2020)
LS

97.42
92.83 ± 0.43
95.79 ± 0.07
95.91 ± 0.06
(un) PTB results for DEP.

Model
He and Choi (2021b)
LS

UAS
95.92 ± 0.02
86.54 ± 0.12
94.15 ± 0.14
94.51 ± 0.22

(b) OntoNotes results for DEP.

LAS
87.41

91.60

92.08

96.26
90.50 ± 0.53
93.17 ± 0.16
94.31 ± 0.09

LAS
94.24 ± 0.03
83.84 ± 0.13
91.27 ± 0.19
92.81 ± 0.21

Tableau 5: Results for DEP. S denotes S2S.

Model

w/o CD

PTB
97.51 ± 0.11
97.51 ± 0.11
97.70 ± 0.02
97.67 ± 0.02
97.64 ± 0.01
97.55 ± 0.02

OntoNotes
98.21 ± 0.02
98.21 ± 0.02
98.40 ± 0.01
98.39 ± 0.01
98.37 ± 0.02
98.29 ± 0.05

(un) Accuracy of ablation tests for POS.

Model

w/o CD

CoNLL 03
70.29 ± 0.70
66.33 ± 0.73
92.75 ± 0.03
92.72 ± 0.02
93.18 ± 0.04
93.12 ± 0.06

OntoNotes 5
84.61 ± 1.18
84.57 ± 1.16
89.60 ± 0.06
89.50 ± 0.07
90.33 ± 0.04
90.23 ± 0.05

(b) F1 of ablation tests for NER.

Model

w/o CD

PTB
90.50 ± 0.53
90.45 ± 0.47
93.17 ± 0.16
93.12 ± 0.14
94.31 ± 0.09
81.50 ± 0.27

OntoNotes
83.84 ± 0.13
83.78 ± 0.13
91.27 ± 0.19
91.05 ± 0.20
92.81 ± 0.21
81.76 ± 0.36

5.4 Dependency Parsing

The constituency trees from PTB and OntoNotes
are converted into the Stanford dependencies
v3.3.0 (De Marneffe and Manning, 2008) for DEP
experiments. Forty and 1 non-projective trees are
removed from the training and development sets
of PTB 3, respectivement. For OntoNotes 5, ces
numbers are 262 et 28. Test sets are not affected.
As shown in Table 5, textual tokens are cru-
cial
in learning arc-standard transitions using
S2S, leading to +2.6 et +7.4 LAS improve-
ments, respectivement. Although our PT method
underperforms recent state-of-the-art methods, it
has the strongest performance among all S2S
approaches. Fait intéressant, our S2S model man-
ages to learn a transition system without explic-
itly modeling the stack, the buffer, the partial
parse, or pointers.

We believe that the performance of DEP with
S2S can be further improved with a larger and
more recent pretrained S2S model and dynamic
oracle (Goldberg and Nivre, 2012).

Tableau 6: Ablation test results.

6 Analysis

6.1 Ablation Study

We perform an ablation study to show the perfor-
mance gain of our proposed constrained decoding
algorithms on different tasks. Constrained decod-
ing algorithms (CD) are compared against free
generation (w/o CD) where a model freely gener-
ates an output sequence that is later post-processed
into task-specific structures using string-matching
rules. Invalid outputs are patched to the greatest
extent, par exemple., POS label sequences are padded or
truncated. As shown in Table 6, ablation of con-
strained decoding seldom impacts the performance
of LS on all tasks, suggesting that the decoder of
seq2seq can acclimatize to the newly added label

591

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

tokens. Fait intéressant, the less performant NER-LS
model degrades the most, promoting the neces-
sity of constrained decoding for weaker seq2seq
models. The performance of LT on all tasks is
marginally degraded when constrained decoding
is ablated, indicating the decoder begins to gen-
erate structurally invalid outputs when textual
tokens are freely generated. This type of problem
seems to be exacerbated when more tokens are
freely generated in the PT schemas, especially for
the DEP-PT.

Unlike POS and NER, DEP is more prone
to hallucinated textual tokens as early errors in
the transition sequence get accumulated in the
arc-standard system which shifts all later predic-
tions off the track. It is not yet a critical problem
as LS generates no textual tokens while a textual
token in LT still serves as a valid shift action
even if it is hallucinated. Cependant, a hallucinated
textual token in PT is catastrophic as it could be
part of any arc-standard transitions. As no explicit
shift transition is designed, a hallucinated token
could lead to multiple instances of missing shifts
in Algorithm 12.

6.2 Case Study

To facilitate understanding and comparison of
different models, a concrete example of input
(je), gold annotation (G), and actual model pre-
diction per each schema is provided below for
each task. Wrong predictions and corresponding
ground truth are highlighted in red and teal,
respectivement.

POS In the following example, only PT correctly
detects the past tense (VBD) of ‘‘put’’.

je: The word I put in boldface is extremely interesting.
G: DT NN PRP VBD IN NN VBZ RB JJ .
LS: DT NN PRP VBP IN NN VBZ RB RB JJ
LT: The/DT word/NN I/PRP put/VBP in/IN

boldface/NN is/VBZ extr./RB interesting/JJ./.
PT: ‘‘The’’ is a determiner; ‘‘word’’ is a singular noun;

‘‘I’’ is a personal pronoun; ‘‘put’’ is a past tense verb;
‘‘in’’ is a preposition or subordinating conjunction;
‘‘boldface’’ is a singular noun; ‘‘is’’ is a 3rd person
singular present verb; ‘‘extremely’’ is an adverb;
‘‘interesting’’ is an adjective; ‘‘.’’ is a period.

je: Large image of the Michael Jackson HIStory statue.
G: Large image of the Michael Jackson
statue.
(cid:5)

(cid:2)

(cid:3)(cid:4)
PERSON(PER)

HIStory
(cid:2) (cid:3)(cid:4) (cid:5)
WOA

LS: O O O O B-PER E-PER S-ORG O O O
LT: Large image of the Michael Jackson

HIStory statue.
PT: ‘‘Michael Jackson’’ is a person;
‘‘HIStory’’ is an art work.

CON As highlighted with strikeout
text be-
faible, LS and LT failed to parse ‘‘how much’’
as a wh-noun phrase and a wh-adverb phrase,
respectivement.

je: It’s crazy how much he eats.
G: (S (NP (NP It)) (VP ’s (ADJP crazy) (SBAR
(WHNP (WHADJP how much)) (S (NP he)
(VP eats)))) .)

LS: N-S N-NP N-NP SH RE RE N-VP SH N-ADJP SH
RE N-SBAR N-WHNP N-WHADVP SH SH RE RE
N-S N-NP SH RE N-VP SH RE RE RE RE SH RE
LT: (S (NP (NP It)) (VP ’s (ADJP crazy) (SBAR
(WHNP (WHADJP how much)) (S (NP he)
(VP eats)))) .)

PT: a sentence has a simple clause, which has a noun

phrase and a verb phrase and ‘‘.’’; the noun phrase
has a noun phrase ‘‘It’’, the verb phrase has ‘‘’s’’
and an adjective phrase ‘‘crazy’’ and a subordinating
clause, which has a wh-noun phrase and a simple
clause; the wh-noun phrase has a wh-adjective
phrase ‘‘how much’’, the simple clause has a noun
phrase ‘‘he’’ and a verb phrase ‘‘eats’’.

DEP In the following example, LS incorrectly
attached ‘‘so out of’’ to ‘‘place’’, and LT wrongly
attached ‘‘so’’ to ‘‘looks’’.

je: It looks so out of place.

G:
LS: SH SH LA-nsubj SH SH SH SH LA-advmod
LA-advmod LA-advmod RA-acomp SH
RA-punct RA-root

LT: It looks LA-nsubj so out of place RA-pobj

RA-pcomp RA-prep RA-ccomp . RA-punct
RA-root

PT: ‘‘It’’ is a nominal subject of ‘‘looks’’; ‘‘so’’ is an

adverbial modifier of ‘‘out’’; ‘‘of’’ has an object of
a preposition ‘‘place’’; ‘‘out’’ has a prepositional
complement ‘‘of’’; ‘‘looks’’ has a prepositional
modifier ‘‘out’’; ‘‘looks’’ has a punctuation ‘‘.’’;
‘‘sentence’’ has a root ‘‘looks’’.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

NER In the following example, LS and LT
could not correctly recognize ‘‘HIStory’’ as an
art work, possibly due to its leading uppercase
letters.

6.3 Design Choices

In the interest of experimentally comparing the
schema variants, we would like each design we

592

Model
POS-PT
dec.LEX
DEP-PT
dec.LEX

PTB 3
97.64 ± 0.01
97.63 ± 0.02
94.31 ± 0.09
93.89 ± 0.18

OntoNotes 5
98.37 ± 0.02
98.35 ± 0.03
92.81 ± 0.21
91.19 ± 0.86

Tableau 7: Study of lexicality on POS and DEP.

Model
NER-PT
inc.VRB
Model
CON-PT
inc.VRB

CoNLL 03
93.18 ± 0.04
92.47 ± 0.03
PTB 3
95.34 ± 0.06
95.19 ± 0.06

OntoNotes 5
90.33 ± 0.04
89.63 ± 0.23
OntoNotes 5
94.55 ± 0.03
94.02 ± 0.49

Tableau 8: Study of verbosity on NER and CON.

consider to be equivalent in some systematic way.
To this end, we fix other aspects and variate
two dimensions of the prompt design, lexicality,
and verbosity, to isolate the impact of individual
variables.

Lexicality We call the portion of textual tok-
ens in a sequence its lexicality. Ainsi, LS and PT
have zero and full lexicality, respectivement, alors que
LT falls in the middle. To tease apart the impact
of lexicality, we substitute the lexical phrases
with corresponding tag abbreviations in PT on
POS and DEP, par exemple., ‘‘friend’’ is a noun →
‘‘friend’’ is a NN, ‘‘friend’’ is a nominal sub-
ject of ‘‘bought’’ → ‘‘friend’’ is a nsubj of
‘‘bought’’. Tags are added to the BART vocabu-
lary and learned from scratch as LS and LT. Comme
shown in Table 7, decreasing the lexicality of PT
marginally degrades the performance of S2S on
POS. On DEP, the performance drop is rather sig-
nificant. Similar trends are observed comparing
LT and LS in Section 5, confirming that lexicons
play an important role in prompt design.

Verbosity Our PT schemas on NER and CON
are designed to be as concise as human narrative,
and as easy for S2S to generate. Another design
choice would be as verbose as some LS and LT
schemas. To explore this dimension, we increase
the verbosity of NER-PT and CON-PT by adding
‘‘isn’t an entity’’ for all non-entity tokens and
substituting each ‘‘which’’ to its actual referred
phrase, respectivement. The results are presented in
Tableau 8. Though increased verbosity would elim-
inate any ambiguity, unfortunately, it hurts per-
formance. Emphasizing a token ‘‘isn’t an entity’’
might encounter the over-confidence issue as the
boundary annotation might be ambiguous in gold
NER data (Zhu and Li, 2022). CON-PT deviates
from human language style when reference is
forbidden, which eventually makes it lengthy and
hard to learn.

6.4 Stratified Analysis

Section 5 shows that our S2S approach performs
comparably to most ad-hoc models. To reveal its
pros and cons, we further partition the test data
using task-specific factors and run tests on them.
The stratified performance on OntoNotes 5 is com-
pared to the strong BERT baseline (He and Choi,
2021b), which is representative of non-S2S models
implementing many state-of-the-art decoders.

For POS, we consider the rate of Out-Of-
Vocabulary tokens (OOV, tokens unseen in the
training set) in a sentence as the most significant
factor. As illustrated in Figure 1a, the OOV rate
degrades the baseline performance rapidly, es-
pecially when over half tokens in a sentence are
OOV. Cependant, all S2S approaches show strong
resistance to OOV, suggesting that our S2S mod-
els unleash greater potential
through transfer
learning.

For NER, entities unseen during training often
confuse a model. This negative impact can be
observed on the baseline and LT in Figure 1b.
Cependant, the other two schemas generating tex-
tual tokens, LT and PT, are less severely impacted
by unseen entities. It further supports the intuition
behind our approach and agrees with the finding
by Shin et al. (2021): With the output sequence
being closer to natural language, the S2S model
has less difficulty generating it even with unseen
entities.

Since the number of binary parses for a sen-
tence of n + 1 tokens is the nth Catalan Number
(Church and Patil, 1982), the length is a crucial
factor for CON. As shown in Figure 1c, all models,
especially LS, perform worse when the sentence
gets longer. Fait intéressant, by simply recalling all
the lexicons, LT easily regains the ability to parse
long sentences. Using an even more natural rep-
resentation, PT outperforms them with a perfor-
mance on par with the strong baseline. It again

593

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Chiffre 1: Factors impacting each task: the rate of OOV tokens for POS, the rate of unseen entities for NER, le
sentence length for CON, and the head-dependent distance for DEP.

supports our intuition that natural language is
beneficial for pretrained S2S.

For DEP, the distance between each depen-
dent and its head is used to factorize the overall
performance. As shown in Figure 1d, the gap
between S2S models and the baseline increases
with head-dependent distance. The degeneration
of relatively longer arc-standard transition se-
quences could be attributed to the static oracle
used in finetuning.

Comparing the three schemas across all sub-
groupes, LT uses the most special
tokens but
performs the worst, while PT uses zero special
tokens and outperforms the rest two. It suggests
that special tokens could harm the performance
of the pretrained S2S model as they introduce
a mismatch between pretraining and finetuning.
With zero special tokens, PT is most similar to
natural language, and it also introduces no ex-
tra parameters in finetuning, leading to better
performance.

7 Conclusion

We aim to unleash the true potential of S2S
models for sequence tagging and structure parsing.
To this end, we develop S2S methods that rival
state-of-the-art approaches more complicated than
ours, without substantial task-specific architecture
modifications. Our experiments with three novel
prompting schemas on four core NLP tasks dem-
onstrated the effectiveness of natural language
in S2S outputs. Our systematic analysis revealed
the pros and cons of S2S models, appealing for
more exploration of structure prediction with S2S.
Our proposed S2S approach reduces the need
for many heavily engineered task-specific archi-
tectures. It can be readily extended to multi-task
and few-shot learning. We have a vision of S2S
playing an integral role in more language under-
standing and generation systems. The limitation

of our approach is its relatively slow decoding
speed due to serial generation. This issue can be
mitigated with non-autoregressive generation and
model compression techniques in the future.

Remerciements

We would like to thank Emily Pitler, Cindy
Robinson, Ani Nenkova, and the anonymous
TACL reviewers for their insightful and thoughtful
feedback on the early drafts of this paper.

Les références

Alan Akbik, Tanja Bergmann, and Roland
Vollgraf. 2019. Pooled contextualized embed-
dings for named entity recognition. En Pro-
ceedings of the 2019 Conference of the North
the Association for
American Chapter of
Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short
Papers), pages 724–728, Minneapolis,Minnesota.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N19
-1078

Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual String Embeddings for Se-
quence Labeling. In Proceedings of the 27th
International Conference on Computational
Linguistics, COLING’18, pages 1638–1649.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
sentations, ICLR 2015, San Diego, Californie, Etats-Unis,
May 7–9, 2015, Conference Track Proceedings.

Xuefeng Bai, Yulong Chen, and Yue Zhang.
2022. Graph pre-training for AMR parsing

594

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

and generation. In Proceedings of the 60th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 6001–6015, Dublin, Ireland. Associa-
tion for Computational Linguistics.

Jonathan Berant and Percy Liang. 2014. Seman-
tic parsing via paraphrasing. In Proceedings
of the 52nd Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Long Papers), pages 1415–1425, Baltimore,
Maryland. Association for Computational Lin-
guistics. https://doi.org/10.3115/v1
/P14-1133

Michele Bevilacqua, Rexhina Blloshmi, et
Roberto Navigli. 2021. One spring to rule them
les deux: Symmetric AMR semantic parsing and
generation without a complex pipeline. En Pro-
ceedings of AAAI. https://est ce que je.org/10
.1609/aaai.v35i14.17489

Bernd Bohnet, Ryan McDonald, Gonc¸alo
Sim˜oes, Daniel Andor, Emily Pitler, et
Joshua Maynez. 2018. Morphosyntactic tag-
ging with a Meta-BiLSTM model over context
sensitive token encodings. In Proceedings of
the 56th Annual Meeting of
the Associa-
tion for Computational Linguistics, ACL’18,
pages 2642–2652. https://est ce que je.org/10
.18653/v1/P18-1246

Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack
Clark, Christopher Berner, Sam McCandlish,
Alec Radford,
Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-
Dans-
shot
formation Processing Systems, volume 33,
pages 1877–1901. Curran Associates, Inc.

In Advances in Neural

learners.

Jiawei Chen, Qing Liu, Hongyu Lin, Xianpei
Han, and Le Sun. 2022. Few-shot named en-
tity recognition with self-describing networks.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 5711–5722,

595

Dublin,
Ireland. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2022.acl-long.392

Lingzhen Chen and Alessandro Moschitti. 2018.
Learning to progressively recognize new
named entities with sequence to sequence
models. In Proceedings of
the 27th Inter-
national Conference on Computational Lin-
guistics, pages 2181–2191, Santa Fe, Nouveau
Mexico, Etats-Unis. Association for Computational
Linguistics.

Jason Chiu and Eric Nichols. 2016. Named entity
recognition with bidirectional LSTM-CNNs.
Transactions of the Association for Computa-
tional Linguistics, 4:357–370. https://est ce que je
.org/10.1162/tacl_a_00104

Chinmay Choudhary and Colm O’riordan. 2021.
End-to-end mBERT based seq2seq enhanced
dependency parser with linguistic typology
connaissance. In Proceedings of the 17th Inter-
national Conference on Parsing Technologies
and the IWPT 2021 Shared Task on Parsing
into Enhanced Universal Dependencies (IWPT
2021), pages 225–232, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2021.iwpt-1.24

Kenneth Church and Ramesh Patil. 1982.
Coping with syntactic ambiguity or how to
put the block in the box on the table. Amer-
ican Journal of Computational Linguistics,
8(3–4):139–149.

Kevin Clark, Minh-Thang Luong, Christopher D.
Manning, and Quoc Le. 2018. Semi-supervised
sequence modeling with cross-view training.
In Proceedings of
le 2018 Conference on
Empirical Methods in Natural Language Pro-
cessation, pages 1914–1925, Brussels, Belgium.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/D18
-1217

Leyang Cui, Yu Wu, Jian Liu, Sen Yang, et
Yue Zhang. 2021. Template-based named en-
tity recognition using BART. In Findings of
the Association for Computational Linguis-
tics: ACL-IJCNLP 2021, pages 1835–1845,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.findings-acl.161

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Marie-Catherine De Marneffe and Christopher
D. Manning. 2008. The Stanford typed de-
pendencies representation. In COLING 2008:
Proceedings of
the Workshop on Cross-
framework and Cross-domain Parser Evalu-
ation, pages 1–8. https://est ce que je.org/10
.3115/1608858.1608859

Daniel Deutsch, Shyam Upadhyay, and Dan
Roth. 2019. A general-purpose algorithm for
constrained sequential inference. In Proceed-
ings of
the 23rd Conference on Computa-
tional Natural Language Learning (CoNLL),
pages 482–492, Hong Kong, Chine. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/K19-1045

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of
le 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota. Association for Com-
putational Linguistics.

Daniel Fern´andez-Gonz´alez and Carlos G´omez-
Rodr´ıguez. 2020. Enriched in-order lineariza-
tion for faster sequence-to-sequence constituent
parsing. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, pages 4092–4099, En ligne. Associ-
ation for Computational Linguistics.

Abbas Ghaddar and Phillippe Langlais. 2018.
Robust lexical features for improved neural
network named-entity recognition. In Proceed-
ings of the 27th International Conference on
Computational Linguistics, pages 1896–1907,
Santa Fe, New Mexico, Etats-Unis. Association for
Computational Linguistics.

Yoav Goldberg and Joakim Nivre. 2012. UN
dynamic oracle for arc-eager dependency
parsing. In Proceedings of COLING 2012,
pages 959–976, Mumbai, India. The COLING
2012 Organizing Committee.

Tatsunori B. Hashimoto, Kelvin Guu, Yonatan
Oren, and Percy S. Liang. 2018. A retrieve-
and-edit framework for predicting structured
outputs. Advances in Neural Information Pro-
cessing Systems, 31.

Han He and Jinho Choi. 2020. Establishing
strong baselines for the new decade: Sequence
tagging, syntactic and semantic parsing with
bert. In The Thirty-Third International Flairs
Conference.

Han He and Jinho D. Choi. 2021un. Levi graph
AMR parser using heterogeneous attention. Dans
Proceedings of
the 17th International Con-
ference on Parsing Technologies and the
IWPT 2021 Shared Task on Parsing into En-
hanced Universal Dependencies (IWPT 2021),
pages 50–57, En ligne. Association for Compu-
tational Linguistics. https://est ce que je.org/10
.18653/v1/2021.iwpt-1.5

Han He and Jinho D. Choi. 2021b. The stem
cell hypothesis: Dilemma behind multi-task
learning with transformer encoders. En Pro-
ceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 5555–5577, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://est ce que je.org/10
.18653/v1/2021.emnlp-main.451

Chris Hokamp and Qun Liu. 2017. Lexically
constrained decoding for sequence generation
using grid beam search. In Proceedings of the
55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Pa-
pers), pages 1535–1546, Vancouver, Canada.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/P17
-1141

Robin Jia and Percy Liang. 2016. Data recom-
bination for neural semantic parsing. En Pro-
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 12–22, Berlin,
Allemagne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/P16-1002

John D. Lafferty, Andrew McCallum, et
Fernando C. N. Pereira. 2001. Conditional
random fields: Probabilistic models for seg-
menting and labeling sequence data. In ICML,
pages 282–289.

Guillaume Lample, Miguel Ballesteros, Sandeep
Subramanien, Kazuya Kawakami, and Chris
Dyer. 2016. Neural architectures for named en-
tity recognition. In Proceedings of the 2016

596

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

the North American Chap-
Conference of
the Association for Computational
ter of
Linguistics: Human Language Technologies,
pages 260–270, San Diego, California. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1030

Mike Lewis, Yinhan Liu, Naman Goyal,
Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke
Zettlemoyer. 2020. BART: Denoising sequence-
to-sequence pre-training for natural language
generation, translation, and comprehension. Dans
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 7871–7880, En ligne. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.acl-main.703

Peng-Hsuan Li, Ruo-Ping Dong, Yu-Siang Wang,
Ju-Chieh Chou, and Wei-Yun Ma. 2017. Lev-
eraging linguistic structures for named en-
tity recognition with bidirectional recursive
neural networks. In Proceedings of the 2017
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2664–2669,
Copenhagen, Denmark. Association for Com-
putational Linguistics.

Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao.
2018. Seq2seq dependency parsing. In Proceed-
ings of the 27th International Conference on
Computational Linguistics, pages 3203–3214,
Santa Fe, New Mexico, Etats-Unis. Association for
Computational Linguistics.

Jiangming Liu and Yue Zhang. 2017. In-order
transition-based constituent parsing. Transac-
tions of
the Association for Computational
Linguistics, 5:413–424. https://doi.org
/10.1162/tacl_a_00070

Chunpeng Ma, Lemao Liu, Akihiro Tamura,
Tiejun Zhao, and Eiichiro Sumita. 2017. De-
terministic attention for sequence-to-sequence
constituent parsing. In Thirty-First AAAI Con-
ference on Artificial Intelligence. https://
doi.org/10.1609/aaai.v31i1.10967

Mitchell P. Marcus, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. Building a
Large Annotated Corpus of English: Le
Penn Treebank. Computational Linguistics,
19(2):313–330. https://doi.org/10.21236
/ADA273556

Khalil Mrini, Franck Dernoncourt, Quan Hung
Tran, Trung Bui, Walter Chang, and Ndapa
Nakashole. 2020. Rethinking self-attention:
Towards interpretability in neural parsing. Dans
Findings of the Association for Computational
Linguistics: EMNLP 2020, pages 731–742,
En ligne. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.findings-emnlp.65

Joakim Nivre. 2004. Incrementality in deter-
ministic dependency parsing. In Proceedings
de
the Workshop on Incremental Parsing:
Bringing Engineering and Cognition Together,
pages 50–57, Barcelona, Espagne. Association for
Computational Linguistics. https://est ce que je
.org/10.3115/1613148.1613156

Giovanni Paolini, Ben Athiwaratkun,

Jason
Krone, Jie Ma, Alessandro Achille, RISHITA
ANUBHAI, Cicero Nogueira dos Santos, Bing
Xiang, and Stefano Soatto. 2021. Structured
prediction as translation between augmented
natural languages. In International Conference
on Learning Representations.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. In Proceedings of
le 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 2227–2237,
La Nouvelle Orléans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1202

Sameer Pradhan, Alessandro Moschitti, Nianwen
Xue, Hwee Tou Ng, Anders Bj¨orkelund, Olga
Uryupina, Yuchen Zhang, and Zhi Zhong.
2013. Towards Robust Linguistic Analysis
using OntoNotes. In Proceedings of the Seven-
teenth Conference on Computational Natural
Language Learning, pages 143–152, Sofia,
Bulgaria. Association for Computational Lin-
guistics.

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, Ilya Sutskever, et autres.
2019. Language models are unsupervised mul-
titask learners. OpenAI blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael

597

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.

Kenji Sagae

and Alon Lavie.

2005. UN
classifier-based parser with linear
run-time
complexity. In Proceedings of the Ninth In-
ternational Workshop on Parsing Technology,
pages 125–132, Vancouver, British Columbia.
Association for Computational Linguistics.
https://doi.org/10.3115/1654494
.1654507

Richard Shin, Christopher Lin, Sam Thomson,
Charles Chen, Subhro Roy, Emmanouil
Antonios Platanios, Adam Pauls, Dan Klein,
Jason Eisner, and Benjamin Van Durme. 2021.
Constrained language models yield few-shot
semantic parsers. In Proceedings of the 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 7699–7715,
Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.emnlp-main.608

Jana Strakov´a, Milan Straka, and Jan Hajic.
2019. Neural architectures for nested NER
through linearization. In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 5326–5331,
Florence, Italy. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/P19-1527

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with
neural networks. Advances in Neural Informa-
tion Processing Systems, 27.

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
task: Language-independent named entity
the Seventh
reconnaissance. In Proceedings of
Conference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147. https://
doi.org/10.3115/1119176.1119195

Bo-Hsiang Tseng,

Jianpeng Cheng, Yimai
Fang, and David Vandyke. 2020. A generative
model for joint natural language understand-
ing and generation. In Proceedings of the 58th
Annual Meeting of the Association for Com-
putational Linguistics, pages 1795–1807, Sur-

598

line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.acl-main.163

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Advances
in Neural Information Processing Systems,
pages 5998–6008.

Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015un. Pointer networks. In Advances
in Neural Information Processing Systems,
volume 28.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav
Petrov, Ilya Sutskever, and Geoffrey Hinton.
2015b. Grammar as a foreign language. Dans
Advances in Neural Information Processing
Systems, volume 28. Curran Associates, Inc.

Xinyu Wang, Yong Jiang, Nguyen Bach, Tao
Wang, Zhongqiang Huang, Fei Huang, et
Kewei Tu. 2021. Automated concatenation
of embeddings for structured prediction. Dans
Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics
and the 11th International Joint Conference
on Natural Language Processing (Volume 1:
Long Papers), pages 2643–2660, En ligne.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2021
.acl-long.206

Ralph Weischedel, Martha Palmer, Mitchell
Marcus, Eduard Hovy, Sameer Pradhan, Lance
Ramshaw, Nianwen Xue, Ann Taylor, Jeff
Kaufman, Michelle Franchini, and Mohammed
El-Bachouti, Robert Belvin, and Ann Houston.
2013. Ontonotes release 5.0 ldc2013t19. Lin-
guistic Data Consortium, Philadelphia, Pennsylvanie.

Sam Wiseman and Alexander M. Rush. 2016.
Sequence-to-sequence learning as beam-search
optimization. In Proceedings of the 2016 Con-
ference on Empirical Methods in Natural Lan-
guage Processing, pages 1296–1306, Austin,
Texas. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/D16-1137

Lu Xu, Zhanming Jie, Wei Lu, and Lidong Bing.
2021. Better feature integration for named entity
reconnaissance. In Proceedings of the 2021 Con-
ference of the North American Chapter of the

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Association for Computational Linguistics: Hu-
man Language Technologies, pages 3457–3469,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.naacl-main.271

Ikuya Yamada, Akari Asai, Hiroyuki Shindo,
Hideaki Takeda, and Yuji Matsumoto. 2020.
LUKE: Deep contextualized entity repre-
sentations with entity-aware self-attention.
In Proceedings of
le 2020 Conference on
Empirical Methods in Natural Language Pro-
cessation (EMNLP), pages 6442–6454, En ligne.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.emnlp-main.523

Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo,
Zheng Zhang, and Xipeng Qiu. 2021. A uni-
fied generative framework for various NER
subtasks. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 5808–5822,
En ligne. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2021.acl-long.451

Songlin Yang and Kewei Tu. 2022. Bottom-up
constituency parsing and nested named en-
Dans
tity recognition with pointer networks.
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 2403–2416,
Ireland. Association for Computa-
Dublin,
tional Linguistics. https://est ce que je.org/10
.18653/v1/2022.acl-long.171

Deming Ye, Yankai Lin, Peng Li, and Maosong
Sun. 2022. Packed levitated marker for entity
and relation extraction. In Proceedings of the
60th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long
Papers), pages 4904–4917, Dublin, Ireland.
Association for Computational Linguistics.

Juntao Yu, Bernd Bohnet, and Massimo Poesio.
2020. Named entity recognition as dependency
parsing. In Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics, pages 6470–6476, En ligne. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl-main
.577

Zhirui Zhang, Shujie Liu, Mu Li, Ming
Zhou, and Enhong Chen. 2017. Stack-based
multi-layer attention for transition-based de-
pendency parsing. In Proceedings of the 2017
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1677–1682,
Copenhagen, Denmark. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1175

Enwei Zhu and Jinpeng Li. 2022. Boundary
smoothing for named entity recognition. Dans
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 7096–7108,
Dublin, Ireland. Association for Computational
Linguistics. https://doi.org/10.18653/v1
/2022.acl-long.490

Huiming Zhu, Chunhui He, Yang Fang, et
Weidong Xiao. 2020. Fine grained named en-
tity recognition via seq2seq framework. IEEE
Access, 8:53953–53961. https://doi.org
/10.1109/ACCESS.2020.2980431

Muhua Zhu, Yue Zhang, Wenliang Chen, Min
Zhang, and Jingbo Zhu. 2013. Fast and accurate
shift-reduce constituent parsing. In Proceedings
of the 51st Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Long Papers), pages 434–443, Sofia, Bulgaria.
Association for Computational Linguistics.

599

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
5
5
7
2
1
3
4
4
9
5

/
t

un
c
_
un
_
0
0
5
5
7
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Télécharger le PDF