Unleashing the True Potential of Sequence-to-Sequence Models

Unleashing the True Potential of Sequence-to-Sequence Models
for Sequence Tagging and Structure Parsing

Han He
计算机科学系
Emory University
亚特兰大, 遗传算法 30322 美国
han.he@emory.edu

Jinho D. Choi
计算机科学系
Emory University
亚特兰大, 遗传算法 30322 美国
jinho.choi@emory.edu

抽象的

Sequence-to-Sequence (S2S) models have
achieved remarkable success on various text
generation tasks. 然而, learning complex
structures with S2S models remains challeng-
ing as external neural modules and additional
lexicons are often supplemented to predict
non-textual outputs. We present a systematic
study of S2S modeling using contained de-
coding on four core tasks: part-of-speech tag-
ging, named entity recognition, 选区,
and dependency parsing, to develop efficient
exploitation methods costing zero extra param-
埃特斯. 尤其, 3 lexically diverse lineari-
zation schemas and corresponding constrained
decoding methods are designed and evaluated.
Experiments show that although more lexical-
ized schemas yield longer output sequences
that require heavier training, their sequences
being closer to natural language makes them
easier to learn. 而且, S2S models using
our constrained decoding outperform other
S2S approaches using external resources. 我们的
best models perform better than or comparably
to the state-of-the-art for all 4 任务, lighting
a promise for S2S models to generate non-
sequential structures.

1

介绍

Sequence-to-Sequence (S2S) models pretrained
for language modeling (PLM) and denoising ob-
jectives have been successful on a wide range of
NLP tasks where both inputs and outputs are se-
序列 (Radford et al., 2019; Raffel et al., 2020;
刘易斯等人。, 2020; Brown et al., 2020). 然而,
for non-sequential outputs like trees and graphs, A
procedure called linearization is often required to
flatten them into ordinary sequences (李等人。,
2018; Fern´andez-Gonz´alez and G´omez-Rodr´ıguez,
2020; Yan et al., 2021; Bevilacqua et al., 2021; 他
and Choi, 2021A), where labels in non-sequential
structures are mapped heuristically as individual

582

tokens in sequences, and numerical properties like
indices are either predicted using an external de-
coder such as Pointer Networks (Vinyals et al.,
2015A) or cast to additional tokens in the vocab-
ulary. While these methods are found to be effec-
主动的, we hypothesize that S2S models can learn
complex structures without adapting such patches.
To challenge the limit of S2S modeling, 捷运
(刘易斯等人。, 2020) is finetuned on four tasks with-
out extra decoders: part-of-speech tagging (销售点),
named entity recognition (NER), constituency pars-
英 (CON), and dependency parsing (DEP). Three
novel linearization schemas are introduced for
each task: label sequence (LS), label with text
(LT), and prompt (PT). LS to PT feature an
increasing number of lexicons and a decreasing
number of labels, which are not in the vocabulary
(部分 3). Every schema is equipped with a
constrained decoding algorithm searching over
valid sequences (部分 4).

Our experiments on three popular datasets de-
pict that S2S models can learn these linguistic
structures without external resources such as in-
dex tokens or Pointer Networks. Our best models
perform on par with or better than the other state-
of-the-art models for all four tasks (部分 5).
最后, a detailed analysis is provided to compare
the distinctive natures of our proposed schemas
(部分 6).1

2 相关工作

S2S (Sutskever et al., 2014) architectures have
been effective on many sequential modeling tasks.
Conventionally, S2S is implemented as an en-
coder and decoder pair, where the encoder learns
input representations used to generate the output

1All our resources including source codes are publicly
可用的: https://github.com/emorynlp/seq2seq
-corenlp.

计算语言学协会会刊, 卷. 11, PP. 582–599, 2023. https://doi.org/10.1162/tacl 00557
动作编辑器: Emily Pitler. 提交批次: 11/2022; 修改批次: 1/2023; 已发表 6/2023.
C(西德:2) 2023 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

sequence via the decoder. Since the input sequence
can be very long, attention mechanisms (Bahdanau
等人。, 2015; Vaswani et al., 2017) focusing on par-
ticular positions are often augmented to the basic
建筑学. With transfer-learning, S2S models
pretrained on large unlabeled corpora have risen
to a diversity of new approaches that convert lan-
guage problems into a text-to-text format (Akbik
等人。, 2018; 刘易斯等人。, 2020; Radford et al.,
2019; Raffel et al., 2020; Brown et al., 2020).
Among them, tasks most related to our work are
linguistic structure predictions using S2S, 销售点,
NER, DEP, and CON.

POS has been commonly tackled as a sequence
tagging task, where the input and output sequences
have equal lengths. S2S, 另一方面, 做
not enjoy such constraints as the output sequence
can be arbitrarily long. 所以, S2S is not as
popular as sequence tagging for POS. Prevailing
neural architectures for POS are often built on top
of a neural sequence tagger with rich embeddings
(Bohnet et al., 2018; Akbik et al., 2018) 和康-
ditional Random Fields (Lafferty et al., 2001).

NER has been cast to a neural sequence tagging
task using the IOB notation (Lample et al., 2016)
over the years, which benefits most from contex-
tual word embeddings (Devlin et al., 2019; 王
等人。, 2021). Early S2S-based works cast NER
to a text-to-IOB transduction problem (Chen and
Moschitti, 2018; Strakov´a et al., 2019; Zhu et al.,
2020), which is included as a baseline schema in
部分 3.2. Yan et al. (2021) augment Pointer Net-
works to generate numerical entity spans, 哪个
we refrain to use because the focus of this work
is purely on the S2S itself. Most recently, Cui
等人. (2021) propose the first template prompting
to query all possible spans against a S2S lan-
规格型号, which is highly simplified into a
one-pass generation in our PT schema. Instead of
directly prompting for the entity type, Chen et al.
(2022) propose to generate its concepts first then
its type later. Their two-step generation is tai-
lored for few-shot learning, orthogonal to our ap-
普罗奇. 而且, our prompt approach does not
rely on non-textual tokens as they do.

CON is a more established task for S2S models
since the bracketed constituency tree is naturally
a linearized sequence. Top-down tree lineariza-
tions based on brackets (Vinyals et al., 2015乙)
or shift-reduce actions (Sagae and Lavie, 2005)
rely on a strong encoder over the sentence while
bottom-up ones (Zhu et al., 2013; Ma et al.,

2017) can utilize rich features from readily built
the in-order traversal
partial parses. 最近,
has proved superior to bottom-up and top-down
in both transition (Liu and Zhang, 2017) 和
S2S (Fern´andez-Gonz´alez and G´omez-Rodr´ıguez,
2020) constituency parsing. Most recently, A
Pointer Networks augmented approach (Yang and
Tu, 2022) is ranked top among S2S approaches.
Since we are interested in the potential of S2S
models without patches, a naive bottom-up baseline
and its novel upgrades are studied in Section 3.3.
DEP has been underexplored as S2S due to the
linearization complexity. The first S2S work maps
a sentence to a sequence of source sentence words
interleaved with the arc-standard, reduce-actions
in its parse (Wiseman and Rush, 2016), 这是
adopted as our LT baseline in Section 3.4. 张
等人. (2017) introduce a stack-based multi-layer
attention mechanism to leverage structural lin-
guistics information from the decoding stack in
arc-standard parsing. Arc-standard is also used
in our LS baseline, 然而, we use no such ex-
tra layers. Apart from transition parsing, 李等人.
(2018) directly predict the relative head posi-
tion instead of the transition. This schema is
later extended to multilingual and multitasking by
Choudhary and O’riordan (2021). Their encoder
and decoder use different vocabularies, 而在
our PT setting, we re-use the vocabulary in the
S2S language model.

S2S appears to be more prevailing for semantic
parsing due to two reasons. 第一的, synchronous
context-free grammar bridges the gap between
natural text and meaning representation for S2S.
It has been employed to obtain silver annotations
(Jia and Liang, 2016), and to generate canonical
natural language paraphrases that are easier to
learn for S2S (Shin et al., 2021). This trend of in-
sights viewing semantic parsing as prompt-guided
一代 (Hashimoto et al., 2018) and paraphras-
英 (Berant and Liang, 2014) has also inspired our
design of PT. 第二, the flexible input/output
format of S2S facilitates joint learning of seman-
tic parsing and generation. Latent variable sharing
(Tseng et al., 2020) and unified pretraining (Bai
等人。, 2022) are two representative joint model-
ing approaches, which could be augmented with
our idea of PT schema as a potentially more ef-
fective linearization.

Our finding that core NLP tasks can be solved
using LT overlaps with the Translation between
Augmented Natural Languages (Paolini et al.,

583

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

2021). 然而, we take one step further to
study the impacts of textual tokens in schema
design choices. Our constrained decoding is sim-
ilar to existing work (Hokamp and Liu, 2017;
Deutsch et al., 2019; Shin et al., 2021). 我们
craft constrained decoding algorithms for our pro-
posed schemas and provide a systematic ablation
study in Section 6.1.

3 Schemas

This section presents our output schemas for
销售点, NER, CON, and DEP in Table 1. For each
任务, 3 lexically diverse schemas are designed as
follows to explore the best practice for structure
学习. 第一的, Label Sequence (LS) is defined as
a sequence of labels consisting of a finite set of
task-related labels, that are merged into the S2S
词汇, with zero text. 第二, Label with
Text (LT) includes tokens from the input text on
top of the labels such that it has a medium num-
ber of labels and text. 第三, PrompT (PT) gives
a list of sentences describing the linguistic struc-
ture in natural language with no label. We hy-
pothesize that the closer the output is to natural
语言, the more advantage the S2S takes from
the PLM.

3.1 Part-of-Speech Tagging (销售点)

LS LS defines the output as a sequence of POS
tags. 正式地, given an input sentence of n to-
kens x = {x1, x2, ··· , xn}, its output is a tag se-
quence of the same length yLS = {y1, y2, ··· ,
yn}. Distinguished from sequence tagging, 任何
LS output sequence is terminated by the ‘‘end-
of-sequence’’ (EOS) 代币, which is omitted
from yLS for simplicity. Predicting POS tags
often depends on their neighbor contexts. 我们
challenge that the autoregressive decoder of a
S2S model can capture this dependency through
self-attention.

LT For LT, the token from the input is inserted
before its corresponding tag. 正式地, the output
is defined yLT = {(x1, y1), (x2, y2), .., (xn, yn)}.
Both x and y are part of the output and the S2S
model is trained to generate each pair sequentially.

PT PT is a human-readable text describing the
POS sequence. 具体来说, we use a phrase
i =‘‘xi is y(西德:3)
yPT
i is
the definition of a POS tag yi, 例如, a noun. 这

i’’ for the i-th token, where y(西德:3)

final prompt is then the semicolon concatenation
of all phrases: yPT = yPT

2 ; ··· ; yPT
n .

1 ; yPT

3.2 Named Entity Recognition (NER)

LS LT of an input sentence comprising n tokens
x = {x1, x2, ··· , xn} is defined as the BIEOS tag
sequence yLS = {y1, y2, ··· , yn}, which labels
each token as the Beginning, Inside, End, Outside,
or Single-token entity.

LT LT uses a pair of entity type labels to wrap
each entity: yLT = ..B-yj, 希, .., xi+k, E-yj, ..,
is the type label of the j-th entity
where yj
consisting of k tokens.

i’’, where y(西德:3)

PT PT is defined as a list of sentences describ-
i =‘‘xi is y(西德:3)
ing each entity: yPT
i is
the definition of a NER tag yi, 例如, a person.
Different from the prior prompt work (Cui et al.,
2021), our model generates all entities in one
pass which is more efficient than their brute-force
方法.

3.3 Constituency Parsing (CON)

Schemas for CON are developed on constituency
trees pre-processed by removing the first level of
non-terminals (POS tags) and rewiring their chil-
德伦 (代币) to parents, 例如, (NP (PRON My)
(NOUN friend)) → (NP My friend).

LS LS is based on a top-down shift-reduce sys-
tem consisting of a stack, 一个缓冲区, and a depth
record d. 最初, the stack contains only the root
constituent with label TOP and depth 0; 缓冲区
contains all tokens from the input sentence; d is
set to 1. A Node-X (N-X) transition creates a
new depth-d non-terminal labeled with X, pushes
it to the stack, and sets d ← d + 1. A Shift
(SH) transition removes the first token from the
buffer and pushes it to the stack as a new terminal
with depth d. A Reduce (关于) pops all elements
with the same depth d from the stack then make
them the children of the top constituent of the
stack, and it sets d ← d − 1. The linearization of
a constituency tree using our LS schema can be
obtained by applying 3 string substitutions: 关于-
place each left bracket and the label X following
it with a Node-X, replace terminals with SH,
replace right brackets with RE.

LT LT is derived by reverting all SH in LS back
to the corresponding tokens so that tokens in LT
effectively serves as SH in our transition system.

584

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

桌子 1: Schemas for the sentence ‘‘My friend who lives in Orlando bought me a gift from Disney
World’’.

PT PT is also based on a top-down lineariza-
的, although it describes a constituent using tem-
plates: ‘‘pi has {西杰}’’, where pi is a constituent
and cj-s are its children. To describe a constitu-

耳鼻喉科, the indefinite article ‘‘a’’ is used to denote a
new constituent (例如, ‘‘. . . has a noun phrase’’).
The definite article ‘‘the’’ is used for referring
to an existing constituent mentioned before (例如,

585

‘‘the noun phrase has’’), or describing a con-
stituent whose children are all terminals (例如,
‘‘. . . has the noun phrase ‘My friend’’’). 什么时候
describing a constituent that directly follows its
mention, the determiner ‘‘which’’ is used instead
of repeating itself multiple times e.g., ‘‘(. . . 和
the subordinating clause, which has’’). Sen-
tences are joined with a semicolon ‘‘;’’ as the
final prompt.

3.4 Dependency Parsing (DEP)

LS LS uses three transitions from the arc-
standard system (Nivre, 2004): 转移 (SH), 左边
arc (<), and right arc (>).

LT LT for DEP is obtained by replacing each
SH in a LS with its corresponding token.

PT PT is derived from its LS sequence by re-
moving all SH. 然后, for each left arc creating
an arc from xj to xi with dependency relation r
(例如, a possessive modifier), a sentence is created
by applying the template ‘‘xi is r of xj’’. 为了
each right arc creating an arc from xi to xj with
the dependency relation r, a sentence is created
with another template ‘‘xi has r xj’’. The prompt
is finalized by joining all such sentences with a
semicolon.

4 Decoding Strategies

To ensure well-formed output sequences that
match the schemas (部分 3), a set of constrained
decoding strategies is designed per task except for
CON, which is already tackled as S2S model-
ing without constrained decoding (Vinyals et al.,
2015乙; Fern´andez-Gonz´alez and G´omez-Rodr´ıguez,
2020). 正式地, given an input x and any par-
tial y 2n then

返回 {EOS}

别的

if i is even then
}
返回 {x i

2

别的

return D

Algorithm 2: Prefix Matching
Function PrefixMatch(时间 , p):

node ← T
while node and p do

node ← node.children[p1]
p ← p>1
return node

depends on the parity of
Algorithm 1.

我, as defined in

PT The PT generation can be divided into two
阶段: token and ‘‘is-a-tag’’ statement genera-
系统蒸发散. A binary status u is used to indicate whether
yi is expected to be a token. To generate a token,
an integer k ≤ n is used to track the index of
the next token. To generate an ‘‘is-a-tag’’ state-
蒙特, an offset o is used to track the beginning
of an ‘‘is-a-tag’’ statement. Each description dj
of a POS tag tj is extended to a suffix sj =
‘‘is dj ; ’’.

Suffixes are stored in a trie tree T to facilitate
prefix matching between a partially generated
statement and all candidate suffixes, as shown
in Algorithm 2. The full decoding is depicted in
Algorithm 3.

4.2 Named Entity Recognition

LS Similar to POS-LS, the NextY for NER
returns BIEOS tags if i ≤ n else EOS.

LT Opening tags (<>) in NER-LT are grouped
into a vocabulary O. The last generated output
token yi−1 (assuming y0 = BOS, a.k.a. 开始
of a sentence) is looked up in O to decide what
type of token will be generated next. To enforce
label consistency between a pair of tags, a variable
e is introduced to record the expected closing
tag. Reusing the definition of k in Algorithm 3,
decoding of NER-LT is described in Algorithm 4.

586

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Algorithm 3: Constrained POS-PT
u ← true, k ← 0, o ← 0
Function NextY(X, y哦)
if node.children is empty then

u ← true
return NextY(X, y n then

Y ← Y ∪ {EOS}

别的

Y ← Y ∪ {xk}

return Y

PT For each entity type ei, its description di is
filled into the template ‘‘is di;’’ to create an ‘‘is-a’’
suffix si. Since the prompt is constructed using
text while the number of entities is variable, 这是
not straightforward to tell whether a token belongs
to an entity or an ‘‘is-a’’ suffix. 所以, a noisy
segmentation procedure is utilized to split a phrase
into two parts: entity and ‘‘is-a’’ suffix. Each si is
collected into a trie S to perform segmentation of
a partially generated phrase p (Algorithm 5).

Once a segment is obtained, the decoder is
constrained to generate the entity or the suffix.
For the generation of an entity, string matching
is used to find every occurrence o of its partial
generation in x and add the following token xo+1

587

Algorithm 5: Segmentation
Function Segment(S, p):
for i ← 1 到 |p| 做

实体, suffix = p≤i, p>i
node ← PrefixMatch(S, p>i)
if node then

return entity, suffix, node

return null

Algorithm 6: Constrained NER-PT
Function NextY(X, y o then

spans ← spans ∪ {(哦, 我, null)}

spans ← spans ∪ {(我, j, v)}

if o < |x| + 1 then spans ← spans ∪ {(o, |x| + 1, null)} return spans while parent do foreach sibling of parent do if sibling.label is label and sibling has no children then return sibling parent ← parent.parent return null Algorithm 10: Reverse CON-PT Function Reverse(T , x): root ← parent ← new TOP-tree latest ← null foreach (i, j, v) ∈ Split(T , x) do if v then if xi:j starts with ‘‘the’’ then target ← FindTarget(parent, v) else latest ← new v-tree add latest to parent.children latest.parent ← parent else if xi:j starts with ‘‘has’’ or ‘‘which has’’ then parent ← latest add tokens in ‘‘’’ into latest return root parent. The search of the target constituent is described in Algorithm 9. shows Algorithm 10 reverse final the creating new constituents, and sentences attach- ing new constituents to existing ones. Splitting is done by longest-prefix-matching (Algorithm 7) using a trie T built with the definite and indef- inite article versions of the description of each constituent label e.g., ‘‘the noun phrase’’ and ‘‘a noun phrase’’ of NP. Algorithm 8 describes the splitting procedure. Once a prompt is split into two types of sen- tences, a constituency tree is then built accord- ingly. We use a variable parent to track the last constituent that gets attachments, and another var- iable latest to track the current new consis- tent that gets created. Due to the top-down nature of linearization, the target constituent that new constituents are attached to is always among the siblings of either parent or the ancestors of linearization. 4.4 Dependency Parsing LS Arc-standard (Nivre, 2004) transitions are added to a candidate set and only transitions per- mitted by the current parsing state are allowed. LT DEP-LS replaces all SH transitions with input tokens in left-to-right order. Therefore, an incremental offset is kept to generate the next token in place of each SH in DEP-LT. PT DEP-PT is more complicated than CON-PT because each sentence contains one more token. Its generation is therefore divided into 4 possible token (1st), relation (rel), sec- states: first ond token (2ed), and semicolon. An arc-standard 588 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 5 5 7 2 1 3 4 4 9 5 / / t l a c _ a _ 0 0 5 5 7 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Algorithm 11: Recall Shift Function RecallShift(system, i, xj): while system.si is not xj do system.apply(SH) transition system is executed synchronously with constrained decoding since PT is essentially a sim- plified transition sequence with all SH removed. Let b and s be the system buffer and stack, respec- tively. Let c be a set of candidate tokens that will be generated in y, which initially contains all input tokens and an inserted token ‘‘sentence’’ that is only used to represent the root in ‘‘the sentence has a root . . .’’ A token is removed from c once it gets popped out of s. Since DEP-PT generates no SH, each input token xj in y effectively introduces SH(s) till it is pushed onto s at index i (i ∈ {1, 2}), as formally described in Algorithm 11. After the first token is generated, its offset o in y is recorded such that the following relation sequence yi>o can be located. To decide the next
token of yi>o, it is then prefix-matched with a
trie T built with the set of ‘‘has-’’ and ‘‘is-’’
dependency relations. The children of the prefix-
matched node are considered candidates if it
has any. 否则, the dependency relation is
marked as completed. Once a relation is gen-
erated, the second token will be generated in a
similar way. 最后, upon the completion of a
句子, the transition it describes is applied to
the system and c is updated accordingly. 完整的
procedure is described in Algorithm 12. Since a
transition system has been synchronously main-
tained with constrained decoding, no extra re-
verse linearization is needed.

5 实验

For all tasks, BART-Large (刘易斯等人。, 2020)
is finetuned as our underlying S2S model. 我们
also tried T5 (Raffel et al., 2020), although its
performance was less satisfactory. Every model
is trained three times using different random seeds
and their average scores and standard deviations
on the test sets are reported. Our models are ex-
perimented on the OntoNotes 5 (Weischedel et al.,
2013) using the data split suggested by Pradhan
等人. (2013). 此外, two other popular data-
sets are used for fair comparisons to previous
作品: the Wall Street Journal corpus from the

Algorithm 12: Constrained DEP-PT
(地位, 过渡, t1, t2, 哦) (1st, null, null, null, 0)
c ← {句子} ∪ y
Function NextY(X, y哦)
if node.children then

Y ← Y ∪ {node.children}

别的

relation ← the relation in y>o
if y>o starts with ‘‘is’’ then

transition ← LA-relation

别的

transition ← RA-relation

status ← 2ed

else if status is 2ed then

Y ← Y ∪ c
status ← semicolon

else if status is semicolon then

t2 ← yi−1
Y ← Y ∪ {; }
RecallShift(系统, 1, t2)
RecallShift(系统, 2, t1)
if transition starts with LA then

remove s1 from c

别的

remove s2 from c
system.apply(过渡)
if system is terminal then
Y ← Y ∪ {EOS}

status ← 1st

return Y

Penn Treebank 3 (Marcus et al., 1993) for POS,
DEP, and CON, as well as the English portion
of the CoNLL’03 dataset (Tjong Kim Sang and
De Meulder, 2003) for NER.

589

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

OntoNotes

模型

CoNLL’03

OntoNotes 5

模型

博内特等人. (2018)

He and Choi (2021乙)
LS

LT

PT

PTB

97.96


97.51 ± 0.11
97.70 ± 0.02
97.64 ± 0.01


98.32 ± 0.02
98.21 ± 0.02
98.40 ± 0.01
98.37 ± 0.02

桌子 2: Results for POS.

Each token is independently tokenized using
the subword tokenizer of BART and merged into
an input sequence. The boundary information for
each token is recorded to ensure full tokens are
generated in LT and PT without broken pieces.
To fit in the positional embeddings of BART, sen-
tences longer than 1,024 subwords are discarded,
which include 1 sentence from the Penn Tree-
bank 3 training set, 和 24 sentences from the
OntoNotes 5 training set. Development sets and
test sets are not affected.

5.1 Part-of-Speech Tagging

Token level accuracy is used as the metric for
销售点. LT outperforms LS although LT is twice as
long as LS, suggesting that textual tokens posi-
tively impact the learning of the decoder (桌子 2).
PT performs almost the same with LT, 也许
due to the fact that POS is not a task requiring a
powerful decoder.

5.2 Named Entity Recognition

For CoNLL’03, the provided splits without merg-
ing the development and training sets are used.
For OntoNotes 5, the same splits as Chiu and
Nichols (2016), 李等人. (2017); Ghaddar and
Langlais (2018); He and Choi (2020, 2021乙) 是
用过的. Labeled span-level F1 score is used for
评估.

We acknowledge that the performance of NER
systems can be largely improved by rich embed-
丁斯 (王等人。, 2021), document context fea-
特雷斯 (Yu et al., 2020), dependency tree features
(徐等人。, 2021), and other external resources.
While our focus is the potential of S2S, we mainly
consider two strong baselines that also use BART
the generative
as the only external resource:
BART-Pointer framework (Yan et al., 2021) 和
the recent template-based BART NER (Cui et al.,
2021).

如表所示 3, LS performs the worst
on both datasets, possibly attributed to the fact

590

Clark et al. (2018)

Peters et al. (2018)

Akbik et al. (2019)

Strakov´a et al. (2019)

Yamada et al. (2020)
Yu et al. (2020)†
Yan et al. (2021)‡S
Cui et al. (2021)S
He and Choi (2021乙)

Wang et al. (2021)

Zhu and Li (2022)

Ye et al. (2022)
LS

LT

PT

92.60

92.22

93.18

93.07

92.40

92.50

93.24

92.55

94.6


70.29 ± 0.70
92.75 ± 0.03
93.18 ± 0.04

89.83

90.38


89.04 ± 0.14

91.74

91.9
84.61 ± 1.18
89.60 ± 0.06
90.33 ± 0.04

桌子 3: Results for NER. S denotes S2S.

that the autoregressive decoder overfits the high-
order left-to-right dependencies of BIEOS tags.
LT performs close to the BERT-Large biaffine
模型 (Yu et al., 2020). PT performs compa-
rably well with the Pointer Networks approach
(Yu et al., 2020) and it outperforms the template
prompting (Cui et al., 2021) by a large margin,
suggesting S2S has the potential to learn struc-
tures without using external modules.

5.3 Constituency Parsing

All POS tags are removed and not used in train-
ing or evaluation. Terminals belonging to the
same non-terminal are flattened into one con-
stituent before training and unflattened in post-
加工. The standard constituent-level F-score
produced by the EVALB3 is used as the evalua-
tion metric.

桌子 4 shows the results on OntoNotes 5 和
PTB 3. Incorporating textual tokens into the out-
put sequence is important on OntoNotes 5, 带领-
ing to a +0.9 F-score, while it is not the case on
PTB 3. It is possibly due to the fact that Onto-
Notes is more diverse in domains, requiring a
higher utilization of pre-trained S2S for domain
transfer. PT performs the best, and it has a com-
petitive performance to recent works, 尽管
fact that it uses no extra decoders.

3https://nlp.cs.nyu.edu/evalb/.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

模型

PTB 3

OntoNotes 5

Fern´andez-Gonz´alez and
G´omez-Rodr´ıguez (2020)S
Mrini et al. (2020)

He and Choi (2021乙)
Yang and Tu (2022)S
LS

LT

PT

91.6

96.38

96.01
95.23 ± 0.08
95.24 ± 0.04
95.34 ± 0.06


94.43 ± 0.03

93.40 ± 0.31
94.32 ± 0.11
94.55 ± 0.03

桌子 4: Results for CON. S denotes S2S.

UAS
91.17

93.71

94.11

模型
Wiseman and Rush (2016)S
张等人. (2017)S
李等人. (2018)S
Mrini et al. (2020)
LS

LT

PT

97.42
92.83 ± 0.43
95.79 ± 0.07
95.91 ± 0.06
(A) PTB results for DEP.

模型
He and Choi (2021乙)
LS

LT

PT

UAS
95.92 ± 0.02
86.54 ± 0.12
94.15 ± 0.14
94.51 ± 0.22

(乙) OntoNotes results for DEP.

LAS
87.41

91.60

92.08

96.26
90.50 ± 0.53
93.17 ± 0.16
94.31 ± 0.09

LAS
94.24 ± 0.03
83.84 ± 0.13
91.27 ± 0.19
92.81 ± 0.21

桌子 5: Results for DEP. S denotes S2S.

模型

LS

w/o CD

LT

w/o CD

PT

w/o CD

PTB
97.51 ± 0.11
97.51 ± 0.11
97.70 ± 0.02
97.67 ± 0.02
97.64 ± 0.01
97.55 ± 0.02

OntoNotes
98.21 ± 0.02
98.21 ± 0.02
98.40 ± 0.01
98.39 ± 0.01
98.37 ± 0.02
98.29 ± 0.05

(A) Accuracy of ablation tests for POS.

模型

LS

w/o CD

LT

w/o CD

PT

w/o CD

CoNLL 03
70.29 ± 0.70
66.33 ± 0.73
92.75 ± 0.03
92.72 ± 0.02
93.18 ± 0.04
93.12 ± 0.06

OntoNotes 5
84.61 ± 1.18
84.57 ± 1.16
89.60 ± 0.06
89.50 ± 0.07
90.33 ± 0.04
90.23 ± 0.05

(乙) F1 of ablation tests for NER.

模型

LS

w/o CD

LT

w/o CD

PT

w/o CD

PTB
90.50 ± 0.53
90.45 ± 0.47
93.17 ± 0.16
93.12 ± 0.14
94.31 ± 0.09
81.50 ± 0.27

OntoNotes
83.84 ± 0.13
83.78 ± 0.13
91.27 ± 0.19
91.05 ± 0.20
92.81 ± 0.21
81.76 ± 0.36

5.4 Dependency Parsing

(C) LAS of ablation tests for DEP.

The constituency trees from PTB and OntoNotes
are converted into the Stanford dependencies
v3.3.0 (De Marneffe and Manning, 2008) for DEP
实验. Forty and 1 non-projective trees are
removed from the training and development sets
of PTB 3, 分别. For OntoNotes 5, 这些
numbers are 262 和 28. Test sets are not affected.
如表所示 5, textual tokens are cru-
cial
in learning arc-standard transitions using
S2S, 导致 +2.6 和 +7.4 LAS improve-
评论, 分别. Although our PT method
underperforms recent state-of-the-art methods, 它
has the strongest performance among all S2S
方法. 有趣的是, our S2S model man-
ages to learn a transition system without explic-
itly modeling the stack, 缓冲区, the partial
parse, or pointers.

We believe that the performance of DEP with
S2S can be further improved with a larger and
more recent pretrained S2S model and dynamic
oracle (Goldberg and Nivre, 2012).

桌子 6: Ablation test results.

6 分析

6.1 Ablation Study

We perform an ablation study to show the perfor-
mance gain of our proposed constrained decoding
algorithms on different tasks. Constrained decod-
ing algorithms (CD) are compared against free
一代 (w/o CD) where a model freely gener-
ates an output sequence that is later post-processed
into task-specific structures using string-matching
规则. Invalid outputs are patched to the greatest
extent, 例如, POS label sequences are padded or
truncated. 如表所示 6, ablation of con-
strained decoding seldom impacts the performance
of LS on all tasks, suggesting that the decoder of
seq2seq can acclimatize to the newly added label

591

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

代币. 有趣的是, the less performant NER-LS
model degrades the most, promoting the neces-
sity of constrained decoding for weaker seq2seq
型号. The performance of LT on all tasks is
marginally degraded when constrained decoding
is ablated, indicating the decoder begins to gen-
erate structurally invalid outputs when textual
tokens are freely generated. This type of problem
seems to be exacerbated when more tokens are
freely generated in the PT schemas, especially for
the DEP-PT.

Unlike POS and NER, DEP is more prone
to hallucinated textual tokens as early errors in
the transition sequence get accumulated in the
arc-standard system which shifts all later predic-
tions off the track. It is not yet a critical problem
as LS generates no textual tokens while a textual
token in LT still serves as a valid shift action
even if it is hallucinated. 然而, a hallucinated
textual token in PT is catastrophic as it could be
part of any arc-standard transitions. As no explicit
shift transition is designed, a hallucinated token
could lead to multiple instances of missing shifts
in Algorithm 12.

6.2 Case Study

To facilitate understanding and comparison of
different models, a concrete example of input
(我), gold annotation (G), and actual model pre-
diction per each schema is provided below for
each task. Wrong predictions and corresponding
ground truth are highlighted in red and teal,
分别.

POS In the following example, only PT correctly
detects the past tense (VBD) of ‘‘put’’.

我: The word I put in boldface is extremely interesting.
G: DT NN PRP VBD IN NN VBZ RB JJ .
LS: DT NN PRP VBP IN NN VBZ RB RB JJ
LT: The/DT word/NN I/PRP put/VBP in/IN

boldface/NN is/VBZ extr./RB interesting/JJ./.
PT: ‘‘The’’ is a determiner; ‘‘word’’ is a singular noun;

‘‘I’’ is a personal pronoun; ‘‘put’’ is a past tense verb;
‘‘in’’ is a preposition or subordinating conjunction;
‘‘boldface’’ is a singular noun; ‘‘is’’ is a 3rd person
singular present verb; ‘‘extremely’’ is an adverb;
‘‘interesting’’ is an adjective; ‘‘.’’ is a period.

我: Large image of the Michael Jackson HIStory statue.
G: Large image of the Michael Jackson
statue.
(西德:5)

(西德:2)

(西德:3)(西德:4)
PERSON(PER)

HIStory
(西德:2) (西德:3)(西德:4) (西德:5)
WOA

LS: O O O O B-PER E-PER S-ORG O O O
LT: Large image of the Michael Jackson

HIStory statue.
PT: ‘‘Michael Jackson’’ is a person;
‘‘HIStory’’ is an art work.

CON As highlighted with strikeout
text be-
低的, LS and LT failed to parse ‘‘how much’’
as a wh-noun phrase and a wh-adverb phrase,
分别.

我: It’s crazy how much he eats.
G: (S (NP (NP It)) (VP ’s (ADJP crazy) (SBAR
(WHNP (WHADJP how much)) (S (NP he)
(VP eats)))) .)

LS: N-S N-NP N-NP SH RE RE N-VP SH N-ADJP SH
RE N-SBAR N-WHNP N-WHADVP SH SH RE RE
N-S N-NP SH RE N-VP SH RE RE RE RE SH RE
LT: (S (NP (NP It)) (VP ’s (ADJP crazy) (SBAR
(WHNP (WHADJP how much)) (S (NP he)
(VP eats)))) .)

PT: a sentence has a simple clause, which has a noun

phrase and a verb phrase and ‘‘.’’; the noun phrase
has a noun phrase ‘‘It’’, the verb phrase has ‘‘’s’’
and an adjective phrase ‘‘crazy’’ and a subordinating
条款, which has a wh-noun phrase and a simple
条款; the wh-noun phrase has a wh-adjective
phrase ‘‘how much’’, the simple clause has a noun
phrase ‘‘he’’ and a verb phrase ‘‘eats’’.

DEP In the following example, LS incorrectly
attached ‘‘so out of’’ to ‘‘place’’, and LT wrongly
attached ‘‘so’’ to ‘‘looks’’.

我: It looks so out of place.

G:
LS: SH SH LA-nsubj SH SH SH SH LA-advmod
LA-advmod LA-advmod RA-acomp SH
RA-punct RA-root

LT: It looks LA-nsubj so out of place RA-pobj

RA-pcomp RA-prep RA-ccomp . RA-punct
RA-root

PT: ‘‘It’’ is a nominal subject of ‘‘looks’’; ‘‘so’’ is an

adverbial modifier of ‘‘out’’; ‘‘of’’ has an object of
a preposition ‘‘place’’; ‘‘out’’ has a prepositional
complement ‘‘of’’; ‘‘looks’’ has a prepositional
modifier ‘‘out’’; ‘‘looks’’ has a punctuation ‘‘.’’;
‘‘sentence’’ has a root ‘‘looks’’.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

NER In the following example, LS and LT
could not correctly recognize ‘‘HIStory’’ as an
art work, possibly due to its leading uppercase
字母.

6.3 Design Choices

In the interest of experimentally comparing the
schema variants, we would like each design we

592

模型
POS-PT
dec.LEX
DEP-PT
dec.LEX

PTB 3
97.64 ± 0.01
97.63 ± 0.02
94.31 ± 0.09
93.89 ± 0.18

OntoNotes 5
98.37 ± 0.02
98.35 ± 0.03
92.81 ± 0.21
91.19 ± 0.86

桌子 7: Study of lexicality on POS and DEP.

模型
NER-PT
inc.VRB
模型
CON-PT
inc.VRB

CoNLL 03
93.18 ± 0.04
92.47 ± 0.03
PTB 3
95.34 ± 0.06
95.19 ± 0.06

OntoNotes 5
90.33 ± 0.04
89.63 ± 0.23
OntoNotes 5
94.55 ± 0.03
94.02 ± 0.49

桌子 8: Study of verbosity on NER and CON.

consider to be equivalent in some systematic way.
为此, we fix other aspects and variate
two dimensions of the prompt design, lexicality,
and verbosity, to isolate the impact of individual
变量.

Lexicality We call the portion of textual tok-
ens in a sequence its lexicality. 因此, LS and PT
have zero and full lexicality, 分别, 尽管
LT falls in the middle. To tease apart the impact
of lexicality, we substitute the lexical phrases
with corresponding tag abbreviations in PT on
POS and DEP, 例如, ‘‘friend’’ is a noun →
‘‘friend’’ is a NN, ‘‘friend’’ is a nominal sub-
ject of ‘‘bought’’ → ‘‘friend’’ is a nsubj of
‘‘bought’’. Tags are added to the BART vocabu-
lary and learned from scratch as LS and LT. 作为
如表所示 7, decreasing the lexicality of PT
marginally degrades the performance of S2S on
销售点. On DEP, the performance drop is rather sig-
nificant. Similar trends are observed comparing
LT and LS in Section 5, confirming that lexicons
play an important role in prompt design.

Verbosity Our PT schemas on NER and CON
are designed to be as concise as human narrative,
and as easy for S2S to generate. Another design
choice would be as verbose as some LS and LT
schemas. To explore this dimension, we increase
the verbosity of NER-PT and CON-PT by adding
‘‘isn’t an entity’’ for all non-entity tokens and
substituting each ‘‘which’’ to its actual referred
短语, 分别. The results are presented in
桌子 8. Though increased verbosity would elim-
inate any ambiguity, unfortunately, it hurts per-
formance. Emphasizing a token ‘‘isn’t an entity’’
might encounter the over-confidence issue as the
boundary annotation might be ambiguous in gold
NER data (Zhu and Li, 2022). CON-PT deviates
from human language style when reference is
forbidden, which eventually makes it lengthy and
hard to learn.

6.4 Stratified Analysis

部分 5 shows that our S2S approach performs
comparably to most ad-hoc models. To reveal its
pros and cons, we further partition the test data
using task-specific factors and run tests on them.
The stratified performance on OntoNotes 5 is com-
pared to the strong BERT baseline (He and Choi,
2021乙), which is representative of non-S2S models
implementing many state-of-the-art decoders.

For POS, we consider the rate of Out-Of-
Vocabulary tokens (OOV, tokens unseen in the
training set) in a sentence as the most significant
因素. As illustrated in Figure 1a, the OOV rate
degrades the baseline performance rapidly, 英语-
pecially when over half tokens in a sentence are
OOV. 然而, all S2S approaches show strong
resistance to OOV, suggesting that our S2S mod-
els unleash greater potential
through transfer
学习.

For NER, entities unseen during training often
confuse a model. This negative impact can be
observed on the baseline and LT in Figure 1b.
然而, the other two schemas generating tex-
tual tokens, LT and PT, are less severely impacted
by unseen entities. It further supports the intuition
behind our approach and agrees with the finding
by Shin et al. (2021): With the output sequence
being closer to natural language, the S2S model
has less difficulty generating it even with unseen
实体.

Since the number of binary parses for a sen-
tence of n + 1 tokens is the nth Catalan Number
(Church and Patil, 1982), the length is a crucial
factor for CON. As shown in Figure 1c, all models,
especially LS, perform worse when the sentence
gets longer. 有趣的是, by simply recalling all
the lexicons, LT easily regains the ability to parse
long sentences. Using an even more natural rep-
resentation, PT outperforms them with a perfor-
mance on par with the strong baseline. It again

593

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

数字 1: Factors impacting each task: the rate of OOV tokens for POS, the rate of unseen entities for NER, 这
sentence length for CON, and the head-dependent distance for DEP.

supports our intuition that natural language is
beneficial for pretrained S2S.

For DEP, the distance between each depen-
dent and its head is used to factorize the overall
表现. As shown in Figure 1d, the gap
between S2S models and the baseline increases
with head-dependent distance. The degeneration
of relatively longer arc-standard transition se-
quences could be attributed to the static oracle
used in finetuning.

Comparing the three schemas across all sub-
团体, LT uses the most special
tokens but
performs the worst, while PT uses zero special
tokens and outperforms the rest two. It suggests
that special tokens could harm the performance
of the pretrained S2S model as they introduce
a mismatch between pretraining and finetuning.
With zero special tokens, PT is most similar to
自然语言, and it also introduces no ex-
tra parameters in finetuning, leading to better
表现.

7 结论

We aim to unleash the true potential of S2S
models for sequence tagging and structure parsing.
为此, we develop S2S methods that rival
state-of-the-art approaches more complicated than
ours, without substantial task-specific architecture
modifications. Our experiments with three novel
prompting schemas on four core NLP tasks dem-
onstrated the effectiveness of natural language
in S2S outputs. Our systematic analysis revealed
the pros and cons of S2S models, appealing for
more exploration of structure prediction with S2S.
Our proposed S2S approach reduces the need
for many heavily engineered task-specific archi-
tectures. It can be readily extended to multi-task
and few-shot learning. We have a vision of S2S
playing an integral role in more language under-
standing and generation systems. The limitation

of our approach is its relatively slow decoding
speed due to serial generation. This issue can be
mitigated with non-autoregressive generation and
model compression techniques in the future.

致谢

We would like to thank Emily Pitler, Cindy
罗宾逊, Ani Nenkova, and the anonymous
TACL reviewers for their insightful and thoughtful
feedback on the early drafts of this paper.

参考

Alan Akbik, Tanja Bergmann, and Roland
Vollgraf. 2019. Pooled contextualized embed-
dings for named entity recognition. In Pro-
ceedings of the 2019 Conference of the North
the Association for
American Chapter of
计算语言学: Human Language
Technologies, 体积 1 (Long and Short
文件), pages 724–728, 明尼阿波利斯,Minnesota.
计算语言学协会.
https://doi.org/10.18653/v1/N19
-1078

Alan Akbik, Duncan Blythe, and Roland Vollgraf.
2018. Contextual String Embeddings for Se-
quence Labeling. In Proceedings of the 27th
国际计算会议
语言学, COLING’18, pages 1638–1649.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
本吉奥. 2015. Neural machine translation by
jointly learning to align and translate. In 3rd
International Conference on Learning Repre-
句子, ICLR 2015, 圣地亚哥, CA, 美国,
May 7–9, 2015, Conference Track Proceedings.

Xuefeng Bai, Yulong Chen, and Yue Zhang.
2022. Graph pre-training for AMR parsing

594

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

and generation. In Proceedings of the 60th
Annual Meeting of the Association for Compu-
tational Linguistics (体积 1: Long Papers),
pages 6001–6015, 都柏林, 爱尔兰. Associa-
tion for Computational Linguistics.

Jonathan Berant and Percy Liang. 2014. Seman-
tic parsing via paraphrasing. In Proceedings
of the 52nd Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 1415–1425, 巴尔的摩,
Maryland. Association for Computational Lin-
语言学. https://doi.org/10.3115/v1
/P14-1133

Michele Bevilacqua, Rexhina Blloshmi, 和
Roberto Navigli. 2021. One spring to rule them
两个都: Symmetric AMR semantic parsing and
generation without a complex pipeline. In Pro-
ceedings of AAAI. https://doi.org/10
.1609/aaai.v35i14.17489

Bernd Bohnet, Ryan McDonald, Gonc¸alo
Sim˜oes, Daniel Andor, Emily Pitler, 和
Joshua Maynez. 2018. Morphosyntactic tag-
ging with a Meta-BiLSTM model over context
sensitive token encodings. 在诉讼程序中
the 56th Annual Meeting of
the Associa-
tion for Computational Linguistics, ACL’18,
pages 2642–2652. https://doi.org/10
.18653/v1/P18-1246

Tom Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah, Jared D. 卡普兰, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini
阿加瓦尔, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel Ziegler, Jeffrey Wu, Clemens Winter,
Chris Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, 杰克
克拉克, Christopher Berner, Sam McCandlish,
Alec Radford,
伊利亚·苏茨克维尔, and Dario
Amodei. 2020. Language models are few-
在-
shot
formation Processing Systems, 体积 33,
pages 1877–1901. 柯伦联合公司, Inc.

In Advances in Neural

learners.

Jiawei Chen, Qing Liu, Hongyu Lin, Xianpei
Han, and Le Sun. 2022. Few-shot named en-
tity recognition with self-describing networks.
In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 5711–5722,

595

都柏林,
爱尔兰. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.392

Lingzhen Chen and Alessandro Moschitti. 2018.
Learning to progressively recognize new
named entities with sequence to sequence
型号. 在诉讼程序中
the 27th Inter-
national Conference on Computational Lin-
语言学, pages 2181–2191, 圣达菲, 新的
墨西哥, 美国. Association for Computational
语言学.

Jason Chiu and Eric Nichols. 2016. Named entity
recognition with bidirectional LSTM-CNNs.
Transactions of the Association for Computa-
tional Linguistics, 4:357–370. https://土井
.org/10.1162/tacl_a_00104

Chinmay Choudhary and Colm O’riordan. 2021.
End-to-end mBERT based seq2seq enhanced
dependency parser with linguistic typology
知识. In Proceedings of the 17th Inter-
national Conference on Parsing Technologies
and the IWPT 2021 Shared Task on Parsing
into Enhanced Universal Dependencies (IWPT
2021), pages 225–232, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2021.iwpt-1.24

Kenneth Church and Ramesh Patil. 1982.
Coping with syntactic ambiguity or how to
put the block in the box on the table. 阿梅尔-
ican Journal of Computational Linguistics,
8(3–4):139–149.

Kevin Clark, Minh-Thang Luong, Christopher D.
曼宁, and Quoc Le. 2018. Semi-supervised
sequence modeling with cross-view training.
在诉讼程序中
这 2018 会议
Empirical Methods in Natural Language Pro-
cessing, pages 1914–1925, 布鲁塞尔, 比利时.
计算语言学协会.
https://doi.org/10.18653/v1/D18
-1217

Leyang Cui, Yu Wu, Jian Liu, Sen Yang, 和
Yue Zhang. 2021. Template-based named en-
tity recognition using BART. In Findings of
the Association for Computational Linguis-
抽动症: ACL-IJCNLP 2021, pages 1835–1845,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2021.findings-acl.161

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Marie-Catherine De Marneffe and Christopher
D. 曼宁. 2008. The Stanford typed de-
pendencies representation. In COLING 2008:
会议记录
the Workshop on Cross-
framework and Cross-domain Parser Evalu-
化, 第 1–8 页. https://doi.org/10
.3115/1608858.1608859

Daniel Deutsch, Shyam Upadhyay, and Dan
Roth. 2019. A general-purpose algorithm for
constrained sequential inference. In Proceed-
ings of
the 23rd Conference on Computa-
tional Natural Language Learning (CoNLL),
pages 482–492, 香港, 中国. Associa-
tion for Computational Linguistics. https://
doi.org/10.18653/v1/K19-1045

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
理解. 在诉讼程序中
这 2019
Conference of the North American Chapter
of the Association for Computational Linguis-
抽动症: 人类语言技术, 体积 1
(Long and Short Papers), pages 4171–4186,
明尼阿波利斯, Minnesota. Association for Com-
putational Linguistics.

Daniel Fern´andez-Gonz´alez and Carlos G´omez-
Rodr´ıguez. 2020. Enriched in-order lineariza-
tion for faster sequence-to-sequence constituent
解析. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 4092–4099, 在线的. Associ-
ation for Computational Linguistics.

Abbas Ghaddar and Phillippe Langlais. 2018.
Robust lexical features for improved neural
network named-entity recognition. In Proceed-
ings of the 27th International Conference on
计算语言学, pages 1896–1907,
圣达菲, New Mexico, 美国. 协会
计算语言学.

Yoav Goldberg and Joakim Nivre. 2012. A
dynamic oracle for arc-eager dependency
解析. COLING 论文集 2012,
pages 959–976, Mumbai, 印度. The COLING
2012 Organizing Committee.

Tatsunori B. Hashimoto, Kelvin Guu, Yonatan
Oren, and Percy S. 梁. 2018. A retrieve-
and-edit framework for predicting structured
outputs. Advances in Neural Information Pro-
cessing Systems, 31.

Han He and Jinho Choi. 2020. Establishing
strong baselines for the new decade: Sequence
tagging, syntactic and semantic parsing with
bert. In The Thirty-Third International Flairs
会议.

Han He and Jinho D. Choi. 2021A. Levi graph
AMR parser using heterogeneous attention. 在
会议记录
the 17th International Con-
ference on Parsing Technologies and the
IWPT 2021 Shared Task on Parsing into En-
hanced Universal Dependencies (IWPT 2021),
pages 50–57, 在线的. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.iwpt-1.5

Han He and Jinho D. Choi. 2021乙. The stem
cell hypothesis: Dilemma behind multi-task
learning with transformer encoders. In Pro-
ceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 5555–5577, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.451

Chris Hokamp and Qun Liu. 2017. Lexically
constrained decoding for sequence generation
using grid beam search. 在诉讼程序中
55th Annual Meeting of the Association for
计算语言学 (体积 1: Long Pa-
pers), pages 1535–1546, Vancouver, 加拿大.
计算语言学协会.
https://doi.org/10.18653/v1/P17
-1141

Robin Jia and Percy Liang. 2016. Data recom-
bination for neural semantic parsing. In Pro-
ceedings of
the 54th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 12–22, 柏林,
德国. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/P16-1002

John D. 拉弗蒂, Andrew McCallum, 和
Fernando C. 氮. 佩雷拉. 2001. Conditional
random fields: Probabilistic models for seg-
menting and labeling sequence data. In ICML,
pages 282–289.

Guillaume Lample, Miguel Ballesteros, Sandeep
Subramanian, Kazuya Kawakami, and Chris
Dyer. 2016. Neural architectures for named en-
tity recognition. 在诉讼程序中 2016

596

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

the North American Chap-
Conference of
the Association for Computational
ter of
语言学: 人类语言技术,
pages 260–270, 圣地亚哥, 加利福尼亚州. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/N16-1030

Mike Lewis, Yinhan Liu, Naman Goyal,
Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke
Zettlemoyer. 2020. 捷运: Denoising sequence-
to-sequence pre-training for natural language
一代, 翻译, and comprehension. 在
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics,
pages 7871–7880, 在线的. 协会
计算语言学. https://doi.org
/10.18653/v1/2020.acl-main.703

Peng-Hsuan Li, Ruo-Ping Dong, Yu-Siang Wang,
Ju-Chieh Chou, and Wei-Yun Ma. 2017. Lev-
eraging linguistic structures for named en-
tity recognition with bidirectional recursive
神经网络. 在诉讼程序中 2017
Conference on Empirical Methods in Natu-
ral Language Processing, pages 2664–2669,
哥本哈根, 丹麦. Association for Com-
putational Linguistics.

Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao.
2018. Seq2seq dependency parsing. In Proceed-
ings of the 27th International Conference on
计算语言学, pages 3203–3214,
圣达菲, New Mexico, 美国. 协会
计算语言学.

Jiangming Liu and Yue Zhang. 2017. In-order
transition-based constituent parsing. Transac-
tions of
the Association for Computational
语言学, 5:413–424. https://doi.org
/10.1162/tacl_a_00070

Chunpeng Ma, Lemao Liu, Akihiro Tamura,
Tiejun Zhao, and Eiichiro Sumita. 2017. 的-
terministic attention for sequence-to-sequence
constituent parsing. In Thirty-First AAAI Con-
ference on Artificial Intelligence. https://
doi.org/10.1609/aaai.v31i1.10967

Mitchell P. 马库斯, Mary Ann Marcinkiewicz,
and Beatrice Santorini. 1993. 建设一个
Large Annotated Corpus of English: 这
Penn Treebank. 计算语言学,
19(2):313–330. https://doi.org/10.21236
/ADA273556

Khalil Mrini, Franck Dernoncourt, Quan Hung
特兰, Trung Bui, Walter Chang, and Ndapa
Nakashole. 2020. Rethinking self-attention:
Towards interpretability in neural parsing. 在
Findings of the Association for Computational
语言学: EMNLP 2020, pages 731–742,
在线的. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.findings-emnlp.65

乔金·尼弗尔. 2004. Incrementality in deter-
ministic dependency parsing. In Proceedings

the Workshop on Incremental Parsing:
Bringing Engineering and Cognition Together,
pages 50–57, 巴塞罗那, 西班牙. 协会
计算语言学. https://土井
.org/10.3115/1613148.1613156

Giovanni Paolini, Ben Athiwaratkun,

Jason
Krone, Jie Ma, Alessandro Achille, RISHITA
ANUBHAI, Cicero Nogueira dos Santos, Bing
Xiang, and Stefano Soatto. 2021. Structured
prediction as translation between augmented
natural languages. In International Conference
on Learning Representations.

Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextu-
alized word representations. 在诉讼程序中
这 2018 Conference of the North American
Chapter of the Association for Computational
语言学: 人类语言技术,
体积 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/N18-1202

Sameer Pradhan, Alessandro Moschitti, Nianwen
薛, Hwee Tou Ng, Anders Bj¨orkelund, Olga
Uryupina, Yuchen Zhang, and Zhi Zhong.
2013. Towards Robust Linguistic Analysis
using OntoNotes. In Proceedings of the Seven-
teenth Conference on Computational Natural
Language Learning, pages 143–152, Sofia,
Bulgaria. Association for Computational Lin-
语言学.

Alec Radford, Jeffrey Wu, Rewon Child, 大卫
Luan, Dario Amodei, 伊利亚·苏茨克维尔, 等人.
2019. Language models are unsupervised mul-
titask learners. OpenAI blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, 迈克尔

597

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Matena, Yanqi Zhou, Wei Li, and Peter J. 刘.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. 杂志
of Machine Learning Research, 21(140):1–67.

Kenji Sagae

and Alon Lavie.

2005. A
classifier-based parser with linear
run-time
复杂. In Proceedings of the Ninth In-
ternational Workshop on Parsing Technology,
pages 125–132, Vancouver, British Columbia.
计算语言学协会.
https://doi.org/10.3115/1654494
.1654507

Richard Shin, Christopher Lin, Sam Thomson,
Charles Chen, Subhro Roy, Emmanouil
Antonios Platanios, Adam Pauls, Dan Klein,
Jason Eisner, and Benjamin Van Durme. 2021.
Constrained language models yield few-shot
semantic parsers. 在诉讼程序中 2021
Conference on Empirical Methods in Natu-
ral Language Processing, pages 7699–7715,
Online and Punta Cana, Dominican Republic.
计算语言学协会.
https://doi.org/10.18653/v1/2021
.emnlp-main.608

Jana Strakov´a, Milan Straka, and Jan Hajic.
2019. Neural architectures for nested NER
through linearization. 在诉讼程序中
57th Annual Meeting of the Association for
计算语言学, pages 5326–5331,
Florence, 意大利. Association for Computational
语言学. https://doi.org/10.18653
/v1/P19-1527

伊利亚·苏茨克维尔, Oriol Vinyals, and Quoc V. Le.
2014. Sequence to sequence learning with
神经网络. Advances in Neural Informa-
tion Processing Systems, 27.

Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared
任务: Language-independent named entity
the Seventh
认出. 在诉讼程序中
Conference on Natural Language Learning at
HLT-NAACL 2003, pages 142–147. https://
doi.org/10.3115/1119176.1119195

Bo-Hsiang Tseng,

Jianpeng Cheng, Yimai
Fang, and David Vandyke. 2020. A generative
model for joint natural language understand-
ing and generation. In Proceedings of the 58th
Annual Meeting of the Association for Com-
putational Linguistics, pages 1795–1807, 在-

598

线. Association for Computational Linguis-
抽动症. https://doi.org/10.18653/v1
/2020.acl-main.163

Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin.
2017. Attention is all you need. In Advances
in Neural Information Processing Systems,
pages 5998–6008.

Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. 2015A. Pointer networks. In Advances
in Neural Information Processing Systems,
体积 28.

Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav
Petrov, 伊利亚·苏茨克维尔, and Geoffrey Hinton.
2015乙. Grammar as a foreign language. 在
神经信息处理的进展
系统, 体积 28. 柯伦联合公司, Inc.

Xinyu Wang, Yong Jiang, Nguyen Bach, Tao
王, Zhongqiang Huang, Fei Huang, 和
Kewei Tu. 2021. Automated concatenation
of embeddings for structured prediction. 在
Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics
and the 11th International Joint Conference
on Natural Language Processing (体积 1:
Long Papers), pages 2643–2660, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2021
.acl-long.206

Ralph Weischedel, Martha Palmer, 米切尔
马库斯, Eduard Hovy, Sameer Pradhan, Lance
拉姆肖, Nianwen Xue, Ann Taylor, Jeff
考夫曼, Michelle Franchini, and Mohammed
El-Bachouti, Robert Belvin, and Ann Houston.
2013. Ontonotes release 5.0 ldc2013t19. 林-
guistic Data Consortium, 费城, PA.

Sam Wiseman and Alexander M. 匆忙. 2016.
Sequence-to-sequence learning as beam-search
优化. 在诉讼程序中 2016 骗局-
ference on Empirical Methods in Natural Lan-
guage Processing, pages 1296–1306, Austin,
德克萨斯州. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/D16-1137

Lu Xu, Zhanming Jie, Wei Lu, and Lidong Bing.
2021. Better feature integration for named entity
认出. 在诉讼程序中 2021 骗局-
ference of the North American Chapter of the

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

计算语言学协会: 胡-
man Language Technologies, pages 3457–3469,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2021.naacl-main.271

Ikuya Yamada, Akari Asai, Hiroyuki Shindo,
Hideaki Takeda, and Yuji Matsumoto. 2020.
LUKE: Deep contextualized entity repre-
sentations with entity-aware self-attention.
在诉讼程序中
这 2020 会议
Empirical Methods in Natural Language Pro-
cessing (EMNLP), pages 6442–6454, 在线的.
计算语言学协会.
https://doi.org/10.18653/v1/2020
.emnlp-main.523

Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo,
Zheng Zhang, and Xipeng Qiu. 2021. A uni-
fied generative framework for various NER
subtasks. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(体积 1: Long Papers), pages 5808–5822,
在线的. Association for Computational Lin-
语言学. https://doi.org/10.18653/v1
/2021.acl-long.451

Songlin Yang and Kewei Tu. 2022. Bottom-up
constituency parsing and nested named en-

tity recognition with pointer networks.
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 2403–2416,
爱尔兰. Association for Computa-
都柏林,
tional Linguistics. https://doi.org/10
.18653/v1/2022.acl-long.171

Deming Ye, Yankai Lin, Peng Li, and Maosong
Sun. 2022. Packed levitated marker for entity
and relation extraction. 在诉讼程序中
60th Annual Meeting of the Association for

计算语言学 (体积 1: 长的
文件), pages 4904–4917, 都柏林, 爱尔兰.
计算语言学协会.

Juntao Yu, Bernd Bohnet, and Massimo Poesio.
2020. Named entity recognition as dependency
解析. In Proceedings of the 58th Annual
Meeting of the Association for Computational
语言学, pages 6470–6476, 在线的. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/2020.acl-main
.577

Zhirui Zhang, Shujie Liu, Mu Li, Ming
周, and Enhong Chen. 2017. Stack-based
multi-layer attention for transition-based de-
pendency parsing. 在诉讼程序中 2017
Conference on Empirical Methods in Natu-
ral Language Processing, pages 1677–1682,
哥本哈根, 丹麦. Association for Com-
putational Linguistics. https://doi.org
/10.18653/v1/D17-1175

Enwei Zhu and Jinpeng Li. 2022. Boundary
smoothing for named entity recognition. 在
Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics
(体积 1: Long Papers), pages 7096–7108,
都柏林, 爱尔兰. Association for Computational
语言学. https://doi.org/10.18653/v1
/2022.acl-long.490

Huiming Zhu, Chunhui He, Yang Fang, 和
Weidong Xiao. 2020. Fine grained named en-
tity recognition via seq2seq framework. IEEE
Access, 8:53953–53961. https://doi.org
/10.1109/ACCESS.2020.2980431

Muhua Zhu, Yue Zhang, Wenliang Chen, 最小
张, and Jingbo Zhu. 2013. Fast and accurate
shift-reduce constituent parsing. In Proceedings
of the 51st Annual Meeting of the Associa-
tion for Computational Linguistics (体积 1:
Long Papers), pages 434–443, Sofia, Bulgaria.
计算语言学协会.

599

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
5
5
7
2
1
3
4
4
9
5

/

/
t

A
C
_
A
_
0
0
5
5
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3
下载pdf