LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text
Understanding and Generation
Jian Guan1, Zhuoer Feng1, Yamei Chen1, Ruilin He2,
Xiaoxi Mao3, Changjie Fan3, Minlie Huang1∗
1The CoAI group, DCST, China; 2Huawei
Technologies Co., Ltd., China; 3Netease Fuxi AI Lab., China
{j-guan19,fze17}@mails.tsinghua.edu.cn, chenziym4132013@163.com,
{maoxiaoxi,fanchangjie}@corp.netease.com,
heruilin@huawei.com,
aihuang@tsinghua.edu.cn
Astratto
Standard multi-task benchmarks are essen-
tial for developing pretraining models that
can generalize to various downstream tasks.
Existing benchmarks for natural
lingua
processing (PNL) usually focus only on under-
standing or generating short texts. Tuttavia,
long text modeling requires many distinct abil-
ities in contrast to short texts, such as the
modeling of long-range discourse and com-
monsense relations, and the coherence and
controllability of generation. The lack of stan-
A
dardized benchmarks makes it difficult
assess these abilities of a model and fairly
compare different models, especially Chinese
models. Therefore, we propose a story-centric
benchmark named LOT for evaluating Chi-
nese long text modeling, which aggregates
two understanding tasks and two generation
compiti. We construct new datasets for these
tasks based on human-written Chinese sto-
ries with hundreds of words. Inoltre,
we release an encoder-decoder-based Chinese
long text pretraining model named LongLM
with up to 1 billion parameters. We pre-
train LongLM on 120G Chinese novels with
two generative tasks including text infilling
and conditional continuation. Extensive ex-
periments show that LongLM outperforms
similar-sized pretraining models substantially
on both the understanding and generation tasks
in LOT.
1
introduzione
Pretrained language models have achieved sig-
nificant advances in various natural
lingua
understanding (NLU) and generation (NLG) compiti
∗ Corresponding author.
434
(Devlin et al., 2019; Radford et al., 2019). Standard
benchmarks such as GLUE (Wang et al., 2019)
further boost the improvement and fast iteration
of pretrained models. Popular benchmarks usually
aggregate multiple tasks to spur the progress of
generalizable models. But these benchmarks fo-
cus mainly on understanding or generating short
texts. Per esempio, the GLUE tasks take at most
two sentences as input. And most tasks in NLG
benchmarks such as GLGE (Liu et al., 2020) E
GEM (Gehrmann et al., 2021) require generating
only several words (per esempio., dialogue generation). Al-
though there have been many models pretrained
on long texts such as GPT3 (Brown et al., 2020)
and CPM (Zhang et al., 2020), the lack of bench-
mark datasets makes it difficult to fully assess and
compare their abilities of long text modeling.
in questo documento, we present LOT, a benchmark for
evaluating Chinese LOng Text understanding and
generation. As shown in Table 1, modeling long
texts requires many distinct abilities compared to
short texts, including (1) commonsense reasoning
regarding characters’ reaction and intention, E
knowledge about physical objects (per esempio., ‘‘river’’)
and abstract concepts (per esempio., ‘‘irony’’); (2) modello-
ing discourse-level features such as inter-sentence
relations (per esempio., causality) and global discourse
structures (per esempio., the order of events); E (3) IL
generation coherence and controllability, Quale
require both maintaining a coherent plot and
adhering to controllable attributes (per esempio., topics).
Accordingly, LOT contains two understanding
tasks and two generation tasks regarding the above
abilities. We construct new datasets for these tasks
based on various kinds of stories such as fables and
fairy tales collected from public web resources,
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 434–451, 2022. https://doi.org/10.1162/tacl a 00469
Redattore di azioni: Dipanjan Das. Lotto di invio: 10/2021; Lotto di revisione: 12/2021; Pubblicato 4/2022.
C(cid:3) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Effendi’s son is eccentric, always behaving
opposed to what Effendi has ordered him
to do. Familiar to his son’s temper, Effendi
usually communicates using irony. One day,
the father and son were blocked by a river after
purchasing flour from a mill. And while they
were crossing the river, one bag on the donkey’s
back lost its weight and leaned. Effendi told
his son with irony:‘‘My boy! drop the sack
into the river!’’ The son heard the words and
thought:‘‘I have been opposed to my father for
so many years. For this only time, I have to
obey him.’’ Therefore, he followed Effendi’s
words and indeed pushed the sack into the
river. ‘‘My boy! What are you doing?’’Effendi
shouted in anger.’’ …
Tavolo 1: A long text example. The concepts and
events concerning commonsense and discourse
relations are highlighted in bold.
considering that stories usually contain abundant
commonsense and discourse relations. All these
tasks require processing stories with hundreds of
parole. Note that LOT does not involve extra-long
texts with thousands of words since the compli-
cated linguistic phenomena in these texts make
it hard to test individual abilities and guide the
improvement of generation models.
Inoltre, we release LongLM, a Chinese
Long text pretraining Language Model. LongLM
is a Transformer-based model with an encoder-
decoder architecture. LongLM has three different
versions ranging from 60 million to 1 billion pa-
rameters. We pretrain LongLM on 120G Chinese
novels with two generative tasks, including text
infilling (Lewis et al., 2020) and conditional con-
tinuation (Radford et al., 2018). The pretraining
data do not include other types of texts (per esempio., news,
Wiki-texts) since we mainly focus on common-
sense and discourse relations within general long
texts instead of factual and technical knowledge.
To the best of our knowledge, LongLM is the
first pretraining model of the same size scale that
focuses on modeling long-form stories. Extensive
experiments on LOT show that LongLM outper-
forms strong baselines substantially on both the
understanding and generation tasks. Tuttavia, we
also observe that LongLM is still far behind hu-
man performance, which requires better semantic
representations of events and deeper modeling of
the commonsense and discourse relations between
them. We summarize the main contributions of
this paper as follows:
IO. We propose a new story-centric benchmark
LOT for evaluating Chinese long text understand-
ing and generation. LOT consists of four tasks
for testing the fundamental abilities to model long
texts. We also present new datasets for these tasks.
II. We release a new Chinese pretraining model
named LongLM. Experiment results demonstrate
the strong performance of LongLM on LOT,
Ma
there still exists considerable room for
improvement.1
2 Related Work
NLP Benchmarks Recently, there have been a
lot of multi-task benchmarks proposed to drive
the progress of generalizable models. The bench-
marks usually aggregate multiple model-agnostic
tasks under a unified framework, enabling re-
searchers to fairly compare different models.
SentEval (Conneau and Kiela, 2018) gathered
multiple classification tasks involving either one
or two sentences as inputs to evaluate sentence
representations. DiscoEval (Chen et al., 2019)
extended these tasks to the discourse level regard-
ing inter-sentence relations. GLUE (Wang et al.,
2019) included more diverse tasks such as natu-
ral language inference (Rockt¨aschel et al., 2016).
Sarlin et al. (2020) proposed SuperGLUE as a
more challenging counterpart of GLUE by in-
troducing multi-sentence tasks. But the additional
tasks are only limited to the formats of coreference
resolution and question answering. Inoltre
to these English benchmarks, many benchmarks
were proposed to evaluate NLU for other lan-
guages, such as CLUE (Xu et al., 2020UN) for
Chinese. Inoltre, GLGE (Liu et al., 2020) E
GEM (Gehrmann et al., 2021) were proposed for
evaluating NLG models across diversified gen-
eration tasks such as text summarization and
personalizing dialogue. Tuttavia,
Non c'è
benchmark designed specifically for long text
modeling, especially Chinese. Additionally, IL
above benchmarks were originally designed to
cover as diverse task formats as possible. In cont-
trast, we design the LOT tasks with the guidance
1The LOT benchmark, the pretraining resources, and the
appendix are available at https://github.com/thu
-coai/LOT-LongLM.
435
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
of necessary abilities for long text modeling as
suggested by Ribeiro et al. (2020), making it eas-
ier to figure out where models are failing, and how
to improve them.
Long Text Datasets Previous studies in the field
of long text modeling have frequently focused
on the ROCStories (Mostafazadeh et al., 2016)
and WritingPrompts (Fan et al., 2018) datasets.
ROCStories contains 100k artificial five-sentence
stories, while WritingPrompts consists of 300K
pairs of prompts and stories with hundreds
of words. Recent works collected stories with
thousands of words to model longer-range de-
pendencies, such as WikiText-103 (Merity et al.,
2016), roleplayerguild (Louis and Sutton, 2018),
PG-19 (Rae et al., 2020), STORIUM (Akoury
et al., 2020), and Long-Range Arena (Tay et al.,
2020). Tuttavia, these datasets are written in En-
glish. LOT will drive the development of Chinese
language models.
Inoltre, LOT does not include datasets of
extra-long texts, like PG-19, for the following two
reasons: (1) Extra-long texts are far beyond the
scope of current machine learning models because
the discourse-level linguistic phenomena are en-
tangled and complicated in these texts. Therefore,
extra-long texts usually serve for computing per-
plexity of language models (Dai et al., 2019) Ma
hardly provide fine-grained guidance for improv-
ing model designs. (2) LOT aims not to spur
research on building fuller connections across to-
kens within an extra-long sequence, but to drive
the progress of machines in the aforementioned
fundamental abilities for long text modeling.
Story Understanding and Generation LOT is
centered on fundamental abilities for long text
modeling and thus includes four story understand-
ing and generation tasks concerning common-
sense and discourse relations. Recent studies
have proposed various tasks to evaluate story
understanding and generation. Primo, story end-
ing selection (Mostafazadeh et al., 2016), story
ending generation (Guan et al., 2019), and story
completion (Wang and Wan, 2019) focused on
the commonsense reasoning ability on inter-event
causal and temporal relations. Secondo, Chen et al.
(2019) evaluated the ability to model discourse re-
lations by predicting the position of a sentence or a
paragraph in a text. Third, some works focused on
the coherence of story generation conditioned on
short prompts (Fan et al., 2018), titles (Yao et al.,
2019) and beginnings (Guan et al., 2020). Fourth,
some studies centered on controllability, questo è,
the imposing of controllable attributes on story
generation such as keywords (Xu et al., 2020B),
emotional trajectories (Brahman and Chaturvedi,
2020), outlines (Rashkin et al., 2020), and styles
(Kong et al., 2021). LOT is a comprehensive
benchmark to test the above abilities for Chinese
long text modeling.
D'altra parte, LOT does not
involve
those tasks that require learning more partic-
ular features of stories, such as event chains
(Chambers and Jurafsky, 2008), character types
(Bamman et al., 2013), inter-character relations
(Chaturvedi et al., 2016, 2017), social networks
(Agarwal et al., 2013), and abstractive structures
(Finlayson, 2012). Non-neural story generation
models usually retrieved events from a knowl-
edge base with pre-specified semantic relations
based on handcrafted rules (Li et al., 2013), Quale
are costly and lack generalization. in questo documento,
we focus mainly on evaluating neural models for
story understanding and generation.
3 LOT Benchmark
We design LOT as an aggregation of two un-
derstanding tasks including Cloze Test (ClozeT)
and Sentence Position Prediction (SenPos), E
two generation tasks including Plot Completion
and Outline-conditioned Genera-
(PlotCom)
zione (OutGen). We show the task descriptions and
data statistics in Tables 2 E 3, rispettivamente. Noi
use the jieba tokenizer2 for word tokenization.
We design LOT based on the following prin-
ciples: (1) Task Diversity: The tasks vary in
task formats, types and lengths of inputs and
outputs, and focused abilities, making LOT a
comprehensive framework for evaluating the gen-
eralization of models. (2) Task Difficulty: IL
tasks take hundreds of words as inputs or outputs,
and do not involve domain-specific knowledge
about science, films, and so forth. Therefore, Essi
are beyond the scope of current state-of-the-art
models, but are solvable by most Chinese native
speakers. (3) Task Formulation: The tasks have
been well formulated in prior studies and agreed
to be challenging but meaningful. We introduce
new Chinese datasets for these tasks, which are
constructed to focus more specifically on testing
2https://github.com/fxsjy/jieba.
436
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tasks
Abilities
Inputs
Outputs
ClozeT
Commonsense Reasoning
A text with a sentence removed (the position
specified); Two candidate sentences.
Choosing the correct sentence
from two candidates.
Metrics
Precisione
SenPos
Inter-sentence Relationship
A text with a sentence removed (the position
unspecified); The removed sentence.
Choosing the correct position
for the removed sentence.
Precisione
PlotCom Commonsense Reasoning;
Inter-sentence Relationship
A text with a sentence removed (the position
specified).
Generating a sentence to com-
plete the text.
BLEU; Dist
OutGen
Discourse Structure;
Coherence; Controllability
A title, an outline as an out-of-order set of
phrases about characters and events.
Generating a coherent text ad-
hering to the title and outline.
BLEU; Dist;
Cover; Order
Tavolo 2: Overview of the tasks in LOT for the abilities they test, inputs and outputs, and the evaluation
metrics. Dist and Cover refer to Distinct and Coverage (Sezione 5.3), rispettivamente.
Datasets
Train
Val
Test
Task: ClozeT
# Esempi
Vocabulary Size
644
9k
294
7k
294
7k
Avg. # Char in Input Text
Avg. # Word in Input Text
Avg. # Sent in Input Text
139.07
89.28
5.95
138.95
89.03
5.94
141.15
90.20
5.95
Avg. # Word in Candidate
15.60
16.38
15.75
Task: SenPos
# Esempi
Vocabulary Size
Avg. # Char in Input Text
Avg. # Word in Input Text
Avg. # Sent in Input Text
Avg. # Word in Removed Sent
20,000
147k
289.59
254.11
9.61
30.48
800
10k
258.48
224.20
8.43
29.28
863
22k
258.52
223.25
8.44
30.26
Avg. # Candidate Positions
8.05
6.91
6.91
Task: PlotCom
# Esempi
Vocabulary Size
Avg. # Char in Input Text
Avg. # Word in Input Text
Avg. # Sent in Input Text
13,099
22k
164.35
105.48
7.17
465
8k
464
8k
137.67
87.56
5.59
133.26
84.98
5.48
Avg. # Word in Output Sent
15.08
15.96
16.15
Task: OutGen
# Esempi
Vocabulary Size
Avg. # Word in Input Title
Avg. # Word in Input Outline
Avg. # Phrase in Input Outline
1,456
19k
4.64
19.20
8.00
242
6k
4.89
19.05
8.00
729
12k
4.64
19.47
8.00
Avg. # Char in Output Text
Avg. # Word in Output Text
Avg. # Sent in Output Text
169.94
108.91
7.20
169.80
108.68
7.11
170.49
109.04
7.15
Tavolo 3: Data statistics of LOT tasks. IL
abbreviation char/sent/len is short for charac-
ter/sentence/length, rispettivamente.
a certain ability than original datasets. (4) Au-
tomatic Evaluation: These tasks have reliable
automatic metrics to evaluate the focused abili-
ties.We exclude open-ended generation tasks such
as story generation from titles, which is difficult to
automatically evaluate (Guan et al., 2021) because
the tasks suffer from the notorious one-to-many
issue: There are many plausible outputs for the
same input (Zhao et al., 2017).
We constructed datasets for LOT through au-
tomatic and manual annotation. Primo, we crawled
human-written stories from public web pages as
the data source. These stories are under licenses
that allow use and redistribution for research pur-
poses. Then, we hired a commercial team to create
the LOT examples. The team is led by a profes-
sional screenwriter and has taken on hundreds of
NLP annotation projects. All annotators are native
Chinese speakers and well-trained for the annota-
tion tasks. We show the full list of the source web
pages and the annotation details in the appendix.
3.1 Cloze Test
Mostafazadeh et al. (2016) introduced the Story
Cloze Test (SCT) task for evaluating story com-
prehension, which requires selecting the right
ending from two candidates for a four-sentence
leading context. Tuttavia, SCT suffers from
the following issues: (1) Its dataset is artificial
and contains innate biases between right and
wrong endings in some features such as lengths
(Schwartz et al., 2017; Sharma et al., 2018). Such
biases may leak information about the target la-
bels. (2) SCT focuses on reasoning only endings
but neglects other types of reasoning, ad esempio
abductive reasoning (Bhagavatula et al., 2019),
which requires reasoning what happens between
observed beginnings and endings. (3) SCT limits
the scope of commonsense reasoning to realistic
events. The limitation may be neither necessary
nor sufficient. Per esempio, ‘‘Cupid can fly’’ can
437
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
A goblin had buried a treasure under the
ground. After that, he received a long flight
mission from the Devil King. The goblin be-
gan to worry about how to guard the treasure
during his mission. The goblin thought for a
long time and decided to give the treasure
to a miser. The miser clung to his vault even
when he was asleep, so the goblin trusted him
very much · · ·
Tavolo 4: An example for selecting a sentence that
can be reasoned based on the context and common
sense (in red). We also highlight a sentence that
does not satisfy the requirement in green, Quale
introduces a new character ‘‘the Devil King’’.
be reasoned based on common sense although it
is not realistic, while some story settings may be
realistic but fail to be reasoned only based on the
context and common sense, as shown in Table 4.
Therefore, when constructing our ClozeT dataset,
we adopt the following approaches to alleviate
the above issues: (1) All examples are derived
from existing human-written stories. (2) We allow
annotators to create examples where the removed
sentence is initially in the middle of the story. (3)
We change the scope of commonsense reasoning
to all events that embody characters’ reaction and
intention, or the nature of physical objects and
concepts. Tavolo 6 shows two ClozeT examples.
Inoltre, we also conducted experiments to
investigate the potential biases of our dataset in
Sezione 5.5.
Story Filtering To ensure the quality of LOT
examples, we asked annotators to judge whether
each crawled story meets the following defini-
zione: ‘‘anything which is told in the form of a
coherent event sequence involving several spe-
cific and related characters’’ (Mostafazadeh et al.,
2016). We provided detailed cases for annotators
to instruct them about this definition. Then, anno-
tators needed to refine those stories which do not
meet the definition by rewriting the plots. They
should also clean up the stories by the following
heuristics: (1) refusing examples that may vio-
late ethical principles (per esempio., discrimination); (2)
deleting noisy words (per esempio., links); (3) changing
slang and informal words into standard modern
Chinese; (4) rewriting all dialogues to objective
events. Finalmente, we collected 2,427 high-quality
Testo:
I couldn’t control my anger very
well.[1]My parents would yell at me, E
i ran to my room.[2]I buried my head in a
pillow and screamed.[3]I threw my pillow
and hit it hard.
Removed Sentence: I tried to express my
anger.
Tavolo 5: A poor example for the SenPos task.
The removed sentence has multiple reasonable
positions including [2] E [3] in the original
testo.
Chinese stories, which will be used to con-
struct the datasets for the ClozeT, PlotCom, E
OutGen tasks.
Dataset Construction We presented the stories
to another group of annotators to construct the
ClozeT dataset. For each story, they should select
a sentence as the right candidate that can be rea-
soned based on the context and common sense.
Tavolo 4 shows an example presented to the annota-
tors to illustrate how to judge whether a sentence
satisfies this requirement. Then, the annotators
rewrite the sentence into another one as the wrong
candidate that maintains a good topical relatedness
with the context but violates common sense. IL
wrong candidates should either embody unreason-
able reactions or intentions, or violate the nature
of physical objects or concepts. And we require
annotators not to select the first sentence, Quale
usually aims to introduce story settings instead of
narrating an event. We browse through the an-
notation results and give the annotators detailed
feedback before approving their submissions. Fi-
nally, we collected 1,232 examples in total and
split them for training, validation and testing.
3.2 Sentence Position Prediction
We use the sentence position prediction task
(Chen et al., 2019) to evaluate the ability to cap-
ture inter-sentence relations (per esempio., causality). Noi
formulate the task as follows: Given a text with
a sentence removed, models should choose the
correct position of the sentence in the text from
multiple candidates. Chen et al. (2019) constructed
an English dataset for this task by randomly re-
moving sentences from existing texts. Tuttavia,
such examples may be invalid since a sentence
may have multiple plausible positions in a text,
438
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tavolo 6: Two ClozeT examples. The right candidates are extracted from the original stories (at the
position of ‘‘[MASK]’’) while the wrong candidates are written by crowd-sourced annotators. IL
first example focuses on common sense regarding the fox’s reaction to the silly wolf ’s behavior, while
the second example focuses on common sense regarding the relations between palace and prince. Noi
highlight the entities and events related to the commonsense relations in red, and those which violate
common sense in the wrong candidates in green.
as illustrated in Table 5. Therefore, we construct
the dataset for our task based on the following
pipeline: (1) extracting paragraphs with less than
500 words from crawled stories; (2) randomly se-
lecting a sentence to remove for each paragraph,
and regarding all positions between two adjacent
sentences as candidates3; E (3) asking annota-
tors to refine part of the auto-constructed examples
as the validation and test sets, and the remaining
as the training set. Tavolo 7 shows two SenPos
examples.
Dataset Construction We asked annotators to
refine each example so that the removed sentence
has only one reasonable position in the text. We did
not allow annotators to select the first or last sen-
tence of the original text as the removed sentence
because they usually contain obvious wording
caratteristiche (per esempio., ‘‘once upon a time,’’ ‘‘they lived
happily together’’), which may make this task
trivial. Unlike ClozeT, we allowed the texts for
SenPos to be incomplete or include dialogues
that also embody rich inter-sentence relations. Fi-
nally, we collected 1,663 examples for validation
and testing through human annotation. And we
3We set the minimum length of the removed sentence to
10 Chinese characters, and we merge a sentence in a story
with its neighbors if it contains less than 10 characters.
constructed 20,000 examples automatically for
training.
3.3 Plot Completion
We use the Plot Completion task (Wang and Wan,
2019) to test the ability to make inferences based
on common sense. We formulate this task as
follows: Given a story with a sentence removed,
models should generate a sentence to complete the
story and make it reasonable and coherent.
Dataset Construction Prior studies (Wang and
Wan, 2019; Paul and Frank, 2021) automatically
constructed datasets for this task based on exist-
ing datasets by randomly removing one sentence
from a story. Tuttavia, as shown in Table 4, non
all sentences in a story can be reasoned only based
on the context and common sense. Therefore, we
only used the above automatic method to construct
the training data. And we adapted the ClozeT data
to this task for validation and testing, since an-
notators have marked out the qualified sentences.
Specifically, we randomly sampled some ClozeT
examples and took the incomplete story of each
example as input, and the right candidate as the
target sentence to be generated.
439
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tavolo 7: Two SenPos examples. The special tokens from [1] A [9] refer to the candidate positions.
The first/second example focuses on testing the ability to capture the inter-sentence causal/temporal
relations, rispettivamente. We highlight the entities and events implying the relations in red.
3.4 Outline-conditioned Generation
Prior work tended to test the ability of long text
generation through story generation conditioned
on inputs with limited information such as titles
(Yao et al., 2019). Tuttavia, these tasks are ex-
tremely open-ended so that it is difficult to reliably
measure the generation quality using automatic
metrics (Guan and Huang, 2020). To alleviate the
issue, we introduce the Outline-conditioned Gen-
eration task (Rashkin et al., 2020), which requires
generating a coherent long-form story conditioned
on an outline of characters and events. We formu-
late the outline as a set of out-of-order phrases,
which not only narrows down the set of plausible
stories but also serves for testing the controllabil-
ity and planning ability of models to arrange the
given events reasonably at the discourse level.
Dataset Construction We built the dataset for
this task automatically based on filtered stories.
We followed Rashkin et al. (2020) to extract
the outline of a story using the RAKE algorithm
(Rose et al., 2010). We extract at most eight
phrases for each story, and each phrase contains no
more than eight words. Per esempio, the outline
for the story in Table 1 È {‘‘told his son with
irony,’’ ‘‘purchasing flour from a mill,’’ ‘‘crossing
the river,’’ ‘‘drop the sack into the river,’’ ‘‘indeed
pushed the sack,’’ ‘‘familiar to his son’s temper,’’
‘‘shouted,’’ ‘‘one bag’’}. The outline can serve as
discourse-level guidance for generation models,
which should rearrange the events reasonably and
generate a story with a good global discourse
structure, rather than focus on modeling only the
local coherence.
3.5 Overall Score
Existing benchmarks usually summarize the per-
formance of a model as a single score by averaging
all metric scores without considering task diffi-
culties. To encourage models to progress on those
tasks where there is a more significant gap between
machines and humans, we propose to average
metric scores with different weights. Suppose that
there are a total of M metrics for all tasks, we
derive the overall score as follows:
wi(cid:3)
M
j=1 wj
Si,
S =
wi =
M(cid:2)
i=1
Hi
Bi
,
(1)
(2)
where Hi, Bi, and Si are the score of humans, UN
pre-selected baseline, and the evaluated model for
the i-th metric, rispettivamente, and wi is the weight
for this metric. Intuitively, the metric scores where
the baseline model has a larger gap with humans
will have a larger weight when computing the
overall score. We use BERT and GPT2 as the base-
line models for the understanding and generation
tasks in LOT, rispettivamente.
440
Versions
dm
Small
Base
Large
512
768
1,536
dff
2,048
3,072
3,072
dkv
64
64
64
nh
8
12
12
ne/nd
6/6
12/12
24/32
# P
60M
223M
1B
Tavolo 8: Hyper-parameter settings for different
versions of LongLM. dm, dff, and dkv are the
dimension of hidden states, the feed forward
layers, and the keys/values in the self-attention
layers, rispettivamente. nh is the number of at-
tention heads. ne and nd denote the number
of hidden layers for the encoder and decoder,
rispettivamente. # P is the number of parameters.
4 Long Text Pretraining Model
To provide more flexibility on both understanding
and generation tasks, we build LongLM fol-
lowing the original encoder-decoder design of
Transformer (Vaswani et al., 2017) with three
different sizes, as shown in Table 8. We follow
Cui et al. (2020) to use a sentencepiece vocabu-
lary of 32,000 wordpieces (Kudo and Richardson,
2018). And we set the maximum sequence length
A 512 for both the encoder and decoder.
Pretraining Data We collect 120G novels as the
pretraining data for LongLM, which cover various
topics such as romance, military, and so on. Since
a novel is usually much longer than the maximum
input and output length of LongLM, we split a
novel into multiple segments for pretraining.
Pretraining Tasks Encoder-decoder models are
trained typically by maximizing the likelihood of
the target output given an input. To improve capac-
ities of both the encoder and decoder, we propose
to train LongLM with two pretraining tasks in-
cluding text infilling (Raffel et al., 2020) E
conditional continuation (Radford et al., 2019).
For the first task, the input is a text where a
number of spans are sampled and replaced by
special tokens with unique IDs, while the output
is the spans delimited by the special tokens used
in the input. The lengths of masked spans are
drawn from a Poisson distribution with λ=3 and
all masked tokens compress 15% of the original
texts. As for the second task, the input and output
are, rispettivamente, the front and back half of a text,
which is split into two parts randomly. We show
an example of the pretraining tasks in Figure 1.
Pretraing Details We set the learning rate to
1e-4 with the Adam optimizer and the batch size
Figura 1: Schematic of the pretraining tasks.
is the ‘‘end of sequence’’ token.
Modelli
TextInfill
CondCont
PPL
BLEU-3/4
PPL
BLEU-3/4
LongLMsmall
LongLMbase
LongLMlarge
11.61
8.24
6.50
73.80/68.96
75.65/71.05
77.08/72.65
22.91
17.03
14.08
5.30/2.43
5.73/2.64
8.91/5.97
Tavolo 9: Perplexity (PPL) and BLEU scores
of LongLM for text infilling (TextInfill) E
conditional continuation (CondCont). The best
performance is in bold and the second best is
underlined.
A 1,000. We pretrained LongLM for 2.5M steps.
It took about two months to train the largest model
using eight NVIDIA V100 GPUs.
Model Performance To assess the performance
of LongLM on the pretraining tasks, we randomly
separated out 1,000 texts from the initial pre-
training data for testing, which were never seen
in the pretraining phase. We used perplexity and
BLEU-n (n = 3, 4) to evaluate both pretraining
compiti. And we generated outputs using the greedy
decoding algorithm for the text infilling task, E
top-k sampling (Fan et al., 2018) with k = 40 E
a softmax temperature of 0.7 (Goodfellow et al.,
2014) for the conditional continuation task. As
shown in Table 9, the performance improves sub-
stantially as the number of parameters increases.
5 Experiments
In this section, we tested LongLM and exist-
ing models on LOT with automatic and manual
evaluation. Inoltre, we conducted extensive
experiments to investigate the potential biases of
the ClozeT and SenPos datasets (Sezione 5.5), E
measure the overlap between training and testing
dati (Sezione 5.6).
441
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
5.1 Evaluated Models
We evaluated the following models, Quale
are implemented based on the register models
of HuggingFace Transformers:4
(1) Vanilla
Transformer: It has the same architecture as
BERTbase except that the number of layers is
set to 3 (Vaswani et al., 2017). (2) BERT: It
is implemented based on the bert-base-Chinese
(3)
(Devlin et
register model
RoBERTa:
is implemented based on the
It
hfl/chinese-roberta-wwm-ext
register model
(Cui et al., 2020). (4) GPT2: It is implemented
based on the uer/gpt2-chinese-cluecorpussmall
register model (Zhao et al., 2019). (5) mT5: It
is implemented based on the google/mt5-base
register model (Xue et al., 2021). We set all the
baseline models to the base version due to limited
computational resources.
al., 2019).
To show the generic benefits of the pretraining
data of LongLM for long text modeling, we pre-
trained a left-to-right language model from scratch
on the data with the standard language modeling
objective. This model has the same architecture as
†
GPT2base and is denoted as GPT2
base. Inoltre,
we evaluated two task-specific pretraining mod-
els including PlotMachines (PM) (Rashkin et al.,
2020) and Plan&Write (PW) (Yao et al., 2019),
and two typical non-pretrained models includ-
ing ConvS2S (Gehring et al., 2017) and Fusion
(Fan et al., 2018) on the generation tasks in LOT.
We used GPT2base as the backbone model of PM
and PW. For PM, we regard input sentences (for
PlotCom) or input phrases (for OutGen) as the plot
elements used in the memory network, and update
the memory representations at each step of decod-
ing. As for PW, we take a keyword extracted from
the target sentence using the RAKE algorithm (for
PlotCom) or the sorted input phrases in order
(for OutGen) as the intermediate representations
for planning. We implemented these models based
on the codes provided by the original papers.
5.2 Experiment Settings
Understanding Tasks For both tasks, we en-
code the input of each example and then predict
a distribution over all candidates by normalizing
the dot-product values between the representa-
tions of each candidate and the context. We use
the candidate with the maximum probability as
the prediction result. For ClozeT, we represent
4https://huggingface.co/models.
a candidate using the hidden state at the end of
Esso, and we regard the hidden state at the position
of the removed sentence appearing in the orig-
inal text as the context representation. And for
SenPos, we take the hidden state at each candidate
position as the candidate representation and the
hidden state at the end of the removed sentence as
the context representation. When evaluating mT5
and LongLM, we feed the same input into the
encoder and decoder (Lewis et al., 2020) and use
the hidden states of the decoder for prediction in
the above way.
Generation Tasks For PlotCom, we take the
incomplete story of an example as input to gen-
erate the missing sentence. And for OutGen, we
concatenate all phrases in an outline with special
tokens as input to generate a story.
Hyper-Parameters For all models, we set the
batch size to 12, the maximum sequence length
A 512, and the learning rate to 3e-5. We decode
outputs use top-k sampling with k = 40 E
a softmax temperature of 0.7 for the generation
compiti.
5.3 Automatic Evaluation
Metrics We use accuracy to evaluate the un-
derstanding tasks. As for generation tasks, we
use BLEU-n (B-n) and Distinct-n (D-n) to eval-
uate the n-gram overlap with ground-truth texts
(Papineni et al., 2002) and n-gram generation
diversity (Li et al., 2016), rispettivamente. We set
n = 1, 2 for both generation tasks. Additionally,
we also use the following two metrics to eval-
uate OutGen: (1) Coverage (Cover): It is used
to evaluate the generation controllability, Quale
is computed as the average Rouge-L recall score
(Lin, 2004) between the generated text and each
input phrase. A higher coverage score indicates
the generated text covers more input phrases. (2)
Order: It is used to measure the gap between the
positional orders of input phrases appearing in the
generated texts and ground-truth texts. Specifi-
cally, we compute the order score as the average
ratio of the number of inversions in the generated
story to the number of all position pairs of any
two phrases. An inversion refers to a position pair
that are out of the ground-truth order. And we use
the position of the longest common subsequence
between a story and a phrase as the position of
the phrase in the story. Because an input phrase
does not always appear in the generated story,
442
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Modelli
# P
ClozeT
SenPos Overall
Transformer
BERTbase
RoBERTabase
GPT2base
†
GPT2
base
mT5base
LongLMsmall
LongLMbase
LongLMlarge
Humans
wi
Transformer
BERTbase
RoBERTabase
GPT2base
†
GPT2
base
mT5base
LongLMsmall
LongLMbase
LongLMlarge
Humans
wi
Validation Set
38M
102M
102M
102M
102M
582M
60M
223M
1B
N/A
N/A
38M
102M
102M
102M
102M
582M
60M
223M
1B
N/A
N/A
55.78
70.75
72.11
70.07
74.49
72.45
73.81
75.17
79.93
99.00
0.37
Test Set
54.42
69.39
67.69
73.13
76.87
75.17
77.21
77.55
80.61
100.00
0.39
17.38
40.13
51.63
37.78
39.25
63.25
48.75
64.38
70.00
97.00
0.63
16.34
43.68
51.35
37.25
39.28
61.41
53.07
62.34
69.41
98.00
0.61
31.46
51.36
59.14
49.62
52.17
66.62
57.94
68.34
73.64
97.73
1.00
31.23
53.74
57.74
51.28
53.98
66.79
62.51
68.29
73.39
98.78
1.00
Tavolo 10: Precisione (%) on the understanding
tasks in LOT. # P means the number of
parameters. The best performance is in bold
and the second best is underlined. wi is the
metric weight with BERT as the baseline
model when computing the overall score.
we regard all position pairs of such a phrase and
others as inversions.
Results Tables 10 E 11 show the results on
the understanding and generation tasks, respec-
tively. To obtain the human performance on
the understanding tasks, we randomly sampled
100 examples from the validation set or test set
and hired three crowd-sourced annotators (na-
tive Chinese speakers) to do these tasks. Noi
made final decisions among them through ma-
jority voting. All results show an almost perfect
inter-annotator agreement with Fleiss’s κ > 0.85
(Fleiss and Joseph, 1971). For generation tasks, we
regard the scores of ground-truth texts as human
performance.
We summarize the evaluation results as follows:
(1) Pretrained models have significantly bet-
ter performance than non-pretrained models. (2)
443
LongLMlarge outperforms other baselines substan-
tially on both the understanding and generation
compiti. LongLMbase/LongLMsmall achieves better
overall scores with half fewer parameters than
mT5/GPT2. (3) By comparing GPT2† and GPT2,
we can derive that our pretraining data can effec-
tively improve the ability to model long texts. (4)
LongLMsmall has a better performance than GPT2†
on the understanding tasks, and is comparable
with GPT2† on the generation tasks, suggesting
the benefits of the encoder-decoder framework
and the text infilling task. (5) It is still extremely
challenging for all models to capture the com-
monsense and inter-sentence discourse relations
between events in long texts for tackling the
ClozeT and SenPos tasks. Inoltre, we in-
vestigate how the size of training data influences
the accuracy of BERT for SenPos. The result in
Figura 2 indicates the necessity to develop bet-
ter representations of discourse relations instead
of relying only on increasing the data size. (6)
The results on the generation tasks show that
LongLM does well in generating more word over-
laps with references than similar-sized baselines
for both tasks, and covers more input phrases
and arranges them in correct orders for OutGen.
But LongLM underperforms GPT2-based models
in terms of diversity on PlotCom. (7) Dynami-
cally tracking plot states (cioè., PM) does not bring
significant improvement on the generation tasks
compared with GPT2, suggesting that it may re-
quire modeling the discourse structure explicitly
to tackle the generation tasks. And the superiority
of PW to GPT2 on OutGen further indicates the
benefit of modeling discourse-level features. In
summary, we believe LOT will serve as an effec-
tive evaluation for capturing the commonsense and
discourse relations of long texts beyond the surface
events, and generating coherent and controllable
long-form texts.
5.4 Manual Evaluation
Because automatic metrics may be unreliable
for evaluating NLG (Guan and Huang, 2020),
we conducted a point-wise manual evaluation
to measure the disparity between machines and
humans for the generation tasks in LOT. For
each task, we randomly sampled 100 examples
from the test set and obtained 100 ground-truth
texts and 300 generated texts from three typ-
ical models including GPT2base, mT5base and
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
PlotCom
B-2
D-1
D-2
B-1
B-2
D-1
D-2
Cover
Order
OutGen
Overall
Modelli
# P
ConvS2S
Fusion
†
base
GPT2base
GPT2
PM
PW
mT5base
LongLMsmall
LongLMbase
LongLMlarge
Truth
wi
ConvS2S
Fusion
†
base
GPT2base
GPT2
PM
PW
mT5base
LongLMsmall
LongLMbase
LongLMlarge
Truth
wi
58M
109M
102M
102M
102M
102M
582M
60M
223M
1B
N/A
N/A
58M
109M
102M
102M
102M
102M
582M
60M
223M
1B
N/A
N/A
B-1
18.92
20.56
22.67
22.49
22.11
22.45
22.56
21.78
22.91
23.76
Validation Set
4.18
4.69
6.22
5.43
5.49
5.57
6.46
7.11
8.28
8.70
6.31
8.63
24.75
26.88
23.89
25.64
24.44
20.17
22.16
25.93
32.18
35.73
70.57
74.87
69.74
71.54
71.31
59.63
63.54
72.18
29.23
29.22
30.43
35.29
31.81
35.84
36.71
35.03
40.33
42.79
10.38
10.34
14.87
18.31
14.94
18.47
22.25
19.17
24.29
24.91
3.45
3.39
10.95
13.89
12.99
11.86
14.52
10.80
14.66
16.13
21.79
22.67
44.38
51.36
50.56
47.62
50.01
39.70
51.82
57.71
14.81
17.41
60.90
64.01
62.98
64.93
77.98
62.53
79.60
80.46
25.34
26.55
55.52
57.64
56.75
57.30
63.15
56.53
62.78
64.36
100.00
100.00
35.32
84.33
100.00
100.00
21.66
71.43
100.00
100.00
0.11
0.40
0.04
0.03
0.08
0.17
0.05
0.04
0.04
0.04
19.60
20.52
22.94
22.45
22.87
22.76
22.52
22.05
23.28
24.20
4.20
4.90
5.76
5.38
5.75
6.07
6.48
7.45
8.58
9.06
6.00
8.43
24.69
26.08
24.08
25.55
24.33
19.93
21.37
25.75
32.42
35.09
70.30
73.26
71.19
70.72
70.53
59.79
62.43
71.08
Test Set
29.00
28.77
30.17
35.79
31.85
35.12
36.33
34.48
40.25
42.10
10.14
10.22
14.91
18.68
15.24
17.96
22.07
19.17
24.15
24.77
1.60
1.47
7.62
9.89
8.62
8.68
10.90
7.93
10.75
12.04
13.95
14.12
36.87
43.52
41.32
40.17
43.65
34.25
44.40
50.29
15.45
17.10
60.87
64.43
63.15
63.70
78.66
63.75
79.88
81.48
25.77
26.36
55.90
56.96
57.21
55.17
63.79
57.64
63.67
64.82
100.00
100.00
35.01
84.56
100.00
100.00
15.71
63.46
100.00
100.00
0.10
0.42
0.03
0.03
0.08
0.16
0.05
0.04
0.04
0.04
11.85
12.61
20.24
21.73
20.45
21.48
23.53
21.02
24.75
26.12
92.23
1.00
11.27
11.91
19.21
20.76
19.77
20.52
22.59
20.48
23.93
25.29
91.64
1.00
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
Tavolo 11: Evaluation results on the generation tasks in LOT. # P means the number of parameters. IL
best performance is in bold and the second best is underlined. wi is the metric weight with GPT2base
as the baseline model when computing the overall score.
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
LongLMlarge. For each text along with the in-
put, we hired three crowd-sourced workers to
judge its quality with a binary score (1 for good,
E 0 otherwise) in terms of three aspects: (1)
grammaticality (intra-sentence grammar quality
of generated texts), (2) coherence (causal and
temporal dependencies within generated texts),
E (3) relatedness to inputs (reasonable logical
connections to the input context for PlotCom; E
reasonable utilization of input phrases for Out-
Gen). These aspects are independently evaluated.
We made final decisions among three annotators
through majority voting. We show the annotation
instructions in the appendix.
Tavolo 12 shows the evaluation results. For both
compiti, LongLM outperforms GPT2 and mT5 sig-
nificantly in all aspects (P < 0.05, sign test).
However, it is difficult for all models to generate
a logical completion for PlotCom (relatedness
score < 0.1), showing their poor ability to cap-
ture commonsense and inter-sentence relations.
l
a
c
_
a
_
0
0
4
6
9
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 2: Accuracy of BERT for SenPos as the size of
training data increases.
And the big gap between LongLM and humans
also proves both tasks challenging to existing
generation models. We also observe the positive
correlation between the manual evaluation and
automatic evaluation (Table 11), suggesting that
it may be acceptable to use automatic evaluation
to compare and improve models on the generation
tasks in LOT.
5.5 Bias Investigation
It is essential to investigate potential biases of a
dataset, which may leak information about target
444
Models
Gram (κ)
Cohe (κ)
Relat (κ)
Baselines
ClozeT
SenPos
Task: PlotCom
GPT2base
mT5base
LongLMlarge
Truth
0.84 (0.49)
0.85 (0.24)
0.95 (0.48)
1.00 (1.00)
0.41 (0.71)
0.53 (0.65)
0.82 (0.64)
1.00 (1.00)
0.01 (0.50)
0.01 (0.50)
0.09 (0.69)
0.99 (0.49)
Task: OutGen
GPT2base
mT5base
LongLMlarge
Truth
0.54 (0.52)
0.53 (0.26)
0.81 (0.23)
1.00 (1.00)
0.18 (0.52)
0.08 (0.46)
0.37 (0.43)
1.00 (1.00)
0.39 (0.43)
0.49 (0.38)
0.62 (0.45)
1.00 (1.00)
Table 12: Manual evaluation results for PlotCom
and OutGen in terms of grammaticality (Gram),
coherence (Cohe), and relatedness (Relat). The
best performance is highlighted in bold. All
results show a fair inter-annotator agreement
with Fleiss’ κ > 0.2.
labels and enable models to easily use short-
cuts to handle complex inputs without actually
mastering the focused abilities (Ribeiro et al.,
2020). Therefore, we experimented with the fol-
lowing baselines to inspect the ClozeT and SenPos
datasets: (1) Random: It chooses a candidate ran-
domly. (2) Majority: It chooses the candidate
with an index that is most frequently selected in
the training set. (3) Length: For ClozeT, it chooses
the candidate that contains more words; And for
SenPos, it chooses the position of which the adja-
cent sentences have the closest number of words to
the removed sentence. (4) BLEU-n: For ClozeT,
it chooses the candidate with a higher BLEU-n
score (Papineni et al., 2002) with the context;
And for SenPos, it chooses the position of which
the adjacent sentences have the largest average
BLEU-n score with the removed sentence (n =
1,2). (5) Sentiment: For ClozeT, it chooses the
candidate with a higher sentiment score computed
by an off-the-shelf Chinese sentiment analyzer;5
for SenPos, it chooses the position where the aver-
age sentiment score of its adjacent two sentences
is the closest to the score of the removed sentence.
(6) Discourse Markers: For ClozeT, it chooses
the candidate where its adjacent sentences contain
a discourse marker matching with it. Per esempio,
if ‘‘because’’ occurs in the last sentence before
the position of the candidates, this baseline will
choose the candidate that contains ‘‘so’’.6 If there
5https://github.com/isnowfy/snownlp.
6Different from English, paired discourse markers like
‘‘because’’-‘‘so’’ should be used together in Chinese.
Random
Majority
Length
BLEU-1/2
Sentiment
Discouse Markers
BERT w/o Context
BERT w/o Long
50.00
52.72
52.72
46.94/48.98
50.34
45.92
57.82
62.24
16.03
16.24
16.45
14.14/14.95
16.49
9.15
18.08
19.00
BERT
69.39
43.68
Tavolo 13: Precisione (%) of different baselines
on the test sets of ClozeT and SenPos for bias
investigation. We use the results of BERT as
a reference.
do not exist such paired markers in an exam-
ple or there are multiple eligible candidates, Questo
baseline will randomly choose one. The setting of
this baseline for SenPos is similar to ClozeT. Noi
manually define 24 marker pairs for this baseline.
(7) BERT w/o Context: We fine-tuned BERT to
directly choose without taking the context as in-
put (Schwartz et al., 2017). (8) BERT w/o Long:
It is used to study whether solving these tasks
requires modeling long-range dependencies. For
ClozeT, we fine-tuned BERT to choose with only
the adjacent sentences of the removed sentence
as input. And for SenPos, we encoded each posi-
tion and its adjacent sentences respectively using
BERT and then took the hidden states at these
positions for prediction. These baselines cover
different levels of features ranging from the to-
ken level (per esempio., Length), the sentence level (per esempio.,
Sentiment), to the discourse level (per esempio., Discourse
Markers, BERT w/o Context). We believe that
these baselines will provide a comprehensive in-
spection for the potential biases of our datasets.
As shown in Table 13, both tasks can not be
trivially solved by these baselines, suggesting that
the datasets may be free of biases in terms of
the above features. Therefore, we believe that the
tasks can focus on testing the ability of models to
capture long-range commonsense and discourse
relations.
5.6 Memorization Investigation
Overlap between training and test data may re-
sult in an over-reporting of the generalization
performance of machines. Therefore, it is neces-
sary to investigate how many test data also show
up in the training data. A tal fine, we follow
445
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Tasks
ClozeT
SenPos
PlotCom OutGen
Overlap with the Training Sets
Percent
0.00%
0.62%
0.02%
# 8-grams
# Exam
# Exam>10%
Max Percent
0
0
0
0.00%
1,040
45
17
60.98%
6
3
0
2.53%
0.00%
2
2
0
1.00%
Overlap with the Pretraining Data
Percent
0.67%
4.68%
0.38%
1.22%
# 8-grams
# Exam
# Exam>10%
Max Percent
172
83
4
7,844
486
71
47.22% 60.96%
151
88
1
30.77%
1,212
161
26
41.18%
Tavolo 14: Overlapping analysis for the test sets
of the four tasks with respect to their own train-
ing sets or the pretraining data of LongLM. Noi
compute the following statistics: (1) Percent:
the percentage of 8-grams from the test set that
are also in the training sets or the pretraining
dati; (2) # 8-grams: the number of overlapped
8-grams; (3) # Exam: the number of examples
that contain at least one overlapped 8-gram;
(4) # Exam>10%: the number of examples that
have more than 10% overlapped 8-grams. (4)
Max Percent: the maximum percentage of
overlapped 8-grams from an example.
Radford et al. (2019) to measure the overlap be-
tween two datasets by calculating the percentage
of 8-grams from one that are also in the other. Noi
use the jieba tokenizer for tokenization.
Tavolo 14 shows the overlapping analysis for
test sets of the four tasks in LOT. We can see
that all test sets have less than 1% overlap with
their own training sets. Notably, there are 17 test
examples of SenPos that contain more than 10%
overlapped 8-grams with the training set. Questo
is because a training example and a test example
may come from the same story, and thus they share
similar information (per esempio., characters, locations). UN
test example contains at most 60.98% overlapped
8-grams, suggesting that the training set and test
set do not include exactly the same example.
As for the pretraining data of LongLM, the test
sets of ClozeT and PlotCom still have less than
1% sovrapposizione. Tuttavia, there are dozens of test
examples in SenPos and OutGen that contain more
di 10% overlapped 8-grams. Through manual
inspection of the overlaps, we found that they
mainly come from idioms, proverbs and classic
SenPos
# Exam
Total
863
mT5base
LongLMlarge
61.41%
69.41%
SenPos
# Exam
Total
863
mT5base
LongLMlarge
61.41%
69.41%
w/o Overlap
(Training Set)
846
61.82%
69.50%
w/o Overlap
(Pretraining Data)
792
61.24%
69.32%
Δ
N/A
+0.41%
+0.09%
Δ
N/A
–0.17%
–0.09%
Tavolo 15: Accuracy on the test set of SenPos.
Total means using the whole test set while
w/o Overlap means excluding the examples
that have more than 10% overlapped 8-grams
with the training set or pretraining data from
the test set. # Exam is the number of exam-
ples. Δ denotes the change of accuracy when
excluding the overlapping data compared with
using the total test set.
OutGen
# Exam
mT5base
LongLMlarge
Total
729
36.33
42.10
w/o Overlap
(Pretraining Data)
703
36.45
42.22
Δ
N/A
+0.12
+0.12
Tavolo 16: BLEU-1 score on the test set of
OutGen. Other notations are the same as
Tavolo 15.
fairy tales, which may be part of some novels in
the pretraining data.
To investigate how the overlapping data influ-
ence the measurement of models’ performance,
we re-evaluated LongLMlarge on the test sets of
SenPos and OutGen with exclusion of the exam-
ples that have more than 10% overlapped 8-grams
with the training sets or pretraining data. We also
used mT5base as a baseline in the same setting
of LongLM. The results for SenPos and Out-
Gen are shown in Tables 15 E 16, rispettivamente.
The change of accuracy or BLEU-1 score is very
marginal for both mT5 and LongLM when ex-
cluding the overlapping data, suggesting that the
superior performance of LongLM is rarely at-
tributable to the memorization of training data.
Therefore, we believe that it is fair to compare
LongLM and other models on these tasks.
446
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
6 Conclusions
We present LOT, a story-centric benchmark
for Chinese long text understanding and gen-
eration. LOT includes two story understanding
tasks and two story generation tasks, Quale
comprehensively investigate the abilities of com-
monsense reasoning, controllable generation, E
modeling inter-sentence relations and the global
discourse structures. We provide standard datasets
for the four tasks, which are constructed based
on human-written stories processed by auto-
matic and manual annotation. Inoltre, we
release a new Chinese long text pretraining model
LongLM, which outperforms strong baseline mod-
els substantially on both the understanding and
generation tasks in LOT. The LOT benchmark
and the pretraining model will encourage further
research on Chinese long text modeling.
Ringraziamenti
This work was supported by the National Science
Foundation for Distinguished Young Scholars (NO.
62125604) and the NSFC projects (Key project
NO. 61936010 and regular project no. 61876096).
This work was also supported by the Guoqiang
Institute of Tsinghua University, with grant nos.
2019GQG1 and 2020GQG0005. We would also
like to thank our action editor, Dipanjan Das,
and the anonymous reviewers for their invaluable
suggestions and feedback.
Riferimenti
Apoorv Agarwal, Anup Kotalwar, and Owen
Rambow. 2013. Automatic extraction of so-
cial networks from literary text: A case study
on alice in wonderland. Negli Atti del
Sixth International Joint Conference on Natural
Language Processing, pages 1202–1208.
Nader Akoury, Shufan Wang, Josh Whiting,
Stephen Hood, Nanyun Peng, and Mohit Iyyer.
2020. STORIUM: A Dataset and evaluation
platform for machine-in-the-loop story gener-
ation. Negli Atti del 2020 Conferenza
sui metodi empirici nel linguaggio naturale
in lavorazione (EMNLP), pages 6470–6484, On-
line, Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2020.emnlp-main.525
David Bamman, Brendan O’Connor, and Noah A.
Smith. 2013. Learning latent personas of film
characters. In Proceedings of the 51st Annual
Meeting of
the Association for Computa-
linguistica nazionale (Volume 1: Documenti lunghi),
pages 352–361.
Chandra Bhagavatula, Ronan Le Bras, Chaitanya
Malaviya, Keisuke Sakaguchi, Ari Holtzman,
Hannah Rashkin, Doug Downey, Wen-tau Yih,
and Yejin Choi. 2019. Abductive common-
sense reasoning. In International Conference
sulle rappresentazioni dell'apprendimento.
Faeze Brahman
and Snigdha Chaturvedi.
for
2020. Modeling protagonist emotions
Negli Atti
emotion-aware storytelling.
del 2020 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale (EMNLP),
pages 5277–5294, Online. Associazione per
Linguistica computazionale. https://doi
.org/10.18653/v1/2020.emnlp-main.426
Tom B. Brown, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Jeffrey Wu, Clemens
Daniel M. Ziegler,
Inverno, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskev, E
Dario Amodei. 2020. Language models are
few-shot learners.
Nathanael Chambers and Dan Jurafsky. 2008. E-
supervised learning of narrative event chains. In
Proceedings of ACL-08: HLT, pages 789–797.
Snigdha Chaturvedi, Mohit
Iyyer, and Hal
Daume III. 2017. Unsupervised learning of
evolving relationships between literary char-
acters. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 31.
Snigdha Chaturvedi, Shashank Srivastava, Hal
Daume III, and Chris Dyer. 2016. Model-
ing evolving relationships between characters
IL
in literary novels.
AAAI Conference on Artificial Intelligence,
volume 30.
Negli Atti di
Mingda Chen, Zewei Chu, and Kevin Gimpel.
2019. Evaluation benchmarks and learning
447
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
criteria for discourse-aware sentence rep-
IL 2019
Negli Atti di
resentations.
Conference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
in lavorazione (EMNLP-IJCNLP), pages 649–662.
https://doi.org/10.18653/v1/D19-1060
Alexis Conneau and Douwe Kiela. 2018.
Senteval: An evaluation toolkit for univer-
sal sentence representations. Negli Atti
of the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018).
Yiming Cui, Wanxiang Che, Ting Liu, Bing
Qin, Shijin Wang, and Guoping Hu. 2020.
Revisiting pre-trained models for Chinese nat-
ural language processing. Negli Atti di
IL 2020 Conferenza sui metodi empirici
nell'elaborazione del linguaggio naturale: Findings,
pages 657–668, Online. Association for Com-
Linguistica putazionale.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
Carbonello, Quoc Le, and Ruslan Salakhutdinov.
2019. Transformer-xl: Attentive language mod-
els beyond a fixed-length context. Nel professionista-
ceedings of
the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2978–2988.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, E
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. Negli Atti del 2019 Contro-
ferenza del capitolo nordamericano della
Associazione per la Linguistica Computazionale: Eh-
uomo Tecnologie del linguaggio, Volume 1 (Lungo
and Short Papers), pages 4171–4186.
Angela Fan, Mike Lewis, and Yann Dauphin.
2018. Hierarchical neural story generation. In
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Documenti lunghi), pages 889–898.
Mark Alan Finlayson. 2012. Learning narrative
structure from annotated folktales. Ph.D. thesis,
Istituto di Tecnologia del Massachussetts.
Joseph Fleis. 1971. Measuring nominal scale
agreement among many raters. Psychologi-
cal Bulletin, 76(5):378–382. https://doi
.org/10.1037/h0031619
Jonas Gehring, Michael Auli, David Grangier,
Denis Yarats, and Yann N. Dauphin. 2017.
Convolutional sequence to sequence learn-
ing. In International Conference on Machine
Apprendimento, pages 1243–1252. PMLR.
Sebastian Gehrmann, Tosin Adewumi, Karmanya
Aggarwal, Pawan Sasanka Ammanamanchi,
Aremu Anuoluwapo, Antoine Bosselut, Khyathi
Raghavi Chandu, Miruna Clinciu, Dipanjan
Das, Kaustubh D. Dhole, Wanyu Du, Esin
Durmus, Ondˇrej Duˇsek, Chris Emezue,
Varun Gangal, Cristina Garbacea, Tatsunori
Hashimoto, Yufang Hou, Yacine Jernite,
Harsh Jhamtani, Yangfeng Ji, Shailza Jolly,
Dhruv Kumar, Faisal Ladhak, Aman Madaan,
Mounica Maddela, Khyati Mahajan, Saad
Mahamood, Bodhisattwa Prasad Majumder, Pedro
Henrique Martins, Angelina McMillan-Major,
Simon Mille, Emiel van Miltenburg, Moin
Nadeem, Shashi Narayan, Vitaly Nikolaev,
Rubungo Andre Niyongabo, Salomey Osei,
Ankur Parikh, Laura Perez Beltrachini, Niranjan
Ramesh Rao, Vikas Raunak, Juan Diego
Rodriguez, Sashank Santhanam, Jo˜ao Sedoc,
Thibault Sellam, Samira Shaikh, Anastasia
Shimorina, Marco Antonio Sobrevilla Cabezudo,
Hendrik Strobelt, Nishant Subramani, Wei Xu,
Diyi Yang, Akhila Yerukola, and Jiawei Zhou.
2021. The GEM benchmark: Natural language
generation, its evaluation and metrics. arXiv
preprint arXiv:2102.01672. https://doi
.org/10.18653/v1/2021.gem-1.10
Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio.
2014. Generative adversarial nets. In Advances
in Neural Information Processing Systems,
pages 2672–2680.
Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
Zhu, and Minlie Huang. 2020. A knowledge-
for common-
enhanced pretraining model
sense story generation. Transactions of
IL
Associazione per la Linguistica Computazionale,
8:93–108. https://doi.org/10.1162
/tacl_a_00302
Jian Guan and Minlie Huang. 2020. UNION:
evaluating
An
unreferenced metric
In Procedi-
open-ended story generation.
IL 2020 Conferenza sull'Empirico
ings di
Metodi nell'elaborazione del linguaggio naturale,
EMNLP 2020, Online, novembre 16-20, 2020,
for
448
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
pages 9157–9166. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/2020.emnlp-main.736
pages 7871–7880. Associazione per il calcolo-
linguistica nazionale. https://doi.org/10
.18653/v1/2020.acl-main.703
Jian Guan, Yansen Wang, and Minlie Huang.
2019. Story ending generation with incremental
encoding and commonsense knowledge. In
Proceedings of the AAAI Conference on Artifi-
cial Intelligence, volume 33, pages 6473–6480.
https://doi.org/10.1609/aaai.v33i01
.33016473
Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu,
Wenbiao Ding, Xiaoxi Mao, Changjie Fan, E
Minlie Huang. 2021. OpenMEVA: A bench-
mark for evaluating open-ended story genera-
tion metrics. In Proceedings of the 59th Annual
Riunione dell'Associazione per il Computazionale
Linguistics and the 11th International Joint
Conferenza sull'elaborazione del linguaggio naturale
(Volume 1: Documenti lunghi), pages 6394–6407,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2021.acl-long.500
Xiangzhe Kong,
story
Jialiang Huang, Ziquan
Jian Guan,
Tung,
and Minlie Huang.
2021.
Stylized
generation with
IL
style-guided planning.
Associazione per la Linguistica Computazionale:
ACL-IJCNLP 2021, pages 2430–2436, On-
line. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2021.findings-acl.215
In Findings of
Taku Kudo and John Richardson. 2018. Senten-
cepiece: A simple and language independent
subword tokenizer and detokenizer for neu-
IL
text processing. Negli Atti di
ral
2018 Conference on Empirical Methods in
Elaborazione del linguaggio naturale: System Demon-
strations, pages 66–71. https://doi.org
/10.18653/v1/D18-2012
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, e Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural
language genera-
zione, translation, and comprehension. Nel professionista-
the 58th Annual Meeting of
ceedings of
the Association for Computational Linguis-
tic, ACL 2020, Online, Luglio 5-10, 2020,
449
Boyang Li, Stephen Lee-Urban, George Johnston,
and Mark Riedl. 2013. Story generation with
crowdsourced plot graphs. Negli Atti di
the AAAI Conference on Artificial Intelligence,
volume 27.
Jiwei Li, Michel Galley, Chris Brockett,
Jianfeng Gao, and William B. Dolan. 2016.
A diversity-promoting objective function for
neural conversation models. Negli Atti di
IL 2016 Conferenza del Nord America
Capitolo dell'Associazione per il calcolo
Linguistica: Tecnologie del linguaggio umano,
pages 110–119.
Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81,
Barcelona, Spain. Associazione per il calcolo-
linguistica nazionale.
Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi,
Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu,
Linjun Shou, Ming Gong, Pengcheng Wang,
Jiusheng Chen, Daxin Jiang, Jiancheng Lv,
Ruofei Zhang, Winnie Wu, MingZhou, E
Nan Duan. 2020. GLGE: A new general lan-
guage generation evaluation benchmark. arXiv
preprint arXiv:2011.11928.
Annie Louis and Charles Sutton. 2018. Deep dun-
geons and dragons: Learning character-action
interactions from role-playing game transcripts.
IL 2018 Conference of
Negli Atti di
the North American Chapter of the Associ-
ation for Computational Linguistics: Umano
Language Technologies, Volume 2 (Short Pa-
pers), pages 708–713. https://doi.org
/10.18653/v1/N18-2111
Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2016. Pointer sentinel mix-
ture models. arXiv preprint arXiv:1609.07843.
Nasrin Mostafazadeh, Nathanael Chambers,
Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James
Allen. 2016. A corpus and cloze evaluation for
deeper understanding of commonsense stories.
In Proceedings of NAACL-HLT, pages 839–849.
https://doi.org/10.18653/v1/N16
-1098
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th annual meeting of
the Association for Computational Linguistics,
pages 311–318.
Debjit Paul and Anette Frank. 2021. COINS:
Dynamically generating COntextualized in-
ference rules for narrative story completion.
Negli Atti di
the 59th Annual Meet-
ing of
the Association for Computational
Linguistics and the 11th International Joint
Conferenza sull'elaborazione del linguaggio naturale
(Volume 1: Documenti lunghi), pages 5086–5099,
Online. Association for Computational Linguis-
tic. https://doi.org/10.18653/v1
/2021.acl-long.395
Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018. Improving language
understanding with unsupervised learning.
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.
Jack W. Rae, Anna Potapenko, Siddhant M.
Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. 2020. Compressive transformers for
long-range sequence modelling. In Interna-
tional Conference on Learning Representa-
zioni.
Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21:1–67.
Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, E
Jianfeng Gao. 2020. Plotmachines: Outline-
conditioned generation with dynamic plot state
tracking. Negli Atti del 2020 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4274–4295.
https://doi.org/10.18653/v1/2020
.emnlp-main.349
Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Beyond
accuracy: Behavioral testing of NLP models
with CheckList. In Proceedings of the 58th
Annual Meeting of the Association for Compu-
linguistica nazionale, pages 4902–4912. Online.
Associazione per la Linguistica Computazionale.
https://doi.org/10.18653/v1/2020
.acl-main.442
Tim Rockt¨aschel, Edward Grefenstette, Karl Moritz
Hermann, Tom´as Kocisk´y, and Phil Blunsom.
2016. Reasoning about entailment with neu-
ral attention. In 4th International Conference
sulle rappresentazioni dell'apprendimento, ICLR 2016, San
Juan, Puerto Rico, May 2-4, 2016, Contro-
ference Track Proceedings. http://arxiv
.org/abs/1509.06664.
Stuart Rose, Dave Engel, Nick Cramer, E
Wendy Cowley. 2010. Automatic keyword ex-
traction from individual documents. Text Mining:
Applications and Theory, 1:1–20. https://
doi.org/10.1002/9780470689646.ch1
Paul-Edouard Sarlin, Daniel DeTone, Tomasz
Malisiewicz, and Andrew Rabinovich. 2020.
Superglue: Learning feature matching with
graph neural networks. Negli Atti del
IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 4938–4947.
https://doi.org/10.1109/CVPR42600.2020
.00499
Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila
Zilles, Yejin Choi, and Noah A. Smith. 2017.
The effect of different writing tasks on linguistic
style: A case study of the ROC story cloze
task. In Proceedings of the 21st Conference
on Computational Natural Language Learning
(CoNLL 2017), pages 15–25. https://doi
.org/10.18653/v1/K17-1004
Negli Atti di
Rishi Sharma, James Allen, Omid Bakhshandeh,
and Nasrin Mostafazadeh. 2018. Tackling
the story ending biases in the story cloze
the 56th Annual
test.
Meeting of
the Association for Computa-
linguistica nazionale (Volume 2: Short Papers),
752–757. https://doi.org/10
pagine
.18653/v1/P18-2119
Yi Tay, Mostafa Dehghani, Samira Abnar,
Yikang Shen, Dara Bahri, Philip Pham, Jinfeng
Rao, Liu Yang, Sebastian Ruder, and Donald
Metzler. 2020. Long range arena: A bench-
mark for efficient transformers. In International
Conference on Learning Representations.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez,
450
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.
Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and
analysis platform for natural
language un-
derstanding. In International Conference on
Learning Representations. https://doi
.org/10.18653/v1/W18-5446
Tianming Wang and Xiaojun Wan. 2019.
T-CVAE: Transformer-based conditioned vari-
ational autoencoder for story completion. In
Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence,
IJCAI 2019, Macao, China, agosto 10-16,
2019, pages 5233–5239. ijcai.org.
Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li,
Chenjie Cao, Yudong Li, Yechen Xu, Kai
Sun, Dian Yu, Cong Yu, et al. 2020UN. CLUE:
A Chinese language understanding evalua-
tion benchmark. In Proceedings of the 28th
Conferenza internazionale sul calcolo
Linguistica, pages 4762–4772.
Peng Xu, Mostofa
Patwary, Mohammad
Shoeybi, Raul Puri, Pascale Fung, Anima
Anandkumar, and Bryan Catanzaro. 2020B.
MEGATRON-CNTRL: Controllable
story
generation with external knowledge using
In Procedi-
large-scale language models.
ings di
IL 2020 Conferenza sull'Empirico
Metodi nell'elaborazione del linguaggio naturale,
EMNLP 2020, Online, novembre 16-20, 2020,
pages 2831–2845. Associazione per il calcolo-
linguistica nazionale.
Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,
Aditya Barua, and Colin Raffel. 2021. mt5: UN
massively multilingual pre-trained text-to-text
IL 2021
Negli Atti di
transformer.
the North American Chap-
Conference of
ter of
the Association for Computational
Linguistica: Tecnologie del linguaggio umano,
pages 483–498.
Lili Yao, Nanyun Peng, Ralph Weischedel,
Kevin Knight, Dongyan Zhao, and Rui Yan.
2019. Plan-and-write: Towards better automatic
storytelling. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 33,
pages 7378–7385. https://doi.org/10
.1609/aaai.v33i01.33017378
Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke,
Yuxian Gu, Deming Ye, Yujia Qin, Yusheng
Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi
Wang, Yanan Zheng, Guoyang Zeng, Huanqi
Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun,
Zhiyuan Liu, Minlie Huang, Wentao Han, Jie
Tang, Juanzi Li, Xiaoyan Zhu, and Maosong
Sun. 2020. CPM: A large-scale generative
Chinese pre-trained language model. arXiv
preprint arXiv:2012.00413. https://doi.org
/10.1016/j.aiopen.2021.07.001
Tiancheng Zhao, Ran Zhao,
and Maxine
Eskenazi. 2017. Learning discourse-level diver-
sity for neural dialog models using conditional
variational autoencoders. Negli Atti del
55esima Assemblea Annuale dell'Associazione per
Linguistica computazionale (Volume 1: Lungo
Carte), pages 654–664.
Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao,
Tao Liu, Wei Lu, Xi Chen, Haotang Deng,
Qi Ju, and Xiaoyong Du. 2019. UER: An
open-source toolkit for pre-training models.
EMNLP-IJCNLP 2019241. https://doi
.org/10.18653/v1/D19-3041
451
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
4
6
9
2
0
0
8
0
5
4
/
/
T
l
UN
C
_
UN
_
0
0
4
6
9
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3