LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text

LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text
Understanding and Generation

Jian Guan1, Zhuoer Feng1, Yamei Chen1, Ruilin He2,
Xiaoxi Mao3, Changjie Fan3, Minlie Huang1∗

1The CoAI group, DCST, Chine; 2Huawei
Technologies Co., Ltd., Chine; 3Netease Fuxi AI Lab., Chine
{j-guan19,fze17}@mails.tsinghua.edu.cn, chenziym4132013@163.com,
{maoxiaoxi,fanchangjie}@corp.netease.com,
heruilin@huawei.com,

aihuang@tsinghua.edu.cn

Abstrait

Standard multi-task benchmarks are essen-
tial for developing pretraining models that
can generalize to various downstream tasks.
Existing benchmarks for natural
langue
traitement (NLP) usually focus only on under-
standing or generating short texts. Cependant,
long text modeling requires many distinct abil-
ities in contrast to short texts, such as the
modeling of long-range discourse and com-
monsense relations, and the coherence and
controllability of generation. The lack of stan-
à
dardized benchmarks makes it difficult
assess these abilities of a model and fairly
compare different models, especially Chinese
models. Donc, we propose a story-centric
benchmark named LOT for evaluating Chi-
nese long text modeling, which aggregates
two understanding tasks and two generation
tasks. We construct new datasets for these
tasks based on human-written Chinese sto-
ries with hundreds of words. En outre,
we release an encoder-decoder-based Chinese
long text pretraining model named LongLM
with up to 1 billion parameters. We pre-
train LongLM on 120G Chinese novels with
two generative tasks including text infilling
and conditional continuation. Extensive ex-
periments show that LongLM outperforms
similar-sized pretraining models substantially
on both the understanding and generation tasks
in LOT.

1

Introduction

Pretrained language models have achieved sig-
nificant advances in various natural
langue
understanding (NLU) and generation (NLG) tasks

∗ Corresponding author.

434

(Devlin et al., 2019; Radford et al., 2019). Standard
benchmarks such as GLUE (Wang et al., 2019)
further boost the improvement and fast iteration
of pretrained models. Popular benchmarks usually
aggregate multiple tasks to spur the progress of
generalizable models. But these benchmarks fo-
cus mainly on understanding or generating short
texts. Par exemple, the GLUE tasks take at most
two sentences as input. And most tasks in NLG
benchmarks such as GLGE (Liu et al., 2020) et
GEM (Gehrmann et al., 2021) require generating
only several words (par exemple., dialogue generation). Al-
though there have been many models pretrained
on long texts such as GPT3 (Brown et al., 2020)
and CPM (Zhang et al., 2020), the lack of bench-
mark datasets makes it difficult to fully assess and
compare their abilities of long text modeling.

In this paper, we present LOT, a benchmark for
evaluating Chinese LOng Text understanding and
generation. As shown in Table 1, modeling long
texts requires many distinct abilities compared to
short texts, y compris (1) commonsense reasoning
regarding characters’ reaction and intention, et
knowledge about physical objects (par exemple., ‘‘river’’)
and abstract concepts (par exemple., ‘‘irony’’); (2) model-
ing discourse-level features such as inter-sentence
relations (par exemple., causality) and global discourse
structures (par exemple., the order of events); et (3) le
generation coherence and controllability, lequel
require both maintaining a coherent plot and
adhering to controllable attributes (par exemple., topics).
Accordingly, LOT contains two understanding
tasks and two generation tasks regarding the above
abilities. We construct new datasets for these tasks
based on various kinds of stories such as fables and
fairy tales collected from public web resources,

Transactions of the Association for Computational Linguistics, vol. 10, pp. 434–451, 2022. https://doi.org/10.1162/tacl a 00469
Action Editor: Dipanjan Das. Submission batch: 10/2021; Revision batch: 12/2021; Published 4/2022.
c(cid:3) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Effendi’s son is eccentric, always behaving
opposed to what Effendi has ordered him
to do. Familiar to his son’s temper, Effendi
usually communicates using irony. One day,
the father and son were blocked by a river after
purchasing flour from a mill. And while they
were crossing the river, one bag on the donkey’s
back lost its weight and leaned. Effendi told
his son with irony:‘‘My boy! drop the sack
into the river!’’ The son heard the words and
thought:‘‘I have been opposed to my father for
so many years. For this only time, I have to
obey him.’’ Therefore, he followed Effendi’s
words and indeed pushed the sack into the
river. ‘‘My boy! What are you doing?’’Effendi
shouted in anger.’’

Tableau 1: A long text example. The concepts and
events concerning commonsense and discourse
relations are highlighted in bold.

considering that stories usually contain abundant
commonsense and discourse relations. All these
tasks require processing stories with hundreds of
words. Note that LOT does not involve extra-long
texts with thousands of words since the compli-
cated linguistic phenomena in these texts make
it hard to test individual abilities and guide the
improvement of generation models.

En outre, we release LongLM, a Chinese
Long text pretraining Language Model. LongLM
is a Transformer-based model with an encoder-
decoder architecture. LongLM has three different
versions ranging from 60 million to 1 billion pa-
rameters. We pretrain LongLM on 120G Chinese
novels with two generative tasks, including text
infilling (Lewis et al., 2020) and conditional con-
tinuation (Radford et al., 2018). The pretraining
data do not include other types of texts (par exemple., news,
Wiki-texts) since we mainly focus on common-
sense and discourse relations within general long
texts instead of factual and technical knowledge.
To the best of our knowledge, LongLM is the
first pretraining model of the same size scale that
focuses on modeling long-form stories. Extensive
experiments on LOT show that LongLM outper-
forms strong baselines substantially on both the
understanding and generation tasks. Cependant, nous
also observe that LongLM is still far behind hu-
man performance, which requires better semantic
representations of events and deeper modeling of

the commonsense and discourse relations between
eux. We summarize the main contributions of
this paper as follows:
je. We propose a new story-centric benchmark
LOT for evaluating Chinese long text understand-
ing and generation. LOT consists of four tasks
for testing the fundamental abilities to model long
texts. We also present new datasets for these tasks.
II. We release a new Chinese pretraining model
named LongLM. Experiment results demonstrate
the strong performance of LongLM on LOT,
mais
there still exists considerable room for
improvement.1

2 Related Work

NLP Benchmarks Recently, there have been a
lot of multi-task benchmarks proposed to drive
the progress of generalizable models. The bench-
marks usually aggregate multiple model-agnostic
tasks under a unified framework, enabling re-
searchers to fairly compare different models.
SentEval (Conneau and Kiela, 2018) gathered
multiple classification tasks involving either one
or two sentences as inputs to evaluate sentence
representations. DiscoEval (Chen et al., 2019)
extended these tasks to the discourse level regard-
ing inter-sentence relations. GLUE (Wang et al.,
2019) included more diverse tasks such as natu-
ral language inference (Rockt¨aschel et al., 2016).
Sarlin et al. (2020) proposed SuperGLUE as a
more challenging counterpart of GLUE by in-
troducing multi-sentence tasks. But the additional
tasks are only limited to the formats of coreference
resolution and question answering. En outre
to these English benchmarks, many benchmarks
were proposed to evaluate NLU for other lan-
guages, such as CLUE (Xu et al., 2020un) pour
Chinese. De plus, GLGE (Liu et al., 2020) et
GEM (Gehrmann et al., 2021) were proposed for
evaluating NLG models across diversified gen-
eration tasks such as text summarization and
personalizing dialogue. Cependant,
there is no
benchmark designed specifically for long text
modeling, especially Chinese. En plus, le
above benchmarks were originally designed to
cover as diverse task formats as possible. In con-
trast, we design the LOT tasks with the guidance

1The LOT benchmark, the pretraining resources, et le
appendix are available at https://github.com/thu
-coai/LOT-LongLM.

435

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

of necessary abilities for long text modeling as
suggested by Ribeiro et al. (2020), making it eas-
ier to figure out where models are failing, and how
to improve them.

Long Text Datasets Previous studies in the field
of long text modeling have frequently focused
on the ROCStories (Mostafazadeh et al., 2016)
and WritingPrompts (Fan et al., 2018) datasets.
ROCStories contains 100k artificial five-sentence
stories, while WritingPrompts consists of 300K
pairs of prompts and stories with hundreds
of words. Recent works collected stories with
thousands of words to model longer-range de-
pendencies, such as WikiText-103 (Merity et al.,
2016), roleplayerguild (Louis and Sutton, 2018),
PG-19 (Rae et al., 2020), STORIUM (Akoury
et coll., 2020), and Long-Range Arena (Tay et al.,
2020). Cependant, these datasets are written in En-
glish. LOT will drive the development of Chinese
language models.

De plus, LOT does not include datasets of
extra-long texts, like PG-19, for the following two
raisons: (1) Extra-long texts are far beyond the
scope of current machine learning models because
the discourse-level linguistic phenomena are en-
tangled and complicated in these texts. Donc,
extra-long texts usually serve for computing per-
plexity of language models (Dai et al., 2019) mais
hardly provide fine-grained guidance for improv-
ing model designs. (2) LOT aims not to spur
research on building fuller connections across to-
kens within an extra-long sequence, but to drive
the progress of machines in the aforementioned
fundamental abilities for long text modeling.

Story Understanding and Generation LOT is
centered on fundamental abilities for long text
modeling and thus includes four story understand-
ing and generation tasks concerning common-
sense and discourse relations. Recent studies
have proposed various tasks to evaluate story
understanding and generation. D'abord, story end-
ing selection (Mostafazadeh et al., 2016), story
ending generation (Guan et al., 2019), and story
achèvement (Wang and Wan, 2019) focused on
the commonsense reasoning ability on inter-event
causal and temporal relations. Deuxième, Chen et al.
(2019) evaluated the ability to model discourse re-
lations by predicting the position of a sentence or a
paragraph in a text. Troisième, some works focused on
the coherence of story generation conditioned on

short prompts (Fan et al., 2018), titles (Yao et al.,
2019) and beginnings (Guan et al., 2020). Fourth,
some studies centered on controllability, c'est,
the imposing of controllable attributes on story
generation such as keywords (Xu et al., 2020b),
emotional trajectories (Brahman and Chaturvedi,
2020), outlines (Rashkin et al., 2020), and styles
(Kong et al., 2021). LOT is a comprehensive
benchmark to test the above abilities for Chinese
long text modeling.

On the other hand, LOT does not

involve
those tasks that require learning more partic-
ular features of stories, such as event chains
(Chambers and Jurafsky, 2008), character types
(Bamman et al., 2013), inter-character relations
(Chaturvedi et al., 2016, 2017), social networks
(Agarwal et al., 2013), and abstractive structures
(Finlayson, 2012). Non-neural story generation
models usually retrieved events from a knowl-
edge base with pre-specified semantic relations
based on handcrafted rules (Li et al., 2013), lequel
are costly and lack generalization. In this paper,
we focus mainly on evaluating neural models for
story understanding and generation.

3 LOT Benchmark

We design LOT as an aggregation of two un-
derstanding tasks including Cloze Test (ClozeT)
and Sentence Position Prediction (SenPos), et
two generation tasks including Plot Completion
and Outline-conditioned Genera-
(PlotCom)
tion (OutGen). We show the task descriptions and
data statistics in Tables 2 et 3, respectivement. Nous
use the jieba tokenizer2 for word tokenization.
We design LOT based on the following prin-
ciples: (1) Task Diversity: The tasks vary in
task formats, types and lengths of inputs and
outputs, and focused abilities, making LOT a
comprehensive framework for evaluating the gen-
eralization of models. (2) Task Difficulty: Le
tasks take hundreds of words as inputs or outputs,
and do not involve domain-specific knowledge
about science, films, and so forth. Donc, ils
are beyond the scope of current state-of-the-art
models, but are solvable by most Chinese native
speakers. (3) Task Formulation: The tasks have
been well formulated in prior studies and agreed
to be challenging but meaningful. We introduce
new Chinese datasets for these tasks, which are
constructed to focus more specifically on testing

2https://github.com/fxsjy/jieba.

436

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tasks

Abilities

Inputs

Outputs

ClozeT

Commonsense Reasoning

A text with a sentence removed (the position
specified); Two candidate sentences.

Choosing the correct sentence
from two candidates.

Metrics

Accuracy

SenPos

Inter-sentence Relationship

A text with a sentence removed (the position
unspecified); The removed sentence.

Choosing the correct position
for the removed sentence.

Accuracy

PlotCom Commonsense Reasoning;
Inter-sentence Relationship

A text with a sentence removed (the position
specified).

Generating a sentence to com-
plete the text.

BLEU; Dist

OutGen

Discourse Structure;
Coherence; Controllability

A title, an outline as an out-of-order set of
phrases about characters and events.

Generating a coherent text ad-
hering to the title and outline.

BLEU; Dist;
Cover; Order

Tableau 2: Overview of the tasks in LOT for the abilities they test, inputs and outputs, and the evaluation
metrics. Dist and Cover refer to Distinct and Coverage (Section 5.3), respectivement.

Datasets

Train

Val

Test

Task: ClozeT

# Examples
Vocabulary Size

644
9k

294
7k

294
7k

Avg. # Char in Input Text
Avg. # Word in Input Text
Avg. # Sent in Input Text

139.07
89.28
5.95

138.95
89.03
5.94

141.15
90.20
5.95

Avg. # Word in Candidate

15.60

16.38

15.75

Task: SenPos

# Examples
Vocabulary Size

Avg. # Char in Input Text
Avg. # Word in Input Text
Avg. # Sent in Input Text
Avg. # Word in Removed Sent

20,000
147k

289.59
254.11
9.61
30.48

800
10k

258.48
224.20
8.43
29.28

863
22k

258.52
223.25
8.44
30.26

Avg. # Candidate Positions

8.05

6.91

6.91

Task: PlotCom

# Examples
Vocabulary Size

Avg. # Char in Input Text
Avg. # Word in Input Text
Avg. # Sent in Input Text

13,099
22k

164.35
105.48
7.17

465
8k

464
8k

137.67
87.56
5.59

133.26
84.98
5.48

Avg. # Word in Output Sent

15.08

15.96

16.15

Task: OutGen

# Examples
Vocabulary Size

Avg. # Word in Input Title
Avg. # Word in Input Outline
Avg. # Phrase in Input Outline

1,456
19k

4.64
19.20
8.00

242
6k

4.89
19.05
8.00

729
12k

4.64
19.47
8.00

Avg. # Char in Output Text
Avg. # Word in Output Text
Avg. # Sent in Output Text

169.94
108.91
7.20

169.80
108.68
7.11

170.49
109.04
7.15

Tableau 3: Data statistics of LOT tasks. Le
abbreviation char/sent/len is short for charac-
ter/sentence/length, respectivement.

a certain ability than original datasets. (4) Au-
tomatic Evaluation: These tasks have reliable
automatic metrics to evaluate the focused abili-
ties.We exclude open-ended generation tasks such

as story generation from titles, which is difficult to
automatically evaluate (Guan et al., 2021) because
the tasks suffer from the notorious one-to-many
issue: There are many plausible outputs for the
same input (Zhao et al., 2017).

We constructed datasets for LOT through au-
tomatic and manual annotation. D'abord, we crawled
human-written stories from public web pages as
the data source. These stories are under licenses
that allow use and redistribution for research pur-
poses. Alors, we hired a commercial team to create
the LOT examples. The team is led by a profes-
sional screenwriter and has taken on hundreds of
NLP annotation projects. All annotators are native
Chinese speakers and well-trained for the annota-
tion tasks. We show the full list of the source web
pages and the annotation details in the appendix.

3.1 Cloze Test

Mostafazadeh et al. (2016) introduced the Story
Cloze Test (SCT) task for evaluating story com-
prehension, which requires selecting the right
ending from two candidates for a four-sentence
leading context. Cependant, SCT suffers from
the following issues: (1) Its dataset is artificial
and contains innate biases between right and
wrong endings in some features such as lengths
(Schwartz et al., 2017; Sharma et al., 2018). Tel
biases may leak information about the target la-
bels. (2) SCT focuses on reasoning only endings
but neglects other types of reasoning, tel que
abductive reasoning (Bhagavatula et al., 2019),
which requires reasoning what happens between
observed beginnings and endings. (3) SCT limits
the scope of commonsense reasoning to realistic
events. The limitation may be neither necessary
nor sufficient. Par exemple, ‘‘Cupid can fly’’ can

437

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

A goblin had buried a treasure under the
ground. After that, he received a long flight
mission from the Devil King. The goblin be-
gan to worry about how to guard the treasure
during his mission. The goblin thought for a
long time and decided to give the treasure
to a miser. The miser clung to his vault even
when he was asleep, so the goblin trusted him
very much · · ·

Tableau 4: An example for selecting a sentence that
can be reasoned based on the context and common
sense (in red). We also highlight a sentence that
does not satisfy the requirement in green, lequel
introduces a new character ‘‘the Devil King’’.

be reasoned based on common sense although it
is not realistic, while some story settings may be
realistic but fail to be reasoned only based on the
context and common sense, as shown in Table 4.
Donc, when constructing our ClozeT dataset,
we adopt the following approaches to alleviate
the above issues: (1) All examples are derived
from existing human-written stories. (2) We allow
annotators to create examples where the removed
sentence is initially in the middle of the story. (3)
We change the scope of commonsense reasoning
to all events that embody characters’ reaction and
intention, or the nature of physical objects and
concepts. Tableau 6 shows two ClozeT examples.
En outre, we also conducted experiments to
investigate the potential biases of our dataset in
Section 5.5.

Story Filtering To ensure the quality of LOT
examples, we asked annotators to judge whether
each crawled story meets the following defini-
tion: ‘‘anything which is told in the form of a
coherent event sequence involving several spe-
cific and related characters’’ (Mostafazadeh et al.,
2016). We provided detailed cases for annotators
to instruct them about this definition. Alors, anno-
tators needed to refine those stories which do not
meet the definition by rewriting the plots. Ils
should also clean up the stories by the following
heuristics: (1) refusing examples that may vio-
late ethical principles (par exemple., discrimination); (2)
deleting noisy words (par exemple., links); (3) changing
slang and informal words into standard modern
Chinese; (4) rewriting all dialogues to objective
events. Enfin, we collected 2,427 high-quality

Texte:
I couldn’t control my anger very
well.[1]My parents would yell at me, et
i ran to my room.[2]I buried my head in a
pillow and screamed.[3]I threw my pillow
and hit it hard.

Removed Sentence: I tried to express my
anger.

Tableau 5: A poor example for the SenPos task.
The removed sentence has multiple reasonable
positions including [2] et [3] in the original
text.

Chinese stories, which will be used to con-
struct the datasets for the ClozeT, PlotCom, et
OutGen tasks.

Dataset Construction We presented the stories
to another group of annotators to construct the
ClozeT dataset. For each story, they should select
a sentence as the right candidate that can be rea-
soned based on the context and common sense.
Tableau 4 shows an example presented to the annota-
tors to illustrate how to judge whether a sentence
satisfies this requirement. Alors, the annotators
rewrite the sentence into another one as the wrong
candidate that maintains a good topical relatedness
with the context but violates common sense. Le
wrong candidates should either embody unreason-
able reactions or intentions, or violate the nature
of physical objects or concepts. And we require
annotators not to select the first sentence, lequel
usually aims to introduce story settings instead of
narrating an event. We browse through the an-
notation results and give the annotators detailed
feedback before approving their submissions. Fi-
enfin, we collected 1,232 examples in total and
split them for training, validation and testing.

3.2 Sentence Position Prediction

We use the sentence position prediction task
(Chen et al., 2019) to evaluate the ability to cap-
ture inter-sentence relations (par exemple., causality). Nous
formulate the task as follows: Given a text with
a sentence removed, models should choose the
correct position of the sentence in the text from
multiple candidates. Chen et al. (2019) constructed
an English dataset for this task by randomly re-
moving sentences from existing texts. Cependant,
such examples may be invalid since a sentence
may have multiple plausible positions in a text,

438

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 6: Two ClozeT examples. The right candidates are extracted from the original stories (à la
position of ‘‘[MASK]’’) while the wrong candidates are written by crowd-sourced annotators. Le
first example focuses on common sense regarding the fox’s reaction to the silly wolf ’s behavior, alors que
the second example focuses on common sense regarding the relations between palace and prince. Nous
highlight the entities and events related to the commonsense relations in red, and those which violate
common sense in the wrong candidates in green.

as illustrated in Table 5. Donc, we construct
the dataset for our task based on the following
pipeline: (1) extracting paragraphs with less than
500 words from crawled stories; (2) randomly se-
lecting a sentence to remove for each paragraph,
and regarding all positions between two adjacent
sentences as candidates3; et (3) asking annota-
tors to refine part of the auto-constructed examples
as the validation and test sets, and the remaining
as the training set. Tableau 7 shows two SenPos
examples.

Dataset Construction We asked annotators to
refine each example so that the removed sentence
has only one reasonable position in the text. We did
not allow annotators to select the first or last sen-
tence of the original text as the removed sentence
because they usually contain obvious wording
features (par exemple., ‘‘once upon a time,’’ ‘‘they lived
happily together’’), which may make this task
trivial. Unlike ClozeT, we allowed the texts for
SenPos to be incomplete or include dialogues
that also embody rich inter-sentence relations. Fi-
enfin, we collected 1,663 examples for validation
and testing through human annotation. And we

3We set the minimum length of the removed sentence to
10 Chinese characters, and we merge a sentence in a story
with its neighbors if it contains less than 10 characters.

constructed 20,000 examples automatically for
entraînement.

3.3 Plot Completion

We use the Plot Completion task (Wang and Wan,
2019) to test the ability to make inferences based
on common sense. We formulate this task as
follows: Given a story with a sentence removed,
models should generate a sentence to complete the
story and make it reasonable and coherent.

Dataset Construction Prior studies (Wang and
Wan, 2019; Paul and Frank, 2021) automatically
constructed datasets for this task based on exist-
ing datasets by randomly removing one sentence
from a story. Cependant, as shown in Table 4, pas
all sentences in a story can be reasoned only based
on the context and common sense. Donc, nous
only used the above automatic method to construct
the training data. And we adapted the ClozeT data
to this task for validation and testing, since an-
notators have marked out the qualified sentences.
Spécifiquement, we randomly sampled some ClozeT
examples and took the incomplete story of each
example as input, and the right candidate as the
target sentence to be generated.

439

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tableau 7: Two SenPos examples. The special tokens from [1] à [9] refer to the candidate positions.
The first/second example focuses on testing the ability to capture the inter-sentence causal/temporal
relations, respectivement. We highlight the entities and events implying the relations in red.

3.4 Outline-conditioned Generation

Prior work tended to test the ability of long text
generation through story generation conditioned
on inputs with limited information such as titles
(Yao et al., 2019). Cependant, these tasks are ex-
tremely open-ended so that it is difficult to reliably
measure the generation quality using automatic
metrics (Guan and Huang, 2020). To alleviate the
issue, we introduce the Outline-conditioned Gen-
eration task (Rashkin et al., 2020), which requires
generating a coherent long-form story conditioned
on an outline of characters and events. We formu-
late the outline as a set of out-of-order phrases,
which not only narrows down the set of plausible
stories but also serves for testing the controllabil-
ity and planning ability of models to arrange the
given events reasonably at the discourse level.

Dataset Construction We built the dataset for
this task automatically based on filtered stories.
We followed Rashkin et al. (2020) to extract
the outline of a story using the RAKE algorithm
(Rose et al., 2010). We extract at most eight
phrases for each story, and each phrase contains no
more than eight words. Par exemple, the outline
for the story in Table 1 est {‘‘told his son with
irony,’’ ‘‘purchasing flour from a mill,’’ ‘‘crossing
the river,’’ ‘‘drop the sack into the river,’’ ‘‘indeed
pushed the sack,’’ ‘‘familiar to his son’s temper,’’
‘‘shouted,’’ ‘‘one bag’’}. The outline can serve as
discourse-level guidance for generation models,
which should rearrange the events reasonably and

generate a story with a good global discourse
structure, rather than focus on modeling only the
local coherence.

3.5 Overall Score

Existing benchmarks usually summarize the per-
formance of a model as a single score by averaging
all metric scores without considering task diffi-
culties. To encourage models to progress on those
tasks where there is a more significant gap between
machines and humans, we propose to average
metric scores with different weights. Suppose that
there are a total of M metrics for all tasks, nous
derive the overall score as follows:

wi(cid:3)

M.
j=1 wj

Si,

S =

wi =

M.(cid:2)

je = 1
Hi
Bi

,

(1)

(2)

where Hi, Bi, and Si are the score of humans, un
pre-selected baseline, and the evaluated model for
the i-th metric, respectivement, and wi is the weight
for this metric. Intuitively, the metric scores where
the baseline model has a larger gap with humans
will have a larger weight when computing the
overall score. We use BERT and GPT2 as the base-
line models for the understanding and generation
tasks in LOT, respectivement.

440

Versions

dm

Petit
Base
Large

512
768
1,536

dff

2,048
3,072
3,072

dkv

64
64
64

nh

8
12
12

ne/nd

6/6
12/12
24/32

# P.

60M.
223M.
1B

Tableau 8: Hyper-parameter settings for different
versions of LongLM. dm, dff, and dkv are the
dimension of hidden states, the feed forward
layers, and the keys/values in the self-attention
layers, respectivement. nh is the number of at-
tention heads. ne and nd denote the number
of hidden layers for the encoder and decoder,
respectivement. # P is the number of parameters.

4 Long Text Pretraining Model

To provide more flexibility on both understanding
and generation tasks, we build LongLM fol-
lowing the original encoder-decoder design of
Transformer (Vaswani et al., 2017) with three
different sizes, as shown in Table 8. We follow
Cui et al. (2020) to use a sentencepiece vocabu-
lary of 32,000 wordpieces (Kudo and Richardson,
2018). And we set the maximum sequence length
à 512 for both the encoder and decoder.

Pretraining Data We collect 120G novels as the
pretraining data for LongLM, which cover various
topics such as romance, military, et ainsi de suite. Since
a novel is usually much longer than the maximum
input and output length of LongLM, we split a
novel into multiple segments for pretraining.

Pretraining Tasks Encoder-decoder models are
trained typically by maximizing the likelihood of
the target output given an input. To improve capac-
ities of both the encoder and decoder, we propose
to train LongLM with two pretraining tasks in-
cluding text infilling (Raffel et al., 2020) et
conditional continuation (Radford et al., 2019).
For the first task, the input is a text where a
number of spans are sampled and replaced by
special tokens with unique IDs, while the output
is the spans delimited by the special tokens used
in the input. The lengths of masked spans are
drawn from a Poisson distribution with λ=3 and
all masked tokens compress 15% of the original
texts. As for the second task, the input and output
sont, respectivement, the front and back half of a text,
which is split into two parts randomly. We show
an example of the pretraining tasks in Figure 1.

Pretraing Details We set the learning rate to
1e-4 with the Adam optimizer and the batch size

Chiffre 1: Schematic of the pretraining tasks. et
is the special tokens used for masking spans.
is the ‘‘end of sequence’’ token.

Models

TextInfill

CondCont

PPL

BLEU-3/4

PPL

BLEU-3/4

LongLMsmall
LongLMbase
LongLMlarge

11.61
8.24
6.50

73.80/68.96
75.65/71.05
77.08/72.65

22.91
17.03
14.08

5.30/2.43
5.73/2.64
8.91/5.97

Tableau 9: Perplexity (PPL) and BLEU scores
of LongLM for text infilling (TextInfill) et
conditional continuation (CondCont). The best
performance is in bold and the second best is
underlined.

à 1,000. We pretrained LongLM for 2.5M steps.
It took about two months to train the largest model
using eight NVIDIA V100 GPUs.

Model Performance To assess the performance
of LongLM on the pretraining tasks, we randomly
separated out 1,000 texts from the initial pre-
training data for testing, which were never seen
in the pretraining phase. We used perplexity and
BLEU-n (n = 3, 4) to evaluate both pretraining
tasks. And we generated outputs using the greedy
decoding algorithm for the text infilling task, et
top-k sampling (Fan et al., 2018) with k = 40 et
a softmax temperature of 0.7 (Goodfellow et al.,
2014) for the conditional continuation task. Comme
shown in Table 9, the performance improves sub-
stantially as the number of parameters increases.

5 Experiments

Dans cette section, we tested LongLM and exist-
ing models on LOT with automatic and manual
evaluation. En outre, we conducted extensive
experiments to investigate the potential biases of
the ClozeT and SenPos datasets (Section 5.5), et
measure the overlap between training and testing
data (Section 5.6).

441

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

5.1 Evaluated Models

We evaluated the following models, lequel
are implemented based on the register models
of HuggingFace Transformers:4
(1) Vanilla
Transformer: It has the same architecture as
BERTbase except that the number of layers is
set to 3 (Vaswani et al., 2017). (2) BERT: Il
is implemented based on the bert-base-Chinese
(3)
(Devlin et
register model
RoBERTa:
is implemented based on the
Il
hfl/chinese-roberta-wwm-ext
register model
(Cui et al., 2020). (4) GPT2: It is implemented
based on the uer/gpt2-chinese-cluecorpussmall
register model (Zhao et al., 2019). (5) mT5: Il
is implemented based on the google/mt5-base
register model (Xue et al., 2021). We set all the
baseline models to the base version due to limited
computational resources.

al., 2019).

To show the generic benefits of the pretraining
data of LongLM for long text modeling, we pre-
trained a left-to-right language model from scratch
on the data with the standard language modeling
objective. This model has the same architecture as

GPT2base and is denoted as GPT2
base. De plus,
we evaluated two task-specific pretraining mod-
els including PlotMachines (MP) (Rashkin et al.,
2020) and Plan&Write (PW) (Yao et al., 2019),
and two typical non-pretrained models includ-
ing ConvS2S (Gehring et al., 2017) and Fusion
(Fan et al., 2018) on the generation tasks in LOT.
We used GPT2base as the backbone model of PM
and PW. For PM, we regard input sentences (pour
PlotCom) or input phrases (for OutGen) as the plot
elements used in the memory network, and update
the memory representations at each step of decod-
ing. As for PW, we take a keyword extracted from
the target sentence using the RAKE algorithm (pour
PlotCom) or the sorted input phrases in order
(for OutGen) as the intermediate representations
for planning. We implemented these models based
on the codes provided by the original papers.

5.2 Experiment Settings

Understanding Tasks For both tasks, we en-
code the input of each example and then predict
a distribution over all candidates by normalizing
the dot-product values between the representa-
tions of each candidate and the context. We use
the candidate with the maximum probability as
the prediction result. For ClozeT, we represent

4https://huggingface.co/models.

a candidate using the hidden state at the end of
it, and we regard the hidden state at the position
of the removed sentence appearing in the orig-
inal text as the context representation. And for
SenPos, we take the hidden state at each candidate
position as the candidate representation and the
hidden state at the end of the removed sentence as
the context representation. When evaluating mT5
and LongLM, we feed the same input into the
encoder and decoder (Lewis et al., 2020) and use
the hidden states of the decoder for prediction in
the above way.

Generation Tasks For PlotCom, we take the
incomplete story of an example as input to gen-
erate the missing sentence. And for OutGen, nous
concatenate all phrases in an outline with special
tokens as input to generate a story.

Hyper-Parameters For all models, we set the
batch size to 12, the maximum sequence length
à 512, and the learning rate to 3e-5. We decode
outputs use top-k sampling with k = 40 et
a softmax temperature of 0.7 for the generation
tasks.

5.3 Automatic Evaluation

Metrics We use accuracy to evaluate the un-
derstanding tasks. As for generation tasks, nous
use BLEU-n (B-n) and Distinct-n (D-n) to eval-
uate the n-gram overlap with ground-truth texts
(Papineni et al., 2002) and n-gram generation
diversity (Li et al., 2016), respectivement. We set
n = 1, 2 for both generation tasks. En plus,
we also use the following two metrics to eval-
uate OutGen: (1) Couverture (Cover): It is used
to evaluate the generation controllability, lequel
is computed as the average Rouge-L recall score
(Lin, 2004) between the generated text and each
input phrase. A higher coverage score indicates
the generated text covers more input phrases. (2)
Order: It is used to measure the gap between the
positional orders of input phrases appearing in the
generated texts and ground-truth texts. Specifi-
cally, we compute the order score as the average
ratio of the number of inversions in the generated
story to the number of all position pairs of any
two phrases. An inversion refers to a position pair
that are out of the ground-truth order. And we use
the position of the longest common subsequence
between a story and a phrase as the position of
the phrase in the story. Because an input phrase
does not always appear in the generated story,

442

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Models

# P.

ClozeT

SenPos Overall

Transformer

BERTbase
RoBERTabase
GPT2base

GPT2
base
mT5base

LongLMsmall
LongLMbase
LongLMlarge

Humans

wi

Transformer

BERTbase
RoBERTabase
GPT2base

GPT2
base
mT5base

LongLMsmall
LongLMbase
LongLMlarge

Humans

wi

Validation Set

38M.

102M.
102M.
102M.
102M.
582M.

60M.
223M.
1B

N/A

N/A

38M.

102M.
102M.
102M.
102M.
582M.

60M.
223M.
1B

N/A

N/A

55.78

70.75
72.11
70.07
74.49
72.45

73.81
75.17
79.93

99.00

0.37

Test Set

54.42

69.39
67.69
73.13
76.87
75.17

77.21
77.55
80.61

100.00

0.39

17.38

40.13
51.63
37.78
39.25
63.25

48.75
64.38
70.00

97.00

0.63

16.34

43.68
51.35
37.25
39.28
61.41

53.07
62.34
69.41

98.00

0.61

31.46

51.36
59.14
49.62
52.17
66.62

57.94
68.34
73.64

97.73

1.00

31.23

53.74
57.74
51.28
53.98
66.79

62.51
68.29
73.39

98.78

1.00

Tableau 10: Accuracy (%) on the understanding
tasks in LOT. # P means the number of
parameters. The best performance is in bold
and the second best is underlined. wi is the
metric weight with BERT as the baseline
model when computing the overall score.

we regard all position pairs of such a phrase and
others as inversions.

Results Tables 10 et 11 show the results on
the understanding and generation tasks, respecter-
tivement. To obtain the human performance on
the understanding tasks, we randomly sampled
100 examples from the validation set or test set
and hired three crowd-sourced annotators (na-
tive Chinese speakers) to do these tasks. Nous
made final decisions among them through ma-
jority voting. All results show an almost perfect
inter-annotator agreement with Fleiss’s κ > 0.85
(Fleiss and Joseph, 1971). For generation tasks, nous
regard the scores of ground-truth texts as human
performance.

We summarize the evaluation results as follows:
(1) Pretrained models have significantly bet-
ter performance than non-pretrained models. (2)

443

LongLMlarge outperforms other baselines substan-
tially on both the understanding and generation
tasks. LongLMbase/LongLMsmall achieves better
overall scores with half fewer parameters than
mT5/GPT2. (3) By comparing GPT2† and GPT2,
we can derive that our pretraining data can effec-
tively improve the ability to model long texts. (4)
LongLMsmall has a better performance than GPT2†
on the understanding tasks, and is comparable
with GPT2† on the generation tasks, suggérant
the benefits of the encoder-decoder framework
and the text infilling task. (5) It is still extremely
challenging for all models to capture the com-
monsense and inter-sentence discourse relations
between events in long texts for tackling the
ClozeT and SenPos tasks. En outre, we in-
vestigate how the size of training data influences
the accuracy of BERT for SenPos. The result in
Chiffre 2 indicates the necessity to develop bet-
ter representations of discourse relations instead
of relying only on increasing the data size. (6)
The results on the generation tasks show that
LongLM does well in generating more word over-
laps with references than similar-sized baselines
for both tasks, and covers more input phrases
and arranges them in correct orders for OutGen.
But LongLM underperforms GPT2-based models
in terms of diversity on PlotCom. (7) Dynami-
cally tracking plot states (c'est à dire., MP) does not bring
significant improvement on the generation tasks
compared with GPT2, suggesting that it may re-
quire modeling the discourse structure explicitly
to tackle the generation tasks. And the superiority
of PW to GPT2 on OutGen further indicates the
benefit of modeling discourse-level features. Dans
summary, we believe LOT will serve as an effec-
tive evaluation for capturing the commonsense and
discourse relations of long texts beyond the surface
events, and generating coherent and controllable
long-form texts.

5.4 Manual Evaluation

Because automatic metrics may be unreliable
for evaluating NLG (Guan and Huang, 2020),
we conducted a point-wise manual evaluation
to measure the disparity between machines and
humans for the generation tasks in LOT. Pour
each task, we randomly sampled 100 examples
from the test set and obtained 100 ground-truth
texts and 300 generated texts from three typ-
ical models including GPT2base, mT5base and

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

PlotCom
B-2

D-1

D-2

B-1

B-2

D-1

D-2

Cover

Order

OutGen

Dans l'ensemble

Models

# P.

ConvS2S
Fusion


base

GPT2base
GPT2
MP
PW
mT5base

LongLMsmall
LongLMbase
LongLMlarge

Truth

wi

ConvS2S
Fusion


base

GPT2base
GPT2
MP
PW
mT5base

LongLMsmall
LongLMbase
LongLMlarge

Truth

wi

58M.
109M.

102M.
102M.
102M.
102M.
582M.

60M.
223M.
1B

N/A

N/A

58M.
109M.

102M.
102M.
102M.
102M.
582M.

60M.
223M.
1B

N/A

N/A

B-1

18.92
20.56

22.67
22.49
22.11
22.45
22.56

21.78
22.91
23.76

Validation Set

4.18
4.69

6.22
5.43
5.49
5.57
6.46

7.11
8.28
8.70

6.31
8.63

24.75
26.88
23.89
25.64
24.44

20.17
22.16
25.93

32.18
35.73

70.57
74.87
69.74
71.54
71.31

59.63
63.54
72.18

29.23
29.22

30.43
35.29
31.81
35.84
36.71

35.03
40.33
42.79

10.38
10.34

14.87
18.31
14.94
18.47
22.25

19.17
24.29
24.91

3.45
3.39

10.95
13.89
12.99
11.86
14.52

10.80
14.66
16.13

21.79
22.67

44.38
51.36
50.56
47.62
50.01

39.70
51.82
57.71

14.81
17.41

60.90
64.01
62.98
64.93
77.98

62.53
79.60
80.46

25.34
26.55

55.52
57.64
56.75
57.30
63.15

56.53
62.78
64.36

100.00

100.00

35.32

84.33

100.00

100.00

21.66

71.43

100.00

100.00

0.11

0.40

0.04

0.03

0.08

0.17

0.05

0.04

0.04

0.04

19.60
20.52

22.94
22.45
22.87
22.76
22.52

22.05
23.28
24.20

4.20
4.90

5.76
5.38
5.75
6.07
6.48

7.45
8.58
9.06

6.00
8.43

24.69
26.08
24.08
25.55
24.33

19.93
21.37
25.75

32.42
35.09

70.30
73.26
71.19
70.72
70.53

59.79
62.43
71.08

Test Set

29.00
28.77

30.17
35.79
31.85
35.12
36.33

34.48
40.25
42.10

10.14
10.22

14.91
18.68
15.24
17.96
22.07

19.17
24.15
24.77

1.60
1.47

7.62
9.89
8.62
8.68
10.90

7.93
10.75
12.04

13.95
14.12

36.87
43.52
41.32
40.17
43.65

34.25
44.40
50.29

15.45
17.10

60.87
64.43
63.15
63.70
78.66

63.75
79.88
81.48

25.77
26.36

55.90
56.96
57.21
55.17
63.79

57.64
63.67
64.82

100.00

100.00

35.01

84.56

100.00

100.00

15.71

63.46

100.00

100.00

0.10

0.42

0.03

0.03

0.08

0.16

0.05

0.04

0.04

0.04

11.85
12.61

20.24
21.73
20.45
21.48
23.53

21.02
24.75
26.12

92.23

1.00

11.27
11.91

19.21
20.76
19.77
20.52
22.59

20.48
23.93
25.29

91.64

1.00

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

Tableau 11: Evaluation results on the generation tasks in LOT. # P means the number of parameters. Le
best performance is in bold and the second best is underlined. wi is the metric weight with GPT2base
as the baseline model when computing the overall score.

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

LongLMlarge. For each text along with the in-
put, we hired three crowd-sourced workers to
judge its quality with a binary score (1 for good,
et 0 otherwise) in terms of three aspects: (1)
grammaticality (intra-sentence grammar quality
of generated texts), (2) coherence (causal and
temporal dependencies within generated texts),
et (3) relatedness to inputs (reasonable logical
connections to the input context for PlotCom; et
reasonable utilization of input phrases for Out-
Gen). These aspects are independently evaluated.
We made final decisions among three annotators
through majority voting. We show the annotation
instructions in the appendix.

Tableau 12 shows the evaluation results. For both
tasks, LongLM outperforms GPT2 and mT5 sig-
nificantly in all aspects (p < 0.05, sign test). However, it is difficult for all models to generate a logical completion for PlotCom (relatedness score < 0.1), showing their poor ability to cap- ture commonsense and inter-sentence relations. l a c _ a _ 0 0 4 6 9 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: Accuracy of BERT for SenPos as the size of training data increases. And the big gap between LongLM and humans also proves both tasks challenging to existing generation models. We also observe the positive correlation between the manual evaluation and automatic evaluation (Table 11), suggesting that it may be acceptable to use automatic evaluation to compare and improve models on the generation tasks in LOT. 5.5 Bias Investigation It is essential to investigate potential biases of a dataset, which may leak information about target 444 Models Gram (κ) Cohe (κ) Relat (κ) Baselines ClozeT SenPos Task: PlotCom GPT2base mT5base LongLMlarge Truth 0.84 (0.49) 0.85 (0.24) 0.95 (0.48) 1.00 (1.00) 0.41 (0.71) 0.53 (0.65) 0.82 (0.64) 1.00 (1.00) 0.01 (0.50) 0.01 (0.50) 0.09 (0.69) 0.99 (0.49) Task: OutGen GPT2base mT5base LongLMlarge Truth 0.54 (0.52) 0.53 (0.26) 0.81 (0.23) 1.00 (1.00) 0.18 (0.52) 0.08 (0.46) 0.37 (0.43) 1.00 (1.00) 0.39 (0.43) 0.49 (0.38) 0.62 (0.45) 1.00 (1.00) Table 12: Manual evaluation results for PlotCom and OutGen in terms of grammaticality (Gram), coherence (Cohe), and relatedness (Relat). The best performance is highlighted in bold. All results show a fair inter-annotator agreement with Fleiss’ κ > 0.2.

labels and enable models to easily use short-
cuts to handle complex inputs without actually
mastering the focused abilities (Ribeiro et al.,
2020). Donc, we experimented with the fol-
lowing baselines to inspect the ClozeT and SenPos
datasets: (1) Random: It chooses a candidate ran-
domly. (2) Majority: It chooses the candidate
with an index that is most frequently selected in
the training set. (3) Length: For ClozeT, it chooses
the candidate that contains more words; And for
SenPos, it chooses the position of which the adja-
cent sentences have the closest number of words to
the removed sentence. (4) BLEU-n: For ClozeT,
it chooses the candidate with a higher BLEU-n
score (Papineni et al., 2002) with the context;
And for SenPos, it chooses the position of which
the adjacent sentences have the largest average
BLEU-n score with the removed sentence (n =
1,2). (5) Sentiment: For ClozeT, it chooses the
candidate with a higher sentiment score computed
by an off-the-shelf Chinese sentiment analyzer;5
for SenPos, it chooses the position where the aver-
age sentiment score of its adjacent two sentences
is the closest to the score of the removed sentence.
(6) Discourse Markers: For ClozeT, it chooses
the candidate where its adjacent sentences contain
a discourse marker matching with it. Par exemple,
if ‘‘because’’ occurs in the last sentence before
the position of the candidates, this baseline will
choose the candidate that contains ‘‘so’’.6 If there

5https://github.com/isnowfy/snownlp.
6Different from English, paired discourse markers like

‘‘because’’-‘‘so’’ should be used together in Chinese.

Random
Majority
Length
BLEU-1/2
Sentiment
Discouse Markers
BERT w/o Context
BERT w/o Long

50.00
52.72
52.72
46.94/48.98
50.34
45.92
57.82
62.24

16.03
16.24
16.45
14.14/14.95
16.49
9.15
18.08
19.00

BERT

69.39

43.68

Tableau 13: Accuracy (%) of different baselines
on the test sets of ClozeT and SenPos for bias
enquête. We use the results of BERT as
a reference.

do not exist such paired markers in an exam-
ple or there are multiple eligible candidates, ce
baseline will randomly choose one. The setting of
this baseline for SenPos is similar to ClozeT. Nous
manually define 24 marker pairs for this baseline.
(7) BERT w/o Context: We fine-tuned BERT to
directly choose without taking the context as in-
put (Schwartz et al., 2017). (8) BERT w/o Long:
It is used to study whether solving these tasks
requires modeling long-range dependencies. Pour
ClozeT, we fine-tuned BERT to choose with only
the adjacent sentences of the removed sentence
as input. And for SenPos, we encoded each posi-
tion and its adjacent sentences respectively using
BERT and then took the hidden states at these
positions for prediction. These baselines cover
different levels of features ranging from the to-
ken level (par exemple., Length), the sentence level (par exemple.,
Sentiment), to the discourse level (par exemple., Discourse
Markers, BERT w/o Context). We believe that
these baselines will provide a comprehensive in-
spection for the potential biases of our datasets.
As shown in Table 13, both tasks can not be
trivially solved by these baselines, suggesting that
the datasets may be free of biases in terms of
the above features. Donc, we believe that the
tasks can focus on testing the ability of models to
capture long-range commonsense and discourse
relations.

5.6 Memorization Investigation

Overlap between training and test data may re-
sult in an over-reporting of the generalization
performance of machines. Donc, it is neces-
sary to investigate how many test data also show
up in the training data. To this end, we follow

445

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Tasks

ClozeT

SenPos

PlotCom OutGen

Overlap with the Training Sets

Percent

0.00%

0.62%

0.02%

# 8-grams
# Exam
# Exam>10%
Max Percent

0
0
0
0.00%

1,040
45
17
60.98%

6
3
0
2.53%

0.00%

2
2
0
1.00%

Overlap with the Pretraining Data

Percent

0.67%

4.68%

0.38%

1.22%

# 8-grams
# Exam
# Exam>10%
Max Percent

172
83
4

7,844
486
71

47.22% 60.96%

151
88
1
30.77%

1,212
161
26
41.18%

Tableau 14: Overlapping analysis for the test sets
of the four tasks with respect to their own train-
ing sets or the pretraining data of LongLM. Nous
compute the following statistics: (1) Percent:
the percentage of 8-grams from the test set that
are also in the training sets or the pretraining
data; (2) # 8-grams: the number of overlapped
8-grams; (3) # Exam: the number of examples
that contain at least one overlapped 8-gram;
(4) # Exam>10%: the number of examples that
have more than 10% overlapped 8-grams. (4)
Max Percent: the maximum percentage of
overlapped 8-grams from an example.

Radford et al. (2019) to measure the overlap be-
tween two datasets by calculating the percentage
of 8-grams from one that are also in the other. Nous
use the jieba tokenizer for tokenization.

Tableau 14 shows the overlapping analysis for
test sets of the four tasks in LOT. We can see
that all test sets have less than 1% overlap with
their own training sets. Notably, there are 17 test
examples of SenPos that contain more than 10%
overlapped 8-grams with the training set. Ce
is because a training example and a test example
may come from the same story, and thus they share
similar information (par exemple., characters, locations). UN
test example contains at most 60.98% overlapped
8-grams, suggesting that the training set and test
set do not include exactly the same example.
As for the pretraining data of LongLM, the test
sets of ClozeT and PlotCom still have less than
1% overlap. Cependant, there are dozens of test
examples in SenPos and OutGen that contain more
que 10% overlapped 8-grams. Through manual
inspection of the overlaps, we found that they
mainly come from idioms, proverbs and classic

SenPos

# Exam

Total

863

mT5base
LongLMlarge

61.41%
69.41%

SenPos

# Exam

Total

863

mT5base
LongLMlarge

61.41%
69.41%

w/o Overlap
(Training Set)

846

61.82%
69.50%

w/o Overlap
(Pretraining Data)

792

61.24%
69.32%

Δ

N/A

+0.41%
+0.09%

Δ

N/A

–0.17%
–0.09%

Tableau 15: Accuracy on the test set of SenPos.
Total means using the whole test set while
w/o Overlap means excluding the examples
that have more than 10% overlapped 8-grams
with the training set or pretraining data from
the test set. # Exam is the number of exam-
ples. Δ denotes the change of accuracy when
excluding the overlapping data compared with
using the total test set.

OutGen

# Exam

mT5base
LongLMlarge

Total

729

36.33
42.10

w/o Overlap
(Pretraining Data)

703

36.45
42.22

Δ

N/A

+0.12
+0.12

Tableau 16: BLEU-1 score on the test set of
OutGen. Other notations are the same as
Tableau 15.

fairy tales, which may be part of some novels in
the pretraining data.

To investigate how the overlapping data influ-
ence the measurement of models’ performance,
we re-evaluated LongLMlarge on the test sets of
SenPos and OutGen with exclusion of the exam-
ples that have more than 10% overlapped 8-grams
with the training sets or pretraining data. We also
used mT5base as a baseline in the same setting
of LongLM. The results for SenPos and Out-
Gen are shown in Tables 15 et 16, respectivement.
The change of accuracy or BLEU-1 score is very
marginal for both mT5 and LongLM when ex-
cluding the overlapping data, suggesting that the
superior performance of LongLM is rarely at-
tributable to the memorization of training data.
Donc, we believe that it is fair to compare
LongLM and other models on these tasks.

446

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

6 Conclusions

We present LOT, a story-centric benchmark
for Chinese long text understanding and gen-
eration. LOT includes two story understanding
tasks and two story generation tasks, lequel
comprehensively investigate the abilities of com-
monsense reasoning, controllable generation, et
modeling inter-sentence relations and the global
discourse structures. We provide standard datasets
for the four tasks, which are constructed based
on human-written stories processed by auto-
matic and manual annotation. En outre, nous
release a new Chinese long text pretraining model
LongLM, which outperforms strong baseline mod-
els substantially on both the understanding and
generation tasks in LOT. The LOT benchmark
and the pretraining model will encourage further
research on Chinese long text modeling.

Remerciements

This work was supported by the National Science
Foundation for Distinguished Young Scholars (Non.
62125604) and the NSFC projects (Key project
Non. 61936010 and regular project no. 61876096).
This work was also supported by the Guoqiang
Institute of Tsinghua University, with grant nos.
2019GQG1 and 2020GQG0005. We would also
like to thank our action editor, Dipanjan Das,
and the anonymous reviewers for their invaluable
suggestions and feedback.

Les références

Apoorv Agarwal, Anup Kotalwar, and Owen
Rambow. 2013. Automatic extraction of so-
cial networks from literary text: A case study
on alice in wonderland. In Proceedings of the
Sixth International Joint Conference on Natural
Language Processing, pages 1202–1208.

Nader Akoury, Shufan Wang, Josh Whiting,
Stephen Hood, Nanyun Peng, and Mohit Iyyer.
2020. STORIUM: A Dataset and evaluation
platform for machine-in-the-loop story gener-
ation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 6470–6484, Sur-
line, Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.525

David Bamman, Brendan O’Connor, and Noah A.
Forgeron. 2013. Learning latent personas of film
characters. In Proceedings of the 51st Annual
Meeting of
the Association for Computa-
tional Linguistics (Volume 1: Long Papers),
pages 352–361.

Chandra Bhagavatula, Ronan Le Bras, Chaitanya
Malaviya, Keisuke Sakaguchi, Ari Holtzman,
Hannah Rashkin, Doug Downey, Wen-tau Yih,
and Yejin Choi. 2019. Abductive common-
sense reasoning. In International Conference
on Learning Representations.

Faeze Brahman

and Snigdha Chaturvedi.
pour
2020. Modeling protagonist emotions
In Proceedings
emotion-aware storytelling.
of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 5277–5294, En ligne. Association for
Computational Linguistics. https://est ce que je
.org/10.18653/v1/2020.emnlp-main.426

Tom B. Brun, Benjamin Mann, Nick Ryder,
Melanie Subbiah,
Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, Sandhini Agarwal,
Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh,
Jeffrey Wu, Clemens
Daniel M. Ziegler,
Hiver, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskev, et
Dario Amodei. 2020. Language models are
few-shot learners.

Nathanael Chambers and Dan Jurafsky. 2008. Un-
supervised learning of narrative event chains. Dans
Proceedings of ACL-08: HLT, pages 789–797.

Snigdha Chaturvedi, Mohit

Iyyer, and Hal
Daume III. 2017. Unsupervised learning of
evolving relationships between literary char-
acters. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 31.

Snigdha Chaturvedi, Shashank Srivastava, Hal
Daume III, and Chris Dyer. 2016. Model-
ing evolving relationships between characters
le
in literary novels.
AAAI Conference on Artificial Intelligence,
volume 30.

In Proceedings of

Mingda Chen, Zewei Chu, and Kevin Gimpel.
2019. Evaluation benchmarks and learning

447

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

criteria for discourse-aware sentence rep-
le 2019
In Proceedings of
resentations.
Conference on Empirical Methods in Natural
Language Processing and the 9th Interna-
tional Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 649–662.
https://doi.org/10.18653/v1/D19-1060

Alexis Conneau and Douwe Kiela. 2018.
Senteval: An evaluation toolkit for univer-
sal sentence representations. In Proceedings
of the Eleventh International Conference on
Language Resources and Evaluation (LREC
2018).

Yiming Cui, Wanxiang Che, Ting Liu, Bing
Qin, Shijin Wang, and Guoping Hu. 2020.
Revisiting pre-trained models for Chinese nat-
ural language processing. In Proceedings of
le 2020 Conference on Empirical Methods
in Natural Language Processing: Findings,
pages 657–668, En ligne. Association for Com-
putational Linguistics.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G.
Carbonell, Quoc Le, and Ruslan Salakhutdinov.
2019. Transformer-xl: Attentive language mod-
els beyond a fixed-length context. En Pro-
ceedings of
the 57th Annual Meeting of
the Association for Computational Linguistics,
pages 2978–2988.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, et
Kristina Toutanova. 2019. BERT: Pre-training
of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Con-
ference of the North American Chapter of the
Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long
and Short Papers), pages 4171–4186.

Angela Fan, Mike Lewis, and Yann Dauphin.
2018. Hierarchical neural story generation. Dans
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 889–898.

Mark Alan Finlayson. 2012. Learning narrative
structure from annotated folktales. Ph.D. thesis,
Massachusetts Institute of Technology.

Joseph Fleis. 1971. Measuring nominal scale
agreement among many raters. Psychologi-
cal Bulletin, 76(5):378–382. https://est ce que je
.org/10.1037/h0031619

Jonas Gehring, Michael Auli, David Grangier,
Denis Yarats, and Yann N. Dauphin. 2017.

Convolutional sequence to sequence learn-
ing. In International Conference on Machine
Apprentissage, pages 1243–1252. PMLR.

Sebastian Gehrmann, Tosin Adewumi, Karmanya
Aggarwal, Pawan Sasanka Ammanamanchi,
Aremu Anuoluwapo, Antoine Bosselut, Khyathi
Raghavi Chandu, Miruna Clinciu, Dipanjan
Le, Kaustubh D. Dhole, Wanyu Du, Esin
Durmus, Ondˇrej Duˇsek, Chris Emezue,
Varun Gangal, Cristina Garbacea, Tatsunori
Hashimoto, Yufang Hou, Yacine Jernite,
Harsh Jhamtani, Yangfeng Ji, Shailza Jolly,
Dhruv Kumar, Faisal Ladhak, Aman Madaan,
Mounica Maddela, Khyati Mahajan, Saad
Mahamood, Bodhisattwa Prasad Majumder, Pedro
Henrique Martins, Angelina McMillan-Major,
Simon Mille, Emiel van Miltenburg, Moin
Nadeem, Shashi Narayan, Vitaly Nikolaev,
Rubungo Andre Niyongabo, Salomey Osei,
Ankur Parikh, Laura Perez Beltrachini, Niranjan
Ramesh Rao, Vikas Raunak, Juan Diego
Rodriguez, Sashank Santhanam, Jo˜ao Sedoc,
Thibault Sellam, Samira Shaikh, Anastasia
Shimorina, Marco Antonio Sobrevilla Cabezudo,
Hendrik Strobelt, Nishant Subramani, Wei Xu,
Diyi Yang, Akhila Yerukola, and Jiawei Zhou.
2021. The GEM benchmark: Natural language
generation, its evaluation and metrics. arXiv
preprint arXiv:2102.01672. https://est ce que je
.org/10.18653/v1/2021.gem-1.10

Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio.
2014. Generative adversarial nets. In Advances
in Neural Information Processing Systems,
pages 2672–2680.

Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
Zhu, and Minlie Huang. 2020. A knowledge-
for common-
enhanced pretraining model
sense story generation. Transactions of
le
Association for Computational Linguistics,
8:93–108. https://est ce que je.org/10.1162
/tacl_a_00302

Jian Guan and Minlie Huang. 2020. UNION:
evaluating
Un
unreferenced metric
In Proceed-
open-ended story generation.
le 2020 Conference on Empirical
ings of
Methods in Natural Language Processing,
EMNLP 2020, En ligne, Novembre 16-20, 2020,

pour

448

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

pages 9157–9166. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2020.emnlp-main.736

pages 7871–7880. Association for Computa-
tional Linguistics. https://est ce que je.org/10
.18653/v1/2020.acl-main.703

Jian Guan, Yansen Wang, and Minlie Huang.
2019. Story ending generation with incremental
encoding and commonsense knowledge. Dans
Proceedings of the AAAI Conference on Artifi-
cial Intelligence, volume 33, pages 6473–6480.
https://doi.org/10.1609/aaai.v33i01
.33016473

Jian Guan, Zhexin Zhang, Zhuoer Feng, Zitao Liu,
Wenbiao Ding, Xiaoxi Mao, Changjie Fan, et
Minlie Huang. 2021. OpenMEVA: A bench-
mark for evaluating open-ended story genera-
tion metrics. In Proceedings of the 59th Annual
Meeting of the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 6394–6407,
En ligne. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.acl-long.500

Xiangzhe Kong,

story

Jialiang Huang, Ziquan
Jian Guan,
Tung,
and Minlie Huang.
2021.
Stylized
generation with
le
style-guided planning.
Association for Computational Linguistics:
ACL-IJCNLP 2021, pages 2430–2436, Sur-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.findings-acl.215

In Findings of

Taku Kudo and John Richardson. 2018. Senten-
cepiece: A simple and language independent
subword tokenizer and detokenizer for neu-
le
text processing. In Proceedings of
ral
2018 Conference on Empirical Methods in
Natural Language Processing: System Demon-
strations, pages 66–71. https://doi.org
/10.18653/v1/D18-2012

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence
pre-training for natural
language genera-
tion, translation, and comprehension. En Pro-
the 58th Annual Meeting of
ceedings of
the Association for Computational Linguis-
tics, ACL 2020, En ligne, Juillet 5-10, 2020,

449

Boyang Li, Stephen Lee-Urban, George Johnston,
and Mark Riedl. 2013. Story generation with
crowdsourced plot graphs. In Proceedings of
the AAAI Conference on Artificial Intelligence,
volume 27.

Jiwei Li, Michel Galley, Chris Brockett,
Jianfeng Gao, and William B. Dolan. 2016.
A diversity-promoting objective function for
neural conversation models. In Proceedings of
le 2016 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 110–119.

Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81,
Barcelona, Espagne. Association for Computa-
tional Linguistics.

Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi,
Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu,
Linjun Shou, Ming Gong, Pengcheng Wang,
Jiusheng Chen, Daxin Jiang, Jiancheng Lv,
Ruofei Zhang, Winnie Wu, Ming Zhou, et
Nan Duan. 2020. GLGE: A new general lan-
guage generation evaluation benchmark. arXiv
preprint arXiv:2011.11928.

Annie Louis and Charles Sutton. 2018. Deep dun-
geons and dragons: Learning character-action
interactions from role-playing game transcripts.
le 2018 Conference of
In Proceedings of
the North American Chapter of the Associ-
ation for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Pa-
pers), pages 708–713. https://doi.org
/10.18653/v1/N18-2111

Stephen Merity, Caiming Xiong, James Bradbury,
and Richard Socher. 2016. Pointer sentinel mix-
ture models. arXiv preprint arXiv:1609.07843.

Nasrin Mostafazadeh, Nathanael Chambers,
Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James
Allen. 2016. A corpus and cloze evaluation for
deeper understanding of commonsense stories.
In Proceedings of NAACL-HLT, pages 839–849.
https://doi.org/10.18653/v1/N16
-1098

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th annual meeting of
the Association for Computational Linguistics,
pages 311–318.

Debjit Paul and Anette Frank. 2021. COINS:
Dynamically generating COntextualized in-
ference rules for narrative story completion.
In Proceedings of
the 59th Annual Meet-
ing of
the Association for Computational
Linguistics and the 11th International Joint
Conference on Natural Language Processing
(Volume 1: Long Papers), pages 5086–5099,
En ligne. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2021.acl-long.395

Alec Radford, Karthik Narasimhan, Tim Salimans,
and Ilya Sutskever. 2018. Improving language
understanding with unsupervised learning.

Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. OpenAI blog, 1(8):9.

Jack W. Rae, Anna Potapenko, Siddhant M.
Jayakumar, Chloe Hillier, and Timothy P.
Lillicrap. 2020. Compressive transformers for
long-range sequence modelling. In Interna-
tional Conference on Learning Representa-
tion.

Colin Raffel, Noam Shazeer, Adam Roberts,
Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.
2020. Exploring the limits of transfer learning
with a unified text-to-text transformer. Journal
of Machine Learning Research, 21:1–67.

Hannah Rashkin, Asli Celikyilmaz, Yejin Choi, et
Jianfeng Gao. 2020. Plotmachines: Outline-
conditioned generation with dynamic plot state
tracking. In Proceedings of the 2020 Confer-
ence on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 4274–4295.
https://doi.org/10.18653/v1/2020
.emnlp-main.349

Marco Tulio Ribeiro, Tongshuang Wu, Carlos
Guestrin, and Sameer Singh. 2020. Au-delà
accuracy: Behavioral testing of NLP models
with CheckList. In Proceedings of the 58th
Annual Meeting of the Association for Compu-

tational Linguistics, pages 4902–4912. En ligne.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/2020
.acl-main.442

Tim Rockt¨aschel, Edward Grefenstette, Karl Moritz
Hermann, Tom´as Kocisk´y, and Phil Blunsom.
2016. Reasoning about entailment with neu-
ral attention. In 4th International Conference
on Learning Representations, ICLR 2016, San
Juan, Puerto Rico, May 2-4, 2016, Con-
ference Track Proceedings. http://arxiv
.org/abs/1509.06664.

Stuart Rose, Dave Engel, Nick Cramer, et
Wendy Cowley. 2010. Automatic keyword ex-
traction from individual documents. Text Mining:
Applications and Theory, 1:1–20. https://
doi.org/10.1002/9780470689646.ch1

Paul-Edouard Sarlin, Daniel DeTone, Tomasz
Malisiewicz, and Andrew Rabinovich. 2020.
Superglue: Learning feature matching with
graph neural networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 4938–4947.
https://doi.org/10.1109/CVPR42600.2020
.00499

Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila
Zilles, Yejin Choi, and Noah A. Forgeron. 2017.
The effect of different writing tasks on linguistic
style: A case study of the ROC story cloze
task. In Proceedings of the 21st Conference
on Computational Natural Language Learning
(CoNLL 2017), pages 15–25. https://est ce que je
.org/10.18653/v1/K17-1004

In Proceedings of

Rishi Sharma, James Allen, Omid Bakhshandeh,
and Nasrin Mostafazadeh. 2018. Tackling
the story ending biases in the story cloze
the 56th Annual
test.
Meeting of
the Association for Computa-
tional Linguistics (Volume 2: Short Papers),
752–757. https://est ce que je.org/10
pages
.18653/v1/P18-2119

Yi Tay, Mostafa Dehghani, Samira Abnar,
Yikang Shen, Dara Bahri, Philip Pham, Jinfeng
Rao, Liu Yang, Sebastian Ruder, and Donald
Metzler. 2020. Long range arena: A bench-
mark for efficient transformers. In International
Conference on Learning Representations.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez,

450

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.

Alex Wang, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel R. Bowman.
2019. GLUE: A multi-task benchmark and
analysis platform for natural
language un-
derstanding. In International Conference on
Learning Representations. https://est ce que je
.org/10.18653/v1/W18-5446

Tianming Wang and Xiaojun Wan. 2019.
T-CVAE: Transformer-based conditioned vari-
ational autoencoder for story completion. Dans
Proceedings of the Twenty-Eighth International
Joint Conference on Artificial Intelligence,
IJCAI 2019, Macao, Chine, Août 10-16,
2019, pages 5233–5239. ijcai.org.

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li,
Chenjie Cao, Yudong Li, Yechen Xu, Kai
Sun, Dian Yu, Cong Yu, et autres. 2020un. CLUE:
A Chinese language understanding evalua-
tion benchmark. In Proceedings of the 28th
International Conference on Computational
Linguistics, pages 4762–4772.

Peng Xu, Mostofa

Patwary, Mohammad
Shoeybi, Raul Puri, Pascale Fung, Anima
Anandkumar, and Bryan Catanzaro. 2020b.
MEGATRON-CNTRL: Controllable
story
generation with external knowledge using
In Proceed-
large-scale language models.
ings of
le 2020 Conference on Empirical
Methods in Natural Language Processing,
EMNLP 2020, En ligne, Novembre 16-20, 2020,
pages 2831–2845. Association for Computa-
tional Linguistics.

Linting Xue, Noah Constant, Adam Roberts,
Mihir Kale, Rami Al-Rfou, Aditya Siddhant,

Aditya Barua, and Colin Raffel. 2021. mt5: UN
massively multilingual pre-trained text-to-text
le 2021
In Proceedings of
transformer.
the North American Chap-
Conference of
ter of
the Association for Computational
Linguistics: Human Language Technologies,
pages 483–498.

Lili Yao, Nanyun Peng, Ralph Weischedel,
Kevin Knight, Dongyan Zhao, and Rui Yan.
2019. Plan-and-write: Towards better automatic
storytelling. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 33,
pages 7378–7385. https://est ce que je.org/10
.1609/aaai.v33i01.33017378

Zhengyan Zhang, Xu Han, Hao Zhou, Pei Ke,
Yuxian Gu, Deming Ye, Yujia Qin, Yusheng
Su, Haozhe Ji, Jian Guan, Fanchao Qi, Xiaozhi
Wang, Yanan Zheng, Guoyang Zeng, Huanqi
Cao, Shengqi Chen, Daixuan Li, Zhenbo Sun,
Zhiyuan Liu, Minlie Huang, Wentao Han, Jie
Tang, Juanzi Li, Xiaoyan Zhu, and Maosong
Sun. 2020. CPM: A large-scale generative
Chinese pre-trained language model. arXiv
preprint arXiv:2012.00413. https://doi.org
/10.1016/j.aiopen.2021.07.001

Tiancheng Zhao, Ran Zhao,

and Maxine
Eskenazi. 2017. Learning discourse-level diver-
sity for neural dialog models using conditional
variational autoencoders. In Proceedings of the
55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long
Papers), pages 654–664.

Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao,
Tao Liu, Wei Lu, Xi Chen, Haotang Deng,
Qi Ju, and Xiaoyong Du. 2019. UER: Un
open-source toolkit for pre-training models.
EMNLP-IJCNLP 2019241. https://est ce que je
.org/10.18653/v1/D19-3041

451

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
6
9
2
0
0
8
0
5
4

/

/
t

je

un
c
_
un
_
0
0
4
6
9
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text image

Télécharger le PDF