A Knowledge-Enhanced Pretraining Model
for Commonsense Story Generation
Jian Guan1 Fei Huang1 Zhihao Zhao2 Xiaoyan Zhu1 Minlie Huang1∗
1Department of Computer Science and Technology, Tsinghua University, Beijing 100084, Chine
2School of Software, Beihang University, Beijing, Chine
1Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems
1Beijing National Research Center for Information Science and Technology
j-guan19@mails.tsinghua.edu.cn,f-huang18@mails.tsinghua.edu.cn,
extsuioku@gmail.com, zxy-dcs@tsinghua.edu.cn,
aihuang@tsinghua.edu.cn
Abstrait
Story generation, namely, generating a rea-
sonable story from a leading context, is an im-
portant but challenging task. In spite of the
success in modeling fluency and local coher-
ence, existing neural
language generation
models (par exemple., GPT-2) still suffer from repe-
tition, logic conflicts, and lack of long-range
coherence in generated stories. We conjecture
that this is because of the difficulty of asso-
ciating relevant commonsense knowledge,
understanding the causal relationships, et
planning entities and events with proper tem-
poral order.
In this paper, we devise a
knowledge-enhanced pretraining model for
commonsense story generation. We propose to
utilize commonsense knowledge from exter-
nal knowledge bases to generate reasonable
stories. To further capture the causal and tem-
poral dependencies between the sentences in a
reasonable story, we use multi-task learning,
which combines a discriminative objective
to distinguish true and fake stories during
fine-tuning. Automatic and manual evalua-
tion shows that our model can generate more
reasonable stories than state-of-the-art base-
lines, particularly in terms of logic and global
coherence.
1 Introduction
Story generation is a strong indicator of machine
understanding of natural language. Il
is often
approached as selecting a sequence of events to
form a story with a reasonable logic or plot.
Although existing generative models (Roemmele,
2016; Fan et al., 2018; Fan et al., 2019) peut
generate stories with good local coherence, ils
∗Corresponding author: Minlie Huang.
93
are still struggling to plan a coherent plot and
maintain a reasonable event sequence throughout
the story, ou
they are often biased towards
generating a limited set of stories with generic
plots (Voir et al., 2019) (par exemple., I have a great
temps), even when using the powerful generative
model OpenAI’s GPT-2 (Radford et al., 2019), comme
shown in Table 1.
Pretrained GPT-2 has been shown to cap-
ture useful semantic and syntactic features (Alt
et coll., 2019), as demonstrated by state-of-the-
art performance on some generation tasks such
as machine translation and text summarization
(Radford et al., 2019). Cependant, compared with
such tasks whose source inputs have contained
information to generate desired tar-
sufficient
texts, story generation is a typical open-
get
ended generation task, where only very limited
information is given in the input. As shown in
this paper, we observe some severe issues when
applying GPT-2 to generate reasonable stories,
particularly commonsense stories from a limited
début. These issues include repetition, logic
conflicts, and lack of long-range coherence (Voir
et coll., 2019; Holtzman et al., 2019), as exemplified
in Table 1. Spécifiquement, although GPT-2 performs
reasonably well at generating some related
concepts to bus (par exemple., driver, and the probable
destinations home or party),
it completely
ignores the other key entity accident in
the leading context, which could be caused by
its lower frequency in GPT-2’s initial training
corpus (less than 7% of bus). Besides, même
though the concepts are relevant, they are usually
generic, and used repeatedly and illogically in
the generated stories. Donc, given limited
information as input, it is extremely challenging
for the subsequent generation without any external
guidance—for instance, commonsense knowl-
bord. And the difficulties lie in associating inter
Transactions of the Association for Computational Linguistics, vol. 8, pp. 93–108, 2020. https://doi.org/10.1162/tacl a 00302
Action Editor: Noah Smith. Submission batch: 10/2019; Revision batch: 11/2019; Published 2020.
c(cid:13) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
0
2
1
9
2
3
2
6
1
/
/
t
je
un
c
_
un
_
0
0
3
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Context:
The first time I saw the results of an accident, J'étais
in a bus.
Golden Continuation by Human:
We were stopped and a friend pointed out
le
accident on the road. A man was under a car after
having been run over. People tried to move the car
off him but only made it worse. After seeing that, je
am much more careful when crossing the street.
pretrained GPT-2:
She was trying to get home but
the other
passengers would not let her go. He thought she
was going to die because of her weight. She was
trying to get home but the other passengers would
not let her go. The bus driver would not let her go.
Fine-tuned GPT-2:
I was on my way to a party. I was on my way
to a party and I ’d gotten out of my seat, et
started driving. I got a flat tire, so I stopped driving.
I drove to the party and had a great time.
Tableau 1: Story examples generated by human and
GPT-2 models. The stories written by the pre-
trained GPT-2 and fine-tuned GPT-2 (post-trained
on ROCStories [Mostafazadeh et al., 2016b])
suffer from repetition (in italic), bad inter-sentence
coherence to the context (par exemple., ignoring key entities
such as accident in bold), as well as conflicting
logic (underlined, par exemple., first stopped driving
and then drove to the party), in spite of
their good fluency and intra-sentence coherence.
dependent commonsense knowledge for expand-
ing a reasonable story, handling the causal rela-
tionships, as well as deciding the temporal orders
between entities and events in context.
Explicitly introducing external commonsense
knowledge has been shown helpful to improve
language understanding and long-range coher-
ence of generated texts (Zhou et al., 2018; Guan
et coll., 2019; Yang et al., 2019b). Par exemple,
for the entities in the given context of Table 1,
many potentially related concepts (par exemple., run
over, cross street) can be inferred and
predicted based on external commonsense knowl-
edge bases such as ConceptNet (Speer and Havasi,
2012) and ATOMIC (Sap et al., 2019). These
knowledge bases contain abundant semantic
knowledge of concepts and inferential knowledge
for commonsense reasoning. We enhance GPT-2
with such knowledge by post-training the model
on the knowledge examples constructed from
these knowledge bases, which can provide addi-
tional crucial information for story generation.
Empirical experiments demonstrate that training
with millions of such examples helps improve
the coherence and logicality of generated sto-
ries. Entre-temps, we adopt multi-task learning to
address the problem of handling causal and tem-
poral dependencies. We combine the generation
objective with an auxiliary multi-label classifica-
tion objective, which requires distinguishing true
stories from fake stories that are constructed by
randomly shuffling the sentences, replacing a sen-
tence with a negatively sampled one, or repeating
a sentence in an original story. The additional
classification task empowers our model to better
capture the logicality in a story implicitly, namely,
modeling the causal and temporal dependencies,
inter-sentence coherence, and avoiding repetition.
this paper are
The main contributions of
summarized as follows:
• We propose a knowledge-enhanced pretrain-
ing model for commonsense story generation
by extending GPT-2 with external common-
sense knowledge. The model is post-trained
on the knowledge examples constructed from
ConceptNet and ATOMIC, thereby improv-
ing long-range coherence of generated stories.
• To generate reasonable stories, we adopt a
classification task to distinguish true stories
from auto-constructed fake stories. The auxil-
iary task makes the model implicitly capture
the causal, temporal dependencies between
sentences and inter-sentence coherence, et
lead to less repetition.
• We conduct extensive experiments with auto-
matic and manual evaluation. Results show
that our model can generate more reasonable
stories than strong baselines, particularly in
terms of logicality and global coherence.1
2 Related Work
2.1 Neural Story Generation
Many existing neural story generation models
generated stories by conditioning upon various
contents such as images (Huang et al., 2016) et
1Notre
implementation is
available
at https://
github.com/thu-coai/CommonsenseStoryGen,
and demo is available at http://coai.cs.tsinghua.
edu.cn/static/CommonsenseStoryGen.
94
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
0
2
1
9
2
3
2
6
1
/
/
t
je
un
c
_
un
_
0
0
3
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
short text descriptions (Jain et al., 2017). Different
from these studies, we consider the setting of
open-ended story generation from only a limited
leading context in this paper. For this task, prior
studies have attempted to build specific sentence
representations by modeling story entities and
events to simplify the dependencies between
phrases (Ji et al., 2017; Clark et al., 2018).
Another line is to decompose story generation
into separate steps (Martin et al., 2018; Fan
et coll., 2018; Wang et al., 2016; Xu et al.,
2018; Yao et al., 2019; Fan et al., 2019). These
models usually focused on first planning story
sketches and then generating sentences from the
sketches. Cependant, improving pretrained models
to generate commonsense stories is yet to be well
investigated.
2.2 Pretraining
large-scale pretraining models have
Recently,
been widely developed in various NLP tasks.
Some work leveraged pretraining to provide better
the word level
language representations at
(Mikolov et al., 2013; Pennington et al., 2014;
Peters et al., 2018) or sentence level (Le and
Mikolov, 2014; Kiros et al., 2015) for various
downstream task-specific architectures. Cependant,
Radford et al. (2018) and Devlin et al. (2018)
suggest that these complex task-specific archi-
tectures are no longer necessary, and it is sufficient
to merely fine-tune pretrained task-independent
transformer language models for downstream
tasks. Mehri et al. (2019) explored different pre-
training methods based on language models for
dialogue context representation learning. Plus loin-
plus, Radford et al. (2019) demonstrate pretrained
language models (c'est à dire., GPT-2) can perform down-
stream tasks better than state-of-the-art models
even in a zero-shot setting (c'est à dire., without any fine-
tuning on task-specific data). Wolf et al. (2019)
fine-tuned GPT-2 for personalized conversation
generation, which obtains very competitive results
in the challenge. Cependant, as previous studies
(Voir et al., 2019; Holtzman et al., 2019) observed,
transferring GPT-2 directly to open-ended text
generation still suffers from several issues such as
repetition or lack of knowledge and inter-sentence
coherence with different decoding algorithms.
Besides, although Song et al. (2019) and Dong
et autres. (2019) extended the language model to sup-
port an encoder-decoder framework (Sutskever
et coll., 2014), we build our model based on GPT-2
because of its simplicity and broad applicability.
2.3 Commonsense Knowledge
Incorporating commonsense knowledge is neces-
sary and beneficial for language inference (LoBue
and Yates, 2011; Bowman et al., 2015; Rashkin
et coll., 2018b), reading comprehension (Mihaylov
and Frank, 2018; Rashkin et al., 2018un), et
particularly for open-ended language generation,
which usually requires external knowledge to
enrich the limited source information. Com-
monsense knowledge has been demonstrated to
significantly improve dialogue generation (Zhou
et coll., 2018), story ending generation (Guan et
al., 2019), and essay generation from given top-
ics (Yang et al., 2019b). And recently, quelques
work also attempted to integrate external com-
monsense knowledge into pretrained models such
as BERT (Devlin et al., 2018) to enhance lan-
guage representation for reading comprehension
(Yang et al., 2019un) and other knowledge-driven
NLP tasks like entity typing and relation classi-
fication (Zhang et al., 2019). Besides, Sun et al.
(2019) improved BERT on Chinese NLP tasks
by multi-stage knowledge masking strategy to
integrate phrase and entity level knowledge into
the language representation. De plus, Bosselut
et autres. (2019) transferred the implicit knowledge
from GPT-2 by fine-tuning the model to gen-
erate an object given the subject and a relation
as input in commonsense knowledge graphs, que
est, automatic knowledge base construction. Comment-
jamais, the low novelty of the generated objects
showed that it could still be difficult for GPT-2 to
generate commonsense texts solely based on its
implicit knowledge. Donc, we target integrat-
ing external knowledge into GPT-2 for generating
more reasonable commonsense stories.
2.4 Multi-Task Learning
Incorporating other auxiliary task objectives to
complement the primary goal has been shown to
improve the performance in many NLP tasks such
as sentiment classification (Yu and Jiang, 2016)
and conversation generation (Zhao et al., 2017).
Recently, multi-task learning was also used to
pretrain language models to capture dependencies
in context (Devlin et al., 2018; Mehri et al.,
2019) and further improve pretrained models’
representation power during fine-tuning (Loup
et coll., 2019).
95
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
3
0
2
1
9
2
3
2
6
1
/
/
t
je
un
c
_
un
_
0
0
3
0
2
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
3 Méthodologie
The task in this work can be defined as follows:
Given a one-sentence story beginning X as the
leading context, the model should continue to
complete a K-sentence story Y with a reasonable
plot. The sentences in a generated story should
have reasonable logical connections, causal
relationships, and temporal dependencies with
each other and with the given beginning. To this
end, we devise a novel framework to leverage
knowledge and handle the causal and temporal
dependencies, as Figure 1 shows.
3.1 Pretrained Transformer Language
Model
The transformer architecture is a general model
used in language modeling (Vaswani et al.,
2017), which consists of multiple transformer
blocks of multi-head self-attention followed by
layer-normalization and fully connected layers.
Radford et al. (2019) used a 12-layer decoder-only
transformer (GPT-2) (c'est à dire., a left-to-right language
model) with masked self-attention heads which are
constrained in that every token can only attend to
its left context. Officiellement, the objective in this stage
is to minimize the following negative likelihood:
LGP T = −
|toi|
X
t=1
logP (ut|toi