Visual Writing Prompts:
Character-Grounded Story Generation with Curated Image Sequences
Xudong Hong1,2,4, Asad Sayeed3, Khushboo Mehra2,4,
Vera Demberg2,4 and Bernt Schiele1,4
1Dept. of Computer Vision and Machine Learning, MPI Informatics, Germany
2Dept. of Language Science and Technology and Dept. of Computer Science,
Saarland University, Germany
3Dept. of Philosophy, Linguistics, and Theory of Science, University of Gothenburg, Sweden
4Saarland Informatics Campus, Saarbr¨ucken, Germany
{xhong,kmehra,vera}@lst.uni-saarland.de
schiele@mpi-inf.mpg.de, asad.sayeed@gu.se
Abstract
Current work on image-based story generation
suffers from the fact that the existing image se-
quence collections do not have coherent plots
behind them. We improve visual story gen-
eration by producing a new image-grounded
dataset, Visual Writing Prompts (VWP). VWP
contains almost 2K selected sequences of
movie shots, each including 5-10 images. The
image sequences are aligned with a total of
12K stories which were collected via crowd-
sourcing given the image sequences and a set
of grounded characters from the corresponding
image sequence. Our new image sequence col-
lection and filtering process has allowed us to
obtain stories that are more coherent, diverse,
and visually grounded compared to previous
work. We also propose a character-based story
generation model driven by coherence as a
strong baseline. Evaluations show that our
generated stories are more coherent, visually
grounded, and diverse than stories generated
with the current state-of-the-art model. Our
code, image features, annotations and collected
stories are available at https://vwprompt
.github.io/.
1
Introduction
In this work, we improve the quality of text
stories generated by neural models from image
sequences. We do so by improving the curation
of the image sequences that form the basis for
collecting the story/image pairs used to train the
models: We build a dataset in which the images
lend themselves better to telling a story. To show
the usefulness of our dataset, we train a coherence-
driven model where we design a coherence compo-
nent inspired by entity grid models. Experiments
565
show that our model produces more coherent, vi-
sually grounded and diverse stories than previous
models.
Stories are essential in natural language un-
derstanding and generation because they are the
key mechanism for humans to understand the
world (Piper et al., 2021). Automatically gener-
ating good stories is a challenging task requiring
various capabilities in language processing (Peng
et al., 2018), event understanding (Martin et al.,
2018; Hong et al., 2020), and world knowledge
(Guan et al., 2020; Hsu et al., 2020) to come to-
gether. Previous approaches to story generation
have used different kinds of input to guide the
story: Some use a textual prompt to start the story
(Fan et al., 2018), yet others involve describing
a sequence of images to direct the story (Huang
et al., 2016). We choose to work inside the latter
family of approaches in order to exploit the rich
information contained in image sequences and
to prevent suffering from the symbol grounding
problem (Harnad, 1990).
Research on visual narratives shows how it
would be possible to construct the sort of dataset
we propose: Image sequences should consist of
a series of coherent events centered around one
or more main characters (Cohn, 2020). In fact,
even Aristotle points out in Poetics that event and
character are the most important elements for a
good story.
To date, several datasets of image sequences
for narrative generation exist, such as the Visual
Storytelling (VIST; Huang et al., 2016) dataset,
which includes sets of images extracted from
Flickr albums. However, image sequences gener-
ated this way have the drawback that they may
Transactions of the Association for Computational Linguistics, vol. 11, pp. 565–581, 2023. https://doi.org/10.1162/tacl a 00553
Action Editor: Marco Baroni. Submission batch: 6/2022; Revision batch: 12/2022; Published 6/2023.
c(cid:2) 2023 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Comparison of one story in Visual Writing Prompts with one story in Visual Storytelling and five stories
Travel Blogs. Our dataset has recurring characters across all five images and sub-stories. Each occurrence of a
character in a sub-story has a bounding box in the corresponding image, which grounds the textual appearance to
visual input.
not lend themselves well to storytelling. Consider
for instance the image sequence shown in the first
column of Figure 1: The people featured across
the image sequence are all different, and there is
no real development of an event or a plot. So the
stories that humans were able to write for these
types of image sequences are often quite poor from
a narrative point of view and lead to low-quality
training data for our story generation algorithms,
which in turn, unsurprisingly, generate quite bad
stories.
We thus argue that image sequences serving
as writing prompts should be comprehensible as
visual narratives by humans. Humans (with rea-
sonable writing proficiency) can then ‘‘translate’’
such visual narratives into textual narratives. For
an image sequence to qualify as a visual narrative,
events and characters must have two proper-
ties: coherence, meaning that the events are se-
mantically related and centered around recurring
characters; and diversity, meaning that several
different events jointly construct a plot. Psycho-
linguistic experiments show that missing either of
these properties impedes human comprehension
of image sequences as visual narratives (Cohn
et al., 2012). In addition, the characters should be
easily recognized in the image sequences and can
be straightforwardly linked to the stories (visual
groundedness). Image sequences without these
properties are hence not effective writing prompts.
In this work, we define the term visual tellability
to mean the tellability (H¨uhn et al., 2014) of
image sequences, that is, how likely it is that
humans can write a story with an image sequence,
which measures whether the image sequences
have the three properties described above. We
propose a new dataset, Visual Writing Prompts
(VWP), containing curated image sequences and
matching user-generated stories. Our image se-
lection process allows us to choose optimized
image sequences that have high visual tellability,
and to encourage our crowdsourced storytellers
to produce coherent and visually grounded stories
with high diversity.
To obtain coherent and visually grounded sto-
ries, we provide cropped images of characters
explicitly with image sequences for storytellers.
To improve diversity, we select images from a
data source that is already likely to have a plot:
image sequences selected from movie scenes with
566
aligned synopses. To further show the impor-
tance of coherence and visual groundedness, we
propose a story generation model with a repre-
sentation of visual coherence focused principally
on character continuity as a strong baseline. Ex-
periments show that our model outperforms the
current state-of-the-art model TAPM (Yu et al.,
2021) and generates stories that are more coher-
ent, visually grounded, and diverse.
We summarize our contributions in this work
as follows: (a) We propose a pipeline to extract
image sequences automatically from annotated
movies as story writing prompts, which leads to
image sequences with higher visual tellability.
(b) We collect a new dataset of stories based
on curated image sequences with grounded char-
acters, which is more coherent and has better
diversity than previous datasets. (c) We propose a
character-grounded story generation model driven
by visual coherence as a strong baseline for image-
based story generation, which generates more co-
herent, diverse, and visually grounded stories than
the current state-of-the-art model.
2 Related Work
Story Generation. There are several existing
datasets for generating a story conditioned on a
prompt such as title (Fan et al., 2018), keyword
(Yao et al., 2019), cue phrase (Xu et al., 2020),
script (Pu et al., 2022), or story plot (Rashkin et al.,
2020). The ROCStories corpus (Mostafazadeh
et al., 2016) is a collection of short stories with
rich causal and temporal relations. In subsequent
work, new datasets have also been formed by
gathering annotations on subsets of ROCStories
for specialized story generation tasks such as mod-
eling character psychology (Rashkin et al., 2018),
counterfactual reasoning (Qin et al., 2019), and
so forth. The STORIUM dataset (Akoury et al.,
2020) of collaboratively written long stories con-
tains rich annotations such as narrator prompts,
character goals, and other attributes to guide story
generation. However, all these datasets relying on
textual prompts suffer from the symbol ground-
ing problem that the meanings of textual stories
are grounded on textual symbols (Harnad, 1990).
In contrast, our dataset contains stories grounded
on nonsymbolic prompts from visual perception,
that is, characters in image sequences.
Visually Grounded Stories. Early work on the
VIST dataset (Huang et al., 2016) identified that
language in visually grounded stories is much
more diverse than in image captions. However,
most of the previous datasets of visually grounded
stories have several limitations: characters are not
explicitly annotated (Chandu et al., 2019), the
dataset is limited in scale (Xiong et al., 2019), or
there is no sequence of events behind the images
(Park and Kim, 2015; Huang et al., 2016). Our
dataset is the first large-scale dataset that is fo-
cused on overcoming these limitations. Unlike the
VIST images, images in our VWP dataset do not
feature people posing for the camera in limited
contexts. Instead, they depict a rich range of sit-
uations, interactions, and emotions. Furthermore,
providing character annotations in VWP ensures
that the entities in the narrative are grounded to the
image sequence and can be easily tracked across
the sequence even when some visual attributes
change. We hypothesize that these features will
result in more coherent and visually grounded
stories while maintaining a high level of diversity.
3 Image Sequence Construction
In this section, we describe how we obtain im-
age sequences and design a pipeline to filter
and sample images. Our objective is to construct
image sequences that are visually tellable, that
is, are coherent, diverse, and visually grounded.
Our pipeline for image sequence construction is
shown in Figure 2.
Movie Scene Extraction. To achieve high co-
herence and diversity, we choose to select images
from movie scenes that have a plot consisting of
a series of events around several main charac-
ters. We extract movie scenes from the MovieNet
dataset (Huang et al., 2020) since it is a dataset
that contains movie synopses, annotated movie
scenes with extracted movie shots, and identified
main characters. The paragraphs in each movie
synopsis describe sub-plots of the movie plot,
which are aligned with one or more movie scenes.
Changing from one paragraph to another in the
synopsis indicates scene changes (Xiong et al.,
2019). Moreover, events and characters in one
movie scene are semantically coherent. We can
make use of these properties to achieve high di-
versity by sampling image sequences from movie
scenes aligned with only one paragraph, so that
image sequences are from one sub-plot with a
series of different events.
567
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
detect the degree of similarity, we first feed the
images to a ResNet-50 pre-trained on ImageNet
and extract image features after the f c7 layer.
Then we compute pairwise cosine similarities of
the image features within each image sequence
and discard an image if its cosine similarity with
any one of the other images is larger than 0.89.
Additionally, we detect adult content by apply-
ing a pre-trained classifier3 and exclude images
that trigger the classifier. We also remove the first
or the last image sequence in a movie to avoid
images with credits.
Image Sampling. The most intuitive way to
collect stories is to use extracted movie scenes
directly as writing prompts. Since these movie
scenes contain a large number of movie shots, we
control the workload by constraining the number
of images for each writing task to a lower number
K which is obtained through the second pilot
studies in Section 4.1. So from each selected
movie scene, we sample images consecutively in
non-overlapping sliding windows with a size of
K and use each set of K images as one writing
prompt.
4 Crowdsourcing Experiment Design
In this section, we design a crowdsourcing exper-
iment to collect stories using our collected image
sequences as writing prompts. Our objective is to
obtain coherent stories that have high diversity
from crowdsourced storytellers.
We design and run all our studies on Amazon
Mechanical Turk (AMT).4 The worker user inter-
face is shown in Figure 3. In each assignment, we
ask the worker to select a subset of images from the
image sequence and write a short story (50 to 300
words) that fits the image sequence. To ensure
that the human-written stories are grounded on
main characters, we provide names and cropped
images of at most five major characters. We re-
trieve the bounding boxes for each character from
the MovieNet annotations and choose the least
blurry appearance of each character in the image
sequence. We pose three questions to the workers.
The first two questions are used to identify work-
ers who have watched the movie from which the
image sequence is taken, as they might exhibit
3https://github.com/notAI-tech/NudeNet/.
4https://www.mturk.com/.
Figure 2: Image processing pipeline. Black squares are
input or output. Circles are processing steps.
Filtering Movies. Since we want to constrain
the range of commonsense inferences of story-
tellers to the real world and help them to produce
coherent stories, we first filter out all fantasy,
science fiction, and horror movies. We also filter
out all animations because their image charac-
teristics are too different from the other movies.
Filtering Images.1 To help storytellers to write
stories that are visually grounded on characters
or objects around them, we discard blurry images
and images without any COCO ‘‘objects’’.2 We
measure the amount of image blur by calculating
the variance of the Laplacian (Pech-Pacheco et al.,
2000) and remove images with a variance lower
than 30. We further apply a MaskRCNN-based
object detector (He et al., 2020) and filter out
images without any detected objects—this will
help us generate stories with interesting grounded
objects in the image.
To increase the diversity of image sequences,
we need to avoid including shots that are very
similar (as can happen when a person speaks in a
long monologue, for example) to one another. To
1Hyper-parameters in this section are determined by a
that optimizes the filter process
preliminary experiment
manually on 50 image sequences.
2A human character is also labeled as an ‘‘object’’ in
COCO dataset (Lin et al., 2014).
568
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 3: Worker interface on Amazon Mechanical Turk. We first show the instructions and the requirements.
The main characters are provided on the left side. On the right side, each image is accompanied by a textarea.
The full story is presented under the input area. We also show the word count and the number of images used for
workers’ convenience. The questionnaire is at the bottom.
569
different behaviors during story-writing. The
third question is to measure visual tellability on
a 5-point Likert scale, which is used to show the
effectiveness of our image sequence construction
pipeline.
We also design a review form for story re-
viewers to judge the quality of collected stories.
We ask the reviewers: 1) whether they want to
approve the story; 2) if not, which requirement
does it break? 3) if yes, judge the statement: this
is a good story. on a 5-point Likert scale. The
first two questions are to assure that the collected
stories fulfill the following requirements: the story
is grammatical, the story is diverse, and the story
is visually grounded. The third question is to get
judgments of the quality of the approved stories.
4.1 Pilot Studies
We identify the following design questions of the
crowdsourcing experiment for data collection:
1. Does the image filtering process improve the
tellability of the image sequences?
2. What is the optimal number of images to provide
to workers to achieve high visual tellability at a
reasonable workload in one writing prompt?
We conducted two pilot studies to investigate
these questions. We collect 5 stories per image
sequence at most from different writers.
Pilot Study 1: Effectiveness of Image Filtering.
The first study tests whether our image-filtering
steps (see Section 3) increase the visual tellability
of the extracted image sequences. We extract 180
movie scenes containing 10 images each from
selected movies; on half of these, we apply our
image filters, while we leave the others as is. All
resulting image sequences have 5 to 10 images.
Results show that the average visual tellabil-
ity score of image sequences with filtering is
3.7, which is significantly higher (unpaired t-test,
t = 4.89, p-value < 0.001) than the average vi-
sual tellability score of image sequences without
filtering (3.29). This shows that our image filter-
ing process in the image sequence construction
pipeline leads to higher visual tellability and we
will apply image filtering in our data collection.
Pilot Study 2: Number of Images to Display.
The second study explores the effect of the number
of images K in a writing prompt on workload and
visual tellability. We randomly sample 150 movie
scenes with 20 images, where writers can choose
from 5 to 20 images for their stories. We set the
minimum number of images to 5 because the most
common narrative structure is 5-part play that
contains five components (Cohn, 2013). In addi-
tion, since there are five, we can make our data-
set comparable to theirs. We set the maximum
number to 20 because we find in a preliminary
experiment that the workload of writing prompts
with more than 20 images is too high considering
our budget. We then run our study on these scenes.
We find a negative correlation between the
actual number of images used by storytellers and
the visual tellability scores, r(500) = −0.17,
p < 0.001. This result indicates that showing
fewer images can both improve visual tellability
and reduce workload. However, we also want to
obtain longer stories. Since a majority of 89% of
the human-written stories use 5 to 10 images out
of 20 and achieve a reasonably high average vi-
sual tellability (3.75), we set the maximum num-
ber of images we display to 10.
5 Data Collection
In this section, we describe how we collect and
process the stories in the VWP dataset. Our goal
is to obtain narratives given the curated image
sequences as writing prompts.
Worker Qualification.
In order to improve
story quality, we apply a qualification process
to workers. We first collect 4.5K stories together
with visual tellability judgments and obtain 556
candidate workers. Each story is reviewed by one
of five graduate students from Saarland Univer-
sity who are proficient in English. To ensure that
the reviewers mutually understand the purpose of
the task, we let the reviewers judge 100 stories
then check the reviews together to agree on the
judgment standards. We then select 58 qualified
workers with an acceptance rate ≥90%, average
story quality >3.1, and accepted assignments ≥5.
We assign a qualification to these workers and
invite them to bulk collection.
Bulk Collection. We collect 7.5K stories with
the qualified workers in bulk collection. We group
about 300 image sequences into a batch and col-
lect 1.5K stories per batch. For each batch, we
570
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
–
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Name
VIST
Travel Blogs
VWP (Ours)
Image
Genre
photos
photos
movie shots
# Text
50 K
10 K
12 K
# Image
per Text
5
1
[5, 10]
# token
per Text
57.6
222.3‡
83.7
# Event
per Text
6.3
3.8‡
12.8
# Char.
per Text
3.4
2.3‡
13.1
Table 1: Comparison of statistics of VWP against previous datasets. Numbers with ‡ are obtained from
a small sample of the Disney split of the Travel Blogs dataset that is available in their repository.
sample s stories from each worker and review the
stories to update the assessment of the worker,
(cid:2)
s =
10,
10 log nw,
if nw < 10
otherwise
where nw is the number of stories that worker w
wrote in this batch. We run the bulk collection
batch by batch and revoke the qualification if
the worker does not satisfy the selection criteria
anymore.
Text Processing. We process the raw text to
make it easier for training story generation mod-
els. We tokenize all stories with the spaCy English
tokenizer (Virtanen et al., 2020). We then recog-
nize all entities using a Name Entity Recogni-
tion model (Peters et al., 2017). We change all
location names to placeholders and replace all
named characters in each story to [male0], . . . ,
[maleM ], [f emale0], . . . , [f emaleN ]. We ob-
tain the gender of each named person based on
a name statistics following Huang et al. (2016).
Finally, to mark the alignment between images
and story sections, we add a special separator to-
ken [sent]. We randomly sample 849 stories as
validation split and 586 stories as test split.
5.1 Statistics of the Dataset
We present statistics, automatic measures of co-
herence and diversity of our dataset
to show
that our collected stories are more coherent and
diverse.
Statistics. We compare the properties of our
to similar previous datasets including
dataset
Travel blogs (Park and Kim, 2015)5 and VIST
(Huang et al., 2016) in Table 1. Our VWP dataset
has 1965 image sequences with 20763 unique im-
ages from 122 movies. Each image sequence has
5https://github.com/cesc-park/CRCN.
Dataset
VIST
VWP (Ours)
# stories
4987
4680
Avg. LL
LL
−4017
−0.8055
−3722* −0.7953*
Table 2: Coherence by log-likelihood (LL) and
average log-likelihood (Avg. LL) on validation
split of VIST versus a sample split from our
VWP dataset with the same number of image
sequences. The stories are more coherent if the
number is larger.
5 to 10 images. Our stories have 45% more to-
kens, 103% more events, and 285% more char-
acters per text compared to the VIST dataset.
While the Travel blogs dataset has longer stories,
it has only one image per story.
Coherence. We first analyze coherence of the
stories focusing on the characters and their ap-
pearances. According to Centering theory (Grosz
et al., 1995), coherent narratives are typically
structured such that salient entities often appear
in strong grammatical roles like subject or ob-
ject. As a result, we apply a model based on this
theory, Entity Grid (Lapata and Barzilay, 2005),
to measure the local coherence of our dataset.
We apply the generative Entity Grid model im-
plemented in the Cohere toolkit (Smith et al.,
2016) on VIST and VWP. We calculate the log-
likelihood based on entity transitions as the story
coherence. The results in Table 2 show that our
dataset is significantly more coherent compared
to the VIST dataset (unpaired t-test, t = −5,
p-value < 0.001).
To further check whether event elements are
semantically related given the same image se-
quence, we also compute the average Jaccard sim-
ilarities between event elements of the stories for
each image sequence by main characters, pred-
icates (without auxiliary verbs), and arguments
in different semantic roles. We identify the main
571
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
VIST
VWP (Ours)
#
998
1000
PRD Characters Arguments Arg0 Arg1 Arg2 ArgM-LOC
0.063
0.068
0.055
0.057
0.013
0.017
0.018
0.048
0.041
0.101
0.018
0.025
0.184
0.21
Table 3: Semantic similarity between stories of each image sequence. For all results, the higher the
number, the better except the first column which is the number of image sequences. PRD refers to
predicate.
Dataset
Voc
Verb
VIST
VWP (Ours)
12627
13637
3447
4811
Diverse
Verb:
Verb:
Voc % Tok % Verb %
1.2
27.3
1.23
35.28
73.6
79
unigram bigram trigram
3.39
2.71
33.48
34.87
75.22
79.10
Table 4: Comparison of diversity. The first five columns show event diversity for the validation split of
VIST versus a comparable sample of VWP. We report measures including the vocabulary size (Voc),
unique number of verbs (Verb), verb-vocabulary ratio (Verb: Voc %), verb-token ratio (Verb: Tok %),
and percentage of diverse verbs (Diverse Verb %). The last three columns show predicate n-gram
diversity for VIST versus VWP. We measure diversity using unique:total ratios of predicate unigram,
bigram, and trigram. For all results the higher the number, the better.
characters in the raw text using coreference clus-
ters (Lee et al., 2018). To ensure that characters
mentioned only once in the story can be detected
by the coreference resolution model, we append
the stories with one introductory sentence per
character. For example, to identify the character
Jack in Figure 1, we add ‘‘This is Jack.’’ before
the story. The Jaccard similarity between story A
and B is defined as J(A, B) = A∩B
A∪B , where A, B
are the token sets of predicate/argument in story A
and B. The results in Table 3 show that the event
elements of stories conditioned on the same image
sequence are more semantically related to each
other. Our dataset has higher semantic cohesion
compared to the VIST dataset.
Diversity. We then measure diversity of the
stories from two perspectives: 1) If a story has
a plot with a series of different events, it must
have diverse events instead of just repeating one
event; 2) If these events are combined into dif-
ferent n-grams in the plot, then the story must
have diverse predicate n-grams. For example, in
the last column in Figure 1, the character Will
has a predicate trigram (tell, convince, work),
which is different from the next trigram (convince,
work, call).
For event diversity, we follow Goldfarb-
Tarrant et al. (2020) to obtain the unique number
of verbs, the verb-vocabulary ratio, verb-token
ratio, and the percentage of diverse verbs (not
in the top 5 most frequent verbs). The results in
Table 4 show that our dataset has higher event
diversity than VIST across all measures. To mea-
sure predicate n-gram diversity, we extract and
lemmatize verbs obtained from a Semantic Role
Labeling model (Shi and Lin, 2019) and calculate
the unique:total ratios of predicate unigram, bi-
gram, and trigram (Table 4). We observe that the
event sequences in VWP are more diverse than
those in VIST, because VWP has higher bigram
and trigram ratios.
Visual Groundedness. To check visual ground-
edness of the stories, we first apply the same
semantic role labeler to 25 human-written stories
each from VWP and VIST. We obtain 299 events
and 715 arguments from the VWP samples, and
84 events and 196 arguments from the VIST sam-
ples. We then manually annotated these events
and arguments with three labels: 1) Grounded
means the event or argument is in the correspond-
ing image; 2) Inferred means not in the image,
but can be inferred; 3) Hallucianted means not
in the image and cannot be inferred.
The results in Table 5 show that about 55%
of the events and 63% of the arguments in the
VWP stories appear in images, which are higher
than 45% of the events and 54% of the arguments
in the VIST stories. The numbers of events and
arguments that are not in the images but can be
inferred are similar between two datasets. Only
572
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Label
E Grounded
E Inferred
E Hallucianted
A Grounded
A Inferred
A Hallucianted
VWP
#
164
134
1
447
254
14
VIST
#
38
39
7
105
64
27
%
54.9
44.8
0.3
62.5
35.5
2.0
%
45.2
46.4
8.3
53.6
32.7
13.8
Table 5: Visual Groundedness of stories. We re-
port counts and percentages of each label in each
data. E means event and A means argument.
2% of the arguments in VWP stories are not in the
images and cannot be inferred (i.e., not visually
grounded). However, there are 8% of the events
and 14% of arguments are not visually grounded
in VIST. The results show that stories in VWP are
more visually grounded than stories in VIST.
6 Experiment and Evaluation
In this section, we propose a strong baseline model
for character-grounded story generation. We then
experiment on our VWP dataset and show the
results. Our goal is to demonstrate the usefulness
of our dataset.
We extract features for all images with Swin
Transformer (Liu et al., 2021), a state-of-the-art
computer vision backbone model where all pa-
rameters are fixed. We use their official model
checkpoint, pre-trained on the ImageNet-21K da-
taset, to increase domain generality. We extract
three different visual features:
1. Global features (global) are most commonly
used in image-based language generation. We
extract global features from the output of the last
feedforward layer.
2. Object
features (obj) are widely used in
image-based language generation. Since person
is also a label in object detection (Lin et al., 2014),
using object features is a proper baseline for char-
acter features. We obtain object features using a
Cascade Mask R-CNN object detector (Cai and
Vasconcelos, 2021) with the same Swin Trans-
former backbone. We crop the bounding boxes
of the top 20 objects that the detector predicts for
each image and extract the features the same way
as global features.
3. Character features (char) are extracted by crop-
ping out the least blurry instance of each character
using bounding boxes from our dataset. We feed
the bounding boxes to the same Swin Trans-
former backbone and get the features from the last
feedforward layer.
We use the following models for visual story
generation as baselines:
GPT-2.
(GPT-2; Radford et al., 2019) is a
Transformer-based language model pre-trained
on large-scale text. We use the small version,
which is widely used in previous works of story
generation.
al., 2021)
(TAPM; Yu et
TAPM.
a
Transformer-based model which adapts the vi-
sual features with pre-trained GPT-2. This is the
current state-of-the-art model for visual story
generation.
is
For each baseline, we consider four different
variants with different inputs: 1) only global image
features; 2) global features and object features;
3) global features and character features; and 4)
all three available features.
6.1 Character-based Visual Story
Generation
We propose the character-grid transformer model
(CharGrid) as a strong baseline to show the
importance of modeling coherence and visual
groundedness. We hypothesize that characters and
different instances of them in image sequences
play an important role in visual story genera-
tion models in two dimensions: firstly, explicit
character representations can improve quality of
generated stories, which has been observed in tex-
tual story generation (Clark et al., 2018). Secondly,
representations that describe different instances
to
of characters across images are beneficial
image-based story generation models.
Character Grid. To represent coherence of
image sequences, we proposed a novel visual rep-
resentation, character grid. As we mentioned in
Section 5.1, one of the most effective methods to
measure text coherence is Entity Grid, a matrix
of sentences by entities where the cells are the
grammatical roles of the entities in the sentence
573
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Example of character grid representations. Each row represents an image and each column represents a
character. Shades of the cells indicate the similarities between the character features and the image features. The
darker color represents higher similarity. The green square shows a pattern that indicates high coherence and the
red square represents low coherence.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 5: Architecture of character-grid transformer. The blue circles are pre-trained components where the
parameters are fixed.
context (Lapata and Barzilay, 2005). The contri-
bution of an entity’s mention to the sentence’s
coherence is defined by its within-sentence gram-
matical role.
Inspired by this, we measure the narrative
importance of a character in an image by the
similarity between global image features and the
character’s features. We thus model the coherence
of an image sequence using a matrix C of images
by character instances shown in Figure 4. We
obtain the narrative importance of each character
instance by computing the dot product of each
character’s features and the corresponding global
image features. In the character grid C, each ele-
ment is computed as cab = ia · lb, where ia is the
global features of image a, and lb is the features
of character b.
Model Architecture. As we show in Figure 5,
the architecture is based on the Transformer
model. The input to the Transformer is a sequence
of tokenized features including global image fea-
tures, character features, character grid, and text
features. Global image features and character fea-
tures are the same as the features for baseline
models described above, which are first fed to
574
trainable image and character encoders that con-
sists of a feedforward layer. Text features are to-
kenized representations of the generated context,
which are presented to the model incrementally.
The character grid is flattened and fed to a feed-
forward layer. The four inputs then pass through
the Transformer module. The output obtained at
each time step is a probability distribution over all
possible output tokens from a pre-trained GPT-2
tokenizer (Wolf et al., 2020).
We also construct two variants of our model to
inspect the contributions of each design decision.
We replace the character features with object fea-
tures to obtain the object-grid transformer model
(ObjGrid). We use both character features and
object features to obtain the entity-grid trans-
former model (EntiGrid).
Model Training. We randomly initialized the
model parameters except for the vision backbone
model. We optimize the model by maximizing the
likelihood of the image sequence-story pairs in
the training set. The parameters are updated via
backpropagation. We employ Nucleus sampling
(Holtzman et al., 2020) to obtain the full output se-
quence for validation. We compute the METEOR
score (Banerjee and Lavie, 2005) on the validation
set after each training epoch. If the current epoch
gets a lower METEOR score, we consider the
current epoch as the best epoch and run auto-
matic metrics on the test set. We choose the
METEOR score following previous work in vi-
sual story generation (see Section 2). In addition,
Huang et al. (2016) found METEOR correlates
than BLEU and
better with human judgment
Skip-Thoughts similarity on the VIST dataset.
6.2 Reference-based Metrics
Our goal is to show the effectiveness of character
grid representations. Although it has been shown
that reference-based metrics correlate poorly with
human judgments in open-ended language gener-
ation tasks (Guan and Huang, 2020; Gehrmann
et al., 2021), it is still efficient to use them for
comparison across many different models. Fur-
thermore, we want to make our results compa-
rable to the results of the state-of-the-art model
TAPM (Yu et al., 2021). They applied greedy
search to generate stories with their models for
testing and reported reference-based metrics. We
thus follow the same setting and compare our
proposed CharGrid model against several previ-
ous baselines.
We train all the models for at most 15 epochs
with 3 different random seeds. We apply the
reference-based metrics including unigram (B-1),
bigram (B-2), trigram (B-3), and 4-gram (B-4)
BLEU scores (B; Papineni et al., 2002), METEOR
(M; Banerjee and Lavie, 2005), ROUGE-L (R;
Lin, 2004), and CIDEr (C; Vedantam et al., 2015),
which were used in the visual storytelling shared
task (Mitchell et al., 2018). We then report the
mean and standard deviation of 3 runs.
Results in Table 6 show that the character-grid
transformer model (CharGrid) driven by visual
coherence outperforms TAPM with character
features (TAPM + char) significantly on BLEU-
1/2/3 and CIDEr. CharGrid model also outper-
forms GPT-2 with character features (GPT-2 +
char) significantly on most metrics except mar-
ginally on BLEU-4 and METEOR. The object-
grid transformer model (ObjGrid) outperforms
TAPM with object features (TAPM + obj) signif-
icantly on BLEU-1/2/3 and CIDEr. The ObjGrid
model also outperforms GPT-2 with object fea-
tures (GPT-2 + obj) significantly on most metrics
except marginally on BLEU-4. The entity-grid
transformer model (EntiGrid) outperforms TAPM
with all features (TAPM + obj, char) signifi-
cantly on most metrics except marginally on
METEOR and ROUGE-L. The EntiGrid model
also outperforms GPT-2 with all features (GPT-2 +
obj, char) on most metrics except BLEU-4.
These results show the effectiveness of character/
object/entity grid representations for coherence of
image sequences.
6.3 Human Evaluation
Because story generation is an open-domain task,
reference-based metrics can only show how out-
put stories match with the references. To measure
the quality of generated stories directly, we con-
duct a crowdsourcing experiment to obtain human
binary judgments between two systems. We de-
sign the first question for Grammaticality, which
measures whether the textual outputs are at least
grammatical and sets a foundation for other met-
rics. We then design questions for two properties
that we identified for good textual stories: Co-
herence and Diversity. Finally, we ask a question
to compare the Visual Groundedness in order to
make sure that the stories are relevant to the input
image sequence.
575
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Model
Features
B-1
B-2
B-3
B-4
M
R-L
C
GPT-2
GPT-2 + obj
GPT-2 + char
GPT-2 + obj,char
TAPM
TAPM + obj
TAPM + char
TAPM + obj,char
global
global, obj
global, char
global, obj, char
global
global, obj
global, char
global, obj, char
38.65**
40.65**
39.95**
40.41**
39.85**
40.86**
40.03**
40.87**
20.28**
21.35**
21.04**
21.44**
21.7**
22.13**
21.68**
21.99**
9.78**
10.2**
10.11**
10.56**
10.72**
10.83**
10.66**
10.72**
Ours
ObjGrid
EntiGrid
CharGrid
global, obj
global, obj, char
global, char
47.66
45.83
47.71
25.26
24.85
25.33
11.95
12.11
11.95
4.68*
4.87*
4.92+
5.06
5.19
5.25
5.18
5.06+
5.42
5.7
5.42
31.64**
31.69**
31.85*
32.03*
32.38+
32.34+
32.42+
32.48+
24.24+
24.05+
24.19+
24.38
25.09
24.91
24.88
24.87
1.66**
1.85**
1.57**
1.87**
1.48**
1.82**
1.4**
1.59**
32.83
32.68
33.03
24.42
24.89
4.68
3.53+
25.01
4.83
Table 6: Results of all models using different input features on the test set of VWP using reference-
based metrics including BLEU (B), METEOR (M), ROUGE-L (R-L), and CIDEr (C). All numbers are
average of three runs with different random seeds. +, *, and ** represent that the number is one, two,
or three standard deviations away from the mean of the CharGrid model.
Model
TAPM + char vs. TAPM
CharGrid vs. TAPM + char
Grammatical
+2.45
+6.49**
Coherence
+1.99
+8.41**
Visual Groundedness
+3.99*
+6.25*
Diversity
+1.69
+11.06**
Table 7: Human binary judgments (in percentage) of generated stories between TAPM and TAPM with
character features (TAPM + char), TAPM + char and our model (CharGrid) on the test set of VWP
across four criteria: Grammaticality, Coherence, Visually Groundedness, and Diversity. The numbers
are percentages. * p-value < 0.05. ** p-value < 0.01.
We conduct the experiment with 28 crowd
workers over 50 pairs of stories and report the
percentage of the judgments for each system that
annotators are in favor of. To make the stories
more readable, we change the generated charac-
ter placeholders to randomly sampled names. The
results in Table 7 show that TAPM with charac-
ter features (TAPM + char) outperforms TAPM
in Visual Groundedness significantly. CharGrid
outperforms TAPM + char on all metrics signifi-
cantly. We use two-sided binomial tests. This indi-
cates that our character grid representation yields
better stories. These results confirm the findings
in the evaluation with reference-based metrics.
use Nucleus Sampling (Holtzman et al., 2020)
with p = 0.1 on all models to generate the stories.
As in Figure 6, TAPM generates unreasonable
noun phrases (the train). With character features,
TAPM + char is able to explore character-object
interaction and reason that there is no train in the
image. So it generates more reasonable terms (a
street).
However, TAPM + char model fails to repre-
sent the relations between characters, TAPM +
char generates the pronoun they without introduc-
ing characters in the second image. In contrast,
CharGrid introduces two new characters correctly.
6.4 Qualitative Evaluation
7 Conclusions and Future Work
We also conduct a qualitative evaluation to show
that stories generated by TAPM with character
features are more visually grounded than without
character features and character grid representa-
tion further improves the coherence and visual
groundedness. To obtain more diverse text, we
We show that curated image sequences with char-
acters are effective as writing prompts for visual
story generation in both data collection and model
design. By filtering images without any objects
that could be recognized by the object detector
576
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Figure 6: Qualitative results of generated and human-written stories. The red color represents errors made by
models and the green color indicates better output.
and removing highly similar images to boost di-
versity, we can improve the visual tellability of
image sequences. Presenting selected characters
during the story-writing yields stories with charac-
ters grounded in images, which are more coherent
and diverse. Correspondingly, using character fea-
tures as input to the story generation model can
improve the quality of generated stories. Adding
the character grid representation can bring fur-
ther improvements in coherence, grammaticality,
visual groundedness, and diversity.
Future Work. One important property of visual
narratives not covered in this work is narrativity
(Piper et al., 2021), that is, whether an image se-
quence contains the necessary narrative structures
to make a good story. A narrative structure can
be achieved by events following a typical order
with roles like Establisher, Initial, Initial Peak,
and Release (Cohn, 2013). We observe that these
roles of events emerge in our collected stories.
Our annotations of different instances of the same
character across a story allow us to construct event
chains for each character. Future work should in-
vestigate how to annotate the roles of these events,
measure narrativity, and build a model to generate
stories with higher narrativity.
A major assumption of all previous work in
storytelling is that all humans are equally and
in story-writing and can
reasonably proficient
translate visual narratives into textual narratives.
However, individual differences in the writing
proficiency of humans must have an impact on
story quality. Exploring this from the perspective
of both data selection and model design would be
an interesting future direction to take.
Acknowledgments
Xudong Hong is supported by International Max
Planck Research School for Computer Science
(IMPRS-CS) of Max-Planck Institute for Infor-
matics (MPI-INF). This research was funded in
part by a Swedish Research Council (VR) grant
(2014-39) for the Centre for Linguistic Theory and
Studies in Probability (CLASP). This research was
also funded in part by the Chair of Computer Sci-
ence and Computational Linguistics at Saarland
University. We thank three anonymous review-
ers for their detailed and insightful reviews that
helped us to improve this paper. We sincerely
thank our action editor, and the editorial team at
Transactions of the Association for Computational
Linguistics. We also thank our student assistants:
Andrew Johnson, AriaRay Brown, Danielle Gregg
and Teresa Mart´ın Soeder. Last but not least, we
thank all the anonymous story writers for their
hard work and creative stories.
References
Nader Akoury, Shufan Wang, Josh Whiting,
Stephen Hood, Nanyun Peng, and Mohit Iyyer.
2020. STORIUM: A dataset and evaluation
platform for machine-in-the-loop story gener-
ation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language
Processing (EMNLP), pages 6470–6484, On-
line. Association for Computational Linguis-
tics. https://doi.org/10.18653/v1
/2020.emnlp-main.525
Satanjeev Banerjee and Alon Lavie. 2005. ME-
TEOR: An automatic metric for MT evaluation
577
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
with improved correlation with human judg-
ments. In Proceedings of the ACL Workshop
on Intrinsic and Extrinsic Evaluation Measures
for Machine Translation and/or Summariza-
tion, pages 65–72, Ann Arbor, Michigan.
Association for Computational Linguistics.
Zhaowei Cai and Nuno Vasconcelos. 2021. Cas-
cade R-CNN: High quality object detection
and instance segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
43(5):1483–1498. https://doi.org/10
.1109/TPAMI.2019.2956516, PubMed:
31794388
Khyathi Chandu, Eric Nyberg, and Alan W. Black.
2019. Storyboarding of
recipes: Grounded
contextual generation. In Proceedings of the
57th Annual Meeting of the Association for
Computational Linguistics, pages 6040–6046,
Florence, Italy. Association for Computational
Linguistics. https://doi.org/10.18653
/v1/P19-1606
Elizabeth Clark, Yangfeng Ji, and Noah A.
Smith. 2018. Neural text generation in stories
using entity representations as context.
In
Proceedings of the 2018 Conference of the
North American Chapter of the Association
for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers),
pages 2250–2260, New Orleans, Louisiana.
Association for Computational Linguistics.
https://doi.org/10.18653/v1/N18
-1204
Neil Cohn. 2013. Visual narrative structure.
Cognitive Science, 37(3):413–452. https://
doi.org/10.1111/cogs.12016, PubMed:
23163777
Neil Cohn. 2020. Visual narrative comprehen-
sion: Universal or not? Psychonomic Bulletin &
Review, 27(2):266–285. https://doi.org
/10.3758/s13423-019-01670-1, PubMed:
31820277
Neil Cohn, Martin Paczynski, Ray Jackendoff,
Phillip J. Holcomb, and Gina R. Kuperberg.
2012. (Pea)nuts and bolts of visual narrative:
Structure and meaning in sequential image com-
prehension. Cognitive Psychology, 65(1):1–38.
https://doi.org/10.1016/j.cogpsych
.2012.01.003, PubMed: 22387723
Angela Fan, Mike Lewis, and Yann Dauphin.
2018. Hierarchical neural story generation. In
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 889–898,
Melbourne, Australia. Association for Compu-
tational Linguistics.
Sebastian Gehrmann, Tosin Adewumi, Karmanya
Aggarwal, Pawan Sasanka Ammanamanchi,
Anuoluwapo Aremu, Antoine Bosselut, Khyathi
Raghavi Chandu, Miruna-Adriana Clinciu,
Dipanjan Das, Kaustubh Dhole, Wanyu Du,
Esin Durmus, Ondˇrej Duˇsek, Chris Chinenye
Emezue, Varun Gangal, Cristina Garbacea,
Tatsunori Hashimoto, Yufang Hou, Yacine
Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza
Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak,
Aman Madaan, Mounica Maddela, Khyati
Mahajan, Saad Mahamood, Bodhisattwa Prasad
Majumder, Pedro Henrique Martins, Angelina
McMillan-Major, Simon Mille, Emiel van
Miltenburg, Moin Nadeem, Shashi Narayan,
Vitaly Nikolaev, Andre Niyongabo Rubungo,
Salomey Osei, Ankur Parikh, Laura Perez-
Beltrachini, Niranjan Ramesh Rao, Vikas
Juan Diego Rodriguez, Sashank
Raunak,
Jo˜ao Sedoc, Thibault Sellam,
Santhanam,
Samira Shaikh, Anastasia Shimorina, Marco
Antonio Sobrevilla Cabezudo, Hendrik Strobelt,
Nishant Subramani, Wei Xu, Diyi Yang,
Akhila Yerukola, and Jiawei Zhou. 2021. The
GEM benchmark: Natural
language genera-
tion, its evaluation and metrics. In Proceedings
of the 1st Workshop on Natural Language Gen-
eration, Evaluation, and Metrics (GEM 2021),
pages 96–120, Online. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.gem-1.10
Seraphina Goldfarb-Tarrant, Tuhin Chakrabarty,
Ralph Weischedel, and Nanyun Peng. 2020.
Content planning for neural story generation
with aristotelian rescoring. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 4319–4338, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.351
Barbara J. Grosz, Aravind K. Joshi, and Scott
Weinstein. 1995. Centering: A framework for
modeling the local coherence of discourse.
Computational Linguistics,
21(2):203–225.
https://doi.org/10.21236/ADA324949
578
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan
Zhu, and Minlie Huang. 2020. A knowledge-
enhanced pretraining model for commonsense
story generation. Transactions of the Associa-
tion for Computational Linguistics, 8:93–108.
https://doi.org/10.1162/tacl a 00302
Jian Guan and Minlie Huang. 2020. UNION:
An Unreferenced Metric for Evaluating Open-
ended Story Generation. In Proceedings of
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 9157–9166, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.736
Stevan Harnad. 1990. The symbol grounding
problem. Physica D: Nonlinear Phenomena,
42(1):335–346. https://doi.org/10.1016
/0167-2789(90)90087-6
Kaiming He, Georgia Gkioxari, Piotr Doll´ar,
and Ross Girshick. 2020. Mask R-CNN. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 42(2):386–397. https://doi
.org/10.1109/TPAMI.2018.2844175,
PubMed: 29994331
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes,
and Yejin Choi. 2020. The curious case of
neural text degeneration. In International Con-
ference on Learning Representations.
Xudong Hong, Rakshith Shetty, Asad Sayeed,
Khushboo Mehra, Vera Demberg, and Bernt
Schiele. 2020. Diverse and relevant visual
storytelling with scene graph embeddings.
the 24th Conference on
In Proceedings of
Computational Natural Language Learning,
pages 420–430, Online. Association for Com-
putational Linguistics. https://doi.org/10
.18653/v1/2020.conll-1.34
Chao-Chun Hsu, Zi-Yuan Chen, Chi-Yang
Hsu, Chih-Chia Li, Tzu-Yuan Lin, Ting-Hao
Huang, and Lun-Wei Ku. 2020. Knowledge-
enriched visual storytelling. Proceedings of
the AAAI Conference on Artificial Intelligence,
34(05):7952–7960. https://doi.org/10
.1609/aaai.v34i05.6303
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze
Wang, and Dahua Lin. 2020. Movienet: A ho-
listic dataset for movie understanding. In Com-
puter Vision – ECCV 2020, pages 709–727,
International Publishing.
Cham. Springer
https://doi.org/10.1007/978-3-030
-58548-8 41
Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin
Mostafazadeh, Ishan Misra, Aishwarya Agrawal,
Jacob Devlin, Ross Girshick, Xiaodong He,
Pushmeet Kohli, Dhruv Batra, C. Lawrence
Zitnick, Devi Parikh, Lucy Vanderwende,
Michel Galley, and Margaret Mitchell. 2016.
the
Visual storytelling.
2016 Conference of
the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
pages 1233–1239, San Diego, California. Asso-
ciation for Computational Linguistics.
In Proceedings of
Peter H¨uhn, Jan Christoph Meister, John Pier,
and Wolf Schmid, editors. 2014. Hand-
book of Narratology. De Gruyter, Berlin,
M¨unchen, Boston. https://doi.org/10
.1515/9783110316469
Maria Lapata and Regina Barzilay. 2005. Au-
tomatic evaluation of text coherence: Models
and representations. In IJCAI’05 Proceedings
the 19th International Joint Conference
of
on Artificial Intelligence, pages 1085–1090.
Morgan Kaufmann Publishers Inc.
Kenton Lee, Luheng He, and Luke Zettlemoyer.
2018. Higher-order coreference resolution with
In Proceedings of
coarse-to-fine inference.
the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 687–692,
New Orleans, Louisiana. Association for Com-
putational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81,
Barcelona, Spain. Association for Computa-
tional Linguistics.
Tsung-Yi Lin, Michael Maire, Serge Belongie,
James Hays, Pietro Perona, Deva Ramanan,
Piotr Doll´ar, and C. Lawrence Zitnick. 2014.
Microsoft COCO: Common objects in con-
In Computer Vision – ECCV 2014,
text.
pages 740–755, Cham. Springer International
Publishing. https://doi.org/10.1007
/978-3-319-10602-1_48
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan
Wei, Zheng Zhang, Stephen Lin, and Baining
579
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Guo. 2021. Swin transformer: Hierarchical
vision transformer using shifted windows. In
2021 IEEE/CVF International Conference on
Computer Vision (ICCV), pages 9992–10002.
https://doi.org/10.1109/ICCV48922
.2021.00986
Lara Martin, Prithviraj Ammanabrolu, Xinyu
Wang, William Hancock, Shruti Singh, Brent
Harrison, and Mark Riedl. 2018. Event repre-
sentations for automated story generation with
deep neural nets. Proceedings of
the AAAI
Conference on Artificial Intelligence, 32(1).
https://doi.org/10.1609/aaai.v32i1
.11430
Margaret Mitchell, Ting-Hao ‘Kenneth’ Huang,
Francis Ferraro, and Ishan Misra, editors.
2018. Proceedings of the First Workshop on
Storytelling. Association for Computational
Linguistics, New Orleans, Louisiana.
Nasrin Mostafazadeh, Nathanael Chambers,
Xiaodong He, Devi Parikh, Dhruv Batra, Lucy
Vanderwende, Pushmeet Kohli, and James
Allen. 2016. A corpus and cloze evaluation
for deeper understanding of commonsense sto-
ries. In Proceedings of the 2016 Conference
of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human
Language Technologies, pages 839–849, San
Diego, California. Association for Computa-
tional Linguistics. https://doi.org/10
.18653/v1/N16-1098
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, Philadelphia, Pennsylvania,
USA. Association for Computational Linguis-
tics. https://doi.org/10.3115/1073083
.1073135
Cesc C. Park and Gunhee Kim. 2015. Expressing
an image stream with a sequence of natural
sentences. In Advances in Neural Informa-
tion Processing Systems, volume 28. Curran
Associates, Inc.
Jos´e Luis Pech-Pacheco, Gabriel Crist´obal, Jes´us
Chamorro-Martinez, and Joaqu´ın Fern´andez-
Valdivia. 2000. Diatom autofocusing in bright-
field microscopy: A comparative study. In
Proceedings 15th International Conference on
Pattern Recognition. ICPR-2000, volume 3,
pages 314–317.
Nanyun Peng, Marjan Ghazvininejad, Jonathan
May, and Kevin Knight. 2018. Towards con-
trollable story generation. In Proceedings of the
First Workshop on Storytelling, pages 43–49,
Association for Computational Linguistics.
Matthew E. Peters, Waleed Ammar, Chandra
Bhagavatula, and Russell Power. 2017. Semi-
supervised sequence tagging with bidirectional
language models. In Proceedings of the 55th
Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers),
pages 1756–1765, Vancouver, Canada. Associ-
ation for Computational Linguistics. https://
doi.org/10.18653/v1/P17-1161
Andrew Piper, Richard Jean So, and David
Bamman. 2021. Narrative theory for compu-
tational narrative understanding. In Proceed-
ings of
the 2021 Conference on Empirical
Methods in Natural Language Processing,
pages 298–311, Online and Punta Cana,
Dominican Republic. Association for Compu-
tational Linguistics. https://doi.org/10
.18653/v1/2021.emnlp-main.26
Dongqi Pu, Xudong Hong, Pin-Jie Lin,
Ernie Chang, and Vera Demberg. 2022. Two-
stage movie script summarization: An efficient
method for low-resource long document sum-
marization. In Proceedings of the Workshop on
Automatic Summarization for Creative Writing,
pages 57–66, Gyeongju, Republic of Korea.
Association for Computational Linguistics.
Lianhui Qin, Antoine Bosselut, Ari Holtzman,
Chandra Bhagavatula, Elizabeth Clark, and
Yejin Choi. 2019. Counterfactual story reason-
ing and generation. In Proceedings of the 2019
Conference on Empirical Methods in Natural
Language Processing and the 9th International
Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 5043–5053,
Hong Kong, China. Association for Computa-
tional Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever.
2019. Language models are unsupervised mul-
titask learners. OpenAI Blog, 1(8):9.
Hannah Rashkin, Antoine Bosselut, Maarten
Sap, Kevin Knight, and Yejin Choi. 2018.
580
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3
Modeling naive psychology of characters in
simple commonsense stories. In Proceedings
of the 56th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1:
Long Papers), pages 2289–2299, Melbourne,
Australia. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/P18-1213
Hannah Rashkin, Asli Celikyilmaz, Yejin
Choi, and Jianfeng Gao. 2020. PlotMa-
chines: Outline-conditioned generation with
dynamic plot state tracking. In Proceedings
of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP),
pages 4274–4295, Online. Association for
Computational Linguistics. https://doi.org
/10.18653/v1/2020.emnlp-main.349
Peng Shi and Jimmy Lin. 2019. Simple BERT
models for relation extraction and semantic
role labeling. CoRR, abs/1904.05255.
Karin Sim Smith, Wilker Aziz, and Lucia Specia.
2016. Cohere: A toolkit for local coherence.
the Tenth International
In Proceedings of
Conference on Language Resources and Eval-
uation (LREC’16), pages 4111–4114, Portoroˇz,
Slovenia. European Language Resources Asso-
ciation (ELRA).
R. Vedantam, C. Zitnick, and D. Parikh. 2015.
CIDEr: Consensus-based image description
evaluation. In 2015 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR),
pages 4566–4575, Los Alamitos, CA, USA.
IEEE Computer Society. https://doi.org
/10.1109/CVPR.2015.7299087
Pauli Virtanen, Ralf Gommers, Travis E.
Oliphant, Matt Haberland, Tyler Reddy, David
Cournapeau, Evgeni Burovski, Pearu Peterson,
Warren Weckesser, Jonathan Bright, St´efan J.
van der Walt, Matthew Brett, Joshua Wilson,
K. Jarrod Millman, Nikolay Mayorov, Andrew
R. J. Nelson, Eric Jones, Robert Kern, Eric
Larson, C J Carey, ˙Ilhan Polat, Yu Feng, Eric
W. Moore, Jake VanderPlas, Denis Laxalde,
Josef Perktold, Robert Cimrman, Ian Henriksen,
E. A. Quintero, Charles R. Harris, Anne M.
Archibald, Antˆonio H. Ribeiro, Fabian
Pedregosa, Paul van Mulbregt, and SciPy 1.0
Contributors. 2020. SciPy 1.0: Fundamental
algorithms for scientific computing in Python.
Nature Methods, 17:261–272. https://doi
.org/10.1038/s41592-020-0772-5,
PubMed: 32094914
Thomas Wolf, Lysandre Debut, Victor Sanh,
Julien Chaumond, Clement Delangue, Anthony
Moi, Pierric Cistac, Tim Rault, Remi Louf,
Morgan Funtowicz, Joe Davison, Sam Shleifer,
Patrick von Platen, Clara Ma, Yacine Jernite,
Julien Plu, Canwen Xu, Teven Le Scao,
Sylvain Gugger, Mariama Drame, Quentin
Lhoest, and Alexander Rush. 2020. Transform-
ers: State-of-the-art natural language process-
ing. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Pro-
cessing: System Demonstrations, pages 38–45,
Online. Association for Computational Lin-
guistics. https://doi.org/10.18653/v1
/2020.emnlp-demos.6
Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang
Zhou, Bolei Zhou, and Dahua Lin. 2019.
A graph-based framework to bridge movies
and synopses. In 2019 IEEE/CVF International
Conference on Computer Vision, ICCV 2019,
Seoul, Korea (South), October 27 – November
2, 2019, pages 4591–4600. IEEE. https://
doi.org/10.1109/ICCV.2019.00469
Peng Xu, Mostofa
Patwary, Mohammad
Shoeybi, Raul Puri, Pascale Fung, Anima
Anandkumar, and Bryan Catanzaro. 2020.
MEGATRON-CNTRL: Controllable story gen-
eration with external knowledge using large-
In Proceedings of
scale language models.
the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP),
pages 2831–2845, Online. Association for
Computational Linguistics.
Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin
Knight, Dongyan Zhao, and Rui Yan. 2019.
Plan-and-write: Towards better automatic story-
telling. Proceedings of the AAAI Conference
on Artificial Intelligence, 33(01):7378–7385.
https://doi.org/10.1609/aaai.v33i01
.33017378
Youngjae Yu, Jiwan Chung, Heeseung Yun,
Jongseok Kim, and Gunhee Kim. 2021. Tran-
sitional adaptation of pretrained models for
visual storytelling. In 2021 IEEE/CVF Confer-
ence on Computer Vision and Pattern Recog-
nition (CVPR), pages 12653–12663.
581
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
5
5
3
2
1
3
4
4
8
7
/
/
t
l
a
c
_
a
_
0
0
5
5
3
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
9
S
e
p
e
m
b
e
r
2
0
2
3