Controllable Summarization with Constrained Markov Decision Process

Controllable Summarization with Constrained Markov Decision Process

Hou Pong Chan1, Lu Wang2, and Irwin King3

1University of Macau, Macau SAR, China
2University of Michigan, Ann Arbor, MI, USA
3The Chinese University of Hong Kong, Hong Kong SAR, China
1hpchan@um.edu.mo
2wangluxy@umich.edu
3king@cse.cuhk.edu.hk

Abstract

We study controllable text summarization,
which allows users to gain control on a particu-
lar attribute (e.g., length limit) of the generated
summaries. In this work, we propose a novel
training framework based on Constrained
Markov Decision Process (CMDP), which
conveniently includes a reward function along
with a set of constraints, to facilitate better
summarization control. The reward function
encourages the generation to resemble the
human-written reference, while the constraints
are used to explicitly prevent the generated
summaries from violating user-imposed re-
quirements. Our framework can be applied
to control important attributes of summariza-
tion, including length, covered entities, and
abstractiveness, as we devise specific con-
straints for each of these aspects. Extensive
experiments on popular benchmarks show that
our CMDP framework helps generate informa-
tive summaries while complying with a given
attribute’s requirement.1

1

Introduction

Text summarization aims to condense the informa-
tion of an input document into a concise summary.
Although recently neural abstractive summariza-
tion models have achieved promising performance
(See et al., 2017; Paulus et al., 2018), they do not
allow users to indicate their preference to control
different aspects of the generated summaries. Con-
trollable summarization has many use cases. For
instance, it can summarize product descriptions
to fit within a word limit in online advertising.
In another example, teachers can demonstrate the
technique of paraphrasing important information
by showing a system-generated summary with

1Our source code is available at https://github

.com/kenchan0226/control-sum-cmdp.

high abstractiveness. Controllable summarization
can also complement information retrieval sys-
tems, for example, to only generate summaries
covering the entities that users are interested in.
Figure 1 illustrates one such usage, where our
proposed model produces distinct abstractive sum-
maries of the same source document, focusing on
different input entities.

To allow users to control a particular attribute
of the generated summaries, Fan et al. (2018)
proposed a token-based controllable summariza-
tion model (ControlSum). Although ControlSum
incorporates control tokens that let users specify
a requirement on a summary attribute, the max-
imum likelihood training objective of the model
does not provide explicit supervision signals that
prevent the model from violating the specified
attribute requirement. Consequently, a substantial
portion of the generated summaries still fail to
meet the specified attribute requirement as shown
in our experiments.

One possible solution to enforce the attribute re-
quirement is to apply reinforcement learning (RL)
with Markov Decision Process (MDP) (Bellman,
1957) to optimize a weighted sum of reward func-
tions, including a penalty function to penalize
the violation of the attribute requirement, and a
summarization metric to encourage the generated
summaries to be consistent with the references.
However, selecting appropriate weights for dif-
ferent reward functions is a delicate task, and
requires intensive hyperparameter tuning.

In this work, we argue that applying constraints
on the training objective is a more convenient way
to control an attribute of a summary, since it avoids
tuning reward function weights. We formulate the
problem of training controllable text summariza-
tion models as a constrained Markov Decision

1213

Transactions of the Association for Computational Linguistics, vol. 9, pp. 1213–1232, 2021. https://doi.org/10.1162/tacl a 00423
Action Editor: Xiaojun Wan. Submission batch: 1/2021; Revision batch: 5/2021; Published 11/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

1997) model, and DistilGPT2 (Sanh et al., 2019),
a large-scale pre-trained Transformer (Vaswani
et al., 2017) model.2 Experiment results demon-
strate that our approach consistently improves both
controllable summarization models’ capabilities of
following the specified attribute requirement. In
addition, our framework increases the ROUGE
scores of the generated summaries when provided
with the reference control tokens (e.g., the tokens
that represent the entities in the reference sum-
mary). Human evaluations further confirm that
our framework produces informative summaries
that conform to the attribute requirement.

The key contributions of this paper include:
(1) A novel training framework that provides ex-
plicit guidance signals to supervise a controllable
summarization model to conform to the specified
attribute requirement; (2) Constraints that allow
users to control the length, covered entities, and
the abstractiveness of the generated summaries,
respectively; (3) Consistent performance improve-
ment of controllable summarization models based
on two different architectures.

2 Related Work

Summarization Systems with Specified At-
tributes.
Several methods extend abstractive
summarization models to allow users to control a
specific attribute of summaries. Fan et al. (2018)
propose a method that allows users to control an
attribute such as length, entity, and style of sum-
maries by prepending special tokens to the input
document. Liu et al. (2018) focus on controlling
the exact length of summaries. They multiply the
input word embeddings in the decoder by the spec-
ified summary length. Song et al. (2020) propose
a masked language model to control the portion of
copied words in the output summary for the sen-
tence summarization task. This model controls the
abstractiveness of a summary at the word level.
In contrast, our work controls the extractive frag-
ment density (Grusky et al., 2018) of the output
summary, which restricts the abstractiveness at the
fragment level. Makino et al. (2019) and Laban
et al. (2020) incorporate a penalty term on the
training objective to penalize a model for violat-
ing the length requirement for word limit control.
However, it requires hyperparameter tuning for
the weight of penalty if one wants to apply their
method to another dataset. Our approach imposes

2We choose DistilGPT2 since it is smaller than GPT2.

Figure 1: A sample document and three summaries
generated by our entity-controlled model based on
DistilGPT2 (Sanh et al., 2019) and fine-tuned by our
proposed method. Each summary corresponds to the
requested entity inside the pair of brackets.

Process (CMDP) (Altman, 1999), a RL framework
trained with both rewards and constraints. In
this setup, we maximize a summarization met-
ric to encourage the similarity between the output
summaries and the references, as well as im-
pose constraints to disallow the summaries from
violating a specified attribute requirement.

Moreover, we apply our approach to improve
token-based controllable summarization models
and control important summary attributes includ-
ing length, covered entities, and abstractiveness by
creating specific constraints for each attribute. For
length control, we divide summary length into dis-
joint length bins and restrict the summary length
according to the desired length bin. For entity
control, we design constraints that guide the gen-
erated summary to cover the salient information
of user-specified entities. To control abstractive-
ness, which measures the degree of textual novelty
between a summary and its input document, we
define bins corresponding to three abstractiveness
levels, and design constraints that allow users to
control the summary’s abstractiveness.

Extensive experiments are conducted on pop-
ular benchmarks, to evaluate the effectiveness
of our CMDP training framework with different
types of attribute requirements. Concretely, we use
our CMDP framework to fine-tune controllable
summarization models based on pointer-generator
network (See et al., 2017), a Recurrent Neural
Network (RNN) (Hochreiter and Schmidhuber,

1214

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

constraints on the training objective and does not
need to search suitable weights for penalties based
on human inspection.

Query-focused summarization aims to predict
a summary that answers specific questions, for
example, ‘‘How often did Lebron James visit his
hometown?’’. Most of the query-focused sum-
marization methods are extractive and they are
based on centrality ranking (Wan, 2008; Wan and
Zhang, 2014), manifold-ranking (Wan et al., 2007;
Wan and Xiao, 2009; Wan, 2009), or sentence-
compression framework (Wang et al., 2013). Re-
cently, Nema et al. (2017) propose two query
attention-based models for abstractive query-
focused summarization. On the other hand, entity-
controlled summarization aims to produce a
summary that captures the salient information of
the desired entities, for example, ‘‘Lebron James’’.

Abstractive Summarization. Most of the exist-
ing abstractive summarization models (Gehrmann
et al., 2018; Zhang et al., 2020a; Chan et al.,
2020) are built on the encoder-decoder model
(Bahdanau et al., 2015) to generate summaries.
See et al. (2017) propose the pointer-generator
network, which allows copying words from the
source to the output summary. The structure-
infused copy mechanism (Song et al., 2018) in-
corporates the syntactic structure of the source
text into the pointer-generator network to facilitate
copying important words to the output summary.
Lebanoff et al. (2019) propose a summarization
framework that first extracts either a single sen-
tence or a pair of sentences from the source doc-
ument, then it condenses or fuses the selected
sentence(s) to generate a summary. The above
models do not allow users to constrain the degree
of copying nor sentence fusion from the source
document.

Recent methods apply RL with MDP to opti-
mize an abstractive summarization model towards
a single or a weighted sum of reward functions.
Several methods (Paulus et al., 2018; C¸ elikyilmaz
et al., 2018) adopt the ROUGE-L score (Lin,
2004) as the reward function. The SENECA model
(Sharma et al., 2019) optimizes a weighted sum
of ROUGE-2, ROUGE-L, and a coherence score
from a coherence model. To improve the factual
correctness of the generated summaries, several
methods (Huang et al., 2020; Zhang et al., 2020c)
use RL to maximize a weighted sum of ROUGE
scores and a factual correctness score computed by

a model. Kryscinski et al. (2018) use the weighted
sum of ROUGE-L and 3-gram novelty as the re-
ward to increase the abstractiveness of summaries,
but this method does not allow users to control the
abstractiveness level of summaries. Pasunuru and
Bansal (2018) extend the ROUGE-L reward by
up-weighting the salient words detected by a clas-
sifier. One can modify this word-level weighting
scheme to encourage the summary to contain cer-
tain keywords, but this method does not explicitly
encourage the model to generate relevant infor-
mation about the keywords. In contrast, we design
a constraint to enforce a summary to retain rele-
vant information of the requested entities. Ziegler
et al. (2020) apply RL to fine-tune a GPT2 model
(Radford et al., 2019). The reward is provided by
a model trained from human preferences on dif-
ferent summaries. Though one can use a weighted
sum of rewards to control an attribute of gener-
ated summaries, such a method needs to tune the
weights for rewards. Our CMDP approach avoids
the tuning of such weights.

Controllable Text Generation.
Controllable
text generation has received increasing attention
from researchers. In machine translation, several
methods (Sennrich et al., 2016; Kobus et al.,
2017; Takeno et al., 2017) apply special tokens
to control the politeness, domain, or length of
the translation output. Ficler and Goldberg (2017)
concatenate a style embedding with the decoder
input to control the style of the generated review.
Kikuchi et al. (2016), Miao et al. (2019), and
Schumann et al. (2020) introduce different tech-
niques to control sentence length for the headline
generation task, such as feeding a length embed-
ding to the decoder. The label-fine-tuning (LFT)
model (Niu and Bansal, 2018) uses special tokens
to control the politeness of responses for dialogue
response generation. Several insertion-based de-
coding methods (Sun et al., 2017; Zhu et al.,
2019; Gu et al., 2019) are proposed to com-
plete a fill-in-the-blank sentence, for example,
‘‘keywords 1
’’. These decoding
methods can be used to enforce the output to con-
tain certain keywords, but users need to specify the
relative order among the keywords. In contrast,
entity-controlled summarization lets the model
determine the relative order among the requested
entities. Recently, Keskar et al. (2019) train a large
language model conditioned on control codes that
specify particular attributes such as domain or

keywords 2

1215

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

language style. Compared with the above methods,
our approach incorporates the attribute require-
ment into the training objective, which gives more
explicit supervision signals to the summarizer.

3 Controllable Summarization with

Constrained Markov Decision Process

3.1 Problem Definition

Given a text document x and a requirement on an
attribute a (e.g., length limit of 20 words), the goal
of controllable text summarization is to generate
a summary y that satisfies the requirement. Both
the input document and output summary are se-
quences of words, namely, x = [x1, . . . , xlx] and
y = [y1, . . . , yly], where lx and ly are the numbers
of words in x and y, respectively. In this work,
we focus on single-document summarization.

3.2 Constrained Markov Decision

Process Formulation

We propose a constrained Markov Decision Pro-
cess (CMDP) approach to guide a controllable
summarization model to follow the attribute re-
quirement. Assume an agent interacts with an
environment to generate a summary in discrete
time steps. At each step t, the agent performs
an action by sampling a word yt from its pol-
icy πθ, which is a controllable summarization
model. Then the agent updates its internal state
representation (hidden state of the decoder) and
proceeds to the next step. Once the agent produces
the end-of-sequence (EOS) token, we denote the
current time step as T , the environment gives
a reward r(y1, . . . , yT , y∗, x), and a set of costs
ci(y1, . . . , yT , y∗, x) to the agent. The process
then terminates. The reward function r mea-
sures the similarity between the output summary
[y1, . . . , yT ] and the reference summary y∗, while
a cost function ci measures how well a summary
satisfies an attribute requirement, for example, we
can define a length cost function to measure the
difference between the output summary length ly
and the specified length limit l: ly − l. The goal
of the agent is to maximize the expected reward
while ensuring the costs are under constraints
as follows:

Ey1:T ∼πθ [r(y1:T , y∗, x)],
max
πθ
s.t. Ey1:T ∼πθ [ci(y1:T , y∗, x)] ≤ αi,

(1)

i = 1, . . . , m,

where y1:T denotes y1, . . . , yT , αi is a pre-defined
threshold associated with cost function ci, m is
the size of the set of constraints. A constraint
restricts an attribute of the generated summary.
For example, to limit the summary length, we
can define a constraint to enforce the length cost
function to be no larger than 0, ly − l ≤ 0.

(cid:2)

Lagrange Relaxation. Following Tessler et al.
(2019), we apply the Lagrange relaxation technique
(Bertsekas, 1997) to approximate the constrained
optimization problem in Eq. (1). We use J(πθ)
as a shorthand to denote Ey1:T ∼πθ [r(y1:T , y∗, x)]
and use Jci(πθ) to denote Ey1:T ∼πθ [ci(y1:T , y∗, x)].
We then define a Lagrangian function L(λ, θ) =
m
J(πθ) −
i=1 λi(Jci(πθ) − αi), where λi is a
Lagrangian multiplier and λ = [λ1, . . . , λm] ∈
Rm. When λi ≥ 0, ∀i, the optimal value of
maxθ L(λ, θ) is an upper bound to the optimal
value of Eq. (1). If we minimize the optimal value
of maxθ L(λ, θ), we will obtain a tighter upper
bound on the optimal value of Eq. (1). Thus,
we approximate Eq. (1) by the following relaxed
problem:

min
λ(cid:9)0

max
θ

J(πθ) −

m(cid:3)

i=1

λi(Jci(πθ) − αi),

(2)

where λ (cid:9) 0 denotes that every entry in λ is
non-negative. Intuitively, this relaxed problem pe-
nalizes the behavior of violating the constraints,
and all the Lagrange multipliers λi are learnable.
In contrast, the MDP formulation requires the
manual tuning of weights for penalty terms.

(cid:2)

T
t=0

Policy Training. We optimize θ and λ al-
ternatively using gradient ascent and descent:
θ ← θ + η1∇θL(λ, θ), λ ← λ − η2∇λL(λ, θ),
where η1 and η2 are learning rates for θ and λ
respectively. The gradients are expressed as fol-
lows. ∇θL = Eπθ [
∇θ log πθ(yt|y1:t−1)(r−
λT c)], ∇λL = −(Eπθ [c] − α), where c =
[c1, . . . , cm] ∈ Rm, α = [α1, . . . , αm] ∈ Rm, Eπθ
is a shorthand for Ey1:T ∼πθ . Since it is intractable
to enumerate all possible y1:T , we approximate
the expectation Ey1:T ∼πθ using a sample of output
sequence y1:T ∼ πθ. Moreover, we also subtract
the reward by a baseline b, which is a standard
technique to reduce the variance of the gradient
estimator (Sutton and Barto, 1998). The gradients
are then estimated by:

T(cid:3)

∇θL ≈

∇θ log πθ(yt|y1:T )(r − λT c − b),

(3)

t=0
∇λL ≈ −(c − α).

(4)

1216

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

We can interpret ∇θL as the standard policy
gradient with a regularization term −λT c, where
λ is trained by a gradient descent algorithm.

In this work, we apply the self-critical baseline
(Rennie et al., 2017). Specifically, we use greedy
search to generate an output sequence ¯y from the
policy. Then, we treat the reward of this sequence
r(¯y, y∗, x) as the baseline b.

Reward Function. We apply BERTScore
(Zhang et al., 2020b) as the reward function to
measure the similarity between an output sum-
mary and the reference summary based on their
BERT (Devlin et al., 2019) contextual embed-
dings. We do not use ROUGE scores (Lin, 2004)
as the reward since they cannot match paraphrases
in an output.

3-gram Repetition Constraint. Similar to prior
work (Paulus et al., 2018; Liu and Lapata, 2019;
Laban et al., 2020), we address the problem of rep-
etition of text fragments by adding a 3-gram repe-
tition constraint into our framework. We define a
cost function that measures the ratio of 3-gram
repetition in a summary: RepeatRatio3(y) =
#repeat 3-gram/# 3-gram. Then we set its
threshold to zero and apply the following 3-gram
repetition constraint: RepeatRatio3(y) ≤ 0.

3.3 Implementation with RNN and
Pre-trained Transformer

We apply our CMDP framework to train two types
of controllable summarization models: pointer-
generator network (See et al., 2017) and Distil-
GPT2 (Sanh et al., 2019). The pointer-generator
network is a popular abstractive summarization
model based on RNN encoder-decoder model
(Bahdanau et al., 2015). We also incorporate
the intra decoder attention (Paulus et al., 2018)
mechanism since it has been shown to improve
the performance of the pointer-generator. GPT2
(Radford et al., 2019) is a large-scale pre-trained
language model based on Transformer (Vaswani
et al., 2017). DistilGPT2 is a compressed version
of GPT2 model using the knowledge distillation
technique (Sanh et al., 2019). We append the
text ‘‘TL;DR’’ to the input document to trigger
the summarization operation by DistilGPT2. We
append control tokens to these two models.

3.4 Length-controlled Summarization

Length-controlled summarization aims to control
the length of generated summaries. We adopt

the setting proposed by Fan et al. (2018), which
allows users to constrain the summary length to a
pre-defined range (e.g., 33 to 37 words). We first
divide summary length into 10 disjoint length bins
LB = (lb1, . . . , lb10). Each length bin corresponds
to a range of length, and each bin contains a
roughly equal number of training samples in the
corpus. Let lbi∗ denote the specified length bin.
The goal of this task is to generate a summary y
that satisfies the specified length bin lbi∗.

Base Model. We expand the vocabulary of the
model with ten special tokens (e.g., )
to denote the corresponding bins. In training,
we feed the token that indicates the length bin
of the reference summary. During testing, we
control the length of the output summary by in-
putting the token of our specified length bin. For
pointer-generator, we prepend the token at the
beginning of the document. For DistilGPT2, we
insert the special token into the ‘‘TL;DR:’’ prefix
(e.g., ‘‘TL;DR:’’).

Length Bin Constraint. To encourage the sum-
mary length to match the specified length bin, we
define a cost function that computes the nor-
malized distance between the length bin of the
generated summary ˆi and the specified length bin
i∗: |ˆi − i∗|/10, then we set the threshold α = 0,
which leads to the following length bin constraint:
|ˆi − i∗| ≤ 0. We adopt a normalized cost func-
tion to prevent the values of costs from being
too large and dominating the gradient ∇θL in
Eq. (3).

3.5 Entity-controlled Summarization

Our second task is to generate a summary that
focuses on entities requested by a user. Fan et al.
(2018) anonymize each entity in the document by
a special token. In contrast, we do not anonymize
the entities, which is a more realistic setup.

Base Model. During training, we prepend the
reference entities to the document. These re-
quested entities are separated by segmenters, for
example, ‘‘Lebron James LA Lakers’’. In
test time, we control the focus of the summary by
feeding in our specified entities. To make the ref-
erence summaries focus on the reference entities,
we remove the reference summary sentences that
contain neither reference entities nor coreferent

1217

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

mentions of reference entities on training, vali-
dation, and test splits.3

a

apply

QA Constraint. We
question-
answering (QA) constraint to guide the generated
summary to capture the important information of
the requested entities. The main idea is to use
the QA-based metric from Eyal et al. (2019) and
Scialom et al. (2019) to evaluate the capability of
a summary to answer a set of questions regarding
the reference entities. The QA constraint ensures
that the score of the QA-based metric is above a
threshold.

Specifically, we first construct a set of cloze
question-answer pairs by individually masking
each of the named entities from the reference
summary to create the question, with the masked
entity as its gold-standard answer. The summary
predicted by a system is considered as the context
for a QA model. We feed each of the cloze
questions and the context to the QA model, then
the QA model extracts an answer from the context
for each cloze question. We use the F1 score of
the answers extracted by the QA model as the
evaluation metric, denoted as QA-F1 score. If
a summary presents the key information of the
reference entities, then the QA-model can predict
the correct answers from the summary most of
the time. We use the negative of QA-F1 as our
cost function and set the threshold to -0.9. Our QA
constraint is then defined as: −QA-F1(y) ≤ −0.9.
The QA model is a BERT model (Devlin et al.,
2019) with a span classification head on top of
the last-layer hidden states. The span classification
head is a fully connected layer that predicts the be-
ginning and ending positions of the answer span on
the context. We obtain a BERT-based QA model
that is fine-tuned on SQuAD 2.0 (Rajpurkar et al.,
2018) from Huggingface Transformers (Wolf
et al., 2019). Then we further fine-tune the QA
model on the CNN/Dailymail (Hermann et al.,
2015; Nallapati et al., 2016) corpus using our
constructed question-context-answer triplets. We
construct 349,653/17,442 cloze question-context-
answer triplets for training and development. The
details of the construction method are described
in §A.2.

Entity Repetition Constraint. We find that the
QA constraint will cause the model to repeatedly

3Fewer than 2% of the removed sentences contain named

entities that have coreferent mentions.

generate the same requested entity in a sentence,
because the model wants to increase the chance
that the QA model will select the requested enti-
ties as the answer. Since a named entity usually
contains one or two words, the entity repetition
behavior cannot be fixed by the 3-gram repetition
constraint. To address this problem, we first de-
fine a function ER(y) to measure the fraction of
sentences in y that contain repetition of requested
entities. We then use ER(y) as the cost function
and apply the following constraint: ER(y) ≤ 0.

3.6 Abstractiveness-controlled

Summarization

Our third task is abstractiveness-controlled sum-
marization, which allows a user to specify the
degree of text novelty between a generated sum-
mary and the corresponding document.4 In this
work, we adopt extractive fragment density
(Grusky et al., 2018) to measure the abstrac-
tiveness of a summary. Given a document x and
a summary y, the set of extractive fragments
F(x, y) is the set of common sequences of words
in x and y. Extractive fragment density is defined
as the mean square of the extractive fragment
|f |2. Intuitively, a summary
lengths: 1
ly
that copies many longer text fragments from the
document has a higher extractive fragment density
and a lower abstractiveness. We divide the values
of extractive fragment density into three abstrac-
tiveness bins: ab1 = (3.3, +∞], ab2 = (1.3, 3.3],
ab3 = [0, 1.3], which indicates low, medium, and
high abstractiveness respectively. The goal of ab-
stractiveness control is to generate a summary y
that follows the specified abstractiveness bin abi∗.

f ∈F(x,y)

(cid:2)

Base Model. Similar
to length control, we
use special tokens to denote the abstractiveness
bins and input a special token to control the
abstractiveness level of the output summary.

Abstractiveness Bin Constraint. To avoid the
output summary from violating the specified ab-
stractivenss bin, we apply a cost function to
evaluate the normalized distance between the ab-
stractiveness bin of the output summary ˆi and
the desired abstractiveness bin i∗: |ˆi − i∗|/3. We
set the threshold to 0 and obtain the following
abstractiveness bin constraint: |ˆi − i∗| ≤ 0.

4Abstraction refers to the process of semantic generaliza-
tion of concepts in the source document. The degree of text
novelty is a proxy for measuring abstractiveness.

1218

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Bin CNN/DM Newsroom Newsroom-b
1
2
3

37.88%
57.56%
4.56%

33.94%
37.54%
28.52%

45.92%
25.96%
28.12%

Table 1: Distribution of abstractiveness bins of
reference summaries on CNN/DM, Newsroom,
and Newsroom-b training sets. Bin 3 is the most
abstractive bin. Newsroom-b is a subset of News-
room which has a more balanced distribution of
abstractiveness bins.

Conjunction Constraint. We find that after ap-
plying the abstractiveness constraint, the model
often inserts the conjunction ‘‘but’’ into a copied
fragment to decrease the extractive fragment den-
sity, even if there is no contrast relationship.
Since it is difficult to detect the improper use of
conjunction, we devise a constraint to avoid the
model from generating ‘‘but’’ when the reference
summary does not contain ‘‘but’’. Concretely, we
first define a binary function IC(y) as follows.
IC(y) = 1 if the predicted summary y contains
‘‘but’’ and the reference summary does not con-
tain ‘‘but’’; otherwise, IC(y) = 0. We then apply
the following conjunction constraint: IC(y) ≤ 0.
This method can be generalized to other discourse
markers depending on specific model behavior.

4 Experimental Setup

Datasets. We use three popular summarization
datasets in our experiments. The first one is the
CNN/DailyMail (Hermann et al., 2015; Nallapati
et al., 2016) corpus. We use the standard splits,
which have 287,113/13,368/11,490 samples for
training, validation, and test sets. Each summary
in the training set has 66 words on average. We
follow the preprocessing steps of See et al. (2017).
Table 1 shows the distribution of abstractiveness
bins. We can observe that most of the reference
summaries belong to abstractiveness bin 1 and 2,
indicating that this dataset is not abstractive.

Moreover, we use a subset of the Newsroom
(Grusky et al., 2018) corpus. Newsroom contains
1.3 million news articles with summaries from 38
different news publishers. We construct a subset
of the Newsroom corpus called Newsroom-b that
has a more balanced distribution of abstractiveness
bins. We extract all the samples from three of the
news publishers (Washington Post, The Guardian,
and New York Times) and obtain the splits of

297,327/31,815/32,047 for training, validation,
and test sets. The distribution of abstractiveness
bins is shown in Table 1.

Furthermore, we conduct experiments of length
control on the DUC-2002 dataset (Ellis, 2002) us-
ing a test-only setup (Chen et al., 2018; Chen and
Bansal, 2018; Chan and King, 2021). DUC-2002
consists of 567 documents and each document has
two reference summaries. We remove the doc-
uments that are shorter than their corresponding
reference summaries, resulting in 554 documents.
This dataset has long reference summaries with
an average length of 113 words.

Baselines and Comparison. We use maximum
likelihood (ML) loss to train the pointer-generator
and DistilGPT2 based controllable summariza-
tion models described in §3.5, denoted as PG
and D.GPT2 respectively. We then use a suffix
‘‘+CMDP’’ to indicate that a model is fine-tuned
by our CMDP framework. The following baselines
do not use pre-trained models. We consider the
ControlSum (Fan et al., 2018) model as a baseline
for all of our control settings. For entity con-
trol, we incorporate query-focused summarization
baselines including GRSUM (Wan, 2008), an ex-
tractive model that incorporates query-relevance
into a random walk algorithm, QueryAtt (Nema
et al., 2017), an abstractive model that applies
a query attention to focus on different parts of
the input query, and SD2 (Nema et al., 2017),
which integrates an orthogonality constraint into
the QueryAtt model to encourage the successive
query attention context vectors to be orthogonal
to each other. Both the QueryAtt and SD2 models
have a strong inductive bias that the generated
summary should focus on the query. We mod-
ify the ROUGESal (Pasunuru and Bansal, 2018)
method by doubling the weights to the words of
the requested entities and treat it as a baseline,
denoted as ROUGEEnt.

Evaluation Metrics. For length control and entity
control, we evaluate the quality of summaries using
ROUGE-1, ROUGE-2, and ROUGE-L F1 scores
with full-length and stemming (Lin, 2004). For
abstractiveness control, we use embedding-based
metrics, BERTScore (Zhang et al., 2020b) and
MoverScore (Zhao et al., 2019),
to measure
the semantic similarity between an output sum-
mary and a reference summary. To evaluate how
well the generated summaries satisfy the attribute

1219

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Method

Bin 1
R-2

R-1

R-L

R-1

Bin 4
R-2

R-L

R-1

Bin 7
R-2

R-L

R-1

Bin 10
R-2

R-L

ControlSum
32.40 14.30 28.28 36.30 15.34 31.95 38.55 16.18 34.50 40.30 17.08 36.59
PG
27.93 12.06 24.40 31.41 12.51 27.23 31.81 12.27 27.54 31.94 11.79 28.09
35.30 17.00 31.98 37.88 17.59 34.27 39.85 18.46 36.17 40.73 17.11 37.30
PG+CMDP
31.21 13.36 27.12 36.27 15.97 31.91 38.18 16.43 33.64 40.87 17.45 36.62
D.GPT2
D.GPT2+CMDP 33.09 13.48 29.74 38.41 16.55 34.59 39.65 16.77 35.79 42.05 17.77 38.35

Table 2: Results of length control on different specified length bins using the DUC-2002 data.
Our CMDP framework consistently improves the ROUGE scores of PG and D.GPT2 (p < 0.04, approximate randomization test, for ROUGE-1 and ROUGE-L). requirement, we define a metric called bin % to measure the percentage of generated summaries that follow the specified bin (length or abstrac- tiveness bin). We use the QA-F1 score defined in §3.5 to evaluate whether a summary retains the essential information of the reference entities. We define reference entities as all the named entities (typed as location, person, and organization) that appear in both the reference summary and the first 400 words of the input document. We also define appear % to measure the percentage of requested entities that appear in the summary. For the non-reference control settings, the entire test set is evaluated under different control constraints and reference summaries do not exist in these cases. use Implementation Details. We Spacy (Honnibal et al., 2020) for coreference resolution. For RNN-based models, we use the Adam algo- rithm (Kingma and Ba, 2015) for training. We first use ML loss to train a RNN-based model until the validation loss stops decreasing for three consecutive checkpoints. Then we start the (C)MDP training. The initial learning rates are 1e-3 and 5e-5 for ML and CMDP training, respec- tively. For Transformer-based models, we use the AdamW algorithm (Loshchilov and Hutter, 2017) for training. We first use ML loss to train a Transformer-based model for 12 epochs. Then we start the (C)MDP training. The initial learning rates are 5e-5 and 1.77e-5 for ML and CMDP training. During CMDP training of D.GPT2, we freeze the bottom four layers of the model. We initialize the values of λ to 0.01. 5 Automatic Evaluation Results 5.1 Results of Length Control Reference Length Bin. We first evaluate the performance of length controlled models when R-2 R-L Bin % R-1 Method 39.75 17.43 36.70 48.15 ControlSum 35.07 15.05 32.11 74.09 PG 39.77 16.65 36.66 94.37 PG+CMDP D.GPT2 39.28 17.36 36.07 50.74 D.GPT2+CMDP 41.72 17.99 39.00 70.13 41.46 17.69 38.74 69.71 D.GPT2+MDP Table 3: Results of length control using reference length bins as the input on the CNN/DM dataset. Our CMDP framework significantly improves the ROUGE scores and bin % of both PG and D.GPT2 (p < 0.0001, approx. randomization test). supplying the length bin of the reference sum- mary (reference length bin) at testing time. The results are shown in Table 3. We observe that after applying our CMDP framework, both PG and D.GPT2 models obtain significantly higher ROUGE scores and a larger portion of their gen- erated summaries follow the specified length bin. We also report the results of the D.GPT2 model af- ter fine-tuned by RL with MDP (D.GPT2+MDP). In this MDP approach, the reward is BERTScore minus a weighted sum of length bin distance and 3-gram repetition ratio. We tune the weights of penalties on the validation set and set the weights for length bin distance and 3-gram repetition to 0.4 and 0.6, respectively. We can see that our CMDP approach outperforms the MDP approach. The above results demonstrate the effectiveness of our framework. Moreover, we observe that the D.GPT2 based models obtain higher ROUGE scores but lower bin % than the PG based models. One possible reason is that the large-scale pre-training in D.GPT2 makes the model more difficult to adapt to a specific bin requirement. This suggests a trade-off between the task metrics and the bin %. 1220 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 2: Bin % of different models with different specified length bins on the DUC-2002 dataset. Our framework improves the bin % of PG and D.GPT2 for bin 4, 7, and 10 by a wide margin. Arbitrary Length Bin. We evaluate the per- formance of length-controlled models when supplying different length bins at testing time. We report the results of length-controlled models on four different length bins: 1, 4, 7, and 10. The DUC-2002 dataset is adopted since this dataset has long reference summaries. Hence, we can evaluate the quality of summaries with different lengths by truncating the summaries. We truncate the reference and system summaries to 33, 46, 59, and 100 for specified length bins of 1, 4, 7, and 10, respectively, when computing ROUGE scores. ROUGE evaluation with truncation is a common practice for evaluating a system summary when given a length budget (Hong et al., 2014). The intuition is that a good summary should contain the more essential information at the beginning. We analyze the results of length-controlled models on different length bins. Figure 2 illus- trates the results of bin % obtained by different models. We observe that all the models achieve more than 90 bin % for length bin 1. It is be- cause length bin 1 represents the range of (0, 33] in length, it is easy to satisfy the requirement by generating a very short summary. For length bin 4, 7, and 10, our CMDP framework improves the bin % of both PG and D.GPT2 models by a wide margin. From Table 2, we can see that our frame- work consistently improves the ROUGE scores of PG and D.GPT2 models. Costs and Lagrangian Multipliers. Further- more, we analyze the values of costs (c) and Lagrangian multipliers (λ) of our PG+CMDP model during training. From Figure 3, we can see that the costs received by the agent decrease gradually over iterations. It is because the relaxed training objective of our framework in Eq. (2) penalizes the behavior of violating the constraints. Figure 3: Values of costs (c) and Lagrangian multi- pliers (λ) of PG+CMDP for length control on every checkpoint (4k iterations) during training. Each value is averaged over 4k iterations. We also observe that the values of Lagrangian multipliers λ keeps increasing. The reason is that according to Eq. (4), the gradient of λ is nega- tive as long as there is a sample that violates the constraints during training. As mentioned in § 3.2, λ is learned by a gradient descent algorithm and the algorithm increases λ when the gradient is negative. 5.2 Results of Entity Control Reference Entities. We first evaluate the perfor- mance of entity-controlled models in summarizing the reference entities. For each of the models, we feed in all the reference entities to generate a summary that centers on the reference entities. The results are presented in Table 4. We use the CNN/DM dataset for entity-controlled sum- marization because it contains named entities in 99.74% of the reference summaries in its test set, whereas the Newsroom-b dataset only has 85.24%. When computing QA-F1 and appear %, we ignore the samples that do not have a named entity in the reference summary. We observe that our frame- work consistently and significantly improves the ROUGE scores, QA-F1 score, and appear % for both of the PG and D.GPT2 models. These results demonstrate the effectiveness of our framework in summarizing reference entities. system. In this control We also adopt the D.GPT2+MDP model as a rival setting, the reward is BERTScore(y) + γ1QAF1(y) − γ2RepeatRatio3(y) − γ3ER(y). We set γ1, γ2, γ3 to 0.15, 0.4, and 0.5 respectively after hyper- parameter tuning. It is observed that the MDP 1221 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Method R-1 R-2 R-L QA-F1 Appear % 35.89 15.86 31.96 34.92 GRSUM 39.45 20.36 36.78 23.47 ROUGEEnt 39.41 19.94 36.55 27.02 ControlSum QueryAtt 38.92 20.38 36.47 25.12 39.43 20.71 36.88 27.23 SD2 37.61 19.27 35.04 23.53 PG 40.81 20.23 37.56 30.38 PG+CMDP D.GPT2 41.68 22.32 38.85 35.32 D.GPT2+CMDP 45.00 23.65 41.85 36.00 45.00 23.50 41.90 35.72 D.GPT2+MDP 76.22 83.75 74.08 75.10 75.97 37.96 86.64 82.31 93.37 94.46 Table 4: Results of entity-controlled models using reference entities as the input on the CNN/DM dataset. Our CMDP framework significantly im- proves the ROUGE scores, QA-F1, and appear % (p < 0.0001, approx. randomization test). approach and our CMDP approach obtain sim- ilar performance while our approach has fewer hyperparameters to tune. Entities at Different Positions. Next, we eval- uate the capability of entity-controlled models to summarize entities at different positions of the document with the following setup. For each of these models, we use the named entities at docu- ment sentences 1 to 2, 3 to 4, 5 to 6, and 7 to 8 as the requested entities respectively. Since we do not have reference summaries for these entities, we use the document sentences to construct cloze questions to evaluate the output summaries. For each requested entity, we build cloze questions by masking each document sentence that contains the entity or its coreferent mention. We use the F1 score of the answer predicted by the QA model as an evaluation metric, denoted as QA(cid:14)-F1. We analyze the performance of our method for entities at various sentences of the document. The results of appear % and QA(cid:14)-F1 scores are pre- sented in Figure 4. We observe that our CMDP framework consistently improves the appear % and QA(cid:14)-F1 scores of both PG and D.GPT2 mod- els for entities at different positions. Without our CMDP training, the appear % are low for entities at latter positions of the document. The reason is that we use reference entities for model training and the reference entities are concentrated in the first few sentences of the document, which bias a neural model towards these sentences. There are 45.6% of reference entities appear in the first two doc- ument sentences in the training set of CNN/DM. Nevertheless, the neural models fine-tuned by our Figure 4: Results of entity-controlled models for en- tities in different document sentences. Our CMDP framework consistently improves the QA(cid:14)-F1 and appear % for entities at different positions. CMDP achieve high appear % for entities at varying positions. Moreover, we observe that the GRSUM system achieves highest QA(cid:14)-F1 scores and its appear % scores are similar to that of D.GPT2+CMDP. We analyze the reasons as follows. The GRSUM system is an extractive method while all other methods in Figure 4 are abstractive methods. It is relatively easy for an extractive method to se- lect document sentences that mention the request entities to obtain high appear %. In the setting of non-reference entity control, we use document sentences to construct the cloze questions for the QA(cid:14)-F1 metric since we do not have a reference summary. Hence, the QA(cid:14)-F1 metric tends to give higher scores to extractive summaries. Moreover, we also observe that the GRSUM model achieves higher QA(cid:14)-F1 scores for the entities at latter sentences of the document. The entities at latter positions of a news article are usually less impor- tant entities that are only mentioned once and do not have coreferent mentions. The GRSUM sys- tem relies on term vectors to measure the relevance of a sentence. Thus, this system cannot recognize a coreferent mention that uses completely different words (e.g., pronoun). As a result, it is easier for GRSUM to extract a summary for entities at latter locations. However, an extractive method cannot paraphrase the information of the document to generate a concise entity-focused summary. 5.3 Results of Abstractiveness Control We analyze the capability of abstractiveness- controlled models to generate summaries with 1222 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Method ControlSum PG PG+CMDP D.GPT2 D.GPT2+CMDP D.GPT2+MDP bin 1 bin 2 BERTS. MoverS. Bin % BERTS. MoverS. Bin % BERTS. MoverS. Bin % 24.30 13.11 26.03 11.15 98.88 12.96 00.37 14.57 17.52 87.09 80.95 17.06 9.87 7.45 7.67 11.16 13.18 13.06 6.42 8.49 97.65 0.47 72.48 72.37 20.53 17.67 22.78 22.30 25.75 25.25 24.50 22.59 26.95 26.41 31.21 30.40 16.15 15.99 18.44 17.05 18.45 18.27 26.53 26.49 29.62 27.85 30.12 29.77 99.40 99.36 99.72 99.17 99.72 99.77 bin 3 Table 5: Results of abstractiveness-controlled models with different specified bins on Newsroom-b dataset. Bin 3 is the most abstractive bin. Our CMDP framework significantly improves the BERTScore, MoverScore, and bin % over all the bins (p < 0.003, approx. randomization test). Method ControlSum PG PG+CMDP D.GPT2 D.GPT2+CMDP D.GPT2+MDP bin 2 BERTS. MoverS. Bin % BERTS. MoverS. Bin % BERTS. MoverS. Bin % bin 1 bin 3 38.55 35.42 41.77 39.02 43.23 42.56 23.56 20.00 25.96 23.82 26.65 26.43 99.94 99.85 100.00 99.90 99.56 99.77 39.47 34.91 40.79 39.58 44.07 43.59 23.23 18.67 23.54 23.30 26.39 26.23 1.09 1.42 75.10 1.93 62.60 55.67 37.51 32.53 34.22 38.15 42.03 41.44 20.58 15.87 17.71 21.23 24.71 24.42 0.03 0.21 48.62 0.01 1.94 2.09 Table 6: Results of abstractiveness-controlled models with different specified bins on CNN/DM dataset. Our CMDP framework significantly improves the BERTScore and MoverScore (p < 0.003, approx. randomization test) over all the bins. It also significantly improves the bin % for bin 2 and 3 (p < 0.00001, approx. randomization test). different abstractiveness levels. In our experi- ments, for each of the abstractiveness-controlled models, we feed in abstractiveness bin 1, bin 2, and bin 3 independently. The results on Newsroom-b and CNN/DM datasets are presented in Table 5 and 6. We can see that our CMDP framework con- sistently improves the BERTScores and Mover- Scores of PG and D.GPT2 models. We also observe that all the models achieve more than 99 bin % for bin 1 (least abstractive), because it is easier for models to directly copy document sen- tences than to paraphrase document information. For abstractiveness bin 2 and 3, our CMDP frame- work substantially improves the bin % of PG and D.GPT2 models, which show that our framework improves the ability of summarization models to generate summaries of higher abstractiveness lev- els. Similar to the results of length control, there is a trade-off between the task metrics and the bin %. We then compare the bin % results on the CNN/DM dataset with that on Newsroom-b. It is observed that for abstractiveness bin 3 (most abstractive), all the models achieve a low bin % on CNN/DM but a substantially higher bin % on Newsroom-b. This is because in the CNN/DM, there are only 4.6% of the training samples be- longing to bin 3. Hence, it is difficult for a model to learn to generate a highly abstractive summary. In contrast, the Newsroom-b dataset has a bal- anced distribution of abstractiveness bins so that a model can learn from more abstractive references. Furthermore, we compare our framework with the D.GPT2+MDP model on both datasets. The reward is BERTScore(y) − γ1|ˆi − i∗|/3 − γ2RepeatRatio3(y)−γ3IC(y), whereˆi denotes the abstractiveness bin of the generated summary and i∗ denotes the specified abstractiveness bin. On the CNN/DM dataset, we set γ1, γ2, γ3 to 0.3, 0.5, and 0.3 respectively. On the Newsroom-b dataset, we set these weights to 0.4, 0.5, and 0.3 respec- tively. We observe that the MDP approach and our CMDP approach obtain similar performance while our approach has fewer hyperparameters to tune. 6 Human Evaluation We conduct human evaluation to verify the quality of the generated summaries. We hire postgraduate 1223 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Method SD2 D.GPT2 D.GPT2+CMDP Fluency Entity-rel. Faithful. 3.63 3.33 3.92 70% 68% 71% 4.83 4.65 4.83 Table 7: Human fluency, entity-relevance, and faithfulness scores of entity-controlled models with the reference entities as the input. Faithful. denotes the percentage of generated summaries that are faithful. The Krippendorf’s α inter-rater agreement for all columns are 0.68, 0.77, and 0.56. students as annotators and each test sample is eval- uated by three annotators. The names of models are blinded to the annotators. 6.1 Results of Entity Control The human annotators evaluate entity-controlled summarization models using the following met- rics: (i) fluency: estimating the readability and grammaticality of a summary using a rating from 1 to 5; (ii) faithfulness: a yes/no question indicating whether a summary is factually consistent with the document. The annotators are instructed to state ‘‘yes’’ only if the summary does not contain any factual inconsistencies; and (iii) entity-relevance: evaluating how well a summary retains the key information of the requested entities from 1 to 5. Reference Entities. We ask human annota- tors to evaluate the quality of summaries when requesting reference entities. For each of the entity-controlled models, we feed in all the ref- erence entities. The overall number of annotators is six. For each of the test samples, we present the input document, requested entities, reference summary, and three system summaries generated by SD2, D.GPT2, and D.GPT2+CMDP models. We present the evaluation scores on 100 random samples of the CNN/DM dataset in Table 7. For the faithfulness metric, we report the percentage of faithful summary computed by majority vote (i.e., at least two out of three annotators vote as faithful). Our D.GPT2+CMDP method signifi- cantly outperforms the D.GPT2 and SD2 models in terms of entity-relevance (power analysis with mixed effects model (Card et al., 2020), power > 0.99, approx. randomization test, p < 0.0001) while maintaining similar fluency and faithfulness with the SD2 model (approx. randomization test, p > 0.97).

Sent.

3&4

5&6

Method

SD2
D.GPT2+CMDP
SD2
D.GPT2+CMDP

Fluen. Ent.-rel. Faith.
63%
2.81
4.75
64%
3.36
4.79
62%
2.68
4.78
62%
3.29
4.78

Table 8: Human fluency, entity-relevance, and
faithfulness scores of entity-controlled models
for entities at different document sentences. The
Krippendorf’s α inter-rater agreement for these
scores are 0.60, 0.78, and 0.44.

Entities at Different Positions. We pick the
best two models (SD2 and D.GPT2+CMDP) in
the previous section to further conduct human
evaluation for entities at different sentences of the
document. The total number of annotators is four.
As mentioned in §5.2, most of the reference enti-
ties are located in document sentences 1 to 2. To
avoid too much overlapping with the reference en-
tities setting, we do not choose the bin of sentences
1 to 2 and conduct evaluation on the subsequent
two bins, sentences 3 to 4 and 5 to 6. For each
model, we feed in the named entities at document
sentences 3 to 4 and 5 to 6 as the requested entities
respectively. Since we do not have gold-standard
summaries for this setup, we cannot show the ref-
erence summaries to the annotators. The results
on 100 random samples are shown in Table 8.
Our D.GPT2+CMDP model consistently achieves
higher entity-relevance scores than the SD2 model
(power analysis with mixed effects model, power
> 0.81, approx. randomization test, p < 0.0001) and obtains competitive fluency and faithfulness scores (approx. randomization test, p > 0.41).

6.2 Results of Abstractiveness Control

The annotators evaluate abstractiveness-controlled
models using the following setting. There are six
annotators for the results of CNN/DM dataset and
three annotators for the results of Newsroom-b
dataset. For each test sample, we generate two
groups of system summaries (group 1 and group
2). For group 1, we use our D.GPT2+CMDP
model to generate three different summaries by
feeding abstractiveness bin 1, bin 2, and bin 3
respectively. For group 2, we use our PG+CMDP
model to generate three different summaries using
a similar method. During evaluation, we present
the source document,
the reference summary,
and two groups of system summaries to the

1224

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
4
2
3
1
9
7
2
4
4
6

/

/
t

l

a
c
_
a
_
0
0
4
2
3
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Method
PG+CMDP
D.GPT2+CMDP

CNN/DM Newsroom-b
EM PM EM PM
66% 94% 84% 96%
66% 92% 86% 98%

Table 9: Results of exact match (EM) and par-
tial match (PM) scores of human abstractiveness
rankings that are consistent with the specified
bins. The Krippendorf’s α inter-rater agreement
for the abstractiveness rankings on CNN/DM and
Newsroom-b are 0.85 and 0.72 respectively.

annotators. The summaries within each group are
randomly shuffled.

Abstractiveness Among Summaries. We eval-
uate the abstractiveness of the generated sum-
maries by human judgments using the following
setup. For each group of system summaries, we
ask the annotators to give a ranking among the
three system summaries according to their ab-
stractiveness. For instance, if an annotator thinks
that summary 1 > summary 2 > summary 3 in
terms of abstractiveness, then the annotator gives
a ranking of [3, 2, 1] to them. The abstractive-
ness rankings from different annotators are then
aggregated by averaging. If the aggregated ab-
stractiveness ranking is consistent with the order of
our specified abstractiveness bins, then this group
of summaries has an exact match. For example,
suppose the order of our specified abstractiveness
bins is [3, 2, 1]. If the aggregated abstractiveness
ranking is [3, 1.6, 1.3], then then this group of
summaries has an exact match. If the aggregated
abstractiveness ranking is [3, 1.3, 1.6], then there is
no exact match. Moreover, we investigate whether
the summaries of abstractiveness bin 1 and bin 3
can be distinguished by annotators. If the aggre-
gated abstractiveness ranking is consistent with
the order of abstractiveness bin 1 and bin 3, then
there is a partial match. Suppose the order of
our specified abstractiveness bins is [3, 2, 1], if
the aggregated ranking is [3, 1.3, 1.6], then there
is a partial match. If the aggregated ranking is
[1.6, 1.3, 3], then there is no partial match.

We analyze the exact match and partial match
scores of abstractiveness-controlled models as fol-
lows. The results on 100 random test samples of
the CNN/DM and Newsroom-b datasets5 are pre-

5We use both CNN/DM and Newsroom-b because we
want to understand the impact of the training dataset on the
abstractiveness of the output summaries.

Bin

1

2

3

Method
Flu. Rel. Faithful.
4.79 3.43
PG+CMDP
D.GPT2+CMDP 4.75 3.34
PG+CMDP
4.52 2.34
D.GPT2+CMDP 4.57 3.14
PG+CMDP
4.47 2.00
D.GPT2+CMDP 4.60 2.99

98%
96%
58%
66%
52%
66%

Table 10: Human fluency, relevance, and faithful-
ness scores of abstractiveness-controlled models
on Newsroom-b. The Krippendorf’s α inter-rater
agreement for these metrics are 0.51, 0.37, and
0.40.

sented in Table 9. We observe that our models on
both of the two datasets achieve very high partial
match scores, but our models on the CNN/DM
dataset obtain lower exact match scores than that
on the Newsroom-b dataset (approx. randomiza-
tion test, p < 0.02). This is because the CNN/DM dataset is extractive in nature. Hence, it is more difficult to learn three levels of abstractiveness on CNN/DM. Nonetheless, our models can still achieve more than 60% exact match scores. Quality of Individual Summaries. Next, we ask the annotators to evaluate the qualities of the summaries of three different abstractiveness bins using the following metrics: (i) fluency: measuring the readability of a summary from 1 to 5; (ii) faithfulness: a yes/no question ask- ing whether a summary is factually consistent with the document; and (iii) relevance: evalu- ating how well a summary retains the salient information of the document on 1-5. The results of 100 random test samples from the Newsroom-b dataset6 are presented in Table 10. When us- ing abstractiveness bin 1 (lowest level), all the models achieve significantly higher fluency, rel- evance, and faithfulness (approx. randomization test, p < 0.005). The scores of all these metrics drop substantially for abstractiveness bin 2 and bin 3 because paraphrasing is more challenging than copying. Figure 5 illustrates sample sum- maries generated by our D.GPT2+CMDP model on the Newsroom-b dataset. We observe that the generated summary of bin 3 has a factual error, which is italicized in the figure. 6We choose Newsroom-b because there are more generated summaries that satisfy the abstractiveness bin re- quirement, which is more suitable for comparing the quality of summaries of different abstractiveness bins. 1225 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 framework needs to tune one threshold value for entity control and it does not need to tune any threshold for other control settings. Whereas the numbers of penalty weights to be tuned in the MDP framework are 2, 3, and 3 for length, entity, and abstractiveness control respectively. Acknowledgments The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2410021, Research Impact Fund, R5034-18), National Key Research and Develop- ment Program of China (No. 2018AAA0100204), the Science and Technology Development Fund of Macau SAR (File no. 0015/2019/AKP), and Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems (No. 2019B121205007). Lu Wang is supported in part by the National Science Foundation through a CAREER award IIS-2046016. We would like to thank the action editor and the anonymous reviewers for their comments. References Eitan Altman. 1999. Constrained Markov Deci- sion Processes, volume 7. CRC Press. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Repre- sentations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. Richard Bellman. 1957. A Markovian deci- sion process. Journal of Mathematics and Mechanics, pages 679–684. Dimitri P. Bertsekas. 1997. Nonlinear program- ming. Journal of the Operational Research Society, 48(3):334–334. https://doi.org /10.1057/palgrave.jors.2600425 Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power In Proceed- great comes responsibility. the 2020 Conference on Empirical ings of Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020, Figure 5: Sample summaries generated by our D.GPT2+CMDP model with abstractiveness bin 1, 2, and 3 on the Newsroom-b testing set. Extractive frag- ments in summaries are in blue color. Factual errors are in red color. 7 Conclusion We propose a novel CMDP training framework for controllable text summarization. Our frame- work imposes constraints on the training objective to explicitly disallow the output summaries from violating the requirement specified by users. Moreover, we apply our framework to control key summarization attributes such as length, covered entities, and abstractiveness of the summaries. We then devise specific constraints to restrict each of these attributes respectively. Empirical stud- ies on popular benchmarks demonstrate that our framework significantly improves the capability of controllable summarization models to conform to the desired attribute requirement. In our framework, we can set hard constraints without tuning threshold values. For instance, we set the threshold of our length bin constraint to 0 to disallow the violation of length bin requirement. Compared to the weights of penalty in the MDP framework, the threshold value in a soft constraint is also easier to set. For example, the goal of entity control is to generate a summary that presents the key information of the requested entities, which implies that the generated summaries should ob- tain a high QA-F1 score. The range of QA-F1 score is [0, 1]. In order to encourage the gen- erated summaries to obtain a high QA-F1 score, the threshold for QA-F1 score should be close to 1, this gives us a clue about how to set the value of threshold. On the other hand, the MDP framework does not give us any clues to set the values of penalty weights. In summary, our CMDP 1226 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages 9263–9274. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2020.emnlp-main.745 Asli C¸ elikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Pro- ceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 1 (Long Papers), pages 1662–1675. Association for Computational Linguistics. https://doi .org/10.18653/v1/N18-1150 Hou Pong Chan, Wang Chen, and Irwin King. 2020. A unified dual-view model for review summarization and sentiment clas- In Pro- sification with inconsistency loss. the 43rd International ACM ceedings of SIGIR conference on research and develop- ment in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25–30, 2020, pages 1191–1200. https://doi.org/10 .1145/3397271.3401039 Hou Pong Chan and Irwin King. 2021. A condense-then-select strategy for text summa- rization. Knowledge-Based Systems, page 107235. https://doi.org/10.1016/j.knosys .2021.107235 Xiuying Chen, Shen Gao, Chongyang Tao, Yan Song, Dongyan Zhao, and Rui Yan. 2018. Iterative document representation learning to- wards summarization with polishing. In Pro- ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 – November 4, 2018, pages 4088–4097. https://doi.org /10.18653/v1/D18-1442 Yen-Chun Chen and Mohit Bansal. 2018. Fast ab- stractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers, pages 675–686. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/P18-1063 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Angela Ellis. 2002. Document Understanding Conferences (DUC) 2002 Data. Matan Eyal, Tal Baumel, and Michael Elhadad. 2019. Question answering as an automatic eval- uation metric for news article summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pages 3938–3948. Association for Computational Linguistics. Angela Fan, David Grangier, and Michael Auli. 2018. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, NMT@ACL 2018, Melbourne, Australia, July 20, 2018, pages 45–54. Jessica Ficler and Yoav Goldberg. 2017. Control- ling linguistic style aspects in neural language generation. CoRR, abs/1707.02633. Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. 2018. Bottom-up abstrac- tive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, Oc- tober 31 - November 4, 2018, pages 4098–4109. Association for Computational Linguistics. Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summar- ies with diverse extractive strategies. In Pro- ceedings of the 2018 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 1 (Long Papers), pages 708–719. https:// doi.org/10.18653/v1/N18-1065 1227 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Jiatao Gu, Qi Liu, and Kyunghyun Cho. 2019. Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics, 7:661–676. Karl Moritz Hermann, Tom´as Kocisk´y, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Process- ing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Decem- ber 7–12, 2015, Montreal, Quebec, Canada, pages 1693–1701. Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural Computation, https://doi.org/10 9(8):1735–1780. .1162/neco.1997.9.8.1735, PubMed: 9377276 Kai Hong, John M. Conroy, Benoˆıt Favre, Alex Kulesza, Hui Lin, and Ani Nenkova. 2014. A repository of state of the art and com- petitive baseline summaries for generic news summarization. In Proceedings of the Ninth International Conference on Language Re- sources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014, pages 1608–1616. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Process- ing in Python. Luyang Huang, Lingfei Wu, and Lu Wang. 2020. Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward. In Proceedings of the 58th Annual the Association for Computa- Meeting of tional Linguistics, ACL 2020, Online, July 5–10, 2020, pages 5094–5107. https://doi .org/10.18653/v1/2020.acl-main.457 Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A conditional transformer lan- guage model for controllable generation. CoRR, abs/1909.05858. encoder-decoders. In Proceedings of the 2016 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pages 1328–1338. https://doi.org/10 .18653/v1/D16-1140 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Repre- sentations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings. Catherine Kobus, Josep Crego, and Jean Senellart. 2017. Domain control for neural machine trans- lation. In Proceedings of the International Con- ference Recent Advances in Natural Language Processing, RANLP 2017, pages 372–378, Varna, Bulgaria. INCOMA Ltd. Wojciech Kryscinski, Romain Paulus, Caiming Xiong, and Richard Socher. 2018. Improving abstraction in text summarization. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1808–1817. https://doi.org /10.18653/v1/D18-1207 Philippe Laban, Andrew Hsi, John Canny, and Marti A. Hearst. 2020. The summary loop: Learning to write abstractive summaries with- out examples. In Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, ACL 2020, Online, July 5–10, 2020, pages 5135–5150. https://doi .org/10.18653/v1/2020.acl-main.460 Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, and Fei Liu. 2019. Scoring sentence singletons and pairs for abstractive summarization. In Proceedings of the 57th Con- ference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers, pages 2175–2189. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/P19-1209 Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. length in neural 2016. Controlling output Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out. 1228 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Yang Liu and Mirella Lapata. 2019. Text sum- marization with pretrained encoders. In Pro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pages 3728–3738. https://doi.org/10 .18653/v1/D19-1387 Preksha Nema, Mitesh M. Khapra, Anirban Laha, and Balaraman Ravindran. 2017. Diversity driven attention model for query-based abstrac- tive summarization. In Proceedings of the 55th Annual Meeting of the Association for Com- putational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, Volume 1: Long Papers, pages 1063–1072. https://doi .org/10.18653/v1/P17-1098 Yizhu Liu, Zhiyi Luo, and Kenny Q. Zhu. 2018. Controlling length in abstractive summariza- tion using a convolutional neural network. In Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4110–4119. Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in Adam. CoRR, abs/1711.05101. Takuya Makino, Tomoya Iwakura, Hiroya Takamura, and Manabu Okumura. 2019. Global optimization under length constraint for neu- ral text summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1039–1048, Italy. Association for Computa- Florence, tional Linguistics. https://doi.org/10 .18653/v1/P19-1099 Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. 2019. CGMH: constrained sentence generation by metropolis-hastings sampling. In The Thirty-Third AAAI Conference on Artifi- cial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelli- gence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Ar- Intelligence, EAAI 2019, Honolulu, tificial Hawaii, USA, January 27 - February 1, 2019, pages 6834–6842. https://doi.org/10 .1609/aaai.v33i01.33016834 Ramesh Nallapati, Bowen Zhou, C´ıcero Nogueira dos Santos, C¸ aglar G¨ulc¸ehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11–12, 2016, pages 280–290. https://doi.org /10.18653/v1/K16-1028 Tong Niu and Mohit Bansal. 2018. Polite dia- logue generation without parallel data. TACL, 6:373–389. https://doi.org/10.1162 /tacl_a_00027 Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with saliency and entailment. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 2 (Short Pa- pers), pages 646–653. https://doi.org /10.18653/v1/N18-2102 Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8). Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswer- able questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Mel- bourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 784–789. https://doi .org/10.18653/v1/N18-2102 Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In 2017 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, 1229 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 pages 1179–1195. https://doi.org/10 .1109/CVPR.2017.131 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. CoRR, abs/1910.01108. Raphael Schumann, Lili Mou, Yao Lu, Olga Vechtomova, and Katja Markert. 2020. Dis- crete optimization for unsupervised sentence summarization with word-level extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, ACL 2020, Online, July 5–10, 2020, pages 5032–5042. https://doi.org/10 .18653/v1/2020.acl-main.452 Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano. 2019. An- swers unite! unsupervised metrics for rein- forced summarization models. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pages 3244–3254. Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summa- rization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguis- tics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1073–1083. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness in neu- ral machine translation via side constraints. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pages 35–40. https:// doi.org/10.18653/v1/N16-1005 Eva Sharma, Luyang Huang, Zhe Hu, and Lu Wang. 2019. An entity-driven framework for abstractive summarization. In Proceedings of the 2019 Conference on Empirical Meth- ods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pages 3278–3289. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D19-1323 Kaiqiang Song, Bingqing Wang, Zhe Feng, Ren Liu, and Fei Liu. 2020. Controlling the amount of verbatim copying in abstractive summa- In The Thirty-Fourth AAAI Con- rization. ference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8902–8909. AAAI Press. https:// doi.org/10.1609/aaai.v34i05.6420 Kaiqiang Song, Lin Zhao, and Fei Liu. 2018. Structure-infused copy mechanisms for ab- stractive summarization. In Proceedings of the 27th International Conference on Com- putational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20–26, 2018, pages 1717–1729. Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirectional beam search: Forward-backward inference in neural sequence models for fill- in-the-blank image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July21–26, 2017, pages 7215–7223. https:// doi.org/10.1109/CVPR.2017.763 Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning - an Introduction. Adaptive computation and machine learning. MIT Press. Shunsuke Takeno, Masaaki Nagata, and Kazuhide Yamamoto. 2017. Controlling target features in neural machine translation via prefix con- straints. In Proceedings of the 4th Workshop on Asian Translation, WAT@IJCNLP 2017, Taipei, Taiwan, November 27–December 1, 2017, pages 55–63. Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. 2019. Reward constrained policy op- In International Conference on timization. Learning Representations (ICLR). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In Advances in Neural 1230 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Information Processing Systems 30: Annual Conference on Neural Information Process- ing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pages 5998–6008. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910. 03771. https://doi.org/10.18653/v1 /2020.emnlp-demos.6 Xiaojun Wan. 2008. Using only cross-document relationships for both generic and topic-focused multi-document summarizations. Information Retrieval, 11(1):25–49. https://doi.org /10.1007/s10791-007-9037-5 Xiaojun Wan. analysis for 2009. Topic topic-focused multi-document summarization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2–6, 2009, pages 1609–1612. ACM. Xiaojun Wan and Jianguo Xiao. 2009. Graph-based multi-modality learning for topic- focused multi-document summarization. In IJCAI 2009, Proceedings of the 21st Interna- tional Joint Conference on Artificial Intelli- gence, Pasadena, California, USA, July 11–17, 2009, pages 1586–1591. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Manifold-ranking based topic-focused IJCAI multi-document 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, January 6–12, 2007, pages 2903–2908. summarization. India, In Xiaojun Wan and Jianmin Zhang. 2014. CT- SUM: extracting more certain summaries for news articles. In The 37th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, SIGIR ’14, Gold Coast, QLD, Australia - July 06 - 11, 2014, pages 787–796. ACM. Lu Wang, Hema Raghavan, Vittorio Castelli, Radu Florian, and Claire Cardie. 2013. A sentence compression based framework to query-focused multi-document summarization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4–9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pages 1384–1394. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020a. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th In- ternational Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learn- ing Research, pages 11328–11339. PMLR. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020b. BERTScore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Ad- dis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net. Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, and Curtis Langlotz. 2020c. Optimizing the factual correctness of a summary: A study of summarizing radiol- ogy reports. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, pages 5108–5120. https://doi.org/10 .18653/v1/2020.acl-main.458 Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text generation evalu- ating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, pages 563–578. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D19-1053 Wanrong Zhu, Zhiting Hu, and Eric P. Xing. 2019. Text infilling. CoRR, abs/1901.00158. Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2020. Fine-tuning language models from human preferences. CoRR, abs/1909.08593. 1231 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 A Appendix A.1 Output Samples for Length Control Figure 6 presents sample summaries generated by our D.GPT2+CMDP model using different length bins on the DUC-2002 testing set. We observe that our model discards secondary information when given a shorter length budget. A.2 Training Data for the QA Model We construct question-context-answer triplets to train a QA model. We individually mask each named entity in a reference summary to create a cloze question and the masked entity is its answer. The reference summary is used as the context. For example, suppose the reference summary y∗ is ‘‘Arsenal beat Chelsea 3-1 yesterday.’’, then we construct two cloze questions, q1 =‘‘[MASK] beat Chelsea 3-1 yesterday.’’ and q2 =‘‘Arsenal beat [MASK] 3-1 yesterday.’’, and two answers, a1 =‘‘Arsenal’’ and a2 =‘‘Chelsea’’. After that, we obtain two question-context-answer triplets, (q1, y∗, a1) and (q2, y∗, a2). Since the constructed cloze questions are too similar to the corresponding reference summaries, if we only use reference summaries as the context in our training data, it will encourage the QA model to only rely on surface clues to extract answers. To alleviate this problem, we use the method by Chen and Bansal (2018) to extract a pseudo reference summary ˜y from the source document. Then we use ˜y as the context to construct another set of question-context-answer triplets {(qi, ˜y, ai)}. The pseudo reference sum- mary includes that the document achieve highest ROUGE-L recall with the reference summary. We discard a triplet if ˜y does not contain all the named entities in the reference. To have a balanced training data, we only keep the training triplets (qi, y∗, ai) that has a corre- sponding pseudo reference summary (qi, ˜y, ai). sentences Figure 6: Sample summaries generated by our D.GPT2+CMDP model using different length bins on the DUC-2002 testing set. irrelevant To allow the QA model to give a prediction of ‘‘unanswerable’’ to low-quality summaries,we two types of unanswerable train- construct ing samples: training samples and repeated-entity training samples. For irrelevant training samples, we select document sentences that do not contain the reference entities and have a low textual overlap with the reference summary (ROUGE-L recall ≤ 0.2). For repeated-entity training samples, we find out the sentences in the reference summary that contains two named entities and repeat one of its named entities. We treat such samples as unanswerable since they con- tain factual inconsistencies. Overall, our training data consists of 109,815 unanswerable samples and 239,838 answerable samples. We will release our training data for the QA model. 1232 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 2 3 1 9 7 2 4 4 6 / / t l a c _ a _ 0 0 4 2 3 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3Controllable Summarization with Constrained Markov Decision Process image
Controllable Summarization with Constrained Markov Decision Process image

Download pdf