Document Summarization with Latent Queries
Yumo Xu and Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, United Kingdom
yumo.xu@ed.ac.uk
mlap@inf.ed.ac.uk
Abstrait
The availability of large-scale datasets has
driven the development of neural models that
create generic summaries for single or multiple
documents. For query-focused summarization
(QFS), labeled training data in the form of
queries, documents, and summaries is not read-
ily available. We provide a unified modeling
framework for any kind of summarization,
under the assumption that all summaries are
a response to a query, which is observed in
the case of QFS and latent in the case of
generic summarization. We model queries as
discrete latent variables over document tokens,
and learn representations compatible with ob-
served and unobserved query verbalizations.
Our framework formulates summarization as
a generative process, and jointly optimizes
a latent query model and a conditional lan-
guage model. Despite learning from generic
summarization data only, our approach out-
performs strong comparison systems across
benchmarks, query types, document settings,
and target domains.1
1
Introduction
Recent years have witnessed substantial progress
in generic summarization (Voir et al., 2017;
Gehrmann et al., 2018; Liu and Lapata, 2019un, dans-
ter alia) thanks to neural architectures based on the
encoder-decoder paradigm (Sutskever et al., 2014)
and the availability of large-scale datasets contain-
ing hundreds of thousands of document-summary
pairs. Malheureusement, training data of this mag-
nitude is not readily available for the related
task of query-focused summarization (QFS; Dang
2005) which aims to create a summary from one
or multiple document(s) that answers a specific
query. Existing QFS benchmarks (Dang, 2005;
Hoa, 2006; Nema et al., 2017; Baumel et al.,
1Our code and models can be found at https://
github.com/yumoxu/lqsum.
623
2016) have been constructively used for evalu-
ation but are relatively small for training large
neural models.
To make up for the absence of labeled QFS data,
recent work has resorted to distant supervision
provided by pretrained models, paraphrase identi-
fication, and question-answering datasets (Xu and
Lapata, 2020; Su et al., 2020; Laskar et al.,
2020b). Other work induces proxy queries (Xu
and Lapata, 2021) from generic summarization
datasets, without additional question-answering
resources that can be also extremely expen-
sive to acquire (Bajaj et al., 2016). Malgré cela
progress, building and scaling QFS systems re-
mains challenging due to the many different
ways natural language queries express users’ in-
formation needs. Par exemple, queries can have
one or multiple keyword(s) (Baumel et al., 2016;
Zhu et al., 2019), a simple question (Nema et al.,
2017), or a longer narrative composed of multi-
ple sub-queries (Dang, 2006) (see the examples
in Table 1). Although QFS systems can poten-
tially handle queries resembling those seen in
entraînement, they are not expected to work well on
out-of-distribution queries (Xu and Lapata, 2021),
namely, queries with different surface forms from
those seen in training. In order to cover new types
of queries, it might be necessary to gather more
data, re-design proxy queries, and re-train one
or more system components that can be compu-
tationally inefficient and in some cases practi-
cally infeasible.
In this work, we provide a unified model-
ing framework for generic summarization and
QFS, under the assumption that only data for
the former is available. Spécifiquement, we treat
generic summarization as a special case of QFS
where the query is latent. We model queries as
discrete latent variables over document tokens,
and learn representations compatible with ob-
served and unobserved query verbalizations. Notre
Transactions of the Association for Computational Linguistics, vol. 10, pp. 623–638, 2022. https://doi.org/10.1162/tacl a 00480
Action Editor: Wenjie (Maggie) Li. Submission batch: 8/2021; Revision batch: 12/2021; Published 5/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
0
2
0
2
2
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Dataset
Task Domain
Size
D/Q/S Tokens Query Type
Query Example
CNN/DM SDS News
WikiCatSum MDS Wiki
WikiRef
SDS Wiki
Is euthanasia better than withdrawing life support?
Debatepedia SDS Debates
DUC 2006 MDS Newswire 1,250 (50) 699.3/32.8/250 Composite AMNESTY INTERNATIONAL – What is the scope
DUC 2007 MDS Newswire 1,125 (45) 540.3/30.5/250 Composite of operations of Amnesty International and what are
760.5/0.0/45.7 Empty
800.0/0.0/105.6 Empty
398.7/6.7/36.2 Keywords Marina Beach, Incidents
66.4/10.0/11.3 Question
11,490
8,494
12,000
1,000
∅
∅
TD-QFS
MDS Medical
7,099 (50) 182.9/3.0/250 Title
the international reactions to its activities?
Alzheimer’s Disease
Tableau 1: Test data statistics. SDS/MDS stand for single-/multi-document summarization. Size refers
to number of test documents; for multi-document QFS, we specify the number of clusters in brackets.
D/Q/S are Document/Query/Summary tokens. Composite queries consist of a TOPIC and a narrative.
framework formulates abstractive summarization
as a generative process, and decomposes the learn-
ing objective into: (1) latent query modeling (c'est à dire.,
generating latent query variables from document
observations) et (2) conditional language mod-
eling (c'est à dire., generating summaries conditioned on
observed documents and latent queries). To fur-
ther handle user queries at test time, we propose
a non-parametric calibration of the latent query
distribution, which allows us to perform zero-shot
QFS without model re-training.
Our contributions in this work are threefold:
(un) we bring together generic summarization
and QFS under a unified modeling framework
that does not require query-related resources for
training or development; (b) we provide a deep
generative formulation for document summariza-
tion, where queries are represented directly from
input documents in latent space, c'est, avec-
out resorting to pipeline-style query extraction
or generation; et (c) experiments on a range
of summarization benchmarks show that across
query types, document settings, and target do-
mains, our model achieves better results than
strong comparison systems.
2 Related Work
Rush et al. (2015) and Nallapati et al. (2016)
to apply the neural
were among the first
encoder-decoder architecture to abstractive sum-
marization. See et al.
(2017) enhance their
approach with a pointer-generator model, essen-
tially a copy mechanism allowing words from
the source document to be copied directly in
the summary. Gehrmann et al. (2018) incorpo-
rate a content selection model that decides on
relevant aspects of the source document. Ils
frame this task as a word-level tagging problem,
with the objective of separately identifying tokens
from a document that should be part of its sum-
mary; at test time, they produce content selection
probabilities for each word, which are then used
to restrict the copy mechanism by performing
hard masking over the input document. Another
line of research controls summary generation via
topics (Perez-Beltrachini et al., 2019un; Wang
et coll., 2020), retrieve-and-edit methods (Cao
et coll., 2018), factual relations (Jin et al., 2020),
keywords, relational triples, or preselected source
phrases (Dou et al., 2021).
The majority of previous QFS approaches
have been extractive and compose summaries
by selecting central and query-relevant sentences
(Wan et al., 2007; Badrinath et al., 2011; Wan and
Zhang, 2014; Li et al., 2017b,un). Plus récemment,
Xu and Lapata (2020) propose a coarse-to-fine
framework that leverages distant supervision from
question answering for summary sentence extrac-
tion. Abstractive QFS has received significantly
less attention in comparison, due to generation
models being particularly data-hungry (Lebanoff
et coll., 2018; Liu and Lapata, 2019un). As a re-
sult, resources from a wider range of NLP tasks
have been used. Su et al. (2020) rank docu-
ment paragraphs against queries with the aid
of QA and machine reading datasets (Su et al.,
2019; Rajpurkar et al., 2016), and then iteratively
summarize selected paragraphs. De la même manière, Laskar
et autres. (2020b) jointly exploit supervision from
QFS data (typically reserved for evaluation) et
related QA and paraphrase identification tasks.
Because query-related resources can be also
costly to obtain (Bajaj et al., 2016; Kwiatkowski
et coll., 2019), Xu and Lapata (2021) use none
whatsoever. Plutôt, they create proxy queries by
selectively masking information slots in generic
summaries. Despite promising system performance,
624
je
D
o
w
n
o
un
d
e
d
F
r
o
m
h
t
t
p
:
/
/
d
je
r
e
c
t
.
m
je
t
.
e
d
toi
/
t
un
c
je
/
je
un
r
t
je
c
e
–
p
d
F
/
d
o
je
/
.
1
0
1
1
6
2
/
t
je
un
c
_
un
_
0
0
4
8
0
2
0
2
2
9
5
1
/
/
t
je
un
c
_
un
_
0
0
4
8
0
p
d
.
F
b
oui
g
toi
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
their approach assumes prior knowledge of tar-
get queries (proxies are created to match their
length, and content), and a development set is used
(Xu and Lapata, 2021). Aussi, their system is
particularly tailored to multi-document QFS and
includes a sophisticated evidence selection com-
ponent. Our work is closely related to theirs in that
we also do not take advantage of query-related re-
sources. We go a step further and do not require a
development set either, allowing our model to be
independent of specific query verbalizations and
produce QFS summaries in zero-shot settings.
Our approach is generally applicable to single-
and multi-document QFS. For any summariza-
tion task we assume that queries are latent and
estimate these jointly via a summarization and
(weakly supervised) tagging task. The latter draws
inspiration from Gehrmann et al. (2018) under the
assumption that document tokens found in the
summary also provide evidence for the (latent)
query that gave rise to it. Enfin, our model is
fundamentally different from approaches that rely
on document-based guidance to improve the in-
formativeness (Cao et al., 2018) or faithfulness
(Chen et al., 2021) of summaries. While these
models exploit guidance from supervision signals
in training data, we are faced with the problem of
estimating queries when there are none available
(at least during training).
3 Problem Formulation
Let {(D, Q, S)} denote a summarization dataset,
where document D is a sequence of tokens, et
S its corresponding summary; query Q addition-
ally specifies an information request. In generic
summarization, Q = ∅, whereas in QFS Q can
assume various formats, ranging from keywords
to composite questions (see Table 1 for examples).
Our model learns from generic summarization
data alone, while robustly generalizing to a range
of tasks at test time, including out-of-domain QFS.
A shared characteristic between generic summa-
rization and QFS is the fact that user intent is
underspecified. Even when queries are available
(c'est à dire., Q (cid:4)= ∅), they are incomplete expressions
of intent as it is unlikely to specify queries to
the level of detail necessary to compose a good
summary (Xu and Lapata, 2021). We thus iden-
tify latent query signals from D, and optionally
take advantage of Q as additional observation for
belief update.
Generative Model We model an observed input
document D as a sequence of random variables
x = [x1; x2; . . . ; xM ] where xi
is a token and
M the length of the document. We define the
latent query as a sequence of discrete latent states
over input document tokens: z = [z1; z2; . . . ; zM ].
Spécifiquement, from each document token xi, nous
generate a binary query variable zi, dont
distribution p(zi) represents the belief that xi
contributes to a potential query for document D.
Modeling latent queries at the token-level allows
us to regularize the model—by taking into ac-
count weak supervision in the form of token-level
tagging (Gehrmann et al., 2018). It also renders
the model independent of the query form, thereby
enabling zero-shot inference (see Section 4).
The output summary y = [y1; y2; . . . ; yT ] est
then generated from {X, z} using teacher-forcing
at training time. At test time, we may additionally
be presented with a query Q; we ground this
optional information to the input document via
discrete observed variables ˜z = [˜z1; ˜z2; . . . ; ˜zM ],
and generate y by additionally conditioning on ˜z
(if it exists) in an autoregressive manner.
Our model estimates the conditional distribu-
tion pθ(oui|X) according to the generative process
just described (and illustrated in Figure 1) comme:
pθ(oui|X) =
=
(cid:2)
z
(cid:2)
z
pθ(oui|z, X)pθ(z|X)
(1)
pθ(oui|z, X)
(cid:3)
je
pθ(zi|xi)
Inference Model The posterior distribution of
latent variable z is calculated as:
pθ(z|X, oui) =
pθ(X, oui, z)
pθ(X, oui)
=
(cid:4)
pθ(X, oui, z)
z pθ(X, oui, z)
(2)
Malheureusement, exact inference of this posterior
is computationally intractable due to the joint
probability pθ(X, oui). We therefore approximate it
with a variational posterior qφ(z|X, oui). Inspired by
β-VAE (Higgins et al., 2017), we maximize the
probability of generating summary y, provided
the distance between the prior and variational
posterior distributions is below a small constant δ:
(cid:5)
(cid:6)
E(X,oui)∼D
Ez∼qφ(z|X,oui) log pθ(oui|X, z)
maximum
φ,je
subject to DKL (qφ(z|X, oui)(cid:6)pθ(z|X)) < δ
(3)
(4)
Because we cannot solve Equation (4) directly,
we invoke the Karush-Kuhn-Tucker conditions
625
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
0
2
0
2
2
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Proposed summarization framework: generative process and neural parametrization. Shaded nodes
represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies
between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test
time. Latent queries create a query-focused view of the input document, which together with a query-agnostic
view serve as input to a decoder for summary generation.
(Kuhn et al., 1951) and cast
the above con-
strained optimization problem into unconstrained
optimization, with the following ELBO objective:
LELBO = Eqφ(z|x,y) [log pθ(y|x, z)]
(5)
− βDKL (qφ(z|x, y)||pθ(z|x))
where the Lagrangian multiplier β is a hyperpa-
rameter. To minimize our model’s dependence
on queries (which we assume are unavailable for
both training and development), we adopt a uni-
form prior pθ(z|x). In other words, the probability
of variable z being a query word (given all in-
stances of x) follows a uniform distribution. In
this case, minimizing the KL term in Equation (5)
is equivalent to maximizing the entropy of the
variational posterior.2 We further assume that the
tokens observed in a document are a superset of
potential query tokens, and therefore z ⊥⊥ y and
qφ(z|x, y) = qφ(z|x).3
While the simplification reduces the risk of
exposure to bias from training on y, it makes
learning meaningful latent variables more chal-
lenging, as they depend solely on x. We alleviate
this by introducing a new type of weak super-
vision o(ˆz|x, y), which we automatically extract
from data (i.e., document-summary pairs). Essen-
tially, we tag tokens in the document as likely to
be in the summary and by extension in the query.
2When pθ(z|x) ∼ U (a, b), DKL(qφ(z|x, y)||pθ(z|x)) =
−H (qφ(z|x)) + log(b − a + 1) always holds (z ∈ [a, b]).
3We experimentally verified this assumption in several
QFS datasets. In WikRef (Zhu et al., 2019) and Debatepedia
(Nema et al., 2017), 1.57% and 4.27% of query tokens are not
attested in the input document, respectively. In DUC (Dang,
2005) and TD-QFS (Baumel et al., 2016) where the input
contains multiple documents, all query tokens are attested.
Across all datasets, only 1.69% of query tokens are not
attested in the input document/cluster.
We discuss how this tagger is learned in Section 4.
For now, suffice it to say that weak supervision is
a form of posterior regularization adding an extra
term in the objective, which we rewrite as:
L = Eqφ(z|x) [log pθ(y|x, z)]
(cid:10)
(cid:8)(cid:9)
(cid:7)
conditional language modeling
(6)
(cid:7)
+ βH (qφ(z|x)) − ωH (o(ˆz|x, y), qφ(z|x))
(cid:10)
(cid:8)(cid:9)
latent query modeling
where H(·) denotes posterior entropy and H(·, ·)
denotes cross entropy.
As can be seen from Equation (6), we decom-
pose summarization into two modeling objectives,
namely, latent query modeling and conditional
language modeling. Inside the query modeling
term, hyperparameter ω controls the influence of
weak supervision ˆz, while β controls the strength
of label smoothing on the weak annotations.
Neural Parametrization We parametrize the
two objectives in Equation (6) with a latent query
model and a conditional language model illus-
trated in Figure 1. The query model estimates
latent query z from input variable x. At infer-
ence time,
it, optionally, conditions on query
knowledge ˆz (when this is available). The con-
ditional language model is based on the vanilla
encoder-decoder architecture, the main difference
being that it encodes two views of input doc-
ument D. One encoding is query-focused, and
depends directly on z as generated from the query
model. The second encoding is query-agnostic,
allowing for the original document to provide
complementary context. A decoder conditioned
on both encodings autoregressively generates
the summary y. In contrast to previous work
(Xu and Lapata, 2021), the latent query model and
626
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
0
2
0
2
2
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
conditional language model are trained jointly in
a fully differentiable end-to-end manner. In the
following sections we explain in detail how these
two models are parametrized.
4 Latent Query Model
In this section we discuss how the inference net-
work for latent queries is constructed. We also
explain how query-focused document represen-
tations are obtained, our attempts to mitigate
posterior collapse via weak supervision o(ˆz|x, y)
(see Equation (6)), and how query belief is updated
when queries are available at test time.
Inference Network for Latent Queries We
construct a neural network model to infer for
each token in the input document whether it con-
stitutes a query term. Given a contextual token
representation matrix Hq ∈ RM ×dh, we project
it to RM ×2 with a two-layer MLP as a scoring
function:
Hs = ReLU(HqWh + b
π = HsWs + b(cid:2)
s
(8)
where Wh ∈ Rdh×dh, bh ∈ Rdh×1, Ws ∈ Rdh×2,
and bs ∈ R2×1 are learnable model parameters.
(cid:2)
h)
(7)
Let G(0) denote the standard Gumbel dis-
tribution, and g(cid:4) ∼ G(0), (cid:5) ∈ [0, 1]
is i.i.d.
drawn Gumbel noise. We normalize π to form
a variational distribution as:
qφ(zi = (cid:5)|x) = softmax(cid:4)([π0 + g0, π1 + g1])
(cid:4)
=
exp((π(cid:4) + g(cid:4))/τ )
(cid:4)(cid:9)∈[0,1] exp((π(cid:4)(cid:9) + g(cid:4)(cid:9))/τ )
(9)
where τ is the temperature controlling how close
qφ(z|x) is to arg max(cid:4) qφ(z|x), and is optimized
on the development set. Note that Gumbel noise is
only applied during learning and is set to its mode
(i.e., 0) for inference.
Query-focused View As explained earlier, in
addition to a canonical, query-agnostic encoding
of the input document D (which we discuss in
Section 5), we further introduce a query-focused
encoding factorized via latent queries z.
Specifically, for the ith token, we take the
continuous relaxation of its discrete latent variable
zi, and ground4 it to the input document via:
Qi = qφ(zi = 1|x) · Hq,i
4We also experimented with drawing hard samples from
z via the straight-through trick (Jang et al., 2016), which is
(10)
627
As we can see, the query-focused view explicitly
models the dependency on latent queries. From a
learning perspective, this factorization leads to the
following partial derivatives of the query encoder
states with respect to the query-focused view:
∂Qi
∂Hq,i
=
(cid:11)
(cid:7)
(cid:12)
(cid:10)
· ∂Δπ
∂Hq,i
1 − q(1)
φ
(cid:8)(cid:9)
carry gate
(cid:10) Qi + q(1)
φ(cid:7)(cid:8)(cid:9)(cid:10)
transform gate
·1
(11)
where q((cid:4))
φ is a shorthand for the variational prob-
ability of zi = (cid:5)|x, and Δπ = π1 − π0 (see
Equation (8)) and 1 denotes an all-one vector.
This can be viewed as a special case of highway
networks (Srivastava et al., 2015) where transform
gate q(1)
compresses the information captured
φ
by a token based on its likelihood of being a
query term.
Token Tagging as Weak Supervision Al-
though it is possible to optimize latent queries
solely based on conditional language modeling
(our approach is fully differentiable), we addi-
tionally exploit weak supervision to label tokens
in the document as query-specific or not. Weak
supervision is advantageous as it imposes extra
regularization on the posterior (see Equation (6)),
thereby mitigating its collapse (i.e., the decoder
may learn to ignore the query-focused view and
instead solely rely on the query-agnostic view).
Let t1, . . . , tn denote binary tags for each
of the source tokens, that is, 1 if a token is
query-specific and 0 otherwise. We could learn
such a tagger from training data generated by
aligning query tokens to the document. In de-
fault of such gold-standard data, we approximate
queries by summaries and obtain silver standard
token labels by aligning summaries to their cor-
responding documents. Specifically, inspired by
Gehrmann et al. (2018), we assume a token in
the document is query-specific if it is part of the
longest common sub-sequence (LCS) of tokens in
the summary. Our tagging model is built on top
of a pretrained language model, and thus operates
on subwords. We first byte-pair encode (BPE;
Sennrich et al., 2016) documents and summaries,
and then search for the LCS over BPE sequences.
If there exist multiple identical LCSs, only the
one appearing at the earliest document position is
differentiable with biased gradient estimation. However, it
did not yield better results than continuous relaxation.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
0
2
0
2
2
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
tagged as positive. We refer to this tagging scheme
as BPE-LCS.
all other tokens to zero. We further incorporate
query information via a simple calibration as:
Note that although we model query variables at
the token level, we take phrases indirectly into ac-
count through LCS, which identifies subsequences
of tokens (or phrases) as query annotations. Our
our tagging model is therefore able to capture
dependencies between tokens, albeit indirectly.
Training To optimize the variational inference
model, that is, the MLP defined in Equations (7–9),
we use a cross entropy loss for token tagging, with
the posterior entropy term from Equation (6).
Formally, we write the query modeling loss as
follows:
Lquery = −ωLtag + βLentropy
N(cid:2)
M(cid:2)
(cid:13)(cid:13)
= −
ωˆzj
i
− βq(1)
φ
(12)
(cid:14)
log q(1)
φ
(cid:13)
+
ω
j=1
i=1
(cid:13)
1 − ˆzj
i
(cid:14)
− βq(0)
φ
(cid:14)
(cid:14)
log q(0)
φ
where ˆzi is a binary label automatically assigned
via BPE-LCS(D, S), the alignment procedure de-
scribed above. As we can see, the entropy term
dynamically smooths the weak annotations ˆzi (the
degree of smoothing is modulated by qφ). We
optimize ω and β on a development set.
In the initial stages of training,
the tagger
might lead to inaccurate posterior probability as-
signments qφ(zi|x), and, consequently, hurt the
summarization model, which relies heavily on a
high-quality query-focused view. To address this
issue, we introduce a posterior dropout mecha-
nism that replaces the estimated posterior with
weak supervision o(ˆz|x) according to probability
α. We initialize α to 1, so that only o(ˆz|x) is
used in the beginning of training, and the tagger
is supervised via Equation (12). We then linearly
anneal α over optimization steps so that the gradi-
ents from the summarization objective (which we
introduce in Section 5) can jointly optimize the
tagger.
Zero-shot Transfer We now explain how
queries are taken into account at test time by
performing query belief updates Δ(zi|x, ˜z). In
the case of generic summarization where no
queries are available, we simply perform no up-
date. When Q (cid:4)= ∅, some tokens in the document
become more relevant and we consequently set
Δ(zi = 1|x, ˜z) = 1, ∀wi ∈ BPE-LCS(D, Q), and
628
qφ(zi = 1|x, ˜z) = min{1,
(13)
qφ(zi = 1|x) + Δ(zi = 1|x, ˜z)}
Note that our calibration is non-parametric, since
it is not realistic to assume access to a development
set for each query type (e.g., in order to perform
hyper-parameter tuning). This enables zero-shot
transfer to QFS tasks with varying characteristics.
5 Conditional Language Model
In this section we describe our conditional lan-
guage model, which estimates the log-likelihood
expectation of a summary sequence over the vari-
ational posterior (see Equation (6)). As mentioned
earlier, we adopt an encoder-decoder architecture
tailored to document summarization with latent
queries.
Encoder We encode two views of the input
document, a generic query-agnostic view D, and
a query-focused one Q (see Equation (10)). As
shown in Figure 1(c), our encoder module consists
of three encoders: a shared encoder, a document
encoder, and a query encoder. Because both views
are created from the same document, we use a
shared encoder for general document understand-
ing that also reduces model parameters. The shared
document representation serves as input to more
specialized encoders. Each encoder contains one
or multiple Transformer layers (Vaswani et al.,
2017), each composed of a multi-head attention
(MHA) layer and a feed-forward (FFN) layer:
(cid:13)
H(enc) = LN
(cid:13)
H(enc) = LN
(cid:13)
H(enc) + MHA
(cid:13)
H(enc) + FFN
H(enc), H(enc), H(enc)
H(enc)
(cid:14)(cid:14)
(14)
(cid:14)(cid:14)
where LN denotes layer normalization. As shown
in Figure 1(c), the query-focused view Q directly
conditions on sampled latent queries, while D is
based on the original document and its content.
Decoder We adopt a decoder structure similar
to Dou et al. (2021) to handle multiple inputs. Our
decoder sequentially attends to the two encoded
views of the same document:
(cid:13)
H(dec), H(dec), H(dec)
H(dec) + MHA
(cid:13)
H(dec), Q, Q
H(dec) + MHA
(cid:13)
H(dec), D, D
H(dec) + MHA
(cid:14)(cid:14)
(cid:13)
H(dec)
H(dec) + FFN
H(dec) = LN
H(dec) = LN
H(dec) = LN
H(dec) = LN
(15)
(cid:14)(cid:14)
(cid:14)(cid:14)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:14)(cid:14)
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
4
8
0
2
0
2
2
9
5
1
/
/
t
l
a
c
_
a
_
0
0
4
8
0
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
After taking the context of the previous generation
H(dec) into account, the decoder will first attend
to signals coming from query Q, then to original
document D (based on guidance provided by the
query). The final summary generation objective is
calculated autoregressively as:
Llm =
(cid:11)
yj
t
log pθ
N(cid:2)
T(cid:2)
j=1
t=1
(cid:12)
|yj