Document Summarization with Latent Queries

Yumo Xu and Mirella Lapata

Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, United Kingdom
yumo.xu@ed.ac.uk

mlap@inf.ed.ac.uk

Abstrait

The availability of large-scale datasets has
driven the development of neural models that
create generic summaries for single or multiple
documents. For query-focused summarization
(QFS), labeled training data in the form of
queries, documents, and summaries is not read-
ily available. We provide a unified modeling
framework for any kind of summarization,
under the assumption that all summaries are
a response to a query, which is observed in
the case of QFS and latent in the case of
generic summarization. We model queries as
discrete latent variables over document tokens,
and learn representations compatible with ob-
served and unobserved query verbalizations.
Our framework formulates summarization as
a generative process, and jointly optimizes
a latent query model and a conditional lan-
guage model. Despite learning from generic
summarization data only, our approach out-
performs strong comparison systems across
benchmarks, query types, document settings,
and target domains.1

Introduction

Recent years have witnessed substantial progress
in generic summarization (Voir et al., 2017;
Gehrmann et al., 2018; Liu and Lapata, 2019un, dans-
ter alia) thanks to neural architectures based on the
encoder-decoder paradigm (Sutskever et al., 2014)
and the availability of large-scale datasets contain-
ing hundreds of thousands of document-summary
pairs. Malheureusement, training data of this mag-
nitude is not readily available for the related
task of query-focused summarization (QFS; Dang
2005) which aims to create a summary from one
or multiple document(s) that answers a specific
query. Existing QFS benchmarks (Dang, 2005;
Hoa, 2006; Nema et al., 2017; Baumel et al.,

1Our code and models can be found at https://

github.com/yumoxu/lqsum.

623

2016) have been constructively used for evalu-
ation but are relatively small for training large
neural models.

To make up for the absence of labeled QFS data,
recent work has resorted to distant supervision
provided by pretrained models, paraphrase identi-
fication, and question-answering datasets (Xu and
Lapata, 2020; Su et al., 2020; Laskar et al.,
2020b). Other work induces proxy queries (Xu
and Lapata, 2021) from generic summarization
datasets, without additional question-answering
resources that can be also extremely expen-
sive to acquire (Bajaj et al., 2016). Malgré cela
progress, building and scaling QFS systems re-
mains challenging due to the many different
ways natural language queries express users’ in-
formation needs. Par exemple, queries can have
one or multiple keyword(s) (Baumel et al., 2016;
Zhu et al., 2019), a simple question (Nema et al.,
2017), or a longer narrative composed of multi-
ple sub-queries (Dang, 2006) (see the examples
in Table 1). Although QFS systems can poten-
tially handle queries resembling those seen in
entraînement, they are not expected to work well on
out-of-distribution queries (Xu and Lapata, 2021),
namely, queries with different surface forms from
those seen in training. In order to cover new types
of queries, it might be necessary to gather more
data, re-design proxy queries, and re-train one
or more system components that can be compu-
tationally inefficient and in some cases practi-
cally infeasible.

In this work, we provide a unified model-
ing framework for generic summarization and
QFS, under the assumption that only data for
the former is available. Spécifiquement, we treat
generic summarization as a special case of QFS
where the query is latent. We model queries as
discrete latent variables over document tokens,
and learn representations compatible with ob-
served and unobserved query verbalizations. Notre

Transactions of the Association for Computational Linguistics, vol. 10, pp. 623–638, 2022. https://doi.org/10.1162/tacl a 00480
Action Editor: Wenjie (Maggie) Li. Submission batch: 8/2021; Revision batch: 12/2021; Published 5/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
0
2
0
2
2
9
5
1

/
t

un
c
_
un
_
0
0
4
8
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Dataset

Task Domain

Size

D/Q/S Tokens Query Type

Query Example

CNN/DM SDS News
WikiCatSum MDS Wiki
WikiRef
SDS Wiki
Is euthanasia better than withdrawing life support?
Debatepedia SDS Debates
DUC 2006 MDS Newswire 1,250 (50) 699.3/32.8/250 Composite AMNESTY INTERNATIONAL – What is the scope
DUC 2007 MDS Newswire 1,125 (45) 540.3/30.5/250 Composite of operations of Amnesty International and what are

760.5/0.0/45.7 Empty
800.0/0.0/105.6 Empty
398.7/6.7/36.2 Keywords Marina Beach, Incidents
66.4/10.0/11.3 Question

11,490
8,494
12,000
1,000

∅
∅

TD-QFS

MDS Medical

7,099 (50) 182.9/3.0/250 Title

the international reactions to its activities?
Alzheimer’s Disease

Tableau 1: Test data statistics. SDS/MDS stand for single-/multi-document summarization. Size refers
to number of test documents; for multi-document QFS, we specify the number of clusters in brackets.
D/Q/S are Document/Query/Summary tokens. Composite queries consist of a TOPIC and a narrative.

framework formulates abstractive summarization
as a generative process, and decomposes the learn-
ing objective into: (1) latent query modeling (c'est à dire.,
generating latent query variables from document
observations) et (2) conditional language mod-
eling (c'est à dire., generating summaries conditioned on
observed documents and latent queries). To fur-
ther handle user queries at test time, we propose
a non-parametric calibration of the latent query
distribution, which allows us to perform zero-shot
QFS without model re-training.

Our contributions in this work are threefold:
(un) we bring together generic summarization
and QFS under a unified modeling framework
that does not require query-related resources for
training or development; (b) we provide a deep
generative formulation for document summariza-
tion, where queries are represented directly from
input documents in latent space, c'est, avec-
out resorting to pipeline-style query extraction
or generation; et (c) experiments on a range
of summarization benchmarks show that across
query types, document settings, and target do-
mains, our model achieves better results than
strong comparison systems.

2 Related Work

Rush et al. (2015) and Nallapati et al. (2016)
to apply the neural
were among the first
encoder-decoder architecture to abstractive sum-
marization. See et al.
(2017) enhance their
approach with a pointer-generator model, essen-
tially a copy mechanism allowing words from
the source document to be copied directly in
the summary. Gehrmann et al. (2018) incorpo-
rate a content selection model that decides on
relevant aspects of the source document. Ils
frame this task as a word-level tagging problem,

with the objective of separately identifying tokens
from a document that should be part of its sum-
mary; at test time, they produce content selection
probabilities for each word, which are then used
to restrict the copy mechanism by performing
hard masking over the input document. Another
line of research controls summary generation via
topics (Perez-Beltrachini et al., 2019un; Wang
et coll., 2020), retrieve-and-edit methods (Cao
et coll., 2018), factual relations (Jin et al., 2020),
keywords, relational triples, or preselected source
phrases (Dou et al., 2021).

The majority of previous QFS approaches
have been extractive and compose summaries
by selecting central and query-relevant sentences
(Wan et al., 2007; Badrinath et al., 2011; Wan and
Zhang, 2014; Li et al., 2017b,un). Plus récemment,
Xu and Lapata (2020) propose a coarse-to-fine
framework that leverages distant supervision from
question answering for summary sentence extrac-
tion. Abstractive QFS has received significantly
less attention in comparison, due to generation
models being particularly data-hungry (Lebanoff
et coll., 2018; Liu and Lapata, 2019un). As a re-
sult, resources from a wider range of NLP tasks
have been used. Su et al. (2020) rank docu-
ment paragraphs against queries with the aid
of QA and machine reading datasets (Su et al.,
2019; Rajpurkar et al., 2016), and then iteratively
summarize selected paragraphs. De la même manière, Laskar
et autres. (2020b) jointly exploit supervision from
QFS data (typically reserved for evaluation) et
related QA and paraphrase identification tasks.

Because query-related resources can be also
costly to obtain (Bajaj et al., 2016; Kwiatkowski
et coll., 2019), Xu and Lapata (2021) use none
whatsoever. Plutôt, they create proxy queries by
selectively masking information slots in generic
summaries. Despite promising system performance,

624

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
0
2
0
2
2
9
5
1

/
t

un
c
_
un
_
0
0
4
8
0
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

their approach assumes prior knowledge of tar-
get queries (proxies are created to match their
length, and content), and a development set is used
(Xu and Lapata, 2021). Aussi, their system is
particularly tailored to multi-document QFS and
includes a sophisticated evidence selection com-
ponent. Our work is closely related to theirs in that
we also do not take advantage of query-related re-
sources. We go a step further and do not require a
development set either, allowing our model to be
independent of specific query verbalizations and
produce QFS summaries in zero-shot settings.

Our approach is generally applicable to single-
and multi-document QFS. For any summariza-
tion task we assume that queries are latent and
estimate these jointly via a summarization and
(weakly supervised) tagging task. The latter draws
inspiration from Gehrmann et al. (2018) under the
assumption that document tokens found in the
summary also provide evidence for the (latent)
query that gave rise to it. Enfin, our model is
fundamentally different from approaches that rely
on document-based guidance to improve the in-
formativeness (Cao et al., 2018) or faithfulness
(Chen et al., 2021) of summaries. While these
models exploit guidance from supervision signals
in training data, we are faced with the problem of
estimating queries when there are none available
(at least during training).

3 Problem Formulation

Let {(D, Q, S)} denote a summarization dataset,
where document D is a sequence of tokens, et
S its corresponding summary; query Q addition-
ally specifies an information request. In generic
summarization, Q = ∅, whereas in QFS Q can
assume various formats, ranging from keywords
to composite questions (see Table 1 for examples).
Our model learns from generic summarization
data alone, while robustly generalizing to a range
of tasks at test time, including out-of-domain QFS.
A shared characteristic between generic summa-
rization and QFS is the fact that user intent is
underspecified. Even when queries are available
(c'est à dire., Q (cid:4)= ∅), they are incomplete expressions
of intent as it is unlikely to specify queries to
the level of detail necessary to compose a good
summary (Xu and Lapata, 2021). We thus iden-
tify latent query signals from D, and optionally
take advantage of Q as additional observation for
belief update.

Generative Model We model an observed input
document D as a sequence of random variables
x = [x1; x2; . . . ; xM ] where xi
is a token and
M the length of the document. We define the
latent query as a sequence of discrete latent states
over input document tokens: z = [z1; z2; . . . ; zM ].
Spécifiquement, from each document token xi, nous
generate a binary query variable zi, dont
distribution p(zi) represents the belief that xi
contributes to a potential query for document D.
Modeling latent queries at the token-level allows
us to regularize the model—by taking into ac-
count weak supervision in the form of token-level
tagging (Gehrmann et al., 2018). It also renders
the model independent of the query form, thereby
enabling zero-shot inference (see Section 4).

The output summary y = [y1; y2; . . . ; yT ] est
then generated from {X, z} using teacher-forcing
at training time. At test time, we may additionally
be presented with a query Q; we ground this
optional information to the input document via
discrete observed variables ˜z = [˜z1; ˜z2; . . . ; ˜zM ],
and generate y by additionally conditioning on ˜z
(if it exists) in an autoregressive manner.

Our model estimates the conditional distribu-
tion pθ(oui|X) according to the generative process
just described (and illustrated in Figure 1) comme:

pθ(oui|X) =

(cid:2)

z
(cid:2)

pθ(oui|z, X)pθ(z|X)

(1)

pθ(oui|z, X)

(cid:3)

pθ(zi|xi)

Inference Model The posterior distribution of
latent variable z is calculated as:

pθ(z|X, oui) =

pθ(X, oui, z)
pθ(X, oui)

(cid:4)

pθ(X, oui, z)
z pθ(X, oui, z)

(2)

Malheureusement, exact inference of this posterior
is computationally intractable due to the joint
probability pθ(X, oui). We therefore approximate it
with a variational posterior qφ(z|X, oui). Inspired by
β-VAE (Higgins et al., 2017), we maximize the
probability of generating summary y, provided
the distance between the prior and variational
posterior distributions is below a small constant δ:

(cid:5)

(cid:6)

E(X,oui)∼D

Ez∼qφ(z|X,oui) log pθ(oui|X, z)
maximum
φ,je
subject to DKL (qφ(z|X, oui)(cid:6)pθ(z|X)) < δ (3) (4) Because we cannot solve Equation (4) directly, we invoke the Karush-Kuhn-Tucker conditions 625 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 0 2 0 2 2 9 5 1 / / t l a c _ a _ 0 0 4 8 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 1: Proposed summarization framework: generative process and neural parametrization. Shaded nodes represent observed variables, unshaded nodes indicate latent variables, arrows represent conditional dependencies between variables, and plates refer to repetitions of sampling steps. Dashed lines denote optional queries at test time. Latent queries create a query-focused view of the input document, which together with a query-agnostic view serve as input to a decoder for summary generation. (Kuhn et al., 1951) and cast the above con- strained optimization problem into unconstrained optimization, with the following ELBO objective: LELBO = Eqφ(z|x,y) [log pθ(y|x, z)] (5) − βDKL (qφ(z|x, y)||pθ(z|x)) where the Lagrangian multiplier β is a hyperpa- rameter. To minimize our model’s dependence on queries (which we assume are unavailable for both training and development), we adopt a uni- form prior pθ(z|x). In other words, the probability of variable z being a query word (given all in- stances of x) follows a uniform distribution. In this case, minimizing the KL term in Equation (5) is equivalent to maximizing the entropy of the variational posterior.2 We further assume that the tokens observed in a document are a superset of potential query tokens, and therefore z ⊥⊥ y and qφ(z|x, y) = qφ(z|x).3 While the simplification reduces the risk of exposure to bias from training on y, it makes learning meaningful latent variables more chal- lenging, as they depend solely on x. We alleviate this by introducing a new type of weak super- vision o(ˆz|x, y), which we automatically extract from data (i.e., document-summary pairs). Essen- tially, we tag tokens in the document as likely to be in the summary and by extension in the query. 2When pθ(z|x) ∼ U (a, b), DKL(qφ(z|x, y)||pθ(z|x)) = −H (qφ(z|x)) + log(b − a + 1) always holds (z ∈ [a, b]). 3We experimentally verified this assumption in several QFS datasets. In WikRef (Zhu et al., 2019) and Debatepedia (Nema et al., 2017), 1.57% and 4.27% of query tokens are not attested in the input document, respectively. In DUC (Dang, 2005) and TD-QFS (Baumel et al., 2016) where the input contains multiple documents, all query tokens are attested. Across all datasets, only 1.69% of query tokens are not attested in the input document/cluster. We discuss how this tagger is learned in Section 4. For now, suffice it to say that weak supervision is a form of posterior regularization adding an extra term in the objective, which we rewrite as: L = Eqφ(z|x) [log pθ(y|x, z)] (cid:10) (cid:8)(cid:9) (cid:7) conditional language modeling (6) (cid:7) + βH (qφ(z|x)) − ωH (o(ˆz|x, y), qφ(z|x)) (cid:10) (cid:8)(cid:9) latent query modeling where H(·) denotes posterior entropy and H(·, ·) denotes cross entropy. As can be seen from Equation (6), we decom- pose summarization into two modeling objectives, namely, latent query modeling and conditional language modeling. Inside the query modeling term, hyperparameter ω controls the influence of weak supervision ˆz, while β controls the strength of label smoothing on the weak annotations. Neural Parametrization We parametrize the two objectives in Equation (6) with a latent query model and a conditional language model illus- trated in Figure 1. The query model estimates latent query z from input variable x. At infer- ence time, it, optionally, conditions on query knowledge ˆz (when this is available). The con- ditional language model is based on the vanilla encoder-decoder architecture, the main difference being that it encodes two views of input doc- ument D. One encoding is query-focused, and depends directly on z as generated from the query model. The second encoding is query-agnostic, allowing for the original document to provide complementary context. A decoder conditioned on both encodings autoregressively generates the summary y. In contrast to previous work (Xu and Lapata, 2021), the latent query model and 626 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 0 2 0 2 2 9 5 1 / / t l a c _ a _ 0 0 4 8 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 conditional language model are trained jointly in a fully differentiable end-to-end manner. In the following sections we explain in detail how these two models are parametrized. 4 Latent Query Model In this section we discuss how the inference net- work for latent queries is constructed. We also explain how query-focused document represen- tations are obtained, our attempts to mitigate posterior collapse via weak supervision o(ˆz|x, y) (see Equation (6)), and how query belief is updated when queries are available at test time. Inference Network for Latent Queries We construct a neural network model to infer for each token in the input document whether it con- stitutes a query term. Given a contextual token representation matrix Hq ∈ RM ×dh, we project it to RM ×2 with a two-layer MLP as a scoring function: Hs = ReLU(HqWh + b π = HsWs + b(cid:2) s (8) where Wh ∈ Rdh×dh, bh ∈ Rdh×1, Ws ∈ Rdh×2, and bs ∈ R2×1 are learnable model parameters. (cid:2) h) (7) Let G(0) denote the standard Gumbel dis- tribution, and g(cid:4) ∼ G(0), (cid:5) ∈ [0, 1] is i.i.d. drawn Gumbel noise. We normalize π to form a variational distribution as: qφ(zi = (cid:5)|x) = softmax(cid:4)([π0 + g0, π1 + g1]) (cid:4) = exp((π(cid:4) + g(cid:4))/τ ) (cid:4)(cid:9)∈[0,1] exp((π(cid:4)(cid:9) + g(cid:4)(cid:9))/τ ) (9) where τ is the temperature controlling how close qφ(z|x) is to arg max(cid:4) qφ(z|x), and is optimized on the development set. Note that Gumbel noise is only applied during learning and is set to its mode (i.e., 0) for inference. Query-focused View As explained earlier, in addition to a canonical, query-agnostic encoding of the input document D (which we discuss in Section 5), we further introduce a query-focused encoding factorized via latent queries z. Specifically, for the ith token, we take the continuous relaxation of its discrete latent variable zi, and ground4 it to the input document via: Qi = qφ(zi = 1|x) · Hq,i 4We also experimented with drawing hard samples from z via the straight-through trick (Jang et al., 2016), which is (10) 627 As we can see, the query-focused view explicitly models the dependency on latent queries. From a learning perspective, this factorization leads to the following partial derivatives of the query encoder states with respect to the query-focused view: ∂Qi ∂Hq,i = (cid:11) (cid:7) (cid:12) (cid:10) · ∂Δπ ∂Hq,i 1 − q(1) φ (cid:8)(cid:9) carry gate (cid:10) Qi + q(1) φ(cid:7)(cid:8)(cid:9)(cid:10) transform gate ·1 (11) where q((cid:4)) φ is a shorthand for the variational prob- ability of zi = (cid:5)|x, and Δπ = π1 − π0 (see Equation (8)) and 1 denotes an all-one vector. This can be viewed as a special case of highway networks (Srivastava et al., 2015) where transform gate q(1) compresses the information captured φ by a token based on its likelihood of being a query term. Token Tagging as Weak Supervision Al- though it is possible to optimize latent queries solely based on conditional language modeling (our approach is fully differentiable), we addi- tionally exploit weak supervision to label tokens in the document as query-specific or not. Weak supervision is advantageous as it imposes extra regularization on the posterior (see Equation (6)), thereby mitigating its collapse (i.e., the decoder may learn to ignore the query-focused view and instead solely rely on the query-agnostic view). Let t1, . . . , tn denote binary tags for each of the source tokens, that is, 1 if a token is query-specific and 0 otherwise. We could learn such a tagger from training data generated by aligning query tokens to the document. In de- fault of such gold-standard data, we approximate queries by summaries and obtain silver standard token labels by aligning summaries to their cor- responding documents. Specifically, inspired by Gehrmann et al. (2018), we assume a token in the document is query-specific if it is part of the longest common sub-sequence (LCS) of tokens in the summary. Our tagging model is built on top of a pretrained language model, and thus operates on subwords. We first byte-pair encode (BPE; Sennrich et al., 2016) documents and summaries, and then search for the LCS over BPE sequences. If there exist multiple identical LCSs, only the one appearing at the earliest document position is differentiable with biased gradient estimation. However, it did not yield better results than continuous relaxation. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 0 2 0 2 2 9 5 1 / / t l a c _ a _ 0 0 4 8 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 tagged as positive. We refer to this tagging scheme as BPE-LCS. all other tokens to zero. We further incorporate query information via a simple calibration as: Note that although we model query variables at the token level, we take phrases indirectly into ac- count through LCS, which identifies subsequences of tokens (or phrases) as query annotations. Our our tagging model is therefore able to capture dependencies between tokens, albeit indirectly. Training To optimize the variational inference model, that is, the MLP defined in Equations (7–9), we use a cross entropy loss for token tagging, with the posterior entropy term from Equation (6). Formally, we write the query modeling loss as follows: Lquery = −ωLtag + βLentropy N(cid:2) M(cid:2) (cid:13)(cid:13) = − ωˆzj i − βq(1) φ (12) (cid:14) log q(1) φ (cid:13) + ω j=1 i=1 (cid:13) 1 − ˆzj i (cid:14) − βq(0) φ (cid:14) (cid:14) log q(0) φ where ˆzi is a binary label automatically assigned via BPE-LCS(D, S), the alignment procedure de- scribed above. As we can see, the entropy term dynamically smooths the weak annotations ˆzi (the degree of smoothing is modulated by qφ). We optimize ω and β on a development set. In the initial stages of training, the tagger might lead to inaccurate posterior probability as- signments qφ(zi|x), and, consequently, hurt the summarization model, which relies heavily on a high-quality query-focused view. To address this issue, we introduce a posterior dropout mecha- nism that replaces the estimated posterior with weak supervision o(ˆz|x) according to probability α. We initialize α to 1, so that only o(ˆz|x) is used in the beginning of training, and the tagger is supervised via Equation (12). We then linearly anneal α over optimization steps so that the gradi- ents from the summarization objective (which we introduce in Section 5) can jointly optimize the tagger. Zero-shot Transfer We now explain how queries are taken into account at test time by performing query belief updates Δ(zi|x, ˜z). In the case of generic summarization where no queries are available, we simply perform no up- date. When Q (cid:4)= ∅, some tokens in the document become more relevant and we consequently set Δ(zi = 1|x, ˜z) = 1, ∀wi ∈ BPE-LCS(D, Q), and 628 qφ(zi = 1|x, ˜z) = min{1, (13) qφ(zi = 1|x) + Δ(zi = 1|x, ˜z)} Note that our calibration is non-parametric, since it is not realistic to assume access to a development set for each query type (e.g., in order to perform hyper-parameter tuning). This enables zero-shot transfer to QFS tasks with varying characteristics. 5 Conditional Language Model In this section we describe our conditional lan- guage model, which estimates the log-likelihood expectation of a summary sequence over the vari- ational posterior (see Equation (6)). As mentioned earlier, we adopt an encoder-decoder architecture tailored to document summarization with latent queries. Encoder We encode two views of the input document, a generic query-agnostic view D, and a query-focused one Q (see Equation (10)). As shown in Figure 1(c), our encoder module consists of three encoders: a shared encoder, a document encoder, and a query encoder. Because both views are created from the same document, we use a shared encoder for general document understand- ing that also reduces model parameters. The shared document representation serves as input to more specialized encoders. Each encoder contains one or multiple Transformer layers (Vaswani et al., 2017), each composed of a multi-head attention (MHA) layer and a feed-forward (FFN) layer: (cid:13) H(enc) = LN (cid:13) H(enc) = LN (cid:13) H(enc) + MHA (cid:13) H(enc) + FFN H(enc), H(enc), H(enc) H(enc) (cid:14)(cid:14) (14) (cid:14)(cid:14) where LN denotes layer normalization. As shown in Figure 1(c), the query-focused view Q directly conditions on sampled latent queries, while D is based on the original document and its content. Decoder We adopt a decoder structure similar to Dou et al. (2021) to handle multiple inputs. Our decoder sequentially attends to the two encoded views of the same document: (cid:13) H(dec), H(dec), H(dec) H(dec) + MHA (cid:13) H(dec), Q, Q H(dec) + MHA (cid:13) H(dec), D, D H(dec) + MHA (cid:14)(cid:14) (cid:13) H(dec) H(dec) + FFN H(dec) = LN H(dec) = LN H(dec) = LN H(dec) = LN (15) (cid:14)(cid:14) (cid:14)(cid:14) (cid:13) (cid:13) (cid:13) (cid:13) (cid:14)(cid:14) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 0 2 0 2 2 9 5 1 / / t l a c _ a _ 0 0 4 8 0 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 After taking the context of the previous generation H(dec) into account, the decoder will first attend to signals coming from query Q, then to original document D (based on guidance provided by the query). The final summary generation objective is calculated autoregressively as: Llm = (cid:11) yj t log pθ N(cid:2) T(cid:2) j=1 t=1 (cid:12) |yj
Document Summarization with Latent Queries image

Télécharger le PDF