Augmenting Transformers with KNN-Based
Composite Memory for Dialog
Angela Fan
Facebook AI Research
Universit´e de Lorraine
LORIA
angelafan@fb.com
Claire Gardent
CNRS/LORIA
claire.gardent@loria.fr
Chlo´e Braud
CNRS/IRIT
chloe.braud@irit.fr
Antoine Bordes
Facebook AI Research
abordes@fb.com
Astratto
Various machine learning tasks can benefit
from access to external information of different
modalities, such as text and images. Recente
work has focused on learning architectures
with large memories capable of storing this
knowledge. We propose augmenting genera-
tive Transformer neural networks with KNN-
based Information Fetching (KIF) modules.
Each KIF module learns a read operation to
access fixed external knowledge. We apply
these modules to generative dialog modeling,
a challenging task where information must be
flexibly retrieved and incorporated to maintain
the topic and flow of conversation. We demon-
strate the effectiveness of our approach by
identifying relevant knowledge required for
knowledgeable but engaging dialog from
Wikipedia, images, and human-written dialog
utterances, and show that
leveraging this
retrieved information improves model perfor-
mance, measured by automatic and human
evaluation.
1 introduzione
Machine learning approaches to various tasks,
such as game-playing or dialog, are often depen-
dent on external information. This information
can take multimodal forms, including structured
knowledge bases, free text, and images, E
also comes in overwhelmingly large quantities.
A pressing challenge is to create models that
can identify which specific elements of multiple
information sources are relevant in a particular
context, and incorporate them into standard archi-
82
tectures on each task. In this work, we focus
on human–machine dialog and how to efficiently
retrieve external knowledge that is relevant to
the dialog. We consider two scenarios and for
each scenario, retrieve two types of knowledge:
(io) knowledge about similar dialog contexts and
(ii) external knowledge used to ground the
conversation into real world information.
Knowledge about similar dialog contexts allows
for a hybrid retrieval/generative approach to dialog
where the system response is generated based not
only on a representation of the current dialog
context and of the relevant world knowledge,
but also based on a response retrieved from a
similar dialog context. The retrieved knowledge
can be viewed as providing information about
structure and dialog sentences, or utterances:
which response is likely given a similar context?
External knowledge is also retrieved to improve
the semantic content of the dialog model. In
one scenario, Wizard of Wikipedia (Dinan et al.
2018), general topics are provided to crowdwor-
kers, who are asked to have in-depth and specific
conversations about these topics by referencing
specific Wikipedia sentences as knowledge. In this
scenario, external knowledge is retrieved from a
pre-selected set of Wikipedia sentences associated
with the current dialog topic. Retrieval aims to
select the sentence that is most relevant at each
step of the dialog and thereby to ground system
responses in relevant world knowledge (per esempio., by
referring to Star Wars when talking about science
fiction).
In the other scenario, Engaging ImageChat
(Shuster et al., 2020), crowdworkers are provided
with images and asked to have a conversation
Operazioni dell'Associazione per la Linguistica Computazionale, vol. 9, pag. 82–99, 2021. https://doi.org/10.1162/tacl a 00356
Redattore di azioni: Masaaki Nagata. Lotto di invio: 6/2020; Lotto di revisione: 9/2020; Pubblicato 3/2021.
C(cid:13) 2021 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
inspired by or about the image. In questo caso, IL
retrieved external knowledge is images and their
associated dialogs. By retrieving images that are
similar to the image being talked about, we aim to
enrich system responses with knowledge about
is typically mentioned when describing
what
similar images (per esempio., when talking about an image
with dogs, mentioning their breed).
Our work on incorporating different types and
modalities of knowledge is related to methods that
strive to add external memory, such as knowledge
bases, to neural networks. Previous work has ex-
plored incorporating large external memories into
neural network layers (Weston et al., 2015;
Sukhbaatar et al., 2015, 2019; Lample et al., 2019).
Many existing approaches focus on using attention
over the memory slots, which is computationally
intensive and becomes less effective as the the size
of the memory grows. In this work, we propose
representing multiple sources of external infor-
mation as fixed encodings and using K Nearest
Neighbors (KNN) search to fetch relevant infor-
mazione. KNN search is computationally efficient
and scalable, and libraries like faiss (Johnson
et al., 2019) allow KNN to be easily used on GPUs
and integrated into neural networks. Further,
the external memories are pre-encoded, so the
information encoding is only computed once. As
the external memories are kept fixed, they do not
require any training to learn the memories along
with the model. We can thus scale easily to larger
memories by learning only the KNN-based read
operation to identify relevant information from
the memory.
Our core contribution proposes an efficient,
KNN-based Information Fetching (KIF) module
that can access relevant external knowledge, com-
bine knowledge from different sources, and inte-
grate this information into standard sequence to
sequence architectures. We apply these flexible
modules to two dialog datasets that challenge gen-
erative models to leverage external information to
write coherent, on-topic responses. Both of our
chosen tasks require models to leverage external
informazione, such as information from Wikipedia
or images, to engage in the conversation. Noi
show that relevant information can be identified
from hundreds of thousands of candidates in a
multimodal, multi-knowledge-source setting to
improve the performance of generative dialog
models. Further, the output of the KIF modules
is interpretable as specific human-readable know-
83
ledge elements are selected, allowing users to
better understand the information the generative
model conditions upon when writing the subse-
quent utterance. On both datasets, we achieve
state-of-the-art results compared to generative
models and find there is no statistically significant
difference in the interestingness or human pre-
ference of our model output compared to state-
of-the-art retrieval models.
2 Related Work
We discuss related work on learning to incorporate
external knowledge into neural networks and
efficiently access relevant information. We then
describe work in generative dialog that incor-
porates knowledge.
2.1 Incorporating External Knowledge
line of
Augmenting neural networks with memory, O
longer-term components that can be accessed
with read and write operations, has been
explored in various proposed architectures. For
esempio, Memory Networks (Weston et al., 2015;
Sukhbaatar et al., 2015, 2019) introduce attention
mechanisms over large external memories. Neural
cache models (Grave et al., 2017B) simplify these
to access previous memories with a dot product.
Previous work has also studied how to read and
write into these memory architectures (Rae et al.,
2016; Graves et al., 2014; Joulin and Mikolov,
2015). In contrasto, we focus on how to read large
ricordi.
Another
research has focused on
computational scalability for larger external me-
mories to allow efficient access of information.
Per esempio, Chandar et al. (2016) propose a
hierarchical memory network rather than a flat
one and Rae et al. (2016) learn sparse operations
to read and write. Lample et al. (2019) focus
on learning memories of up to one million
slots and how to efficiently access the slots
using product keys. Khandelwal et al. (2019)
use nearest neighbor operations to augment
language models by performing retrieval at the
token level—in contrast, we focus on multimodal
retrieval of multiple pieces of knowledge based
on an entire dialog context. Beyond explicit
memory representations, it may be possible to
store information implicitly during training time
by memorizing common patterns present in text
(Petroni et al., 2019). We focus on learning
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
to fetch relevant
information from multiple
explicit external multimodal knowledge sources
and integrate them into one network. Further,
our work allows the retrieved information to be
interpreted as each memory slot is an explicit fact
that can be read as text, rather than a learned vector
such as in Lample et al. (2019).
Work has also focused on computationally
efficient softmax operations (Mnih and Hinton,
2009; Grave et al., 2017UN; Chen et al., 2016).
Many approximate softmax techniques use KNN-
like operations to form clusters, and the overall
softmax operation is constrained by the slow
calculation of the exponential. Our usage of KNN
benefits from efficient and scalable libraries such
as faiss and nmslib.
2.2 Generative Dialog
We develop a general architecture for incorpo-
rating external information and apply it to the
case of generative dialog models. Previous work
in dialog has leveraged knowledge as necessary
information to accomplish the task. Per esempio,
airline and restaurant booking tasks often use
API calls to access information about reservation
times and availability (Bordes et al., 2017). In
contrasto, our work focuses on how to incorporate
unstructured knowledge, such as free text found
on the Web. Previous work has used architectures
that attend over the available knowledge and
identify relevant pieces of information, Quale
scales poorly with large quantities of information
(Dinan et al., 2018; Qin et al., 2019; Lian
et al., 2019). We replace the use of attention
information with the output of
over external
a KNN module. Other work has investigated
incorporating information retrieval in language
modeling and question answering (Chen et al.,
2017; Fan et al., 2019; Seo et al., 2019; Guu et al.,
2020), while we focus on dialog applications and
flexibly incorporating knowledge from multiple,
multimodal sources.
On the modeling side, work has explored
both generative (Serban et al. 2016UN, 2016B)
and retrieval based models (Zhang et al., 2018),
which identify the best utterance from the
training set to return as the dialog response. Questo
often leverages self-attention or cross-attention
mechanisms (Humeau et al., 2019). Further work
has explored hybrid models, Per esempio, using the
output of a retrieval model as input for a generative
84
modello (Dinan et al., 2018; Weston et al., 2018;
Cai et al., 2019; Zhu et al., 2020). Some of this
work has specialized to use both types of models to
generate conversations in an ensemble (Song et al.,
2016) or to specifically improve consistency (Song
et al., 2020). We extend these approaches by
augmenting generative models with retrieval-like
operations based on KNN search, allowing dialog
models to flexibly incorporate various sources of
external knowledge at the same time and scale to
large quantities of retrieval candidates.
3 KNN-based Information
Fetching Modules
the KIF module assumes an en-
A grandi linee,
inputs X =
coder model M can access
{x1, x2, . . . , xn}. Per esempio, X can be a
collection of sentences, and xi represents an
individual sentence. In a setting without additional
supporting information, the encoder will process
an input xi and produce the encoder output
M (xi). If xi is a sequence such as a sentence,
then M (xi) is a representation of the variable
size of the sequence length by the fixed size
encoder M ’s hidden size. Tuttavia, in many tasks,
additional information is present, represented as
E = {e1, e2, . . . , em}. We encode each element
of X and E into a vector representation using the
codificatore. To identify the closest information in E
that is relevant to xi, our general approach will
be to use KNN by comparing the representation
of xi with the representation of each element in
the set E. KNN is a fully differentiable operation
(Pl¨otz and Roth, 2018), so can be incorporated in a
straightforward way into neural models. The most
relevant information in E will then be available in
the model. We display a KIF-Augmented model 1
in Figure 1 and describe how the KIF module
operates.
IL
Quello
embeddings of
One challenge to overcome is
IL
representation of all elements of the knowledge
source E are pre-computed and kept
fixed,
creating M (E)—we do not backpropagate to
affect
the pre-encoded
knowledge. In the early stages of training, IL
model receives large amounts of loss, which would
affect the quality of the pre-encoded embeddings
if we backpropagated to them. Further, encoding
the fixed external knowledge once and re-using
it allows for greater scalability. Tuttavia, Questo
lack of backpropagation can introduce a mismatch
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 1: KIF modules fetch relevant information from multimodal external knowledge. External
knowledge sources E1 and E2 are pre-encoded by encoder M (green). In the model, input xi is encoded
by encoder M ′ (blue) to produce M ′(xi). KIF modules (orange) operate on M ′(xi) and identify the
nearest neighbors encoded in M (E1) and M (E2) using KNN. Identified relevant elements from E1 and
E2 are re-encoded by M ′ in a gating mechanism with a weighted sum (represented by σ(WS1i) · WS1i,
where WS stands for weighted sum), then concatenated to M ′(xi). Full description with notation can be
found in Section 3.
between the encoding of E and the encodings
produced by a model that is training, as the training
model has constantly changing representations
because the weights are being learned. We use
M to represent the original encoder model used
to encode E and M ′ to represent the constantly
training model that is encoding X. The model
must learn a function to align M ′(xi) to the
pre-encoded elements of the external memory
M (E).
To circumvent
this misalignment, we learn
a mapping operator fE(M ′(xi)) that trains to
map elements of the model’s representation of X,
or M ′(X), into the additional information repre-
sentation space M (E). Concretely, fE(M ′(xi))
is a multilayer perceptron with ReLU nonlineari-
ties. From the input elements of X, fE(M ′(xi))
learns representations of an output close to the
corresponding projection of X into E. This can
be interpreted as learning a read operation on a
fixed external memory. If there was no change
to the encoding of the model compared to the
pre-computed knowledge, then the ideal map-
ping operator would be the identity function (COME
M ′ would equal M ). Tuttavia, as the model
changes significantly during the training process,
the nonlinear mapping capability of fE(M ′(xi))
is essential to be able to identify the correct
knowledge E from the input X.
Così, a model augmented with KIF will
incorporate external knowledge in the following
maniera. Primo, we find the k nearest elements
to fE(M ′(xi)) in M (E), based on KNN search
with inner product. Then, the relevant elements
identified by KNN are re-encoded by M ′. For
esempio, if element ej is retrieved by KIF, it would
produce M ′(es). We use the optimized faiss
library for KNN search, which can conduct
billion-scale KNN efficiently on GPUs.
The KNN output for an element xi is produced
by using faiss to search for the k nearest
representations to fE(M ′(xi)) in M (E). Note
that as the encoders M and M ′ produce output
representations of variable length (Per esempio, In
the case where xi is a variable length sequence,
such as a sentence), we average across the length
dimension to produce a fixed-size representations
r to conduct the KNN search.
rxi = Avg(cid:0)fE(M ′(xi))(cid:1)
RE = (cid:8)Avg(M (e)) | e ∈ E(cid:9)
KNNxi = KNearest(cid:0)k, rxi, RE(cid:1)
(1)
(2)
(3)
Then, the KIF module output for an element xi
is the set of all re-encoded representations of the
KNN-retrieved knowledge:
KIFxi = (cid:8)M ′(e) | e ∈ KNNi(cid:9)
(4)
These elements are weighted by their normal-
ized nearest neighbor scores and then summed.
This is subsequently concatenated to M ′(xi) A
form the final encoder output:
[M ′(xi), WeightedSum(KIFi)]
(5)
85
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
instance,
simultaneously. For
This can be easily extended to using multiple
two
modules
sources of external information, E1 and E2, can
be combined by identifying the top candidates
of each information source. The weighted sum
of the KIF output on each information source is
concatenated with the encoded input M ′(xi). IL
KIF output dimensionality is the same size as the
hidden size of M ′(xi), so they can be directly
concatenated.
Finalmente, different sources of information may
not be required for every prediction and some
information sources can be more important than
others. To allow the model to make more fine-
grained decisions about what
information to
use from what source, and how much of it,
we add a gating mechanism using a sigmoid
function around each weighted sum of KNN
representations. KIF1i and KIF2i denote the KIF
module from Equation (4) applied to E1 and E2,
rispettivamente.
WS1i = WeightedSum(KIF1i)
WS2i = WeightedSum(KIF2i)
(6)
(7)
which produces the final encoder output, UN
concatenation of M ′(xi) with the output of
multiple KIF modules:
(cid:2)M ′(xi), P(WS1i) · WS1i, P(WS2i) · WS2i(cid:3) (8)
This concatenation represents the output of the
encoder M ′ and can be used for various purposes,
such as providing the encoder output to a decoder
in a sequence to sequence model.
4 Applying KIF to Dialog Tasks
We describe how to apply KIF to the task of
generative dialog, a setting where models must
generate engaging and on-topic responses. Noi
investigate dialog for two reasons: Primo, dialog
agents must be able to consult relevant information
to maintain the topic of the conversation. Secondo,
retrieval-based agents have strong performance
compared to generative ones, due to their ability to
copy dialog utterances from the training set. Using
KIF, we can incorporate the benefits of retrieval
architectures into generative, knowledge-based
models.
86
4.1 KIF for Generative Dialog
In dialog, xi represents the text of the conversation
io. A conversation consists of multiple back-
and-forth utterances (or turns). Per esempio, UN
conversation could consist of 4 turns: xi =
[xi,1, xi,2, xi,3, xi,4] where xi,4
the direct
utterance the model should respond to, and the
earlier utterances are the conversation context.
È
Standard generative dialog models use a
Transformer neural network as the encoder M
and want to produce an output that is an ap-
propriate response to the conversation. Tuttavia,
in many cases, the conversation history alone
does not include all of the information required to
produce an appropriate response. Per esempio, if
a model needs to chat about a specific movie,
it can be helpful
to provide the model with
more information about that movie so a more
interesting dialog response could be produced. A
incorporate knowledge, models often concatenate
a knowledge source E such as Wikipedia to
xi and use attention modules to identify the
most relevant knowledge. Tuttavia, this approach
is computationally intensive when handling
large quantities of information. Further, Attenzione
mechanisms have been found to operate poorly
over long sequences, as the mechanism becomes
blurry due to the softmax and struggles to make
fine-grained decisions (Fan et al., 2018B). IL
same is true for hierarchical approaches, Quale
lack scalability.
We augment Transformer sequence to sequence
(seq2seq) networks on the encoder side with KIF
to improve generative dialog models. We experi-
ment on two dialog tasks, Wizard of Wikipedia
(Dinan et al., 2018) and Engaging ImageChat
(Shuster et al., 2020). In both datasets, models
must leverage information external to the dialog
history alone—in Wizard of Wikipedia, the chat
requires access to knowledgeable facts and in
Engaging ImageChat, discussion about a specific
Immagine. As models must process multiple inputs
and ground responses in the knowledgeable facts
or images, these tasks challenge existing seq2seq
approcci.
4.2 Wizard of Wikipedia
The goal of the Wizard of Wikipedia dataset is to
train knowledgeable agents that can chat in any
domain. The dataset contains 1,365 various topics
discussed in 18,430 dialogs in the training set,
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
totalling 166,787 training utterances. Each topic is
a general concept, such as dogs or ice cream, and is
included as the first utterance of the conversation.
The conversation is meant to be in-depth and
detailed, so individual utterances must reference
specific knowledge as a basis for the utterance. IL
knowledge takes the form of Wikipedia sentences.
Per esempio, the chat utterance I love Toy Story!
It was released in 1995 would reference the
Wikipedia sentence Toy Story is a 1995 American
computer-animated buddy comedy […]. For each
utterance, a set of sentences are identified by an
information retrieval system, and the crowdworker
selected one knowledge sentence as the basis for
their utterance.
Knowledge Sources. Our model for Wizard of
Wikipedia has access to two sources of external
informazione, E1 and E2:
• E1 is Wikipedia Knowledge provided by the
dataset as evidence to support knowledgeable
chitchat (initially curated by the information
retrieval system used in Dinan et al. [2018]).
The scale of this KNN search is to filter
through an average of 34 sentences. The KIF
module uses dialog features to fetch relevant
knowledge to condition upon to generate the
subsequent utterance.
• E2 is Training Utterances. To incorporate
the benefits of retrieval-based dialog models
to the generative setting, we use KIF to
identify relevant utterances from the training
set and take their responses as input. If
many conversations about dogs have already
occurred, models should be able to take
advantage of these human-written examples
to improve their generations. Per esempio,
likely conversation could occur about the
breed of the dog, daily routine with a pet, E
similar topics. There are around 170K dialog
utterances as inputs to KNN search. This can
be interpreted as incorporating the benefits of
retrieval models by identifying an utterance
with similar structure as the text the model
would like to generate. We do not allow the
module to fetch the correct response of the
current conversation context.
Access to these two sources of knowledge
can be seen as learning a template and a topic
separately. Sample templates can be identified
from the training utterances, and topic-specific
information learned by accessing the Wikipedia
knowledge.
Additional KNN Features. To better identify
relevant training utterances from the large quantity
available, we break down xi into conversation
sub-features for a more fine-grained match in the
KNN search step. By conducting KNN on more
caratteristiche, we can achieve higher quality retrieval.
We leverage the nature of dialog to decide these
caratteristiche.
We concatenate the encoding of the most
recent dialog utterance (per esempio., xi,last) with the
encoding of the dialog context from the current
conversation and the turn number t, such that
M ′(xi,last), M ′(xi,−last), t is the representation
used for KNN search. Concretely, if the model is
trying to produce the 5th turn of the conversation,
then xi,last
is the most recent utterance from the
dialog partner, xi,−last would be the last 3 turns
of exchange, and t would be 4. Note that the turn
number is represented as a standalone number.
These are known to be salient conversation fea-
tures. The most recent dialog utterance is the di-
rect turn the model is responding to, and the
dialog context may provide additional clues. IL
turn number is important, as earlier turns are often
generic (per esempio., how are you doing today) and later
turns are more specific.
4.3 Engaging ImageChat
The goal of Engaging ImageChat is to create
agents capable of chitchatting about
images
selected from the YFFC100M dataset (Thomee
et al., 2016). The dataset contains 186,782 dialogs
in the training set, each about a unique image,
totalling 355,862 utterances. Agents are assigned
one of 215 personalities (per esempio., sweet, caring,
excited) to increase engagingness. Previous work
(Shuster et al., 2020, 2019) identified that both
crowdworkers and models, when provided with
personalities, produced more diverse, interesting
responses, as evaluated by humans.
We use a multimodal neural network designed
to handle both image input and text
input.
Following Shuster et al. (2020), the images are
encoded using a pre-trained ResNeXt network
(Xie et al., 2017). To extract the final image
representation, we project the 2048-dimensional
output of the image encoder to 512-dimensions
using a deep multilayer perceptron with ReLU
activation units. The conversation history, Quale
87
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
includes the one-word personality, is encoded with
a Transformer encoder network. The image and
conversation are integrated using the Multimodal-
Sum-Combiner module proposed in Shuster et al.
(2020).
then the turn number t and personality p are
represented separately. As the personality is a
word, we use the same Transformer to encode
Esso. The concatenation of features used for KNN
search is: M ′(xi,last), M ′(xi,−last), T, P.
Knowledge Sources. Our model for Engaging
ImageChat has access to two sources of external
informazione, E1 and E2:
5 Experimental Setup
5.1 Implementation Details
• E1 is Chat on Similar Images. Although there
are over 180K different images in this dataset,
many of the images are similar. Per esempio,
conversations associated with two pictures
of dogs could be relevant to each other. IL
model is able to use KIF directly on the
current image features to fetch from around
180K different images and return 6 turns of
related chat for each fetched image. Fetching
from E1 consists of identifying related image
chats, or conversations on related topics.
• E2 is Training Utterances. Similar to the
motivation for the previous dataset, we allow
the model to identify training utterances that
could be useful for responding in the current
conversation. The scale of this fetching task
is large: 350K dialog utterances. This could
be interpreted as identifying utterances with
similar structure to what the model would
like to generate, and is complementary to the
topic-based related image chats.
Additional KNN Features. To identify relevant
information from training utterances, we use the
same dialog features as Wizard of Wikipedia in
the KNN search step, with one modification: Noi
add the personality provided by the dataset. Noi
represent the personality feature as the personality
word, such as caring, and embed it with the
encoder M ′. As utterances from speakers with
the same personality are more likely to be
similar, this feature improves the quality of the
fetched information. Per esempio, conversations
with the sweet personality often include similar
text such as aww,
that’s wonderful. We use
two additional features for the KNN search: T,
the turn number, and p, the personality. Questo
feature is explicitly used in Shuster et al. (2020)
to improve the engagingness and flow of the
conversation. Similar to Wizard of Wikipedia, we
represent the conversation turn t as a number.
The Transformer model is used to encode text
xi and produce a representation of the text,
Parameter Settings. We use parl.ai (Mugnaio
et al., 2017) to implement our models. The data for
both datasets used is available for download from
parl.ai as well. We use byte-pair encoding
(Sennrich et al., 2016) to represent the text to better
handle the rare word problem (Dinan et al., 2018;
Fan et al., 2018UN). Our generative Transformer
models have 8 encoder layers and 8 decoder layers,
with FFN size 2048, embedding dimension 512,
E 4 attention heads. We optimize using Adam
(Kingma and Ba) and the inverse square root
learning schedule (Vaswani et al., 2017) with 10k
warmup updates. The initial learning rate is 0.0001
and we optimize for model perplexity. We use a
dropout of 0.5 and set gradient clipping to 0.1.
We set k = 5 for all cases. For both datasets,
we model a vocabulary size of 54,944 based on
the BPE-based vocabulary from the Reddit pre-
training. We tuned the learning rate and batchsize
hyperparameters together.
Pre-training. We pre-train the Transformer
seq2seq model used for both datasets on 250M
comments from Reddit. The Reddit dataset was
made available by pushshift.io. The comments
are parsed to maintain conversational
threads
of users responding to each other, so the
encoder network has been exposed to conversa-
tional context at
IL
Reddit dataset does not include aspects such as
personality, as those are unique to specific datasets
such as Engaging ImageChat. The context size in
pre-training is set to 512 gettoni. The ResNeXt
encoder used to model images for the Engaging
ImageChat dataset was pre-trained on 3.5 billion
images (Mahajan et al., 2018).
training time. Note that
5.2 Evaluation
Generation. We generate with beam search,
setting the beam size to 4. We use 3-gram block-
ing. This technique disallows repeated n-grams
from being generated multiple times and reduces
repetition.
88
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Automatic Metrics. Following Dinan et al.
(2018), we compute F1, a metric of unigram
sovrapposizione, between the generated utterance and
the human-written reference utterance from the
dataset. For generative models, utterances are
generated using beam search. For retrieval models,
the next utterance is predicted by ranking the entire
set of training utterances, and the highest scoring
utterance is chosen.
In Wizard of Wikipedia, there are two test sets:
A set of seen topics, or topics that have been
seen at training time with new test-time dialogs.
The second set is unseen, or topics that have not
been encountered at all during training time. Noi
evaluate on both subsets.
Human Evaluation. We follow the setup and
use the analysis questions proposed in the
Acute-Eval dialog evaluation system (Li et al.,
2019). For reproducibility, we adopt this existing
evaluation setting that has been applied to several
dialog datasets. We use the question wording
suggested by Acute-Eval and follow their
self-chat procedure and interface. As one of the
original datasets assessed in this system was
Wizard of Wikipedia,
their evaluation setting
extends naturally to ours. We collect 100 human-
bot conversational dialogs on a crowdsourcing
platform for both datasets. The dialogs are eight
turns long. Then, we show pairs of the collected
conversations side by side, one conversation with
a human and model A and the other conversation
with a human and model B. We ask annotators the
following questions:
• Who would you prefer to talk to for a long
conversation?
• If you had to say one of the speakers is
interesting and one is boring, who would you
say is more interesting?
• Which speaker sounds more human?
• Which speaker has more coherent responses
in the conversation?
• If you had to say that one speaker is more
knowledgeable and one is more ignorant,
who is more knowledgeable? (Wizard of
Wikipedia only)
We measure the percentage of time one model
was chosen over the other, taking the majority
agreement between three evaluators. To reduce
variance, dialogs paired in the evaluation were
collected on the same topic for Wizard of Wiki-
pedia and collected on the same image and per-
sonalities for Engaging ImageChat. Topic and
images selected for evaluation are unique and
taken randomly from the test set.
5.3 Baselines
We compare Transformers augmented with KIF to
other existing approaches on Wizard of Wikipedia
and Engaging ImageChat. The best approaches,
judged by human evaluation, are retrieval models,
the Retrieval Transformer Memory Network from
Dinan et al. (2018) and the Retrieval Transformer
from Shuster et al. (2020). These have been
shown to be strong baselines compared with
other retrieval techniques based on TF-IDF (Chen
et al., 2017). Così, we report the existing retrieval
models for both datasets, but focus on comparing
to other generative baselines.
We compare to three additional generative
in Wizard of Wikipedia,
baselines. Note that
the construction of the dataset is that sentences
of Wikipedia knowledge are provided with the
utterances in a concatenated form. Models must
identify the relevant information in this provided
knowledge, or can access more Wikipedia know-
ledge beyond the provided sentences. The follow-
ing baseline methods always have access to the
information provided in the datas et already, Ma
no additional Wikipedia knowledge beyond that.
• Transformer Memory Networks. To contrast
the ability of KIF to existing work, we
compare our models to published Trans-
former Memory Networks (Dinan et al.,
2018). These models encode each piece of
external information independently with a
Transformer Encoder, and these are stored
as memory slots. To access information in
the memory slots, a model performs dot-
product attention between the memory slots
and the dialog context. In Dinan et al. (2018),
the knowledge selection from Wikipedia was
supervised with either (UN) a two-stage model
where the first model was trained to pre-
dict the right knowledge and a second model
conditions on the predicted knowledge to
generate the next utterance, O (B) an end-
to-end model with an auxiliary loss for
knowledge prediction accuracy.
• Retrieve and Refine. We implement a hybrid
modello (Weston et al., 2018) that incorporates
89
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
top retrieval candidates as additional input
to Generative Transformer MemNets. In Re-
trieve and Refine, a fixed number of candi-
dates are retrieved and concatenated to the
conversational history in the encoder, making
the input much longer. For both datasets, IL
Retrieve and Refine mechanism that fetches
a fixed number of training utterances is added
to the Generative Transformer MemNet with
Reddit Pre-Training baseline.
Unlike the KIF-Augmented Transformer, IL
retrieval is conducted with a separate model
so there is no backpropagation to affect the
retrieval. With KIF, models can alter the
retrieved candidates by learning the mapping
operator. Further, a fixed amount of infor-
mation is always retrieved, without the cap-
ability to easily rescale to focus on specific
candidates. KIF modules have weighting
mechanisms to focus more on certain infor-
mazione, and the modules are combined with
gating so models can learn which knowledge
sources are more important and adjust
flexibly. Lastly, Retrieve and Refine is only
used to retrieve one source of information:
training set utterances.
• Response Generation with MR. We imple-
ment the model proposed in Qin et al. (2019),
which encodes the conversation history and
document contextually with a biLSTM before
generating the next dialog utterance. IL
initial model was applied to a machine
reading task where a knowledge document
was provided along with the conversation
history. For Wizard of Wikipedia, we replace
the knowledge document with the Wikipedia
sentences provided in the dataset. The model
then uses the conversation to identify the
most relevant information in the document
using a cross-attention mechanism. For the
Engaging ImageChat dataset, as there is no
document provided with the dataset, we
replace the expected document with the
conversation history, and use the most recent
utterance in the conversation to attend to the
conversation history.
more effectively as they are trained for dialog.
Così, we replace CoVE embeddings with
domain-specific ones.
All of Transformer generative baselines are
initialized with the same pre-training on Reddit
that we use for our models for fair comparison on
modeling quality.
6 Results
We describe the results of incorporating KIF
modules into Transformer networks. We display
an example conversation between a human and
our model in Figure 4, and show the top scoring
Wikipedia knowledge and Training Utterance
fetched by KIF modules. We compare to various
baselines using automatic and human evaluation,
and discuss our experiments. We present various
ablation settings to understand the key features
that make our method function.
6.1 KIF is Effective for Incorporating
Knowledge
Automatic Evaluation. Comparing KIF aug-
mented Transformer networks to published base-
lines and Retrieve and Refine, we find improved
risultati.
For Wizard of Wikipedia, the improvement in
F1 score over the best baseline is around 8 points
(Vedi la tabella 1). A major contributing factor is the
construction of the dataset—as each dialog turn
is grounded in a specific knowledge sentence
from Wikipedia, improving the ability to identify
the relevant fact strongly improves performance.
Contrasting the results from the seen and unseen
test sets in Table 1, the improvement on unseen is
worse—it is harder to fetch training utterances for
unseen topics.
While Imagechat has no explicit dependency
on knowledge, we still see a 2 point improve-
ment compared to the Generative Transformer
MemNet (with the additional Reddit pre-training),
indicating that KIF can be generally useful (Vedere
Tavolo 2). Compared to an even stronger baseline
that we tune in this work, Retrieve and Refine, we
Vedere 1 point improvement.
We make an additional improvement to this
baseline: in Qin et al. (2019), the embeddings
used pre-trained CoVE vectors (McCann
et al., 2017). We found our Reddit pre-
trained Transformer embeddings to work
Human Evaluation. Results are shown in
Figura 2. On both datasets, we find there is large
improvement over existing generative models
(green bars) that is statistically significant for some
of the evaluation questions. Evaluators agree that
90
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Model
Test F1
(Seen)
Test F1
(Unseen)
Retrieval Baselines
Retrieval Transformer MemNet (Dinan et al., 2018)
Generative Baselines
2-Stage Generative MemNet (Dinan et al., 2018)
Generative Transformer MemNet (Dinan et al., 2018)
+ Reddit Pre-Training
Retrieve and Refine (Weston et al., 2018)
Response Generation with MR (Qin et al., 2019)
KIF-Augmented Transformer
15.4
18.9
16.9
17.6
18.2
17.5
25.9
12.4
17.4
14.4
16.3
17.9
16.8
22.3
Tavolo 1: Results on the Wizard of Wikipedia dataset. We implement the Retrieve and Refine
and Response Generation with MR approaches, all with Reddit Pre-Training, and evaluate them on
Wizard of Wikipedia. The Seen test set consists of conversations on topics seen at training time, E
the Unseen test set consists of conversations about new topics that were not in the training set.
Model
Retrieval Baselines
Retrieval Transformer (Shuster et al., 2020)
Generative Baselines
Generative Transformer MemNet (Dinan et al., 2018)
+ Reddit Pre-Training
Retrieve and Refine(Weston et al., 2018)
Response Generation with MR (Qin et al., 2019)
KIF-Augmented Transformer
Test F1
9.81
7.1
12.8
13.6
13.2
14.4
Tavolo 2: Results on the Engaging ImageChat dataset. We implement the Generative Transformer
Memory Network, Retrieve and Refine, and Response Generation with MR approaches, all with
Reddit Pre-Training, and evaluate them on Engaging ImageChat.
KIF-augmented Transformers are generally more
coherent and human-sounding compared to the
Generative MemNet.
Comparison with existing retrieval models
(shown in blue) is more nuanced. Along the
lines of existing work (Zhang et al., 2018; Dinan
et al., 2018), we find that retrieval-based models
score very well in human evaluations that ask
how human or interesting a dialog sounds. Questo
is because retrieval models return human-written
utterances from the training set and do not suffer
in generative
from decoding mistakes present
1In Shuster et al. (2020), retrieval Transformer models
report Hits@N using a fixed candidate set of 99 distractor
candidates and 1 true candidate. We compute F1 using their
open-sourced model by scoring the entire training set of over
350K utterances with the model and taking the top scoring
candidate as the response.
models. Per esempio, on Engaging ImageChat,
while our model has significantly improved over
the generative baseline (see green bars in Figure 2,
right), it does not beat retrieval based methods in
sounding more human or being more interesting
(see blue bars in Figure 2, right). As the Retrieval
baseline returns human-written text for other
humans to evaluate, we hypothesize that humans
score each other’s writing quite well. Compared
with generative models, which we focus on
improving, retrieval models often produce longer
text with more interesting, nuanced vocabulary
usage, and do not make generation mistakes
such as repetition. These factors often lead to
the stronger performance of retrieval models.
A surprising result
is that KIF-augmented
Transformers are more human sounding than
91
l
D
o
w
N
o
UN
D
e
D
F
R
o
M
H
T
T
P
:
/
/
D
io
R
e
C
T
.
M
io
T
.
e
D
tu
/
T
UN
C
l
/
l
UN
R
T
io
C
e
–
P
D
F
/
D
o
io
/
.
1
0
1
1
6
2
/
T
l
UN
C
_
UN
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
T
l
UN
C
_
UN
_
0
0
3
5
6
P
D
.
F
B
sì
G
tu
e
S
T
T
o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3
Figura 2: Human Evaluation Results on Both Datasets. More than 50% indicates the KNN Model is
preferred. Stars indicate statistical significance at p < 0.05.
retrieval models on Wizard of Wikipedia. This
is because the dataset’s utterances are long and
factual due to the tendency of crowdworkers
to copy Wikipedia. Sometimes humans chatting
with the retrieval bot would respond uh. . . that’s
an interesting fact? Otherwise, our model
scores similarly to retrieval models, with most
evaluations not having statistically significant
difference.
We conduct a second evaluation on the Unseen
Test Set of the Wizard of Wikipedia dataset.
Results are shown in Figure 3. Trends are similar
compared to the results on the Seen Test set,
though the preference for the KIF-augmented
Transformer is greater over the retrieval baseline.
We hypothesize that because the Unseen Test Set
is on entirely held out topics, the retrieval baseline
can struggle to identify relevant utterances. In
contrast, the KIF-augmented Transformer, similar
to the generative baseline from Dinan et al. (2018),
can use the generative capability to produce
utterances.
Lastly, we conduct an additional study to
examine the variance of the comparative dialog
judgements. The evaluation study for Wizard of
Wikipedia is repeated three times on different
days, and evaluators who have answered on
previous days are not allowed to evaluate again
in any subsequent experiments. Overall, we
find reasonable interannotator agreement rates,
around 73% averaged across all evaluations,
which is similar to the agreement rates reported
in Li et al. (2019). We find there is greater
variance on questions asking which dialog is
more human and more interesting, most likely as
different evaluators can interpret these in different
ways. Further, we see that comparison with
the Retrieval model has less variance compared
to the Generative model, possibly because the
Retrieval model’s human written text is devoid of
Figure 3: Human Evaluation on the Unseen
Test Set of Wizard of Wikipedia. More than
50% indicates the KNN Model is preferred. Stars
indicate statistical significance at p < 0.05.
mistakes. Overall, we find that the conclusions
(and statistical significance) are stable across
multiple evaluations.
6.2 Analysis of Fetched Knowledge
Example conversations from our KIF-augmented
generative model are shown in Figure 4 on
Wizard of Wikipedia. We find that relevant
knowledge is identified that affects the content
of the generated utterance. For example,
the
model finds knowledge sentences about Disney
movies as the human conversationalist starts
the conversation discussing Disney. The model
leverages the fetched knowledge to write the
content of the generated utterance. In a concrete
example, the fetched sentence disney announced
intentions [...] after the success of the incredibles
leads the model to generate the utterance i love the
incredibles, they are my favorite disney movie.
In contrast, the model uses the form of the
fetched training utterance often as a template for
writing a response. For example, the model copies
the training utterance Ohhh . . . what do people
with color blindness do to cope with the effects?
and starts the model generation with Ohhh ... and
continues with the question i think toy story is a
classic? following the form of the selected training
utterance.
92
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of
Wikipedia. The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed
with model output.
Figure 5 displays the top-3 fetched training
set utterances and knowledge sentences on the
Wizard of Wikipedia dataset when responding
to a human utterance. KIF modules can identify
multiple relevant items. In response to the human
question about blue skies the 1946 movie the model
identifies both the comedy film and the band.
Finally, the elements retrieved by KIF modules
provide a more interpretable understanding of
what the model is conditioning upon to generate
a dialog response. In Table 3, we display for the
same dialog history, changing the model’s fetched
training utterance and knowledge sentence for our
own examples. The model heavily incorporates
our manual changes of the fetched information into
the generated utterance. For example, changing
the knowledge directly affects what the model
generates as the favorite character—from buzz
lightyear to mr potato head to slinky dog—while
changing the fetched training utterance changes
the form of the generated sentence.
6.3 Scaling KIF to Challenging
Retrieval Settings
KIF modules can be used in more realistic and
challenging settings for knowledge retrieval that
test the scalability of the module. In Figure 6(a),
we compare the Generative Transformer MemNet
Baseline with KIF-Augmented Transformers in
three settings. The first is the standard Wikipedia
sentences provided by the dataset
(average
34 sentences). Then, we extend to providing
the model with the full Wikipedia article (on
average, 57 sentences) and finally to multiple
totaling 205
Wikipedia articles (on average,
sentences),
identified using the conversation’s
topic. This increasing size of available knowl-
edge could be realistic for settings where it
information is most relevant,
is unclear what
if filtering steps to preprocess the data remove
potentially relevant information, or if information
synthesis from multiple knowledge sources is
necessary to produce a high-quality generation.
As the Wikipedia knowledge becomes more
difficult to identify, performance decreases, but
still outperforms the baseline that uses the
dataset-provided set of 34 sentences.
Comparing the scaling capability of KIF to the
standard Generative Transformer MemNet Base-
line highlights the advantage of using KNN. The
attention-based mechanism used in Dinan et al.,
information
2018 struggles to identify salient
93
Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding
to a human chat from the dataset using a trained Wizard of Wikipedia model. Examples are taken from
validation.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
when given increasingly larger quantities of
knowledge, unlike the KNN information fetch. We
hypothesize the attention mechanism is challenged
by softmax-ing over a larger quantity of inputs, as
it can be difficult to make sharp distinctions.
6.4 Ablations
Importance of Multiple Knowledge Sources.
One benefit of the KIF module approach is
that several modules can be combined, each
capturing information from a different source. In
both settings, Wizard of Wikipedia and Engaging
ImageChat, two modules were used to incorporate
multiple forms of knowledge—training utterances
to capture the capability of a retrieval-based model
and knowledge from Wikipedia or related chats
based on image features. We perform here an
ablation study to evaluate the impact of using
only one source of information. As can be seen
in Table 4, performance decreases when only one
source of information is used (see Table 4).
For Engaging ImageChat,
this study also
underlines the importance of being able to fetch
form
in a multimodal
fashion. The general
of the KIF module—requiring only a feature
vector to find nearest neighbors from—allows
fetching on multiple modalities such as text and
images. In Table 4, using the Image-based KIF
to fetch text from Related Images is important
to reach the strongest performance (compare
Training Utterances Only that uses text-based KIF
and using both Training Utterances and Related
Images).
Using Dialog Features for KNN Performance.
The quality of the KNN search is critical to the
performance of KIF modules. As the external
knowledge is kept fixed, KIF must be able to align
the dialog context with the knowledge to identify
relevant pieces of information. In Table 5, we
show that matching on more features can improve
the quality of the retrieved information. Using only
the encoding of the immediate previous utterance
can improve results on Wizard of Wikipedia by
7 F1 points, but
this is further improved by
also leveraging the encoding of context (+1.8
F1) and using the dialog turn number (+1 F1).
These features are available in the datasets, and
94
Knowledge
Training Utterance
Generation
buzz lightyear’s name is in
honor of astronaut edwin ‘buzz’
aldrin
my favorite character in that
book series is hermione granger
cool! my favorite character in
that movie is buzz lightyear
mr potato head is based on the
real-life mr. potato head toy
my favorite character in that
book series is hermione granger
slinky dog is a toy dachschund
with a metal slinky for a body
my favorite character in that
book series is hermione granger
my favorite character in that
movie is real-life mr potato
head
cool! my favorite character is the
slinky dog
slinky dog is a toy dachschund
with a metal slinky for a body
i
really like
hermione granger
the
character
cool! i really like slinky dog
slinky dog is a toy dachschund
with a metal slinky for a body
my favorite character of all
time has to be hermione granger
i love that movie, my favorite
character has to be slinky dog
the dachshund
slinky dog is a toy dachschund
with a metal slinky for a body
i agree with you! that’s my
favorite character as well
i think so too! my favorite is
slinky
Table 3: Effect of Fetched Information on Generated Utterances. The top section provides examples
for a fixed training utterance, changing the knowledge—the generated text maintains the construction
of the training utterance but changes the favorite character to match the knowledge. The bottom section
provides examples for fixed knowledge but changing the training utterance—the generated text modifies
its form to match the training utterance, but the favorite character information remains consistent.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
Figure 6: Ablations on Wizard of Wikipedia. (a) KIF can scale to hundreds of relevant sentences (blue)
while the baseline model, the Generative Transformer MemNet (gray), scales poorly (b) Gating can
remove irrelevant information. In the 3 Sources case, one source of external information is unrelated.
(c) Performance as k varies.
we leverage them to improve the relatedness of
retrieved knowledge.
Multi-Hop Retrieval with KIF. Work in me-
mory networks (Weston et al., 2015; Sukhbaatar
et al., 2015) utilized multi-hop mechanisms. Such
capacity could be useful when multiple sources are
necessary or information is incrementally fetched.
To emulate multi-hop memory mechanisms, we
information for
use KIF to retrieve relevant
N = 2 or N = 3 fixed hops. As the number
of hops is fixed, the multi-hop operation remains
differentiable. We do not allow the model to
retrieve the same information in a second hop.
We experimented in two settings. First, the
same KIF module is used multiple times to fetch
different information, and then all of the fetched
knowledge is concatenated. Results are shown
in Table 6 (top). Second, we examine spreading
the fetches into different KIF modules at various
95
Model
Test F1
Model
Valid F1
Wizard of Wikipedia
18.1
Training Utterances Only
Wiki Knowledge Only
23.9
Training Utterances and Wiki Knowledge 25.9
Engaging ImageChat
Training Utterances Only
Related Images Only
Training Utterances and Related Images
13.9
13.8
14.4
Table 4: Using Multiple KIF Modules on Multiple
Sources is important for improved performance.
KIF-Augmented Transformer
27.4
One KIF Module fetches multiple times
2 Fetches
3 Fetches
26.9
26.0
Multiple KIF Modules fetch once each
2 Fetches
3 Fetches
26.5
25.9
Table 6: Multi-hop with KIF to retrieve
information with multiple fetch steps.
Model
Valid F1
Wizard of Wikipedia
Previous Utterance Only
+ dialog Context
+ Turn Embedding
Engaging ImageChat
Previous Utterance Only
+ dialog Context
+ Turn Embedding + Personality
24.6
26.4
27.4
13.3
14.5
15.1
Table 5: Important Features for KNN Search
using KIF. Salient
conversation features
improve performance on both datasets.
encoder depths. This could be interpreted as the
model learning to access more information each
layer. As the model progresses deeper, more
abstract and high level representations are built,
which could allow different knowledge to be
retrieved. Results are shown in Table 6 (bottom).
In both multi-hop settings, no improvement in
performance on the Wizard of Wikipedia dataset
is observed. We hypothesize that this can be
partially attributed to the construction of the
dataset—as humans explicitly based their written
dialog utterance on one knowledge sentence.
Further, it is possible that concatenation brings
together too much information for the model to
incorporate, and thus adding additional fetches
makes the retrieval more noisy.
Effect of Gating. We analyze the effect of the
gating mechanism by evaluating the capability of
the gate to identify and focus on salient infor-
mation. On Wizard of Wikipedia, we concatenate
a third source of information: dialog turns from
a completely different corpus called PersonaChat
(Zhang et al., 2018). This dataset looks quite
different—short utterances without
factual
knowledge—and should be easy for the model
to identify as distinct from Wizard of Wikipedia.
As shown in Figure 6(b), if KIF on PersonaChat is
included without gating, it has a harmful effect as
the model includes irrelevant information. When
equipped with gating, the model learns to use
the gate to ignore some inputs, and can recover
almost the full performance of a model without
this irrelevant information source.
Size of K in KNN. Figure 6(c) shows the
performance on Wizard of Wikipedia when
varying the amount of knowledge. Being able to
access multiple relevant pieces of information is
helpful, but too much information can be harmful.
This is likely because the weighted sum becomes
blurry if too many sentences are incorporated.
7 Conclusion
We present a KNN-based Information Fetching
module that learns to identify relevant information
from external knowledge sources by learning a
mapping-based read operation. KIF modules ben-
efit from the scalability and efficiency of KNN
search, enabling computation with large external
memories. We show in the context of two dialog
datasets that relevant knowledge can be identi-
fied and incorporated to create more engaging,
high-quality dialog.
Acknowledgments
We thank the reviewers and action editor for
their comments and insightful discussion. We
thank Emily Dinan and Kurt Shuster for provid-
ing assistance to reproduce their original works.
96
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
References
Antoine Bordes, Y-Lan Boureau, and Jason
Weston. 2017. Learning end-to-end goal-
oriented dialog. In 5th International Conference
ICLR 2017,
on Learning Representations,
Toulon, France, April 24-26, 2017, Conference
Track Proceedings.
a matching-to-generation
Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu,
Xiao-jiang Liu, and Shuming Shi. 2019.
Retrieval-guided dialogue response generation
via
framework.
the 2019 Conference on
In Proceedings of
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 1866–1875. DOI:
https://doi.org/10.18653/v1/D19
-1195
Sarath Chandar, Sungjin Ahn, Hugo Larochelle,
Pascal Vincent, Gerald Tesauro, and Yoshua
Bengio. 2016. Hierarchical memory networks.
CoRR, abs/1605.07427.
Danqi Chen, Adam Fisch, Jason Weston, and
Antoine Bordes. 2017. Reading Wikipedia to
answer open-domain questions. In Proceedings
of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 1870–1879. DOI: https://
doi.org/10.18653/v1/P17-1171, PMCID:
PMC5579958
Wenlin Chen, David Grangier, and Michael Auli.
2016. Strategies for training large vocabulary
language models. In Proceedings of
neural
the 54th Annual Meeting of the Association
(Volume 1:
for Computational Linguistics
1975–1985. DOI:
Long Papers),
https://doi.org/10.18653/v1/P16
-1186
pages
Emily Dinan, Stephen Roller, Kurt Shuster,
Angela Fan, Michael Auli, and Jason Weston.
2018. Wizard of Wikipedia: Knowledge-
powered conversational agents. In International
Conference on Learning Representations.
Angela Fan, Claire Gardent, Chlo´e Braud, and
Antoine Bordes. 2019. Using local knowledge
graph construction to scale seq2seq models
to multi-document inputs. In Proceedings of
the 2019 Conference on Empirical Methods
97
in Natural Language Processing and the 9th
International Joint Conference on Natural
Language
(EMNLP-IJCNLP),
pages 4177–4187.
Processing
Angela Fan, David Grangier, and Michael Auli.
2018a. Controllable abstractive summarization.
In Proceedings of
the 2nd Workshop on
Neural Machine Translation and Generation,
pages 45–54.
Angela Fan, Mike Lewis, and Yann Dauphin.
2018b. Hierarchical neural story generation. In
Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics
(Volume 1: Long Papers), pages 889–898.
Edouard Grave, Armand Joulin, Moustapha Ciss´e,
David Grangier, and Herv´e J´egou. 2017a.
Efficient softmax approximation for GPUs.
In Proceedings of
the 34th International
Conference on Machine Learning-Volume 70,
pages 1302–1310.
Edouard Grave, Armand Joulin, and Nicolas
Usunier. 2017b. Improving neural
language
In 5th
models with a continuous cache.
International Conference on Learning Repre-
sentations, ICLR 2017, Toulon, France, April
24-26, 2017, Conference Track Proceedings.
Alex Graves, Greg Wayne, and Ivo Danihelka.
2014. Neural Turing machines. arXiv preprint
arXiv:1410.5401.
Kelvin Guu, Kenton Lee, Zora Tung, Panupong
Pasupat, and Ming-Wei Chang. 2020. Retrieval
augmented language model pre-training. In
Proceedings of the International Conference
on Machine Learning, pages 5695–5704.
Samuel Humeau, Kurt Shuster, Marie-Anne
Lachaux, and Jason Weston. 2019. Poly-
encoders: Architectures and pre-training strate-
gies for fast and accurate multi-sentence scoring.
In International Conference on Learning
Representations.
Jeff Johnson, Matthijs Douze, and Herv´e J´egou.
2019. Billion-scale similarity search with
GPUs. IEEE Transactions on Big Data. DOI:
https://doi.org/10.1109/TBDATA
.2019.2921572
Armand Joulin and Tomas Mikolov. 2015.
Inferring algorithmic patterns with stack-
In Advances
augmented
recurrent
nets.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
in Neural Information Processing Systems,
pages 190–198.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky,
Luke Zettlemoyer, and Mike Lewis. 2019.
Generalization through memorization: Nearest
neighbor language models. In International
Conference on Learning Representations.
Diederik P. Kingma and Jimmy Ba. Adam:
A Method for Stochastic Optimization.
In
3rd International Conference on Learning
Representations, ICLR 2015, San Diego, CA,
USA, May 7-9, 2015, Conference Track
Proceedings.
Guillaume Lample, Alexandre Sablayrolles,
Marc’Aurelio Ranzato, Ludovic Denoyer,
and Herv´e J´egou. 2019. Large memory
In Advances
layers with product keys.
in Neural Information Processing Systems,
pages 8548–8559.
Margaret Li,
Jason Weston,
2019. ACUTE-EVAL:
and Stephen
Roller.
Improved
dialogue evaluation with optimized questions
and multi-turn comparisons. arXiv preprint
arXiv:1909.03087.
Rongzhong Lian, Min Xie, Fan Wang, Jinhua
Peng, and Hua Wu. 2019. Learning to select
knowledge for response generation in dialog
systems. In Proceedings of the 28th Interna-
tional Joint Conference on Artificial Intel-
ligence, pages 5081–5087. AAAI Press.
Dhruv Mahajan, Ross Girshick, Vignesh
Ramanathan, Kaiming He, Manohar Paluri,
Yixuan Li, Ashwin Bharambe, and Laurens
van der Maaten. 2018. Exploring the limits of
weakly supervised pretraining. In Proceedings
of
on Com-
puter Vision (ECCV), pages 181–196. DOI:
https://doi.org/10.1007/978-3-030
-01216-8 12
the European Conference
Bryan McCann,
James Bradbury, Caiming
Xiong, and Richard Socher. 2017. Learned
in translation: Contextualized word vectors.
In Advances in Neural Information Processing
Systems, pages 6294–6305.
dialog research software platform. pages 79–84.
https://arxiv.org/abs/1705.06476,
DOI: https://doi.org/10.18653/v1
/D17-2014
Andriy Mnih and Geoffrey Hinton. 2009.
A scalable hierarchical distributed language
model. In Advances in Neural Information
Processing Systems, pages 1081–1088.
Fabio Petroni, Tim Rockt¨aschel, Sebastian
Riedel,
Patrick Lewis, Anton Bakhtin,
Yuxiang Wu, and Alexander Miller. 2019.
Language models as knowledge bases? In
the 2019 Conference on
Proceedings of
Empirical Methods
in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2463–2473. DOI:
https://doi.org/10.18653/v1/D19
-1250
Tobias Pl¨otz and Stefan Roth. 2018. Neural
In Advances
nearest neighbors networks.
in Neural Information Processing Systems,
pages 1087–1098.
Lianhui Qin, Michel Galley, Chris Brockett,
Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin
Choi, and Jianfeng Gao. 2019. Conversing by
reading: Contentful neural conversation with
on-demand machine reading. In Proceedings of
the 57th Annual Meeting of the Association for
Computational Linguistics, pages 5427–5436.
Jack Rae, Jonathan J. Hunt,
Ivo Danihelka,
Timothy Harley, Andrew W. Senior, Gregory
Wayne, Alex Graves, and Timothy Lillicrap.
2016. Scaling memory-augmented neural
networks with sparse reads and writes. In
Advances in Neural Information Processing
Systems, pages 3621–3629.
Rico Sennrich, Barry Haddow, and Alexandra
Birch. 2016. Neural machine translation of rare
words with subword units. In Proceedings of
the 54th Annual Meeting of the Association
for Computational Linguistics
(Volume 1:
1715–1725. DOI:
Long Papers),
https://doi.org/10.18653/v1/P16
-1162
pages
Alexander Miller, Will Feng, Dhruv Batra,
Antoine Bordes, Adam Fisch, Jiasen Lu, Devi
Parikh, and Jason Weston. 2017. parl.ai: A
Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski,
Ankur Parikh, Ali Farhadi, and Hannaneh
open-domain
Hajishirzi.
2019. Real-time
98
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3
question answering with dense-sparse phrase
the 57th Annual
index. In Proceedings of
Meeting of the Association for Computational
Linguistics, pages 4430–4441.
Iulian V. Serban, Ryan Lowe, Laurent Charlin, and
Joelle Pineau. 2016a. Generative deep neural
networks for dialogue: A short review. arXiv
preprint arXiv:1611.06216.
Iulian V. Serban, Alessandro Sordoni, Yoshua
Bengio, Aaron Courville, and Joelle Pineau.
2016b. Building end-to-end dialogue systems
using generative hierarchical neural network
In Thirtieth AAAI Conference on
models.
Artificial Intelligence.
Kurt Shuster, Samuel Humeau, Antoine Bordes,
and Jason Weston. 2020. Image-chat: Engaging
grounded conversations. In Proceedings of the
58th Annual Meeting of the Association for
Computational Linguistics, pages 2414–2429.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.219
Kurt Shuster, Samuel Humeau, Hexiang Hu,
Antoine Bordes, and Jason Weston. 2019.
Engaging image captioning via personality.
In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition,
pages 12516–12526. DOI: https://doi
.org/10.1109/CVPR.2019.01280
Haoyu Song, Yan Wang, Wei-Nan Zhang,
Xiaojiang Liu, and Ting Liu. 2020. Generate,
delete and rewrite: A three-stage framework for
improving persona consistency of dialogue
generation. arXiv preprint arXiv:2004.07672.
DOI: https://doi.org/10.18653/v1
/2020.acl-main.516, PMID: 32249355
Yiping Song, Rui Yan, Xiang Li, Dongyan
Zhao, and Ming Zhang. 2016. Two are
better than one: An ensemble of retrieval-and
generation-based dialog systems. arXiv preprint
arXiv:1610.07149.
Sainbayar Sukhbaatar, Edouard Grave, Guillaume
Lample, Herve Jegou, and Armand Joulin.
2019. Augmenting self-attention with persistent
memory. https://arxiv.org/abs/1907
.01470
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus,
et al. 2015. End-to-end memory networks. In
Advances in Neural Information Processing
Systems, pages 2440–2448.
Bart Thomee, David A Shamma, Gerald
Friedland, Benjamin Elizalde, Karl Ni,
Douglas Poland, Damian Borth, and Li-Jia
Li. 2016. YFCC100M: The new data in multi-
media research. Communications of the ACM,
59(2):64–73. DOI: https://doi.org/10
.1145/2812802
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
In Advances
Attention is all you need.
in Neural Information Processing Systems,
pages 5998–6008.
Jason Weston, Sumit Chopra, and Antoine Bordes.
2015. Memory networks. In 3rd International
Conference on Learning Representations, ICLR
2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Jason Weston, Emily Dinan, and Alexander
Miller. 2018. Retrieve and refine: Improved
for dialogue.
sequence generation models
In Proceedings of the 2018 EMNLP Work-
shop SCAI: The 2nd International Workshop
AI,
Search-Oriented Conversational
on
pages 87–92. DOI: https://doi.org/10
.18653/v1/W18-5713
Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen
Tu, and Kaiming He. 2017. Aggregated residual
transformations for deep neural networks.
In Proceedings of
the IEEE conference on
computer vision and pattern recognition,
pages 1492–1500.
Saizheng Zhang, Emily Dinan, Jack Urbanek,
Arthur Szlam, Douwe Kiela, and Jason Weston.
2018. Personalizing dialogue agents: I have a
dog, do you have pets too? In Proceedings of
the 56th Annual Meeting of the Association
(Volume 1:
for Computational Linguistics
2204–2213. DOI:
Long Papers),
https://doi.org/10.18653/v1/P18
-1205
pages
Yutao Zhu, Zhicheng Dou, Jian-Yun Nie, and
Ji-Rong Wen. 2020. ReBoost: A retrieval-
boosted sequence-to-sequence model for neural
Information Retrieval
response generation.
Journal, 23(1):27–48. DOI: https://doi
.org/10.1007/s10791-019-09364-x
99
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
5
6
1
9
2
4
0
3
2
/
/
t
l
a
c
_
a
_
0
0
3
5
6
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3