Augmenting Transformers with KNN-Based - Am MIT spezialisierte KI-Forschung

Augmenting Transformers with KNN-Based
Composite Memory for Dialog

Angela Fan
Facebook AI Research
Universit´e de Lorraine
LORIA
angelafan@fb.com

Claire Gardent
CNRS/LORIA
claire.gardent@loria.fr

Chlo´e Braud
CNRS/IRIT
chloe.braud@irit.fr

Antoine Bordes
Facebook AI Research
abordes@fb.com

Abstrakt

Various machine learning tasks can benefit
from access to external information of different
modalities, such as text and images. Recent
work has focused on learning architectures
with large memories capable of storing this
Wissen. We propose augmenting genera-
tive Transformer neural networks with KNN-
based Information Fetching (KIF) modules.
Each KIF module learns a read operation to
access fixed external knowledge. Wir bewerben uns
these modules to generative dialog modeling,
a challenging task where information must be
flexibly retrieved and incorporated to maintain
the topic and flow of conversation. We demon-
strate the effectiveness of our approach by
identifying relevant knowledge required for
knowledgeable but engaging dialog from
Wikipedia, Bilder, and human-written dialog
utterances, and show that
leveraging this
retrieved information improves model perfor-
Mance, measured by automatic and human
evaluation.

1 Einführung

Machine learning approaches to various tasks,
such as game-playing or dialog, are often depen-
dent on external information. This information
can take multimodal forms, including structured
knowledge bases, free text, and images, Und
also comes in overwhelmingly large quantities.
A pressing challenge is to create models that
can identify which specific elements of multiple
information sources are relevant in a particular
Kontext, and incorporate them into standard archi-

tectures on each task. In this work, we focus
on human–machine dialog and how to efficiently
retrieve external knowledge that is relevant to
the dialog. We consider two scenarios and for
each scenario, retrieve two types of knowledge:
(ich) knowledge about similar dialog contexts and
(ii) external knowledge used to ground the
conversation into real world information.

Knowledge about similar dialog contexts allows
for a hybrid retrieval/generative approach to dialog
where the system response is generated based not
only on a representation of the current dialog
context and of the relevant world knowledge,
but also based on a response retrieved from a
similar dialog context. The retrieved knowledge
can be viewed as providing information about
structure and dialog sentences, or utterances:
which response is likely given a similar context?
External knowledge is also retrieved to improve
the semantic content of the dialog model. In
one scenario, Wizard of Wikipedia (Dinan et al.
2018), general topics are provided to crowdwor-
kers, who are asked to have in-depth and specific
conversations about these topics by referencing
specific Wikipedia sentences as knowledge. In diesem
scenario, external knowledge is retrieved from a
pre-selected set of Wikipedia sentences associated
with the current dialog topic. Retrieval aims to
select the sentence that is most relevant at each
step of the dialog and thereby to ground system
responses in relevant world knowledge (z.B., von
referring to Star Wars when talking about science
fiction).

In the other scenario, Engaging ImageChat
(Shuster et al., 2020), crowdworkers are provided
with images and asked to have a conversation

Transactions of the Association for Computational Linguistics, Bd. 9, S. 82–99, 2021. https://doi.org/10.1162/tacl a 00356
Action Editor: Masaaki Nagata. Submission batch: 6/2020; Revision batch: 9/2020; Published 3/2021.
C(cid:13) 2021 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

inspired by or about the image. In this case, Die
retrieved external knowledge is images and their
associated dialogs. By retrieving images that are
similar to the image being talked about, we aim to
enrich system responses with knowledge about
is typically mentioned when describing
what
similar images (z.B., when talking about an image
with dogs, mentioning their breed).

Our work on incorporating different types and
modalities of knowledge is related to methods that
strive to add external memory, such as knowledge
bases, to neural networks. Previous work has ex-
plored incorporating large external memories into
neural network layers (Weston et al., 2015;
Sukhbaatar et al., 2015, 2019; Lample et al., 2019).
Many existing approaches focus on using attention
over the memory slots, which is computationally
intensive and becomes less effective as the the size
of the memory grows. In this work, we propose
representing multiple sources of external infor-
mation as fixed encodings and using K Nearest
Neighbors (KNN) search to fetch relevant infor-
mation. KNN search is computationally efficient
and scalable, and libraries like faiss (Johnson
et al., 2019) allow KNN to be easily used on GPUs
and integrated into neural networks. Weiter,
the external memories are pre-encoded, so the
information encoding is only computed once. Als
the external memories are kept fixed, they do not
require any training to learn the memories along
with the model. We can thus scale easily to larger
memories by learning only the KNN-based read
operation to identify relevant information from
the memory.

Our core contribution proposes an efficient,
KNN-based Information Fetching (KIF) module
that can access relevant external knowledge, com-
bine knowledge from different sources, and inte-
grate this information into standard sequence to
sequence architectures. We apply these flexible
modules to two dialog datasets that challenge gen-
erative models to leverage external information to
write coherent, on-topic responses. Both of our
chosen tasks require models to leverage external
Information, such as information from Wikipedia
or images, to engage in the conversation. Wir
show that relevant information can be identified
from hundreds of thousands of candidates in a
multimodal, multi-knowledge-source setting to
improve the performance of generative dialog
Modelle. Weiter, the output of the KIF modules
is interpretable as specific human-readable know-

ledge elements are selected, allowing users to
better understand the information the generative
model conditions upon when writing the subse-
quent utterance. On both datasets, we achieve
state-of-the-art results compared to generative
models and find there is no statistically significant
difference in the interestingness or human pre-
ference of our model output compared to state-
of-the-art retrieval models.

2 Related Work

We discuss related work on learning to incorporate
external knowledge into neural networks and
efficiently access relevant information. We then
describe work in generative dialog that incor-
porates knowledge.

2.1 Incorporating External Knowledge

line of

Augmenting neural networks with memory, oder
longer-term components that can be accessed
with read and write operations, has been
explored in various proposed architectures. Für
Beispiel, Memory Networks (Weston et al., 2015;
Sukhbaatar et al., 2015, 2019) introduce attention
mechanisms over large external memories. Neuronal
cache models (Grave et al., 2017B) simplify these
to access previous memories with a dot product.
Previous work has also studied how to read and
write into these memory architectures (Rae et al.,
2016; Graves et al., 2014; Joulin and Mikolov,
2015). Im Gegensatz, we focus on how to read large
memories.
Ein anderer

research has focused on
computational scalability for larger external me-
mories to allow efficient access of information.
Zum Beispiel, Chandar et al. (2016) propose a
hierarchical memory network rather than a flat
one and Rae et al. (2016) learn sparse operations
to read and write. Lample et al. (2019) focus
on learning memories of up to one million
slots and how to efficiently access the slots
using product keys. Khandelwal et al. (2019)
use nearest neighbor operations to augment
language models by performing retrieval at the
token level—in contrast, we focus on multimodal
retrieval of multiple pieces of knowledge based
on an entire dialog context. Beyond explicit
memory representations, it may be possible to
store information implicitly during training time
by memorizing common patterns present in text
(Petroni et al., 2019). We focus on learning

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

to fetch relevant
information from multiple
explicit external multimodal knowledge sources
and integrate them into one network. Weiter,
our work allows the retrieved information to be
interpreted as each memory slot is an explicit fact
that can be read as text, rather than a learned vector
such as in Lample et al. (2019).

Work has also focused on computationally
efficient softmax operations (Mnih and Hinton,
2009; Grave et al., 2017A; Chen et al., 2016).
Many approximate softmax techniques use KNN-
like operations to form clusters, and the overall
softmax operation is constrained by the slow
calculation of the exponential. Our usage of KNN
benefits from efficient and scalable libraries such
as faiss and nmslib.

2.2 Generative Dialog

We develop a general architecture for incorpo-
rating external information and apply it to the
case of generative dialog models. Previous work
in dialog has leveraged knowledge as necessary
information to accomplish the task. Zum Beispiel,
airline and restaurant booking tasks often use
API calls to access information about reservation
times and availability (Bordes et al., 2017). In
Kontrast, our work focuses on how to incorporate
unstructured knowledge, such as free text found
on the Web. Previous work has used architectures
that attend over the available knowledge and
identify relevant pieces of information, welche
scales poorly with large quantities of information
(Dinan et al., 2018; Qin et al., 2019; Lian
et al., 2019). We replace the use of attention
information with the output of
over external
a KNN module. Other work has investigated
incorporating information retrieval in language
modeling and question answering (Chen et al.,
2017; Fan et al., 2019; Seo et al., 2019; Guu et al.,
2020), while we focus on dialog applications and
flexibly incorporating knowledge from multiple,
multimodal sources.

On the modeling side, work has explored
both generative (Serban et al. 2016A, 2016B)
and retrieval based models (Zhang et al., 2018),
which identify the best utterance from the
training set to return as the dialog response. Das
often leverages self-attention or cross-attention
mechanisms (Humeau et al., 2019). Further work
has explored hybrid models, Zum Beispiel, using the
output of a retrieval model as input for a generative

Modell (Dinan et al., 2018; Weston et al., 2018;
Cai et al., 2019; Zhu et al., 2020). Some of this
work has specialized to use both types of models to
generate conversations in an ensemble (Song et al.,
2016) or to specifically improve consistency (Song
et al., 2020). We extend these approaches by
augmenting generative models with retrieval-like
operations based on KNN search, allowing dialog
models to flexibly incorporate various sources of
external knowledge at the same time and scale to
large quantities of retrieval candidates.

3 KNN-based Information

Fetching Modules

the KIF module assumes an en-
Broadly,
inputs X =
coder model M can access
{x1, x2, . . . , xn}. Zum Beispiel, X can be a
collection of sentences, and xi represents an
individual sentence. In a setting without additional
supporting information, the encoder will process
an input xi and produce the encoder output
M (xi). If xi is a sequence such as a sentence,
then M (xi) is a representation of the variable
size of the sequence length by the fixed size
encoder M ’s hidden size. Jedoch, in many tasks,
additional information is present, represented as
E = {e1, e2, . . . , em}. We encode each element
of X and E into a vector representation using the
encoder. To identify the closest information in E
that is relevant to xi, our general approach will
be to use KNN by comparing the representation
of xi with the representation of each element in
the set E. KNN is a fully differentiable operation
(Pl¨otz and Roth, 2018), so can be incorporated in a
straightforward way into neural models. Am meisten
relevant information in E will then be available in
the model. We display a KIF-Augmented model 1
in Abbildung 1 and describe how the KIF module
operates.

Die

Das

embeddings of

One challenge to overcome is

Die
representation of all elements of the knowledge
source E are pre-computed and kept
fixed,
creating M (E)—we do not backpropagate to
affect
the pre-encoded
Wissen. In the early stages of training, Die
model receives large amounts of loss, which would
affect the quality of the pre-encoded embeddings
if we backpropagated to them. Weiter, encoding
the fixed external knowledge once and re-using
it allows for greater scalability. Jedoch, Das
lack of backpropagation can introduce a mismatch

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 1: KIF modules fetch relevant information from multimodal external knowledge. External
knowledge sources E1 and E2 are pre-encoded by encoder M (Grün). In the model, input xi is encoded
by encoder M ′ (Blau) to produce M ′(xi). KIF modules (orange) operate on M ′(xi) and identify the
nearest neighbors encoded in M (E1) and M (E2) using KNN. Identified relevant elements from E1 and
E2 are re-encoded by M ′ in a gating mechanism with a weighted sum (represented by σ(WS1i) · WS1i,
where WS stands for weighted sum), then concatenated to M ′(xi). Full description with notation can be
found in Section 3.

between the encoding of E and the encodings
produced by a model that is training, as the training
model has constantly changing representations
because the weights are being learned. Wir gebrauchen
M to represent the original encoder model used
to encode E and M ′ to represent the constantly
training model that is encoding X. The model
must learn a function to align M ′(xi) to the
pre-encoded elements of the external memory
M (E).

To circumvent

this misalignment, we learn
a mapping operator fE(M ′(xi)) that trains to
map elements of the model’s representation of X,
or M ′(X), into the additional information repre-
sentation space M (E). Concretely, fE(M ′(xi))
is a multilayer perceptron with ReLU nonlineari-
Krawatten. From the input elements of X, fE(M ′(xi))
learns representations of an output close to the
corresponding projection of X into E. This can
be interpreted as learning a read operation on a
fixed external memory. If there was no change
to the encoding of the model compared to the
pre-computed knowledge, then the ideal map-
ping operator would be the identity function (als
M ′ would equal M ). Jedoch, as the model
changes significantly during the training process,
the nonlinear mapping capability of fE(M ′(xi))
is essential to be able to identify the correct
knowledge E from the input X.

Daher, a model augmented with KIF will
incorporate external knowledge in the following
manner. Erste, we find the k nearest elements

to fE(M ′(xi)) in M (E), based on KNN search
with inner product. Dann, the relevant elements
identified by KNN are re-encoded by M ′. Für
Beispiel, if element ej is retrieved by KIF, it would
produce M ′(ej). We use the optimized faiss
library for KNN search, which can conduct
billion-scale KNN efficiently on GPUs.

The KNN output for an element xi is produced
by using faiss to search for the k nearest
representations to fE(M ′(xi)) in M (E). Notiz
that as the encoders M and M ′ produce output
representations of variable length (Zum Beispiel, In
the case where xi is a variable length sequence,
such as a sentence), we average across the length
dimension to produce a fixed-size representations
r to conduct the KNN search.

rxi = Avg(cid:0)fE(M ′(xi))(cid:1)
RE = (cid:8)Avg(M (e)) | e ∈ E(cid:9)
KNNxi = KNearest(cid:0)k, rxi, RE(cid:1)

(1)

(2)

(3)

Dann, the KIF module output for an element xi
is the set of all re-encoded representations of the
KNN-retrieved knowledge:

KIFxi = (cid:8)M ′(e) | e ∈ KNNi(cid:9)

(4)

These elements are weighted by their normal-
ized nearest neighbor scores and then summed.
This is subsequently concatenated to M ′(xi) Zu
form the final encoder output:

[M ′(xi), WeightedSum(KIFi)]

(5)

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Beispiel,

simultaneously. Für

This can be easily extended to using multiple
zwei
modules
sources of external information, E1 and E2, can
be combined by identifying the top candidates
of each information source. The weighted sum
of the KIF output on each information source is
concatenated with the encoded input M ′(xi). Der
KIF output dimensionality is the same size as the
hidden size of M ′(xi), so they can be directly
concatenated.

Endlich, different sources of information may
not be required for every prediction and some
information sources can be more important than
Andere. To allow the model to make more fine-
grained decisions about what
information to
use from what source, and how much of it,
we add a gating mechanism using a sigmoid
function around each weighted sum of KNN
Darstellungen. KIF1i and KIF2i denote the KIF
module from Equation (4) applied to E1 and E2,
jeweils.

WS1i = WeightedSum(KIF1i)
WS2i = WeightedSum(KIF2i)

(6)
(7)

which produces the final encoder output, A
concatenation of M ′(xi) with the output of
multiple KIF modules:

(cid:2)M ′(xi), σ(WS1i) · WS1i, σ(WS2i) · WS2i(cid:3) (8)

This concatenation represents the output of the
encoder M ′ and can be used for various purposes,
such as providing the encoder output to a decoder
in a sequence to sequence model.

4 Applying KIF to Dialog Tasks

We describe how to apply KIF to the task of
generative dialog, a setting where models must
generate engaging and on-topic responses. Wir
investigate dialog for two reasons: Erste, dialog
agents must be able to consult relevant information
to maintain the topic of the conversation. Zweite,
retrieval-based agents have strong performance
compared to generative ones, due to their ability to
copy dialog utterances from the training set. Using
KIF, we can incorporate the benefits of retrieval
architectures into generative, knowledge-based
Modelle.

4.1 KIF for Generative Dialog

In dialog, xi represents the text of the conversation
ich. A conversation consists of multiple back-
and-forth utterances (or turns). Zum Beispiel, A
conversation could consist of 4 turns: xi =
[xi,1, xi,2, xi,3, xi,4] where xi,4
the direct
utterance the model should respond to, und das
earlier utterances are the conversation context.

Ist

Standard generative dialog models use a
Transformer neural network as the encoder M
and want to produce an output that is an ap-
propriate response to the conversation. Jedoch,
in many cases, the conversation history alone
does not include all of the information required to
produce an appropriate response. Zum Beispiel, Wenn
a model needs to chat about a specific movie,
it can be helpful
to provide the model with
more information about that movie so a more
interesting dialog response could be produced. To
incorporate knowledge, models often concatenate
a knowledge source E such as Wikipedia to
xi and use attention modules to identify the
most relevant knowledge. Jedoch, this approach
is computationally intensive when handling
large quantities of information. Weiter, attention
mechanisms have been found to operate poorly
over long sequences, as the mechanism becomes
blurry due to the softmax and struggles to make
fine-grained decisions (Fan et al., 2018B). Der
same is true for hierarchical approaches, welche
lack scalability.

We augment Transformer sequence to sequence
(seq2seq) networks on the encoder side with KIF
to improve generative dialog models. We experi-
ment on two dialog tasks, Wizard of Wikipedia
(Dinan et al., 2018) and Engaging ImageChat
(Shuster et al., 2020). In both datasets, Modelle
must leverage information external to the dialog
history alone—in Wizard of Wikipedia, the chat
requires access to knowledgeable facts and in
Engaging ImageChat, discussion about a specific
Bild. As models must process multiple inputs
and ground responses in the knowledgeable facts
or images, these tasks challenge existing seq2seq
approaches.

4.2 Wizard of Wikipedia

The goal of the Wizard of Wikipedia dataset is to
train knowledgeable agents that can chat in any
Domain. The dataset contains 1,365 various topics
discussed in 18,430 dialogs in the training set,

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

totalling 166,787 training utterances. Each topic is
a general concept, such as dogs or ice cream, und ist
included as the first utterance of the conversation.
The conversation is meant to be in-depth and
detailliert, so individual utterances must reference
specific knowledge as a basis for the utterance. Der
knowledge takes the form of Wikipedia sentences.
Zum Beispiel, the chat utterance I love Toy Story!
It was released in 1995 would reference the
Wikipedia sentence Toy Story is a 1995 amerikanisch
computer-animated buddy comedy […]. For each
utterance, a set of sentences are identified by an
information retrieval system, and the crowdworker
selected one knowledge sentence as the basis for
their utterance.

Knowledge Sources. Our model for Wizard of
Wikipedia has access to two sources of external
Information, E1 and E2:

• E1 is Wikipedia Knowledge provided by the
dataset as evidence to support knowledgeable
chitchat (initially curated by the information
retrieval system used in Dinan et al. [2018]).
The scale of this KNN search is to filter
through an average of 34 Sätze. The KIF
module uses dialog features to fetch relevant
knowledge to condition upon to generate the
subsequent utterance.

• E2 is Training Utterances. To incorporate
the benefits of retrieval-based dialog models
to the generative setting, we use KIF to
identify relevant utterances from the training
set and take their responses as input. Wenn
many conversations about dogs have already
occurred, models should be able to take
advantage of these human-written examples
to improve their generations. Zum Beispiel,
likely conversation could occur about the
breed of the dog, daily routine with a pet, Und
similar topics. There are around 170K dialog
utterances as inputs to KNN search. This can
be interpreted as incorporating the benefits of
retrieval models by identifying an utterance
with similar structure as the text the model
would like to generate. We do not allow the
module to fetch the correct response of the
current conversation context.

Access to these two sources of knowledge
can be seen as learning a template and a topic
separately. Sample templates can be identified
from the training utterances, and topic-specific

information learned by accessing the Wikipedia
Wissen.

Additional KNN Features. To better identify
relevant training utterances from the large quantity
verfügbar, we break down xi into conversation
sub-features for a more fine-grained match in the
KNN search step. By conducting KNN on more
Merkmale, we can achieve higher quality retrieval.
We leverage the nature of dialog to decide these
Merkmale.

We concatenate the encoding of the most
recent dialog utterance (z.B., xi,last) mit dem
encoding of the dialog context from the current
conversation and the turn number t, such that
M ′(xi,last), M ′(xi,−last), t is the representation
used for KNN search. Concretely, if the model is
trying to produce the 5th turn of the conversation,
then xi,last
is the most recent utterance from the
dialog partner, xi,−last would be the last 3 turns
of exchange, and t would be 4. Note that the turn
number is represented as a standalone number.
These are known to be salient conversation fea-
tures. The most recent dialog utterance is the di-
rect turn the model is responding to, und das
dialog context may provide additional clues. Der
turn number is important, as earlier turns are often
generic (z.B., how are you doing today) and later
turns are more specific.

4.3 Engaging ImageChat

The goal of Engaging ImageChat is to create
agents capable of chitchatting about
Bilder
selected from the YFFC100M dataset (Thomee
et al., 2016). The dataset contains 186,782 dialogs
in the training set, each about a unique image,
totalling 355,862 utterances. Agents are assigned
one of 215 personalities (z.B., sweet, caring,
excited) to increase engagingness. Previous work
(Shuster et al., 2020, 2019) identified that both
crowdworkers and models, when provided with
personalities, produced more diverse, interesting
responses, as evaluated by humans.

We use a multimodal neural network designed
to handle both image input and text
Eingang.
Following Shuster et al. (2020), the images are
encoded using a pre-trained ResNeXt network
(Xie et al., 2017). To extract the final image
representation, we project the 2048-dimensional
output of the image encoder to 512-dimensions
using a deep multilayer perceptron with ReLU
activation units. The conversation history, welche

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

includes the one-word personality, is encoded with
a Transformer encoder network. The image and
conversation are integrated using the Multimodal-
Sum-Combiner module proposed in Shuster et al.
(2020).

then the turn number t and personality p are
represented separately. As the personality is a
word, we use the same Transformer to encode
Es. The concatenation of features used for KNN
search is: M ′(xi,last), M ′(xi,−last), T, P.

Knowledge Sources. Our model for Engaging
ImageChat has access to two sources of external
Information, E1 and E2:

5 Experimental Setup

5.1 Implementation Details

• E1 is Chat on Similar Images. Although there
are over 180K different images in this dataset,
many of the images are similar. Zum Beispiel,
conversations associated with two pictures
of dogs could be relevant to each other. Der
model is able to use KIF directly on the
current image features to fetch from around
180K different images and return 6 turns of
related chat for each fetched image. Fetching
from E1 consists of identifying related image
chats, or conversations on related topics.
• E2 is Training Utterances. Similar to the
motivation for the previous dataset, we allow
the model to identify training utterances that
could be useful for responding in the current
conversation. The scale of this fetching task
is large: 350K dialog utterances. This could
be interpreted as identifying utterances with
similar structure to what the model would
like to generate, and is complementary to the
topic-based related image chats.

Additional KNN Features. To identify relevant
information from training utterances, we use the
same dialog features as Wizard of Wikipedia in
the KNN search step, with one modification: Wir
add the personality provided by the dataset. Wir
represent the personality feature as the personality
word, such as caring, and embed it with the
encoder M ′. As utterances from speakers with
the same personality are more likely to be
ähnlich, this feature improves the quality of the
fetched information. Zum Beispiel, conversations
with the sweet personality often include similar
text such as aww,
that’s wonderful. Wir gebrauchen
two additional features for the KNN search: T,
the turn number, und p, the personality. Das
feature is explicitly used in Shuster et al. (2020)
to improve the engagingness and flow of the
conversation. Similar to Wizard of Wikipedia, Wir
represent the conversation turn t as a number.
The Transformer model is used to encode text
xi and produce a representation of the text,

Parameter Settings. We use parl.ai (Müller
et al., 2017) to implement our models. The data for
both datasets used is available for download from
parl.ai as well. We use byte-pair encoding
(Sennrich et al., 2016) to represent the text to better
handle the rare word problem (Dinan et al., 2018;
Fan et al., 2018A). Our generative Transformer
models have 8 encoder layers and 8 decoder layers,
with FFN size 2048, embedding dimension 512,
Und 4 attention heads. We optimize using Adam
(Kingma and Ba) and the inverse square root
learning schedule (Vaswani et al., 2017) with 10k
warmup updates. The initial learning rate is 0.0001
and we optimize for model perplexity. We use a
dropout of 0.5 and set gradient clipping to 0.1.
We set k = 5 for all cases. For both datasets,
we model a vocabulary size of 54,944 based on
the BPE-based vocabulary from the Reddit pre-
Ausbildung. We tuned the learning rate and batchsize
hyperparameters together.

Pre-training. We pre-train the Transformer
seq2seq model used for both datasets on 250M
comments from Reddit. The Reddit dataset was
made available by pushshift.io. The comments
are parsed to maintain conversational
threads
of users responding to each other, so the
encoder network has been exposed to conversa-
tional context at
Die
Reddit dataset does not include aspects such as
personality, as those are unique to specific datasets
such as Engaging ImageChat. The context size in
pre-training is set to 512 tokens. The ResNeXt
encoder used to model images for the Engaging
ImageChat dataset was pre-trained on 3.5 Milliarde
Bilder (Mahajan et al., 2018).

training time. Beachten Sie, dass

5.2 Evaluation

Generation. We generate with beam search,
setting the beam size to 4. We use 3-gram block-
ing. This technique disallows repeated n-grams
from being generated multiple times and reduces
repetition.

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Automatic Metrics. Following Dinan et al.
(2018), we compute F1, a metric of unigram
Überlappung, between the generated utterance and
the human-written reference utterance from the
dataset. For generative models, utterances are
generated using beam search. For retrieval models,
the next utterance is predicted by ranking the entire
set of training utterances, and the highest scoring
utterance is chosen.

In Wizard of Wikipedia, there are two test sets:
A set of seen topics, or topics that have been
seen at training time with new test-time dialogs.
The second set is unseen, or topics that have not
been encountered at all during training time. Wir
evaluate on both subsets.

Human Evaluation. We follow the setup and
use the analysis questions proposed in the
Acute-Eval dialog evaluation system (Li et al.,
2019). For reproducibility, we adopt this existing
evaluation setting that has been applied to several
dialog datasets. We use the question wording
suggested by Acute-Eval and follow their
self-chat procedure and interface. As one of the
original datasets assessed in this system was
Wizard of Wikipedia,
their evaluation setting
extends naturally to ours. We collect 100 menschlich-
bot conversational dialogs on a crowdsourcing
platform for both datasets. The dialogs are eight
turns long. Dann, we show pairs of the collected
conversations side by side, one conversation with
a human and model A and the other conversation
with a human and model B. We ask annotators the
following questions:

• Who would you prefer to talk to for a long

conversation?

• If you had to say one of the speakers is
interesting and one is boring, who would you
say is more interesting?

• Which speaker sounds more human?
• Which speaker has more coherent responses

in the conversation?

• If you had to say that one speaker is more
knowledgeable and one is more ignorant,
who is more knowledgeable? (Wizard of
Wikipedia only)

We measure the percentage of time one model
was chosen over the other, taking the majority
agreement between three evaluators. To reduce
variance, dialogs paired in the evaluation were

collected on the same topic for Wizard of Wiki-
pedia and collected on the same image and per-
sonalities for Engaging ImageChat. Topic and
images selected for evaluation are unique and
taken randomly from the test set.

5.3 Baselines

We compare Transformers augmented with KIF to
other existing approaches on Wizard of Wikipedia
and Engaging ImageChat. The best approaches,
judged by human evaluation, are retrieval models,
the Retrieval Transformer Memory Network from
Dinan et al. (2018) and the Retrieval Transformer
from Shuster et al. (2020). These have been
shown to be strong baselines compared with
other retrieval techniques based on TF-IDF (Chen
et al., 2017). Daher, we report the existing retrieval
models for both datasets, but focus on comparing
to other generative baselines.

We compare to three additional generative
in Wizard of Wikipedia,
baselines. Beachten Sie, dass
the construction of the dataset is that sentences
of Wikipedia knowledge are provided with the
utterances in a concatenated form. Models must
identify the relevant information in this provided
Wissen, or can access more Wikipedia know-
ledge beyond the provided sentences. The follow-
ing baseline methods always have access to the
information provided in the datas et already, Aber
no additional Wikipedia knowledge beyond that.

• Transformer Memory Networks. To contrast
the ability of KIF to existing work, Wir
compare our models to published Trans-
former Memory Networks (Dinan et al.,
2018). These models encode each piece of
external information independently with a
Transformer Encoder, and these are stored
as memory slots. To access information in
the memory slots, a model performs dot-
product attention between the memory slots
and the dialog context. In Dinan et al. (2018),
the knowledge selection from Wikipedia was
supervised with either (A) a two-stage model
where the first model was trained to pre-
dict the right knowledge and a second model
conditions on the predicted knowledge to
generate the next utterance, oder (B) an end-
to-end model with an auxiliary loss for
knowledge prediction accuracy.

• Retrieve and Refine. We implement a hybrid
Modell (Weston et al., 2018) that incorporates

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

top retrieval candidates as additional input
to Generative Transformer MemNets. In Re-
trieve and Refine, a fixed number of candi-
dates are retrieved and concatenated to the
conversational history in the encoder, Herstellung
the input much longer. For both datasets, Die
Retrieve and Refine mechanism that fetches
a fixed number of training utterances is added
to the Generative Transformer MemNet with
Reddit Pre-Training baseline.

Unlike the KIF-Augmented Transformer, Die
retrieval is conducted with a separate model
so there is no backpropagation to affect the
retrieval. With KIF, models can alter the
retrieved candidates by learning the mapping
operator. Weiter, a fixed amount of infor-
mation is always retrieved, without the cap-
ability to easily rescale to focus on specific
candidates. KIF modules have weighting
mechanisms to focus more on certain infor-
mation, and the modules are combined with
gating so models can learn which knowledge
sources are more important and adjust
flexibly. zuletzt, Retrieve and Refine is only
used to retrieve one source of information:
training set utterances.

• Response Generation with MR. We imple-
ment the model proposed in Qin et al. (2019),
which encodes the conversation history and
document contextually with a biLSTM before
generating the next dialog utterance. Der
initial model was applied to a machine
reading task where a knowledge document
was provided along with the conversation
Geschichte. For Wizard of Wikipedia, we replace
the knowledge document with the Wikipedia
sentences provided in the dataset. The model
then uses the conversation to identify the
most relevant information in the document
using a cross-attention mechanism. For the
Engaging ImageChat dataset, as there is no
document provided with the dataset, Wir
replace the expected document with the
conversation history, and use the most recent
utterance in the conversation to attend to the
conversation history.

more effectively as they are trained for dialog.
Daher, we replace CoVE embeddings with
domain-specific ones.

All of Transformer generative baselines are
initialized with the same pre-training on Reddit
that we use for our models for fair comparison on
modeling quality.

6 Ergebnisse

We describe the results of incorporating KIF
modules into Transformer networks. We display
an example conversation between a human and
our model in Figure 4, and show the top scoring
Wikipedia knowledge and Training Utterance
fetched by KIF modules. We compare to various
baselines using automatic and human evaluation,
and discuss our experiments. We present various
ablation settings to understand the key features
that make our method function.

6.1 KIF is Effective for Incorporating

Knowledge

Automatic Evaluation. Comparing KIF aug-
mented Transformer networks to published base-
lines and Retrieve and Refine, we find improved
results.

For Wizard of Wikipedia, the improvement in
F1 score over the best baseline is around 8 points
(siehe Tabelle 1). A major contributing factor is the
construction of the dataset—as each dialog turn
is grounded in a specific knowledge sentence
from Wikipedia, improving the ability to identify
the relevant fact strongly improves performance.
Contrasting the results from the seen and unseen
test sets in Table 1, the improvement on unseen is
worse—it is harder to fetch training utterances for
unseen topics.

While Imagechat has no explicit dependency
on knowledge, we still see a 2 point improve-
ment compared to the Generative Transformer
MemNet (with the additional Reddit pre-training),
indicating that KIF can be generally useful (sehen
Tisch 2). Compared to an even stronger baseline
that we tune in this work, Retrieve and Refine, Wir
sehen 1 point improvement.

We make an additional improvement to this
baseline: in Qin et al. (2019), the embeddings
used pre-trained CoVE vectors (McCann
et al., 2017). We found our Reddit pre-
trained Transformer embeddings to work

Human Evaluation. Results are shown in
Figur 2. On both datasets, we find there is large
improvement over existing generative models
(green bars) that is statistically significant for some
of the evaluation questions. Evaluators agree that

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Modell

Test F1
(Seen)

Test F1
(Unseen)

Retrieval Baselines
Retrieval Transformer MemNet (Dinan et al., 2018)

Generative Baselines
2-Stage Generative MemNet (Dinan et al., 2018)
Generative Transformer MemNet (Dinan et al., 2018)
+ Reddit Pre-Training
Retrieve and Refine (Weston et al., 2018)
Response Generation with MR (Qin et al., 2019)

KIF-Augmented Transformer

15.4

18.9
16.9
17.6
18.2
17.5

25.9

12.4

17.4
14.4
16.3
17.9
16.8

22.3

Tisch 1: Results on the Wizard of Wikipedia dataset. We implement the Retrieve and Refine
and Response Generation with MR approaches, all with Reddit Pre-Training, and evaluate them on
Wizard of Wikipedia. The Seen test set consists of conversations on topics seen at training time, Und
the Unseen test set consists of conversations about new topics that were not in the training set.

Modell

Retrieval Baselines
Retrieval Transformer (Shuster et al., 2020)

Generative Baselines
Generative Transformer MemNet (Dinan et al., 2018)
+ Reddit Pre-Training
Retrieve and Refine(Weston et al., 2018)
Response Generation with MR (Qin et al., 2019)

KIF-Augmented Transformer

Test F1

9.81

7.1
12.8
13.6
13.2

14.4

Tisch 2: Results on the Engaging ImageChat dataset. We implement the Generative Transformer
Memory Network, Retrieve and Refine, and Response Generation with MR approaches, all with
Reddit Pre-Training, and evaluate them on Engaging ImageChat.

KIF-augmented Transformers are generally more
coherent and human-sounding compared to the
Generative MemNet.

Comparison with existing retrieval models
(shown in blue) is more nuanced. Along the
lines of existing work (Zhang et al., 2018; Dinan
et al., 2018), we find that retrieval-based models
score very well in human evaluations that ask
how human or interesting a dialog sounds. Das
is because retrieval models return human-written
utterances from the training set and do not suffer
in generative
from decoding mistakes present

1In Shuster et al. (2020), retrieval Transformer models
report Hits@N using a fixed candidate set of 99 distractor
candidates and 1 true candidate. We compute F1 using their
open-sourced model by scoring the entire training set of over
350K utterances with the model and taking the top scoring
candidate as the response.

Modelle. Zum Beispiel, on Engaging ImageChat,
while our model has significantly improved over
the generative baseline (see green bars in Figure 2,
Rechts), it does not beat retrieval based methods in
sounding more human or being more interesting
(see blue bars in Figure 2, Rechts). As the Retrieval
baseline returns human-written text for other
humans to evaluate, we hypothesize that humans
score each other’s writing quite well. Compared
with generative models, which we focus on
improving, retrieval models often produce longer
text with more interesting, nuanced vocabulary
usage, and do not make generation mistakes
such as repetition. These factors often lead to
the stronger performance of retrieval models.

A surprising result

is that KIF-augmented
Transformers are more human sounding than

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

:
/
/

D
ich
R
e
C
T
.

ich
T
.

e
D
u

/
T

A
C
l
/

A
R
T
ich
C
e
–
P
D

F
/

D
Ö

ich
/

1
0
1
1
6
2

/
T

A
C
_
A
_
0
0
3
5
6
1
9
2
4
0
3
2

/
T

A
C
_
A
_
0
0
3
5
6
P
D

B
j
G
u
e
S
T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 2: Human Evaluation Results on Both Datasets. More than 50% indicates the KNN Model is
bevorzugt. Stars indicate statistical significance at p < 0.05. retrieval models on Wizard of Wikipedia. This is because the dataset’s utterances are long and factual due to the tendency of crowdworkers to copy Wikipedia. Sometimes humans chatting with the retrieval bot would respond uh. . . that’s an interesting fact? Otherwise, our model scores similarly to retrieval models, with most evaluations not having statistically significant difference. We conduct a second evaluation on the Unseen Test Set of the Wizard of Wikipedia dataset. Results are shown in Figure 3. Trends are similar compared to the results on the Seen Test set, though the preference for the KIF-augmented Transformer is greater over the retrieval baseline. We hypothesize that because the Unseen Test Set is on entirely held out topics, the retrieval baseline can struggle to identify relevant utterances. In contrast, the KIF-augmented Transformer, similar to the generative baseline from Dinan et al. (2018), can use the generative capability to produce utterances. Lastly, we conduct an additional study to examine the variance of the comparative dialog judgements. The evaluation study for Wizard of Wikipedia is repeated three times on different days, and evaluators who have answered on previous days are not allowed to evaluate again in any subsequent experiments. Overall, we find reasonable interannotator agreement rates, around 73% averaged across all evaluations, which is similar to the agreement rates reported in Li et al. (2019). We find there is greater variance on questions asking which dialog is more human and more interesting, most likely as different evaluators can interpret these in different ways. Further, we see that comparison with the Retrieval model has less variance compared to the Generative model, possibly because the Retrieval model’s human written text is devoid of Figure 3: Human Evaluation on the Unseen Test Set of Wizard of Wikipedia. More than 50% indicates the KNN Model is preferred. Stars indicate statistical significance at p < 0.05. mistakes. Overall, we find that the conclusions (and statistical significance) are stable across multiple evaluations. 6.2 Analysis of Fetched Knowledge Example conversations from our KIF-augmented generative model are shown in Figure 4 on Wizard of Wikipedia. We find that relevant knowledge is identified that affects the content of the generated utterance. For example, the model finds knowledge sentences about Disney movies as the human conversationalist starts the conversation discussing Disney. The model leverages the fetched knowledge to write the content of the generated utterance. In a concrete example, the fetched sentence disney announced intentions [...] after the success of the incredibles leads the model to generate the utterance i love the incredibles, they are my favorite disney movie. In contrast, the model uses the form of the fetched training utterance often as a template for writing a response. For example, the model copies the training utterance Ohhh . . . what do people with color blindness do to cope with the effects? and starts the model generation with Ohhh ... and continues with the question i think toy story is a classic? following the form of the selected training utterance. 92 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 4: Conversation between Human and KIF-Augmented Transformer on Wizard of Wikipedia. The top-scoring Wikipedia knowledge and training utterances fetched by KIF are displayed with model output. Figure 5 displays the top-3 fetched training set utterances and knowledge sentences on the Wizard of Wikipedia dataset when responding to a human utterance. KIF modules can identify multiple relevant items. In response to the human question about blue skies the 1946 movie the model identifies both the comedy film and the band. Finally, the elements retrieved by KIF modules provide a more interpretable understanding of what the model is conditioning upon to generate a dialog response. In Table 3, we display for the same dialog history, changing the model’s fetched training utterance and knowledge sentence for our own examples. The model heavily incorporates our manual changes of the fetched information into the generated utterance. For example, changing the knowledge directly affects what the model generates as the favorite character—from buzz lightyear to mr potato head to slinky dog—while changing the fetched training utterance changes the form of the generated sentence. 6.3 Scaling KIF to Challenging Retrieval Settings KIF modules can be used in more realistic and challenging settings for knowledge retrieval that test the scalability of the module. In Figure 6(a), we compare the Generative Transformer MemNet Baseline with KIF-Augmented Transformers in three settings. The first is the standard Wikipedia sentences provided by the dataset (average 34 sentences). Then, we extend to providing the model with the full Wikipedia article (on average, 57 sentences) and finally to multiple totaling 205 Wikipedia articles (on average, sentences), identified using the conversation’s topic. This increasing size of available knowl- edge could be realistic for settings where it information is most relevant, is unclear what if filtering steps to preprocess the data remove potentially relevant information, or if information synthesis from multiple knowledge sources is necessary to produce a high-quality generation. As the Wikipedia knowledge becomes more difficult to identify, performance decreases, but still outperforms the baseline that uses the dataset-provided set of 34 sentences. Comparing the scaling capability of KIF to the standard Generative Transformer MemNet Base- line highlights the advantage of using KNN. The attention-based mechanism used in Dinan et al., information 2018 struggles to identify salient 93 Figure 5: Examples of Top-3 Fetched Training Utterances and Fetched Knowledge when responding to a human chat from the dataset using a trained Wizard of Wikipedia model. Examples are taken from validation. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 when given increasingly larger quantities of knowledge, unlike the KNN information fetch. We hypothesize the attention mechanism is challenged by softmax-ing over a larger quantity of inputs, as it can be difficult to make sharp distinctions. 6.4 Ablations Importance of Multiple Knowledge Sources. One benefit of the KIF module approach is that several modules can be combined, each capturing information from a different source. In both settings, Wizard of Wikipedia and Engaging ImageChat, two modules were used to incorporate multiple forms of knowledge—training utterances to capture the capability of a retrieval-based model and knowledge from Wikipedia or related chats based on image features. We perform here an ablation study to evaluate the impact of using only one source of information. As can be seen in Table 4, performance decreases when only one source of information is used (see Table 4). For Engaging ImageChat, this study also underlines the importance of being able to fetch form in a multimodal fashion. The general of the KIF module—requiring only a feature vector to find nearest neighbors from—allows fetching on multiple modalities such as text and images. In Table 4, using the Image-based KIF to fetch text from Related Images is important to reach the strongest performance (compare Training Utterances Only that uses text-based KIF and using both Training Utterances and Related Images). Using Dialog Features for KNN Performance. The quality of the KNN search is critical to the performance of KIF modules. As the external knowledge is kept fixed, KIF must be able to align the dialog context with the knowledge to identify relevant pieces of information. In Table 5, we show that matching on more features can improve the quality of the retrieved information. Using only the encoding of the immediate previous utterance can improve results on Wizard of Wikipedia by 7 F1 points, but this is further improved by also leveraging the encoding of context (+1.8 F1) and using the dialog turn number (+1 F1). These features are available in the datasets, and 94 Knowledge Training Utterance Generation buzz lightyear’s name is in honor of astronaut edwin ‘buzz’ aldrin my favorite character in that book series is hermione granger cool! my favorite character in that movie is buzz lightyear mr potato head is based on the real-life mr. potato head toy my favorite character in that book series is hermione granger slinky dog is a toy dachschund with a metal slinky for a body my favorite character in that book series is hermione granger my favorite character in that movie is real-life mr potato head cool! my favorite character is the slinky dog slinky dog is a toy dachschund with a metal slinky for a body i really like hermione granger the character cool! i really like slinky dog slinky dog is a toy dachschund with a metal slinky for a body my favorite character of all time has to be hermione granger i love that movie, my favorite character has to be slinky dog the dachshund slinky dog is a toy dachschund with a metal slinky for a body i agree with you! that’s my favorite character as well i think so too! my favorite is slinky Table 3: Effect of Fetched Information on Generated Utterances. The top section provides examples for a fixed training utterance, changing the knowledge—the generated text maintains the construction of the training utterance but changes the favorite character to match the knowledge. The bottom section provides examples for fixed knowledge but changing the training utterance—the generated text modifies its form to match the training utterance, but the favorite character information remains consistent. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 6: Ablations on Wizard of Wikipedia. (a) KIF can scale to hundreds of relevant sentences (blue) while the baseline model, the Generative Transformer MemNet (gray), scales poorly (b) Gating can remove irrelevant information. In the 3 Sources case, one source of external information is unrelated. (c) Performance as k varies. we leverage them to improve the relatedness of retrieved knowledge. Multi-Hop Retrieval with KIF. Work in me- mory networks (Weston et al., 2015; Sukhbaatar et al., 2015) utilized multi-hop mechanisms. Such capacity could be useful when multiple sources are necessary or information is incrementally fetched. To emulate multi-hop memory mechanisms, we information for use KIF to retrieve relevant N = 2 or N = 3 fixed hops. As the number of hops is fixed, the multi-hop operation remains differentiable. We do not allow the model to retrieve the same information in a second hop. We experimented in two settings. First, the same KIF module is used multiple times to fetch different information, and then all of the fetched knowledge is concatenated. Results are shown in Table 6 (top). Second, we examine spreading the fetches into different KIF modules at various 95 Model Test F1 Model Valid F1 Wizard of Wikipedia 18.1 Training Utterances Only Wiki Knowledge Only 23.9 Training Utterances and Wiki Knowledge 25.9 Engaging ImageChat Training Utterances Only Related Images Only Training Utterances and Related Images 13.9 13.8 14.4 Table 4: Using Multiple KIF Modules on Multiple Sources is important for improved performance. KIF-Augmented Transformer 27.4 One KIF Module fetches multiple times 2 Fetches 3 Fetches 26.9 26.0 Multiple KIF Modules fetch once each 2 Fetches 3 Fetches 26.5 25.9 Table 6: Multi-hop with KIF to retrieve information with multiple fetch steps. Model Valid F1 Wizard of Wikipedia Previous Utterance Only + dialog Context + Turn Embedding Engaging ImageChat Previous Utterance Only + dialog Context + Turn Embedding + Personality 24.6 26.4 27.4 13.3 14.5 15.1 Table 5: Important Features for KNN Search using KIF. Salient conversation features improve performance on both datasets. encoder depths. This could be interpreted as the model learning to access more information each layer. As the model progresses deeper, more abstract and high level representations are built, which could allow different knowledge to be retrieved. Results are shown in Table 6 (bottom). In both multi-hop settings, no improvement in performance on the Wizard of Wikipedia dataset is observed. We hypothesize that this can be partially attributed to the construction of the dataset—as humans explicitly based their written dialog utterance on one knowledge sentence. Further, it is possible that concatenation brings together too much information for the model to incorporate, and thus adding additional fetches makes the retrieval more noisy. Effect of Gating. We analyze the effect of the gating mechanism by evaluating the capability of the gate to identify and focus on salient infor- mation. On Wizard of Wikipedia, we concatenate a third source of information: dialog turns from a completely different corpus called PersonaChat (Zhang et al., 2018). This dataset looks quite different—short utterances without factual knowledge—and should be easy for the model to identify as distinct from Wizard of Wikipedia. As shown in Figure 6(b), if KIF on PersonaChat is included without gating, it has a harmful effect as the model includes irrelevant information. When equipped with gating, the model learns to use the gate to ignore some inputs, and can recover almost the full performance of a model without this irrelevant information source. Size of K in KNN. Figure 6(c) shows the performance on Wizard of Wikipedia when varying the amount of knowledge. Being able to access multiple relevant pieces of information is helpful, but too much information can be harmful. This is likely because the weighted sum becomes blurry if too many sentences are incorporated. 7 Conclusion We present a KNN-based Information Fetching module that learns to identify relevant information from external knowledge sources by learning a mapping-based read operation. KIF modules ben- efit from the scalability and efficiency of KNN search, enabling computation with large external memories. We show in the context of two dialog datasets that relevant knowledge can be identi- fied and incorporated to create more engaging, high-quality dialog. Acknowledgments We thank the reviewers and action editor for their comments and insightful discussion. We thank Emily Dinan and Kurt Shuster for provid- ing assistance to reproduce their original works. 96 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 References Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2017. Learning end-to-end goal- oriented dialog. In 5th International Conference ICLR 2017, on Learning Representations, Toulon, France, April 24-26, 2017, Conference Track Proceedings. a matching-to-generation Deng Cai, Yan Wang, Wei Bi, Zhaopeng Tu, Xiao-jiang Liu, and Shuming Shi. 2019. Retrieval-guided dialogue response generation via framework. the 2019 Conference on In Proceedings of Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1866–1875. DOI: https://doi.org/10.18653/v1/D19 -1195 Sarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal Vincent, Gerald Tesauro, and Yoshua Bengio. 2016. Hierarchical memory networks. CoRR, abs/1605.07427. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879. DOI: https:// doi.org/10.18653/v1/P17-1171, PMCID: PMC5579958 Wenlin Chen, David Grangier, and Michael Auli. 2016. Strategies for training large vocabulary language models. In Proceedings of neural the 54th Annual Meeting of the Association (Volume 1: for Computational Linguistics 1975–1985. DOI: Long Papers), https://doi.org/10.18653/v1/P16 -1186 pages Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge- powered conversational agents. In International Conference on Learning Representations. Angela Fan, Claire Gardent, Chlo´e Braud, and Antoine Bordes. 2019. Using local knowledge graph construction to scale seq2seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods 97 in Natural Language Processing and the 9th International Joint Conference on Natural Language (EMNLP-IJCNLP), pages 4177–4187. Processing Angela Fan, David Grangier, and Michael Auli. 2018a. Controllable abstractive summarization. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 45–54. Angela Fan, Mike Lewis, and Yann Dauphin. 2018b. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898. Edouard Grave, Armand Joulin, Moustapha Ciss´e, David Grangier, and Herv´e J´egou. 2017a. Efficient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1302–1310. Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017b. Improving neural language In 5th models with a continuous cache. International Conference on Learning Repre- sentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural Turing machines. arXiv preprint arXiv:1410.5401. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the International Conference on Machine Learning, pages 5695–5704. Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019. Poly- encoders: Architectures and pre-training strate- gies for fast and accurate multi-sentence scoring. In International Conference on Learning Representations. Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data. DOI: https://doi.org/10.1109/TBDATA .2019.2921572 Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack- In Advances augmented recurrent nets. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 in Neural Information Processing Systems, pages 190–198. Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2019. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv´e J´egou. 2019. Large memory In Advances layers with product keys. in Neural Information Processing Systems, pages 8548–8559. Margaret Li, Jason Weston, 2019. ACUTE-EVAL: and Stephen Roller. Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087. Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, and Hua Wu. 2019. Learning to select knowledge for response generation in dialog systems. In Proceedings of the 28th Interna- tional Joint Conference on Artificial Intel- ligence, pages 5081–5087. AAAI Press. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of on Com- puter Vision (ECCV), pages 181–196. DOI: https://doi.org/10.1007/978-3-030 -01216-8 12 the European Conference Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pages 6294–6305. dialog research software platform. pages 79–84. https://arxiv.org/abs/1705.06476, DOI: https://doi.org/10.18653/v1 /D17-2014 Andriy Mnih and Geoffrey Hinton. 2009. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems, pages 1081–1088. Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In the 2019 Conference on Proceedings of Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473. DOI: https://doi.org/10.18653/v1/D19 -1250 Tobias Pl¨otz and Stefan Roth. 2018. Neural In Advances nearest neighbors networks. in Neural Information Processing Systems, pages 1087–1098. Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jianfeng Gao. 2019. Conversing by reading: Contentful neural conversation with on-demand machine reading. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5427–5436. Jack Rae, Jonathan J. Hunt, Ivo Danihelka, Timothy Harley, Andrew W. Senior, Gregory Wayne, Alex Graves, and Timothy Lillicrap. 2016. Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621–3629. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: 1715–1725. DOI: Long Papers), https://doi.org/10.18653/v1/P16 -1162 pages Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. parl.ai: A Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh open-domain Hajishirzi. 2019. Real-time 98 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 question answering with dense-sparse phrase the 57th Annual index. In Proceedings of Meeting of the Association for Computational Linguistics, pages 4430–4441. Iulian V. Serban, Ryan Lowe, Laurent Charlin, and Joelle Pineau. 2016a. Generative deep neural networks for dialogue: A short review. arXiv preprint arXiv:1611.06216. Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016b. Building end-to-end dialogue systems using generative hierarchical neural network In Thirtieth AAAI Conference on models. Artificial Intelligence. Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. 2020. Image-chat: Engaging grounded conversations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2414–2429. DOI: https://doi.org/10.18653/v1 /2020.acl-main.219 Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12516–12526. DOI: https://doi .org/10.1109/CVPR.2019.01280 Haoyu Song, Yan Wang, Wei-Nan Zhang, Xiaojiang Liu, and Ting Liu. 2020. Generate, delete and rewrite: A three-stage framework for improving persona consistency of dialogue generation. arXiv preprint arXiv:2004.07672. DOI: https://doi.org/10.18653/v1 /2020.acl-main.516, PMID: 32249355 Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and Ming Zhang. 2016. Two are better than one: An ensemble of retrieval-and generation-based dialog systems. arXiv preprint arXiv:1610.07149. Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting self-attention with persistent memory. https://arxiv.org/abs/1907 .01470 Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems, pages 2440–2448. Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multi- media research. Communications of the ACM, 59(2):64–73. DOI: https://doi.org/10 .1145/2812802 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. In Advances Attention is all you need. in Neural Information Processing Systems, pages 5998–6008. Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Jason Weston, Emily Dinan, and Alexander Miller. 2018. Retrieve and refine: Improved for dialogue. sequence generation models In Proceedings of the 2018 EMNLP Work- shop SCAI: The 2nd International Workshop AI, Search-Oriented Conversational on pages 87–92. DOI: https://doi.org/10 .18653/v1/W18-5713 Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association (Volume 1: for Computational Linguistics 2204–2213. DOI: Long Papers), https://doi.org/10.18653/v1/P18 -1205 pages Yutao Zhu, Zhicheng Dou, Jian-Yun Nie, and Ji-Rong Wen. 2020. ReBoost: A retrieval- boosted sequence-to-sequence model for neural Information Retrieval response generation. Journal, 23(1):27–48. DOI: https://doi .org/10.1007/s10791-019-09364-x 99 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 5 6 1 9 2 4 0 3 2 / / t l a c _ a _ 0 0 3 5 6 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Augmenting Transformers with KNN-Based image

PDF Herunterladen