Improving Dialog Evaluation with a Multi-reference Adversarial
Dataset and Large Scale Pretraining
Ananya B. Sai∗ and Akash Kumar Mohankumar∗ and
Siddhartha Arora and Mitesh M. Khapra
{ananya, miteshk}@cse.iitm.ac.in, {makashkumar99, sidarora1990}@gmail.com
Robert-Bosch Centre for Data Science and Artificial Intelligence
Indian Institute of Technology, Madras
Abstrakt
There is an increasing focus on model-based
dialog evaluation metrics such as ADEM,
RUBER, and the more recent BERT-based
metrics. These models aim to assign a high
score to all relevant responses and a low score
to all irrelevant responses. Im Idealfall, such models
should be trained using multiple relevant and
irrelevant responses for any given context.
Jedoch, no such data is publicly available,
and hence existing models are usually trained
using a single relevant response and multiple
randomly selected responses from other con-
texts (random negatives). To allow for better
training and robust evaluation of model-based
metrics, we introduce the DailyDialog++
dataset, consisting of (ich) five relevant re-
sponses for each context and (ii) five adver-
sarially crafted irrelevant responses for each
Kontext. Using this dataset, we first show that
even in the presence of multiple correct refer-
zen, n-gram based metrics and embedding
based metrics do not perform well at separat-
ing relevant responses from even random
negatives. While model-based metrics perform
better than n-gram and embedding based
their perfor-
metrics on random negatives,
mance drops substantially when evaluated on
adversarial examples. To check if large scale
pretraining could help, we propose a new
BERT-based evaluation metric called DEB,
which is pretrained on 727M Reddit conversa-
tions and then finetuned on our dataset. DEB
significantly outperforms existing models,
showing better correlation with human judg-
ments and better performance on random
negatives (88.27% accuracy). Jedoch,
its
performance again drops substantially when
∗The first two authors worked equally towards the project.
810
evaluated on adversarial responses, thereby
highlighting that even large-scale pretrained
evaluation models are not
to the
adversarial examples in our dataset. Der
dataset1 and code2 are publicly available.
robust
1 Einführung
Open-domain conversational systems are increas-
ingly in demand for several applications ranging
from personal digital assistants to entertainers
for recreation. While several automated dialogue
agents such as Siri, Alexa, Cortana, and Google
Assistant have been built and deployed, es gibt kein
good automatic evaluation metric to measure the
quality of their conversations. Researchers have
usually adopted n-gram based metrics (Papineni
et al., 2002; Banerjee and Lavie, 2005; Lin, 2004)
or embedding based metrics (Forgues et al., 2014;
Rus and Lintean, 2012; Zhang et al., 2020A) Zu
compare the model’s response with a single refer-
enz. These metrics assume that a valid response
should be semantically or lexically similar to the
reference without taking the context of the con-
versation into consideration. Jedoch, in open
domain conversations, a given context can have a
wide range of possible responses that may be lex-
ically and semantically very different from each
andere. Zum Beispiel, the context, ‘‘I like danc-
ing and swimming, what about you?’’ can be
responded to with ‘‘I paint in my free time’’ or ‘‘I
do not have time for hobbies right now’’, both of
which are valid responses. Infolge, n-gram and
word embedding based metrics, which rely on lex-
ical and/or semantic match, correlate very weakly
with human judgments for dialogue evaluation
(Liu et al., 2016).
1Dataset: h t t p s : / / i i t m n l p . g i t h u b . io
/DailyDialog-plusplus/.
2Code:
h t t p s : / / g i t h u b . c o m / i i t m n lp
/Dialogue-Evaluation-with-BERT.
Transactions of the Association for Computational Linguistics, Bd. 8, S. 810–827, 2020. https://doi.org/10.1162/tacl a 00347
Action Editor: Xiaojun Wan. Submission batch: 6/2020; Revision batch: 8/2020; Published 12/2020.
C(cid:3) 2020 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Given the shortcomings of context-agnostic
n-gram and embedding based metrics, the focus
has now shifted to building neural network-based,
trainable dialogue evaluation models (Lowe et al.,
2017; Tao et al., 2018; Shimanaka et al., 2019;
Ghazarian et al., 2019). Such models are trained
to identify whether a given response can be
considered as a valid continuation of the given
context or not. Mit anderen Worten, the model should
(ich) assign a high score to all relevant responses
no matter how diverse they are and (ii) assign a
low score to all irrelevant responses, preferably
with a clear margin of separation from relevant
responses. Although there exist several open-
domain dialogue datasets (Forsythand and Martell,
2007; Tiedemann, 2012; Ritter et al., 2010; Li
et al., 2017B) that are used for training dialogue
response generation systems, they are not suitable
for training and testing such evaluation models.
This is because these datasets have only a single
relevant response and no irrelevant responses.
Irrelevant responses can of course be generated by
sampling random utterances from other contexts,
but such examples typically do not have any
overlap with the context and hence are easier for
the model to distinguish from relevant responses
(as we will show in our results later). We refer
to the randomly sampled responses as random
negatives.
Some efforts have been made to build dialog
datasets with multiple relevant responses (d.h.,
multiple references), but these datasets are either
very small (1000 contexts) (Moghe et al., 2018;
Gupta et al., 2019) or automatically constructed
from Reddit conversations, somit, potentially
noisy (Gao et al., 2019). Weiter, these datasets
do not have any carefully crafted adversarial
irrelevant responses. We define an adversarial
irrelevant response as one that has a significant
word overlap with the context but
is still an
irrelevant response (hence harder to identify than
randomly selected irrelevant examples, welche
may not have any relation to the context). To
overcome this limitation of existing datasets, Wir
propose a large scale multi-reference dataset,
DailyDialog++, which is an extension of the
DailyDialog dataset. Insbesondere, for each of
the 19K contexts derived from DailyDialog, Wir
collect an additional 5 reference responses with
the help of human annotators. Weiter, for ∼11K
contexts in DailyDialog, we also ask human
annotators to carefully craft irrelevant responses
that have a significant word overlap with the
Kontext. This dataset will be made publicly
available and help towards better training and
more robust evaluation of dialogue evaluation
metrics.
Using this dataset, we extensively evaluate a
wide range of n-gram-based and embedding-
based metrics. Insbesondere, we compute (ich)
these metrics with binary
the correlation of
human judgments and (ii) the accuracy obtained
by using the scores assigned by the metrics
to classify relevant/irrelevant
responses. Der
performance of these metrics improves when
presented with multiple references as opposed
to a single reference, but they still leave a lot to
be desired. Andererseits, most model-based
evaluation metrics, when trained and evaluated
using multiple relevant and random negative
responses, perform significantly better than the
n-gram-based and embedding-based methods.
Jedoch, their performance drops substantially
on the adversarial examples in our dataset.
zuletzt, one could argue that dialog evaluation
metrics could be improved by pretraining on
large amounts of data. To check if
this is
indeed the case, we propose a new BERT-
based evaluation metric called DEB (Dialog
Evaluation using BERT), which is pretrained on
727M Reddit conversations. In der Tat, this model
performs significantly better on random negatives
with an accuracy of 88.27% in distinguishing
the positive and random negative responses. Es
also correlates well with human judgments on
responses generated by five dialog generation
Systeme (Serban et al., 2016, 2017; Park et al.,
2018; Zhang et al., 2020B). Insbesondere, Die
Spearman rank correlation between human scores
and DEB scores is 0.52 at the response level scores
Und 0.70 at the system level scores, calculated by
aggregating the scores on all responses by each
System. Jedoch, once again, when evaluated
on adversarial examples from our dataset,
its
performance drops substantially, underscoring
that even large-scale pretrained models are not
robust to adversarial examples.
2 Proposed Dataset
Our goal was to build a dataset with manu-
ally created multiple relevant and adversarial
irrelevant responses. Dafür, we wanted to start
with an existing base dataset that already has one
811
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
relevant response for every context, and then
to include multiple responses. Für
extend it
the base dataset, we considered several popular
datasets such as Twitter (Ritter et al., 2010),
Reddit (Henderson et al., 2019), Open Subtitles
(Tiedemann, 2012), NPS Chat (Forsythand and
Martell, 2007), PersonaChat (Zhang et al., 2018),
and DailyDialog (Li et al., 2017B). Of these,
Twitter and Reddit are generally considered noisy,
so we chose not to use either of them as the base
dataset. Ähnlich, Open Subtitles and NPS Chat
did not have speaker-aligned utterances, and hence
were not suitable for our purposes. We found
that the DailyDiaog dataset was clean, menschlich-
written, readily available, and covered a diverse
set of generic topics such as ordinary life, Schule
life, tourism, attitude & emotion, relationship,
Gesundheit, arbeiten, Politik, Kultur & Ausbildung, Und
finance. It contains a total of 13K conversations
with an average of 8 turns between exactly 2
speakers. Alternativ, we could have also chosen
PersonaChat, which is of a similar size and also
contains chit-chat style conversations, aber wir
chose the antecedent DailyDialog dataset.
Für
shorter conversations
in DailyDialog
(having less than 8 turns) we collected multiple
relevant responses only for the last utterance. Für
longer conversations (having 8 turns or more),
we divided the conversation into two or more
smaller chunks and collected multiple relevant
responses for the last utterance in every chunk.
In this way, from the 13K conversations3 in
DailyDialog, we were able to create 19K sub-
conversations with multiple relevant responses
for the last utterance in each sub-conversation
or context. The responses were created by in-
house annotators. Each context was shown to 2–3
annotators, and each of them was asked to generate
1–3 alternative responses for the last utterance,
capping the total number of alternative responses
Zu 5 (in addition to the one response already
available in DailyDialog). The annotators were
strictly instructed to avoid short generic responses
(‘‘Okay’’, ‘‘Thank you’’, ‘‘Sure’’, usw.), and write
longer meaningful responses containing at least
8–10 words. These responses were then verified
3Out of the 13K conversations released in DailyDialog,
we found that a good number of contexts were repeated,
either with slightly different spellings or through some subtle
differences such as representing numbers using digits versus
using words. We filtered out the repetitions and worked with
the remaining ∼11K contexts.
812
(and if needed, corrected and re-validated) by a
different set of annotators.
2.1 Adversarial Irrelevant Responses
In addition to collecting multiple relevant re-
sponses for each context, we also wanted to collect
irrelevant responses for each context. Die meisten von den
models that are trained for the task of dialogue
evaluation (and dialogue generation) (Tao et al.,
2018; Ghazarian et al., 2019; Li et al., 2017A)
procure irrelevant responses by randomly sam-
pling responses from other contexts. Such random
negatives are often entirely out of context (un-
related) and hence are too easy for the model
to distinguish. To allow for a more critical or
adversarial examination of dialogue evaluation
Systeme, we propose creating adversarially crafted
irrelevant responses that have lexical or semantic
overlap with the context but are still unacceptable
as valid responses.
For obtaining such tricky negative responses,
the annotators were asked to choose some words
from the context and use them directly or indirectly
while writing the responses. Indirect usage here
refers to using words closely related to the context
Wörter. Zum Beispiel, using synonyms, antonyms,
homonyms, subwords, or other words that are
known to frequently co-occur with the words in
the context (z.B., the words ‘‘flexibility’’ and
‘‘injuries’’ co-occur with ‘‘acrobatics’’). Once
wieder, each context was shown to 2–3 annota-
tors, and each of them was asked to generate 1–3
adversarially crafted responses for the last utter-
ance, capping the total number of alternative
responses to 5. Each response was then validated
by two different annotators. The validating an-
notators were instructed to either eliminate or
modify the responses that were not negative or
were borderline. A final check was made by one
more evaluator to ensure that the responses were
adversarially crafted, irrelevant, and grammati-
cally correct. We collected 5 such responses for
11,429 contexts. Tisch 1 shows examples of rel-
evant and irrelevant responses in our dataset and
Tisch 2 shows some statistics about our dataset.
We acknowledge that,
in practice, a given
context can have a large number of relevant
exhaustively
responses
collecting all such responses is prohibitively
expensive and time-consuming. Although it is
desirable to have even more than 5 responses for
(>> 5). Jedoch,
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Context
FS: Can you do push-ups ?
SS: Of course I can . It’s a piece of cake !
Believe it or not , I can do 30 push-ups a minute. SS: Watch me do it.
FS: Really ? I think that’s impossible !
SS: You mean 30 push-ups ?
FS: Yeah !
Valid responses
SS: You don’t believe me, do you? SS: Push up the window and look out for a minute
SS: Start your timer, here we go.
Invalid, adversarial responses
SS: Would you like to eat a piece of cake before gym?
SS: I like watching the Ripley’s Believe it or Not show
SS: That’s because you can’t do it. where they discuss nearly impossible feats
SS: You don’t know that I am a
fitness trainer, do you ?
and gymnastics
SS: I have enough time for my treadmill exercises
SS: Are you asking me to do 40 squats?
Tisch 1: Examples from DailyDialog++ dataset with the context consisting of 2 speakers [annotated as
FS (First Speaker) and SS (Second Speaker)], and multiple reference responses and adversarial negative
responses. The underlined, purple colored words in the adversarial responses are those that overlap or
are closely related to the theme or words in the context.
Total # of contexts
Avg. # of turns per context
Avg. # of words per context
Avg. # of words per utterance
# of contexts with 5 relevant responses
# of contexts with 5 adv. irrelevant responses
Avg. # of words per relevant response
Avg. # of words per irrelevant response
19,071
3.31
45.32
13.55
19,071
11,429
10.13
13.8
Tisch 2: DailyDialog++ dataset statistics.
every context, we believe that having at least 5 ist ein
good starting point given the dearth of such multi-
reference conversation datasets. The proposed
dataset thus serves as a pragmatic substitute for
an ideal dataset that would have contained a large
number of responses per context. Having said
Das, we would also like to point out that the
value of the proposed dataset goes beyond having
multiple relevant references as it is also the first
dataset containing adversarial irrelevant responses
for given contexts.
3 Existing Metrics
In diesem Abschnitt, we present a brief overview
von
the existing automatic metrics used for
dialogue evaluation. The existing metrics can be
broadly classified into two categories, nämlich,
(ich) Untrained metrics, Und (ii) Trained metrics.
Untrained evaluation metrics, usually adopted
from the NLG literature, use a predefined formula
to compare the candidate response with a reference
without taking the context into account. Auf der
andererseits, trained metrics are usually trained
specifically for the task of dialogue response
evaluation to identify valid and invalid responses
for a given context.
3.1 Untrained Metrics
Untrained metrics can be further sub-classified
into (ich) n-gram based, (ii) word embedding based,
Und (iii) contextualized embedding based metrics.
N -gram Based: N -gram based metrics score a
candidate response based on the amount of n-gram
overlap it has with a given reference. BLEU
(Papineni et al., 2002), ROUGE-L (Lin, 2004), Und
METEOR (Banerjee and Lavie, 2005) are among
the most commonly adopted n-gram based metrics
to evaluate dialogue systems. BLEU is calculated
using n-gram precision scores between the can-
didate response and the reference. ROUGE-L (Lin,
2004) is based on the F-measure of the longest
common subsequence between the candidate and
reference responses. METEOR (Banerjee and
Lavie, 2005) relaxes the exact match criteria by
including word stems, synonyms, and paraphrases.
More recently, Galley et al. (2015) proposed
deltaBLEU, which takes in multiple references
and rewards n-gram matches with positive
references and penalizes the matches with the
negative references.
Word Embedding Based: These methods use
word embeddings to compute the similarity
between the candidate response and the reference
response. The most commonly used word embed-
ding based metrics are Embedding Average
(Wieting et al., 2016), Vector Extrema (Forgues
et al., 2014), and Greedy Matching (Rus and
Lintean, 2012). Embedding Average defines a
sentence embedding as the average word embed-
ding of the constituent words. The final score is
calculated using the cosine similarity of candi-
date and reference sentence embeddings. Vector
Extrema (Forgues et al., 2014) instead computes
the sentence embedding by taking the most ex-
treme value for each dimension. Mit anderen Worten,
813
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
first computes
the value of the i-th dimension of the sentence
embedding is computed by taking a maximum
over the i-th dimension of all words in the
Satz. Greedy Matching (Rus and Lintean,
2012)
the maximum cosine
similarity that every word in the candidate
response has with any word in the reference
response. Ähnlich, the highest cosine similarity
for each of the reference words with any of
the candidate response words is calculated. Der
similarity between the candidate response and
reference response is then computed by taking
an average of the maximum cosine similarities
computed above.
BERTScore: Kürzlich, Zhang et al. (2020A)
proposed BERTScore, which uses contextualized
word embeddings of the candidate and reference
sentences to compute the score. BERTScore is
similar to greedy matching but uses contextualized
embeddings from BERT instead of static word
embeddings.
3.2 Trained Metrics
ADEM: Automatic Dialogue Evaluation Model
(ADEM) (Lowe et al., 2017) uses pretrained vector
representations of the the dialogue context c,
reference response r, and proposed response ˆr
to compute the evaluation score as follows:
Score(C, R, ˆr) = (cT Mˆr + rT Nˆr − α)/β (1)
where M, N ∈ Rn×n are learned matrices,
and α, β are scalar constants used to re-scale
scores in the range [1, 5]. The context, proposed
response and reference response are encoded using
a Hierarchical RNN (H-RNN) encoder consisting
of utterance-level and context-level RNNs. Der
H-RNN encoder is pretrained on a Twitter dataset
(Dhingra et al., 2016) in a generative setup using
the latent variable hierarchical recurrent encoder
decoder (VHRED) Modell (Serban et al., 2017).
The weight matrices, M, N, are later finetuned
for the task of dialogue response evaluation.
RUBER:
(Tao et al., 2018) introduced an un-
referenced evaluation model consisting of GRU
encoders (Chung et al., 2014) to measure the
relatedness between the dialogue context and
a given response. The authors train the model
on Chinese dialogue data with the hinge loss
objective.
814
BERT Regressor4: Shimanaka et al. (2019)
propose a BERT based evaluation model to score
a candidate sentence based on a reference. Unlike
BERTScore, the BERT model is finetuned to
predict human judgement scores from the conca-
tenated reference and candidate sentence.
BERT+DNN5: Ghazarian et al.
(2019) use
contextualized embeddings to compute a related-
ness score between the dialogue context and
response. The best performing model of Ghazarian
et al. (2019) consists of a multilayer perceptron
takes the concatenation of contextualized
Das
representations of the context and response as
Eingang. The contextualized representations are
obtained by max-pooling the respective BERT
embeddings for each token. Note that the BERT
embeddings are not finetuned.
4 Dialogue Evaluation using BERT
In the last two years, considerable success in NLP
has been driven by large pretrained transformer-
based models (Radford et al., 2019; Devlin et al.,
2019; Zhang et al., 2019). These models are
typically trained with a language model objective
and leverage large amounts of unlabeled data.
Jedoch, none of the trained metrics discussed in
the previous section leverage pretraining on large-
scale dialogue corpora. With the hope that such
pretraining should help dialog evaluation models
Auch, we introduce DEB (Dialog Evaluation using
BERT) which is trained using a masked language
model objective (similar to BERT) and a modified
next response prediction objective.
1, . . . , wr
M
We set up the the task of next response predic-
tion as one of identifying whether the given
response is a valid next response for the given con-
}
Text. Formally, given a context C = {wc
1, . . . , wc
N
}, we first pass
and a response R = {wr
the concatenated sequence U = {[CLS], wc
1,
} through the BERT
. . . , wc
transformer and obtain Hcls ∈ RH , the last-layer
activations corresponding to the special [CLS]
token. We then make our final next response
predictions as follows: ˆy = softmax(WHcls),
where W ∈ R2×H is a learnable matrix. Wir
N, [SEP], wr
1, . . . , wr
M
4Because we couldn’t find an exact name for the evaluator
model by Shimanaka et al. (2019) , we adopt the name ‘BERT
regressor’ from their paper’s title.
5Due to the lack of a specific name for the models in
Ghazarian et al. (2019), we refer to the model adopted from
their work as ‘BERT+DNN’
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
use cross entropy loss with binary targets for the
next-response prediction. Zusätzlich, we use the
standard masked language model objective by
randomly masking 15% of the words in C and R.
Note that the proposed model is a straight-
forward extension of the standard BERT model
used for language modeling. We do not claim any
novelty on this front. The key contribution here
is to assess if pretraining on large-scale dialogue
corpora improves the performance of dialogue
evaluation metrics. Existing BERT-based evalua-
tion metrics (Shimanaka et al., 2019; Ghazarian
et al., 2019) do not use such pretraining on
any large-scale, domain-related corpora. In other
Wörter, they do not leverage the more successful
recipe of (ich) pretraining with a masked language
modeling objective and (ii) finetuning with a task-
specific objective (dialog evaluation in this case).
The idea behind DEB is to check if this successful
recipe can be replicated for dialog evaluation,
making use of the dialogues in the large-scale
Reddit corpus.
4.1 Training Details
For pretraining, we use a massive open-domain
dialogue dataset of Reddit comments from 2005
Zu 2019 consisting of 256M threads with a
total of 3.68B comments. From this dataset,
we extracted a total of 727M {Kontext, positive
response} pairs with 654M for training and 73M
for testing following the method described in
Henderson et al. (2019). We used an equal number
of negative responses by randomly sampling
responses from other contexts. We use the BERT
base model with 110M parameters consisting of
12 layers, 768 dimensional hidden space, Und 12
attention heads per layer in all our experiments.
We finetune the pretrained DEB model on our
DailyDialog++ dataset for 1 Epoche (we did not
see any advantage of finetuning beyond 1 Epoche).
Note that during finetuning we only use the next
response prediction objective.
5 Experimental Setup
Our goal is to check if the adversarial responses
in our dataset, which are specifically crafted
to target context-dependent model-based metrics
(such as ADEM, RUBER, BERT+DNN, Und
the performance of such
DEB),
Modelle. To do so, we first need to benchmark
the models’ performance on random negatives
indeed affect
and then check if the performance drops when
evaluated on adversarial examples. Somit, in diesem
section, we describe (ich) the process of creating
and validating such random negatives and (ii) Die
process used for training model-based metrics.
We randomly divide our dataset
into train
(80% contexts), validation (10% contexts), Und
test (10% contexts) splits. Note that adversarial
negatives are not used for training or finetuning
the models unless explicitly specified.
5.1 Creating & Validating
Random Negatives
For every context in our dataset, which has 5
relevant responses, we also sample 5 random
negatives. While sampling random negatives, Wir
avoid short responses that may be generic and
relevant for any context. To verify whether the
sampled random negatives were indeed irrelevant,
we asked human annotators to manually check 500
such sampled responses. Genauer, Wir
showed them the original context and the sampled
random negative response and asked them if it
was a relevant or irrelevant response. In 95%
of the cases, the annotators confirmed that the
random negative response was irrelevant, thereby
confirming that a random sampling strategy indeed
results in irrelevant responses (although they may
not be as hard as our adversarial negative examples
as shown later).
5.2 Pretraining & Finetuning
Trained Metrics
We describe the pretraining and finetuning
procedure for the various models used in our
analysis below.
ADEM: As previously mentioned in Section 3,
ADEM was pretrained on Twitter corpus using
the VHRED setup and then finetuned for dialogue
response evaluation. We take this publicly
available model and finetune it further using
our DailyDialog++ dataset with a target of 5 für
positive responses and 1 for random negatives.
The reference response could be any of the
other four relevant responses. Note that ADEM
produces a score on a scale of 1 Zu 5 whereas the
other models produce a score on a scale of 0 Zu
1. For easier comparison, we scale the output of
ADEM so that it lies in the range of 0 Zu 1.
BERT regressor: We finetune the publicly
available pretrained BERT base model (110M
815
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Parameter) on our DailyDialog++ dataset. Wir
train the model with a label of 1 for positive
responses and 0 for random negative responses
using any one of the other four positive responses
as the reference. We train the model using
cross-entropy loss and follow the same set of
hyperparameters as used by Shimanaka et al.
(2019) during finetuning.
BERT+DNN: We use the best performing model
from Ghazarian et al. (2019), which consists of
a three-layered feed-forward neural network and
uses pretrained BERT embeddings as input. Wir
train the model on our DailyDialog++ dataset with
random negatives using cross entropy loss.
RUBER and RUBER-Large: We experiment
with two variants of Tao et al.’s (2018) Modelle
with different sizes, nämlich (ich) RUBER (34M
Parameter), which consists of single-layer GRUs
with a hidden size of 1024, Und (ii) RUBER-
Groß (236M parameters), which consists of two
layered GRUs with a hidden size of 2048. Als
the training
shown in Vaswani et al. (2017),
time for RNN based architectures is very high
when compared with the transformer models that
allow much greater parallelization. We observed
an estimated time of over 200 days to train
the RUBER-Large model on the 727M Reddit
corpus on a 1080ti GPU,
thereby making it
practically infeasible to train such models on
large-scale datasets. Taking the computational
costs into consideration, we pretrained RUBER
and RUBER-Large on a sample of 20M contexts
with relevant and random irrelevant responses
from Reddit. We then finetuned these models on
our proposed dataset with random negatives.6
DEB: We pretrained DEB on the entire 727M
Reddit corpus using the masked language model
and the modified next
response prediction
objective. Pretraining DEB took 4 days on a
single Google Cloud TPUv2. We achieved a test
accuracy of 90% on the next response prediction
task and a perplexity of 15.47 (58% accuracy)
on the masked language modeling task in the
pretraining corpus. We then finetuned DEB on
our dataset with random negatives.
6We agree that this may not be a fair comparison but we
were constrained by the inherent limitations of such RNN-
based, sequential models that make large-scale pretraining
prohibitively expensive and time-consuming.
5.3 Untrained Metrics with
Multiple References
Untrained metrics like METEOR, Greedy Match-
ing, und so weiter, usually work with a single ref-
erence response but can also be adapted to work
with multiple reference responses. Zum Beispiel,
for a given candidate response c and a set of refer-
ence responses r1, r2, r3, …, rk, we can compute
the multi-reference METEOR score as:
M ET EORmulti = maxk
i=1M ET EOR(C, ri)
Instead of the max function we can also use the
average function. We use a similar formula for all
the untrained metrics.
A few metrics like BLEU, deltaBLEU, Und
ROUGE-L have their own standard formula to
incorporate multiple references. BLEU calculates
the number of matches for each n-gram based
on the maximum number of times the n-gram
occurs in common with any one of the references.
deltaBLEU further extends the same idea to
incorporate a score for each reference. We follow
the implementation from Galley et al. (2015) Zu
compute the deltaBLEU scores. For ROUGE-L,
we follow the strategy in Sharma et al. (2017),
where the score is an F-measure of the maximum
precision and maximum recall over all the refer-
zen. In addition to the average and maximum
aggregations, we also report these standard multi-
reference scores for BLEU, deltaBLEU, Und
ROUGE-L.
6 Ergebnisse
In diesem Abschnitt, we compare the performance of
different dialog evaluation metrics in separating
relevant references from (ich) random negatives,
(ii) synthetically crafted adversarial
irrelevant
responses (explained below), Und (iii) manually
crafted adversarial irrelevant responses (as in our
DailyDialog++ dataset).
6.1 Performance on Random Negatives
For every context in our test split, we obtain the
scores assigned by a given metric to the 5 positive
Und 5 random negative responses. Insbesondere,
we treat each of the 5 relevant and 5 random
irrelevant responses as a candidate response. Für
all untrained metrics other than deltaBLEU, Wir
consider the remaining 4 relevant responses as
reference responses. For deltaBLEU, we consider
the remaining 4 relevant responses as references
816
l
D
Ö
w
N
Ö
A
D
e
D
F
R
Ö
M
H
T
T
P
:
/
/
D
ich
R
e
C
T
.
M
ich
T
.
e
D
u
/
T
A
C
l
/
l
A
R
T
ich
C
e
–
P
D
F
/
D
Ö
ich
/
.
1
0
1
1
6
2
/
T
l
A
C
_
A
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
T
l
A
C
_
A
_
0
0
3
4
7
P
D
.
F
B
j
G
u
e
S
T
T
Ö
N
0
8
S
e
P
e
M
B
e
R
2
0
2
3
Metric
BLEU-1
BLEU-2
BLEU-3
BLEU-4
METEOR
ROUGE-L
deltaBLEU (Galley et al., 2015)
Embed Avg
Vec Extr (Forgues et al., 2014)
GreedyMatch (Rus and Lintean, 2012)
BERTScore (Zhang et al., 2020A)
ADEM (Lowe et al., 2017)
BERT regressor (Shimanaka et al., 2019)
BERT+DNN (Ghazarian et al., 2019)
RUBER (Tao et al., 2018)
RUBER-Large (Tao et al., 2018)
DEB (ours)
Point Biserial Correlation (p-value)
Accuracy in percentage
Single
0.26 (<1e-9)
0.22 (<1e-9)
0.14 (<1e-9)
0.08 (<1e-9)
0.23 (<1e-9)
0.23 (<1e-9)
−
0.23 (<1e-9)
0.24 (<1e-9)
0.24 (<1e-9)
0.29 (<1e-9)
Avg
0.42 (<1e-9)
0.39 (<1e-9)
0.26 (<1e-9)
0.17 (<1e-9)
0.40 (<1e-9)
0.41 (<1e-9)
−
0.25 (<1e-9)
0.35 (<1e-9)
0.36 (<1e-9)
0.39 (<1e-9)
Multiple
Max
0.41 (<1e-9)
0.36 (<1e-9)
0.24 (<1e-9)
0.15 (<1e-9)
0.41 (<1e-9)
0.40 (<1e-9)
−
0.23 (<1e-9)
0.33 (<1e-9)
0.32 (<1e-9)
0.39 (<1e-9)
Standard
0.41 (<1e-9)
0.40 (<1e-9)
0.28 (<1e-9)
0.18 (<1e-9)
−
0.37 (<1e-9)
0.29 (<1e-9)
−
−
−
−
Single
61.26
58.09
53.11
51.16
59.77
59.47
−
61.27
59.22
60.02
63.71
Multiple
Avg Max
68.75
68.60
68.37
68.26
58.90
58.85
53.56
53.56
68.01
68.51
68.25
67.89
−
−
62.67
61.56
63.90
63.70
65.56
63.99
68.59
69.05
Standard
70.36
68.66
58.89
53.50
−
68.43
64.89
−
−
−
−
0.40 (<1e-9)
0.52 (<1e-9)
0.57 (<1e-9)
0.64 (<1e-9)
0.69 (<1e-9)
0.79*(<1e-9)
64.74
73.40
74.67
78.18
82.36
88.27*
Table 3: Automatic evaluation metrics performance on random negatives (PBC refers to point-biserial
correlation. Column subheading ‘Single’ refers to experiments using single reference response and
‘Avg’ and ‘Max’ are the average and maximum aggregation strategies when using multiple reference
responses. ‘Standard’ is applicable when the metric aggregates multiple references differently.
* indicates statistical significance in performance over all other metrics (with p-values <1e-9) on
William’s test for comparing correlations and Chi-squared test for accuracies. p-values for individual
correlations are in parenthesis.
with a score of 1 and the remaining 4 irrelevant
responses as references with a score of −1. We
expect a good evaluation metric to provide high
scores on relevant responses and low scores on
the irrelevant responses. We then quantify the
performance of all metrics using two measures.
First, we compute the Point Biserial correlation
(PBC) between the scores assigned by a metric
and the binary target i.e., a score of 1 for positive
responses and 0 for random negative responses.7
Second, we compute the classification accuracy
of the metric by using a threshold and marking all
responses having a score above this threshold as
positive and others as negative. We use a threshold
of 0.5 for the trained metrics. For all the untrained
metrics, we perform a search from 0 to 1 with
step size of 0.01 and select the threshold that
minimizes the error rate on the validation set.8
Later in Section 6.1.1, we shall observe that if
we use 0.5 as the threshold, the performance of
7Note that it can be shown that PBC is equivalent to the
Pearson correlation when one of the variables is binary, as is
the case above.
8With this approach of setting a threshold, we want to
be lenient with the untrained metrics and investigate how
best they can be adopted. One might also think of using the
median of all the scores assigned by a metric as its threshold,
however, such an approach is error-prone and has several
boundary conditions that would fail the purpose. We hence
estimate the threshold by minimizing the risk.
most untrained metrics would be abysmally poor.
Note that for the trained metrics we found that
the scores were spread evenly in the range of 0
to 1 and there was no benefit of doing a grid
search to find the threshold—a threshold of 0.5
was adequate.
In Table 3, we report PBC and accuracy of
the different untrained metrics with both single
and multiple references, and the trained metrics.
When evaluating using single references, we use
any one of the 5 relevant responses as a reference
than the one being used as
response (other
a candidate). We observe that with a single
reference, all the untrained metrics are poor at
distinguishing between the positive and random
negative responses as inferred from the low
accuracy and correlation values. When we use
multiple responses, we observe a relatively better
performance. We notice that the performance is
largely similar across the aggregation techniques:
average, maximum, and standard (when appli-
cable). Metrics such as BLEU-1, METEOR,
ROUGE-L, and BERTScore with multiple refer-
ences are able to achieve modest correlations
with the binary target. Interestingly, we observe
that all the word embedding based methods, even
in the presence of multiple references, perform
badly in scoring the positive and random negative
responses. In contrast, trained metrics such as
817
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 1: Box plots of the scores given by various metrics to the positive and random negative responses.
BERT regressor, RUBER, BERT+DNN, and DEB
perform substantially better than the untrained
metrics. Our proposed DEB model achieves
state-of-the-art performance, with an accuracy of
88.27% and a strong correlation of 0.79.
6.1.1 Analysis using Box Plots
We now visualize the box plots of the scores given
by the various metrics to the positive and random
negative responses. Figure 1 shows these box plots
for the multi-reference untrained metrics (max
aggregation) and the trained metrics. We observe
several shortcomings of the untrained metrics.
Firstly, all the untrained metrics have a significant
overlap in the interquartile range of the positive
and random negative scores, implying that there
is a high degree of intermixing of scores given
to the positive and random negative responses.
The overlap is even higher for word embedding
based metrics, which obtain low point biserial
correlations. Secondly, we note that the score
distributions of the untrained metrics are highly
skewed. For instance, the scores of BERTScore
are almost always greater than 0.75 even though
it scores responses in the range [0,1]. Therefore,
it is difficult to tell at what value of the metric a
response can be safely considered relevant. These
observations suggest that untrained metrics even
with multiple references cannot be reliably used
to score dialogue responses.
For the ADEM evaluation model, we observe
that it outputs scores close to the mean score of 0.5
with little spread in their values. Sai et al. (2019)
also made similar observation about the clustering
of the scores around the mean in ADEM, which
they explain using linear system theory. In BERT
regressor, there is a high overlap in the scores
given to positives and random negatives. We
further observe that the RUBER and BERT+DNN
are able to better distinguish the positive and
random negative responses. Although there is
separation in the interquartile range for the two
classes in RUBER and BERT+DNN scores, there
is a greater spread within each class and a lot of
points of the two classes substantially overlap.
RUBER-Large is able to reduce the overlap,
while DEB further achieves better performance
by pushing the scores for positive responses close
to 1 and the scores for random negatives to 0 with
high accuracy. We shall show in Section 7.3 that
DEB achieves this by pushing the Hcls embed-
dings for the positive and random negative res-
ponses farther apart in space.
6.2 Performance on Synthetically Crafted
Adversarial Responses
Due to space constraints, in the remainder of
this section we present results only for the best
performing evaluation metrics from Table 3,
namely, BERT+DNN, RUBER, RUBER-Large,
and DEB. Before evaluating them using the
adversarial examples in our dataset, we first
investigate the performance of the models with
synthetically crafted adversarial attacks, similar
to Sai et al. (2019). In particular, we perform
simple transformations on relevant responses by
818
Modification
DEB
RUBER-
Large
RUBER
BERT+DNN
Unmodified positives
Reverse word order
Jumble word order
Retain only nouns
Remove punctuation
Remove stopwords
Replace with synonyms
Remove stopwords
Replace with synonyms
% classified as positive
77.5%
81.7%
87.9%
71.3%
70.3%
60.0%
72.3%
71.2%
69.3%
27.8%
27.9%
60.1%
72.4%
72.9%
86.4%
69.6%
73.6%
85.8%
65.6%
70.8%
81.2%
Pearson Correlation with human scores
0.56
(<1e-9)
0.57
(<1e-9)
0.58
(<1e-9)
0.68
(<1e-9)
0.52
(<1e-9)
0.54
(<1e-9)
93.5%
80.4%
77.4%
0.0%
88.5%
29.3%
91.1%
0.056
(0.26)
−0.017
(0.67)
Table 4: Fraction of responses classified as
positives with synthetic modifications. Unmod-
ified positives are presented in the 1st row for
reference (p-values for individual correlations in
brackets).
(i) jumbling words in the sequence, (ii) re-
versing the sequence, (iii) dropping all words
except nouns, (iv) dropping all stop words, (v)
dropping punctuation, and (vi) replacing words
with synonyms. These results are presented in
Table 4.
this is that
The modifications of reversing and jumbling
the word order in a relevant response make it
irrelevant (grammatically wrong) and hence we
expect to see more of the original true positives
get classified as negatives. BERT+DNN classifies
a majority of these responses as positives. One
possible reason for
their model
only uses a max pooled aggregation on BERT
embeddings and does not explicitly model the
sequential order of words. On the other hand,
DEB fares better than the other models as seen
by the drop in fraction of responses identified
as positives. However, RUBER variants and
BERT+DNN do better than DEB when retaining
only nouns in a response. On removing punc-
tuation, we expect
that most of the positive
responses without punctuation would remain
positive and hence the percentage of responses
marked positive should remain about the same.
In this case, both DEB and BERT+DNN perform
better than the RUBER models. For the modifi-
cations of removing stopwords and replacing
words with synonyms, it is hard to generalize the
trend that is observed. Hence, we perform human
evaluations by presenting in-house annotators
with contexts and modified responses. We ask
them to provide scores in the range 0 to 3, with
higher scores meaning better responses. We obtain
human scores on 400 samples for this task and
compute the Pearson correlation of the model
Figure 2: Accuracy of different models in identifying
adversarial and random negatives versus positive
responses.
predictions with the human judgements. In this
case, we find DEB is better correlated with human
judgements on both the modifications.
6.3 Performance of Model-Based Metrics on
Manually Crafted Adversarial Responses
So far we have established that (i) untrained
metrics perform poorly compared to trained met-
rics even for separating random negatives from
positives (ii) trained models like RUBER, BERT+
DNN, RUBER-Large and DEB perform remark-
in distinguishing relevant responses
ably well
from random responses (iii) RUBER variants and
DEB perform well on most synthetically mutated
responses whereas BERT+DNN performs poorly
against certain mutations. However, we still need
to check if the trained models are robust to adver-
sarial examples which are specifically crafted to
fool such context-dependent, model-based met-
rics. Note that none of the untrained metrics are
context dependent as they directly compute the
similarity between the reference and candidate
response without considering the context.
We consider the 5 relevant and the 5 adversarial
irrelevant responses in our dataset and just as
before compute the scores assigned by the dif-
ferent metrics to each of these responses. We
then compute the accuracy of a metric using the
target label as 0 for irrelevant responses and 1
for relevant responses. As expected, the accuracy
of all the models drops, as seen in Figure 2. In
particular, we observe that the models wrongly
classify most of
responses as
positive/relevant responses. This can be seen from
the confusion matrices in Table 5, where it is clear
that the number of false positives is very high.
the irrelevant
819
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
TP FN FP
TN
BERT+DNN
Positive vs
Random
negatives
5337 373
2520 3190
Positive vs
Adversarial
negatives
5337
373
4179 1531
BERT regressor
3442 1126
1304 3264
3442 1126
1837 2731
RUBER
4420 1280
1207 4493
4420 1280
2714 2986
RUBER-Large
4659 1041
970 4730
4659 1041
2500 3200
DEB
5011 689
5054
646
5011
689
3101 2599
Table 5: Confusion matrix showing changes
in the performance of different models on
DailyDialog++ with random and adversarial
negatives.
Model
BERT original
DEB pretrained on Reddit
Pretrained DEB finetuned on
rand neg
Pos vs
Rand Neg
72.65
84.16
Pos vs
Adv Neg
58.10
59.82
88.29
66.75
Table 6: Ablation studies on DEB.
7 Discussion
In this section, we do further analysis of DEB.
7.1 Ablation Studies on DEB
the underlying BERT model
There are different stages of training our DEB
model. First,
is
already pretrained on English Wikipedia and the
BooksCorpus. We then pretrain it further for our
task using Reddit corpus and finally finetune it
on the DailyDialog++ dataset. We now evaluate
the contributions of each of these stages of
training (see Table 6). First, we find that the
original BERT model when adopted directly for
the task of dialog evaluation gives an accuracy of
72.65% and 58.10% on random and adversarial
negatives respectively. On further analysis, we
find that it has a high false positive rate, with
more than 52% of
the adversarial negatives
getting classified as positives. After pretraining
it with Reddit data, it achieves an accuracy of
84.16% on DailyDialog++ even though it has
not seen any training instances from this dataset.
Model
BERT regressor
BERT+DNN
RUBER
(Pretrained)
RUBER-Large
(Pretrained)
DEB
(Pretrained)
Training/
Finetuning Data
Rand neg
Adv neg
Rand + Adv neg
Rand neg
Adv neg
Rand + Adv neg
Rand neg
Adv neg
Rand + Adv neg
Rand neg
Adv neg
Rand + Adv neg
Rand neg
Adv neg
Rand + Adv neg
Pos vs
Rand Neg
73.40
69.89
72.77
74.67
60.49
73.87
78.18
70.82
75.11
82.35
63.99
79.91
88.29
86.24
88.67
Pos vs
Adv Neg
67.57
75.92
74.55
60.14
87.67
86.61
64.96
76.50
83.88
68.94
90.49
86.54
66.75
82.04
92.65
Table 7: Accuracy in classifying Pos vs Rand Neg
and Pos vs Adv Neg responses for various model
variants trained/finetuned on DailyDialog++.
However, there is only a marginal improvement on
adversarial negatives. Finally, finetuning BERT
on DailyDialog++ using only random negatives
further improves the accuracy to 88.29% and
66.75%, respectively.
7.2 Training with Adversarial Examples
We examine whether the evaluation models can
learn to distinguish the adversarial negatives when
specifically finetuned for that task. By training on
DailyDialog++ with adversarial negatives rather
than random negatives, we find that all models
give an improved performance in identifying
adversarial negatives (see Table 7). However,
with such training, every model’s performance
drops when evaluated on DailyDialog++ with
random negatives, with BERT+DNN dropping
substantially to 60.49%. The best overall perform-
ance is seen when the models are finetuned with
both random and adversarial negatives, with DEB
achieving the highest accuracies on both test sets.
While such improvement is expected given the
capacity of the models, obtaining such adversarial
examples for training is not always feasible.
Effect of the Number of Adversarial Negatives
Added to Training: Because of the difficulty
in manually creating adversarial examples, we
study the effect of the number of the adversarial
examples added to the training set. Our findings
are presented in Figure 3, where we progressively
increase the percentage of adversarial negative
examples added as input to the DEB model during
training with random negatives. As expected,
820
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
similarity of the vectors with their mean vector, ¯v.
The lower the conicity, the higher the spread.
For each utterance in DailyDialog++, we first
construct the sets P, R, and A using the pretrained
DEB model. We find that the average conicity of
the set P is 0.89 (averaged over all utterances),
indicating that the positive responses get mapped
very close to each other. The average conicity
of the set P ∪ R is 0.59, indicating that the
positive responses are well separated from the
random negatives. However, the average conicity
of the set P ∪ A is 0.74, indicating that the
positive responses are not well separated from
the adversarial negative responses. We illustrate
this in Figure 4a by representing the mean
vector of each of the sets along a corresponding
highlighted region where the vectors of the set lie
on average.9 We then finetune the DEB model on
the DailyDialog++ dataset. Once again, for every
utterance we construct the sets P, R, and A using
this finetuned model. We now observe that the
average conicity of the sets P , P ∪ R, and P ∪ A
are 0.86, 0.37, and 0.35 respectively. Thus, after
finetuning, the model is able to achieve a clear
separation between positive responses and random
or adversarial negative responses. Furthermore,
the positive responses are still close to each other
(illustrated in Figure 4b).
8 Generalization to Other Datasets
section, we
In this
investigate how well
the different model-based metrics trained on
DailyDialog++ generalize to other datasets that
are not seen during training. We evaluate the 3
unreferenced models, BERT+DNN, RUBER, and
DEB, which require only context and candidate
response as inputs on these 3 datasets.
Twitter: Microsoft Research Social Media
Conversation Corpus (Sordoni et al., 2015) con-
tains a curated list of 3-turn Twitter conversations,
all of which are human-verified as good responses.
PersonaChat: The dialogues in PersonaChat
(Zhang et al., 2018) are associated with well-
defined personalities of the speakers involved.
We consider the verified human-human chat logs,
released by See et al. (2019), as positive examples.
9Note that separation of cones in the figure does not
indicate complete separation of all the vectors between the
sets, rather separation on average, as there could be some
overlap or outliers, as evident from the model’s performance
in various experiments.
Figure 3: Effect of varying the amount of adversarial
negatives added to the training set.
the accuracy in identifying adversarial negatives
improves as the model is exposed to more data
points of the same type, where we specifically
note the considerable improvement from 45.6%
to 70.85% after adding just 1% of adversarial
negatives from our dataset (i.e., 100 contexts with
5 adversarial examples each). With the addition of
more adversarial negatives, we find a small drop
in the accuracy of identifying random negatives.
There is also a slight decrease in the performance
on the positives responses when the number of
adversarial examples are small. We note that the
adversarial negatives are hard negatives close to
the positive responses in the embedding space, as
we elaborate in Section 7.3, thereby confusing the
model.
7.3 Conicity Analysis on DEB
We analyze the embeddings from the final
embeddings projection space, that is, the one used
by softmax layer for next response prediction.
We check for the spread of the embeddings of
the positive and negative responses. Specifically,
let P, R, and A be the set of embeddings of all
positive responses, random negative responses,
and adversarial negative responses respectively
for a given context. We want that if we consider
the set P then the spread of this set should be
low in the projected space (all positive responses
embedded close to each other). At the same time, if
we consider the union of the sets P, R, and A then
the spread of this set should be high (positive
responses separated from negative responses).
We measure this spread using conicity analysis
(Chandrahas et al., 2018). Conicity on a set of
vectors V is defined as the average of the cosine
821
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Figure 4: Illustration of the spread of the positive and negative response embeddings by DEB (not to scale).
Model
BERT+DNN
RUBER
RUBER-Large
DEB
Persona Twitter Holl-E
54.60
48.71
71.01
54.83
71.18
61.17
55.94
77.18
62.32
62.74
82.71
78.55
Table 8: Transferability to other datasets.
Holl-E: This dataset
(Moghe et al., 2018)
contains conversations about movies, where each
response is generated by copying and modifying
content from a relevant background document.
We use the multi-reference test set of Holl-E
containing 4 positive responses for each context.
For all the 3 datasets, we consider the reference
responses as positive responses and obtain nega-
tive examples by randomly sampling responses
from other contexts. We reiterate that we do not
train the models on these datasets but simply
evaluate the models trained on DailyDialog++
on these datasets. Table 8 shows that DEB
outperforms the other unreferenced models on all
the 3 datasets. With Holl-E dataset being specific
to conversations about movies rather than generic
topics, we find the scores are relatively lower on it
for all the models. The other evaluation models and
metrics cannot be compared on PersonaChat and
Twitter without additional reference responses,
since the available single reference in these
datasets is being evaluated. On the multi-reference
test set of Holl-E, however, we find that their
performance is lower than the three unreferenced
models.
9 Correlations with Human Judgments
on System Generated Responses
Lastly, we wanted to check if DEB scores correlate
well with scores assigned by humans on responses
generated by dialogue systems (as opposed to
humans). To do so, we collected responses gener-
ated by the following five dialogue response
generation models:
HRED: Hierarchical Recurrent Encoder De-
coder (HRED) (Serban et al., 2016) extends the
traditional seq2seq model by adding an additional
utterance-level RNN.
VHRED: Latent Variable HRED (VHRED)
(Serban et al., 2017) includes a latent variable at
the decoder, and is trained by maximizing a
variational lower-bound on the log-likelihood.
VHCR: Variational Hierarchical Conversation
RNN (VHCR) (Park et al., 2018) further extends
VHRED by drawing a prior encoding for each
conversation.
DialoGPT small: Zhang et al. (2020b) pre-
trained GPT-2-like (Radford et al., 2019) trans-
former models on 147M conversations extracted
from Reddit comments. The small version con-
tains 12 layers and 768 hidden dimensions.
DialoGPT medium: The medium version of
DialogGPT contains 24 layers and 1024 hidden
dimensions.
For the RNN-based models (HRED, VHRED,
VHCR), we use a single-layer bidirectional en-
coder and single-layer decoder each with a
hidden size of 1024. We pretrain the RNN-based
822
Model
Pearson
Spearman
Kendall tau
Response level
0.016 (0.73)
BERT+DNN
0.111 (2.5e-2)
RUBER
RUBER-Large
0.265 (<1e-7)
DEB w/o Reddit 0.356 (<1e-9)
DEB w/o DD++ 0.274 (<1e-9)
DEB
0.007 (0.88)
0.009 (0.89)
0.090 (8.9e-2)
0.126 (1.1e-2)
0.173 (<1e-6)
0.256 (<1e-6)
0.202 (<1e-9)
0.295 (<1e-9)
0.232 (<1e-9)
0.337 (<1e-9)
0.440* (<1e-9) 0.523* (<1e-9) 0.374* (<1e-9)
System level
0.050 (0.89)
BERT+DNN
0.221 (0.72)
RUBER
RUBER-Large
0.679 (0.20)
DEB w/o Reddit 0.784 (0.12)
DEB w/o DD++ 0.855 (0.06)
DEB
0.973 (5.2e-3)
-0.100 (0.87)
0.300 (0.62)
0.499 (0.39)
0.600 (0.28)
0.600 (0.28)
0.700 (0.18)
0.000 (1.1)
0.200 (0.81)
0.399 (0.483)
0.400 (0.48)
0.400 (0.48)
0.600 (0.23)
Table 9: Human correlations on DailyDialog++
data with different models. (Individual p-values
in parenthesis.) * indicates statistical significance
in performance over other models, with p-values
<1e-6 on the William’s test.
models on the casual conversation subset of the
Reddit dataset, consisting of 10M conversation
exchanges. We finetune all the models on the
DailyDialog++ dataset.
We conducted human evaluations to compare
the extent to which the model-based metrics agree
with human judgements. We randomly sampled
100 contexts from the test set of the DailyDialog++
dataset and obtained the responses generated by
each of the above models. Annotators were shown
a context-response pair and were asked to rate
how human-like the response is with respect to
the context, on a scale of 0–3. The annotators were
asked to check for both fluency and coherence.
A total of 15 in-house annotators participated
in the human evaluation study. The annotators
were Computer Science graduates competent in
English. Each context-response pair was rated by
5 annotators and the final score was obtained
by averaging the 5 scores. We also obtained
scores at the system level by aggregating the
scores for each model. In Table 9, we report the
correlations of human judgments with the model
scores at the response level and system level.
We observe that the BERT+DNN model, which
only has a feed-forward neural network that is
learnable, does not have any significant correlation
with human judgments. On the other hand,
RUBER, consisting of pretrained GRUs, obtains
low to moderate correlations. RUBER-Large
further obtains improved correlations, indicating
that using large-scale pretrained models helps.
This trend is also observed in the comparisons
of DEB with its ablated versions
(without
Reddit pretraining and without finetuning on
DailyDialog++),
indicating the contribution of
these steps in training the final model. Our
proposed DEB model obtains significantly higher
correlations at response level. We checked for
significance using William’s test to compare DEB
with all other models and found p-values to be
< 1e−6. This establishes the effectiveness of
DEB in scoring model generated responses. At
the system level, we find that DEB correlates
substantially higher than other models, with the
human rankings of the models. However, the p-
values in this case are not significant due to the
limited number of systems. In hindsight, we realize
that reporting system level correlations is not very
informative as the number of samples are very
small (as many as the number of systems). Hence,
these numbers are not very reliable. However,
following Lowe et al. (2017), we still report the
system-level correlations (along with the p-values)
for the sake of completeness.
10 Related Work
We point the reader to Serban et al. (2018) for
an excellent survey of existing datasets containing
single reference responses. Recently, there has
been some effort to create datasets containing
multiple references but these datasets are either
too small (around 1,000 contexts) (Moghe et al.,
2018; Gupta et al., 2019) or noisy (Gao et al.,
2019).
We have already reviewed all
the existing
dialog metrics in Section 3 and hence we do
not discuss them again here. Instead, we quickly
mention existing works which critically examine
dialog evaluation metrics. For example, Liu et al.
(2016) show that existing n-gram based metrics
do not correlate well with human judgements for
dialog evaluation. We report similar results but
additionally show that the correlation improves in
the presence of multiple references. Similarly, Sai
et al. (2019) have critically examined ADEM and
shown that in most cases it produces a score close
to 2.5 (on a scale of 1 to 5) and hence does not
clearly separate relevant and irrelevant responses.
Lastly, we also mention a very recent work,
Zhang et al. (2020b), which has pretrained a large
scale transformer on Reddit corpus for building
conversation systems. However, their focus is on
dialog generation and not on evaluation metrics.
823
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
11 Conclusions
We propose a multi-reference open-domain
dialogue dataset with multiple relevant responses
and adversarial irrelevant responses. We perform
the existing dialogue
an extensive study of
evaluation metrics using this dataset and also
propose
a new transformer-based evaluator
pretrained on large-scale dialogue datasets. We
identify the strengths and weaknesses of such
a model through studies of its performance on
untrained and synthetically modified data. We
find DEB to be easily adaptable to other open-
domain dialogue datasets. We also present the
scope of the adversarial responses in our dataset
towards bringing out better evaluation metrics,
since all the current models do not perform well
on those unless explicitly trained.
Acknowledgments
We thank the Department of Computer Science
and Engineering, IIT Madras and the Robert
Bosch Center for Data Science and Artificial
Intelligence, IIT Madras (RBC-DSAI) for pro-
viding us resources required to carry out this re-
search. We are grateful to Google for the TFRC
credits that supported our usage of TPUs for
several experiments in this paper. We also thank
Google for supporting Ananya Sai through their
Google India Ph.D. Fellowship Program. We
thank the action editor, Xiaojun Wan, and all
the anonymous reviewers for their very helpful
comments in enhancing the work. We thank the in-
house human annotators and evaluators for helping
us create the dataset.
References
Satanjeev Banerjee and Alon Lavie. 2005.
METEOR: An automatic metric for MT evalu-
ation with improved correlation with human
judgments. In Proceedings of the ACL Work-
shop on Intrinsic and Extrinsic Evaluation
Measures for Machine Translation and/or
Summarization, pages 65–72, Ann Arbor,
Michigan. Association for Computational
Linguistics.
Chandrahas, Aditya Sharma, and Partha P.
Talukdar. 2018. Towards understanding the
geometry of knowledge graph embeddings. In
Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics,
ACL 2018, Melbourne, Australia, July 15-20,
2018, Volume 1: Long Papers, pages 122–131.
Association for Computational Linguistics.
DOI: https://doi.org/10.18653/v1
/P18-1012
Junyoung Chung, Caglar Gulcehre, Kyunghyun
Cho, and Yoshua Bengio. 2014. Empirical
evaluation of gated recurrent neural networks
on sequence modeling. In NeurIPS 2014 Work-
shop on Deep Learning, December 2014.
Jacob Devlin, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. 2019. BERT: Pre-
training of deep bidirectional transformers for
In Proceedings of
language understanding.
the 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short
Papers), pages 4171–4186. Association for
Computational Linguistics.
Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,
Michael Muehl, and William W. Cohen. 2016.
Tweet2vec: Character-based distributed repre-
sentations for social media. In Proceedings
of the 54th Annual Meeting of the Association
for Computational Linguistics, ACL 2016,
August 7-12, 2016, Berlin, Germany, Volume 2:
Short Papers. The Association for Computer
Linguistics. DOI: https://doi.org/10
.18653/v1/P16-2044
Joelle Pineau,
Jean-Marie
Gabriel Forgues,
Larchevˆeque,
and R´eal Tremblay. 2014.
Bootstrapping dialog systems with word
In NeurIPS, modern machine
embeddings.
language processing
learning and natural
workshop, volume 2.
Eric N. Forsythand and Craig H. Martell. 2007.
Lexical and discourse analysis of online chat
dialog. In Proceedings of the First IEEE Inter-
national Conference on Semantic Computing
(ICSC 2007), September 17-19, 2007, Irvine,
California, USA, pages 19–26. IEEE Computer
Society. DOI: https://doi.org/10.1109
/ICSC.2007.55
Michel Galley, Chris Brockett, Alessandro
Sordoni, Yangfeng Ji, Michael Auli, Chris
824
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Quirk, Margaret Mitchell,
Jianfeng Gao,
and Bill Dolan. 2015. deltaBLEU: A discrim-
inative metric for generation tasks with intrin-
sically diverse targets. In Proceedings of the
53rd Annual Meeting of the Association for
Computational Linguistics and the 7th Inter-
national Joint Conference on Natural Lan-
guage Processing (Volume 2: Short Papers),
pages 445–450, Beijing, China. Association
for Computational Linguistics. DOI: https://
doi.org/10.3115/v1/P15-2073
Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris
Brockett, Michel Galley, Jianfeng Gao, and
Bill Dolan. 2019. Jointly optimizing diversity
and relevance in neural response generation.
In Proceedings of the 2019 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, NAACL-HLT 2019, Minneapo-
lis, MN, USA, June 2-7, 2019, Volume 1 (Long
and Short Papers), pages 1229–1238. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/N19
-1125
Sarik Ghazarian, Johnny Wei, Aram Galstyan,
and Nanyun Peng. 2019. Better automatic eval-
uation of open-domain dialogue systems with
In Proceedings
contextualized embeddings.
of the Workshop on Methods for Optimizing
and Evaluating Neural Language Generation,
pages 82–89. Association for Computational
Linguistics, Minneapolis, Minnesota. DOI:
https://doi.org/10.18653/v1/W19
-2310
Prakhar Gupta, Shikib Mehri, Tiancheng Zhao,
Amy Pavel, Maxine Esk´enazi, and Jeffrey P.
Bigham. 2019.
Investigating evaluation of
open-domain dialogue systems with human
In Proceed-
generated multiple references.
ings of
the 20th Annual SIGdial Meeting
on Discourse and Dialogue, SIGdial 2019,
Stockholm, Sweden, September 11-13, 2019,
pages 379–391. Association for Computatio-
nal Linguistics. DOI: https://doi.org
/10.18653/v1/W19-5944, PMCID:
PMC6813692
Matthew Henderson, Paweł Budzianowski, I˜nigo
Casanueva, Sam Coope, Daniela Gerz, Girish
Kumar, Nikola Mrkˇsi´c, Georgios Spithourakis,
Pei-Hao Su, Ivan Vulic, and Tsung-Hsien Wen.
2019. A repository of conversational datasets.
In Proceedings of the Workshop on NLP for
Conversational AI. Data available at github
.com/PolyAI-LDN/conversational
-datasets. DOI: https://doi.org/10
.18653/v1/W19-4101
Jiwei Li, Will Monroe, Tianlin Shi, S´ebastien
Jean, Alan Ritter, and Dan Jurafsky. 2017a.
Adversarial learning for neural dialogue gen-
eration. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language
Processing, pages 2157–2169, Copenhagen,
for Computational
Denmark. Association
Linguistics.
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li,
Ziqiang Cao, and Shuzi Niu. 2017b. Daily-
dialog: A manually labelled multi-turn dialogue
dataset. In Proceedings of the Eighth Interna-
tional Joint Conference on Natural Language
Processing, IJCNLP 2017, Taipei, Taiwan,
November 27 - December 1, 2017 - Volume 1:
Long Papers, pages 986–995. Asian Federation
of Natural Language Processing.
Chin-Yew Lin. 2004. ROUGE: A package for
automatic evaluation of summaries. In Text
Summarization Branches Out, pages 74–81,
Barcelona, Spain. Association for Computa-
tional Linguistics.
Chia-Wei Liu, Ryan Lowe,
Iulian Serban,
Michael Noseworthy, Laurent Charlin, and
Joelle Pineau. 2016. How NOT to evaluate
your dialogue system: An empirical study of
unsupervised evaluation metrics for dialogue
the
response generation. In Proceedings of
2016 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2016,
Austin, Texas, USA, November 1-4, 2016,
pages 2122–2132. The Association for Com-
putational Linguistics.
Ryan Lowe, Michael Noseworthy, Iulian Vlad
Serban, Nicolas Angelard-Gontier, Yoshua
Bengio, and Joelle Pineau. 2017. Towards
an automatic turing test: Learning to evalu-
ate dialogue responses. In Proceedings of the
55th Annual Meeting of the Association for
Computational Linguistics, ACL 2017, Van-
couver, Canada, July 30 - August 4, Volume 1:
Long Papers, pages 1116–1126. Association for
Computational Linguistics. DOI: https://
doi.org/10.18653/v1/P17-1103
825
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Nikita Moghe,
Siddhartha Arora,
Suman
Banerjee, and Mitesh M. Khapra. 2018.
Towards exploiting background knowledge
for building conversation systems. In Proceed-
ings of
the 2018 Conference on Empirical
Methods in Natural Language Processing,
Brussels, Belgium, October 31 - November 4,
2018, pages 2322–2332. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/D18-1255
Kishore Papineni, Salim Roukos, Todd Ward,
and Wei-Jing Zhu. 2002. BLEU: A method for
automatic evaluation of machine translation.
In Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics,
pages 311–318, Philadelphia, Pennsylvania,
USA. Association for Computational Linguis-
tics. DOI: https://doi.org/10.3115
/1073083.1073135
Yookoon Park, Jaemin Cho, and Gunhee Kim.
2018. A hierarchical latent structure for vari-
ational conversation modeling. In Proceedings
of the 2018 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 1792–1801,
New Orleans, Louisiana. Association for Com-
putational Linguistics. DOI: https://doi
.org/10.18653/v1/N18-1162
Alec Radford, Jeffrey Wu, Rewon Child, David
Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask
learners. Technical report. OpenAI.
Alan Ritter, Colin Cherry, and Bill Dolan. 2010.
Unsupervised modeling of Twitter conver-
sations. In Human Language Technologies:
Conference of the North American Chapter of
the Association of Computational Linguistics,
Proceedings, June 2-4, 2010, Los Angeles, Cal-
ifornia, USA, pages 172–180. The Association
for Computational Linguistics.
Vasile Rus and Mihai C. Lintean. 2012. A
comparison of greedy and optimal assessment
of natural language student input using word-
to-word similarity metrics. In Proceedings of
the Seventh Workshop on Building Educational
Applications Using NLP, BEA@NAACL-HLT
2012, June 7, 2012, Montr´eal, Canada,
826
pages 157–162. The Association for Computer
Linguistics.
Ananya B. Sai, Mithun Das Gupta, Mitesh M.
Khapra, and Mukundhan Srinivasan. 2019. Re-
evaluating ADEM: A deeper look at scor-
ing dialogue responses. In The Thirty-Third
AAAI Conference on Artificial Intelligence,
AAAI 2019, The Thirty-First Innovative Appli-
cations of Artificial Intelligence Conference,
IAAI 2019, The Ninth AAAI Symposium on
Educational Advances in Artificial
Intelli-
gence, EAAI 2019, Honolulu, Hawaii, USA,
January 27-February 1, 2019, pages 6220–6227.
AAAI Press. DOI: https://doi.org/10
.1609/aaai.v33i01.33016220
Abigail See, Stephen Roller, Douwe Kiela,
and Jason Weston. 2019. What makes a
good conversation? How controllable attributes
affect human judgments. In Proceedings of
the 2019 Conference of the North American
Chapter of the Association for Computational
Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short
Papers), pages 1702–1723. Association for
Computational Linguistics.
Iulian Vlad Serban, Ryan Lowe, Peter Henderson,
Laurent Charlin, and Joelle Pineau. 2018. A
survey of available corpora for building data-
driven dialogue systems: The journal version.
Dialogue Discourse, 9(1):1–49.
Iulian Vlad Serban, Alessandro Sordoni, Yoshua
Bengio, Aaron C. Courville, and Joelle Pineau.
2016. Building end-to-end dialogue systems
using generative hierarchical neural network
the Thirtieth
models.
AAAI Conference on Artificial Intelligence,
February 12-17, 2016, Phoenix, Arizona,
USA, pages 3776–3784. AAAI Press. DOI:
https://doi.org/10.5087/dad.2018
.101
In Proceedings of
Iulian Vlad Serban, Alessandro Sordoni, Ryan
Lowe, Laurent Charlin, Joelle Pineau, Aaron C.
Courville, and Yoshua Bengio. 2017. A hier-
archical latent variable encoder-decoder model
for generating dialogues. In Proceedings of the
Thirty-First AAAI Conference on Artificial In-
telligence, February 4-9, 2017, San Francisco,
California, USA, pages 3295–3301. AAAI
Press.
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3
Shikhar Sharma, Layla El Asri, Hannes Schulz,
and Jeremie Zumer. 2017. Relevance of
unsupervised metrics in task-oriented dialogue
for evaluating natural
language generation.
CoRR, abs/1706.09799.
Hiroki Shimanaka, Tomoyuki Kajiwara, and
Mamoru Komachi. 2019. Machine translation
evaluation with BERT regressor. ArXiv,
abs/1907.12679.
Alessandro Sordoni, Michel Galley, Michael
Auli, Chris Brockett, Yangfeng Ji, Margaret
Mitchell, Jian-Yun Nie, Jianfeng Gao, and
Bill Dolan. 2015. A neural network approach
to context-sensitive generation of conversa-
tional responses. In NAACL HLT 2015, The
2015 Conference of
the North American
Chapter of
the Association for Computa-
tional Linguistics: Human Language Tech-
nologies, Denver, Colorado, USA, May 31
- June 5, 2015, pages 196–(205. The Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.3115/v1/N15-1020
Chongyang Tao, Lili Mou, Dongyan Zhao, and
Rui Yan. 2018. RUBER: an unsupervised
method for automatic evaluation of open-
domain dialog systems. In Proceedings of the
Thirty-Second AAAI Conference on Artificial
Intelligence, (AAAI-18),
the 30th innovative
Applications of Artificial Intelligence (IAAI-
18), and the 8th AAAI Symposium on Edu-
cational Advances in Artificial Intelligence
(EAAI-18), New Orleans, Louisiana, USA,
February 2-7, 2018, pages 722–729. AAAI
Press.
J¨org Tiedemann. 2012. Parallel data, tools and
the
interfaces in OPUS. In Proceedings of
Eighth International Conference on Language
Resources and Evaluation, LREC 2012, Istanbul,
Turkey, May 23-25, 2012, pages 2214–2218.
European Language Resources Association
(ELRA).
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017.
Attention is all you need. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing
Systems
Associates, Inc.
30,
pages
5998–6008. Curran
John Wieting, Mohit Bansal, Kevin Gimpel, and
Karen Livescu. 2016. Towards universal para-
phrastic sentence embeddings. Yoshua Bengio
and Yann LeCun, editors, In 4th International
Conference on Learning Representations, ICLR
2016, San Juan, Puerto Rico, May 2-4, 2016,
Conference Track Proceedings.
Saizheng Zhang, Emily Dinan, Jack Urbanek,
Arthur Szlam, Douwe Kiela, and Jason Weston.
2018. Personalizing dialogue agents: I have a
dog, do you have pets too? In Proceedings of
the 56th Annual Meeting of the Association
for Computational Linguistics, ACL 2018, Mel-
bourne, Australia, July 15-20, 2018, Volume 1:
Long Papers, pages 2204–2213. Associa-
tion for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P18-1205
Tianyi Zhang, Varsha Kishore, Felix Wu,
Kilian Q. Weinberger,
and Yoav Artzi.
2020a. BERTScore: Evaluating text generation
with BERT. In 8th International Conference
on Learning Representations,
ICLR 2020,
Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net.
response generation.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun
Chen, Chris Brockett, Xiang Gao, Jianfeng
Gao, Jingjing Liu, and Bill Dolan. 2020b.
DialoGPT : Large-scale generative pre-training
for conversational
In
Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics:
System Demonstrations, ACL 2020, Online,
July 5-10, 2020, pages 270–278. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/2020
.acl-demos.30
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin
Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: Enhanced language representation
In Proceedings
with informative entities.
of
the Association
the 57th Conference of
for Computational Linguistics, ACL 2019,
Florence, Italy, July 28- August 2, 2019, Vol-
ume 1: Long Papers, pages 1441–1451. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/P19
-1139
827
l
D
o
w
n
o
a
d
e
d
f
r
o
m
h
t
t
p
:
/
/
d
i
r
e
c
t
.
m
i
t
.
e
d
u
/
t
a
c
l
/
l
a
r
t
i
c
e
-
p
d
f
/
d
o
i
/
.
1
0
1
1
6
2
/
t
l
a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4
/
/
t
l
a
c
_
a
_
0
0
3
4
7
p
d
.
f
b
y
g
u
e
s
t
t
o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3