Improving Dialog Evaluation with a Multi-reference Adversarial

Improving Dialog Evaluation with a Multi-reference Adversarial
Dataset and Large Scale Pretraining

Ananya B. Sai∗ and Akash Kumar Mohankumar∗ and
Siddhartha Arora and Mitesh M. Khapra
{ananya, miteshk}@cse.iitm.ac.in, {makashkumar99, sidarora1990}@gmail.com
Robert-Bosch Centre for Data Science and Artificial Intelligence
Indian Institute of Technology, Madras

Abstract

There is an increasing focus on model-based
dialog evaluation metrics such as ADEM,
RUBER, and the more recent BERT-based
metrics. These models aim to assign a high
score to all relevant responses and a low score
to all irrelevant responses. Ideally, such models
should be trained using multiple relevant and
irrelevant responses for any given context.
However, no such data is publicly available,
and hence existing models are usually trained
using a single relevant response and multiple
randomly selected responses from other con-
texts (random negatives). To allow for better
training and robust evaluation of model-based
metrics, we introduce the DailyDialog++
dataset, consisting of (i) five relevant re-
sponses for each context and (ii) five adver-
sarially crafted irrelevant responses for each
context. Using this dataset, we first show that
even in the presence of multiple correct refer-
ences, n-gram based metrics and embedding
based metrics do not perform well at separat-
ing relevant responses from even random
negatives. While model-based metrics perform
better than n-gram and embedding based
their perfor-
metrics on random negatives,
mance drops substantially when evaluated on
adversarial examples. To check if large scale
pretraining could help, we propose a new
BERT-based evaluation metric called DEB,
which is pretrained on 727M Reddit conversa-
tions and then finetuned on our dataset. DEB
significantly outperforms existing models,
showing better correlation with human judg-
ments and better performance on random
negatives (88.27% accuracy). However,
its
performance again drops substantially when

∗The first two authors worked equally towards the project.

810

evaluated on adversarial responses, thereby
highlighting that even large-scale pretrained
evaluation models are not
to the
adversarial examples in our dataset. The
dataset1 and code2 are publicly available.

robust

1 Introduction

Open-domain conversational systems are increas-
ingly in demand for several applications ranging
from personal digital assistants to entertainers
for recreation. While several automated dialogue
agents such as Siri, Alexa, Cortana, and Google
Assistant have been built and deployed, there is no
good automatic evaluation metric to measure the
quality of their conversations. Researchers have
usually adopted n-gram based metrics (Papineni
et al., 2002; Banerjee and Lavie, 2005; Lin, 2004)
or embedding based metrics (Forgues et al., 2014;
Rus and Lintean, 2012; Zhang et al., 2020a) to
compare the model’s response with a single refer-
ence. These metrics assume that a valid response
should be semantically or lexically similar to the
reference without taking the context of the con-
versation into consideration. However, in open
domain conversations, a given context can have a
wide range of possible responses that may be lex-
ically and semantically very different from each
other. For example, the context, ‘‘I like danc-
ing and swimming, what about you?’’ can be
responded to with ‘‘I paint in my free time’’ or ‘‘I
do not have time for hobbies right now’’, both of
which are valid responses. As a result, n-gram and
word embedding based metrics, which rely on lex-
ical and/or semantic match, correlate very weakly
with human judgments for dialogue evaluation
(Liu et al., 2016).

1Dataset: h t t p s : / / i i t m n l p . g i t h u b . io

/DailyDialog-plusplus/.

2Code:

h t t p s : / / g i t h u b . c o m / i i t m n lp

/Dialogue-Evaluation-with-BERT.

Transactions of the Association for Computational Linguistics, vol. 8, pp. 810–827, 2020. https://doi.org/10.1162/tacl a 00347
Action Editor: Xiaojun Wan. Submission batch: 6/2020; Revision batch: 8/2020; Published 12/2020.
c(cid:3) 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Given the shortcomings of context-agnostic
n-gram and embedding based metrics, the focus
has now shifted to building neural network-based,
trainable dialogue evaluation models (Lowe et al.,
2017; Tao et al., 2018; Shimanaka et al., 2019;
Ghazarian et al., 2019). Such models are trained
to identify whether a given response can be
considered as a valid continuation of the given
context or not. In other words, the model should
(i) assign a high score to all relevant responses
no matter how diverse they are and (ii) assign a
low score to all irrelevant responses, preferably
with a clear margin of separation from relevant
responses. Although there exist several open-
domain dialogue datasets (Forsythand and Martell,
2007; Tiedemann, 2012; Ritter et al., 2010; Li
et al., 2017b) that are used for training dialogue
response generation systems, they are not suitable
for training and testing such evaluation models.
This is because these datasets have only a single
relevant response and no irrelevant responses.
Irrelevant responses can of course be generated by
sampling random utterances from other contexts,
but such examples typically do not have any
overlap with the context and hence are easier for
the model to distinguish from relevant responses
(as we will show in our results later). We refer
to the randomly sampled responses as random
negatives.

Some efforts have been made to build dialog
datasets with multiple relevant responses (i.e.,
multiple references), but these datasets are either
very small (1000 contexts) (Moghe et al., 2018;
Gupta et al., 2019) or automatically constructed
from Reddit conversations, hence, potentially
noisy (Gao et al., 2019). Further, these datasets
do not have any carefully crafted adversarial
irrelevant responses. We define an adversarial
irrelevant response as one that has a significant
word overlap with the context but
is still an
irrelevant response (hence harder to identify than
randomly selected irrelevant examples, which
may not have any relation to the context). To
overcome this limitation of existing datasets, we
propose a large scale multi-reference dataset,
DailyDialog++, which is an extension of the
DailyDialog dataset. In particular, for each of
the 19K contexts derived from DailyDialog, we
collect an additional 5 reference responses with
the help of human annotators. Further, for ∼11K
contexts in DailyDialog, we also ask human
annotators to carefully craft irrelevant responses

that have a significant word overlap with the
context. This dataset will be made publicly
available and help towards better training and
more robust evaluation of dialogue evaluation
metrics.

Using this dataset, we extensively evaluate a
wide range of n-gram-based and embedding-
based metrics. In particular, we compute (i)
these metrics with binary
the correlation of
human judgments and (ii) the accuracy obtained
by using the scores assigned by the metrics
to classify relevant/irrelevant
responses. The
performance of these metrics improves when
presented with multiple references as opposed
to a single reference, but they still leave a lot to
be desired. On the other hand, most model-based
evaluation metrics, when trained and evaluated
using multiple relevant and random negative
responses, perform significantly better than the
n-gram-based and embedding-based methods.
However, their performance drops substantially
on the adversarial examples in our dataset.

Lastly, one could argue that dialog evaluation
metrics could be improved by pretraining on
large amounts of data. To check if
this is
indeed the case, we propose a new BERT-
based evaluation metric called DEB (Dialog
Evaluation using BERT), which is pretrained on
727M Reddit conversations. Indeed, this model
performs significantly better on random negatives
with an accuracy of 88.27% in distinguishing
the positive and random negative responses. It
also correlates well with human judgments on
responses generated by five dialog generation
systems (Serban et al., 2016, 2017; Park et al.,
2018; Zhang et al., 2020b). In particular, the
Spearman rank correlation between human scores
and DEB scores is 0.52 at the response level scores
and 0.70 at the system level scores, calculated by
aggregating the scores on all responses by each
system. However, once again, when evaluated
on adversarial examples from our dataset,
its
performance drops substantially, underscoring
that even large-scale pretrained models are not
robust to adversarial examples.

2 Proposed Dataset

Our goal was to build a dataset with manu-
ally created multiple relevant and adversarial
irrelevant responses. For this, we wanted to start
with an existing base dataset that already has one

811

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

relevant response for every context, and then
to include multiple responses. For
extend it
the base dataset, we considered several popular
datasets such as Twitter (Ritter et al., 2010),
Reddit (Henderson et al., 2019), Open Subtitles
(Tiedemann, 2012), NPS Chat (Forsythand and
Martell, 2007), PersonaChat (Zhang et al., 2018),
and DailyDialog (Li et al., 2017b). Of these,
Twitter and Reddit are generally considered noisy,
so we chose not to use either of them as the base
dataset. Similarly, Open Subtitles and NPS Chat
did not have speaker-aligned utterances, and hence
were not suitable for our purposes. We found
that the DailyDiaog dataset was clean, human-
written, readily available, and covered a diverse
set of generic topics such as ordinary life, school
life, tourism, attitude & emotion, relationship,
health, work, politics, culture & education, and
finance. It contains a total of 13K conversations
with an average of 8 turns between exactly 2
speakers. Alternatively, we could have also chosen
PersonaChat, which is of a similar size and also
contains chit-chat style conversations, but we
chose the antecedent DailyDialog dataset.

For

shorter conversations

in DailyDialog
(having less than 8 turns) we collected multiple
relevant responses only for the last utterance. For
longer conversations (having 8 turns or more),
we divided the conversation into two or more
smaller chunks and collected multiple relevant
responses for the last utterance in every chunk.
In this way, from the 13K conversations3 in
DailyDialog, we were able to create 19K sub-
conversations with multiple relevant responses
for the last utterance in each sub-conversation
or context. The responses were created by in-
house annotators. Each context was shown to 2–3
annotators, and each of them was asked to generate
1–3 alternative responses for the last utterance,
capping the total number of alternative responses
to 5 (in addition to the one response already
available in DailyDialog). The annotators were
strictly instructed to avoid short generic responses
(‘‘Okay’’, ‘‘Thank you’’, ‘‘Sure’’, etc.), and write
longer meaningful responses containing at least
8–10 words. These responses were then verified

3Out of the 13K conversations released in DailyDialog,
we found that a good number of contexts were repeated,
either with slightly different spellings or through some subtle
differences such as representing numbers using digits versus
using words. We filtered out the repetitions and worked with
the remaining ∼11K contexts.

812

(and if needed, corrected and re-validated) by a
different set of annotators.

2.1 Adversarial Irrelevant Responses

In addition to collecting multiple relevant re-
sponses for each context, we also wanted to collect
irrelevant responses for each context. Most of the
models that are trained for the task of dialogue
evaluation (and dialogue generation) (Tao et al.,
2018; Ghazarian et al., 2019; Li et al., 2017a)
procure irrelevant responses by randomly sam-
pling responses from other contexts. Such random
negatives are often entirely out of context (un-
related) and hence are too easy for the model
to distinguish. To allow for a more critical or
adversarial examination of dialogue evaluation
systems, we propose creating adversarially crafted
irrelevant responses that have lexical or semantic
overlap with the context but are still unacceptable
as valid responses.

For obtaining such tricky negative responses,
the annotators were asked to choose some words
from the context and use them directly or indirectly
while writing the responses. Indirect usage here
refers to using words closely related to the context
words. For example, using synonyms, antonyms,
homonyms, subwords, or other words that are
known to frequently co-occur with the words in
the context (e.g., the words ‘‘flexibility’’ and
‘‘injuries’’ co-occur with ‘‘acrobatics’’). Once
again, each context was shown to 2–3 annota-
tors, and each of them was asked to generate 1–3
adversarially crafted responses for the last utter-
ance, capping the total number of alternative
responses to 5. Each response was then validated
by two different annotators. The validating an-
notators were instructed to either eliminate or
modify the responses that were not negative or
were borderline. A final check was made by one
more evaluator to ensure that the responses were
adversarially crafted, irrelevant, and grammati-
cally correct. We collected 5 such responses for
11,429 contexts. Table 1 shows examples of rel-
evant and irrelevant responses in our dataset and
Table 2 shows some statistics about our dataset.

We acknowledge that,

in practice, a given
context can have a large number of relevant
exhaustively
responses
collecting all such responses is prohibitively
expensive and time-consuming. Although it is
desirable to have even more than 5 responses for

(>> 5). However,

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Context
FS: Can you do push-ups ?
SS: Of course I can . It’s a piece of cake !
Believe it or not , I can do 30 push-ups a minute. SS: Watch me do it.
FS: Really ? I think that’s impossible !
SS: You mean 30 push-ups ?
FS: Yeah !

Valid responses
SS: You don’t believe me, do you? SS: Push up the window and look out for a minute
SS: Start your timer, here we go.

Invalid, adversarial responses

SS: Would you like to eat a piece of cake before gym?
SS: I like watching the Ripley’s Believe it or Not show

SS: That’s because you can’t do it. where they discuss nearly impossible feats
SS: You don’t know that I am a
fitness trainer, do you ?

and gymnastics
SS: I have enough time for my treadmill exercises
SS: Are you asking me to do 40 squats?

Table 1: Examples from DailyDialog++ dataset with the context consisting of 2 speakers [annotated as
FS (First Speaker) and SS (Second Speaker)], and multiple reference responses and adversarial negative
responses. The underlined, purple colored words in the adversarial responses are those that overlap or
are closely related to the theme or words in the context.

Total # of contexts
Avg. # of turns per context
Avg. # of words per context
Avg. # of words per utterance
# of contexts with 5 relevant responses
# of contexts with 5 adv. irrelevant responses
Avg. # of words per relevant response
Avg. # of words per irrelevant response

19,071
3.31
45.32
13.55
19,071
11,429
10.13
13.8

Table 2: DailyDialog++ dataset statistics.

every context, we believe that having at least 5 is a
good starting point given the dearth of such multi-
reference conversation datasets. The proposed
dataset thus serves as a pragmatic substitute for
an ideal dataset that would have contained a large
number of responses per context. Having said
that, we would also like to point out that the
value of the proposed dataset goes beyond having
multiple relevant references as it is also the first
dataset containing adversarial irrelevant responses
for given contexts.

3 Existing Metrics

In this section, we present a brief overview
of
the existing automatic metrics used for
dialogue evaluation. The existing metrics can be
broadly classified into two categories, namely,
(i) Untrained metrics, and (ii) Trained metrics.
Untrained evaluation metrics, usually adopted
from the NLG literature, use a predefined formula
to compare the candidate response with a reference
without taking the context into account. On the
other hand, trained metrics are usually trained
specifically for the task of dialogue response
evaluation to identify valid and invalid responses
for a given context.

3.1 Untrained Metrics

Untrained metrics can be further sub-classified
into (i) n-gram based, (ii) word embedding based,
and (iii) contextualized embedding based metrics.

N -gram Based: N -gram based metrics score a
candidate response based on the amount of n-gram
overlap it has with a given reference. BLEU
(Papineni et al., 2002), ROUGE-L (Lin, 2004), and
METEOR (Banerjee and Lavie, 2005) are among
the most commonly adopted n-gram based metrics
to evaluate dialogue systems. BLEU is calculated
using n-gram precision scores between the can-
didate response and the reference. ROUGE-L (Lin,
2004) is based on the F-measure of the longest
common subsequence between the candidate and
reference responses. METEOR (Banerjee and
Lavie, 2005) relaxes the exact match criteria by
including word stems, synonyms, and paraphrases.
More recently, Galley et al. (2015) proposed
deltaBLEU, which takes in multiple references
and rewards n-gram matches with positive
references and penalizes the matches with the
negative references.

Word Embedding Based: These methods use
word embeddings to compute the similarity
between the candidate response and the reference
response. The most commonly used word embed-
ding based metrics are Embedding Average
(Wieting et al., 2016), Vector Extrema (Forgues
et al., 2014), and Greedy Matching (Rus and
Lintean, 2012). Embedding Average defines a
sentence embedding as the average word embed-
ding of the constituent words. The final score is
calculated using the cosine similarity of candi-
date and reference sentence embeddings. Vector
Extrema (Forgues et al., 2014) instead computes
the sentence embedding by taking the most ex-
treme value for each dimension. In other words,

813

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

first computes

the value of the i-th dimension of the sentence
embedding is computed by taking a maximum
over the i-th dimension of all words in the
sentence. Greedy Matching (Rus and Lintean,
2012)
the maximum cosine
similarity that every word in the candidate
response has with any word in the reference
response. Similarly, the highest cosine similarity
for each of the reference words with any of
the candidate response words is calculated. The
similarity between the candidate response and
reference response is then computed by taking
an average of the maximum cosine similarities
computed above.

BERTScore: Recently, Zhang et al. (2020a)
proposed BERTScore, which uses contextualized
word embeddings of the candidate and reference
sentences to compute the score. BERTScore is
similar to greedy matching but uses contextualized
embeddings from BERT instead of static word
embeddings.

3.2 Trained Metrics

ADEM: Automatic Dialogue Evaluation Model
(ADEM) (Lowe et al., 2017) uses pretrained vector
representations of the the dialogue context c,
reference response r, and proposed response ˆr
to compute the evaluation score as follows:

Score(c, r, ˆr) = (cT Mˆr + rT Nˆr − α)/β (1)

where M, N ∈ Rn×n are learned matrices,
and α, β are scalar constants used to re-scale
scores in the range [1, 5]. The context, proposed
response and reference response are encoded using
a Hierarchical RNN (H-RNN) encoder consisting
of utterance-level and context-level RNNs. The
H-RNN encoder is pretrained on a Twitter dataset
(Dhingra et al., 2016) in a generative setup using
the latent variable hierarchical recurrent encoder
decoder (VHRED) model (Serban et al., 2017).
The weight matrices, M, N, are later finetuned
for the task of dialogue response evaluation.

RUBER:
(Tao et al., 2018) introduced an un-
referenced evaluation model consisting of GRU
encoders (Chung et al., 2014) to measure the
relatedness between the dialogue context and
a given response. The authors train the model
on Chinese dialogue data with the hinge loss
objective.

814

BERT Regressor4: Shimanaka et al. (2019)
propose a BERT based evaluation model to score
a candidate sentence based on a reference. Unlike
BERTScore, the BERT model is finetuned to
predict human judgement scores from the conca-
tenated reference and candidate sentence.

BERT+DNN5: Ghazarian et al.
(2019) use
contextualized embeddings to compute a related-
ness score between the dialogue context and
response. The best performing model of Ghazarian
et al. (2019) consists of a multilayer perceptron
takes the concatenation of contextualized
that
representations of the context and response as
input. The contextualized representations are
obtained by max-pooling the respective BERT
embeddings for each token. Note that the BERT
embeddings are not finetuned.

4 Dialogue Evaluation using BERT

In the last two years, considerable success in NLP
has been driven by large pretrained transformer-
based models (Radford et al., 2019; Devlin et al.,
2019; Zhang et al., 2019). These models are
typically trained with a language model objective
and leverage large amounts of unlabeled data.
However, none of the trained metrics discussed in
the previous section leverage pretraining on large-
scale dialogue corpora. With the hope that such
pretraining should help dialog evaluation models
also, we introduce DEB (Dialog Evaluation using
BERT) which is trained using a masked language
model objective (similar to BERT) and a modified
next response prediction objective.

1, . . . , wr
m

We set up the the task of next response predic-
tion as one of identifying whether the given
response is a valid next response for the given con-
}
text. Formally, given a context C = {wc
1, . . . , wc
n
}, we first pass
and a response R = {wr
the concatenated sequence U = {[CLS], wc
1,
} through the BERT
. . . , wc
transformer and obtain Hcls ∈ RH , the last-layer
activations corresponding to the special [CLS]
token. We then make our final next response
predictions as follows: ˆy = softmax(WHcls),
where W ∈ R2×H is a learnable matrix. We

n, [SEP], wr

1, . . . , wr
m

4Because we couldn’t find an exact name for the evaluator
model by Shimanaka et al. (2019) , we adopt the name ‘BERT
regressor’ from their paper’s title.

5Due to the lack of a specific name for the models in
Ghazarian et al. (2019), we refer to the model adopted from
their work as ‘BERT+DNN’

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

use cross entropy loss with binary targets for the
next-response prediction. In addition, we use the
standard masked language model objective by
randomly masking 15% of the words in C and R.
Note that the proposed model is a straight-
forward extension of the standard BERT model
used for language modeling. We do not claim any
novelty on this front. The key contribution here
is to assess if pretraining on large-scale dialogue
corpora improves the performance of dialogue
evaluation metrics. Existing BERT-based evalua-
tion metrics (Shimanaka et al., 2019; Ghazarian
et al., 2019) do not use such pretraining on
any large-scale, domain-related corpora. In other
words, they do not leverage the more successful
recipe of (i) pretraining with a masked language
modeling objective and (ii) finetuning with a task-
specific objective (dialog evaluation in this case).
The idea behind DEB is to check if this successful
recipe can be replicated for dialog evaluation,
making use of the dialogues in the large-scale
Reddit corpus.

4.1 Training Details

For pretraining, we use a massive open-domain
dialogue dataset of Reddit comments from 2005
to 2019 consisting of 256M threads with a
total of 3.68B comments. From this dataset,
we extracted a total of 727M {context, positive
response} pairs with 654M for training and 73M
for testing following the method described in
Henderson et al. (2019). We used an equal number
of negative responses by randomly sampling
responses from other contexts. We use the BERT
base model with 110M parameters consisting of
12 layers, 768 dimensional hidden space, and 12
attention heads per layer in all our experiments.
We finetune the pretrained DEB model on our
DailyDialog++ dataset for 1 epoch (we did not
see any advantage of finetuning beyond 1 epoch).
Note that during finetuning we only use the next
response prediction objective.

5 Experimental Setup

Our goal is to check if the adversarial responses
in our dataset, which are specifically crafted
to target context-dependent model-based metrics
(such as ADEM, RUBER, BERT+DNN, and
the performance of such
DEB),
models. To do so, we first need to benchmark
the models’ performance on random negatives

indeed affect

and then check if the performance drops when
evaluated on adversarial examples. Hence, in this
section, we describe (i) the process of creating
and validating such random negatives and (ii) the
process used for training model-based metrics.

We randomly divide our dataset

into train
(80% contexts), validation (10% contexts), and
test (10% contexts) splits. Note that adversarial
negatives are not used for training or finetuning
the models unless explicitly specified.

5.1 Creating & Validating
Random Negatives

For every context in our dataset, which has 5
relevant responses, we also sample 5 random
negatives. While sampling random negatives, we
avoid short responses that may be generic and
relevant for any context. To verify whether the
sampled random negatives were indeed irrelevant,
we asked human annotators to manually check 500
such sampled responses. More specifically, we
showed them the original context and the sampled
random negative response and asked them if it
was a relevant or irrelevant response. In 95%
of the cases, the annotators confirmed that the
random negative response was irrelevant, thereby
confirming that a random sampling strategy indeed
results in irrelevant responses (although they may
not be as hard as our adversarial negative examples
as shown later).

5.2 Pretraining & Finetuning

Trained Metrics

We describe the pretraining and finetuning
procedure for the various models used in our
analysis below.

ADEM: As previously mentioned in Section 3,
ADEM was pretrained on Twitter corpus using
the VHRED setup and then finetuned for dialogue
response evaluation. We take this publicly
available model and finetune it further using
our DailyDialog++ dataset with a target of 5 for
positive responses and 1 for random negatives.
The reference response could be any of the
other four relevant responses. Note that ADEM
produces a score on a scale of 1 to 5 whereas the
other models produce a score on a scale of 0 to
1. For easier comparison, we scale the output of
ADEM so that it lies in the range of 0 to 1.

BERT regressor: We finetune the publicly
available pretrained BERT base model (110M

815

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

parameters) on our DailyDialog++ dataset. We
train the model with a label of 1 for positive
responses and 0 for random negative responses
using any one of the other four positive responses
as the reference. We train the model using
cross-entropy loss and follow the same set of
hyperparameters as used by Shimanaka et al.
(2019) during finetuning.

BERT+DNN: We use the best performing model
from Ghazarian et al. (2019), which consists of
a three-layered feed-forward neural network and
uses pretrained BERT embeddings as input. We
train the model on our DailyDialog++ dataset with
random negatives using cross entropy loss.

RUBER and RUBER-Large: We experiment
with two variants of Tao et al.’s (2018) models
with different sizes, namely (i) RUBER (34M
parameters), which consists of single-layer GRUs
with a hidden size of 1024, and (ii) RUBER-
Large (236M parameters), which consists of two
layered GRUs with a hidden size of 2048. As
the training
shown in Vaswani et al. (2017),
time for RNN based architectures is very high
when compared with the transformer models that
allow much greater parallelization. We observed
an estimated time of over 200 days to train
the RUBER-Large model on the 727M Reddit
corpus on a 1080ti GPU,
thereby making it
practically infeasible to train such models on
large-scale datasets. Taking the computational
costs into consideration, we pretrained RUBER
and RUBER-Large on a sample of 20M contexts
with relevant and random irrelevant responses
from Reddit. We then finetuned these models on
our proposed dataset with random negatives.6

DEB: We pretrained DEB on the entire 727M
Reddit corpus using the masked language model
and the modified next
response prediction
objective. Pretraining DEB took 4 days on a
single Google Cloud TPUv2. We achieved a test
accuracy of 90% on the next response prediction
task and a perplexity of 15.47 (58% accuracy)
on the masked language modeling task in the
pretraining corpus. We then finetuned DEB on
our dataset with random negatives.

6We agree that this may not be a fair comparison but we
were constrained by the inherent limitations of such RNN-
based, sequential models that make large-scale pretraining
prohibitively expensive and time-consuming.

5.3 Untrained Metrics with
Multiple References

Untrained metrics like METEOR, Greedy Match-
ing, and so forth, usually work with a single ref-
erence response but can also be adapted to work
with multiple reference responses. For example,
for a given candidate response c and a set of refer-
ence responses r1, r2, r3, …, rk, we can compute
the multi-reference METEOR score as:

M ET EORmulti = maxk

i=1M ET EOR(c, ri)

Instead of the max function we can also use the
average function. We use a similar formula for all
the untrained metrics.

A few metrics like BLEU, deltaBLEU, and
ROUGE-L have their own standard formula to
incorporate multiple references. BLEU calculates
the number of matches for each n-gram based
on the maximum number of times the n-gram
occurs in common with any one of the references.
deltaBLEU further extends the same idea to
incorporate a score for each reference. We follow
the implementation from Galley et al. (2015) to
compute the deltaBLEU scores. For ROUGE-L,
we follow the strategy in Sharma et al. (2017),
where the score is an F-measure of the maximum
precision and maximum recall over all the refer-
ences. In addition to the average and maximum
aggregations, we also report these standard multi-
reference scores for BLEU, deltaBLEU, and
ROUGE-L.

6 Results

In this section, we compare the performance of
different dialog evaluation metrics in separating
relevant references from (i) random negatives,
(ii) synthetically crafted adversarial
irrelevant
responses (explained below), and (iii) manually
crafted adversarial irrelevant responses (as in our
DailyDialog++ dataset).

6.1 Performance on Random Negatives

For every context in our test split, we obtain the
scores assigned by a given metric to the 5 positive
and 5 random negative responses. In particular,
we treat each of the 5 relevant and 5 random
irrelevant responses as a candidate response. For
all untrained metrics other than deltaBLEU, we
consider the remaining 4 relevant responses as
reference responses. For deltaBLEU, we consider
the remaining 4 relevant responses as references

816

l

D
o
w
n
o
a
d
e
d

f
r
o
m
h

t
t

p

:
/
/

d
i
r
e
c
t
.

m

i
t
.

e
d
u

/
t

a
c
l
/

l

a
r
t
i
c
e

p
d

f
/

d
o

i
/

.

1
0
1
1
6
2

/
t

l

a
c
_
a
_
0
0
3
4
7
1
9
2
3
8
7
4

/

/
t

l

a
c
_
a
_
0
0
3
4
7
p
d

.

f

b
y
g
u
e
s
t

t

o
n
0
8
S
e
p
e
m
b
e
r
2
0
2
3

Metric

BLEU-1
BLEU-2
BLEU-3
BLEU-4
METEOR
ROUGE-L
deltaBLEU (Galley et al., 2015)
Embed Avg
Vec Extr (Forgues et al., 2014)
GreedyMatch (Rus and Lintean, 2012)
BERTScore (Zhang et al., 2020a)
ADEM (Lowe et al., 2017)
BERT regressor (Shimanaka et al., 2019)
BERT+DNN (Ghazarian et al., 2019)
RUBER (Tao et al., 2018)
RUBER-Large (Tao et al., 2018)
DEB (ours)

Point Biserial Correlation (p-value)

Accuracy in percentage

Single

0.26 (<1e-9) 0.22 (<1e-9) 0.14 (<1e-9) 0.08 (<1e-9) 0.23 (<1e-9) 0.23 (<1e-9) − 0.23 (<1e-9) 0.24 (<1e-9) 0.24 (<1e-9) 0.29 (<1e-9) Avg 0.42 (<1e-9) 0.39 (<1e-9) 0.26 (<1e-9) 0.17 (<1e-9) 0.40 (<1e-9) 0.41 (<1e-9) − 0.25 (<1e-9) 0.35 (<1e-9) 0.36 (<1e-9) 0.39 (<1e-9) Multiple Max 0.41 (<1e-9) 0.36 (<1e-9) 0.24 (<1e-9) 0.15 (<1e-9) 0.41 (<1e-9) 0.40 (<1e-9) − 0.23 (<1e-9) 0.33 (<1e-9) 0.32 (<1e-9) 0.39 (<1e-9) Standard 0.41 (<1e-9) 0.40 (<1e-9) 0.28 (<1e-9) 0.18 (<1e-9) − 0.37 (<1e-9) 0.29 (<1e-9) − − − − Single 61.26 58.09 53.11 51.16 59.77 59.47 − 61.27 59.22 60.02 63.71 Multiple Avg Max 68.75 68.60 68.37 68.26 58.90 58.85 53.56 53.56 68.01 68.51 68.25 67.89 − − 62.67 61.56 63.90 63.70 65.56 63.99 68.59 69.05 Standard 70.36 68.66 58.89 53.50 − 68.43 64.89 − − − − 0.40 (<1e-9) 0.52 (<1e-9) 0.57 (<1e-9) 0.64 (<1e-9) 0.69 (<1e-9) 0.79*(<1e-9) 64.74 73.40 74.67 78.18 82.36 88.27* Table 3: Automatic evaluation metrics performance on random negatives (PBC refers to point-biserial correlation. Column subheading ‘Single’ refers to experiments using single reference response and ‘Avg’ and ‘Max’ are the average and maximum aggregation strategies when using multiple reference responses. ‘Standard’ is applicable when the metric aggregates multiple references differently. * indicates statistical significance in performance over all other metrics (with p-values <1e-9) on William’s test for comparing correlations and Chi-squared test for accuracies. p-values for individual correlations are in parenthesis. with a score of 1 and the remaining 4 irrelevant responses as references with a score of −1. We expect a good evaluation metric to provide high scores on relevant responses and low scores on the irrelevant responses. We then quantify the performance of all metrics using two measures. First, we compute the Point Biserial correlation (PBC) between the scores assigned by a metric and the binary target i.e., a score of 1 for positive responses and 0 for random negative responses.7 Second, we compute the classification accuracy of the metric by using a threshold and marking all responses having a score above this threshold as positive and others as negative. We use a threshold of 0.5 for the trained metrics. For all the untrained metrics, we perform a search from 0 to 1 with step size of 0.01 and select the threshold that minimizes the error rate on the validation set.8 Later in Section 6.1.1, we shall observe that if we use 0.5 as the threshold, the performance of 7Note that it can be shown that PBC is equivalent to the Pearson correlation when one of the variables is binary, as is the case above. 8With this approach of setting a threshold, we want to be lenient with the untrained metrics and investigate how best they can be adopted. One might also think of using the median of all the scores assigned by a metric as its threshold, however, such an approach is error-prone and has several boundary conditions that would fail the purpose. We hence estimate the threshold by minimizing the risk. most untrained metrics would be abysmally poor. Note that for the trained metrics we found that the scores were spread evenly in the range of 0 to 1 and there was no benefit of doing a grid search to find the threshold—a threshold of 0.5 was adequate. In Table 3, we report PBC and accuracy of the different untrained metrics with both single and multiple references, and the trained metrics. When evaluating using single references, we use any one of the 5 relevant responses as a reference than the one being used as response (other a candidate). We observe that with a single reference, all the untrained metrics are poor at distinguishing between the positive and random negative responses as inferred from the low accuracy and correlation values. When we use multiple responses, we observe a relatively better performance. We notice that the performance is largely similar across the aggregation techniques: average, maximum, and standard (when appli- cable). Metrics such as BLEU-1, METEOR, ROUGE-L, and BERTScore with multiple refer- ences are able to achieve modest correlations with the binary target. Interestingly, we observe that all the word embedding based methods, even in the presence of multiple references, perform badly in scoring the positive and random negative responses. In contrast, trained metrics such as 817 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 1: Box plots of the scores given by various metrics to the positive and random negative responses. BERT regressor, RUBER, BERT+DNN, and DEB perform substantially better than the untrained metrics. Our proposed DEB model achieves state-of-the-art performance, with an accuracy of 88.27% and a strong correlation of 0.79. 6.1.1 Analysis using Box Plots We now visualize the box plots of the scores given by the various metrics to the positive and random negative responses. Figure 1 shows these box plots for the multi-reference untrained metrics (max aggregation) and the trained metrics. We observe several shortcomings of the untrained metrics. Firstly, all the untrained metrics have a significant overlap in the interquartile range of the positive and random negative scores, implying that there is a high degree of intermixing of scores given to the positive and random negative responses. The overlap is even higher for word embedding based metrics, which obtain low point biserial correlations. Secondly, we note that the score distributions of the untrained metrics are highly skewed. For instance, the scores of BERTScore are almost always greater than 0.75 even though it scores responses in the range [0,1]. Therefore, it is difficult to tell at what value of the metric a response can be safely considered relevant. These observations suggest that untrained metrics even with multiple references cannot be reliably used to score dialogue responses. For the ADEM evaluation model, we observe that it outputs scores close to the mean score of 0.5 with little spread in their values. Sai et al. (2019) also made similar observation about the clustering of the scores around the mean in ADEM, which they explain using linear system theory. In BERT regressor, there is a high overlap in the scores given to positives and random negatives. We further observe that the RUBER and BERT+DNN are able to better distinguish the positive and random negative responses. Although there is separation in the interquartile range for the two classes in RUBER and BERT+DNN scores, there is a greater spread within each class and a lot of points of the two classes substantially overlap. RUBER-Large is able to reduce the overlap, while DEB further achieves better performance by pushing the scores for positive responses close to 1 and the scores for random negatives to 0 with high accuracy. We shall show in Section 7.3 that DEB achieves this by pushing the Hcls embed- dings for the positive and random negative res- ponses farther apart in space. 6.2 Performance on Synthetically Crafted Adversarial Responses Due to space constraints, in the remainder of this section we present results only for the best performing evaluation metrics from Table 3, namely, BERT+DNN, RUBER, RUBER-Large, and DEB. Before evaluating them using the adversarial examples in our dataset, we first investigate the performance of the models with synthetically crafted adversarial attacks, similar to Sai et al. (2019). In particular, we perform simple transformations on relevant responses by 818 Modification DEB RUBER- Large RUBER BERT+DNN Unmodified positives Reverse word order Jumble word order Retain only nouns Remove punctuation Remove stopwords Replace with synonyms Remove stopwords Replace with synonyms % classified as positive 77.5% 81.7% 87.9% 71.3% 70.3% 60.0% 72.3% 71.2% 69.3% 27.8% 27.9% 60.1% 72.4% 72.9% 86.4% 69.6% 73.6% 85.8% 65.6% 70.8% 81.2% Pearson Correlation with human scores 0.56 (<1e-9) 0.57 (<1e-9) 0.58 (<1e-9) 0.68 (<1e-9) 0.52 (<1e-9) 0.54 (<1e-9) 93.5% 80.4% 77.4% 0.0% 88.5% 29.3% 91.1% 0.056 (0.26) −0.017 (0.67) Table 4: Fraction of responses classified as positives with synthetic modifications. Unmod- ified positives are presented in the 1st row for reference (p-values for individual correlations in brackets). (i) jumbling words in the sequence, (ii) re- versing the sequence, (iii) dropping all words except nouns, (iv) dropping all stop words, (v) dropping punctuation, and (vi) replacing words with synonyms. These results are presented in Table 4. this is that The modifications of reversing and jumbling the word order in a relevant response make it irrelevant (grammatically wrong) and hence we expect to see more of the original true positives get classified as negatives. BERT+DNN classifies a majority of these responses as positives. One possible reason for their model only uses a max pooled aggregation on BERT embeddings and does not explicitly model the sequential order of words. On the other hand, DEB fares better than the other models as seen by the drop in fraction of responses identified as positives. However, RUBER variants and BERT+DNN do better than DEB when retaining only nouns in a response. On removing punc- tuation, we expect that most of the positive responses without punctuation would remain positive and hence the percentage of responses marked positive should remain about the same. In this case, both DEB and BERT+DNN perform better than the RUBER models. For the modifi- cations of removing stopwords and replacing words with synonyms, it is hard to generalize the trend that is observed. Hence, we perform human evaluations by presenting in-house annotators with contexts and modified responses. We ask them to provide scores in the range 0 to 3, with higher scores meaning better responses. We obtain human scores on 400 samples for this task and compute the Pearson correlation of the model Figure 2: Accuracy of different models in identifying adversarial and random negatives versus positive responses. predictions with the human judgements. In this case, we find DEB is better correlated with human judgements on both the modifications. 6.3 Performance of Model-Based Metrics on Manually Crafted Adversarial Responses So far we have established that (i) untrained metrics perform poorly compared to trained met- rics even for separating random negatives from positives (ii) trained models like RUBER, BERT+ DNN, RUBER-Large and DEB perform remark- in distinguishing relevant responses ably well from random responses (iii) RUBER variants and DEB perform well on most synthetically mutated responses whereas BERT+DNN performs poorly against certain mutations. However, we still need to check if the trained models are robust to adver- sarial examples which are specifically crafted to fool such context-dependent, model-based met- rics. Note that none of the untrained metrics are context dependent as they directly compute the similarity between the reference and candidate response without considering the context. We consider the 5 relevant and the 5 adversarial irrelevant responses in our dataset and just as before compute the scores assigned by the dif- ferent metrics to each of these responses. We then compute the accuracy of a metric using the target label as 0 for irrelevant responses and 1 for relevant responses. As expected, the accuracy of all the models drops, as seen in Figure 2. In particular, we observe that the models wrongly classify most of responses as positive/relevant responses. This can be seen from the confusion matrices in Table 5, where it is clear that the number of false positives is very high. the irrelevant 819 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 TP FN FP TN BERT+DNN Positive vs Random negatives 5337 373 2520 3190 Positive vs Adversarial negatives 5337 373 4179 1531 BERT regressor 3442 1126 1304 3264 3442 1126 1837 2731 RUBER 4420 1280 1207 4493 4420 1280 2714 2986 RUBER-Large 4659 1041 970 4730 4659 1041 2500 3200 DEB 5011 689 5054 646 5011 689 3101 2599 Table 5: Confusion matrix showing changes in the performance of different models on DailyDialog++ with random and adversarial negatives. Model BERT original DEB pretrained on Reddit Pretrained DEB finetuned on rand neg Pos vs Rand Neg 72.65 84.16 Pos vs Adv Neg 58.10 59.82 88.29 66.75 Table 6: Ablation studies on DEB. 7 Discussion In this section, we do further analysis of DEB. 7.1 Ablation Studies on DEB the underlying BERT model There are different stages of training our DEB model. First, is already pretrained on English Wikipedia and the BooksCorpus. We then pretrain it further for our task using Reddit corpus and finally finetune it on the DailyDialog++ dataset. We now evaluate the contributions of each of these stages of training (see Table 6). First, we find that the original BERT model when adopted directly for the task of dialog evaluation gives an accuracy of 72.65% and 58.10% on random and adversarial negatives respectively. On further analysis, we find that it has a high false positive rate, with more than 52% of the adversarial negatives getting classified as positives. After pretraining it with Reddit data, it achieves an accuracy of 84.16% on DailyDialog++ even though it has not seen any training instances from this dataset. Model BERT regressor BERT+DNN RUBER (Pretrained) RUBER-Large (Pretrained) DEB (Pretrained) Training/ Finetuning Data Rand neg Adv neg Rand + Adv neg Rand neg Adv neg Rand + Adv neg Rand neg Adv neg Rand + Adv neg Rand neg Adv neg Rand + Adv neg Rand neg Adv neg Rand + Adv neg Pos vs Rand Neg 73.40 69.89 72.77 74.67 60.49 73.87 78.18 70.82 75.11 82.35 63.99 79.91 88.29 86.24 88.67 Pos vs Adv Neg 67.57 75.92 74.55 60.14 87.67 86.61 64.96 76.50 83.88 68.94 90.49 86.54 66.75 82.04 92.65 Table 7: Accuracy in classifying Pos vs Rand Neg and Pos vs Adv Neg responses for various model variants trained/finetuned on DailyDialog++. However, there is only a marginal improvement on adversarial negatives. Finally, finetuning BERT on DailyDialog++ using only random negatives further improves the accuracy to 88.29% and 66.75%, respectively. 7.2 Training with Adversarial Examples We examine whether the evaluation models can learn to distinguish the adversarial negatives when specifically finetuned for that task. By training on DailyDialog++ with adversarial negatives rather than random negatives, we find that all models give an improved performance in identifying adversarial negatives (see Table 7). However, with such training, every model’s performance drops when evaluated on DailyDialog++ with random negatives, with BERT+DNN dropping substantially to 60.49%. The best overall perform- ance is seen when the models are finetuned with both random and adversarial negatives, with DEB achieving the highest accuracies on both test sets. While such improvement is expected given the capacity of the models, obtaining such adversarial examples for training is not always feasible. Effect of the Number of Adversarial Negatives Added to Training: Because of the difficulty in manually creating adversarial examples, we study the effect of the number of the adversarial examples added to the training set. Our findings are presented in Figure 3, where we progressively increase the percentage of adversarial negative examples added as input to the DEB model during training with random negatives. As expected, 820 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 similarity of the vectors with their mean vector, ¯v. The lower the conicity, the higher the spread. For each utterance in DailyDialog++, we first construct the sets P, R, and A using the pretrained DEB model. We find that the average conicity of the set P is 0.89 (averaged over all utterances), indicating that the positive responses get mapped very close to each other. The average conicity of the set P ∪ R is 0.59, indicating that the positive responses are well separated from the random negatives. However, the average conicity of the set P ∪ A is 0.74, indicating that the positive responses are not well separated from the adversarial negative responses. We illustrate this in Figure 4a by representing the mean vector of each of the sets along a corresponding highlighted region where the vectors of the set lie on average.9 We then finetune the DEB model on the DailyDialog++ dataset. Once again, for every utterance we construct the sets P, R, and A using this finetuned model. We now observe that the average conicity of the sets P , P ∪ R, and P ∪ A are 0.86, 0.37, and 0.35 respectively. Thus, after finetuning, the model is able to achieve a clear separation between positive responses and random or adversarial negative responses. Furthermore, the positive responses are still close to each other (illustrated in Figure 4b). 8 Generalization to Other Datasets section, we In this investigate how well the different model-based metrics trained on DailyDialog++ generalize to other datasets that are not seen during training. We evaluate the 3 unreferenced models, BERT+DNN, RUBER, and DEB, which require only context and candidate response as inputs on these 3 datasets. Twitter: Microsoft Research Social Media Conversation Corpus (Sordoni et al., 2015) con- tains a curated list of 3-turn Twitter conversations, all of which are human-verified as good responses. PersonaChat: The dialogues in PersonaChat (Zhang et al., 2018) are associated with well- defined personalities of the speakers involved. We consider the verified human-human chat logs, released by See et al. (2019), as positive examples. 9Note that separation of cones in the figure does not indicate complete separation of all the vectors between the sets, rather separation on average, as there could be some overlap or outliers, as evident from the model’s performance in various experiments. Figure 3: Effect of varying the amount of adversarial negatives added to the training set. the accuracy in identifying adversarial negatives improves as the model is exposed to more data points of the same type, where we specifically note the considerable improvement from 45.6% to 70.85% after adding just 1% of adversarial negatives from our dataset (i.e., 100 contexts with 5 adversarial examples each). With the addition of more adversarial negatives, we find a small drop in the accuracy of identifying random negatives. There is also a slight decrease in the performance on the positives responses when the number of adversarial examples are small. We note that the adversarial negatives are hard negatives close to the positive responses in the embedding space, as we elaborate in Section 7.3, thereby confusing the model. 7.3 Conicity Analysis on DEB We analyze the embeddings from the final embeddings projection space, that is, the one used by softmax layer for next response prediction. We check for the spread of the embeddings of the positive and negative responses. Specifically, let P, R, and A be the set of embeddings of all positive responses, random negative responses, and adversarial negative responses respectively for a given context. We want that if we consider the set P then the spread of this set should be low in the projected space (all positive responses embedded close to each other). At the same time, if we consider the union of the sets P, R, and A then the spread of this set should be high (positive responses separated from negative responses). We measure this spread using conicity analysis (Chandrahas et al., 2018). Conicity on a set of vectors V is defined as the average of the cosine 821 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Figure 4: Illustration of the spread of the positive and negative response embeddings by DEB (not to scale). Model BERT+DNN RUBER RUBER-Large DEB Persona Twitter Holl-E 54.60 48.71 71.01 54.83 71.18 61.17 55.94 77.18 62.32 62.74 82.71 78.55 Table 8: Transferability to other datasets. Holl-E: This dataset (Moghe et al., 2018) contains conversations about movies, where each response is generated by copying and modifying content from a relevant background document. We use the multi-reference test set of Holl-E containing 4 positive responses for each context. For all the 3 datasets, we consider the reference responses as positive responses and obtain nega- tive examples by randomly sampling responses from other contexts. We reiterate that we do not train the models on these datasets but simply evaluate the models trained on DailyDialog++ on these datasets. Table 8 shows that DEB outperforms the other unreferenced models on all the 3 datasets. With Holl-E dataset being specific to conversations about movies rather than generic topics, we find the scores are relatively lower on it for all the models. The other evaluation models and metrics cannot be compared on PersonaChat and Twitter without additional reference responses, since the available single reference in these datasets is being evaluated. On the multi-reference test set of Holl-E, however, we find that their performance is lower than the three unreferenced models. 9 Correlations with Human Judgments on System Generated Responses Lastly, we wanted to check if DEB scores correlate well with scores assigned by humans on responses generated by dialogue systems (as opposed to humans). To do so, we collected responses gener- ated by the following five dialogue response generation models: HRED: Hierarchical Recurrent Encoder De- coder (HRED) (Serban et al., 2016) extends the traditional seq2seq model by adding an additional utterance-level RNN. VHRED: Latent Variable HRED (VHRED) (Serban et al., 2017) includes a latent variable at the decoder, and is trained by maximizing a variational lower-bound on the log-likelihood. VHCR: Variational Hierarchical Conversation RNN (VHCR) (Park et al., 2018) further extends VHRED by drawing a prior encoding for each conversation. DialoGPT small: Zhang et al. (2020b) pre- trained GPT-2-like (Radford et al., 2019) trans- former models on 147M conversations extracted from Reddit comments. The small version con- tains 12 layers and 768 hidden dimensions. DialoGPT medium: The medium version of DialogGPT contains 24 layers and 1024 hidden dimensions. For the RNN-based models (HRED, VHRED, VHCR), we use a single-layer bidirectional en- coder and single-layer decoder each with a hidden size of 1024. We pretrain the RNN-based 822 Model Pearson Spearman Kendall tau Response level 0.016 (0.73) BERT+DNN 0.111 (2.5e-2) RUBER RUBER-Large 0.265 (<1e-7) DEB w/o Reddit 0.356 (<1e-9) DEB w/o DD++ 0.274 (<1e-9) DEB 0.007 (0.88) 0.009 (0.89) 0.090 (8.9e-2) 0.126 (1.1e-2) 0.173 (<1e-6) 0.256 (<1e-6) 0.202 (<1e-9) 0.295 (<1e-9) 0.232 (<1e-9) 0.337 (<1e-9) 0.440* (<1e-9) 0.523* (<1e-9) 0.374* (<1e-9) System level 0.050 (0.89) BERT+DNN 0.221 (0.72) RUBER RUBER-Large 0.679 (0.20) DEB w/o Reddit 0.784 (0.12) DEB w/o DD++ 0.855 (0.06) DEB 0.973 (5.2e-3) -0.100 (0.87) 0.300 (0.62) 0.499 (0.39) 0.600 (0.28) 0.600 (0.28) 0.700 (0.18) 0.000 (1.1) 0.200 (0.81) 0.399 (0.483) 0.400 (0.48) 0.400 (0.48) 0.600 (0.23) Table 9: Human correlations on DailyDialog++ data with different models. (Individual p-values in parenthesis.) * indicates statistical significance in performance over other models, with p-values <1e-6 on the William’s test. models on the casual conversation subset of the Reddit dataset, consisting of 10M conversation exchanges. We finetune all the models on the DailyDialog++ dataset. We conducted human evaluations to compare the extent to which the model-based metrics agree with human judgements. We randomly sampled 100 contexts from the test set of the DailyDialog++ dataset and obtained the responses generated by each of the above models. Annotators were shown a context-response pair and were asked to rate how human-like the response is with respect to the context, on a scale of 0–3. The annotators were asked to check for both fluency and coherence. A total of 15 in-house annotators participated in the human evaluation study. The annotators were Computer Science graduates competent in English. Each context-response pair was rated by 5 annotators and the final score was obtained by averaging the 5 scores. We also obtained scores at the system level by aggregating the scores for each model. In Table 9, we report the correlations of human judgments with the model scores at the response level and system level. We observe that the BERT+DNN model, which only has a feed-forward neural network that is learnable, does not have any significant correlation with human judgments. On the other hand, RUBER, consisting of pretrained GRUs, obtains low to moderate correlations. RUBER-Large further obtains improved correlations, indicating that using large-scale pretrained models helps. This trend is also observed in the comparisons of DEB with its ablated versions (without Reddit pretraining and without finetuning on DailyDialog++), indicating the contribution of these steps in training the final model. Our proposed DEB model obtains significantly higher correlations at response level. We checked for significance using William’s test to compare DEB with all other models and found p-values to be < 1e−6. This establishes the effectiveness of DEB in scoring model generated responses. At the system level, we find that DEB correlates substantially higher than other models, with the human rankings of the models. However, the p- values in this case are not significant due to the limited number of systems. In hindsight, we realize that reporting system level correlations is not very informative as the number of samples are very small (as many as the number of systems). Hence, these numbers are not very reliable. However, following Lowe et al. (2017), we still report the system-level correlations (along with the p-values) for the sake of completeness. 10 Related Work We point the reader to Serban et al. (2018) for an excellent survey of existing datasets containing single reference responses. Recently, there has been some effort to create datasets containing multiple references but these datasets are either too small (around 1,000 contexts) (Moghe et al., 2018; Gupta et al., 2019) or noisy (Gao et al., 2019). We have already reviewed all the existing dialog metrics in Section 3 and hence we do not discuss them again here. Instead, we quickly mention existing works which critically examine dialog evaluation metrics. For example, Liu et al. (2016) show that existing n-gram based metrics do not correlate well with human judgements for dialog evaluation. We report similar results but additionally show that the correlation improves in the presence of multiple references. Similarly, Sai et al. (2019) have critically examined ADEM and shown that in most cases it produces a score close to 2.5 (on a scale of 1 to 5) and hence does not clearly separate relevant and irrelevant responses. Lastly, we also mention a very recent work, Zhang et al. (2020b), which has pretrained a large scale transformer on Reddit corpus for building conversation systems. However, their focus is on dialog generation and not on evaluation metrics. 823 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 11 Conclusions We propose a multi-reference open-domain dialogue dataset with multiple relevant responses and adversarial irrelevant responses. We perform the existing dialogue an extensive study of evaluation metrics using this dataset and also propose a new transformer-based evaluator pretrained on large-scale dialogue datasets. We identify the strengths and weaknesses of such a model through studies of its performance on untrained and synthetically modified data. We find DEB to be easily adaptable to other open- domain dialogue datasets. We also present the scope of the adversarial responses in our dataset towards bringing out better evaluation metrics, since all the current models do not perform well on those unless explicitly trained. Acknowledgments We thank the Department of Computer Science and Engineering, IIT Madras and the Robert Bosch Center for Data Science and Artificial Intelligence, IIT Madras (RBC-DSAI) for pro- viding us resources required to carry out this re- search. We are grateful to Google for the TFRC credits that supported our usage of TPUs for several experiments in this paper. We also thank Google for supporting Ananya Sai through their Google India Ph.D. Fellowship Program. We thank the action editor, Xiaojun Wan, and all the anonymous reviewers for their very helpful comments in enhancing the work. We thank the in- house human annotators and evaluators for helping us create the dataset. References Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evalu- ation with improved correlation with human judgments. In Proceedings of the ACL Work- shop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics. Chandrahas, Aditya Sharma, and Partha P. Talukdar. 2018. Towards understanding the geometry of knowledge graph embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 122–131. Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1 /P18-1012 Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NeurIPS 2014 Work- shop on Deep Learning, December 2014. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre- training of deep bidirectional transformers for In Proceedings of language understanding. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics. Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W. Cohen. 2016. Tweet2vec: Character-based distributed repre- sentations for social media. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics. DOI: https://doi.org/10 .18653/v1/P16-2044 Joelle Pineau, Jean-Marie Gabriel Forgues, Larchevˆeque, and R´eal Tremblay. 2014. Bootstrapping dialog systems with word In NeurIPS, modern machine embeddings. language processing learning and natural workshop, volume 2. Eric N. Forsythand and Craig H. Martell. 2007. Lexical and discourse analysis of online chat dialog. In Proceedings of the First IEEE Inter- national Conference on Semantic Computing (ICSC 2007), September 17-19, 2007, Irvine, California, USA, pages 19–26. IEEE Computer Society. DOI: https://doi.org/10.1109 /ICSC.2007.55 Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris 824 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discrim- inative metric for generation tasks with intrin- sically diverse targets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Inter- national Joint Conference on Natural Lan- guage Processing (Volume 2: Short Papers), pages 445–450, Beijing, China. Association for Computational Linguistics. DOI: https:// doi.org/10.3115/v1/P15-2073 Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, and Bill Dolan. 2019. Jointly optimizing diversity and relevance in neural response generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapo- lis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1229–1238. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/N19 -1125 Sarik Ghazarian, Johnny Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic eval- uation of open-domain dialogue systems with In Proceedings contextualized embeddings. of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 82–89. Association for Computational Linguistics, Minneapolis, Minnesota. DOI: https://doi.org/10.18653/v1/W19 -2310 Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine Esk´enazi, and Jeffrey P. Bigham. 2019. Investigating evaluation of open-domain dialogue systems with human In Proceed- generated multiple references. ings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, SIGdial 2019, Stockholm, Sweden, September 11-13, 2019, pages 379–391. Association for Computatio- nal Linguistics. DOI: https://doi.org /10.18653/v1/W19-5944, PMCID: PMC6813692 Matthew Henderson, Paweł Budzianowski, I˜nigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkˇsi´c, Georgios Spithourakis, Pei-Hao Su, Ivan Vulic, and Tsung-Hsien Wen. 2019. A repository of conversational datasets. In Proceedings of the Workshop on NLP for Conversational AI. Data available at github .com/PolyAI-LDN/conversational -datasets. DOI: https://doi.org/10 .18653/v1/W19-4101 Jiwei Li, Will Monroe, Tianlin Shi, S´ebastien Jean, Alan Ritter, and Dan Jurafsky. 2017a. Adversarial learning for neural dialogue gen- eration. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2157–2169, Copenhagen, for Computational Denmark. Association Linguistics. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017b. Daily- dialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth Interna- tional Joint Conference on Natural Language Processing, IJCNLP 2017, Taipei, Taiwan, November 27 - December 1, 2017 - Volume 1: Long Papers, pages 986–995. Asian Federation of Natural Language Processing. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computa- tional Linguistics. Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue the response generation. In Proceedings of 2016 Conference on Empirical Methods in Nat- ural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2122–2132. The Association for Com- putational Linguistics. Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evalu- ate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Van- couver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1116–1126. Association for Computational Linguistics. DOI: https:// doi.org/10.18653/v1/P17-1103 825 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. 2018. Towards exploiting background knowledge for building conversation systems. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2322–2332. Association for Com- putational Linguistics. DOI: https://doi .org/10.18653/v1/D18-1255 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguis- tics. DOI: https://doi.org/10.3115 /1073083.1073135 Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. A hierarchical latent structure for vari- ational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792–1801, New Orleans, Louisiana. Association for Com- putational Linguistics. DOI: https://doi .org/10.18653/v1/N18-1162 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Technical report. OpenAI. Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conver- sations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, Cal- ifornia, USA, pages 172–180. The Association for Computational Linguistics. Vasile Rus and Mihai C. Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word- to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, BEA@NAACL-HLT 2012, June 7, 2012, Montr´eal, Canada, 826 pages 157–162. The Association for Computer Linguistics. Ananya B. Sai, Mithun Das Gupta, Mitesh M. Khapra, and Mukundhan Srinivasan. 2019. Re- evaluating ADEM: A deeper look at scor- ing dialogue responses. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Appli- cations of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelli- gence, EAAI 2019, Honolulu, Hawaii, USA, January 27-February 1, 2019, pages 6220–6227. AAAI Press. DOI: https://doi.org/10 .1609/aaai.v33i01.33016220 Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 1702–1723. Association for Computational Linguistics. Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A survey of available corpora for building data- driven dialogue systems: The journal version. Dialogue Discourse, 9(1):1–49. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network the Thirtieth models. AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 3776–3784. AAAI Press. DOI: https://doi.org/10.5087/dad.2018 .101 In Proceedings of Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A hier- archical latent variable encoder-decoder model for generating dialogues. In Proceedings of the Thirty-First AAAI Conference on Artificial In- telligence, February 4-9, 2017, San Francisco, California, USA, pages 3295–3301. AAAI Press. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3 Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR, abs/1706.09799. Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2019. Machine translation evaluation with BERT regressor. ArXiv, abs/1907.12679. Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversa- tional responses. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 196–(205. The Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.3115/v1/N15-1020 Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: an unsupervised method for automatic evaluation of open- domain dialog systems. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI- 18), and the 8th AAAI Symposium on Edu- cational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 722–729. AAAI Press. J¨org Tiedemann. 2012. Parallel data, tools and the interfaces in OPUS. In Proceedings of Eighth International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2214–2218. European Language Resources Association (ELRA). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems Associates, Inc. 30, pages 5998–6008. Curran John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2016. Towards universal para- phrastic sentence embeddings. Yoshua Bengio and Yann LeCun, editors, In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Mel- bourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2204–2213. Associa- tion for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P18-1205 Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020a. BERTScore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. response generation. Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020b. DialoGPT : Large-scale generative pre-training for conversational In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2020, Online, July 5-10, 2020, pages 270–278. Associ- ation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020 .acl-demos.30 Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation In Proceedings with informative entities. of the Association the 57th Conference of for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Vol- ume 1: Long Papers, pages 1441–1451. Asso- ciation for Computational Linguistics. DOI: https://doi.org/10.18653/v1/P19 -1139 827 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 3 4 7 1 9 2 3 8 7 4 / / t l a c _ a _ 0 0 3 4 7 p d . f b y g u e s t t o n 0 8 S e p e m b e r 2 0 2 3Improving Dialog Evaluation with a Multi-reference Adversarial image
Improving Dialog Evaluation with a Multi-reference Adversarial image
Improving Dialog Evaluation with a Multi-reference Adversarial image
Improving Dialog Evaluation with a Multi-reference Adversarial image

Download pdf