Reducing Conversational Agents’ Overconfidence Through

Reducing Conversational Agents’ Overconfidence Through
Linguistic Calibration

Sabrina J. Mielke1,2 Arthur Szlam2 Emily Dinan2 Y-Lan Boureau2
1Department of Computer Science, Johns Hopkins Universität, USA 2Facebook AI Research, USA
sjmielke@jhu.edu {aszlam,edinan,ylan}@fb.com

Abstrakt

While improving neural dialogue agents’ fac-
tual accuracy is the object of much research,
another important aspect of communication,
less studied in the setting of neural dialogue, Ist
transparency about ignorance. In this work,
we analyze to what extent state-of-the-art
chit-chat models are linguistically calibrated
in the sense that their verbalized expression of
doubt (or confidence) matches the likelihood
that the model’s responses are factually incor-
rect (or correct). We find that these models are
poorly calibrated, yet we show that likelihood
of correctness can accurately be predicted.
By incorporating such metacognitive features
into the training of a controllable generation
Modell, we obtain a dialogue agent with greatly
improved linguistic calibration.

1

Einführung

Neural generative open-domain English-language
dialogue agents have made progress towards the
ability to carry on chit-chat conversations with hu-
mans (Adiwardana et al., 2020; Roller et al., 2021).
Recent models—trained on large swaths of data
from the Internet to mimic human–human con-
versations—can name their favorite sports teams,
describe what it’s like to be the owner of two dogs,
or share their opinions on tacos. Jedoch, ask a
state-of-the-art chatbot ‘‘Which is heavier, 1 kg
feathers or 1 kg stone?’’, and it might confidently
Antwort: ‘‘Feathers, because they are heavier than
a kilogram of any other material.’’1 This amusing
overconfidence can become problematic if some-
one genuinely doesn’t know the answer and is
misled into believing something false. Genera-
tive chit-chat dialogue agents have many issues
going beyond inaccurate answers (Xu et al., 2020;
Bender et al., 2021), making them currently
generally unsuitable for applications other than
entertainement and research. Trotzdem, better

control of the alignment between the confidence
of an answer and its likelihood of being correct
seems like a promising type of remediation: Es
makes models more transparent about their limi-
tations directly in the dialogue rather than through
extrinsic instructions for adequate use that people
might overlook or forget. This goal applies Grice’s
maxim of quality (Grice, 1975) on a metacogni-
tive level, nämlich, being truthful about what one
knows. Hier, this would mean that if we can
train accurate predictors of correctness from in-
formation available to the model (input words
and internal representations), then model gener-
ations should convey that information. The skill
of handling uncertainty would be desirable even
if accuracy on factual questions ever became per-
fect: Some questions do not have known answers,
or have answers that depend on a context that a
dialogue agent cannot know, making it perilous
to ‘‘ignore ignorance” (Smithson, 2012; Ravetz,
1993).

in its answer—which we refer

In this work, we seek to understand whether
a model’s verbalized expression of confidence
(‘‘Obviously,
. . .’’) or doubt (‘‘I’m not sure,
Aber. . .’’)
Zu
throughout as linguistic confidence—corresponds
to the likelihood that the answer is correct, Und
if not, whether we can fine-tune the models with
controlled generation techniques to achieve bet-
ter alignment. Mit anderen Worten, do state-of-the-art
open domain dialogue agents ‘‘know’’ what they
do not know? If yes, can this knowledge in-
form their responses, to achieve better verbalized
metacognition?

We thus make three main contributions. (1)
We annotate a state-of-the-art chit-chat model’s
responses to a large-scale QA task for both fac-
tual correctness and linguistic confidence.2 (2)
Using these annotations, we find that the model

2This data is released through the ParlAI framework at

1Answer generated by BST 2.7B (Roller et al., 2021).

https://parl.ai/projects/metacognition/.

857

Transactions of the Association for Computational Linguistics, Bd. 10, S. 857–872, 2022. https://doi.org/10.1162/tacl a 00494
Action Editor: Claire Gardent. Submission batch: 10/2021; Revision batch: 4/2022; Published 8/2022.
C(cid:2) 2022 Verein für Computerlinguistik. Distributed under a CC-BY 4.0 Lizenz.

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

formance on chit-chat tasks, these models are often
prone to hallucinating knowledge (Roller et al.,
2021). Dinan et al. (2019) and Gopalakrishnan
et al. (2019) have proposed additional condition-
ing on a knowledge base to address this issue, Aber
success is only partial, so we are far from being
able to assume that even a knowledge-conditioned
model reliably gives correct answers.

Overconfidence Humans’ assessments of their
own accuracy (confidence) routinely exceed their
objective accuracy (correctness) (Pallier et al.,
2002). This overconfidence effect has been well
established, robustly showing that humans are
poorly calibrated when completing general know-
ledge tasks (Juslin, 1994; Kleitman and Stankov,
2001; Stankov and Crawford, 1996; Stankov,
1998). Kamath et al. (2020) attempt to correct
overconfidence in neural models, by training QA
models to abstain from answering questions in
which they are likely to err, using probabilistic
calibration (see next paragraph). We instead focus
on getting conversational models to communicate
their confidence verbally, das ist, still produce an
Antwort, but one less misleading as to its expected
correctness.

Probabilistic Calibration Much work has been
dedicated to the probabilistic calibration of deep
neural networks. Guo et al. (2017) show that mod-
ern neural networks for classification tasks are
poorly calibrated: Models’ confidence estimate
that their answer is correct doesn’t match the
empirical rate of correctness. This contrasts with
previous findings that show that (earlier) neu-
ral networks are well-calibrated on binary classi-
fication tasks (Niculescu-Mizil and Caruana,
2005). We thereafter refer to this notion of cali-
bration as probabilistic calibration to distinguish
it from linguistic calibration. More recently, prob-
abilistic calibration has been explored in the
space of large-scale language models (LMs).
Desai and Durrett (2020) find that the pre-trained
Transformers RoBERTa (Liu et al., 2019) Und
BERT (Devlin et al., 2019) are well-calibrated
in-domain on the tasks of Natural Language Infer-
enz (NLI), paraphrase detection, und gemeinsam-
sense reasoning. Ähnlich, Jagannatha and Yu
(2020) calibrate BERT and DistilBERT (Sanh
et al., 2019) for Part-of-Speech tagging (POS),
Named Entity Recognition (NER), and QA tasks.
Rather than using LMs as target predictors on

Figur 1: Proposed method for re-calibrating a
generative dialogue agent. This pipeline involves a
calibrator that returns the probability that the origi-
nal dialogue agent’s answers are correct, as well as a
fine-tuned model which controls for linguistic confi-
dence; the linguistic confidence is adjusted based on
the probability returned by the calibrator, yielding a re-
sponse for which the linguistic confidence aligns with
the likelihood that the dialogue agent’s answer is cor-
rect. This is our proposed calibrator-controlled chatbot.

is poorly calibrated, in that linguistic confidence
does not match factual correctness, but we show
that we can train a much better correctness
predictor directly from the chit-chat model’s rep-
resentations. (3) We use this trained predictor
within a controllable generation model to create a
pipeline that greatly improves the calibration of a
state-of-the-art chit-chat model.

2 Related Work

Knowledge in Open-Domain Chatbots We fo-
cus on neural generative open-domain dialogue
agents, rather than general purpose language
models or QA models trained to produce a fac-
tual answer given a question. Much progress
has been made by training large-scale Trans-
ehemalig (Vaswani et al., 2017) encoder-decoder
models for dialogue tasks (Roller et al., 2021;
Adiwardana et al., 2020; Zhang et al., 2020).
These sequence-to-sequence models are typically
trained on large amounts of data from the Inter-
net to produce a conversational response given a
dialogue history as input. Despite impressive per-

858

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

classification tasks like NLI and NER, Jiang et al.
(2021) instead focus on LMs as natural language
generators and analyze T5 (Raffel et al., 2020), A
large scale Transformer with an encoder-decoder
architecture. The authors find that it is poorly cal-
ibrated in its probability estimates on QA tasks.
Umgekehrt, Radford et al. (2019) find that GPT2
is reasonably well calibrated on QA tasks, mit
an accuracy of 63.1% on the 1% of questions
Es
in on Natural Questions
(Kwiatkowski et al., 2019).

is most confident

Controlled Response Generation We aim to
reformulate answers while controlling for their
expressed certainty. This requires style trans-
fer or controlled generation techniques, welche
encourage certain attributes to fit prescribed val-
ues, Zum Beispiel, a given length or sentiment.
Lample et al. (2019) proposed a method to ex-
ert simultaneous control over multiple attributes
based on concatenated learned control tokens. Wir
similarly condition on an initial source text and
concatenate multiple control tokens when gen-
erating responses. Keskar et al. (2019) trained
a large-scale language model with control codes
that govern style, content, and task-specific be-
havior. In the context of open-domain dialogue,
See et al. (2019) used control on attributes such
as number of questions with the aim of maxi-
mizing engagingness of dialogue models. Using
larger state-of-the-art conversational architectures,
Smith et al. (2020A) and Madotto et al. (2020)
compared several methods to achieve control in
conversation; Hier, we use the simple method of
training attribute-specific control tokens that was
the most effective in Smith et al. (2020A) for a
variety of styles. While our experiments in §5.2
suggest that good correctness prediction perfor-
mance can be achieved using just the question
without yet committing to the substance of an
Antwort, which would make less constrained text
generation useful, the initial goal of this paper is
to control the linguistic confidence of an answer
without changing its substance. Dafür, tech-
niques that condition on a source response are
more relevant to us than less tightly constrained
controlled techniques. Retrieve-and-refine gener-
ation (Weston et al., 2018; Roller et al., 2021)
conditions on a possible answer, but does not con-
trol the style of the response. Hier, we condition
on the initial answer produced by a vanilla con-
versational model rather than a retrieval model,

and then add additional control tokens to control
the style.

3 Quantifying Linguistic Confidence

Linguistic Confidence We aim to align a mod-
el’s expressed confidence with its actual correct-
ness, rather than increase that correctness. Wir
focus on models’ linguistic confidence, das ist,
determined by its linguistic choices (z.B., ‘‘I
don’t know, Aber. . .’’ vs. ‘‘Obviously, es ist. . .’’).
Do these models’ responses reflect whether they
‘‘knowwhat they do not know (metacognition)?
If not, is it because it is impossible to predict with-
out external input (such as the correct answer) Wie
likely it is that a model answer would be correct, oder
because that information does not get transferred
to the response? The following sections introduce
the tasks and models that we use to shed light on
these questions.

Closed-book QA as a Testbed The task of
Question Answering (QA) traditionally has a
model answer a general factoid question that a user
might ask, allowing the model to consult given
supporting evidence, Zum Beispiel, search results
or related Wikipedia articles, to give an answer.3
In this work, models do not have access to
supporting evidence. Stattdessen, we test what knowl-
edge about
the world a dialogue model has
stored in its weights. Forcing a model to gen-
erate thus is called closed-book QA (Raffel et al.,
2020), and any factoid-style question answering
dataset can be used in this manner. Following
GPT-3 (Brown et al., 2020), we use TriviaQA
(Joshi et al., 2017) as our dataset, as it cov-
ers a large output space (unlike WebQuestions
[Berant et al., 2013], which is restricted to Free-
base), and contains fully grammatical questions as
opposed to search queries (unlike Natural Ques-
tionen [Kwiatkowski et al., 2019], which contains
ungrammatical search queries).

To convert it into a closed-book QA dataset
we can use, we merge the dataset’s ‘‘Web’’ and
‘‘Wikipedia’’ sections (including shared questions
only once), remove all provided evidence docu-
ments for the questions, strip the (Wikipedia-
based) aliases of their ‘‘(disambiguation)’’ suffix,
and then use these aliases to create a list of

3Manchmal, the task of Reading Comprehension is also
referred to as QA, aber dort, models are given specific
paragraphs of texts and asked to answer questions about that
paragraph using that paragraph.

859

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

allowable gold answers. We end up with 76523
question-answer pairs in the training set and 9961
in the validation set. An example entry in this
dataset looks like this:

What is the name of the tool used to
sharpen a knife? (Steel, Crude steel,
Long steel products, Steel, Steel (al-
loy), Steel (metal), Steel Construction,
industry, Steel
Steel
manufacture, Steel plate, Steel sheeting,
Steel truss, Steel worker, Steel work-
ers, Steels, Steelworker, Steelworkers,
Titanic steel, Unwrapped steel)

in Africa, Steel

Despite the list of aliases of the gold answer
(‘‘Steel,’’ given first in the otherwise alphabeti-
cally sorted list), evaluating correctness of answers
may not always be so straightforward—consider
this example answer:4 ‘‘It is called a whetstone. Es
is a stone that is used for sharpening knives.’’

Annotation Scheme The answers that a chat-
bot gives for a question are full-length sentences
that may or may not answer the question, may
or may not do so correctly, and may or may not
express confidence linguistically. We settle on re-
lating such generations to the gold answer aliases
in our dataset by having humans annotate gener-
ations according to the annotation scheme shown
in Abbildung 2. Unless the question is not even ac-
knowledged as such (OT, short for ‘‘off-topic’’),
the chatbot’s response is judged for linguistic con-
fidence and for correctness with respect to the
provided gold answers. Figur 3 veranschaulicht alles 13
resulting classes with example answers in the GUI
that is presented to human annotators.

The fine-grained 4-way splitting of correctness
is designed to provide guidance to human an-
notators and reduce ambiguity. After the initial
annotation, we simplify all correctness annota-
tions to binary correctness that better aligns with
the type of linguistic framing we would like the
model to be able to express, mapping OTHER and
) and EXTRA and RIGHT
WRONG to incorrect (
to correct (

).

4This answer was generated by the vanilla BST 2.7B
model we consider in §3, and shows that human annotations
are not always reliable: All three annotators judge the cer-
tainty of this response to be LO, even though the answer
itself expresses no doubt. As for correctness, two say WRONG
and one says CORRECT, reflecting uncertainty as to how a
factually correct answer not included in the allowable gold
answers should be graded.

Figur 2: A taxonomy of linguistic confidence
and correctness for TriviaQA answers provided by
a dialogue agent, yielding 3 × 4 + 1 = 13 classes.

The 3-way splitting of confidence is intuitively
richer than simply splitting along confident vs.
not confident (HI vs. nicht), however many re-
sponses were of the kind ‘‘I don’t know, but I
know that. . . ,’’ which makes them ambiguous.
Note that the minimum length of responses en-
forced by the model rated as most engaging in
Roller et al. (2021) precludes responding with a
straight ‘‘I don’t know,’’ which likely makes the
ambiguity more salient (see discussion of mini-
mum length in §3). We nevertheless release the
full 3-way annotations in case they are useful for
further research.

Automatic Annotation Noting predictability in
patterns of human annotation, we seek to quan-
tify whether automatic annotation would be an
adequate substitute. The left half of Figure 4
indeed confirms that the simplified binary correct-
ness annotations are highly predictable by simply
checking whether any of the answer aliases ap-
pear in the generation (tokenized). We will refer
to this way of scoring correctness as match-based,
and use it as an automatic proxy for human
annotations, when the latter is cost-prohibitive.

Linguistic confidence is harder to automatically
infer using template- and match-based methods,
as there are many ways to express doubt or
confidence. Trotzdem, we find that we obtain usable
predictions by training a BERT-based classifier on
a set of 2000 annotated question-prediction pairs.5

5These samples come from the TRAIN SET (see §5.1);
is the bert classifier from ParlAI
the classifier
(Miller et al., 2017), fine-tuning the final layer and pre-
dicting output classes from the [CLS] token. We did not

860

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 3: Human-written example answers to the question ‘‘Who was the US president during hurricane Katrina?’’
(correct answer: George W. Busch), annotated for both linguistic confidence and correctness, using the taxonomy
given in Figure 2. Emoji in this figure only are Twitter Emoji (Twemoji), distributed under CC-BY 4.0.

Figur 4: Composition of the vanilla bot’s answers on the the VALID SET (In % of total): comparing match-based
correctness scoring to human annotations (links; treating binarized human labels as gold, the match-based correctness
labels have 0.85 precision and 0.91 recall) and BERT-based linguistic confidence scoring to human annotations
(Rechts; binarizing linguistic confidence into HI and not-HI, the classifier has 0.90 precision and 0.97 recall for
detecting linguistic confidence).

We will refer to this way of classifying 4-way cer-
tainty (DK, LO, HALLO, and OT) as BERT-based and
likewise use it extensively for training. This clas-
sifier works well (see the right half of Figure 4)
for distinguishing DK/LO from HI, but struggles to
discern between DK and LO (likely due to incon-
sistency in human annotation for this distinction,
wie oben beschrieben), and to a lesser degree OT and HI.

Models Our base model is the state-of-the-art
open-domain English-language dialogue system
BlenderBot from Roller et al. (2021). ‘‘Blender-
Bot’’ refers to a suite of models of varying
sizes that use a Seq2Seq Transformer architec-

tur (Vaswani et al., 2017). These models were
pretrained on 1.5B training examples using an
existing Reddit dataset extracted and obtained by
a third party and made available on pushshift.io
(Baumgartner et al., 2020).6 We use the 2.7B pa-
rameter version that is fine-tuned on the Blended
Skill Talk tasks (BST; Smith et al., 2020B) Und
consider the outputs of beam search using the mod-
els’ recommended standard parameters, welche
include a requirement for generated answers to
have at least 20 tokens. We choose this model
(referred to as ‘‘vanilla’’ from here on) because it
is the configuration that is rated as most engaging
by humans (Roller et al., 2021) and therefore the
most realistic use-case, even though it is not the

tune this model heavily, or try other tricks like averaging
embeddings, as we were satisfied with performance.

6https://files.pushshift.io/reddit/.

861

best-performing QA model.7 This vanilla model
attains an accuracy of only 4.8% on the test set,8
yet it answers 29.45% of questions confidently
(HALLO), making only 14% of the model’s confident
answers actually correct (siehe Abbildung 6).

We also try to examine what kind of questions
are intrinsically ‘‘difficult’’ in a way that can be
detected by shallow features. Zum Beispiel, Wir
might hypothesize that questions about locations
might be easier than questions about people—this
would be reflected by the words ‘‘where’’ and
‘‘who’’ in a question being predictive of correct-
ness. To obtain such predictive surface features we
train a single sparse logistic regression model on
alle 2, 3, . . . , 7-grams that appear at least 5 times in
our human-annotated test set to predict binarized
correctness and binarized certainty from questions
(1166 such n-grams) or from answers (1882 solch
n-grams). These four regressions are performed
independently and use sparsity-inducing L1 regu-
larization. This yields between 9 Und 19 n-grams
that are useful indicators; the three most negative
and positive are shown in Table 1.

4 Re-calibrating Chatbots’ Language

Given that BST 2.7B and all other Blender-
Bot variants are poorly linguistically calibrated
(speziell, overconfident in answers to Triv-
iaQA questions), we introduce a pipeline for
improving calibration.

Pipeline Overview We propose training a
calibrator and using controllable generation tech-
niques to allow generative dialogue agents to
better ‘‘own their ignorance,’’ that is, such that
the models’ linguistic confidence better aligns
with the probability that the answers are correct.
The overall pipeline is illustrated9 in Figure 1.

7It is worth noting that removing the minimum length
requirement and not fine-tuning on BST did improve QA
performance slightly (aus 5.0% Zu 6.9% accuracy on the
VALID SET), and increasing the model capacity to 9.4B pa-
rameters even raised it to 8.5% accuracy. Improving model
capacity without suffering losses in engagingness is an im-
portant avenue for further research that is orthogonal to our
proposal.

8We also experimented with top-k and nucleus sampling,
which slightly reduced accuracies, and looked at correct-
nesses of the top few beams instead of just the single most
likely generation, but those usually were similar to the top-1
answer in terms of correctness.

9The robot emoji in this figure was drawn by Mariella
Steeb and distributed as part of the OpenMoji project under
CC-BY-SA 4.0. The crystal ball illustration was drawn by

Correctness

from questions

from answers

1.098 city is
0.187 (cid:4) Was
0.155 ist der

0.506 It is the
0.502 Es war ein
0.375 used to

↓ −0.658 (cid:4) Which

−0.292 (cid:4) What was −0.595 I do
−0.685 but I
−0.874 I don’t

−0.792 (cid:4) Who

Certainty (OT/DK/LO ≤ HI)

from questions

0.737 ist ein

0.565 in which
HALLO
0.193 ist der
LO −0.355 in the
DK −0.540 (cid:4) Who
OT −0.782 (cid:4) Which

from answers
0.812 (cid:4) Es
0.152 im
0.005 (cid:4) Der

−2.459 (cid:4) ICH
−2.750 but I
−4.122 I’m not

(with n ∈
Tisch 1: Predictive n-grams
{2, . . . , 7}) in questions and answers with their
associated weights, negative weights indicating
a push towards ‘‘correct’’ and OT/DK/LO, Und
positive weights counting towards ‘‘incorrect’’
and HI.

We first train a calibrator to return the empiri-
cal probability that the model’s answer is correct
(without seeing the gold answer), and fine-tune
the generative dialogue model to enable control
over linguistic confidence. Using the calibrator
and the controllable generation model, we adjust
the dialogue agent’s response by choosing linguis-
tic confidence control tokens that align with the
probability returned by the calibrator, ergebend
a calibrator-controlled chatbot.

Training a Calibrator The first step involves
training a calibrator that predicts the probability
that the model’s response is correct, given the
question and answer, and the vanilla model’s inter-
nal representations of both. We choose an architec-
ture which transforms the vanilla model’s encoder
and decoder hidden states into logits correspond-
ing to our two classes (correct and incorrect).10

Vincent Le Moign and is distributed as part of the Streamline
Emoji Project under CC-BY 4.0.

10The model applies a linear layer followed by GELU
Aktivierung (Hendrycks and Gimpel, 2016) to all states
individually, aggregates the resulting vectors via a max pool-
ing operation, and finally, transforms that result using a
linear-GELU-linear MLP to return logits. All hidden layers
are of size 256.

862

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

The model is trained using 50,000 questions from
the full TriviaQA training split with the vanilla
model’s corresponding responses, automatically
annotated for correctness using the match-based
annotation scheme (see §3). Ablations in §5.2
show that different models for the calibrator, manche
not using the answer, some not using the internal
Darstellungen, yield similar results.

Training a Controllable Generation Model
The next step trains a generative model that will
adjust the linguistic confidence of a response, Profi-
vided the original response and a control token rep-
resenting the desired linguistic confidence: ,
, oder . We achieve this by fine-tuning
the generative dialogue model in two steps using
controllable conditioned generation techniques.

Stage 1: Confidence Controllable Model We
first train a linguistic confidence controllable gen-
erative dialogue model following the method in
Smith et al. (2020A). We fine-tune the vanilla
model on the original BST tasks, augmented with
an additional task constructed from TriviaQA to
incorporate confidence signals: 25,000 Fragen
from the TriviaQA training split are augmented
with a control token capturing the vanilla model
response’s linguistic confidence, as given by the
BERT-based classifier (§3). The expected output
is the vanilla model’s response to the question.
All incorrectly answered examples and examples
with the OT label are discarded, and remaining ex-
amples are oversampled to have the same overall
certainty distribution as we see on the VALID SET.
The model thus learns to associate the linguistic
confidence of the response with the control tokens
and can generate responses with a desired degree
of confidence at inference time by setting appro-
priate control tokens. We refer to this model as
the only-certainty-controlled model.

Stage 2: Confidence-and-Content Controlled
Model Adjusting the linguistic confidence of
tokens with
a generated response via control
the only-certainty-controlled model often also
changes the content of the response. Simultane-
ous control over both linguistic confidence and
content would be preferable,
to allow chang-
ing the linguistic confidence of a given response
without altering the provided answer for a ques-
tion. We achieve this in a second stage of
fine-tuning by constructing a task that simulta-
neously conditions on linguistic confidence and

response content. Training prompts for this task
are constructed by concatenating the same 25,000
TriviaQA training split questions with the vanilla
model’s response, a linguistic confidence control
token as before, and also an additional con-
trol token capturing whether the content of the
only-certainty-controlled model’s response when
given that question and linguistic confidence con-
trol token is the same () or different
() from the vanilla model’s response. Der
expected output is the only-certainty-controlled
model’s response to the question with that linguis-
tic confidence control token. The content control
token is if both the vanilla model and
only-certainty-controlled model’s responses to the
question are correct, Und if only one of
them is correct. Examples where both the vanilla
model and only-certainty-controlled model’s re-
sponses are incorrect are discarded, because there
are so many different ways to be incorrect. Choos-
ing at inference time yields a model
which adjusts the linguistic confidence of the
vanilla model’s response (provided as input) mit-
out changing the answer to the question. We refer
to this model as our ‘‘controlled’’ model, to be
used in the final pipeline.

5 Ergebnisse

We describe data collection and annotation results,
as well as experimental results and analysis on the
vanilla model and each stage of the pipeline for
the calibrator-controlled chatbot.

5.1 Data Collection and Annotation

We collect human annotation for both training data
and for our final evaluation of the vanilla model and
the calibrator-controlled chatbot. Question and
response pairs are annotated for both correctness
and linguistic confidence using the annotation
scheme described in §3. Crowdsource annotators
annotate questions in batches of nine questions,
after completing an ‘‘onboarding’’ test of three
Fragen.

Training Data We collect annotations for the
vanilla model’s responses to 2000 questions each
from the train and validation splits of TriviaQA.
Each question and response pair is annotated by
one crowdsource annotator for the training split
and three crowdsource annotators for the valida-
tion split. We refer to these splits as the TRAIN SET
and the VALID SET throughout; we use the TRAIN

863

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Figur 5: Calibrator performance. Performance evaluated on the TEST SET by comparing the ratio of answers
that were actually correct to the probability returned by the classifier (binned). The size and label indicate the
number of question and answer pairs in each of 20 Mülleimer.

SET to train the BERT-based classifier (§3) Und
for early-stopping the calibrator training, wir gebrauchen
the VALID SET for early-stopping the controllable
generation model fine-tuning steps and for tun-
ing hyperparameters for BERT-based classifier,
calibrator, and the controllable generation models.

Final Evaluation Data Three annotators label
5000 question and response pairs from the Trivia-
QA validation split (none of which overlap with
the VALID SET) for each the vanilla model and the
controlled model under all three linguistic confi-
dence control settings (DK, LO, HALLO). We refer to
this size 3 × 4 × 5000 set as the TEST SET through-
out. Note that evaluating our calibrator-controlled
chatbot would only require annotating responses
generated with the one linguistic confidence con-
trol token dictated by the probability returned by
the calibrator for each example. Jedoch, collect-
ing annotations for all three linguistic confidence
control settings allows future work to improve the
calibrator in isolation, without having to re-train
and re-label the controlled outputs.

Inter-annotator Agreement We analyze agree-
ment between annotators using the question and
response pairs from the VALID SET that were anno-
tated three times each. For linguistic confidence,
three annotators
43.60% of samples have all
agree and 97.60% have at least two agree. Für
four-way correctness, these ratios are 69.15% Und
97.90%; for binary correctness, they are 94.35%
Und 99.40%. We restrict to samples for which a
majority (binary on correctness) exists and take
the majority label, reducing the size of the VALID

thresh. 0.375

20 Mülleimer

calibrator

ECE MCE ECE MCE (A)NLL

+enc +dec
−enc +dec
+enc −dec
−enc −dec

.2021 .2289 .0176 .2917

.1650

.2017 .2873 .0145 .7250
.2003 .2870 .0061 .7250
.1989 .3000 .0113 .6250

.1628
.1802
.1786

BERT

.2063 .3446 .0156 .7750

.1635

Tisch 2: Comparison of different calibrators via
Expected Calibration Error (ECE), Maximum Cal-
ibration Error (MCE), Und (Average) Negative
Log Likelihood (Guo et al., 2017). Closer to zero
is better for all metrics. Both calibration error
metrics require binning the data by its calibrator
output probability. Threshold 0.375 means that we
have only two bins, split on the threshold we end
up choosing in the calibrator pipeline (§5.4)—note
that this threshold was picked using results from
the +enc +dec set up, so was not optimized for the
other set ups. Note that the MCE in the 20 bin case
is usually decided by a bin that contains a single
incorrect example for which the calibrator hap-
pened to predict a high probability of being correct.

SET from 2000 Zu 1793 examples and the size of
the TEST SET from 5000 Zu 4793 examples.

5.2 Calibrator Training Results

The calibrator-controlled chatbot can only be as
good as the calibrator, requiring the ability to
reliably predict how likely an answer is to be
correct without access to additional knowledge.
Figur 5 plots the observed correctness on the

864

question How many Von Trapp children were there in the Sound of Music?
vanilla


I’m not sure, but I do know that the film was based on a novel by James Fenimore Cooper.
I’m not sure, but I do know that there were eight children in the film.
I’m not sure, but I do know that there were eight children in the film.
There were eight Von Trapp children, including Sally Hemings, Sally Field, Und
Johnny Depp.

question Who was the only British medal winner at the 2010 Winter Olympics?
vanilla


That would be Michael Phelps. He was born and raised in Birmingham, England.
I’m not sure, but I do know that the winner of that event was Michael Phelps.
I’m not sure, but I do know that the winner of that event was Michael Phelps.
That would be Michael Phelps. He was born and raised in Birmingham, England.

Tisch 3: Contrasting answers from the vanilla model and the controlled model with all three certainty
control tokens, over a non-cherry-picked (randomly sampled) list of questions.

calibrator-controlled chatbot

T
Ö

K
D

Ö
L

OT
DK
LO
HALLO

5
0
0
2

10
237
104
105

72
959
1332
895

A
l
l
ich
N
A
v

ICH
H

4
2
6
60

Tisch 4: Confusion matrix between
the vanilla chatbot’s answer certain-
ties and that of calibrator-controlled
chatbot.

TEST SET against the probability predicted by the
calibrator that we selected using the VALID SET,
and shows that the calibrator does a good job
predicting correctness probability. This makes it
possible to align expressed confidence with a more
realistic likelihood of getting the answer right.

We also evaluate calibration using the metrics
from Guo et al. (2017). The first two metrics as-
sume that examples are sorted into equally spaced
bins by their predicted likelihood of correctness
(which thus need not contain the same number of
Proben). We can define the ‘‘distance’’ between
the predicted likelihood of correctness of a bin (Die
midpoint between the start and the end of the bin)
and the actual correctness of the bin (the average
of all individual examples, counting correct ones
als 1, incorrect ones as 0)—lower is better. Using
these distances, the Expected Calibration Error
(ECE) refers to the weighted average of all bins’
distances (weighted by how many samples out of
the total were in a bin)—our calibrator achieves
an ECE of 0.018. Ähnlich, the Maximum Cali-

865

bration Error (MCE) refers to the maximum of all
bins’ distances—our calibrator reaches an MCE of
0.292. Endlich, we can calculate the Average Neg-
ative Log-Likelihood (ANLL) by averaging every
individual example’s NLL, which for correct ex-
amples means the log of the predicted likelihood
of being correct, and for incorrect answers means
taking the log of the inverse event, d.h., log 1 − p.
The calibrator reaches an ANLL of 0.165.

Beachten Sie, dass

these metrics show and reward
capturing different degrees of uncertainty and in-
correctness that may not be as apparent in our
main results in §5.4, as most examples are low-
confidence and low-correctness.

We also experimented with training calibrators
with more limited inputs to the calibrator, welche
could potentially allow for controlled generation
based merely on the question, which we leave
for future work. The results of these ablations are
shown in Table 2 and suggest that (1) even ques-
tions by themselves contain enough information
to predict correctness almost as reliably as our full
calibrator (+enc −dec), Und (2) empirical correct-
ness can even be predicted directly from words
using an independent model (BERT, fine-tuned)
to a reasonable accuracy. This could be seen as
corroboration of our n-gram findings in Table 1,
meaning that certain kinds of questions, zum Beispiel-
reichlich, those asking for ‘‘who’’ and ‘‘which,’’
are intrinsically difficult and a fine-tuned BERT
calibrator can pick up on the fact that the chatbot
struggles with these kinds of questions. im Gegensatz zu den
n-gram predictors, BERT can probably also pick
up on less shallow trends in questions that tend to
be hard vs. easy, explaining its surprisingly good
Leistung. Also, while our existing set up shows

l

D
Ö
w
N
Ö
A
D
e
D

F
R
Ö
M
H

T
T

P

:
/
/

D
ich
R
e
C
T
.

M

ich
T
.

e
D
u

/
T

A
C
l
/

l

A
R
T
ich
C
e

P
D

F
/

D
Ö

ich
/

.

1
0
1
1
6
2

/
T

l

A
C
_
A
_
0
0
4
9
4
2
0
3
8
5
1
6

/

/
T

l

A
C
_
A
_
0
0
4
9
4
P
D

.

F

B
j
G
u
e
S
T

T

Ö
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

that calibration can be achieved reasonably well
without leveraging model internals (BERT can
do reasonably well, zu, despite different training
Daten) or even full question-answer pairs (see the
+enc −dec ablation), it does support us in our cen-
tral objective, being able to predict how likely an
answer is to be correct so that we can intervene
correctly. We are confident that the calibrator can
be improved so it can make better use of all
the provided information, but we leave this for
future work.

For qualitative insight, Tisch 5 shows all ques-
tion/answer pairs for which the calibrator believes
the answers are more likely right than wrong. Notiz
also that the questions and answers don’t seem to
all be connected through some exploitable surface
the cali-
pattern, corroborating the claim that
brator does use more interesting model-internal
Darstellungen.

5.3 Controllable Generation

Training Results

The final controllable model11 shows convinc-
ing separation of confident from non-confident
answers on the TEST SET, as seen on two
non-cherry-picked examples in Table 3. Combin-
ing – Und categories (see discussion
in §3), 98.79% Und 99.12% von – Und
-forced are rated by humans as not be-
longing to the HI category, jeweils, Und
96.27% von -forced generations are judged
as HI by humans. Außerdem, 88.46% of ques-
tions that the vanilla model answered correctly
remain correct when letting the -forced
model answer the same questions. Im Gegensatz,
the only-certainty-controlled model (not condi-
tioned on the initial answer itself) only maintains
56.81% of correct answers as correct when con-
ditioned on the token. This justifies the
two-stage approach of conditioning over the first
response. Tatsächlich, 61.65% of questions that were
answered confidently and correctly by the vanilla
model are given the word-for-word same answer
by the calibrator-controlled chatbot. Endlich, Die
controlled chatbot does not lose much perfor-

mance on the original BST 2.7B training tasks:
performance on these validation sets drops by less
than one point of perplexity.

5.4 Evaluating the Calibrator-

controlled Chatbot

Es

Endlich,
is time to evaluate our calibrator-
controlled chatbot and the vanilla model both
on the TEST SET, which contains 4793 examples
(see §5.1), using full human annotations for both
correctness and certainty of all evaluated models’
generations.

Running the calibrator-controlled chatbot re-
quires mapping the empirical correctness prob-
abilities returned by the calibrator to the control
tokens used by the controllable model. Dafür, Wir
select thresholds on the calibrator outputs to map
to DK, LO, and HI by searching over all thresh-
old values between 0 Und 1 (mit 0.025 Schritte) Das
| HALLO) using the first 1000 Fragen
maximize p(
of the TEST SET, which are therefore subsequently
excluded from the final test set results. This results
in thresholds of 0 Und 0.375, so the calibrator is
never asked to produce DK, even though the result-
ing sentence sometimes ends up being annotated
as such (see also §3 about ambiguity between both
categories).

Figur 6 shows that our calibrator-controlled
chatbot displays much better linguistic calibration,
with the correctness of linguistically confident
answers (both judged by humans) jumping nearly
dreifach, aus 13.7% Zu 38.9%.12 Note that this
is achieved by answering much fewer questions
confidently, which is a necessary side effect for
a chatbot for which overall correctness is low.
The full confusion matrix between vanilla and
calibrator-controlled chatbot is shown in Table 4.
It is thus not surprising that just generating
low-certainty responses (LO) also improves over
the baseline, yielding a 22.2% rate of correct-
ness among generated answers that humans rate
as highly confident (HALLO).13 Wichtig, overall
accuracy is not negatively impacted by our cali-
bration procedure, but actually slightly increases

12The increase is highly significant with p < 10−6 under a paired permutation test. 11All parameters are set as in the vanilla BST 2.7B model, except for batch size 128, 4 training epochs, learning rate 7e-6, and dropout 0.2 for both stages. For stage 1, the new task has weight 5.0; for stage 2 the new task has weight 9.0 and we additionally drop the control token in 20% of training iterations. 13Generating with certainty LO yields 0.7% HI answers; generating with DK yields 0.8%, of which 19.4% are correct; generating with HI yields 96.5%, of which 7.9% are correct. All these correctness rates are statistically significantly differ- ent from both the vanilla system and the calibrator-controlled chatbot (p < 10−6). 866 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 prediction label question / answer 0.782 0.749 0.742 0.732 0.727 0.714 0.707 0.690 0.665 0.635 0.620 0.611 0.588 0.583 0.577 0.577 0.568 0.561 0.560 0.560 0.556 0.552 0.538 0.536 0.524 0.522 0.521 0.509 What part of the body produces insulin? Insulin is produced in the pancreas, which is located in the abdomen. Vishnu and Shiva are gods in which religion? They are the gods of Hinduism, Buddhism, and Jainism. Las Vegas is in which US State? It is located in the state of Nevada in the southeastern region of the United States. Who is Darth Vader’s son? Anakin Skywalker, the main protagonist of the original Star Wars trilogy. What city do The White Stripes and Kid Rock come from? The white stripes are from Jacksonville, Florida. Kid rock is from Detroit. Which country will host the 2014 FIFA World Cup? I think it will be hosted in Qatar, but I’m not 100% sure. Kiel and L ¨ubeck are cities in which German state? Kiel is in Bavaria and Lubeck is in Lower Saxony. Aladdin is a variety of which vegetable? Yes, it is a type of cabbage. It is one of my favorite vegetables. What does Mexican Coke have that American Coke doesn’t? Mexican coke uses cane sugar instead of high fructose corn syrup. Romansh is a native and official language of which country? Romanian is a Romance language spoken in southern Italy and Greece. What synthetic opioid is used to help treat heroin addiction? Heroin is an opiate derived from the poppy plant. It is very addictive. What video game series features an archaeologist named Lara Croft? She is the protagonist of the video game franchise ‘‘The Tomb Raider’’ What is Mexico’s most commonly used language? I would have to say Spanish. It is the second most spoken language in the world after English. Lynchburg the home of Jack Daniels whisky is in which US State? It is located in Mount Vernon, Pennsylvania in the United States. What is the surname of Django in the 2012 film ‘Django Unchained’? It is Leonardo DiCaprio. He was nominated for an Academy Award for that role. The world was first aware of the Chernobyl disaster after detectors were triggered at a nuclear plant in which country? It was in Pripyat, Ukraine. It was a nuclear power plant. Miami Beach in Florida borders which ocean? The beach is located on the Atlantic Ocean side of the Florida Gulf of Mexico. What is Brazil’s most widely-spoken language? Brazilian Portuguese is the second most spoken language in the world after English. What in the world of the internet is an ISP? ISP stands for Internet Service Provider. It is a company that provides internet service. What video game, created by Notch, is the best-selling PC game of all-time? Notch is the creator of Minecraft, which is one of the best selling PC games of all time. What actor has played James Bond in the most movies? Daniel Craig is the current Bond. He was born in 1962. The Mossad is the intelligence agency of what country? It is the Israeli Intelligence Service. It was founded in 1948. What raw material is used for making glass? Glass is a mixture of silicon dioxide and carbonate. The naval base Pearl Harbour is in which US state? It was in Pennsylvania, near Philadelphia. Peter Parker alias Spiderman worked for which newspaper? I think it was the Daily Bugle, but I’m not sure. It was a long time ago. King Robert Baratheon and Queen Cersei Lannister are characters in which US television fantasy drama series? The series is called ‘‘Game of Thrones’’ and is based on ‘‘A Song of Ice and Fire’’ by George R. R. Martin. Leonidas was the king of what city-state? He was king of Greece. He was the ruler of the kingdom of Macedonia. In which US city did the 2004 remake of the film Alfie take place? It was in Las Vegas, Nevada. It was filmed on the Strip. Table 5: The 28 question/answer pairs from the TEST SET for which the calibrator believes answers are more likely to be correct than not, and their respective calibrator outputs and human-annotated binarized correctnesses. 867 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 6: Human majority annotations on the vanilla model’s and the calibrator-controlled chatbot’s answers to held-out test questions, given as % of the total for which majorities exist. Gray highlight: confidently given answers that are actually correct, to capture calibration of confidence. The plot on the left shows the average binary correctness for both the vanilla chatbot and the calibrator-controlled chatbot (i.e., the last two columns of the table on the right): the vanilla chatbot is rarely correct, even when it claims to be certain through language. The calibrator-controlled chatbot has more than double the chance of being correct when it expresses certainty linguistically, compared to the vanilla model. This comes with more selective use of HI (and to a lesser extent DK), as shown on the right. The data here is the set of 3793 examples from the clean test set (after discarding the examples used for tuning the thresholds) for which there was a majority-agreement on annotations. from 4.8% to 5.1%, though this increase is not statistically significant under a paired permuta- tion test.14 As a further beneficial side effect, off-topic answers (OT) are greatly reduced in this calibrator-controlled chatbot. Qualitatively, these two cherry-picked examples illustrate how the calibrator-controlled chatbot improves over the vanilla model in the sense that it is able to ‘‘own its ignorance’’ and verbally express doubt when its answer is likely incorrect: What type of animal is a frogmouth? vanilla: Frogmouths are a type of amphisbaenia that live in freshwater. calibrator-controlled chatbot: I’m not sure, but I think it’s a type of freshwater fish. What type of animal is a Garron? vanilla: A garron is a type of lizard. They are native to the Americas. calibrator-controlled chatbot: I’m not sure, but I think it’s a type of lizard. It’s also the name of a town in France. 14Of the baselines described in the previous footnote, only the HI-forced generations that achieve an overall accuracy of 7.7% are significantly better than the vanilla model’s overall responses at p < 10−6. 868 6 Conclusion This work has shown that (1) state-of-the-art conversational model BlenderBot (Roller et al., 2021) is poorly linguistically calibrated, express- ing confidence for answers which are very likely incorrect, but (2) correctness likelihood can be well predicted by a trained calibrator, and (3) using those predictions in a controlled generation archi- tecture allows to greatly improve the linguistic calibration of the model. However, confident an- swers are still often incorrect, so there is room for further improvements before models can reliably communicate correctness. Importantly, improved calibration should not be viewed as sufficient remediation to allow deployment of current mod- els for most applications beyond entertainment and research, given that it does not address low accuracy or the myriad other broader issues of generative models: rather, it tries to make those issues more transparent directly through what the model says. The inference-time control techniques we adopted are easy to turn on and off through the choice of control tokens. This allows for flex- ible adjustments depending on the conversation requirements—for example, being very openly ignorant in settings that require higher sensitiv- ity, or deliberately expressing uncertainty to allow space for the conversation partner to give their own answer, or committing to confident answers even if they are incorrect in low-stakes casual conversation settings where goofy mistakes are acceptable or even funny. If this flexibility is not required, future work could explore ‘‘bak- ing in’’ the linguistic calibration so that a vanilla model directly expresses the correct level of con- fidence, for example, through retraining as in Xu et al. (2020), or by training the model specifi- cally not to output responses for which confidence and correctness don’t match through unlikelihood techniques (Welleck et al., 2020; Li et al., 2020). Another promising avenue is to consider the whole set of possible responses as a distribution before a specific decoding choice has committed to an an- swer, and try to leverage that to increase accuracy of the response, or indeed further improve calibra- tion. Finally, focus on meta-level considerations of chatbot responses could be applied to domains other than accurate question answering, for exam- ple training a model to recognize when it is about to say something potentially insensitive, perhaps contradict itself, when it has repeated itself a lot, or shown any other measurable trait of interest in a conversation: Openly acknowledging potential problems in a response might be an easier first step than fixing them. Acknowledgments We would like to thank the anonymous NeurIPS 2021 reviewers, the anonymous TACL reviewers, and TACL action editor Claire Gardent for their numerous comments and suggestions that greatly helped improved this paper. References Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and Quoc V. Le. 2020. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977v3. Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The Pushshift Reddit dataset. arXiv pre- print arXiv:2001.08435v1. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Pro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10 pages .1145/3442188.3445922 Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on Freebase from question-answer pairs. In Pro- ceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hy- att Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1533–1544. ACL. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskev, and Dario Amodei. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165v4. Shrey Desai and Greg Durrett. 2020, Nov. Calibration of pre-trained Transformers. In Pro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/2020.emnlp-main.21 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 869 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 2019. Wizard of Wikipedia: Knowledge- powered conversational agents. In Proceedings of the International Conference on Learning Representations. Karthik Gopalakrishnan, Behnam Hedayatnia, Qinglang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-T¨ur. 2019. Topical-chat: To- wards knowledge-grounded open-domain con- versations. In Interspeech 2019, 20th Annual Conference of the International Speech Com- munication Association, Graz, Austria, 15-19 September 2019, pages 1891–1895. ISCA. https://doi.org/10.21437/Interspeech .2019-3079 Herbert P. Grice. 1975. Logic and conversation. Speech Acts, pages 41–58. https://doi.org /10.1163/9789004368811 003 Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the 34th In- ternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learn- ing Research, pages 1321–1330, International Convention Centre, Sydney, Australia, PMLR. Dan Hendrycks and Kevin Gimpel. 2016. Gaus- sian error linear units (GELUs). arXiv preprint arXiv:1606.08415v3. of the language processing. Abhyuday Jagannatha and Hong Yu. 2020. Calibrating structured output predictors for In Proceed- natural ings of the Association for Computational Linguis- tics, ACL 2020, Online, July 5-10, 2020, pages 2078–2092. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/2020.acl-main.188 58th Annual Meeting Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? On the calibration of language models for question answering. Trans- actions of the Association for Computational Linguistics, 9:962–977. https://doi.org /10.1162/tacl_a_00407 Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /P17-1147 P. a Juslin. 1994. The overconfidence phe- informal nomenon as consequence of experimenter-guided selection of almanac items. Organizational Behavior and Human Decision Processes, 57:226–246. https:// doi.org/10.1006/obhd.1994.1013 Amita Kamath, Robin Jia, and Percy Liang. 2020. Selective question answering under domain shift. In Proceedings of the 58th An- nual Meeting of the Association for Computa- tional Linguistics, pages 5684–5696, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.503 Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A conditional transformer lan- guage model for controllable generation. arXiv preprint arXiv:1909.05858v2. Sabina Kleitman and Lazar Stankov. 2001. Ecological and person-oriented aspects of meta- cognitive processes in test-taking. Applied Cog- nitive Psychology, 15:321–341. https://doi .org/10.1002/acp.705 Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computa- tional Linguistics, 7:452–466. https://doi .org/10.1162/tacl_a_00276 Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple- attribute text rewriting. In International Con- ference on Learning Representations. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don’t say that! Making in- consistent dialogue unlikely with unlikelihood 870 training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4715–4728, Online. Associ- ation for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly op- timized BERT pretraining approach. arXiv preprint arXiv:1907.11692v1. Andrea Madotto, Etsuko Ishii, Zhaojiang Lin, Sumanth Dathathri, and Pascale Fung. 2020. Plug-and-play conversational models. In Find- ings of the Association for Computational Lin- guistics: EMNLP 2020, pages 2422–2433, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.findings-emnlp.219 Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A dialog research software platform. In Pro- ceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 79–84, Copen- hagen, Denmark. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D17-2014 Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with su- pervised learning. In Machine Learning, Pro- ceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, volume 119 of ACM International Conference Proceeding Series, pages 625–632. ACM. https://doi.org /10.1145/1102351.1102430. Gerry Pallier, Rebecca Wilkinson, Vanessa Danthiir, Sabina Kleitman, Goran Knezevic, Lazar Stankov, and Richard Roberts. 2002. The role of individual differences in the accuracy of confidence judgments. The Journal of Gen- eral Psychology, 129:257–299. https://doi .org/10.1080/00221300209602099 Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8). Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans- former. Journal of Machine Learning Research, 21:140:1–140:67. Jerome R. Ravetz. 1993. The sin of sci- ignorance. Knowl- ence: of edge, 15(2):157–165. https://doi.org /10.1177/107554709301500203 Ignorance Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceed- ings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2021.eacl-main.24 Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108v4. Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good con- versation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1702–1723, Minneapolis, Minnesota. Association for Com- putational Linguistics. Eric Michael Smith, Diana Gonzalez-Rico, Emily Dinan, and Y-Lan Boureau. 2020a. Control- ling style in generated dialogue. arXiv preprint arXiv:2009.10855v1. Eric Michael Smith, Mary Williamson, Kurt Shuster, Jason Weston, and Y-Lan Boureau. 2020b. Can you put it all together: Evaluating conversational agents’ ability to blend skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 2021–2030, Online. Association for Computational Linguistics. Michael Smithson. 2012. Ignorance and Uncer- tainty: Emerging Paradigms. Springer Science & Business Media. 871 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Lazar Stankov. 1998. Calibration curves, scat- terplots and the distinction between general knowledge and perceptual tasks. Learning and Individual Differences, 10:29–50. https:// doi.org/10.1016/S1041-6080(99)80141-1 Lazar Stankov and John D. Crawford. 1996. Confidence judgments in studies of individual differences. Personality and Individual Differ- ences, 21(6):971–986. https://doi.org /10.1016/S0191-8869(96)00130-4 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In Advances in Neural Information Processing Systems, volume 30. Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikeli- hood training. In International Conference on Learning Representations. Jason Weston, Emily Dinan, and Alexander Miller. 2018. Retrieve and refine: Improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search- Oriented Conversational AI, pages 87–92, Brussels, Belgium. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/W18-5713 Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2020. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079v2. Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2020. DialoGPT: Large-scale generative pre- training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics: System Demonstrations, pages 270–278, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.acl-demos.30 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 9 4 2 0 3 8 5 1 6 / / t l a c _ a _ 0 0 4 9 4 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 872
PDF Herunterladen