How Can We Know When Language Models Know?

How Can We Know When Language Models Know?
On the Calibration of Language Models for Question Answering

Zhengbao Jiang†, Jun Araki‡, Haibo Ding‡, Graham Neubig†
†Languages Technologies Institute, Carnegie Mellon University, États-Unis
‡Bosch Research, États-Unis
{zhengbaj,gneubig}@cs.cmu.edu
{jun.araki,haibo.ding}@us.bosch.com

Abstrait

Recent works have shown that language mod-
le (LM) capture different types of knowledge
regarding facts or common sense. Cependant,
because no model is perfect, they still fail to
provide appropriate answers in many cases. Dans
this paper, we ask the question, ‘‘How can we
know when language models know, with con-
fidence, the answer to a particular query?’’ We
examine this question from the point of view
of calibration, the property of a probabilistic
model’s predicted probabilities actually being
well correlated with the probabilities of cor-
rectness. We examine three strong generative
models—T5, BART, and GPT-2—and study
whether their probabilities on QA tasks are
well calibrated, finding the answer is a rela-
tively emphatic no. We then examine methods to
calibrate such models to make their confidence
scores correlate better with the likelihood of
correctness through fine-tuning, post-hoc pro-
bability modification, or adjustment of the pre-
dicted outputs or inputs. Experiments on a
diverse range of datasets demonstrate the ef-
fectiveness of our methods. We also perform
analysis to study the strengths and limitations
of these methods, shedding light on further
improvements that may be made in methods
for calibrating LMs. We have released the
code at https://github.com/jzbjyb
/lm-calibration.

1

Introduction

Language models (LMs; Church, 1988; Bengio
et coll., 2003; Radford et al., 2019) learn to model
the probability distribution of text, and in doing
so capture information about various aspects of
the syntax or semantics of the language at hand.
Recent works have presented intriguing results
demonstrating that modern large-scale LMs also
capture a significant amount of knowledge, inclure-

962

ing factual knowledge about real-world entities
(Petroni et al., 2019; Jiang et al., 2020b; Roberts
et coll., 2020; Bouraoui et al., 2020), common-
sense knowledge (Trinh and Le, 2018; Kocijan
et coll., 2019; Talmor et al., 2019un; Bosselut et al.,
2019), and simple numerical operations (Wallace
et coll., 2019; Talmor et al., 2019un; Geva et al.,
2020). Notably, large models trained on massive
crawls of Internet text (such as T5 [Raffel et al.,
2019] and GPT-3 [Brown et al., 2020]) have been
shown to be able to perform quite sophisticated
knowledge-based tasks simply through prompting
the model to predict the next words given a par-
ticular cue.

Cependant, at the same time, LMs are obviously
not omnipotent, and still fail to provide appropri-
ate answers in many cases, such as when dealing
with uncommon facts (Poerner et al., 2019; Jiang
et coll., 2020un) or complex reasoning (Talmor et al.,
2019un). The high performance on datasets probing
factual or numerical knowledge might be achieved
through modeling superficial signals in the train-
ing data that are not generalizable to unseen test
cases (Poerner et al., 2019; Zhou et al., 2020;
Wallace et al., 2019; Talmor et al., 2019un). Ainsi,
if such models are to be deployed in real applica-
tions it is of crucial importance to determine the
confidence with which they can provide an answer.
This is especially true if these models are deployed
to safety-critical domains such as healthcare and
finance, where mistaken answers can have serious
consequences.1

In this paper, we ask the question, ‘‘How can we
know when language models know, with confi-
dence, the answer to a particular knowledge-based
query?’’ Specifically, we examine this from the

1Par exemple, a mocked-up medical chatbot based on
GPT-3 answered the question of ‘‘should I kill myself?’’
with ‘‘I think you should’’ (Quach, 2020).

Transactions of the Association for Computational Linguistics, vol. 9, pp. 962–977, 2021. https://doi.org/10.1162/tacl a 00407
Action Editor: Minlie Huang. Submission batch: 1/2021; Revision batch: 4/2021; Published 9/2021.
c(cid:2) 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
0
7
1
9
6
2
6
2
8

/

/
t

je

un
c
_
un
_
0
0
4
0
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Format

Input

Candidate Answers

Original Calibrated

Multiple-choice

Extractive

Oxygen and sugar are the products
de (UN) cell division. (B) digestion.
(C) photosynthesis. (D) respiration.

What type of person can not be
attributed civil disobedience? Civil
disobedience is usually defined as
pertaining to a citizen’s relation

cell division.
digestion.
photosynthesis.
respiration.
head of government
public official
head of government of a country
public officials

0.00
0.00
0.00
1.00
0.07
0.91
0.01
0.01

0.02
0.01
0.83
0.14
0.49
0.26
0.16
0.09

Tableau 1: LM calibration examples for the T5 model with correct answers in bold. ‘‘Original’’ and
‘‘Calibrated’’ indicate the normalized probability before and after fine-tuning to improve calibration.

point of view of calibration, whether the model’s
probability estimates are well-aligned with the
actual probability of the answer being correct.
We apply the largest publicly available LMs, T5,
BART, and GPT-2, over a wide range of question
answering (QA) datasets (Khashabi et al., 2020)
covering diverse domains. We first observe that
despite the models’ high performance (par exemple., T5
eclipses other alternatives such as GPT-3 on some
datasets), the models tend to not be well cali-
brated; their probability estimates over candidates
have far-from-perfect correspondence with the
actual probability that the answer they provide is
correct. Some examples of this are demonstrated
in the ‘‘Original’’ column of Table 1.

To alleviate this problem, we propose methods
to make LMs’ confidence scores correlate bet-
ter with the likelihood of model prediction being
correct. We examined both fine-tuning methods
that modify LMs’ parameters and post-hoc meth-
ods that keep LMs fixed and only manipulate
the confidence values or inputs. Spécifiquement, nous
fine-tune the LM using softmax- or margin-based
objective functions based on multiple candidate
answers. For post-hoc calibration, we examine
temperature-based scaling and feature-based de-
cision trees that take prediction probability and
input-related features as input and produce cal-
ibrated confidence (Jagannatha and Yu, 2020;
Desai and Durrett, 2020; Kamath et al., 2020).
We also study the sensitivity of LMs’ confidence
estimation with respect to language variation by
paraphrasing candidate answers and augmenting
questions using retrieved context.

Experimental results demonstrate that both fine-
tuning and post-hoc methods can improve calibra-
tion performance without sacrificing accuracy. Nous
further perform analysis and ablation studies on
our methods, inspecting different aspects that may

affect calibration performance. We found that like
other neural models, LMs are over-confident much
of the time with confidence close to either 0 ou
1. Par conséquent, post-processing confidence with
temperature-based scaling and feature-based deci-
sion trees is universally helpful. Nous avons également constaté que
LMs become better calibrated if we phrase each
answer multiple ways and provide more evidence
through retrieval, indicating that current LMs are
sensitive to both input and output.

2 LM-based Question Answering

LMs are now a ubiquitous tool in not only natu-
ral language generation, but also natural language
understanding (NLU), where they are largely used
for unsupervised representation learning in pre-
trained models such as BERT (Devlin et al., 2019).
Cependant, recent work has demonstrated that LMs
can also be used as-is to solve NLU tasks, by pre-
dicting the missing words in cloze-style questions
(Petroni et al., 2019), or by predicting the contin-
uation to prompts (Bosselut et al., 2019; Brun
et coll., 2020).

Previous works that purport to calibrate LMs
(Desai and Durrett, 2020; Jagannatha and Yu,
2020; Kamath et al., 2020; Kong et al., 2020)
mainly focus on the former use case, using repre-
sentations learned by LMs to predict target classes
(for tasks such as natural language inference,
part-of-speech tagging, or text classification) ou
identify answer spans (for tasks such as extrac-
tive QA). In contrast, we focus on the latter case,
calibrating LMs themselves by treating them as
natural language generators that predict the next
words given a particular input.

To make our observations and conclusions as
general as possible, we experiment over a diverse
range of QA datasets with broad domain coverage

963

je

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

p

:
/
/

d
je
r
e
c
t
.

m

je
t
.

e
d
toi

/
t

un
c
je
/

je

un
r
t
je
c
e

p
d

F
/

d
o

je
/

.

1
0
1
1
6
2

/
t

je

un
c
_
un
_
0
0
4
0
7
1
9
6
2
6
2
8

/

/
t

je

un
c
_
un
_
0
0
4
0
7
p
d

.

F

b
oui
g
toi
e
s
t

t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Datasets and Domains

Format
Multi-choice ARC (science (Clark et al., 2018)),
AI2 Science Questions (science (Clark
et coll., 2018)), OpenbookQA (science
(Mihaylov et al., 2018)), Winogrande
(commonsense (Sakaguchi et al., 2020)),
CommonsenseQA (commonsense (Talmor
et coll., 2019b)), MCTest (fictional sto-
ries (Richardson et al., 2013)), PIQA
(physical (Bisk et al., 2020)), SIQA
(sociale
(Sap et al., 2019)), RACE
(English comprehension (Lai et al.,
2017)), QASC (science (Khot et al.,
2020)), MT-test (mixed (Hendrycks
et coll., 2020))
SQuAD 1.1 (wikipedia
(Rajpurkar
et coll., 2016)), SQuAD 2 (Wikipedia
(Rajpurkar et al., 2018)), NewsQA
(news (Trischler et al., 2017)), Quoref
(wikipedia (Dasigi et al., 2019)),
ROPES (situation understanding (Lin
et coll., 2019))

Extractive

Tableau 2: Datasets used in this paper and their
domains.

over questions regarding both factual and common
sense knowledge (Khashabi et al., 2020). We list
all the datasets we used in Table 2 along with their
corresponding domain. Since we focus on cali-
brating LMs as generators, we follow Khashabi
et autres. (2020) in converting QA datasets of different
formats to a unified sequence-to-sequence format
that takes a question X as input and calculates the
probability of a continuation Y that corresponds
to the answer:

PLM(Oui |X) =

|Oui |(cid:2)

je = 1

PLM(yi|X, ouiTélécharger le PDF