我们如何知道语言模型何时知道?

我们如何知道语言模型何时知道?
On the Calibration of Language Models for Question Answering

Zhengbao Jiang†, 荒木淳‡, Haibo Ding‡, Graham Neubig†
†Languages Technologies Institute, 卡内基梅隆大学, 美国
‡Bosch Research, 美国
{zhengbaj,gneubig}@cs.cmu.edu
{jun.araki,haibo.ding}@us.bosch.com

抽象的

Recent works have shown that language mod-
这 (LM) capture different types of knowledge
regarding facts or common sense. 然而,
because no model is perfect, they still fail to
provide appropriate answers in many cases. 在
这篇论文, we ask the question, ‘‘How can we
know when language models know, with con-
fidence, the answer to a particular query?’’ We
examine this question from the point of view
of calibration, the property of a probabilistic
model’s predicted probabilities actually being
well correlated with the probabilities of cor-
rectness. We examine three strong generative
models—T5, 捷运, and GPT-2—and study
whether their probabilities on QA tasks are
well calibrated, finding the answer is a rela-
tively emphatic no. We then examine methods to
calibrate such models to make their confidence
scores correlate better with the likelihood of
correctness through fine-tuning, post-hoc pro-
bability modification, or adjustment of the pre-
dicted outputs or inputs. Experiments on a
diverse range of datasets demonstrate the ef-
fectiveness of our methods. We also perform
analysis to study the strengths and limitations
of these methods, shedding light on further
improvements that may be made in methods
for calibrating LMs. We have released the
code at https://github.com/jzbjyb
/lm-calibration.

1

介绍

Language models (LMs; 教会, 1988; 本吉奥
等人。, 2003; Radford et al., 2019) learn to model
the probability distribution of text, and in doing
so capture information about various aspects of
the syntax or semantics of the language at hand.
Recent works have presented intriguing results
demonstrating that modern large-scale LMs also
capture a significant amount of knowledge, 包括-

962

ing factual knowledge about real-world entities
(Petroni et al., 2019; Jiang et al., 2020乙; 罗伯茨
等人。, 2020; Bouraoui et al., 2020), 常见的-
sense knowledge (Trinh and Le, 2018; Kocijan
等人。, 2019; Talmor et al., 2019A; Bosselut et al.,
2019), and simple numerical operations (华莱士
等人。, 2019; Talmor et al., 2019A; Geva et al.,
2020). 尤其, large models trained on massive
crawls of Internet text (such as T5 [Raffel et al.,
2019] and GPT-3 [Brown et al., 2020]) 已经
shown to be able to perform quite sophisticated
knowledge-based tasks simply through prompting
the model to predict the next words given a par-
ticular cue.

然而, 同时, LMs are obviously
not omnipotent, and still fail to provide appropri-
ate answers in many cases, such as when dealing
with uncommon facts (Poerner et al., 2019; Jiang
等人。, 2020A) or complex reasoning (Talmor et al.,
2019A). The high performance on datasets probing
factual or numerical knowledge might be achieved
through modeling superficial signals in the train-
ing data that are not generalizable to unseen test
案例 (Poerner et al., 2019; Zhou et al., 2020;
Wallace et al., 2019; Talmor et al., 2019A). 因此,
if such models are to be deployed in real applica-
tions it is of crucial importance to determine the
confidence with which they can provide an answer.
This is especially true if these models are deployed
to safety-critical domains such as healthcare and
金融, where mistaken answers can have serious
consequences.1

在本文中, we ask the question, ‘‘How can we
know when language models know, with confi-
登塞, the answer to a particular knowledge-based
query?’’ Specifically, we examine this from the

1例如, a mocked-up medical chatbot based on
GPT-3 answered the question of ‘‘should I kill myself?’’
with ‘‘I think you should’’ (Quach, 2020).

计算语言学协会会刊, 卷. 9, PP. 962–977, 2021. https://doi.org/10.1162/tacl 00407
动作编辑器: Minlie Huang. 提交批次: 1/2021; 修改批次: 4/2021; 已发表 9/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
0
7
1
9
6
2
6
2
8

/

/
t

A
C
_
A
_
0
0
4
0
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Format

输入

Candidate Answers

Original Calibrated

Multiple-choice

Extractive

Oxygen and sugar are the products
的 (A) cell division. (乙) digestion.
(C) photosynthesis. (D) respiration.

What type of person can not be
attributed civil disobedience? 民用
disobedience is usually defined as
pertaining to a citizen’s relation

cell division.
digestion.
photosynthesis.
respiration.
head of government
public official
head of government of a country
公职人员

0.00
0.00
0.00
1.00
0.07
0.91
0.01
0.01

0.02
0.01
0.83
0.14
0.49
0.26
0.16
0.09

桌子 1: LM calibration examples for the T5 model with correct answers in bold. ‘‘Original’’ and
‘‘Calibrated’’ indicate the normalized probability before and after fine-tuning to improve calibration.

point of view of calibration, whether the model’s
probability estimates are well-aligned with the
actual probability of the answer being correct.
We apply the largest publicly available LMs, T5,
捷运, and GPT-2, over a wide range of question
answering (QA) datasets (Khashabi et al., 2020)
covering diverse domains. We first observe that
despite the models’ high performance (例如, T5
eclipses other alternatives such as GPT-3 on some
datasets), the models tend to not be well cali-
brated; their probability estimates over candidates
have far-from-perfect correspondence with the
actual probability that the answer they provide is
正确的. Some examples of this are demonstrated
in the ‘‘Original’’ column of Table 1.

为了缓解这个问题, we propose methods
to make LMs’ confidence scores correlate bet-
ter with the likelihood of model prediction being
正确的. We examined both fine-tuning methods
that modify LMs’ parameters and post-hoc meth-
ods that keep LMs fixed and only manipulate
the confidence values or inputs. 具体来说, 我们
fine-tune the LM using softmax- or margin-based
objective functions based on multiple candidate
答案. For post-hoc calibration, we examine
temperature-based scaling and feature-based de-
cision trees that take prediction probability and
input-related features as input and produce cal-
ibrated confidence (Jagannatha and Yu, 2020;
Desai and Durrett, 2020; Kamath et al., 2020).
We also study the sensitivity of LMs’ confidence
estimation with respect to language variation by
paraphrasing candidate answers and augmenting
questions using retrieved context.

Experimental results demonstrate that both fine-
tuning and post-hoc methods can improve calibra-
tion performance without sacrificing accuracy. 我们
further perform analysis and ablation studies on
our methods, inspecting different aspects that may

affect calibration performance. We found that like
other neural models, LMs are over-confident much
of the time with confidence close to either 0 或者
1. 因此, post-processing confidence with
temperature-based scaling and feature-based deci-
sion trees is universally helpful. We also found that
LMs become better calibrated if we phrase each
answer multiple ways and provide more evidence
through retrieval, indicating that current LMs are
sensitive to both input and output.

2 LM-based Question Answering

LMs are now a ubiquitous tool in not only natu-
ral language generation, but also natural language
理解 (自然语言单元), where they are largely used
for unsupervised representation learning in pre-
trained models such as BERT (Devlin et al., 2019).
然而, recent work has demonstrated that LMs
can also be used as-is to solve NLU tasks, by pre-
dicting the missing words in cloze-style questions
(Petroni et al., 2019), or by predicting the contin-
uation to prompts (Bosselut et al., 2019; 棕色的
等人。, 2020).

Previous works that purport to calibrate LMs
(Desai and Durrett, 2020; Jagannatha and Yu,
2020; Kamath et al., 2020; Kong et al., 2020)
mainly focus on the former use case, using repre-
sentations learned by LMs to predict target classes
(for tasks such as natural language inference,
part-of-speech tagging, or text classification) 或者
identify answer spans (for tasks such as extrac-
tive QA). 相比之下, we focus on the latter case,
calibrating LMs themselves by treating them as
natural language generators that predict the next
words given a particular input.

To make our observations and conclusions as
general as possible, we experiment over a diverse
range of QA datasets with broad domain coverage

963

D

w
n

A
d
e
d

F
r


H

t
t

p

:
/
/

d

r
e
C
t
.


t
.

e
d

/
t

A
C

/

A
r
t

C
e

p
d

F
/

d


/

.

1
0
1
1
6
2

/
t

A
C
_
A
_
0
0
4
0
7
1
9
6
2
6
2
8

/

/
t

A
C
_
A
_
0
0
4
0
7
p
d

.

F


y
G

e
s
t

t


n
0
7
S
e
p
e


e
r
2
0
2
3

Datasets and Domains

Format
Multi-choice ARC (科学 (Clark et al., 2018)),
AI2 Science Questions (科学 (克拉克
等人。, 2018)), OpenbookQA (科学
(Mihaylov et al., 2018)), Winogrande
(commonsense (Sakaguchi et al., 2020)),
CommonsenseQA (commonsense (Talmor
等人。, 2019乙)), MCTest (fictional sto-
里斯 (Richardson et al., 2013)), PIQA
(physical (Bisk et al., 2020)), SIQA
(社会的
(Sap et al., 2019)), RACE
(English comprehension (Lai et al.,
2017)), QASC (科学 (Khot et al.,
2020)), MT-test (mixed (Hendrycks
等人。, 2020))
SQuAD 1.1 (wikipedia
(Rajpurkar
等人。, 2016)), SQuAD 2 (维基百科
(Rajpurkar et al., 2018)), NewsQA
(消息 (Trischler et al., 2017)), Quoref
(wikipedia (Dasigi et al., 2019)),
ROPES (situation understanding (林
等人。, 2019))

Extractive

桌子 2: Datasets used in this paper and their
域.

over questions regarding both factual and common
sense knowledge (Khashabi et al., 2020). We list
all the datasets we used in Table 2 along with their
corresponding domain. Since we focus on cali-
brating LMs as generators, we follow Khashabi
等人. (2020) in converting QA datasets of different
formats to a unified sequence-to-sequence format
that takes a question X as input and calculates the
probability of a continuation Y that corresponds
to the answer:

PLM(是 |X) =

|是 |(西德:2)

我=1

PLM(做|X, y下载pdf