Aligning Faithful Interpretations with their Social Attribution

Alon Jacovi
巴伊兰大学
alonjacovi@gmail.com

Yoav Goldberg
Bar Ilan University and
Allen Institute for AI
yoav.goldberg@gmail.com

抽象的

We find that the requirement of model inter-
pretations to be faithful is vague and incom-
plete. With interpretation by textual highlights
as a case study, we present several failure
案例. Borrowing concepts from social science,
we identify that the problem is a misalignment
between the causal chain of decisions (causal
attribution) and the attribution of human be-
havior to the interpretation (social attribution).
We reformulate faithfulness as an accurate
attribution of causality to the model, 并在-
troduce the concept of aligned faithfulness:
faithful causal chains that are aligned with
their expected social behavior. The two steps
of causal attribution and social attribution
together complete the process of explaining
行为. With this formalization, we charac-
terize various failures of misaligned faithful
highlight interpretations, and propose an al-
ternative causal chain to remedy the issues.
最后, we implement highlight explanations
of the proposed causal format using contras-
tive explanations.

1 介绍

When formalizing the desired properties of a
quality interpretation of a model or a decision, 这
NLP community has settled on the key property
of faithfulness (Lipton, 2018; Herman, 2017;
Wiegreffe and Pinter, 2019; Jacovi and Goldberg,
2020), or how ‘‘accurately’’ the interpretation
represents the true reasoning process of the model.
A common pattern of achieving faithfulness in
interpretation of neural models is via decompo-
sition of a model into steps and inspecting the
intermediate steps (Doshi-Velez and Kim, 2017,
cognitive chunks). 例如, neural modular
网络 (NMNs; Andreas et al., 2016) first build
an execution graph out of neural building blocks,
and then apply this graph to data. The graph struc-

294

ture is taken to be a faithful interpretation of the
model’s behavior, as it describes the computa-
tion precisely. 相似地, highlight methods (还
called extractive rationales1), decompose a textual
prediction problem into first selecting highlighted
文本, and then predicting based on the selected
字 (select-predict, 节中描述 2). 这
output of the selection component is taken to be a
faithful interpretation, as we know exactly which
words were selected. 相似地, we know that
words that were not selected do not participate in
the final prediction.

faithful

然而, Subramanian et al. (2020) call NMN
graphs not
in cases where there is a
discrepancy between a building block’s behavior
and its name (IE。, expected behavior). Can we
better characterize the requirement of faithful
interpretations and amend this discrepancy?

We take an extensive and critical look at the
formalization of faithfulness and of explanations,
with textual highlights as an example use-case.
尤其, the select-predict formulation for
faithful highlights raises more questions than it
provides answers: We describe a variety of curious
failure cases of such models in Section 4, 还有
as experimentally validate that the failure cases
are indeed possible and do occur in practice.
Concretely,
the behavior of the selector and
predictor in these models do not necessarily line up
with expectations of people viewing the highlight.
Current literature in ML and NLP interpretability
fails to provide a theoretical
foundation to
characterize these issues (Sections 4, 6.1).

1The term ‘‘rationale’’ (雷等人。, 2016) is more com-
monly used for this format of explanation in NLP, 为了
historical reasons: Highlights were associated with human
rationalization of data annotation Zaidan et al. (2007). 我们
argue against widespread use of this term, as it refers to
multiple concepts in NLP and ML (例如, Zaidan et al., 2007;
包等人。, 2018; DeYoung et al., 2019; Das and Chernova,
2020), and importantly, ‘‘rationalization’’ attributes human
intent to the highlight selection, which is not necessarily
compatible with the model, as we show in this work.

计算语言学协会会刊, 卷. 9, PP. 294–310, 2021. https://doi.org/10.1162/tacl 00367
动作编辑器: Dan Gildea. 提交批次: 9/2020; 修改批次: 11/2020; 已发表 3/2021.
C(西德:2) 2021 计算语言学协会. 根据 CC-BY 分发 4.0 执照.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

To remedy this, we turn to literature on the
science of social explanations and how they are
utilized and perceived by humans (部分 6):
The social and cognitive sciences find that hu-
man explanations are composed of two, equally
important parts: The attribution of a causal chain
to the decision process (causal attribution), 和
the attribution of social or human-like intent to the
causal chain (social attribution) (磨坊主, 2019),
where ‘‘human-like intent’’ refers to a system
of beliefs and goals behind or following the
causal process. 例如, ‘‘she drank the water
because she was thirsty.’’2

People may also attribute social intent to mod-
这: In the context of NLP, when observing that a
model consistently translates ‘‘doctor’’ with male
morphological features (Stanovsky et al., 2019),
the user may attribute the model with a ‘‘belief’’
that all doctors are male, despite the model lacking
an explicit system of beliefs. Explanations can
influence this social attribution: 例如, A
highlight-based explanation may influence the
user to attribute the model with the intent of
‘‘performing a summary before making a deci-
sion’’ or ‘‘attempting to justify a prior decision’’.
Fatally, the second key component of human
explanations—the social attribution of intent—has
been missing from current formalization on the de-
siderata of artificial intelligence explanations. 在
部分 7 we define that a faithful interpretation
—a causal chain of decisions—is aligned with
human expectations if
is adequately con-
strained by the social behavior attributed to it
by human observers.

它

Armed with this knowledge, we can now ver-
balize the issue behind the ‘‘non-faithfulness’’
perceived by Subramanian et al. (2020) for NMNs:
The inconsistency between component names and
their actual behavior causes a misalignment bet-
ween the causal and social attributions. 我们可以
also characterize the more subtle issue underlying
the failures of the select-predict models described
in Section 4: 在部分 8 we argue that for a set
of possible social attributions, the select-predict
formulations fails to guarantee any of them.

在部分 9 we propose an alternative causal
chain for highlights explanations: predict-select-
verify. Predict-select-verify does not suffer from
the issue of misaligned social attribution, 作为

highlights can only be attributed as evidence to-
wards the predictor’s decision. 因此, predict-
select-verify highlights do not suffer from the
misalignment failures of Section 4, and guarantee
that the explanation method does not reduce the
score of the original model.

最后, in Section 10 we discuss an implemen-
tation of predict-select-verify, 即, designing
the components in the roles predictor and selector.
Designing the selector is non-trivial, as there are
many possible options to select highlights that
evidence the predictor’s decision, and we are only
interested in selecting ones that are meaningful
for the user to understand the decision. We le-
verage observations from cognitive research re-
garding the internal structure of (human-given)
explanations, dictating that explanations must be
contrastive to hold tangible meaning to humans.
We propose a classification predict-select-verify
model that provides contrastive highlights—to
our knowledge, a first in NLP—and qualitatively
exemplify and showcase the solution.

Contributions. We identify shortcomings in
the definitions of faithfulness and plausibility to
characterize what is useful explanation, and argue
that the social attribution of an interpretation
method must be taken into account. We formalize
‘‘aligned faithfulness’’ as the degree to which the
causal chain is aligned with the social attribution
of intent that humans perceive from it. Based on
the new formalization, 我们 (1) identify issues with
current select-predict models that derive faithful
highlight interpretations, 和 (2) propose a new
causal chain that addresses these issues, termed
predict-select-verify. 最后, we implement this
chain with contrastive explanations, previously
unexplored in NLP explanaibility. We make our
code available online.3

2 Highlights as Faithful Interpretations

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Highlights, also known as extractive rationales,
are binary masks over a given input that imply
some behavioral interpretation (as an incomplete
description) of a particular model’s decision pro-
cess to arrive at a decision on the input. Given input
n −→ Y , A
sequence x ∈
n
highlight interpretation h ∈
2 is a binary mask
over x that attaches a meaning to m(X), 在哪里

n and model m : (西德:0)

(西德:0)

(西德:2)

2Note that coincidence (lack of intent) is also a part of

3https://github.com/alonjacovi/aligned

this system.

-highlights.

295

the portion of x highlighted by h was important to
the decision.

This functionality of h was interpreted by Lei
等人. (2016) as an implication of a behavioral
process of m(X), where the decision process is a
modular composition of two unfolding stages:

1. Selector component ms : (西德:0)
a binary highlight h over x.

n −→

n
2 selects

(西德:2)

2. Predictor component mp : (西德:0)

n −→ Y

makes a prediction on the input h (西德:5) X.

The final prediction of the system at inference is
米(X) = mp(多发性硬化症(X) (西德:5) X). We refer to h := ms(X)
as the highlight and h (西德:5) x as the highlighted text.
What does the term ‘‘faithfulness’’ mean in
这个上下文? A highlight interpretation can be
faithful or unfaithful to a model. Literature accepts
a highlight
interpretation as ‘‘faithful’’ if the
highlighted text was provably the only input to
the predictor.

Implementations. Various methods have been
proposed to train select-predict models. Of note:
Lei et al. (2016) propose to train the selector and
predictor end-to-end via REINFORCE (威廉姆斯,
1992), and Bastings et al. (2019) replace REIN-
FORCE with the reparameterization trick (Kingma
and Welling, 2014). Jain et al. (2020) propose
FRESH, where the selector and predictor are
trained separately and sequentially.

3 Use-cases for Highlights

To discuss whether an explanation procedure is
useful as a description of the model’s decision
过程, we must first discuss what is considered
useful for the technology.

We refer to the following use-cases:

Dispute: A user may want to dispute a model’s
决定 (例如, in a legal setting). They can do this
by disputing the selector or predictor: by pointing
to some non-selected words, 说: ‘‘the model
wrongly ignored A,’’ or by pointing to selected
words and saying: ‘‘based on this highlighted text,
I would have expected a different outcome.’’

Debug: Highlights allow the developer
到
designate model errors into one of two categories:
Did the model focus on the wrong part of the
输入, or did the model make the wrong prediction
based on the correct part of the input? 每个
category implies a different method of alleviating
问题.

Advice: Assuming that the user is unaware of
the ‘‘correct’’ decision, they may (1) elect to trust
该模型, and learn from feedback on the part
of the input relevant to the decision; 或者 (2) elect
to increase or decrease trust in the model, 基于
on whether the highlight is aligned with the user’s
prior on what the highlight should or should not
包括. 例如, if the highlight is focused
on punctuation and stop words, whereas the user
believes it should focus on content words.

4 Limitations of Select-Predict

Highlights as (Faithful) Interpretations

We now explore a variety of circumstances in
which select-predict highlight interpretations are
uninformative to the above use-cases. While we
present these failures in context of highlights,
they should be understood generally as symptoms
of mistaken or missing formal perspective of
explanations in machine learning. 具体来说,
that faithfulness is insufficient to formalize the
desired properties of interpretations.

4.1 Trojan Explanations

Task information can manifest in the interpretation
in exceedingly unintuitive ways, making it faith-
满, but functionally useless. We lead with an ex-
充足, to be followed by a more exact definition.

Leading Example. Consider the following case
of a faithfully highlighted decision process:

1. The selector makes a prediction y, 和
encodes y in a highlight pattern h. It then
returns the highlighted text h (西德:5) X.

2. The predictor recovers h (the binary mask
向量) from h (西德:5) x and decodes y from the
mask pattern h, without relying on the text.

The selector may choose to encode the predicted
class by the location of the highlight (beginning vs.
end of the text), or as text with more highlighted
tokens than non-highlighted, 等等. 这是
problematic: Although the model appears to make
a decision based on the highlight’s word content,
the functionality of the highlight serves a different
purpose entirely.

显然, this highlight ‘‘explains’’ nothing of
value to the user. The role of the highlight is
completely misaligned with the role expected by
the user. 例如, in the advice use-case,
the highlight may appear random or incompre-
hensible. This will cause the user to lose trust in

296

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

SST-2 AGNews

IMDB Ev.Inf.

20消息

电气

Random baseline
Lei et al. (2016)
Bastings et al. (2019)
FRESH

50.0
59.7
62.8
52.22

25.0
41.4
42.4
35.35

50.0
69.11

33.33
33.45

54.23

38.88

5.0

9.45
11.11

50.0
60.75

58.38

桌子 1: The performance of an RNN classifier
using h alone as input, in comparison to the ran-
dom baseline. Missing cells denote cases where we
were unable to converge training.

预测, even if the model was making a
reasonable and informed decision.

This case may seem unnatural and unlikely.
尽管如此, it is not explicitly avoided by faith-
ful highlights of select-predict models: This ‘‘un-
intentional’’ exploit of the modular process is a
valid trajectory in the training process of current
方法. We verify this by attempting to predict
the model’s decision based on the mask h alone
via another model (桌子 1). 本实验
surprisingly succeeds at above-random chance.
Although the result does not ‘‘prove’’ that the
predictor uses this unintuitive signal, it shows that
there is no guarantee that it doesn’t.

Definition (Trojan Explanations). We term
the more general phenomenon demonstrated by
the example a Trojan explanation: The explana-
的 (in our case, H (西德:5) X) carries information that is
encoded in ways that are ‘‘unintuitive’’ to the user,
who observes the interpretation as an explanation
of model behavior. 这意味着
the user
observing the explanation naturally expects the
explanation to convey information in a particular
方式,4 which is different from the true mechanism
of the explanation.

The ‘‘unintuitive’’ information encoded in h(西德:5)X
is not limited to h itself, and can be anything that is
useful to predict y and that the user will be unlikely
to easily comprehend. To illustrate, we summarize
cases of Trojans (in highlight
解释),
which are reasonably general to multiple tasks:

1. Highlight signal: The label is encoded in the
mask h alone, requiring no information from
the original text it is purported to focus on.

模型

AGNews

IMDB 20News Elec

Full text baseline
Lei et al. (2016)
Bastings et al. (2019)
FRESH

41.39
46.83
47.69
43.29

51.22
57.83

52.46

8.91

9.66
12.53

56.41
60.4

57.7

桌子 2: The performance of a classifier using
quantities of the following tokens in h (西德:5) X:
comma, 时期, dash, escape, ampersand, brack-
埃斯特, and star; as well as the quantity of capital
letters and |H (西德:5) X|. Missing cells are cases where
we were unable to converge training.

班级, and periods for another; the quantity of
capital letters; distance between dashes, 和
很快.

3. The default class: In a classification case,
a class can be predicted by precluding the
ability to predict all other classes and select-
ing it by default. 因此, the selector may
decide that the absence of class features in
itself defines one of the classes.

表中 2 we experiment with type (2) 多于:
we predict the decision (via MLP classifier) of a
model from quantities of various characters, 这样的
as commas and dashes, in the highlighted texts
generated by the models.5 We compare to a base-
line of predicting the decisions based on the same
statistics from the full text. 出奇, all mod-
els show an increased ability to predict their
decisions on some level compared to the baseline.

结论. Trojan explanations are not merely
可能的, but just as reasonable to the model
as any other option unless countered explicitly.
然而, explicit modeling of Trojans is difficult,
as our definition depends on user perception and
contains limitless possible exploits. 更
批判的, our current formal definition of what
constitutes a faithful explanation does not rule
out trojan explanations, and we cannot point to
a property that makes such trojan explanations
undesirable.

2. Arbitrary token mapping: The label

是
encoded via some mapping from highlighted
tokens to labels which is considered arbitrary
给用户: 例如, commas for one

4The expectation of a ‘‘particular way’’ is defined by the
attribution of intent, explored later (§7). Precise descriptions
of this expectation depend on the model, 任务, and user.

4.2 The Dominant Selector

In another failure case, the selector makes an
implicit decision, and proceeds to manipulate the

5We count the following characters for a feature vector
of length 10: comma, 时期, dash, escape, ampersand, 两个都
brackets, quantity of capital letters, and length (by tokens) 的
the highlighted text.

297

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

模型

(A)

(乙)

Text and Highlight

i really don’t have much to say about this book holder, not that it’s just a book holder. it’s a nice one. it does it’s job
. it’s a little too expensive for just a piece of plastic. it’s strong, sturdy, and it’s big enough, even for those massive
heavy textbooks, like the calculus ones. 虽然, i would not recommend putting a dictionary or reference that’s
like 6’’ thick (even though it still may hold). it’s got little clamps at the bottom to prevent the page from flipping all
over the place, although those tend to fall off when you move them. but that’s no big deal. just put them back on.
this book holder is kind of big, and i would not put it on a small desk in the middle of a classroom, but it’s not too
大的. you should be able to put it almost anywhere when studying on your own time.
i really don’t have much to say about this book holder, not that it’s just a book holder. it’s a nice one. it does it’s job .

it’s a little too expensive for just a piece of plastic. it’s strong, sturdy, and it’s big enough, even for those massive
heavy textbooks, like the calculus ones. 虽然, i would not recommend putting a dictionary or reference that’s like
6’’ thick (even though it still may hold). it’s got little clamps at the bottom to prevent the page from flipping all over

the place, although those tend to fall off when you move them. but that’s no big deal.

just put them back on. 这

book holder is kind of big , and i would not put it on a small desk in the middle of a classroom, but it’s not too big.
you should be able to put it almost anywhere when studying on your own time.

预言

Positive

桌子 3: Highlights faithfully attributed to two fictional select-predict models on an elaborate Amazon
Reviews sentiment classification example. Although highlight (A) is easier to understand, it is also far
less useful, as the selector clearly made hidden decisions.

predictor towards this decision (without neces-
sarily manifesting as a ‘‘Trojan’’). This means
that the selector can dictate the decision with
a highlight that is detached from the selector’s
inner reasoning process.

Whereas in the case of Trojan explanations the
highlight’s explanatory power is misunderstood
by the user (but nevertheless exists),
在这个
failure case, the information in the highlight is
unproductive as an explanation altogether.

Suppose that the selector has made some deci-
sion based on some span A in the input, while pro-
ducing span B to pass to the predictor—confident
that the predictor will make the same prediction
on span B as the selector did on span A. 虽然
span B may seem reasonable to human observers,
it is a ‘‘malicious’’ manipulation of the predictor.
The dominant selector can realistically manifest
when span A is more informative to a decision
than span B, but the selector was incentivized, 为了
some reason, to prefer producing span B over span
A. This is made possible because, while span B
may not be a good predictor for the decision, 它
can become a good predictor conditioned on the
existence of span A in the input. 所以, as far
as the predictor is concerned, 概率
of the label conditioned on span B is as high as
the true probability of the label conditioned on
span A. We demonstrate with examples:

‘‘Interesting movie about the history of IranA,
only disappointedB that it’s so short.’’

Assume that a select-predict model where the
selector was trained to mimic human-provided
rationales (DeYoung et al., 2019), and the predic-
tor made a (mistaken) negative sentiment classifi-
阳离子. Assume that Iran (A) is highly correlated
with negative sentiment, more-so than disap-
pointed (乙)—as ‘‘not disappointed’’ and such
are also common. Changing the word ‘‘Iran’’ to
‘‘Hawaii’’, 例如, will change the prediction
of the model from negative to positive. 然而,
this correlation may appear controversial or uneth-
伊卡尔, 因此, humans tend to avoid rationalizing
with it explicitly. The selector will be incentivized
to make a negative prediction because of Iran
while passing disappointed to the predictor.

Because the choice of span B is conditioned on
the choice of span A (meaning that the selector
will choose disappointed only if it had a priori
decided on the negative class thanks to Iran), span
B is just as informative to the predictor as span A
is in predicting the negative label.

This example is problematic not only due to the
‘‘interpretable’’ model behaving unethically, 但
due to the inherent incentive of the model to lie
and pretend it had made an innocent mistake
of overfitting to the word ‘‘disappointed’’.

例子 1. Consider the case of the binary
sentiment analysis task, where the model predicts
the polarity of a particular snippet of text. 给定
this fictional movie review:

例子 2. Assume that two fictional select-
predict models attempt to classify a complex,
mixed-polarity review of a product. 桌子 3 des-
cribes two fictional highlights faithfully attributed

298

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

to these two models, on an example selected from
the Amazon Reviews dataset.

The models make the same decision, yet their
implied reasoning process is wildly different,
thanks to the different highlight interpretations:
模型 (A)’s selector made some decision and
selected a word, ‘‘nice’’, which trivially supports
that decision. The predictor, which can only ob-
serve this word, simply does as it is told. 康姆-
paratively, the selector of model (乙) performed a
very different job: as a summarizer. The predictor
then made an informed decision based on this
summary. How the predictor made its decision
is unclear, but the division of roles in (乙) is sig-
nificantly easier to comprehend—since the user
expects the predictor to make the decision based
on the highlight.

This has direct practical implications: 在里面
dispute use-case, given a dispute claim such as
‘‘the sturdiness of the product was not important
to the decision’’, the claim appears impossible to
verify in (A). The true decision may have been
influenced by words which were not highlighted.
The claim appears to be safer in (乙). But why is
this the case?

4.3 Loss of Performance

It is common for select-predict models to perform
worse on a given task in comparison to models
that classify the full text in ‘‘one’’ opaque step
(桌子 4). Is this phenomenon a reasonable ne-
cessity of interpretability? Naturally, humans are
able to provide highlights of decisions without
any loss of accuracy. 此外, while interpre-
tability may sometimes be prioritized over state-
of-the-art performance, there are also cases that
will disallow the implementation of artificial
models unless they are strong and interpretable.6
We can say that there is some expectation for
whether models can or cannot surrender perfor-
mance in order to explain themselves. This expec-
tation may manifest in one way or the other for
a given interaction of explanation. And regardless
of what this expectation may be in this scenario,
select-predict models do follow the former (loss
of performance exists) and human rationalization
follows the latter (loss of performance does not
存在), such that there is a clear mismatch between
the two. How can we formalize if, 或者是否, 这

6例如, in the case of a doctor or patient seeking
to quantify a trade-off

life-saving advice—it
between performance and explanation ability.

is difficult

behavior of select-predict models is reasonable?
What is the differentiating factor between the two
情况?

5 Explanatory Power of Highlights

We have established three failure cases of select-
predict highlights: Trojan explanations (§4.1)
cause a misinterpretation of the highlight’s func-
tionality, and in dominant selectors (§4.2), 这
highlight does not convey any useful information.
最后, loss of performance (§4.3) shows an inhe-
租, unexplained mismatch between the behavior
of select-predict explainers and human explainers.
All of these cases stem from a shared failure
in formally characterizing the information to be
conveyed to the user. 例如, Trojan expla-
nations are a symptom of the selector and predictor
communicating through the highlight interface in
an ‘‘unintended’’ manner; dominant selectors are
a symptom of the selector making the highlight
decision in an ‘‘unintended’’ manner, as well—but
this is entirely due to the fact that we did not define
what is intended, to begin with. 更远, this is a
general failure of interpretability, not restricted to
select-predict highlight interpretations.

6 On Faithfulness, Plausibility, 和

Explanainability from the Science of
Human Explanations

The mathematical foundations of machine learn-
ing and natural language processing are insuffi-
cient to tackle the underlying issue behind the
symptoms described in Section 4. 实际上, formal-
izing the problem itself is difficult. What enables
a faithful explanation to be ‘‘understood’’ as ac-
curate to the model? And what causes an expla-
nation to be perceived as a Trojan?

在这个部分, we attempt to better formalize
this problem on a vast foundation of social, psy-
chological and cognitive literature about human
explanations.7

6.1 Plausibility is not the Answer
Plausibility8 (or persuasiveness) is the property
of an interpretation being convincing towards
the model prediction, regardless of whether the
model was correct or whether the interpretation is

7Refer to Miller (2019) for a substantial survey in this

区域, which was especially motivating to us.

8Refer to the Appendix for a glossary of relevant ter-

minology from the human explanation sciences.

299

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

SST-2

SST-3

SST-5 AG News

IMDB Ev. Inf. MultiRC Movies Beer

Lei et al. (2016)
Bastings et al. (2019)
FRESH

22.65
3.31

90.0

7.09
0
17.82

9.85
2.97
13.45

33.33
199.02
50.0

22.23
12.63
14.66

36.59
85.19
9.76

31.43
75.0
0.0

160.0

20.0

37.93
13.64

桌子 4: The percentage increase in error of selector-predictor highlight methods compared to
an equivalent architecture model which was trained to classify complete text. We report the numbers
reported in previous work whenever possible (italics means our results). Architectures are not necessarily
consistent across the table, thus they do not imply performance superiority of any method. The highlight
lengths chosen for each experiment were chosen with precedence whenever possible, and otherwise
chosen as 20% following Jain et al. (2020) precedence.

faithful. It is inspired by human-provided expla-
nations as post hoc stories generated to plausibly
justify our actions (Rudin, 2019). Plausibility is of-
ten quantified by the degree that the model’s high-
lights resemble gold-annotated highlights given
by humans (Bastings et al., 2019; Chang et al.,
2020) or by querying for the feedback of people
直接地 (Jain et al., 2020).

Following the failure cases in Section 4, 一
may conclude that plausibility is a desirable, 或者
even necessary, condition for a good interpreta-
系统蒸发散: 毕竟, Trojan explanations are by default
implausible. We strongly argue this is not the case.
Plausibility should be viewed an incentive of
the explainer, and not as a property of the ex-
planation: Human explanations can be catego-
rized by utility across multiple axes (磨坊主, 2019),
among them are (1) learning a better internal
future decisions and calculations
模型
(Lombrozo, 2006; Williams et al., 2013); (2)
examination to verify the explainer has a correct
internal prediction model; (3) teaching9 to modify
the internal model of the explainer towards a more
correct one (can be seen as the opposite end of
(1)); (4) assignment of blame to a component of the
internal model; 最后, (5) justification and
劝说. These goals can be trivially mapped
to our case, where the explainer is artificial.

为了

Critically, 目标 (5) of justification and persua-
sion by the explainer may not necessarily be the
goal of the explainee. 的确, in the case of AI
explainability, it is not a goal of the explainee to
be persuaded that the decision is correct (甚至
when it is), but to understand the decision process.
If plausibility is a goal of the artificial model, 这
perspective outlines a game theoretic mismatch of

incentives between the two players. And specif-
ically in cases where the model is incorrect, 这是
interpreted as the difference between an innocent
mistake and an intentional lie—of course, lying is
considered more unethical. 因此, 我们骗-
clude that modeling and pursuing plausibility
in AI explanations is an ethical issue.

The failures discussed above do not stem from
如何 (和)convincing the interpretation is, but from
怎么样
the user understands the reasoning
process of the model. If the user is able to com-
prehend the steps that the model has taken towards
its decision, then the user will be the one to decide
whether these steps are plausible or not, 基于
how closely they fit the user’s prior knowledge
on whatever correct steps should be taken—
regardless of whether the user knows the correct
answer or whether the model is correct.

6.2 The Composition of Explanations

磨坊主 (2019) describes human explanations of
behavior as a social interaction of knowledge
transfer between the explainer and the explainee,
and thus they are contextual, and can be perceived
differently depending on this context. Two central
pillars of the explanation are causal attribution
—the attribution of a causal chain10 of events to
the behavior—and social attribution—the attri-
bution of intent to others (Heider et al., 1958).

Causal attribution describes faithfulness: 我们
note a stark parallel between causal attribution
and faithfulness—for example, the select-predict
composition of modules defines an unfolding
causal chain where the selector hides portions of
输入, causing the predictor to make a decision
based on the remaining portions. 实际上, 最近的

9虽然 (1) 和 (3) are considered one-and-the-same
in the social sciences, we disentangle them as that is only the
case when the explainer and explainee are both human.

10See Hilton et al. (2005) for a breakdown of types of
causal chains; we focus on unfolding chains in this work, 但
others may be relevant as well.

300

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

work has connected an accurate (faithful) attri-
bution of causality with increased explainability
(Feder et al., 2020; Madumal et al., 2020).

此外, social attribution is missing:
Heider and Simmel (1944) describe an experi-
ment where participants attribute human concepts
of emotion, intentionality, and behavior to ani-
mated shapes. 清楚地,
the same phenomenon
persists when humans attempt to understand the
predictions of artificial models: We naturally
attribute social intent to artificial decisions.

Can models be constrained to adhere to this
attribution? Informally, prior work on highlights
has considered such factors before. Lei et al.
(2016) describe desiderata for highlights as being
‘‘short and consecutive’’, and Jain et al. (2020)
interpreted ‘‘short’’ as ‘‘around the same length
as that of human-annotated highlights’’. We assert
那
the nature of these claims is an attempt
to constrain highlights to the social behavior
implicitly attributed to them by human observers
in the select-predict paradigm (稍后讨论).

# 宣称

1
2
3
4
5
6

The marked text is supportive of the decision.
The marked text is selected after making the decision.
The marked text is sufficient for making the decision.
The marked text is selected irrespective of the decision.
The marked text is selected prior to making the decision.
The marked text
the information that
informed the decision.
The marking pattern alone is sufficient for making the
decision by the predictor.
The marked text provides no explanation whatsoever.

includes all

桌子 5: A list of claims that attribute causality
or social intent to the highlight selection.

causal information is different from the mecha-
nism that the user utilizes to glean causal informa-
tion from the interpretation, where the mechanism
is defined by the user’s social attribution of intent
towards the model. 此外, we claim that
this attribution (expected intent) is heavily depen-
dent on the model’s task and use-case: 不同的
use cases may call for different alignments.

7 (Socially) Aligned Faithfulness

8 Alignment of Select-Predict Highlight

Unlike human explainers, artificial explainers
can exhibit a misalignment between the causal
chain behind a decision and the social attribution
attributed to it. This is because the artificial deci-
sion process may not resemble human behavior.

By presenting to the user the causal pipeline of
decisions in the model’s decision process as an
interpretation of this process, the user naturally
conjures social intent behind this pipeline. 在
order to be considered comprehensible to the user,
the attributed social intent must match the actual
behavior of the model. This can be formalized
as a set of constraints on the possible range of
decisions at each step of the causal chain.

We claim that this problem is the root cause
behind the symptoms in Section 4. Here we de-
fine the general problem independently from the
narrative of highlight interpretations.

Definition. We say that an interpretation method
is faithful if it accurately describes causal infor-
mation about the decision process of the decision.
We say that the faithful method is human-aligned
(short for ‘‘aligned with human expectations of
social intent’’) if the model and method adhere to
the social attribution of intent by human observers.
反过来, ‘‘misaligned’’ interpretations are
interpretations whose mechanism of conveying

Interpretations

In contrast to the human-provided explanations,
in the ML setup, our situation is unique in that
we have control over the causal chain but not the
social attribution. 所以, the social attribution
must lead the design of the causal chain. 其他
字, we argue that we must first identify the
behavior expected of the decision process, 和
then constrain the process around it.

8.1 Attribution of Highlight Interpretations

In order to understand what ‘‘went wrong’’ in the
failure cases above, we need to understand what
are possible expectations—potential instances of
social attribution—to the ‘‘rationalizing’’ select-
predict models. 桌子 5 outlines possible claims
that could be attributed to a highlight explanation.
Claims 1–6 are claims that could reasonably be
attributed to highlights, while claims 7 和 8 是
not likely to manifest.

These claims can be packaged as two high-level

‘‘behaviors’’:
总结 (3–6), where the highlight serves as
an extractive summary of the most important and
useful parts of the complete text. The highlight is
merely considered a compression of the text, 和
sufficient information to make informed decisions

301

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

in a different context, towards some concrete goal.
It is not selected with an answer in mind, but in
anticipation that an answer will be derived in the
未来, for a question that has not been asked yet.
And evidencing (1–3), in which the highlight
serves as supporting evidence towards a prior
decision that was not necessarily restricted to the
highlighted text.

Claims 7–8 are representative of the examples

of failure cases discussed in Section 4.

8.2 Issues with Select-Predict

We argue that select-predict is inherently mislead-
英. Although claims 1–6 are plausible attribu-
tions to select-predict highlights, none of them
can be guaranteed by a select-predict system,
in particular for systems in which the selector
and predictor are exposed to the end task during
训练.

If the select-predict system acts ‘‘as intended’’,
selection happens before prediction, 这是
incompatible with claims 1–2. 然而, 和我们一样
do not have control over the selector component,
it cannot be guaranteed that the selector will not
perform an implicit decision prior to its selection,
and once a selector makes an implicit decision,
the selected text becomes disconnected from the
explanation. 例如, the selector decided on
class A, and chose span B because it ‘‘knows’’
this span will cause the predictor to predict class
A (参见章节 4.2).

换句话说, the advertised select-predict
chain may implicitly become a ‘‘predict-select-
predict’’ chain. The first and hidden prediction
step makes the final prediction step disconnected
from the cause of the model’s decision, 因为
second prediction is conditioned on the first one.
This invalidates attributions 3–6. It also allows for
7–8.

Select-predict models are closest to the char-
acteristics of highlights selected as summaries
by humans—therefore they can theoretically be
aligned with summary attribution if the selector
is designed as a truly summarizing component,
and has no access to the end-task. This is hard to
达到, and no current model has this property.

The issues of Section 4 are direct results of the
above conflation of interests: Trojan highlights
and dominant selectors result from a selector that
makes hidden and unintended decisions, so they
serve as neither summary nor evidence towards

the predictor’s decision. Loss of performance is
due to the selector acting as an imperfect summa-
rizer—whether summary is relevant to the task
to begin with, 或不 (as is the case in agreement
classification, or natural language inference).

9 Predict-Select-Verify

We propose the predict-select-verify causal chain
as a solution that can be constrained to provide
highlights as evidence (IE。, guarantee claims 1–3).
This framework solves the misalignment problem
by allowing the derivation of faithful highlights
aligned with evidencing social attribution.
The decision pipeline is as follows:

1. The predictor mp makes a prediction ˆy :=

mp(X) on the full text.

2. The selector ms selects h := ms(X) 这样

mp(H (西德:5) X) = ˆy.

In this framework, the selector provides evi-
dence that is verified to be useful to the predictor
towards a particular decision. 重要的, 这
final decision has been made on the full text, 和
the selector is constrained to provide a highlight
that adheres to this exact decision. The selector
does not purport to provide a highlight which is
comprehensive of all evidence considered by the
predictor, but it provides a guarantee that the
highlighted text is supportive of the decision.

Causal Attribution. The selector highlights
are provably faithful to the predict-select-verify
chain of actions. They can be said to be faith-
ful by construction (Jain et al., 2020), similarly
to how select-predict highlights are considered
faithful—the models undergo the precise chain of
actions that is attributed to their highlights.

Social Attribution. The term ‘‘rationalization’’
fits the current causal chain, unlike in select-
predict, and so there is no misalignment: 这
derived highlights adhere to the properties of
highlights as evidence described in Section 8.1.
The highlight selection is made under constraints
that the highlight serve the predictor’s prior deci-
锡安, which is not caused by the highlighted text.
The constraints are then verified at the verify step.

Solving the Failure Cases (§4). As a natural but
important by-product result of the above, predict-
select-verify addresses the failures of Section 4:

302

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Trojan highlights and dominant selectors are im-
可能的, as the selector is constrained to only pro-
vide ‘‘retroactive’’ selections towards a specific
priory-decided prediction. The selector cannot
cause the decision, because it was made without
its intervention. 最后, the highlights inherently
cannot cause loss of performance, 因为他们
merely support a decision which was made based
on the full text.

10 Constructing a Predict-Select-Verify

Model with Contrastive Explanations

In order to design a model adhering to the predict-
select-verify chain, we require solutions for the
predictor and for the selector.

The predictor is constrained to be able to accept
both full-text inputs and highlighted inputs. 为了
this reason, we use masked language modeling
(MLM) 型号, such as BERT (Devlin et al.,
2018), fine-tuned on the downstream task. 这
MLM pre-training is performed by recovering
partially masked text, which conveniently suits
our needs. We additionally provide randomly
highlighted inputs to the model during fine-tuning.
The selector is constrained to select highlights
for which the predictor made the same decision as
it did on the full text. 然而, there are likely
many possible choices that the selector may make
under these constraints, as there are many possible
highlights that all result in the same decision
by the predictor. We wish for the selector to
select meaningful evidence to the predictor’s
decision.11 What
is meaningful evidence? 到
回答这个问题, we again refer to cognitive
science on necessary attributes of explanations
that are easy to comprehend by humans. We stress
that selecting meaningful evidence is critical for
predict-select-verify to be useful.

10.1 Contrastive Explanations

An especially relevant observation in the social
science literature is of contrastive explanations
(磨坊主, 2019, 2020), following the notion that the
question ‘‘why P ?’’ is followed by an addendum:
‘‘why P , rather than Q?’’ (Hilton, 1988). We refer
to P as the fact and Q as the foil (Lipton, 1990).
The concrete valuation in the community is that
in the vast majority of cases, the cognitive burden

11例如, the word ‘‘nice’’ in Table 3a is likely not
useful supporting evidence, since it is a rather trivial claim,
even in a predict-select-verify setup.

303

of a ‘‘complete’’ explanation, 那是, where Q
is P , is too great, and thus Q is selected as a
subset of all possible foils (Hilton and Slugoski,
1986; Hesslow, 1988), and often not explicitly,
but implicitly derived from context.

例如, ‘‘Elmo drank the water because
he was thirsty,’’ explains the fact ‘‘Elmo drank
water’’ without mentioning a foil. But while this
explanation is acceptable if the foil is ‘‘not drink-
ing’’, it is not acceptable if the foil is ‘‘drinking
tea’’: ‘‘Elmo drank the water (rather than the tea)
because he was thirsty.’’ Similarly, ‘‘Elmo drank
the water because he hates tea’’ only answers the
latter foil. The foil is implicit in both cases, 但
nevertheless it is not P , but only a subset.

In classification tasks, the implication is that an
interpretation of a prediction of a specific class is
hard to understand, and should be contextualized
by the preference of the class over another—and
the selection of the foil (the non-predicted class)
is non-trivial, and a subject of ongoing discussion
even in human explanations literature.

Contrastive explanations have many implica-
tions for explanations in AI as a vehicle for expla-
nations that are easy to understand. Although there
is a modest body of work on contrastive expla-
nations in machine learning (Dhurandhar et al.,
2018; Chakraborti et al., 2019; 陈等人。, 2020),
据我们所知, the NLP community seldom
discusses this format.

10.2 Contrastive Highlights

An explanation in a classification setting should
not only addresses the fact (predicted class), 但
do so against a foil (some other class).12 给定
classes c and ˆc, where mp(X) = c, we will derive a
contrast explanation towards the question: ‘‘why
did you choose c, rather than ˆc?’’.

We assume a scenario where, having observed
C, the user is aware of some highlight h which
should serve, they believe, as evidence for class
ˆc. 换句话说, we assume the user believes
a pipeline where mp(X) = ˆc and ms(X) = h is
reasonable.

to explain,

12Selecting the foil, or selecting what

是
a difficult and interesting problem even in philosophical
文学 (Hesslow, 1988; Mcgill and Klein, 1993; Chin-
Parker and Cantelon, 2017). In the classification setting, 这是
relatively simple, as we may request the foil (班级) 从
用户, or provide separate contrastive explanations for each
foil.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Procedure Text and Highlight

Label
y

Prediction Foil Prediction Contrast Prediction

mp(X) mp(H (西德:5) X)

mp(HC (西德:5) X)

Manual

Ohio Sues Best Buy, Alleging Used Sales (AP): AP – Ohio authorities
sued Best Buy Co. Inc. on Thursday, alleging the electronics retailer

Business Sci/Tech

商业

Sci/Tech

Manual

engaged in unfair and deceptive business practices .
HK Disneyland Theme Park to Open in September: Hong Kong’s
Disneyland theme park will open on Sept. 12, 2005 and become the
driving force for growth in the city’s tourism industry , Hong Kong’s

government and Walt Disney Co.
Poor? Who’s poor? Poverty is down: The proportion of people living
on less than $1 a day decreased from 40 到 21 percent of the global population between 1981 和 2001, says the World Bank’s latest annual report. Poor? Who’s poor? Poverty is down: The proportion of people living on less than $1 a day decreased from 40 到 21 的百分比
global population between 1981 和 2001, says the World Bank’s
latest annual report.
Poor? Who’s poor? Poverty is down: The proportion of people
living on less than $1 a day decreased from 40 到 21 的百分比

global population between 1981 和
2001, says the World Bank’s latest annual report.

Business World

商业

世界

World Business

世界

商业

World Business

世界

商业

World Business

世界

商业

Automatic Siemens Says Cellphone Flaw May Hurt Users and Its Profit: Siemens,

Business Sci/Tech

商业

Sci/Tech

the world’s fourth-largest maker of mobile phones, said Friday that a

software flaw that can create a piercing ring in its newest phone models

might hurt earnings in its handset division.

Automatic Siemens Says Cellphone Flaw May Hurt Users and Its Profit: Siemens

Business Sci/Tech

商业

Sci/Tech

the world’s fourth-largest maker of mobile phones, said Friday that a

software flaw that can create a piercing ring in its newest phone models

might hurt earnings in its handset division.

Automatic Siemens Says Cellphone Flaw May Hurt Users and Its Profit: Siemens

Business Sci/Tech

商业

Sci/Tech

the world’s fourth-largest maker of mobile phones, said Friday that a

software flaw that can create a piercing ring in its newest phone models

might hurt earnings in its handset division.

Automatic Siemens Says Cellphone Flaw May Hurt Users and Its Profit: Siemens,

Business Sci/Tech

商业

Sci/Tech

the world’s fourth-largest maker of mobile phones, said Friday that a

software flaw that can create a piercing ring in its newest phone models

might hurt earnings in its handset division.

桌子 6: Examples of contrastive highlights (§10) of instances from the AG News corpus. The model
used for mp is fine-tuned bert-base-cased. The foil highlight h is in standard yellow ; the contrastive
delta hΔ is in bold yellow ; and hc := h + hΔ. All examples are cases of model errors, and the foil was
chosen as the gold label.

If mp(H (西德:5) X) (西德:6)= ˆc,

then the user is made
aware that the predictor disagrees that h serves
as evidence for ˆc.

change from h (evidence to ˆc) to hc so that it is
evidence towards c.’’

The final manual procedure is, given a model

否则, mp(H (西德:5) X) = ˆc. We define:

mp and input x:

HC :=

argmin
h+hΔ
s. t. |hΔ|>0
∧ mp((h+hΔ)(西德:5)X)=c

|H + hΔ|.

1. The user observes mp(X) and chooses a

relevant foil ˆc (西德:6)= mp(X).

2. The user chooses a highlight h which they

believe supports ˆc.

hc is the minimal highlight containing h such
that mp(HC (西德:5) X) = c. 直观地, the claim by the
model is as such: ‘‘I consider hΔ as a sufficient

3. If mp(H (西德:5) X) (西德:6)= ˆc, the shortest hc is derived
such that h ⊂ hc and mp(HC (西德:5) X) = m(X)
by brute-force search.

304

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

自动化. Although we believe the above
procedure is most useful and informative, 我们
acknowledge the need for automation of it to ease
the explanation process. Steps 1 和 2 involve
human input which can be automated: In place
of step 1, we may simply repeat the procedure
separately for each of all foils (and if there are too
many to display, select them with some priority
and ability to switch between them after-the-fact);
and in place of step 2, we may heuristically derive
candidates for h—e.g., the longest highlight for
which the model predicts the foil:

H := arg max
H
mp(H(西德:5)X)=ˆc

|H|.

The automatic procedure is then, for each class
ˆc (西德:6)= mp(X):

1. Candidates for h are derived, 例如, 这
longest highlight h for which mp(H(西德:5)X) = ˆc.
2. The shortest hc is derived such that h ⊂ hc

and mp(HC (西德:5) X) = mp(X).

We show examples of both procedures in
桌子 6 on examples from the AG News dataset.
For illustration purposes, we selected incorrectly
classified examples, and selected the foil to be the
true label of the example. The highlight for the
foil was chosen by us in the manual examples.

In the automatic example,

the model made
an incorrect Sci/Tech prediction on a Business
例子. The procedure reveals that the model
would have made the correct prediction if the
body of the news article was given without its
title, and that the words ‘‘Siemens’’, ‘‘Cell’’, 和
‘‘Users’’ in the title are independently sufficient to
flip the prediction on the highlight from Business
to Sci/Tech.

We stress that while the examples presented in
these figures appear reasonable, the true goal of
this method is not to provide highlights that seem
justified, but to provide a framework that allows
models to be meaningfully incorporated in use-
cases of dispute, debug, 和建议, with robust
and proven guarantees of behavior.

例如, in each of the example use-cases:
Dispute: The user verifies if the model ‘‘correctly’’
considered a specific portion of the input in the
决定: The model made decision c, 哪里的
user believes decision ˆc is appropriate and is sup-
ported by evidence h (西德:5) X. If mp(H (西德:5) X) (西德:6)= c, 他们
may dispute the claim that the model interpreted

H (西德:5) x with ‘‘correct’’ evidence intent. 否则
the dispute cannot be made, as the model provably
considered h as evidence for ˆc, yet insufficiently
so when combined with hΔ as hc (西德:5) X.
Debug: Assuming c is incorrect, the user performs
error analysis by observing which part of the input
is sufficient to steer the predictor away from the
correct decision ˆc. This is provided by hΔ.
Advice: When the user is unaware of the answer,
and is seeking perspective from a trustworthy
模型: They are given explicit feedback on which
part of the input the model ‘‘believes’’ is sufficient
to overturn the signal in h towards ˆc. If the model
is not considered trustworthy, the user may gain
or reduce trust by observing whether m(H (西德:5) X)
and hΔ align with user priors.

11 讨论

Causal Attribution of Heat-maps. Recent
work on the faithfulness of attention heat-maps
(Baan et al., 2019; Pruthi et al., 2019; Serrano and
史密斯, 2019) or saliency distributions (阿尔瓦雷斯-
Melis and Jaakkola, 2018; Kindermans et al.,
2019) cast doubt on their faithfulness as indicators
to the significance of parts of the input (to a model
决定). Similar arguments can be made regard-
ing any explanation in the format of heat-maps,
such as LIME and SHAP (Jacovi and Goldberg,
2020). We argue that this is a natural conclu-
sion from the fact that, as a community, 我们有
not envisioned an appropriate causal chain that
utilizes heat-maps in the decision process, rein-
forcing the claims in this work on the parallel
between causal attribution and faithfulness. 这
point is also discussed at length by Grimsley et al.
(2020).

Social Attribution of Heat-maps. 作为男人-
上面提到的, the lack of a clear perception of
a causal chain behind heat-map feature attribution
explanations in NLP makes it difficult to dis-
cuss the social intent attributed by these methods.
尽管如此, it is possible to do so under two per-
spectives: (1) when the heat-map is discretized into
a highlight, and thus can be analyzed along the list
of possible attributions in Table 5; 或者 (2) 当。。。的时候
heat-map is regarded as a collection of pair-wise
claims about which part the input is more impor-
坦特, given two possibilities. Perspective (1) 能
be likened to claims #1 和 #2 表中 5, 即,
‘‘evidencing’’ attributions sans sufficiency.

305

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Contrastive Explanations in NLP. We are not
aware of prior work that discusses or implements
contrastive explanations explicitly in NLP, 如何-
ever this does not imply that existing explanation
methods in NLP are not contrastive. To the con-
trary, the social sciences argue that every manner
of explanation has a foil, and is comprehended by
the explainee against some foil—including pop-
ular methods such as LIME (Ribeiro et al., 2016)
and SHAP (Lundberg and Lee, 2017). 问题-
tion then becomes what the foil is, and whether
this foil is intuitive and thus useful. In the case
of LIME, 例如, the foil is defined by the
aggregation of all possible perturbations admitted
to the approximating linear model—where such
perturbations may not be natural language, 和
thus less intuitive as foil; additionally, Kumar
等人. (2020) have recently derived the foil behind
general shapley value-based explanations, 和
have shown that this foil is not entirely aligned
with human intuition. We argue that making the
foil explicit and intuitive is an important goal of
any interpretation system.

Inter-disciplinary Research. 研究
在
explanations in artificial intelligence will ben-
efit from a deeper interdisciplinary perspective on
two axes: (1) literature on causality and causal
attribution, regarding causal effects in a model’s
reasoning process; 和 (2) literature on the social
perception and attribution of human-like intent to
causal chains of model decisions or behavior.

12 相关工作

How interpretations are comprehended by people
is related to simulatability (Kim et al., 2017)-这
degree to which humans can simulate model deci-
西翁. Quantifying simulatability (Hase and Bansal,
2020) is decidedly different from social alignment,
自从, 例如, it will not necessarily detect
dominant selectors. We theorize that aligned faith-
ful interpretations will increase simulatability.

Predict-select-verify is reminiscent of iterative
erasure (Feng et al., 2018). By iteratively remov-
ing ‘‘significant’’ tokens in the input, Feng et al.
show that a surprisingly small portion of the input
could be interpreted as evidence for the model
to make the prediction, leading to conclusions
on the pathological nature of neural models and
their sensitivity to badly-structured text. 这
experiment retroactively serves as a successful
application of debugging using our formulation.

The approach by Chang et al. (2019) 为了
class-wise highlights is reminiscent of contrastive
highlights, but nevertheless distinct, since such
highlights still explain a fact against all foils.

13 结论

Highlights are a popular format for explanations
of decisions on textual inputs, for which there are
models available today with the ability to derive
highlights ‘‘faithfully’’. We analyze highlights as
a case study in pursuit of rigorous formalization
of quality artificial intelligence explanations.

We redefine faithfulness as

the accurate
representation of the causal chain of decision
making in the model, and aligned faithfulness as a
faithful interpretation which is also aligned to the
social attribution of intent behind the causal chain.
The two steps of causal attribution and social
attribution together complete the process of
‘‘explaining’’ the decision process of the model
to humans.

With this formalization, we characterize various
failures in faithful highlights that ‘‘seem’’ strange,
but could not be properly described previously,
noting they are not properly constrained by their
social attribution as summaries or evidence. 我们
propose an alternative which can be constrained
to serve as evidence. 最后, we implement our
alternative by formalizing contrastive explana-
tions in the highlight format.

Acknowledgements

This project has received funding from the
European Research Council (ERC) under the
European Union’s Horizon 2020 Research and
Innovation program, grant agreement no. 802774
(iEXTRACT).

参考

David Alvarez-Melis and Tommi S. Jaakkola.
2018. On the robustness of interpretability
方法. CoRR, abs/1806.08049.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell,
and Dan Klein. 2016. Learning to compose neu-
ral networks for question answering. CoRR,
abs/1601.01705. DOI: https://doi.org
/10.18653/v1/N16-1181

Joris Baan, Maartje ter Hoeve, Marlies van der
Wees, Anne Schuth, and Maarten de Rijke.

306

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

2019. Do transformer attention heads provide
transparency in abstractive summarization?
CoRR, abs/1907.00570.

Yujia Bao, Shiyu Chang, Mo Yu, and Regina
Barzilay. 2018. Deriving machine attention
from human rationales. Ellen Riloff, 大卫
蒋, Julia Hockenmaier, and Jun’ichi Tsujii,
编辑, 在诉讼程序中 2018 会议
on Empirical Methods in Natural Language
加工, 布鲁塞尔, 比利时, 十月 31 –
十一月 4, 2018, pages 1903–1913. Asso-
ciation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D18
-1216, PMID: 29551352

Jasmijn Bastings, Wilker Aziz, and Ivan Titov.
2019. Interpretable neural predictions with dif-
ferentiable binary variables. 在诉讼程序中
the 57th Annual Meeting of the Association for
计算语言学, pages 2963–2977.
计算语言学协会,
Florence, 意大利. DOI: https://doi.org
/10.18653/v1/P19-1284

Tapabrata Chakraborti, Arijit Patra, 和 J. Alison
高贵. 2019. Contrastive algorithmic fairness:
部分 1 (理论). ArXiv, abs/1905.07360.

Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S.
Jaakkola. 2019. A game theoretic approach to
class-wise selective rationalization. Hanna M.
瓦拉赫, Hugo Larochelle, Alina Beygelzimer,
Florence d’Alch´e-Buc, Emily B. 狐狸, 和
Roman Garnett, 编辑, In Advances in Neu-
ral Information Processing Systems 32: Annual
Conference on Neural Information Process-
ing Systems 2019, 神经信息处理系统 2019, 8-14
十二月 2019, Vancouver, BC, 加拿大,
pages 10055–10065.

Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S.
rationalization.

Invariant

2020.

Jaakkola.
CoRR, abs/2003.09772.

Sheng-Hui Chen, Kayla Boggess, and Lu Feng.
2020. Towards transparent robotic planning
via contrastive explanations. ArXiv, abs/2003
.07425.

Seth Chin-Parker and Julie Cantelon. 2017. 骗局-
trastive constraints guide explanation-based
category learning. 认知科学, 41(6):
1645–1655. DOI: https://doi.org/10
.1111/cogs.12405, PMID: 27564059

307

Devleena Das and Sonia Chernova. 2020.
Leveraging rationales to improve human task
表现. In Proceedings of the 25th Inter-
national Conference on Intelligent User Inter-
faces, IUI ’20, page 510–518. 协会
Computing Machinery, 纽约, 纽约, 美国.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, 和
Kristina Toutanova. 2018. BERT: Pre-training
of deep bidirectional transformers for language
理解. CoRR, abs/1810.04805.

Jay DeYoung, Sarthak Jain, Nazneen Fatema
Rajani, Eric Lehman, Caiming Xiong, 理查德
Socher, and Byron C. 华莱士. 2019. Eraser: A
benchmark to evaluate rationalized NLP mod-
这. DOI: https://doi.org/10.18653
/v1/2020.acl-main.408

Amit Dhurandhar, Pin-Yu Chen, Ronny Luss,
Chun-Chen Tu, Pai-Shun Ting, Karthikeyan
Shanmugam, and Payel Das. 2018. Explana-
tions based on the missing: Towards con-
trastive explanations with pertinent negatives.
In NeurIPS.

Finale Doshi-Velez and Been Kim. 2017. Towards
a rigorous science of interpretable machine
学习. arXiv 预印本 arXiv:1702.08608.

Amir Feder, Nadav Oved, Uri Shalit, 和
Roi Reichart. 2020. Causalm: Causal model
explanation through counterfactual
语言
型号. CoRR, abs/2005.13407.

Shi Feng, Eric Wallace, Alvin Grissom II,
Mohit Iyyer, Pedro Rodriguez, and Jordan L.
Boyd-Graber. 2018. Pathologies of neural
models make interpretation difficult. Ellen
Riloff, David Chiang, Julia Hockenmaier, 和
Jun’ichi Tsujii, 编辑, 在诉讼程序中
2018 Conference on Empirical Methods in Nat-
ural Language Processing, 布鲁塞尔, 比利时,
十月 31 – 十一月 4, 2018, 页面
3719–3728. Association for Computational
语言学, DOI: https://doi.org/10
.18653/v1/D18-1407

Christopher Grimsley, Elijah Mayfield, and Julia
右. S. Bursten. 2020. Why attention is not expla-
国家: Surgical intervention and causal reason-
ing about neural models. 在诉讼程序中
12th Language Resources and Evaluation Con-
参考, pages 1780–1790. European Language
Resources Association, Marseille, 法国.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

Peter Hase and Mohit Bansal. 2020. Evaluating
explainable AI: Which algorithmic explana-
tions help users predict model behavior? CoRR,
abs/2005.01831. DOI: https://doi.org
/10.18653/v1/2020.acl-main.491

Been Kim, Martin Wattenberg, Justin Gilmer,
Carrie Cai, James Wexler, Fernanda Viegas,
and Rory Sayres. 2017. Interpretability beyond
feature attribution: Quantitative testing with
concept activation vectors (tcav).

Fritz Heider, American Psychological Associa-
的, and Ovid Technologies Inc. 1958. The Psy-
chology of Interpersonal Relations. 纽约:
威利 ; Hillsdale, 新泽西州: 劳伦斯·埃尔鲍姆.

Fritz Heider and Marianne Simmel. 1944. 一个
experimental
study of apparent behavior.
The American Journal of Psychology, 57(2):
243–259. DOI: https://doi.org/10
.2307/1416950

Bernease Herman. 2017. The promise and peril
of human evaluation for model interpretability.
CoRR, abs/1711.07414. Withdrawn.

Germund Hesslow. 1988, The problem of causal
选择, Denis J. Hilton, 编辑, 当代的
Science and Natural Explanation: Common-
sense Conceptions of Causality, 纽约
大学出版社.

Hilton, Denis J. 1988. Logic and causal attribution.
In Contemporary Science and Natural Expla-
国家: Commonsense Conceptions of Causal-
性. 33–65, 纽约大学出版社.

Denis J. Hilton, 1966 Mandel, David R., 和
Patrizia Catellani. 2005. The Psychology of
Counterfactual Thinking, 伦敦 ; 纽约 :
劳特利奇. Includes bibliographical references
(p. [217]-244) and indexes.

Denis J. Hilton and Ben R. Slugoski. 1986.
Knowledge-based causal attribution: The ab-
normal conditions focus model. Psychological
审查, 93(1):75. DOI: https://doi.org
/10.1037/0033-295X.93.1.75

Alon Jacovi and Yoav Goldberg. 2020. Towards
faithfully interpretable NLP systems: How should
we define and evaluate faithfulness? CoRR,
abs/2004.03685. DOI: https://doi.org
/10.18653/v1/2020.acl-main.386

Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, 和
Byron C. 华莱士. 2020. Learning to faith-
fully rationalize by construction. CoRR, abs/
2005.00115. DOI: https://doi.org/10
.18653/v1/2020.acl-main.409

Pieter-Jan Kindermans, Sara Hooker,

Julius
Adebayo, Maximilian Alber, Kristof T. Sch¨utt,
Sven D¨ahne, Dumitru Erhan, and Been Kim.
2019, 这 (和)reliability of saliency methods,
Wojciech Samek, Gr´egoire Montavon, 安德里亚
Vedaldi, Lars Kai Hansen, and Klaus-Robert
M¨uller, 编辑, Explainable AI: Interpreting,
Explaining and Visualizing Deep Learning,
Lecture Notes in Computer Science, 11700,
施普林格, pages 267–280. DOI: https://
doi.org/10.1007/978-3-030-28954
-6 14

Diederik P. Kingma and Max Welling. 2014.
Auto-encoding variational bayes. Yoshua
Bengio and Yann LeCun, 编辑, In 2nd In-
ternational Conference on Learning Represen-
tations, ICLR 2014, Banff, AB, 加拿大, 四月
14-16, 2014, Conference Track Proceedings.

我. Elizabeth Kumar, Suresh Venkatasubramanian,
Carlos Scheidegger, and Sorelle A. Friedler.
2020. Problems with shapley-value-based ex-
planations as feature importance measures.
CoRR, abs/2002.11097.

Tao Lei, Regina Barzilay, and Tommi S. Jaakkola.
2016. Rationalizing neural predictions. CoRR,
abs/1606.04155. DOI: https://doi.org
/10.18653/v1/D16-1011

Peter Lipton. 1990. Contrastive explanation.
Royal Institute of Philosophy Supplements,
27:247–266. DOI: https://doi.org/10
.1017/S1358246100005130

Zachary C. Lipton. 2018. The mythos of model in-
terpretability. Communications of ACM, 61(10):
36–43. DOI: https://doi.org/10.1145
/3233231

Tania Lombrozo. 2006. The structure and function
of explanations. 认知科学的趋势,
10(10):464–470. DOI: https://doi.org
/10.1016/j.tics.2006.08.004, PMID:
16942895

Scott M. Lundberg and Su-In Lee. 2017, A uni-
fied approach to interpreting model predictions,

308

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

我. Guyon, U. V. Luxburg, S. 本吉奥, H.
瓦拉赫, 右. 弗格斯, S. Vishwanathan, 和R.
加内特, 编辑, Advances in Neural Infor-
mation Processing Systems 30, Curran Asso-
ciates, 公司, pages 4765–4774.

Prashan Madumal, Tim Miller, Liz Sonenberg,
and Frank Vetere. 2020. Explainable reinforce-
ment learning through a causal lens. In The
Thirty-Fourth AAAI Conference on Artificial
智力, AAAI 2020, The Thirty-Second
Innovative Applications of Artificial Intelli-
gence Conference, IAAI 2020, The Tenth AAAI
Symposium on Educational Advances in Arti-
ficial Intelligence, EAAI 2020, 纽约, 纽约,
美国, 二月 7-12, 2020, pages 2493–2500.
AAAI出版社. DOI: https://doi.org/10
.1609/aaai.v34i03.5631

Ann Mcgill and Jill Klein. 1993. Contrastive and
判断.
反事实的
Journal of Personality and Social Psychology,
64:897–905. DOI: https://doi.org/10
.1037/0022-3514.64.6.897

thinking in causal

Tim Miller. 2019. Explanation in artificial intel-
智慧: Insights from the social sciences. Arti-
ficial Intelligence, 267:1–38. DOI: https://
doi.org/10.1016/j.artint.2018.07
.007

Tim Miller. 2020. Contrastive explanation: A

structural-model approach.

Danish Pruthi, Mansi Gupta, Bhuwan Dhingra,
Graham Neubig, and Zachary C. Lipton. 2019.
Learning to deceive with attention-based expla-
nations. CoRR, abs/1909.07913. DOI: https://
doi.org/10.18653/v1/2020.acl-main
.432

Marco Tulio Ribeiro, Sameer Singh, and Carlos
Guestrin. 2016. ‘‘Why Should I Trust You?’’:
Explaining the predictions of any classifier. 在
Proceedings of the 22Nd ACM SIGKDD Inter-
national Conference on Knowledge Discovery
and Data Mining, KDD ’16, pages 1135–1144.
ACM, 纽约, 纽约, 美国. DOI: https://
doi.org/10.1145/2939672.2939778

Cynthia Rudin. 2019. Stop explaining black box
machine learning models for high stakes deci-
sions and use interpretable models instead. Na-
ture Machine Intelligence, 1(5):206–215. DOI:

https://doi.org/10.1038/s42256
-019-0048-X

Sofia Serrano and Noah A. 史密斯. 2019. Is
attention interpretable? 在诉讼程序中
57th Conference of the Association for Com-
putational Linguistics, 前交叉韧带 2019, Florence,
意大利, 七月 28- 八月 2, 2019, 体积 1: 长的
文件, pages 2931–2951.

Gabriel Stanovsky, 诺亚A. 史密斯, and Luke
Zettlemoyer. 2019. Evaluating gender bias in
machine translation. 在诉讼程序中
这
57th Annual Meeting of the Association for
计算语言学, pages 1679–1684.
计算语言学协会,
Florence, 意大利. DOI: https://doi.org
/10.18653/v1/P19-1164

Sanjay Subramanian, Ben Bogin, Nitish Gupta,
Tomer Wolfson, Sameer Singh,
乔纳森
Berant, and Matt Gardner. 2020. Obtaining
faithful
from compositional
神经网络. CoRR, abs/2005.00724. DOI:
https://doi.org/10.18653/v1/2020
.acl-main.495

解释

Sarah Wiegreffe and Yuval Pinter. 2019. Atten-
tion is not not explanation. Kentaro Inui,
Jing Jiang, Vincent Ng, and Xiaojun Wan,
编辑, 在诉讼程序中 2019 会议
on Empirical Methods in Natural Language
Processing and the 9th International Joint
Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, 香港, 中国,
十一月 3-7, 2019, pages 11–20. Associ-
ation for Computational Linguistics. DOI:
https://doi.org/10.18653/v1/D19
-1002

Joseph Jay Williams, Tania Lombrozo, and Bob
Rehder. 2013. The hazards of explanation:
Overgeneralization in the face of exceptions.
实验心理学杂志: General,
142(4):1006. DOI: https://doi.org/10
.1037/a0030996, PMID: 23294346

Ronald J. 威廉姆斯. 1992. Simple statistical
gradient-following algorithms for connectionist
学习. Machine Learning,
reinforcement
8(3–4):229–256. DOI: https://doi.org
/10.1007/BF00992696

Omar Zaidan, Jason Eisner, and Christine D.
Piatko. 2007. Using ‘‘annotator rationales’’ to

309

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

improve machine learning for text catego-
rization. Candace L. Sidner, Tanja Schultz,
Matthew Stone, and ChengXiang Zhai, 编辑,
In Human Language Technology Conference
of the North American Chapter of the Associa-
tion of Computational Linguistics, 会议记录,
四月 22-27, 2007, 罗切斯特, 纽约, 美国,
pages 260–267. The Association for Computa-
tional Linguistics.

A Glossary

This work is concerned with formalization
and theory of artificial models’ explanations.
We provide a (non-alphabetical) summary of
terminology and their definitions as we utilize
他们. We stress that these definitions are not
universal, as the human explanation sciences
describe multiple distinct perspectives, 和
explanations in AI are still a new field.

Unfolding causal chain: A path of causes
between a set of events, in which a cause from
event C to event E indicates that C must occur
before E.
Human intent: An objective behind an action. 在
our context, reasoning steps in the causal chain
are actions that can be attributed with intent.
Interpretation: A (possibly lossy) mapping from
the full reasoning process of the model to a human-
readable format, involving some implication of a
causal chain of events in the reasoning process.
Faithful interpretation: An interpretation is said
to be faithful if the causal chain it describes is
accurate to the model’s full reasoning process.
Explanation: A process of conveying causal
information about a model’s decision to a person.
We assume that the explainee always attributes
intent to the actions of the explainer.
Plausibility: Incentive of the explainer to provide
justifying explanation that appears convincing.

我

D
哦
w
n
哦
A
d
e
d

F
r
哦
米
H

t
t

:
/
/

d
我
r
e
C
t
.

米

我
t
.

e
d
你

/
t

A
C
我
/

我

A
r
t
我
C
e
–
p
d

F
/

d
哦

我
/

1
0
1
1
6
2

/
t

我

A
C
_
A
_
0
0
3
6
7
1
9
2
3
9
7
2

/
t

我

A
C
_
A
_
0
0
3
6
7
p
d

乙
y
G
你
e
s
t

哦
n
0
7
S
e
p
e
米
乙
e
r
2
0
2
3

310
下载pdf