Evaluating Explanations: How Much Do Explanations

Evaluating Explanations: How Much Do Explanations
from the Teacher Aid Students?

Danish Pruthi1∗ Rachit Bansal2 Bhuwan Dhingra3
Livio Baldini Soares3 Michael Collins3 Zachary C. Lipton1
Graham Neubig1 William W. Cohen3
1 Carnegie Mellon University, USA 2 Delhi Technological University, India
3 Google Research, USA
{ddanish, zlipton, gneubig}@cs.cmu.edu, racbansa@gmail.com
{bdhingra, liviobs, mjcollins, wcohen}@google.com

Astratto

While many methods purport to explain pre-
dictions by highlighting salient features, what
aims these explanations serve and how they
ought to be evaluated often go unstated. In
this work, we introduce a framework to quan-
tify the value of explanations via the accuracy
gains that they confer on a student model
trained to simulate a teacher model. Cru-
cially, the explanations are available to the
student during training, but are not available
at test time. Compared with prior proposals,
our approach is less easily gamed, enabling
principled, automatic, model-agnostic evalu-
ation of attributions. Using our framework,
we compare numerous attribution methods
for text classification and question answer-
ing, and observe quantitative differences that
are consistent (to a moderate to high degree)
across different student model architectures
and learning strategies.1

1

introduzione

The success of deep learning models, together with
the difficulty of understanding how they work,
has inspired a subfield of research on explaining
predictions, often by highlighting specific input
features deemed somehow important to a pre-
diction (Ribeiro et al., 2016; Sundararajan et al.,
2017; Shrikumar et al., 2017). For instance, we
might expect such a method to highlight spans like
‘‘poorly acted’’ and ‘‘slow-moving’’ to explain a
prediction of negative sentiment for a given movie
revisione. Tuttavia, there is little agreement in the
literature as to what constitutes a good explana-

∗Part of this work was done at Google.
1Code for the evaluation protocol: https://github

zione (Lipton, 2016; Jacovi and Goldberg, 2021).
Inoltre, various popular methods for generat-
ing such attributions disagree considerably over
which tokens to highlight (Tavolo 1). With so many
methods claimed to confer the same property while
disagreeing so markedly, one path forward is to
develop clear quantitative criteria for evaluating
purported explanations at scale.

The status quo for evaluating so-called expla-
nations skews qualitative—many proposed tech-
niques are evaluated only via visual inspection
of a few examples (Simonyan et al., 2014;
Sundararajan et al., 2017; Shrikumar et al.,
2017). While several quantitative evaluation tech-
niques have recently been proposed, many of
these are easily gamed (Treviso and Martins,
2020; Hase et al., 2020).2 Some depend upon the
model outputs corresponding to deformed exam-
ples that lie outside the support of the training
distribution (DeYoung et al., 2020), and a few
validate explanations on specifically crafted tasks
(Poerner et al., 2018).

In this work, we propose a new framework,
where explanations are quantified by the degree
to which they help a student model in learn-
ing to simulate the teacher on future examples
(Figura 1). Our framework addresses a coherent
goal, is model-agnostic and broadly applicable
across tasks, E (when instantiated with models
as students) can easily be automated and scaled.
Our method is inspired by argumentative mod-
els for justifying human reasoning, which posit
that the role of explanations is to communicate
information about how decisions are made, E
thus to enable a recipient to anticipate future

2See §7 for a comprehensive discussion on existing

.com/danishpruthi/evaluating-explanations.

metrics, and how they can be gamed by trivial strategies.

359

Operazioni dell'Associazione per la Linguistica Computazionale, vol. 10, pag. 359–375, 2022. https://doi.org/10.1162/tacl a 00465
Redattore di azioni: Trevor Cohn. Lotto di invio: 6/2021; Lotto di revisione: 11/2021; Pubblicato 4/2022.
C(cid:3) 2022 Associazione per la Linguistica Computazionale. Distribuito sotto CC-BY 4.0 licenza.

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Tavolo 1: Overlap among the top-10% tokens selected by different explanation techniques for sentiment
analysis. In each row, for a given technique, we tabulate the fraction of explanatory tokens that overlap
with other explanations. Value of

implies perfect overlap and 0.0 denotes no overlap.

find attention-based explanations and integrated
gradients (Sundararajan et al., 2017) to be the
most effective, and vanilla gradient-based saliency
maps and LIME to be the least effective. Further,
we observe moderate to high agreement among
rankings obtained by varying student architec-
tures and learning strategies in our framework.
For question answering, we validate the effective-
ness of student learners on both human-produced
explanations collected by Lamm et al. (2021),
and automatically generated explanations from a
SpanBERT model (Joshi et al., 2020).

2 Explanation as Communication

2.1 An Illustrative Example

In our framework, we view explanations as a com-
munication channel between a teacher T and a stu-
dent S, whose purpose is to help S to predict T ’s
outputs on a given input. As an example, con-
sider the case of graduate admissions: An aspirant
submits their application x and subsequently the
admission committee T decides whether the can-
didate is to be accepted or not. The acceptance
criterion, fT (X), represents a typical black box
function—one that is of great interest to future
aspirants.3 To simulate the admission criterion, UN
student S might study profiles of several appli-
cants from previous iterations, x1, . . . , xn, E
their admission outcomes fT (x1), . . . , fT (xn).
Let A(fS, fT ) be the simulation accuracy, questo è,
the accuracy with which the student predicts the

3Our illustrative example assumes that the admission
decision depends solely upon the student application, E
ignores how other competing applicants might affect the
outcome.

Figura 1: The proposed framework for quantifying
explanation quality. Student models learn to mimic the
teacher, with and without explanations (provided as
‘‘side information’’ with each example). Explanations
are effective if they help students to better approximate
the teacher on unseen examples for which explanations
are not available. Students and teachers could be either
models or people.

decisions (Mercier and Sperber, 2017). Our frame-
work is similar to human studies conducted by
Hase and Bansal (2020), who evaluate if expla-
nations help predict model behavior. Tuttavia,
here we focus on protocols that do not rely on
human-subject experiments.

Using our framework, we conduct extensive
experiments on two broad categories of NLP
text classification and question answer-
compiti:
ing. For classification tasks, we compare seven
widely used input attribution techniques, covering
gradient-based methods (Simonyan et al., 2014;
Sundararajan et al., 2017), perturbation-based
techniques (Ribeiro et al., 2016), attention-based
explanations (Bahdanau et al., 2015), and other
popular attributions (Shrikumar et al., 2017;
Dhamdhere et al., 2019). These comparisons
lead to observable quantitative differences—we

360

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

teacher’s decisions on unseen future applications
(defined formally below in §2.2).

using the explanations, questo è, explanations are
only available during training, not at test time.

Now suppose each previous admission outcome
was supplemented with an additional explanation
eT (X) from the admission committee, intended to
help S understand the decisions made by T . Ide-
alleato, these explanations would enhance students’
understanding about the admission process, E
would help students simulate the admission de-
cisions better, leading to a higher accuracy. Noi
argue that the degree of improvement in sim-
ulation accuracy is a quantitative indicator of
the utility of the explanations. Note that generic
explanations or explanations that simply encode
the final decision (per esempio., ‘‘We received far too
many applications’’) are unlikely to help stu-
dents simulate fT (X), as they provide no addi-
tional information.

2.2 Quantifying Explanations

For concreteness, we assume a classification task,
and for a teacher T , we let fT denote a model
that computes the teacher’s predictions. Let S
be a student (either human or a machine), Poi
T could teach S to simulate fT by sampling
n examples, x1, . . . , xn, and sharing with S a
dataset ˆD containing its associated predictions
{(x1, ˆy1), . . . , (xn, ˆyn)}, where ˆyi = fT (xi), E
S could then learn some approximation of fT
from this data:

fS, ˆD = learn(S, ˆD).

Additionally, we assume that for a given teacher
T , an explanation generation method can generate
an explanation eT (X) for any example x which is
some side information that potentially helps S in
predicting fT (X). We use ˆE to denote a dataset of
explanation-augmented examples, questo è,

ˆE = {(x1, eT (x1), ˆy1), . . . , (xn, eT (xn), ˆyn)},

and the student learner can make use of this side
information during training, to learn a classifier

fS, ˆE = learn(S, ˆE).

Note that none of the learning tasks discussed
above involve the ‘‘gold’’ label y for any instance
X, only the prediction ˆy for x, produced by the
teacher. While the student S can use the explana-
tions for learning, all the classifiers fT , fS, ˆD, E
fS, ˆE predict labels given only the input x, without

In our framework the benefit of explanations is
measured by how much they help the student to
simulate the teacher. In particular, we quantify the
ability of a student fS to simulate a teacher using
the simulation accuracy:

UN(fS, fT ) = Ex [ 1{fS(X) = fT (X)} ] ,

(1)

where the expected agreement between student
and teacher is computed over test examples.
Better explanations will lead to higher values
of A(fS, ˆE, fT ) than the accuracy associated
with learning to simulate the teacher without
explanations, namely, UN(fS, ˆD, fT ).

So far, for a given teacher model, our criteria
for explanation quality depends upon the choice
of the student model (S), its learning procedure,
and the number of examples used to train it (N).
To reduce the reliance on a given student, we
could assume that the student S is drawn from
a distribution of students Pr(S), and extend our
framework by considering the expected benefit for
a random student averaged over various values of
N. In practice, we experiment with a small set of
diverse students (per esempio., models with different sizes,
architectures, learning procedures) and consider
different values of n.

2.3 Automated Teachers and Students

In principle, T and S could be either people or
algorithms. Tuttavia, quantitative measurements
are easier to conduct when T and (particolarmente)
S are algorithms. In particular, imagine that T
(which for example could be a BERT-based clas-
sifier) identifies an explanation eT (X) that is some
subset of tokens in a document x that are rele-
vant to the prediction (acquired by, Per esempio,
any of the explanation methods mentioned in
the introduction) and S is some machine learner
that makes use of the explanation. The value of
teacher-explanations for S can then be assessed
via standard evaluation of explanation-aware stu-
dent learners, using predicted labels instead of
gold labels. This value can then be compared to
other schemes for producing explanations (per esempio., In-
tegrated gradients). Albeit, an important concern
in automated evaluation is that, by design, IL
obtained results are contingent on the student
modello(S) and how explanations are incorporated
by the student model(S).

361

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Tavolo 2: Example of annotated rationales in sentiment analysis and referential equalities in QA.

Another apparent ‘‘bug’’ in this framework is
that in the automated case, one could obtain a
perfect simulation accuracy with an explanation
that communicates all the weights of the teacher
classifier fT to the student.4 We propose two
approaches to address this problem. Primo, we
simply limit explanations to be of a form that
people can comprehend—for example, spans in a
document x. Questo è, we consider only popular
formats of explanations that are considered to
be human understandable (see §3 for details and
Tavolo 2 for examples). Secondly, we experiment
with a diverse set of student models (per esempio., networks
with architectures different from the original
teacher model), which precludes trivial weight-
copying solutions.

2.4 Discussion

In our framework, two design choices are crucial:
(io) students do not have access to explanations
at test time; E (ii) we use a machine learning
model as a substitute for student learner. These
two design choices differentiate our framework
from similar communication games proposed by
Treviso and Martins (2020) and Hase and Bansal
(2020). When explanations are available at test
time, they can leak the teacher output directly
or indirectly, thus corrupting the simulation task.
Both genuine and trivial explanations can encode
the teacher output, making it difficult to discern
the quality of explanations.5 The framework of
Treviso and Martins (2020) is affected by this

4All the weights of the model can be thought of as a
complete explanation, and is a reasonable choice for simpler
models, per esempio., a linear-model with a few parameters.

5A trivial explanation may highlight the first input token
if the teacher output is 0, and the second token if the output
È 1. Such explanations, termed as ‘‘Trojan explanations’’,
are a problematic manifestation of several approaches, COME

issue, which is probably only partially addressed
by enforcing constraints on the student. Prevent-
ing access to explanations while testing solves
this problem and offers flexibility in choosing
student models.

Substituting machine learners for people al-
lows us to train student models on thousands of
examples, in contrast to the study by Hase and
Bansal (2020), Dove (human) students were
trained on only 16 O 32 examples. As a con-
sequence, the observed differences among many
explanation techniques were statistically insignif-
icant in their studies. While human subject ex-
periments are a valuable complement to scalable
automatic evaluations, it is expensive to conduct
sufficiently large-scale studies; people’s precon-
ceived notions might impair their ability to sim-
ulate the models accurately;6 and lastly these
preconceived notions might bias performance for
different people differently.

3 Learning with Explanations

Our student-teacher framework does not specify
how to use explanations while training the student
modello. Below, we examine two broad approaches
to incorporate explanations: attention regulariza-
tion and multitask learning. Our first approach
regularizes attention values of the student model
to align with the information communicated in
explanations. In the second method, we pose the
learning task for the student as a joint task of
prediction and explanation generation, expecting

discussed in Chang et al. (2020) and Jacovi and Goldberg
(2021).

6We speculate this effect to be pronounced when the
models’ outputs and the true labels differ only over a few
samples.

362

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

to improve prediction due to the benefits of multi-
task learning. We show that both of these methods
indeed improve student performance when using
human-provided explanations (and gold labels) for
classification tasks. We explore variants of these
two approaches for question answering tasks.

Quello

Classification Tasks The training data for
the student model consists of n documents
x1, . . . , xn, and the output
to be learned,
y1, . . . , yn, comes from the teacher,
È,
yi = fT (xi), along with teacher-explanations
eT (x1), . . . , eT (xn). In this work, we consider
teacher explanations in the form of a binary vector
eT (xi), such that eT (xi)j = 1 if the jth token in
document xi is a part of the teacher-explanation,
E 0 otherwise (Vedi la tabella 2 for an example).7
To incorporate explanations during training, we
suggest two different approaches. Primo, noi usiamo
attention regularization, where we add a reg-
ularization term to our loss to reduce the KL
divergence between the attention distribution of
the student model (αstudent) and the distribution of
the teacher-explanation (αexp):

(2)

R(cid:4) = −λ KL(αexp (cid:5) αstudent),
where the explanation distribution (αexp) is uni-
form over all the tokens in the explanation and
(cid:3) elsewhere (Dove (cid:3) is a very small constant).
When dealing with student models that employ
multi-headed attention, which use multiple differ-
ent attention vectors at each layer of the model
(Vaswani et al., 2017), we take αstudent to be the
attention from the [CLS] token to other tokens in
the last layer, averaged across all attention heads.
Several past approaches have used attention reg-
ularization to incorporate human rationales, con
an aim to improve the overall performance of
the system for classification tasks (Bao et al.,
2018; Zhong et al., 2019) and machine translation
(Yin et al., 2021).

Secondo, we use explanations via multitask
apprendimento, where the two tasks are prediction
and explanation generation (a sequence labeling
problem). Formalmente,
loss can be
written as:

the overall

L = −

N(cid:2)

i=1


⎣log p(yi|xi; θ)
(cid:9)

(cid:6)

(cid:7)(cid:8)
classify

+ log p(NO|xi; φ, θ)
(cid:9)

(cid:6)

(cid:7)(cid:8)
explain


.

7Explanations that generate a continuous ‘‘importance’’
score for each token can also be used as per this definition,
per esempio., by selecting the top-k% tokens from those scores.

363

As in multitask learning, if the task of prediction
and explanation generation are complementary,
then the two tasks would benefit from each other.
As a corollary, if the teacher-explanations offer no
additional information about the prediction, Poi
we would see no benefit from multitask learning
(appropriately so). For most of our classification
esperimenti, we use BERT (Devlin et al., 2019)
with a linear classifier on top of the [CLS] vector
to model p(sì|X; θ). To model p(e|X; φ θ) noi usiamo
a linear-chain CRF (Lafferty et al., 2001) on top
of the sequence vectors from BERT. Note that we
share the BERT parameters θ between classifica-
tion and explanation tasks. In prior work, similar
multitask formulations have been demonstrated to
effectively incorporate rationales to improve clas-
sification performance (Zaidan and Eisner, 2008)
and evidence extraction (Pruthi et al., 2020).

Question Answering Let the question q consist
of m tokens q1 . . . qm, along with passage x that
provides the answer to the question, consisting
of n tokens x1, . . . , xn. Let us define a set of
question phrases Q and passage phrases P to be

Q = {(io, j) : 1 ≤ i ≤ j ≤ m}
P = {(io, j) : 1 ≤ i ≤ j ≤ n}.

We consider a subset of QED explanations (Lamm
et al., 2021), which consist of a sequence of
one or more ‘‘referential equality annotations’’
e1 . . . e|e|. Formalmente, each referential equality an-
notation ek for k = 1 . . . |e| is a pair (φk, πk)
Q × P, specifying that phrase φk in the ques-
tion refers to the same thing in the world as
the phrase πk in the passage (Vedi la tabella 2 for
an example).

To incorporate explanations for question answer-
ing tasks, we use the two approaches discussed
for text classification tasks, namely, attention reg-
ularization and multitask learning. Since the expla-
nation format for question answering is different
from the explanations in text classification, we
use a lossy transformation, where we construct a
binary explanation vector, Dove 1 corresponds to
tokens that appear in one or more referential equal-
ities and 0 otherwise. Given the transformation,
both these approaches do not use the alignment
information present in the referential equalities.

To exploit the alignment information provided
by referential equalities, we introduce and append

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Student Model
BERT-base

600
75.5

900
79.0

1200

81.1

w/ explanations used via
multitask learning
attention regularization

75.2
81.5

80.0
83.1

82.5
84.0

Tavolo 3: Simulation accuracy of a student model
when trained with and without explanations
from human experts for sentiment analysis. Noi
note that both the proposed methods to learn
with explanation improve performance: Attention
regularization leads to large gains, whereas mul-
titask learning requires more examples to yield
improvements.

the standard loss with attention alignment loss:

R(cid:4)

= −λ log


1
|e|

|e|(cid:2)

k=1

αstudent[φk → πk]

,

where ek = (φk, πk) is the kth referential equality,
and αstudent[φk → πk]) is the last layer average
attention originating from tokens in φk to tokens in
πk. The average is computed across all the tokens
in φk and across all attention heads. The underly-
ing idea is to increase attention values correspond-
ing to the alignments provided in explanations.

4 Human Experts as Teachers

Below, we discuss the results upon applying
our framework to explanations and output from
human teachers to confirm if expert explanations
improve the student models’ performance.

Setup There exist a few tasks where researchers
have collected explanations from experts besides
the output label. For the task of sentiment analysis
on movie reviews, Zaidan et al. (2007) collected
‘‘rationales’’ where people highlighted portions
of the movie reviews that would encourage (O
discourage) readers to watch (or avoid) the movie.
In another recent effort, Lamm et al. (2021) col-
lected ‘‘QED annotations’’ over questions and
the passages from the Natural Questions (NQ)
dataset (Kwiatkowski et al., 2019). These anno-
tations contain the salient entity in the question
and their referential mentions in the passage that
need to be resolved to answer the question. For
both these tasks, our student-learners are pre-
trained BERT-base models, which are further

Student Model

500

1500

2500

BERT-base
w/ explanations used via
multitask learning
attention regularization
attention alignment loss

28.9

43.7

49.0

29.7
31.2
37.3

42.7
47.2
49.6

49.2
52.6
54.0

Tavolo 4: Simulation performance (F1 score) Di
a student model when trained with and without
explanations from human experts for question
answering. We find that attention regulariza-
tion and attention alignment loss result in large
improvements upon incorporating explanations.

fine-tuned with outputs and explanations from
human experts.

Results Our suggested methods to learn from
explanations indeed benefit from human explana-
zioni. For the sentiment analysis task, Attenzione
regularization boosts performance, as depicted
in Table 3. For instance, attention regulariza-
tion improves the accuracy by an absolute 6
points, for 600 examples. The performance ben-
efits, unsurprisingly, diminish with increasing
training examples—for 1200 examples, the atten-
tion regularization improves performance by 2.9
points. While attention regularization is imme-
diately effective, the multitask learning requires
more examples to learn the sequence labeling task.
We do not see any improvement using multitask
learning for 600 examples, but for 900 E 1200
training examples, we see absolute improvements
Di 1 E 1.4 points, rispettivamente.

We follow up our findings to validate if the
simulation performance of the student model is
correlated with explanation quality. To do so, we
corrupt human explanations by unselecting the
marked tokens with varying noising probabilities
(ranging from 0 A 1, in steps of 0.1). We train stu-
dent models on corrupted explanations using at-
tention regularization and find their performance
to be highly negatively correlated with the amount
of noise (Pearson correlation ρ = −0.72). Questo
study verifies that our metric is correlated with (an
admittedly simple notion of) explanation quality.
For the question-answering task, we measure
the F1 score of the student model on the test set
carved from the QED dataset. As one can observe
from Table 4, both attention regularization and at-
tention alignment loss improve the performance,

364

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

l

D
o
w
N
o
UN
D
e
D

F
R
o
M
H

T
T

P

:
/
/

D
io
R
e
C
T
.

M

io
T
.

e
D
tu

/
T

UN
C
l
/

l

UN
R
T
io
C
e

P
D

F
/

D
o

io
/

.

1
0
1
1
6
2

/
T

l

UN
C
_
UN
_
0
0
4
6
5
2
0
0
6
9
7
1

/

/
T

l

UN
C
_
UN
_
0
0
4
6
5
P
D

.

F

B

G
tu
e
S
T

T

o
N
0
7
S
e
P
e
M
B
e
R
2
0
2
3

Tavolo 5: We evaluate the effectiveness of attribution methods for sentiment analysis using simulation
accuracy of student models trained with these explanations on varying amounts of data (§5.2). Each
method selects top-10% ‘‘important’’ tokens for each example. We find attention-based explanations to
be most effective, followed by integrated gradients. We also tabulate the average rank as per our metric.
Statistically significant differences (p-value < 0.05) from the no-explanation control are underlined. whereas multitask learning is not effective.8 At- tention regularization and attention alignment loss improve F1 score by 2.3 and 8.4 points for 500 examples, respectively. The gains decrease with increasing examples (e.g., the improvement due to attention alignment loss is 5 points on 2500 exam- ples, compared to 8.4 points with 500 examples). The key takeaway from these experiments (with explanations and outputs from human experts) is that we observe benefits with the learning proce- dures discussed in previous section. This provides support to our proposal to use these methods for evaluating various explanation techniques. 5 Automated Evaluation of Attributions Here, we use a machine learning model as our choice for the teacher, and subsequently train student models using the output and explanations produced by the teacher model. Such a setup allows us to compare attributions produced by different techniques for a given teacher model. 5.1 Setup For sentiment analysis, we use BERT-base (Devlin et al., 2019) as our teacher model and train it on the IMDb dataset (Maas et al., 2011). The accuracy of the teacher model is 93.5%. 8We speculate that multitask learning might require more than 2500 examples to yield benefits. Unfortunately, for the QED dataset, we only have 2500 training examples. We compare seven commonly used methods for producing explanations. These techniques in- clude LIME (Ribeiro et al., 2016), gradient-based saliency methods, that is, gradient norm and gra- dient × input (Simonyan et al., 2014), DeepLIFT (Shrikumar et al., 2017), layer conductance (Dhamdhere et al., 2019), integrated gradients (I.G.) (Sundararajan et al., 2017), and attention- based explanations (Bahdanau et al., 2015). More details about these explanation techniques are provided in the Appendix. For each explanation technique to be compa- rable to others, we sort the tokens as per scores assigned by a given explanation technique, and use only the top-k% tokens. This also ensures that across different explanations, the quantity of information from the teacher to the student per example is constant. Additionally, we evaluate no-explanation, random-explanation, and trivial- explanation baselines. For random explanations, we randomly choose k% tokens, and for triv- ial explanations, we use the first k% tokens for the positive class, and the next k% tokens for the negative class. Such trivial explanations en- code the label and can achieve perfect scores for many evaluation protocols that use explanations at test time. Corresponding to each explanation type, we train 4 different student models—comprising BERT and BiLSTM based models—using out- puts and explanations from the teacher. The test 365 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Table 6: Evaluatin different attribution methods for sentiment analysis using the simulation accuracy of BiLSTM-based student models trained with these explanations on varying amounts of data (§5.2). We find attention-based explanations, integrated gradients, and layer conductance to be effective techniques. The rankings are largely consistent with those attained using transformer-based student models (Table 5). Statistically significant differences (p-value < 0.05) from the no-explanation control are underlined. set of the teacher model is divided to construct train, development, and test splits for the student model. We train student models with explanations by using attention regularization and multitask learning. We vary the amount of training data available and note the simulation performance of student models. For the question answering task, we use the Natural Questions dataset (Kwiatkowski et al., 2019). The teacher model is a SpanBERT-based model that is trained jointly to answer the question and produce explanations (Lamm et al., 2021). We use the model made available by the authors. The test set of Natural Questions is split to form the training, development, and test set for the student model. We use a BERT-base QA model as our student model to evaluate the value of teacher explanations. 5.2 Main Results We evaluate different explanation generation methods based upon the simulation accuracy of various student models for two NLP tasks: text classification and question answering. For the sentiment analysis task, we present the simulation performance of BERT-base and BERT-large student models in Table 5, and BiL- STM and BiLSTM+Attention student models in Student Model BERT w/ explanations used via 250 1K 4K 16K 25.0 37.7 52.6 61.6 27.6 42.8 54.4 62.3 attention regularization attention alignment loss 32.9 46.9 55.5 62.2 Table 7: Simulation performance (F1 score) of a student model when trained with and with- out explanations from the SpanBERT QA model (the teacher model in this case). We find these explanations to be effective across both the learn- ing strategies. Table 6. From these two tables, we first note that attention-based explanations are effective, result- ing in large and statistically significant improve- ments over the no-explanation control. We see an improvement of 1.4 to 2.6 points for transformer- based student models (Table 5), and up to 7 points for the Bi-LSTM student model (Table 6). While it may seem that attention is effective because it aligns most directly with attention regu- larization learning strategy, we note that the trends from multitask learning corroborate the same con- clusion for different student models—especially the Bi-LSTM student model, which does not even use the attention mechanism, and therefore cannot 366 Bi-LSTM w/ Attention Student Models BS, ED, HS LR 16, 128, 256, 2.5 × 10−3 16, 64, 256, 0.5 × 10−2 64, 256, 256, 0.5 × 10−2 32, 128, 768, 2.5 × 10−3 None LIME Grad Norm Grad×Inp. DeepLIFT Layer Cond. I.G. Attention MTL AR Rank MTL AR Rank MTL AR Rank MTL AR Rank 8.0 83.9 5.5 84.3 3.0 84.1 6.5 84.0 5.5 84.0 3.5 84.7 1.5 84.8 2.0 84.7 79.9 80.8 81.2 80.6 81.7 83.8 83.6 81.8 80.0 81.0 81.5 81.3 82.1 82.3 82.4 82.1 82.3 83.4 83.9 83.4 83.5 83.7 84.1 83.7 77.4 79.4 79.4 79.0 79.0 80.2 80.3 80.6 79.7 80.3 80.5 80.0 81.2 82.0 82.3 82.6 84.1 84.9 85.0 83.9 84.6 85.1 84.9 85.5 84.6 85.0 84.7 84.8 84.9 84.7 84.8 85.3 7.5 5.0 4.0 7.5 5.0 1.5 3.5 2.0 8.0 5.0 5.0 6.5 5.5 3.0 1.5 1.5 8.0 4.5 5.5 5.0 3.5 4.5 3.0 2.5 Table 8: Gauging the sensitivity of our framework to different hyperparameter values of student models. We note the simulation accuracies of 4 BiLSTM with attention student models with varying batch size (BS), learning rate (LR), embedding dimension (ED), and hidden size (HS). We incorporate explanations via multi-task learning (MTL) over 4K examples and attention regularisation (AR) on 2K examples. The average rank correlation coefficient τ between all five configurations (including one from Table 6) is 0.95. Statistically significant differences (p-value < 0.05) from the no-explanation control are underlined. incorporate explanations using attention regular- ization. Besides attention explanations, we also find integrated gradients and layer conductance to be effective techniques. Qualitatively inspect- ing a few examples, we notice that attention and integrated gradients indeed highlight spans that convey the sentiment of the review. Lastly, we see that trivial explanations do not outperform the control experiment, confirming that our framework is robust to such gamifica- tion attempts. These explanations would result in a perfect score for the protocol discussed in Treviso and Martins (2020). The metric by Hase et al. (2020) would be undefined in the case when 100% of the explanations trivially leak the label—in the limiting case (when all but one explanation leak the label trivially), the metric would result in a high score, which is unintended. For the question answering task, we observe from Table 7 that explanations from SpanBERT QA model are effective, as indicated by both the approaches to learn from explanations. The performance benefit of using attention alignment loss with 250 examples is 7.9 absolute points, and these gains decrease (unsurprisingly) with increas- ing number of training examples. For instance, the gain is only 2.9 points for 4000 examples and the benefits vanish with 16000 training examples. 5.3 Analysis Here, we analyze the the effect of different instan- tiations of our framework—namely, sensitivity to the choice of student architectures, their hyper- parameters, learning strategies, and so forth. Ad- ditionally, we examine the effect of varying the percentage of explanatory tokens (k in top-k to- kens) on the results obtained from our framework. Varying Student Models and Learning Strategies We evaluate the agreement among attribution rankings obtained using (i) different learning strategies; and (ii) different student models. We compute the Kendall rank corre- lation coefficient τ to measure the agreement among different attribution rankings.9 We report different τ values for varying combinations of student models and learning strategies in the Appendix (Table 10). The key takeaways from this investigation are twofold: first the rank cor- relation between rankings produced using the two learning strategies—attention regularization (AR) and multi-task learning (MTL)—for the same student model is 0.64, which is considered a high agreement. This value is obtained by averaging τ values from 3 different student models that 9Note that τ lies in [−1, 1] with 1 signifying perfect agreement and −1 perfect disagreement. 367 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 can use both these learning strategies. Second, the rank correlation among rankings produced using different student models (given the the same learning strategy) is also high—we report average values of 0.65 and 0.47 when we use AR and MTL learning strategies, respectively. For completion, we also compute τ for all distinct combinations across student models and learning strategies (21 combinations in total) and obtain an average value of 0.52. Overall, we observe high agreement among different rankings attained through different instantiations of our student-teacher framework. Sensitivity to Hyperparameters We examine the sensitivity of our framework to different hy- perparameter values of the student models. For BiLSTM-based student models, we perform a random search over different values of four hyperparameters, that is, number of embedding dimensions (ED ∈ {64, 128, 256, 512}), num- ber of hidden size (HS ∈ {256, 512, 768, 1024}), batch size (BS ∈ {8, 16, 32, 64}) and learning rate (LR ∈ {0.5 × 10−3, 1 × 10−3, 2.5 × 10−3, 0.5 × 10−2}). From all possible configurations above, we randomly sample 4 configurations and train a BiLSTM with attention student model corre- sponding to each configuration. The simulation accuracy of student models with different choices of hyperparameters are presented in Table 8. For a given hyperparameter configuration, we average the ranks across the two learning strategies. We compute the Kendall rank correlation coefficient τ among rankings obtained using different hyper- parameter configurations (including the default (cid:18) 5 configuration from Table 6, thus resulting in 2 comparisons). We obtain a high average correla- tion of 0.95, suggesting that our framework yields largely consistent ranking of attributions across varying hyperparameters. (cid:17) Varying the Percentage of Explanatory Tokens To examine the effect of k in selecting top-k% tokens, we evaluate the simulation performance of BERT-base students trained with varying val- ues of k ∈ {5, 10, 20, 40} on 2000 examples.10 For these values of k, we corroborate the same trend, that is, attention-based explanations are the most effective, followed by integrated gradients 10Note that k is not a parameter of our framework, but controls the number of explanatory tokens for each attribution. Sufficiency ↓ Explanations Value Rank Comprehensive. ↑ Rank Value Random Trivial LIME Grad Norm Grad×Inp. DeepLIFT Layer Cond. I.G. Attention 0.29 0.29 0.06 0.25 0.33 0.39 0.11 0.13 0.11 6 7 1 5 8 9 2 4 3 0.04 0.04 0.32 0.11 0.06 0.06 0.24 0.17 0.28 9 8 1 5 7 6 3 4 2 Table 9: Comparing attribution methods as per the sufficiency (lower the better) and comprehensive- ness metrics proposed in (DeYoung et al., 2020). (see Table 11 in the Appendix). We also perform an experiment where we consider the entire at- tention vector to be an explanation, as it does not lose any information due to thresholding. For 500 examples, we see a statistically significant improvement of 0.9 over the top-10% attention variant (p-value = 0.03), the difference shrinks with increasing numbers of training examples (0.1 for 2000 examples). 5.4 Comparison With Other Benchmarks For completeness, we compare the ranking of explanations obtained through our metrics with existing metrics of sufficiency and comprehen- siveness introduced in (DeYoung et al., 2020). The sufficiency metric computes the average difference in the model output upon using the input example versus using the explanation alone (fT (x) − fT (e)), while the comprehensiveness metric is the average of fT (x) − fT (x\e) over the examples. Note that using these metrics is not ideal as they rely upon the model output on de- formed input instances that lie outside the support of the training distribution. We present these metrics for different ex- planations in Table 9. We observe that LIME outperforms other explanations on both the suf- ficiency and comprehensiveness metrics. We attribute this to the fact that LIME explanations rely on attributions from a surrogate linear model trained on perturbed sentences, akin to the inputs used to compute these metrics. The average rank correlation of rankings obtained by our metrics (across all students and tasks) with the rankings from these two metrics is moderate (τ = 0.39), 368 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 which indicates that the two proposals produce slightly different orderings. This is unsurprising as our protocol, in principle, is different from the compared metrics. Ideally, we would like to link this comparison with some notion of user preference. This aspira- tion to evaluate inferred associations with users is similar to that of evaluating latent topics for topic models (Chang et al., 2009). However, directly asking users for their preference (for one expla- nation versus the other) would be inadequate, as users would not be able to comment upon the faithfulness of the explanation to the computation that resulted in the prediction. Instead, we con- duct a study inspired from our protocol, that is, where users simulate the model with and with- out explanations. 5.5 Human Students As discussed in §2.4, it is difficult to ‘‘train’’ peo- ple using a small number of input, output, expla- nation triples to understand the model sufficiently to simulate the model (on unseen examples) better than the control baseline. A recent study trained students with 16 or 32 examples, and tested if students could simulate the model better using different explanations, however the observed dif- ferences among techniques were not statistically significant (Hase and Bansal, 2020). Here, we at- tempt a similar human study, where we present each crowdworker 60 movie reviews, and for 40 (out of 60) reviews we supplement explanations of the model predictions. The goal for the work- ers is to understand the teacher model and guess the output of the model on the 20 unseen movie reviews for which explanations are unavailable. In our case, the teacher model accurately pre- dicts 93.5% of the test examples, therefore to avoid crowdworkers conflating the task of simulation with that of sentiment prediction, we over-sample the error cases such that our final setup comprises 50% correctly classified and 50% incorrectly clas- sified reviews. We experiment with 3 different attribution techniques: attention (as it is one of the best performing explanation technique as per our protocol), LIME (as it is not very effective according to our metrics, but nonetheless is a popular technique), and random (for control). We divide a total of 30 crowdworkers in three cohorts corresponding to each explanation type. The av- erage simulation accuracy of workers is 68.0%, 69.0%, and 75.0% using LIME, attention, and ran- dom explanations, respectively. However, given the large variance in the performance of workers in each cohort, the differences between any pair of these explanations is not statistically significant. The p-value for random vs LIME, random vs at- tention and LIME vs attention is 0.35, 0.14, and 0.87 respectively. This study, similar to past human-subject ex- periments on model simulatability, concludes that explanations do not definitively help crowdwork- ers to simulate text classification models. We speculate that it is difficult for people to simu- late models, especially when they see a few fixed examples. A promising direction for future work could be to explore interactive studies, where peo- ple could query the model on inputs of their choice to evaluate any hypotheses they might conjecture. 6 Limitations and Future Directions There are a few important limitations of our work that could motivate future work in this space. First, our current experiments only compare ex- planations that are of the same format. More work is required to compare explanations of different formats, for example, comparing natural language explanations to the top-k% highlighted tokens, or even comparing two methods to produce natural language explanations. To make such compar- isons, one would have to ensure that different explanations (potentially with different formats) communicate comparable bits of information, and subsequently develop learning strategies to train student models. Second, validating the results of any automated evaluation with human judgement of explana- tion quality remains inherently difficult. When people evaluate input attributions (or any form of explanations) qualitatively, they can deter- mine whether the attributions match their intu- ition about what portions of the input should be important to solve the task (i.e., plausibility of explanations), but it is not easy to evaluate if the highlighted portions are responsible for the model’s prediction. Going forward, we think that more granular notions of simulatability, coupled with counterfactual access to models (where peo- ple can query the model), might help people bet- ter assess the role of explanations. Third, while we observe moderate to high agree- ment among attribution rankings across different 369 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 student architectures and learning schemes, it is conceivable that different explanations are favored based on the choice of student model. This is a natural drawback of using a learning model for evaluation as the measurement could be sensitive to its design. Therefore, we recommend users to average simulation results over a diverse set of stu- dent architectures, training examples, and learn- ing strategies; and, wherever possible, validate explanation quality with its intended users. Lastly, an interesting future direction is to train explanation modules to generate explana- tions that optimize our metric, that is, learning to produce explanations based on the feedback from the students. To start with, an explanation generation module could be a simple transfor- mation over the attention heads of the teacher model (as attention-based explanations are effec- tive explanations as per our framework). Learning explanations can be modeled as a meta-learning problem, where the meta-objective is the few-shot test performance of the student trained with in- termediate explanations, and this performance could serve as a signal to update the explana- tion generation module using implicit gradients as in (Rajeswaran et al., 2019). 7 Related Work Several papers have suggested simulatability as an approach to measure interpretability (Lipton, 2016; Doshi-Velez and Kim, 2017). In a survey on interpretability, Doshi-Velez and Kim (2017) propose the task of forward simulation: Given an input and an explanation, people must predict what a model would output for that instance. Chandrasekaran et al. (2018) conduct human- studies to evaluate if explanations from Visual Question Answering (VQA) models help users predict the output. Recently, Hase and Bansal (2020) perform a similar human-study across text and tabular classification tasks. Due to the na- ture of these two studies, the observed differences with and without explanation, and among different explanation types, were not significant. Con- ducting large-scale human studies poses several challenges, including the considerable financial expense and the logistical challenge of recruit- ing and retaining participants for unusually long tasks (Chandrasekaran et al., 2018). By automat- ing students in our framework, we mitigate such challenges, and observe quantitative differences among methods in our comparisons. Closest in spirit to our work, Treviso and Martins (2020) propose a new framework to assess explanatory power as the communication success rate between an explainer and a layperson (which can be people or machines). However, as a part of their communication, they pass on explanations during test time, which could easily leak the label, and the models trained to play this communication game can learn trivial protocols (e.g., explainer generating a period for positive examples and a comma for negative examples). This is probably only partially addressed by enforcing constraints on the explainer and the explainee. Our setup does not face this issue as explanations are not available at test time. To counter the effects of leakage due to explanations, Hase et al. (2020) present a Leakage-Adjusted Simulatability (LAS) metric. Their metric quantifies the difference in perfor- mance of the simulation models (analogous to our student models) with and without explana- tions at test time. To adjust for this leakage, they average their simulation results across two dif- ferent sets of examples, ones that leak the label, and others that do not. Leakage is modeled as a binary variable, which is estimated by whether a discriminator can predict the answer using the explanation alone. It is unclear how the average of simulation results solves the problem, espe- cially when trivial explanations leak the label. Interpretability Benchmarks DeYoung et al. (2020) introduce the ERASER benchmark to as- sess how well the rationales provided by models align with human rationales, and also how faith- ful these rationales are to model predictions. To measure faithfulness, they propose two met- rics: comprehensiveness and sufficiency. They compute sufficiency by calculating the model performance using only the rationales, and com- prehensiveness by measuring the performance without the rationales. This approach violates the i.i.d assumption, as the training and evaluation data do not come from the same distribution. It is possible that the differences in model perfor- mance are due to distribution shift rather than the features that were removed. This concern is also highlighted by Hooker et al. (2019), who instead evaluate interpretability methods via their Re- mOve And Retrain (ROAR) benchmark. Because 370 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 the ROAR approach uses explanations at test time, it could be gamed: Depending upon the predic- tion, an adversarial teacher could use a different pre-specified ordering of important pixels as an explanation. Lastly, Poerner et al. (2018) present a hybrid document classification task, where the sentences are sampled from different documents with different class labels. The evaluation metric validates if the important tokens (as per a given in- terpretation technique) point to the tokens from the ‘‘right’’ document, that is, one with the same label as the predicted class. This protocol, too, relies on model output for out-of-distribution samples (i.e., hybrid documents), and is very task-specific. 8 Conclusion We have formalized the value of explanations as their utility in a student-teacher framework, measured by how much they improve the stu- dent’s ability to simulate the teacher. In our setup, explanations are provided by the teacher as additional side information during training, but are not available at test time, thus prevent- ing ‘‘leakage’’ between explanations and output labels. Our proposed evaluation confirms the value of human-provided explanations, and cor- relates with a (simplistic) notion of explanation quality. Additionally, we conduct extensive ex- periments that measure the value of numerous previously-proposed schemes for producing ex- planations. Our experiments result in clear quan- titative differences between different explanation methods, which are consistent, to a moderate to high degree, across different choices. Among ex- planation methods, we find attention to be the most effective. For student models, we find that both multitask and attention-regularized student learn- ers are effective, but attention-based learners are more effective, especially in low-resource settings. Acknowledgments We are grateful to Jasmijn Bastings, Katja Filippova, Matthew Lamm, Mansi Gupta, Patrick Verga, Slav Petrov, and Paridhi Asija for insight- ful discussions that shaped this work. We also thank the TACL reviewers and action editor for providing high-quality feedback that improved our work considerably. Lastly, we acknowledge Chris Alberti for sharing explanations from the SpanBERT model. References Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR. Yujia Bao, Shiyu Chang, Mo Yu, and Regina atten- Barzilay. 2018. Deriving machine In Proceed- tion from human rationales. ings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1903–1913. https://doi.org/10 .18653/v1/D18-1216 Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, Prithvijit Chattopadhyay, and Devi Parikh. 2018. Do explanations make VQA models more predictable to a human? In Proceedings of the 2018 Conference on Empir- ical Methods in Natural Language Processing, pages 1036–1042. https://doi.org/10 .18653/v1/D18-1128 Jonathan Chang, Sean Gerrish, Chong Wang, Jordan Boyd-graber, and David Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, volume 22. Shiyu Chang, Yang Zhang, Mo Yu, and Tommi S. Jaakkola. 2020. Invariant rationalization. In the 37th International Con- Proceedings of ference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1448–1458. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and ByronC.Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP mod- els. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 4443–4458. https://doi.org /10.18653/v1/2020.acl-main.408 371 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Kedar Dhamdhere, Mukund Sundararajan, and Qiqi Yan. 2019. How important is a neuron. In 7th International Conference on Learning Representations, ICLR. Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Peter Hase and Mohit Bansal. 2020. Evaluating explainable AI: Which algorithmic explana- tions help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5540–5552. https://doi.org/10 .18653/v1/2020.acl-main.491 Peter Hase, Shiyue Zhang, Harry Xie, and Mohit Bansal. 2020. Leakage-adjusted simulatability: Can models generate non-trivial explanations of their behavior in natural language? In Find- the Association for Computational ings of Linguistics: EMNLP 2020, pages 4351–4367. Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. 2019. A benchmark for inter- pretability methods in deep neural networks. In Advances in Neural Information Process- ing Systems, pages 9737–9748. https:// doi.org/10.18653/v1/2020.findings -emnlp.390 Alon Jacovi and Yoav Goldberg. 2021. Aligning Interpretations with their Social Faithful Attribution. Transactions of the Association for Computational Linguistics, 9:294–310. https://doi.org/10.1162/tacl a 00367 Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Lin- guistics, 8:64–77. https://doi.org/10 .1162/tacl_a_00300 Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Illia Polosukhin, Alberti, Danielle Epstein, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computa- tional Linguistics, 7:453–466. https://doi .org/10.1162/tacl_a_00276 John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and label- ing sequence data. 18th International Confer- ence on Machine Learning 2001 (ICML 2001), pages 282–289. Matthew Lamm, Jennimaria Palomaki, Chris Alberti, Daniel Andor, Eunsol Choi, Livio Baldini Soares, and Michael Collins. 2021. QED: A framework and dataset for explana- tions in question answering. Transactions of the Association for Computational Linguistics, 9:790–806. https://doi.org/10.1162 /tacl_a_00398 Zachary C. Lipton. 2016. The mythos of model interpretability. ACM Queue, 16(3):31–57. https://doi.org/10.1145/3236386 .3241340 Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vec- tors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150. Hugo Mercier and Dan Sperber. 2017. The Enigma of Reason, Harvard University Press. Nina Poerner, Hinrich Sch¨utze, and Benjamin Roth. 2018. Evaluating neural network expla- nation methods using hybrid documents and morphosyntactic agreement. In Proceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 340–350. https://doi .org/10.18653/v1/P18-1032 Danish Pruthi, Bhuwan Dhingra, Graham Neubig, and Zachary C. Lipton. 2020. Weakly-and semi-supervised evidence extraction. Findings of the Association for Computational Lin- guistics: EMNLP 2020, pages 3965–3970. https://doi.org/10.18653/v1/2020 .findings-emnlp.353 Aravind Rajeswaran, Chelsea Finn, Sham and Sergey Levine. 2019. M. Kakade, Meta-learning with implicit gradients. In Ad- Information Processing vances in Neural 372 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Systems 32: Annual Conference on Neural In- formation Processing Systems, pages 113–124. https://doi.org/10.18653/v1/2020 .blackboxnlp-1.10 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ‘‘why should I trust you?’’: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pages 1135–1144. Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In the 34th International Con- Proceedings of ference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, volume 70 of Proceedings of Machine Learn- ing Research, pages 3145–3153. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification mod- els and saliency maps. In 2nd International Con- ference on Learning Representations, ICLR. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep net- works. In Proceedings of the 34th Interna- tional Conference on Machine Learning, ICML, volume 70 of Proceedings of Machine Learn- ing Research, pages 3319–3328. Marcos V. Treviso and Andr´e F. T. Martins. 2020. The explanation game: Towards prediction explainability through sparse communication. the Third BlackboxNLP In Proceedings of Workshop on Analyzing and Interpreting Neu- ral Networks for NLP, BlackboxNLP@EMNLP 2020, Online, November 2020, pages 107–118. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Atten- tion is all you need. In Advances in neural infor- mation processing systems, pages 5998–6008. Kayo Yin, Patrick Fernandes, Danish Pruthi, Aditi Chaudhary, Andr´e F. T. Martins, and Graham Neubig. 2021. Do context-aware trans- lation models pay the right attention? In Joint Conference of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP). Virtual. https://doi.org /10.18653/v1/2021.acl-long.65 Omar Zaidan and Jason Eisner. 2008. Modeling annotators: A generative approach to learning from annotator rationales. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 31–40. https://doi.org/10.3115/1613715 .1613721 Omar Zaidan, Jason Eisner, and Christine Piatko. 2007. Using ‘‘annotator rationales’’ to im- prove machine learning for text categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguis- tics; Proceedings of the Main Conference, pages 260–267. Ruiqi Zhong, Steven Shao, and Kathleen McKeown. 2019. Fine-grained sentiment anal- ysis with faithful attention. arXiv preprint arXiv:1908.06870. 373 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Supplementary Material 9 Explanation Types We examine the following attribution methods: LIME Locally Interpretable Model-agnostic Explanations (Ribeiro et al., 2016), or LIME, are explanations produced by a linear interpretable model that is trained to approximate the original black box model in the local neighborhood of the input example. For a given example, several samples are constructed by perturbing the input string, and these samples are used to train the lin- ear model. We draw twice as many samples as the number of tokens in the example, and select the top words that explain the predicted class. We set the number of features for the linear classifier to be 2k, where k is the number of tokens to be selected. Gradient-based Saliency Methods Several pa- pers, both in NLP and computer vision, use gradients of the log-likelihood of the predicted label to understand the effect of infinitesimally small perturbations in the input. While no pertur- bation of an input string is infinitesimally small, nonetheless, researchers have continued to use this metric. It is most commonly used in two forms: grad norm, i.e., the (cid:9)2 norm of the gradient w.r.t. the token representation, and grad × input (also called grad dot), i.e., the dot product of the gra- dient w.r.t the token representation and the token representation. Integrated Gradients Gradients capture only the effect of perturbations in an infinitesi- mally small neighborhood, integrated gradients (Sundararajan et al., 2017), instead compute and integrate gradients along the line joining a starting reference point and the given input example. For each example, we integrate the gradients over 50 points on the line. Layer Conductance Dhamdhere et al. (2019) introduce and extend the notion of conductance to compute neuron-level importance scores. We apply layer conductance on the first encoder layer of our teacher model and aggregate the scores to define the attributions over the input tokens. DeepLIFT DeepLIFT uses a reference for the model input and target, and measures the con- tribution of each input feature in the pair-wise difference from this reference (Shrikumar et al., 2017). It addresses the limitations of gradient- based attribution methods, for regimes with zero and discontinuous gradients. It backpropagates the contributions of neurons using multipliers as in partial derivatives. We use a reference input (embeddings) of all zeros for our experiments. Attention-based Explanations Attention mech- anisms were originally introduced by Bahdanau et al. (2015) to align source and target tokens in neural machine translation. Because attention mechanisms allocate weight among the encoded tokens, these coefficients are sometimes thought of intuitively as indicating which tokens the model focuses on when making a prediction. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 374 Table 10: The Kendall rank correlation coefficient, τ , comparing rankings obtained through different settings of our metric. We also compute correlations with the sufficiency and comprehensiveness metrics from the ERASER benchmark (DeYoung et al., 2020). MTL and AR denote Multitask Learning and Attention Regularization. Values can range from −1.0 (perfect disagreement) to (perfect agreement). Across different students and different learning strategies, the rankings obtained are highly correlated. Value of k LIME Gradient Norm Gradient × Input Layer Conductance Integrated Gradients Attention Attention Regularization 5% 93.0 92.8 92.5 93.6 94.1 94.7 10% 20% 40% 92.6 92.4 92.2 93.5 93.6 95.2 92.5 90.6 92.6 93.4 93.6 95.3 92.0 90.6 92.8 92.9 93.1 94.6 Multitask Learning 10% 20% 40% 92.6 93.1 92.7 92.9 93.3 94.4 92.5 92.9 92.5 92.5 93.1 94.7 91.8 93.0 91.3 92.3 92.1 94.9 5% 92.8 93.1 92.4 92.2 93.4 94.0 Table 11: Simulation accuracy of a BERT-base student model, examining the effect of k in selecting top-k% explanatory tokens. Student model without explanations obtains a simulation accuracy of 92.6. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 6 5 2 0 0 6 9 7 1 / / t l a c _ a _ 0 0 4 6 5 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 375Evaluating Explanations: How Much Do Explanations image

Scarica il pdf