Is My Model Using the Right Evidence? Systematic Probes for Examining

Is My Model Using the Right Evidence? Systematic Probes for Examining
Evidence-Based Tabular Reasoning

Vivek Gupta1, Riyaz A. Bhat2, Atreya Ghosal3,
Manish Shrivastava3, Maneesh Singh2, Vivek Srikumar1

1University of Utah, Etats-Unis, 2Verisk Inc., India, 3IIIT-Hyderabad, India
{vgupta, svivek}@cs.utah.edu, {riyaz.bhat, maneesh.singh}@verisk.com,
{atreyee.ghosal@research, m.shrivastava}@iiit.ac.in

Abstrait

Neural models command state-of-the-art per-
formance across NLP tasks, including ones
involving ‘‘reasoning’’. Models claiming to
reason about the evidence presented to them
should attend to the correct parts of the input
while avoiding spurious patterns therein, être
self-consistent in their predictions across in-
puts, and be immune to biases derived from
their pre-training in a nuanced, contexte-
sensitive fashion. Do the prevalent *BERT-
family of models do so? In this paper, nous
study this question using the problem of rea-
soning on tabular data. Tabular inputs are
especially well-suited for the study—they ad-
mit systematic probes targeting the properties
listed above. Our experiments demonstrate that
a RoBERTa-based model, representative of
the current state-of-the-art, fails at reason-
ing on the following counts: it (un) ignores
relevant parts of the evidence, (b) is over-
sensitive to annotation artifacts, et (c) relies
on the knowledge encoded in the pre-trained
language model rather than the evidence pre-
sented in its tabular inputs. Enfin, through
inoculation experiments, we show that fine-
tuning the model on perturbed data does not
help it overcome the above challenges.

Introduction

The problem of understanding tabular or semi-
structured data is a challenge for modern NLP.
Recently, Chen et al. (2020b) and Gupta et al.
(2020) have framed this problem as a natural
language inference question (NLI, Dagan et al.,
2013; Bowman et al., 2015, inter alia) via the
TabFact and the INFOTABS datasets, respectivement.
The tabular version of the NLI task seeks to de-
termine whether a tabular premise entails or con-
tradicts a textual hypothesis, or is unrelated to it.

659

One strategy for such tabular reasoning tasks
relies on the successes of contextualized represen-
tations (par exemple., Devlin et al., 2019; Liu et al., 2019b)
for the sentential version of the problem. Tables
are flattened into artificial sentences using heuris-
tics to be processed by these models. Surprisingly,
even this na¨ıve strategy leads to high predictive
accuracy, as shown not only by the introductory
papers but also by related lines of recent work
(par exemple., Eisenschlos et al., 2020; Yin et al., 2020).

In this paper, we ask: Do these seemingly ac-
curate models for tabular inference effectively
use and reason about their semi-structured in-
puts? While ‘‘reasoning’’ can take varied forms,
a model that claims to do so should at least ground
its outputs on the evidence provided in its inputs.
Concretely, we argue that such a model should
(un) be self-consistent in its predictions across con-
trolled variants of the input, (b) use the evidence
presented to it, and the right parts thereof, et,
(c) avoid being biased against
the given evi-
dence by knowledge encoded in the pre-trained
embeddings.

Corresponding to these three properties, nous
identify three dimensions on which to evaluate a
tabular NLI system: robustness to annotation arti-
facts, relevant evidence selection, and robustness
to counterfactual changes. We design systematic
probes that exploit the semi-structured nature of
the premises. This allows us to semi-automatically
construct the probes and to unambiguously de-
fine the corresponding expected model response.
These probes either introduce controlled edits to
the premise or the hypothesis, or to both, thereby
also creating counterfactual examples. Experi-
ments reveal that despite seemingly high test set
accuracy, a model based on RoBERTa (Liu et al.,
2019b), a good representative of BERT deriva-
tive models, is far from being reliable. Not only

Transactions of the Association for Computational Linguistics, vol. 10, pp. 659–679, 2022. https://doi.org/10.1162/tacl a 00482
Action Editor: Katrin Erk. Submission batch: 9/2021; Revision batch: 12/2021; Published 6/2022.
c(cid:2) 2022 Association for Computational Linguistics. Distributed under a CC-BY 4.0 Licence.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

does it ignore relevant evidence from its inputs, it
also relies excessively on annotation artifacts, dans
particular the sentence structure of the hypothe-
sis, and pre-trained knowledge in the embeddings.
Enfin, we found that attempts to inoculate the
model (Liu et al., 2019un) along these dimensions
degrades its overall performance.

The rest of the paper is structured as follows.
§2 introduces the Tabular NLI task, while §3
articulates the need for probing evidence-based
tabular reasoning in extant high-performing mod-
le. §4–§6 detail the probes designed for such an
examination and the results thereof while §7 ana-
lyzes the impact of inoculation to aforementioned
challenges through model fine-tuning. §8 presents
the main takeaways and contextualization in the
related art. §9 provides concluding remarks and
indicates future directions of work.1

2 Preliminaries: Tabular NLI

Tabular natural language inference is a task similar
to standard NLI in that it examines if a natural
language hypothesis can be derived from the given
premise. Unlike standard NLI, where the evidence
is presented in the form of sentences, the premises
in tabular NLI are semi-structured tables that may
contain both text and data.

Dataset Recently, datasets such as TabFact
(Chen et al., 2020b) and INFOTABS (Gupta et al.,
2020), and also shared tasks such as SemEval
2021 Task 9 (Wang et al., 2021un) and FEVER-
OUS (Aly et al., 2021), have sparked interest in
tabular NLI research. Dans cette étude, we use the
INFOTABS dataset for our investigations.
consists

premise-
hypothesis pairs, whose premises are based on
Wikipedia infoboxes. Unlike TabFact, lequel
only contains ENTAIL and CONTRADICT hypotheses,
INFOTABS also includes NEUTRAL ones. Chiffre 1
shows an example table from the dataset with four
hypotheses, which will be our running example.

INFOTABS

23,738

The dataset contains 2,540 distinct infoboxes
representing a variety of domains. All hypotheses
were written and labeled by Amazon MTurk work-
ers. The tables contain a title and two columns,
as shown in the example. Since each row takes
the form of a key-value pair, we will refer to the

1The dataset and the scripts used for our analysis are

available at https://tabprobe.github.io.

660

Chiffre 1: An example of a tabular premise from
INFOTABS. The hypotheses H1 is entailed by it, H2
contradicts it, and H3, H4 are neutral (c'est à dire., neither
entailed nor contradictory).

elements in the left column as the keys, et le
right column provides the corresponding values.

In addition to the usual train and development
sets, INFOTABS includes three test sets, α1, α2, et
α3. The α1 set represents a standard test set that is
both topically and lexically similar to the training
data. In the α2 set, hypotheses are designed to be
lexically adversarial, and the α3 tables are drawn
from topics not present in the training set. We will
use all three test sets for our analysis.

Models over Tabular Premises Unlike stan-
dard NLI, which can use off-the-shelf pre-trained
contextualized embeddings, the semi-structured
nature of premises in tabular NLI necessitates a
different modeling approach.

Following Chen et al. (2020b), tabular premises
are flattened into token sequences that fit the
input interface of such models. While different
flattening strategies exist in the literature, we adopt
the Table as a Paragraph strategy of Gupta et al.
(2020), where each row is converted to a sentence
of the form ‘‘The key of title is value’’. Ce
seemingly na¨ıve strategy, with RoBERTa-large
embeddings (RoBERTaL henceforth), achieved
the highest accuracy in the original work, shown in

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

dev

α1

α2

α3

79.78
Human
Hypothesis Only 60.51
RoBERTaL
75.55
73.59(2.3) 72.41(1.4) 63.02(1.9) 61.82(1.4)
5xCV

79.33
48.89
64.94

83.88
48.26
65.55

84.04
60.48
74.88

Tableau 1: Results of the Table as a Paragraph
strategy on INFOTABS subsets with RoBERTaL
model, hypothesis-only baseline and majority
human agreement. The first three rows are re-
produced from Gupta et al. (2020). The last row
represents the average performances (and stan-
dard deviations as subscripts) using models ob-
tained via five-fold cross validation (5xCV).

Tableau 1.2 The table also shows the hypothesis-only
baseline (Poliak et al., 2018; Gururangan et al.,
2018) and human agreement on the labels.3

To study the stability of the models to vari-
ations in the training data, we performed 5-fold
cross validation (5xCV). An average cross valida-
tion accuracy of 73.53% with a standard deviation
de 2.73% was observed on the training set which
is close to the performance on the α1 test set
(74.88%). En outre, we also evaluated perfor-
mance on the development and test sets. Le
penultimate row of Table 1 presents the perfor-
mance for the model trained on the entire training
data, and the last row presents the performance
of the 5xCV models. The results demonstrate
that model performance is reasonably stable to
variations in the training set.

3 Reasoning: An Illusion?

Given the surprisingly high accuracies in Table 1,
especially on the α1 test dataset, can we conclude
that the RoBERTa-based model reasons effec-
tively about the evidence in the tabular input to
make its inference? C'est, does it arrive at its
answer via a sound logical process that takes into
account all available evidence along with com-

2Other flattening strategies have similar performance

(Gupta et al., 2020).

3Preliminary experiments on the development set showed
that RoBERTaL outperformed other pre-trained embeddings.
We found that BERTB, RoBERTaB, BERTL, ALBERTB,
and ALBERTL reached development set accuracies of 63.0%,
67.23%, 69.34%, 70.44%, et 70.88%, respectivement. While
we have not replicated our experiments on these other models
due to a prohibitively high computational cost, we expect the
conclusions to carry over to these other models as well.

661

mon sense knowledge? Merely achieving high
accuracy is not sufficient evidence of reasoning:
The model may arrive at the right answer for the
wrong reasons leading to improper and inadequate
generalization over unseen data. This observation
is in line with the recent work pointing out that
the high-capacity models we use may be relying
on spurious correlations (par exemple., Poliak et al., 2018).
‘‘Reasoning’’ is a multi-faceted phenomenon,
and fully characterizing it is beyond the scope of
this work. Cependant, we can probe for the ab-
sence of evidence-grounded reasoning via model
responses to carefully constructed inputs and their
variants. The guiding premise for this work is:

Any ‘‘evidence-based reasoning’’ system
should demonstrate expected, predict-
able behavior in response to controlled
changes to its inputs.

Autrement dit, ‘‘reasoning failures’’ can be
identified by checking if a model deviates from ex-
pected behavior in response to controlled changes
to inputs. We note that this strategy has been ei-
ther explicitly or implicitly employed in several
lines of recent work (Ribeiro et al., 2020; Gardner
et coll., 2020). In this work, we instantiate the above
strategy along three specific dimensions, briefly
introduced here using the running example in Fig-
ure 1. Each dimension is used to define several
concrete probes that subsequent sections detail.

1. Avoiding Annotation Artifacts A model
should not rely on spurious lexical correlations. Dans
général, it should not be able to infer the label using
only the hypothesis. Lexical differences in closely
related hypotheses should produce predictable
changes in the inferred label. Par exemple, dans
the hypothesis H2 of Figure 1, if the token ‘‘end’’
is replaced with ‘‘start’’, the model prediction
should change from CONTRADICT to ENTAIL.

2. Evidence Selection A model should use the
correct evidence in the premise for determining
the hypothesis label. Par exemple, ascertaining
that the hypothesis H1 is entailed requires the
Genre and Length rows of Figure 1. When a
relevant row is removed from a table, a model that
predicts the ENTAIL or the CONTRADICT label should
predict the NEUTRAL label. When an irrelevant row
is removed, it should not change its prediction
from ENTAIL to NEUTRAL or vice versa.

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

3. Robustness to Counterfactual Changes A
model’s prediction should be grounded in the
provided information even if it contradicts the
real world, that is to counterfactual information.
Par exemple, if the month of the Released date
changed to ‘‘December’’, then the model should
change the label of H2 in Figure 1 to ENTAIL
from CONTRADICT. Since this information about
release date contradicts the real world, the model
cannot rely on its pre-trained knowledge, say,
from Wikipedia. For the model to predict the label
correctly, it needs to reason with the information
in the table as the primary evidence. Although
the importance of pre-trained knowledge cannot
be overlooked, it must not be at the expense of
primary evidence.

Plus loin, there are certain pieces of information
in the premise (irrelevant to the hypothesis) que
do not impact the outcome, making the outcome
invariant to these changes. Par exemple, delet-
ing irrelevant rows from the premise should not
change the model’s predicted label. Contrary to
this is the relevant information (‘‘evidence’’) dans
the premise. Changing these pieces of informa-
tion should vary the outcome in a predictable
manière, making the model covariant with these
changes. Par exemple, deleting relevant evidence
rows should change the model’s predicted label
to NEUTRAL.

The three dimensions above are not limited to
tabular inference. They can be extended to other
NLP tasks, such as reading comprehension as well
as the standard sentential NLI. Cependant, directly
checking for such properties would require consid-
erable labeled data—a big practical impediment.
Heureusement, in the case of tabular inference, le
(in-/co-)variants associated with these dimensions
allow controlled and semi-automatic edits to the
inputs leading to predictable variation of the ex-
pected output. This insight underlies the design of
probes using which we examine the robustness of
the reasoning employed by a model performing
tabular inference. As we will see in the following
sections, highly effective and precise probes can
be designed without extensive annotation.

Preliminary experiments by Gupta et al. (2020)
on INFOTABS, cependant, reveal that a model trained
just on hypotheses performs surprisingly well on
the test data. This phenomenon, an inductive bias
entirely predicated on the hypotheses, is called
hypothesis bias. Models for other NLI tasks have
been similarly shown to exhibit hypothesis bias,
whereby the models learn to rely on spurious
correlations between patterns in the hypotheses
and corresponding labels (Poliak et al., 2018;
Gururangan et al., 2018; Geva et al., 2019, et
others). Par exemple, negations are observed to be
highly correlated with contradictions (Niven and
Kao, 2019).

To better characterize a model’s reliance on
such artifacts, we perform controlled edits to
hypotheses without altering associated premises.
Unlike the α2 set, which includes minor changes
to function words, we aim to create more sophis-
ticated changes by altering content expressions or
noun phrases in a hypothesis. Two possible sce-
narios arise where a hypothesis alteration, without
a change in the premise, either (un) leads to a change
in the label (c'est à dire., the label covaries with the vari-
ation in the hypothesis), ou (b) does not induce
a label change (c'est à dire., the label is invariant to the
variation in the hypothesis).

In INFOTABS, a set of reasoning categories are
identified to characterize the relationship between
a tabular premise and a hypothesis. We use a
subset of these, listed below, to perform controlled
changes in the hypotheses.

• Named Entities: such as Person, Location,

Organization;

• Nominal modifiers: nominal phrases or

clauses;

• Negation: markers such as no, pas;

• Numerical Values: numeric expressions
representing weights, percentages, domaines;

• Temporal Values: Date and Time; et,

• Quantifiers: like most, many, every.

4 Probing Annotation Artifacts

Can a model make inference about a hypothesis
without a premise? It is natural to answer in the
negative in general. (Bien sûr, certain hypothe-
ses may admit strong priors, par exemple., tautologies.)

Although we can easily track these expressions
in a hypothesis using tools like entity recogniz-
ers and parsers, it is non-trivial to automatically
modify them with a predictable change on the hy-
pothesis label. Par exemple, some label changes
can only be controlled if the target expression in

662

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Preposition Upward Monotonicity Downward Monotonicity

over
sous
plus que
less than
before
after

CONTRADICT
ENTAIL
CONTRADICT
ENTAIL
ENTAIL
CONTRADICT

ENTAIL
CONTRADICT
ENTAIL
CONTRADICT
CONTRADICT
ENTAIL

Tableau 2: Monotonicity properties of prepositions.

the hypothesis is correctly aligned with the facts
in the premise. Such cases include CONTRADICT to
ENTAIL, and NEUTRAL to CONTRADICT or ENTAIL,
which are difficult without extensive expression-
level annotations. Néanmoins, in several cases,
label changes can be deterministically known even
with imprecise changes in the hypothesis. For ex-
ample, we can convert a hypothesis from ENTAIL
to CONTRADICT by replacing a named entity in the
hypothesis with a random entity of the same type.
Hence we follow the following strategy: (un) Nous
avoid perturbations involving the NEUTRAL la-
bel altogether, as they often need changes in the
premise (table) aussi. (b) We generate all label-
preserving and some label-flipping transforma-
tions automatically using the approach described
below. (c) We annotate the CONTRADICT to ENTAIL
label-flipping perturbations manually.

Automatic Generation of Label-preserving
Transformations To automatically perturb hy-
potheses, we leverage the syntactic structure of
a hypothesis and the monotonicity properties of
function words like prepositions. D'abord, we per-
form syntactic analysis of a hypothesis to identify
named entities and their relations to title expres-
sions via dependency paths.4 Then, based on the
entity type, we either substitute or modify them.
Named entities such as person names and loca-
tions are substituted with entities of the same type.
Expressions containing numbers are modified us-
ing the monotonicity property of the prepositions
(or other function words) governing them in their
corresponding syntactic trees.

Given the monotonicity property of a prepo-
sition (see Table 2), we modify its governing
numerical expression in a hypothesis in the same
order to preserve the hypothesis label. Consider
hypothesis H5 in Figure 2, which contains a prepo-
sition (over) with upward monotonicity. Because
of upward monotonicity, we can increase the num-
ber of hours in H5 without altering the label.

Chiffre 2: Hypothesis H5 contradicts the premise.

Type of
Modification

Nominal
Modifier

Temporal
Expression
Negation

Temporal
Expression

Perturbed Hypothesis

C : Breakfast in America is a pop

H1E
E: Breakfast in America which was
produced by Pert Handerson is a pop
album of 46 minutes length.
H1E
album with a length of 56 minutes.
H2C
released towards the end of 1979.
H2C
leased towards the end of 1989.

E: Breakfast in America was not

C : Breakfast in America was re-

Tableau 3: Example hypothesis perturbations for
the running example from Figure 1. The red itali-
cized text represents changes. Superscripts E/C
represent gold ENTAIL and CONTRADICT labels,
and subscripts E/C represent new labels.

Manual Annotation of Label-flipping Trans-
formations Note that
in the above example,
modifying the numerical expression in the reverse
direction (par exemple., decreasing the number of hours)
does not guarantee a label flip. We need to know
the premise to be accurate. During the experi-
ments, we observed that a large step (half/twice
the actual number) suffices in most cases. Nous avons utilisé
this heuristic and manually curated the erroneous
cases. En plus, all the cases of CONTRADICT
to ENTAIL label-flipping perturbations were anno-
tated manually.5

We generated 2,891 perturbed examples from
the α1 set with 1,203 instances preserving the label
et 1,688 instances flipping it. We also generated
11,550 examples from the T rain set, avec 4,275
preserving and 7,275 flipping the label. Some
example perturbations using different types of
expressions are listed in Table 3. It should be noted
that there may not be a one-to-one correspondence
between the gold and perturbed examples, as a
hypothesis may be perturbed numerous times or
not at all. Par conséquent, in order for the results to be
comparable, a single perturbed example must be
sampled for each gold example: we sampled 967
from the α1 set and 4,274 from the T rain set.

5Annotation done by an expert well versed in the

4We used spaCy v2.3.2 for the syntactic analysis.

NLI task.

663

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

Model

Prem+Hypo
Hypo-Only

Label
Original
Preserved
mean(stdev)
T rain Set (w/o NEUTRAL)
92.98(0.20)
99.44(0.06)
70.23(0.35)
96.39(0.13)

α1 Set (w/o NEUTRAL)

Label
Flipped

53.92(0.28)
19.23(0.27)

Prem+Hypo
Hypo-Only

68.94(0.76)
63.52(0.75)

69.56(0.77)
60.27(0.85)

51.48(0.86)
31.02(0.63)

Tableau 4: Results of the Hypothesis-only model
and Prem+Hypo model on the gold and perturbed
hypotheses.

Results and Analysis We tested the hypothesis-
only and full models (both trained on the original
T rain set) on the perturbed examples, without
subsequent fine-tuning on the perturbed exam-
ples.6 The results are presented in Table 4, avec
each cell representing the average accuracy and
standard deviation (subscript) across 100 sam-
plings, avec 80% of the data selected at random in
each sampling.

We note that the performance degrades substan-
tially in both label-preserved and flipped settings
when the model is trained on just the hypotheses.
When labels are flipped after perturbations, le
decrease in performance (averaged across both
models) is about 25% et 61% points, on the α1
set and T rain set, respectivement. Cependant, for the
full model, perturbations that retain the hypothesis
label have little effect on model performance.

The contrast in the performance drop between
the label-preserved and label-flipped cases sug-
gests that changes to the content expressions have
little effect on the model’s original predictions. Dans-
terestingly, the predictions are invariant to changes
to functions words as well, as per results on α2 in
Gupta et al. (2020). This suggests that the model
might be more prone to changes to the template or
structure of a hypothesis than its lexical makeup.
Par conséquent, a model that relies on correlations
between the hypothesis structure and the label is
expected to suffer on the label-flipped cases. Dans
case of label-preserving perturbations of similar
kind, structural correlations between the hypothe-
sis and the label are retained leading to minimal
drop in model performance.

The results of the hypothesis-only model on
the T rain set may appear slightly surprising at
d'abord. Cependant, given that the model was trained

6We analyze the impact of fine-tuning on perturbed

examples in §7.

on this dataset, it seems reasonable to assume
that the model has ‘‘overfit’’ to the training data.
Donc, the model is expected to be vulnerable
even to slight label-preserving modifications to
the examples it was trained on, leading to the huge
drop of 26%. In the same setting, for the α1 set the
performance drop is lesser, namely, à propos 3%.

Taken together, we can conclude from these re-
sults that the model ignores the information in the
hypotheses (thereby perhaps also the aligned facts
in the premise), and instead relies on irrelevant
structural patterns in the hypotheses.

5 Probing Evidence Selection

Predictions of an NLI model should primarily be
based on the evidence in the premise, c'est, sur
the facts relevant to the hypothesis. For a tabu-
lar premise, rows containing the evidence neces-
sary to infer the associated hypothesis are called
relevant rows. Short-circuiting the evidence in rel-
evant rows for inference using annotation artifacts
as suggested in §4 or other spurious artifacts in
irrelevant rows of the table is expected to lead to
poor generalization over unseen data.

To better understand the model’s ability to
select evidence in the premise, we use two kinds
of controlled edits: (un) automatic edits without any
information about relevant rows, et (b) semi-
automatic edits using knowledge of relevant rows
via manual annotation. The rest of the section goes
over both scenarios in detail. All experiments in
this section use the full model that is trained on
both premises and their associated hypotheses.

5.1 Automatic Probing

We define four kinds of table modifications that
are agnostic to the relevance of rows to a hy-
pothesis: (un) row deletion, (b) row insertion,
(c) row-value update, c'est, changing existing
information, et (d) row permutation, c'est, concernant-
ordering rows. Each modification allows certain
desired (valid) changes to model predictions.7 We
examine below the case of row deletion in detail
and refer the reader to the Appendix for the others.

Row deletion should lead to the following de-
sired effects: (un) If the deleted row is relevant to
the hypothesis (par exemple., Length for H1), the model

7In performing these modifications, we made sure
that the modified table does not become inconsistent or
self-contradicting.

664

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

α3

Average

Dataset

ENTAIL
NEUTRAL
CONTRADICT

α1

5.76
4.43
3.23

α2

7.26
3.91
3.70

5.01
5.24
3.01

Average

4.47

4.96

4.42

6.01
4.53
3.31

–

Tableau 5: Percentage of invalid transitions after
row deletion. For an ideal model, all these
numbers should be zero.

side, the model mostly retains its predictions on
row-value update operations. We refer the reader
to the Appendix for more details.

5.2 Manual Probing
Row modification for automatic probing in §5.1
is agnostic to the relevance of the row to a given
hypothèse. Since only a few rows (one or two)
are relevant to the hypothesis, the probing skew
towards hypothesis-unrelated rows weakens the
investigations into the evidence-grounding capa-
bility of the model. Knowing the relevance of
rows allows for the creation of stronger probes.
Par exemple, if a relevant row is deleted, the EN-
TAIL and CONTRADICT predictions should change to
NEUTRAL. (Recall that after deleting an irrelevant
row the model should retain its original label.)

Probing by altering or deleting relevant rows re-
quires human annotation of relevant rows for each
table-hypothesis pair. We used MTurk to annotate
the relevance of rows in the development and the
test sets, with turkers identifying the relevant rows
for each table-hypothesis pair.

Inter-annotator Agreement. Nous
employed
majority voting to derive ground truth labels from
multiple annotations for each row. The inter-
annotator agreement macro F1 score for each of
the four datasets is over 90% and the average
Fleiss’ kappa is 78 (std: 0.22). This suggests good
inter-annotator agreement. Dans 82.4% of cases,
at least 3 out of 5 annotators marked the same
relevant rows.

Results and Analysis We examined the re-
sponse of the model when relevant rows are de-
leted. Chiffre 4 shows the label transitions. Le
fact that even after the deletion of relevant rows,
ENTAIL and CONTRADICT predictions don’t change
to NEUTRAL a large percentage of times (mostly
the original label remains unchanged and at other

Chiffre 3: Changes in model predictions after automatic
row deletion. Directed edges are labeled with transition
percentages from the source node label to the target
node label. The number triple corresponds to α1, α2,
and α3 test sets, respectively and, for each source
node, adds up to 100% over the outgoing edges. Red
lines represent invalid transitions. Dashed and solid
black lines represent valid transitions for irrelevant
and relevant row deletion respectively. * represents
valid transitions with either row deletions.

prediction should change to NEUTRAL. (b) Si le
deleted row is irrelevant (par exemple., Producer for H1),
the model should retain its original prediction.
NEUTRAL predictions should remain unaffected by
row deletion.

Results and Analysis We studied the impact
of row deletion on the α1, α2, and α3 test sets.
Chiffre 3 shows aggregate changes to labels after
row deletions as a directed labeled graph. Le
nodes in this graph represent the three labels in
INFOTABS, and the edges denote transitions after
row deletion. The source and end nodes of an
edge represent predictions before and after the
modification.

We see that the model makes invalid transi-
tions in all three datasets. Tableau 5 summarizes
the invalid transitions by aggregating them over
the label originally predicted by the model. Le
percentage of invalid transitions is higher for EN-
TAIL predictions than for CONTRADICT and NEUTRAL.
After row deletion, many ENTAIL examples are in-
correctly transitioning to CONTRADICT rather than
to NEUTRAL. The opposite trend is observed for the
CONTRADICT predictions.

As with row deletion, the model exhibits invalid
responses to other row modifications listed in
the beginning of this section, like row insertion.
Surprisingly, the performance degrades due to
row permutations as well, suggesting some form
of position bias in the model. On the positive

665

D
o
w
n
o
un
d
e
d

F
r
o
m
h

t
t

:
/
/

d
je
r
e
c
t
.

je
t
.

e
d
toi

/
t

un
c
je
/

un
r
t
je
c
e
–
p
d

F
/

d
o

je
/

1
0
1
1
6
2

/
t

un
c
_
un
_
0
0
4
8
2
2
0
2
8
8
4
7

/
t

un
c
_
un
_
0
0
4
8
2
p
d

b
oui
g
toi
e
s
t

o
n
0
7
S
e
p
e
m
b
e
r
2
0
2
3

with ENTAIL and CONTRADICT labels, where delet-
ing a relevant row should change the prediction
to NEUTRAL.9

Results and Analysis On the human-annotated
relevant rows, the model has an average precision
de 41.0% and a recall of 40.9%. Further analysis
reveals that the model (un) uses all relevant rows
dans 27% cases, (b) uses incorrect or no rows as
evidence in 52% of occurrences, et (c) is only
partially accurate in identifying relevant rows in
the remaining 21% of examples. Upon further
analyzing the cases in (b), we observed that the
model actually ignores premises completely in
88% (de 52%) of cases. This accounts for 46%
(absolute) of all occurrences. In comparison, dans
the human-annotated data, such cases only amount
à < 2%. Although, the model’s predictions are 70% cor- rect in the 4,600 examples, only 21% can be attributed to using all relevant evidence. The cor- rect label in 37% of the 4,600 examples is from irrelevant rows, with the remaining 12% of correct predictions use some, but not all, relevant rows. We can conclude from the findings in this sec- tion that the model does not seem to need all the relevant evidence to arrive at its predictions, rais- ing questions about trust in its predictions. 6 Probing with Counterfactual Examples Since INFOTABS is a dataset of facts based on Wikipedia, pre-trained language models such as RoBERTa, trained on Wikipedia and other pub- licly available text, may have already encountered information in INFOTABS during pre-training. As a result, NLI models built on top of RoBERTaL can learn to infer a hypothesis using the knowledge of the pre-trained language model. More specifi- cally, the model may be relying on ‘‘confirmation bias’’, in which it selects evidence/patterns from both premise and hypothesis that matches its prior knowledge. While world knowledge is necessary for table NLI (Neeraja et al., 2021), models should still treat the premise as the primary evidence. Counterfactual examples can help test whether the model is grounding its inference on the ev- idence provided in the tabular premise. In such examples, the tabular premise is modified such that the content does not reflect the real world. In 9We did not include the 2,400 NEUTRAL examples pairs and the ambiguous 200 ENTAIL or CONTRADICT examples that had no relevant rows as per the consensus annotation. Figure 4: Changes in model predictions after deletion of relevant rows. Red lines represent invalid transitions while black lines represent valid transitions. The di- rected edges are labeled in the same manner as they are in Figure 3. Dataset ENTAIL NEUTRAL CONTRADICT α1 75.41 8.39 77.02 α2 74.70 6.58 81.10 α3 Average 77.31 8.01 77.80 75.80 7.66 78.64 Average 53.60 54.14 54.35 Table 6: Percentage of invalid transitions follow- ing deletion of relevant rows. For an ideal model, all these numbers should be zero. times, it changes incorrectly), indicates that the model is likely utilizing spurious statistical pat- terns in the data for making the prediction. We summarize the combined invalid transitions for each label in Table 6. We see that the percent- age of invalid transitions is considerably higher compared to random row deletion in Figure 3.8 The large percentage of invalid transitions in the ENTAIL and CONTRADICT cases indicates a rather high utilization of spurious statistical patterns by the model to arrive at its answers. 5.3 Human vs Model Evidence Selection We further analyze the model’s capability for se- lecting relevant evidence by comparing it with human annotators. All rows that alter the model predictions during automatic row deletion are con- sidered as model relevant rows and are compared to the human-annotated relevant rows. We only consider the subset of 4,600 (from 7,200 anno- tated dev/test sets pairs) hypothesis-table pairs 8Note that the dashed black lines from Figure 3 are now red in Figure 4, indicating invalid transitions. 666 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Figure 5: Counterfactual table-hypothesis pair created from Figure 1 and Figure 2. Only the values of ‘Length’ rows are swapped, rest of the rows from Figure 1 are copied over. this study, we limit ourselves to modifying only the ENTAIL and CONTRADICT examples. We omit the NEUTRAL cases because the majority of them in INFOTABS involve out-of-table information; pro- ducing counterfactuals for them is much harder and involves the laborious creation of new rows with the right information. The task of creating counterfactual tables pre- sents two challenges. First, the modified tables should not be self-contradictory. Second, we need to determine the labels of the associatedx hypo- theses after the table is modified. We employ a simple approach to generate counterfactuals that addresses both challenges. We use the evidence se- lection data (§5.2) to gather all premise-hypothesis pairs that share relevant keys such as ‘‘Born’’, ‘‘Occupation’’, and so forth. Counterfactual ta- bles are generated by swapping the values of rel- evant keys from one table to another.10 Figure 5 shows an example. We create counter- factuals from the premises in Figure 1 and Figure 2 by swapping their Length rows. We also swap the hypotheses (H1 and H5) aligned to the Length rows in both premises by replacing the title ex- pression Bridesmaids in H5 with Breakfast in America and vice versa. The simple procedure en- 10There may still be a few cases of self-contradiction, but we expect that such invalid cases would not exist in the rows that are relevant to the hypothesis. Figure 6: A counterfactual tabular premise and the as- sociated hypotheses created from Figures 1 and 2. The hypotheses (cid:2)H1 is entailed by the premise, (cid:2)H2 con- tradicts it, and (cid:2)H3 and (cid:2)H4 are neutral. sures that the hypotheses labels are left unchanged in the process, resulting in high-quality data. In addition, we also generated counterfactuals by swapping the table title and associated expres- sions in the hypotheses with the title of another ta- ble, resulting in a counterfactual table-hypothesis pair, as in the row swapping strategy. Figure 6 shows an example created from the premises in Figure 1 and Figure 2 by swapping the title rows Breakfast in America and Bridesmaids. The title expression in all hypotheses in Figure 1 are also replaced by Bridesmaids. This strategy also pre- serves the hypothesis label similar to row swapping. The above approaches are Label Preserving as they do not alter the entailment labels. Counter- factual pairs with flipped labels are important for filtering out the contribution of artifacts or other spurious correlations that originate from a hypo- thesis (see §4). So, in addition, we also created counterfactual table-hypothesis pairs where the original labels are flipped. These counterfactual cases are, however, non-trivial to generate auto- matically, and are therefore created manually. To create the Label-Flipped counterfactual data, three annotators manually modified tables from 667 Model Original mean(stdev) Label Preserved Label Flipped T rain Set (without NEUTRAL) Prem+Hypo Hypo-Only Prem+Hypo Hypo-Only 78.53(0.65) 94.38(0.39) 99.94(0.06) 82.23(0.65) α1 Set (without NEUTRAL) 69.65(0.78) 71.99(0.69) 58.19(0.91) 60.89(0.76) 48.70(0.72) 00.06(0.01) 44.01(0.72) 27.68(0.65) Table 7: Results of the Hypothesis-only and Prem+Hypo models on the gold and counterfac- tual examples. the T rain and α1 datasets corresponding to ENTAIL and CONTRADICT labels, producing 885 counter- factual examples from the T rain set and 942 from the α1 set. The annotators cross-checked the labels to determine annotation accuracy, which was 88.45% for the T rain set and 86.57% for the α1 set. Results and Analysis We tested both hypothesis- only and full (Prem+Hypo) models on the counterfactual examples created above, without fine-tuning on a subset of these examples. The re- sults are presented in Table 7 where each cell rep- resents average accuracy and standard deviation (subscript) over 100 sets of 80% randomly sam- pled counterfactual examples. We see that the (Prem+Hypo) model is not robust to counter- factual perturbations. On the label-flipped coun- terfactuals, the performance drops down to close to a random prediction (48.70% for T rain and 44.01% for α1). The performance on the label- preserved counterfactuals is relatively better which leads us to conjecture that the model largely ex- ploits artifacts in hypotheses. Due to over-fitting, the T rain set has a larger drop of 15.85%, compared to only 2.70% on the α1 set on the label-preserved examples. Moreover, the drop in performance for both Prem+Hypo and Hypo-Only models is comparable to their perfor- mance drop on the original table-hypothesis pairs. This shows that, regardless of whether the rele- vant information in the premise is accurate, both models rely substantially on hypothesis artifacts. On the Label-Flipped counterfactuals, the large drop in accuracy could be due to both ambiguous hypothesis artifacts or counterfactual information. To disentangle these two factors, we can take advantage of the fact that the counterfactual ex- amples are constructed from, and hence paired Prem+Hypo C-THP O-THP Hypo-Only O-Hypo Dataset T rain 0.00 0.00 3.57 49.36 α1 11.43 11.79 6.48 33.12 Table 8: Performance of the full and hypothesis- only models on the original and counterfactual examples. O-THP and C-THP represent original and counterfactual table-hypothesis pairs; O-Hypo represents hypotheses from the original data; ✓ represents correct predictions and ✗ incorrect predictions. represents with, the original examples. This allows us to examine pairs of examples where the full model makes an incorrect prediction on one, but not the other. Especially of interest are the cases where the full model makes a correct prediction on the original example, but not on the corresponding counterfactual example. Table 8 shows the results of this analysis. Each row represents a condition corresponding to whether the full and the hypothesis-only mod- els are correct on the original example. The two cases of interest, described above, correspond to the second and fourth rows of the table. The sec- ond row shows the case where the full model is correct on the original example (and not on the counter-factual example), but the hypothesis-only model is not. Since we can discount the impact of hypothesis bias in these examples, the error in the counter-factual version could be attributed to re- liance on pre-trained knowledge. Unsurprisingly, there are no such examples in the training set. In the α1 set, we see a substantial fraction of coun- terfactual examples (11.79%) belong to this cat- egory. The last row considers the case where the hypothesis-only model is correct. We see that this accounts for a larger fraction of the counterfac- tual errors, both in the training and the α1 sets. Among these examples, despite the (albeit unfor- tunate) fact that the hypothesis alone can support a correct prediction, the model’s reliance on its pre-trained knowledge leads to errors in the coun- terfactual cases. The results, taken in aggregate, suggest that the model produces predictions based on hypothesis artifacts and pre-trained knowledge rather than 668 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ the evidence presented to it, thus impacting its robustness and generalization. 7 Inoculation by Fine-Tuning Our probing experiments demonstrate that the models, trained on the INFOTABS training set, failed along all three dimensions that we investigated. This leads us to the following question: Can addi- tional fine-tuning with perturbed examples help? Liu et al. (2019a) point out that poor perfor- mance on challenging datasets can be ascribed to either a weakness in the model, a lack of diversity in the dataset used for training, or information leakage in the form of artifacts.11 They suggest that models can be further fine-tuned on a few challenging examples to determine the possible source of degradation. Inoculation can lead to one of three outcomes: (a) Outcome 1: The perfor- mance gap between the challenge and the original test sets reduces, possibly due to addition of di- verse examples, (b) Outcome 2: Performance on both the test sets remains unchanged, possibly be- cause of the model’s inability to adapt to the new phenomena or the changed data distribution, or, (c) Outcome 3: Performance degrades on the test set, but improves on the challenge set, suggesting that adding new examples introduces ambiguity or contradictions. We conducted two sets of inoculation experi- ments to help categorize performance degradation of our models into one of these three categories. For each experiment described below, we gen- erated additional inoculation datasets with 100, 200, and 300 examples to inoculate the origi- nal task-specific RoBERTaL models trained on both premises and hypotheses. As in the original inoculation work, we created these adversarial the datasets by sub-sampling inclusively, smaller datasets are subsets of the larger ones. Fol- lowing the training protocol in Liu et al. (2019a), we tried learning rates of 10−4, 5 × 10−5, and 10−5. We performed inoculation for a maximum of 30 epochs with early stopping based on the development set accuracy. We found that with the first two learning rates, the model does not con- verge, and underperforms on the development set. The model performance was best with the learn- ing rate of 10−5, which we used throughout the inoculation experiments. The standard deviation i.e., 11Model weakness is the inherent inability of a model (or a model family) to handle certain linguistic phenomena. #Samples 0 (w/o Ino) 100 200 300 α1 74.88 67.44 67.34 67.24 α2 65.55 62.17 61.88 61.84 α3 64.94 58.51 58.61 58.62 Table 9: Performance of the inoculated models on the original INFOTABS test sets. #Samples Original Label Preserved Label Flipped T rain Set (w/o NEUTRAL) 0 (w/o Ino) 100 200 300 0 (w/o Ino) 100 200 300 99.44 97.24 97.24 97.24 92.98 95.58 95.65 95.64 α1 Set (w/o NEUTRAL) 68.94 68.05 68.37 68.36 69.56 65.67 66.29 66.29 53.92 79.25 78.75 78.74 51.48 57.91 57.49 57.49 Table 10: Performance of the inoculated models on the hypothesis perturbed INFOTABS sets. over 100 sample splits for all experiments was ≤ 0.91. Annotation Artifacts Table 9 shows the per- formance of the inoculated models on the original INFOTABS test sets, and Table 10 shows the re- sults on the hypothesis-perturbed examples (from §4). We see that fine-tuning on the hypothesis- perturbed examples decreases performance on the original α1, α2, and α3 test sets, but performance improves on the more difficult label-flipped ex- amples of the hypothesis-perturbed test set. Counterfactual Examples Tables 11 and 12 show the performance of models inoculated on the original INFOTABS test sets and the counterfac- tual examples from §6, respectively. Once again, we see that fine-tuning on counterfactual examples improves performance on the adversarial coun- terfactual examples test set, at the cost of per- formance on the original test sets. Analysis We see that both experiments above belong to Outcome 3, where the performance im- proves on the challenge set, but degrades on the test set(s). The change in the distribution of inputs hurts the model: we conjecture that this may be because the RoBERTaL model exploits data artifacts in the original dataset but fails to do so for the challenge dataset and vice versa. 669 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 #Samples 0 (w/o Ino) 100 200 300 α1 74.88 69.72 69.88 67.34 α2 65.55 63.88 63.78 62.23 α3 64.94 59.66 58.89 57.58 Table 11: Performance after inoculation by fine-tuning on the original INFOTABS test sets. #Samples Original Label Preserved Label Flipped T rain Set (w/o NEUTRAL) 0 (w/o Ino) 100 200 300 0 (w/o Ino) 100 200 300 94.38 91.82 92.46 91.08 78.53 84.61 84.92 83.54 α1 Set (w/o NEUTRAL) 71.99 66.05 65.86 65.59 69.65 75.03 75.03 74.23 48.70 57.62 59.43 63.58 44.01 50.40 50.57 52.09 Table 12: Performance after inoculation fine-tuning on the INFOTABS counterfactual example sets. We expect our model to handle both original and challenge datasets, at least after fine-tuning (i.e., it should belong to Outcome 1). Its failure points to the need for better models or training regimes. 8 Discussion and Related Work What Did We Learn? Firstly, through system- atic probing, we have shown that despite good performance on the evaluation sets, the model for tabular NLI fails at reasoning. From the analysis of hypothesis perturbations (§4), we show that the model heavily relies on correlations between a hy- pothesis’ sentence structure and its label. Models should be systematically evaluated on adversarial sets like α2 for robustness and sensitivity. This observation is concordant with multiple studies that probe deep learning models on adversarial examples in a variety of tasks such as question answering, sentiment analysis, document classifi- cation, natural language inference, and so forth. (e.g., Ribeiro et al., 2020; Richardson et al., 2020; Goel et al., 2021; Lewis et al., 2021: Tarunesh et al., 2021). Secondly, the model does not look at correct evidence required for reasoning, as is evident from the evidence-selection probing (§5). Rather, it leverages spurious patterns and statistical cor- relations to make predictions. A recent study by Lewis et al. (2021) on question-answering shows that models indeed leverage spurious patterns to answer a large fraction (60–70%) of questions. Thirdly, from counterfactual probes (§6), we found that the model relies on knowledge of pre- trained language models than on tabular evidence as the primary source of knowledge for making predictions. This is in addition to the spurious patterns or hypothesis artifacts leveraged by the model. Similar observations are made by Clark and Etzioni (2016), Jia and Liang (2017), Kaushik et al. (2020), Huang et al. (2020), Gardner et al. (2020), Tu et al. (2020), Liu et al. (2021), Zhang et al. (2021), and Wang et al. (2021b) for un- structured text. Finally, from the inoculation study (§7), we found that fine-tuning on challenge sets improves model performance on challenge sets but degrades on the original α1, α2, and α3 test sets. That is, changes in the data distribution during training have a negative impact on model performance. This adds weight to the argument that the model relies excessively on data artifacts. Benefit of Tabular Data Unlike unstructured data, where creating challenge datasets may be more difficult (e.g., Ribeiro et al., 2020; Goel et al., 2021; Mishra et al., 2021), we can analyze semi- structured data more effectively. Although con- nected with the title, the rows in the table are still independent, linguistically and otherwise. Thus, controlled experiments are easier to design and study. For example, the analysis done for evidence selection via multiple table perturbation opera- tions such as row deletion and insertion is possible mainly due to the tabular nature of the data. Such granularity and component-independence is gen- erally absent for raw text at the token, sentence, and even paragraph level. As a result, designing suitable probes with sufficient coverage can be a challenging task, and can require more man- ual effort. Additionally, probes defined on one tabular dataset (INFOTABS in our case) can be easily ported to other tabular datasets such as WikiTableQA (Pasupat and Liang, 2015), TabFact (Chen et al., 2020b), HybridQA (Chen et al., 2020c; Zayats et al., 2021; Oguz et al., 2020), OpenTableQA (Chen et al., 2021), ToTTo (Parikh et al., 2020), Turing Tables (Yoran et al., 2021), and Logic- Table (Chen et al., 2020a). Moreover, such probes 670 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 can be used to better understand the behavior of various tabular reasoning models (e.g., M¨uller et al., 2021; Herzig et al., 2020; Yin et al., 2020; Iida et al., 2021; Pramanick and Bhattacharya, 2021; Glass et al., 2021; and others). Interpretability for NLI Models For classifi- cation tasks such as NLI, correct predictions do not always mean that the underlying model is em- ploying correct reasoning. More work is needed to make models interpretable, either through expla- nations or by pointing to the evidence that is used for predictions (e.g., Feng et al., 2018; Serrano and Smith, 2019; Jain and Wallace, 2019; Wiegreffe and Pinter, 2019; DeYoung et al., 2020; Paranjape et al., 2020; Hewitt and Liang, 2019; Richardson and Sabharwal, 2020; Niven and Kao, 2019; Ravichander et al., 2021). Many recent shared tasks on reasoning over semi-structured tabular data (such as SemEval 2021 Task 9 [Wang et al., 2021a] and FEVEROUS [Aly et al., 2021]) have highlighted the importance of, and the challenges associated with, evidence extraction for claim verification. Finally, NLI models should be tested on mul- tiple test sets in adversarial settings (e.g., Ribeiro et al., 2016, 2018a,b; Alzantot et al., 2018; Iyyer et al., 2018; Glockner et al., 2018; Naik et al., 2018; McCoy et al., 2019; Nie et al., 2019; Liu et al., 2019a) focusing on particular prop- erties or aspects of reasoning, such as perturbed premises for evidence selection, zero-shot trans- fer (α3), counterfactual premises or alternate facts, and contrasting hypotheses via perturbation (α2). Such behavioral probing by evaluating on multi- ple test-only benchmarks and controlled probes is essential to better understand both the abilities and the weaknesses of pre-trained language models. 9 Conclusion This paper presented a targeted probing study to highlight the limitations of tabular inference models using a case study on a tabular NLI task on INFOTABS. Our findings show that despite good performance on standard splits, a RoBERTa-based tabular NLI model, fine-tuned on the existing pre-trained language model, fails to select the correct evidence, makes incorrect predictions on adversarial hypotheses, and is not grounded in pro- vided evidence–counterfactual or otherwise. We expect that insights from the study can help in designing rationale selection techniques based on structural constraints for tabular inference and other tasks. While inoculation experiments showed partial success, diverse data augmentation may help mitigate challenges. However, annotation of such data can be expensive. It may also be possi- ble to train models to satisfy domain-based con- straints (e.g., Li et al., 2020) to improve model robustness. Finally, probing techniques described here may be adapted to other NLP tasks involv- ing tables such as tabular question answering and tabular text generation. Acknowledgments We thank the reviewing team for their valuable feedback. This work is partially supported by NSF grants #1801446 and #1822877. References Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task. In Pro- ceedings of the Fourth Workshop on Fact Ex- traction and VERification (FEVER), pages 1–13, Dominican Republic. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/2021.fever-1.1 Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating natural language ad- versarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, pages 2890–2896, Brussels, Belgium. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D18-1316 Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural lan- guage inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /D15-1075 Wenhu Chen, Ming-Wei Chang, Eva Schlinger, William Yang Wang, and William W. Cohen. 671 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 2021. Open question answering over tables and text. In International Conference on Learning Representations. Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and William Yang Wang. 2020a. Logical natu- ral language generation from open-domain tables. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Lin- guistics, pages 7929–7942, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2020.acl-main.708 Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020b. Tab- fact: A large-scale dataset for table-based fact verification. In International Conference on Learning Representations. Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. 2020c. HybridQA: A dataset of multi-hop ques- tion answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2020.findings-emnlp.91 Peter Clark and Oren Etzioni. 2016. My computer is an honor student—but how intelligent is it? Standardized tests as a measure of AI. AI Magazine, 37(1):5–12. https://doi.org /10.1609/aimag.v37i1.2636 Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. 2013. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220. https://doi.org/10.2200 /S00509ED1V01Y201305HLT023 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Com- putational Linguistics. Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. 2020. ERASER: A benchmark to evaluate rationalized NLP mod- els. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 4443–4458, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.acl-main.408 Julian Eisenschlos, Syrine Krichene, and Thomas M¨uller. 2020. Understanding tables with inter- mediate pre-training. In Findings of the Asso- ciation for Computational Linguistics: EMNLP 2020, pages 281–296, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.findings-emnlp.27 Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In Pro- ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728, Brussels, Belgium. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/D18-1407 Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Asso- ciation for Computational Linguistics. https:// doi.org/10.18653/v1/2020.findings -emnlp.117 Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annota- tor? an investigation of annotator bias in natural language understanding datasets. In Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Associ- ation for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1107 672 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Michael Glass, Mustafa Canim, Alfio Gliozzo, Saneem Chemmengath, Vishwajeet Kumar, Rishav Chakravarti, Avi Sil, Feifei Pan, Samarth Bharadwaj, and Nicolas Rodolfo Fauceglia. 2021. Capturing row and column semantics in transformer based question answering over tables. In Proceedings of the 2021 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, pages 1212–1224, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2021.naacl-main.96 Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 650–655, Melbourne, Australia. Association for Compu- tational Linguistics. https://doi.org/10 .18653/v1/P18-2103 Karan Goel, Nazneen Fatema Rajani, Jesse Vig, Zachary Taschdjian, Mohit Bansal, and Christopher R´e. 2021. Robustness gym: Uni- fying the NLP evaluation landscape. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 42–55, Online. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/2021.naacl-demos.6 Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, and Vivek Srikumar. 2020. INFOTABS: In- ference on tables as semi-structured data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 2309–2324, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2020.acl-main.210 Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/N18-2017 Jonathan Herzig, Pawel Krzysztof Nowak, Thomas M¨uller, Francesco Piccinno, and Julian Eisenschlos. 2020. TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, Online. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.acl-main.398 John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Pro- ceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat- ural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Associa- tion for Computational Linguistics. https:// doi.org/10.18653/v1/D19-1275 William Huang, Haokun Liu, and Samuel R. Bowman. 2020. Counterfactually-augmented SNLI training data does not yield better gener- alization than unaugmented data. In Proceed- ings of the First Workshop on Insights from Negative Results in NLP, pages 82–87, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .insights-1.13 Hiroshi Iida, Dung Thai, Varun Manjunatha, and Mohit Iyyer. 2021. TABBIE: Pretrained rep- resentations of tabular data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3446–3456, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.naacl-main.270 Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018 Adversarial example generation with syntactically controlled para- phrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885, New Orleans, Louisiana. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/N18-1170 673 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Sarthak Jain and Byron C. Wallace. 2019. At- tention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics. Robin Jia and Percy Liang. 2017. Adversarial ex- amples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Lin- guistics. https://doi.org/10.18653 /v1/D17-1215 Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2020. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations. Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2021. Question and answer test-train overlap in open-domain question answering datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1000–1008, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2021.eacl-main.86 Tao Li, Parth Anand Jawale, Martha Palmer, and Vivek Srikumar. 2020. Structured tuning for semantic role labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8402–8412, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.acl-main.744 the 2019 Conference of Nelson F. Liu, Roy Schwartz, and Noah A. Smith. 2019a. Inoculation by fine-tuning: A method for analyzing challenge datasets. In Proceed- ings of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 2171–2179, Minneapolis, Minnesota. Association for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A robustly op- timized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Version 1. Zeyu Liu, Yizhong Wang, Jungo Kasai, Hannaneh Hajishirzi, and Noah A. Smith. 2021. Probing across time: What does RoBERTa know and when? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 820–842, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .findings-emnlp.71 Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syn- tactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics. https://doi .org/10.18653/v1/P19-1334 Anshuman Mishra, Dhruvesh Patel, Aparna Vijayakumar, Xiang Lorraine Li, Pavan Kapanipathi, and Kartik Talamadupula. 2021. Looking beyond sentence-level natural language inference for question answering and text sum- marization. In Proceedings of the 2021 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1322–1336, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2021.naacl-main.104 Thomas M¨uller, Julian Eisenschlos, and Syrine Krichene. 2021. TAPAS at SemEval-2021 task 9: Reasoning over tables with intermediate pre-training. In Proceedings of the 15th In- ternational Workshop on Semantic Evalua- tion (SemEval-2021), pages 423–430, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .semeval-1.51 Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural lan- guage inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics. 674 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 J. Neeraja, Vivek Gupta, and Vivek Srikumar. 2021. Incorporating external knowledge to en- hance tabular reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2799–2809, Online. Association for Computational Linguistics. https://doi.org /10.18653/v1/2021.naacl-main.224 Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019. Analyzing compositionality- sensitivity of NLI models. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6867–6874. https://doi.org/10 .1609/aaai.v33i01.33016867 Timothy Niven and Hung-Yu Kao. 2019. Prob- ing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653 /v1/P19-1459 Barlas Oguz, Xilun Chen, Vladimir Karpukhin, Stan Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Scott Yih. 2020. Unified open-domain question answering with structured and un- structured knowledge. arXiv preprint arXiv: 2012.14610, Version 2. Bhargavi Joshi, Paranjape, Mandar John Thickstun, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. An information bottleneck ap- proach for controlling conciseness in rationale extraction. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1938–1952, Online. Association for Computational Lin- guistics. https://doi.org/10.18653/v1 /2020.emnlp-main.153 Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1173–1186, On- line. Association for Computational Linguis- tics. https://doi.org/10.18653/v1 /2020.emnlp-main.89 Panupong Pasupat and Percy Liang. 2015. Com- positional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, Beijing, China. Association for Computational Linguistics. https://doi.org/10.3115 /v1/P15-1142 Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/S18-2023 Aniket Pramanick and Indrajit Bhattacharya. 2021. Joint learning of representations for web- tables, entities and types using graph convo- lutional network. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1197–1206, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021 .eacl-main.102 Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. 2021. Probing the probing paradigm: Does probing accuracy entail task relevance? In Proceedings of the 16th Confer- ence of the European Chapter of the Asso- ciation for Computational Linguistics: Main Volume, pages 3363–3377, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.eacl-main.295 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ‘‘why should i trust you?’’: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1135–1144, New York, NY, USA. Association for Com- puting Machinery. https://doi.org/10 .1145/2939672.2939778 Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018a. Anchors: High-precision model- 675 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 agnostic explanations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018b. Semantically equivalent ad- versarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 856–865, Melbourne, Australia. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/P18-1079 Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond ac- curacy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual the Association for Computa- Meeting of tional Linguistics, pages 4902–4912, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020 .acl-main.442 Kyle Richardson, Hai Hu, Lawrence Moss, and Ashish Sabharwal. 2020. Probing natural lan- guage inference models through semantic frag- ments. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8713–8721. https://doi.org/10.1609/aaai.v34i05 .6397 Kyle Richardson and Ashish Sabharwal. 2020. What does my QA model know? Devising con- trolled probes using expert knowledge. Trans- actions of the Association for Computational Linguistics, 8:572–588. https://doi.org /10.1162/tacl_a_00331 Sofia Serrano and Noah A. Smith. 2019. Is at- tention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653 /v1/P19-1282 Ishan Tarunesh, Somak Aditya, and Monojit Choudhury. 2021. Trusting RoBERTa over BERT: Insights from checklisting the natural language inference task. arXiv preprint arXiv: 2107.07229. Version 1. language models. Transactions of the Associa- tion for Computational Linguistics, 8:621–633. https://doi.org/10.1162/tacl a 00335 Nancy Xin Ru Wang, Diwakar Mahajan, Marina Danilevsky, and Sara Rosenthal. 2021a. Semeval-2021 task 9: Fact verification and ev- idence finding for tabular data in scientific documents (sem-tab-facts). In SemEval@ACL/ IJCNLP, pages 317–326. Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and Nan Duan. 2021b. Logic-driven context extension and data augmentation for logical reasoning of text. arXiv preprint arXiv: 2105.03659. Version 1. Sarah Wiegreffe and Yuval Pinter. 2019. Atten- tion is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Association for Computa- tional Linguistics. https://doi.org/10 .18653/v1/D19-1002 Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretrain- ing for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguis- tics, pages 8413–8426, Online. Association for Computational Linguistics. Ori Yoran, Alon Talmor, and Jonathan Berant. 2021. Turning tables: Generating examples from semi-structured tables for endowing lan- guage models with reasoning skills. arXiv preprint arXiv:2107.07261. Version 1. Vicky Zayats, Kristina Toutanova, and Mari Ostendorf. 2021. Representations for question answering from documents with tables and text. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2895–2906, Online. Association for Computational Linguistics. https://doi .org/10.18653/v1/2021.eacl-main .253 Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robust- ness to spurious correlations using pre-trained Chong Zhang, Jieyu Zhao, Huan Zhang, Kai-Wei Chang, and Cho-Jui Hsieh. 2021. Double per- turbation: On the robustness of robustness and 676 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 counterfactual bias evaluation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3899–3916, Online. Association for Com- putational Linguistics. https://doi.org /10.18653/v1/2021.naacl-main.305 Appendix In Section 5.1, we defined four types of row- agnostic table modifications:(a) row deletion, (b) row insertion, (c) row-value update, and (d) row permutation and presented the first one there. We present details of the rest here along with the respective impact on the α1, α2, and α3 test sets. Row Insertion. When we insert new informa- tion that does not contradict an existing table,12 original predictions should be retained in almost all cases. Very rarely, NEUTRAL labels may change to ENTAIL or CONTRADICT. For example, adding the Singles row below to our running example table doesn’t change labels for any hypothesis except the H4 label (see Figure 1) changing to CONTRADICT with the additional information. Singles The Logical Song; Breakfast in America; Goodbye Stranger; Take the Long Way Home Figure 7 shows the possible label changes after new row insertion as a directed labeled graph, and the results are summarized in Table 13. Note that all transitions from NEUTRAL are valid upon row insertion, although not all may be accurate. Row Update. In case of row update, we only change a portion of a row value. Whole row value substitutions are examined separately as compos- ite operations of deletion followed by insertion. Unlike a whole row update, changing only a por- tion of a row is non-trivial. We must ensure that the updated value is appropriate for the key in question and also avoid self-contradictions. To satisfy these constraints, we update a row with a value from a random table with the same key and only update values in multi-valued rows. A row update operation may have an effect on all labels. 12To ensure that the information added is not contradictory to existing rows, we only add rows with new keys instead of changing values for the existing keys. Figure 7: Changes in model predictions after new row insertion. (Notation similar to Figure 3). Dataset ENTAIL NEUTRAL CONTRADICT Average α1 2.81 0 6.77 3.19 α2 4.99 0 6.54 3.84 α3 2.51 0 6.35 2.95 Average 3.44 0 6.55 – Table 13: Percentage of invalid transitions after new row insertion. For an ideal model, all these numbers should be zero. Though feasible, we consider the transitions from CONTRADICT to ENTAIL to be prohibited. Unlike ENTAIL to CONTRADICT transitions, these transitions would be extremely rare as values are updated randomly, regardless of their semantics. For ex- ample, if we substitute pop in the multi-valued key Genre in our running example with another genre, the hypothesis H1 is likely to change to CONTRADICT. Since we are updating a single value from a multi-valued key, the changes to the table are minimal and may not be perceived by the model. As a result, we should expect row updates to have lower impact on model predictions. This appears to be the case, as evidenced by the results in Figure 8, which show that the labels do not change drastically after update. The results in Figure 8 are summarized in Table 14. Row Permutation. By design of the premises, the order of their rows should have no effect on hy- potheses labels. In other words, the labels should be invariant to row permutation. However, from Figure 9, it is evident that even a simple shuffling of rows, where no information has been tampered with, can have a notable effect on performance. 677 l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Dataset α1 α2 ENTAIL NEUTRAL CONTRADICT 9.25 7.1 11.6 12.2 6.8 8.76 α3 14.6 12.5 13.7 Average 12.02 8.79 11.36 Average 9.34 9.26 13.6 – Table 15: Percentage of invalid transitions after row permutations. For an ideal model, all these numbers should be zero. Figure 8: Changes in model predictions after row value update. (Notation similar to Figure 3). Dataset ENTAIL NEUTRAL CONTRADICT Average α1 0.08 0.12 0.49 0.23 α2 0.22 0.11 0.30 0.21 α3 0.12 0.09 0.19 0.13 Average 0.14 0.11 0.33 – Table 14: Percentage of invalid transitions after row value update. For an ideal model, all these numbers should be zero. Figure 9: Changes in model predictions after shuffling of table rows. (Notation similar to Figure 3.) This shows that the model is relying on row po- sitions incorrectly, while the semantics of a table is order invariant. We summarize the combined invalid transitions from Figure 9 in Table 15. Irrelevant Row Deletion. Ideally, deletion of an irrelevant row should have no effect on a hypothesis label. The results in Figure 10 and in Table 16 show that even irrelevant rows have an effect on model predictions. This further illustrates that the seemingly accurate model predictions are not appropriately grounded on evidence. 678 Figure 10: Change in model predictions after deletion of an irrelevant row. (Notation similar to Figure 3.) Dataset ENTAIL NEUTRAL CONTRADICT α1 5.14 3.9 5.94 α2 6.97 3.54 5.09 Average 4.99 5.2 α3 6.09 5.01 6.91 6.01 Average 6.07 4.15 5.98 – Table 16: Percentage of invalid transitions after deletion of irrelevant rows. For an ideal model, all these numbers should be zero. Composition of Perturbation Operations In addition to probing individual operations, we can also study their compositions. For example, we could delete a row, and insert a different row, and so on. The composition of these operations have interesting properties with respect to the allowed transitions. For example, when an operation is composed with itself (e.g., two deletions), the set of valid label changes is the same as for the operation. A particularly interesting composition is deletion followed by an insertion, since this can be viewed as a row update. In Figure 11, we show the transition graph for the composition operation of row deletion followed by insertion and the summary of the possible transitions is presented in Table 17. l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 Dataset ENTAIL NEUTRAL CONTRADICT α1 3.02 0.00 9.81 α2 6.53 0.00 7.88 α3 4.16 0.00 6.71 Average 4.28 4.80 3.63 Average 4.57 0.00 8.13 – Table 17: Percentage of invalid transitions after deletion followed by an insertion operation. For an ideal model, all these numbers should be zero. Figure 11: Changes in model predictions after deletion followed by an insert operation. (Notation similar to Figure 3.) l D o w n o a d e d f r o m h t t p : / / d i r e c t . m i t . e d u / t a c l / l a r t i c e - p d f / d o i / . 1 0 1 1 6 2 / t l a c _ a _ 0 0 4 8 2 2 0 2 8 8 4 7 / / t l a c _ a _ 0 0 4 8 2 p d . f b y g u e s t t o n 0 7 S e p e m b e r 2 0 2 3 679
Télécharger le PDF